Tap Into Domain Specific AI Systems Using Gradient’s Multi-Model, Shared Foundation
Nov 21, 2023
Multi-Model, Shared Foundation
Enterprise businesses that are looking to use their private data to create custom AI solutions are discovering that smaller fine-tuned LLMs outperform larger LLMs - especially when it comes to domain or task specific needs. While the choice between the two may seem obvious, businesses will still need to decide whether the tradeoffs of having a multitude of smaller domain specific LLMs are worth the computational costs that are needed to deploy each model.
With Gradient, enterprise businesses can have the best of both worlds by leveraging a multi-tenant infrastructure that enables businesses to deploy thousands of custom LLMs at the same time, using the same compute and cost as they would with a single LLM. This is the magic behind Gradient’s multi-model, shared foundation.
10x More Cost Effective: Save 10x in costs compared to other development and productionization platforms.
Incremental Testing with Virtually Zero Latency: Leverage Just-in-Time (JIT) fine-tuning to incrementally test your models instantaneously, saving hundreds of hours throughout the process.
Thousands of Models, Same Inference: Run thousands of custom fine-tuned LLMs on the same compute as you would with a single LLM.
Host All Your Models in One Place: Reduce infrastructure complexity, overhead, and maintenance by hosting all your models in one place using Gradient’s scalable AI Cloud Platform.
If you’re new to AI and would like to learn a bit more about LLMs and fine-tuning before deep diving into the technicalities of our platform, check out our Fine-Tuning 101 for Enterprise and RAG 101 for Enterprise walkthrough.
Discover Some of Gradient’s Unique Capabilities
Gradient enables JIT Fine-Tuning by automatically scaling the tuning clusters so that multiple teams can concurrently test and fine-tune models at the same time. To make this possible, Low Rank Adaptation (LoRA) is used to allow each team to leverage the same foundational model for fine-tuning without having to add additional computational resources. This is possible because the fine-tuning process is actually happening within the adapters that come with new parameters, rather than making changes directly to the foundational model itself. Once teams are finished fine-tuning their adapters, the adapter weights are simply incorporated on top of the shared foundational model - keeping the original model’s weights untouched.
Multi-Tenant Concurrent Inference
Given Gradient’s multi-tenant architecture, teams can run concurrent inference jobs on their own models without having to scale computational resources. By leveraging LoRA Dynamic Batching and masking, businesses can run thousands of custom fine-tuned models at the same time without having to worry about each adapter interfering with one another since they’re leveraging the same shared foundational model. This significantly improves productivity across teams, leveraging all of the benefits and none of the drawbacks from traditional and LoRA fine-tuning, as detailed below.
Traditional Fine-Tuning: In order to run inference on fine-tuned models, businesses would need dedicated GPUs to support each model - significantly increasing cost and overhead every time iterations are made. On the contrary, since each model is supported by a dedicated GPU, you won’t run into any issues of having to wait in line for another model to complete its inference process before you start the next.
LoRA Fine-Tuning Without LoRA Dynamic Batching: Using LoRA, fine-tuned models can share the same GPU which will significantly decrease the overall costs and overhead that would typically occur in traditional fine-tuning. However the drawback here is that in order to run inference, the existing job in flight must be completed before the next model can be loaded in - creating an unnecessary bottleneck. This will mean that teams will be competing for the same resources, which will ultimately put a heavy burden on your operational efficiency as you scale.
Train and Deploy Inference with Zero Latency
Time is one of the most valuable assets for a business. With Gradient, inference is done on-demand and can support over thousands of models running inference on a single cluster as long as it’s using LoRA fine-tuned models. This means businesses can now test out changes that are made to their models immediately vs having to wait for the entire training process to run its course and an additional ~15min to deploy the inference server. With iterations being a necessary and frequent step in LLM optimization, the total wait time will only exponentially scale as more models and use cases are introduced.
Gradient also offers the ability to run inference while your model is still training, whereas most APIs will force you to wait until the training is complete. By enabling teams to run incremental benchmarks on their model as it’s being trained, teams will be able to discover issues in their models much quicker and avoid any wasted time. By giving time back to your teams, businesses will be able to address other high priority tasks.
Made Possible by Gradient’s Architectural Setup
In order to fine-tune LLMs, businesses typically need to update all the model’s parameters which can be quite resource intensive, especially when it comes to pre-trained models with billions of parameters. To make the fine-tuning process less resource intensive, Gradient uses Low Rank Adaptation (LoRA) to create adapters that come with new additional parameters. Instead of fine-tuning all the weights that come preinstalled with the pre-trained model, businesses will train the new parameters and simply append them onto the original model while keeping the old parameters untouched.
Although this may seem counter intuitive since adding more parameters would mean increasing GPU usage, the reality is that a fine-tuned model using LoRA adds less than 0.1% of parameters to the pre-trained model. In other words, what this means is that a LoRA fine-tuned model will only increase storage by roughly 10 to 200 MB depending on the configuration. The lower resources required for LoRA based fine-tuning more than outweigh this storage increase, making LoRA based fine-tuning 10 to 50x more efficient and cost effective overall. What’s also important to note here is that there is virtually no performance loss between LoRA fine-tuned models and full fine-tuning.
Thousands of Models for the Price of 1
While we talked about how LoRA can improve the cost efficiency for fine-tuning across a single model, let’s talk about how Gradient accomplishes this at scale. As we mentioned before, smaller fine-tuned LLMs outperform larger LLMs - especially when it comes to domain or task specific needs. This is especially beneficial for businesses in highly regulated industries, such as financial services or healthcare, that will require specificity. However even with LoRA minimizing the amount of GPU used per model for fine-tuning, this would still mean that a new instance would need to be spun up for each adapter. Over time this can add up and take up a lot of your computational resources as you scale.
Let’s put this in perspective for a business. Imagine that Company X is looking to implement custom AI solutions across the following four organizations: Customer Success (Chatbot), Legal (Policy Summarization), Finance (KYC), and Operations (Project Management). If Company X were to build this out using a typical development provider, they would need to create four separate instances to support each model - resulting in 4x the GPU consumption.
Gradient, on the other hand, leverages a multi-tenant architecture that enables businesses to use the same instance to create as many fine-tuned models as necessary. That means regardless of where you sit in the organization (e.g. Customer Success, Legal, etc.) or how many use cases you might have, all your fine-tuned models will share the same computational resource - minimizing costs and maximizing efficiency across your GPU.
Dynamic LoRA Batching and Loading
Instead of pre-loading all of the fine-tuned model weights from the start, Gradient will only load the pre-trained model’s weights at that time. Each of the fine-tuned model adapters will be dynamically loaded in at the right moment during runtime - optimizing efficiency and removing unnecessary bottlenecks. It’s important to note that any new adapter that’s loaded in won’t interfere with any of the ongoing requests that may be further along the process, and those requests will carry on as intended.
To further enhance throughput, Gradient has embedded Dynamic LoRA Batching into the process to support the challenges that typically occur from continuous batching. For those who aren’t familiar with continuous batching, this concept enables high performance text generation by batching multiple requests together between each token generation as new requests flow in and older requests are completed. However the challenge here is that only one set of adapter weights can be used at a time when exchanging LoRA weights between requests, forcing incoming requests to either:
Wait on standby until all active requests are complete for a particular adapter, creating longer latency for adapters that aren’t in use.
Constantly need to swap between adapters throughout the process which is counterintuitive.
To get around this, Gradient uses a scheduler to optimize throughput in batches so that you’ll be able to retain the benefits from continuous batching while having it work concurrently across multiple sets of LoRA adapters. Here’s a breakdown of how it works.
Active Adapters: At all times, a certain quantity of adapters (N) will be designated as active, with their weights loaded onto the GPU and readily available to be employed during the decoding process.
Batched and Masked: Requests from active adapters will be exhausted from their queues and batched together continuously. To avoid mismatches, a mask is applied to ensure the right adapter is applied to each request in the batch.
Scheduler: Once the allocated amount of time has passed, the scheduler will select the next set of adapters that 1) has a request and 2) has been sitting in queue for the longest duration. Those adapters will then be marked as active and replace the longest sitting active adapters that held that spot previously. It’s important to note that duration of time between exchanging active adapters can be adjusted to prioritize throughput versus latency.