Achieving GPT-4 Level Performance At Lower Cost Using DSPy
May 14, 2024
Erin Mccarthy, ML Engineer at Gradient
Overview
As companies are trying to leverage LLMs in production workflows more widely, multiple frameworks have arisen over the last year to help create abstractions for interacting with LLMs and build AI systems more programmatically – DSPy, LMQL, Outlines, and more.
At Gradient, we’ve learned that full stack agentic workflow automation systems can be quite brittle and onerous to iterate on without well-engineered abstractions. To create more robust AI systems, we explored several open-source frameworks that caught our eye in addition to building our own in-house abstractions.
In this blog post, we share a case study to demonstrate the following learnings from our deep dive into DSPy.
You can use DSPy to beat GPT-4 performance on a specific task, at 10x lower cost.
The main improvement from DSPy is structured formatting of the prompt leading to better formatting of the response. This both increases the accuracy and reduced the response size (lowering cost).
Background on DSPy
DSPy is a flexible framework for optimizing LLM prompts, containing building blocks for more complex workflows. Each DSPy program is expressed in Python and involves a series of transformation steps, beginning with an input (e.g. a question) and culminating in an output (e.g. an answer).
The general flow of DSPy resembles the following.
Signature
Defines the structure of your data / how the module handles it AND is given as part of the prompt
Create a dataset that consists of example input-output pairs, with a signature that abstracts the input/output behavior of a module – for example, defining a question and an answer for the “question → answer” signature.
Module Design
Create or use existing modules to represent the dataflow of your program. This can include simple modules or more complex flows, such as Chain of Thought (CoT).
Optimizer
Optimizes all modules in the pipeline to maximize a metric.
Depending on the amount of data that is available to train on, select appropriate optimizer from
BootstrapFewShot
,BootstrapFewShotWithRandomSearch
, andMIPRO
. See DSPy’s documentation here for more details.
Evaluation
Create new metric or use a pre-defined one, such as exact match or f2, to be used with the optimizer.
DSPy composes these four steps into a few lines of code, in addition to allowing you to specify the model endpoint. At Gradient, we have been using DSPy to great success for to drastically improve the efficiency of our agentic solutions. In the next section, we’ll describe a recent use case.
Case Study
In this case study, we will compare manual optimization vs DSPy to get results in a situation where the data is complex and noisy. This is a challenge we have had to solve while building full stack agentic systems, in particular for the financial services industry.
While SOTA GPT-4 has demonstrated strong capabilities with simple prompts, it is cost prohibitive for many enterprises. Here we will take a look at different model options and the cost associated with the different prompt optimization strategies. To explore more complex workflows using the DSPy framework, we compare the simplest module with a slightly more complex module that implements the technique Chain of Thought (CoT).
Through this case study, we’ll show that you can use DSPy to beat GPT-4 performance on a specific task with 10x lower cost per table.
Problem Statement
Given a table, we want to extract all entities and format them as a list of JSON dictionaries.
For these examples, we use generated tables with consistent formatting. In production, there are unique table formats and PDF parsing mistakes that make it more challenging for the LLM to extract entities accurately.
Example table:
Modeled off a real world financial services client use case, the tables we used in this case study consisted of a mix of companies and people and can include information like investment amount, location, DOB (if person) and registration number (if company).
Desired output:
Approaches
Approach #1: Manual Optimization
The most basic approach is to manually optimize the prompt using various prompt engineering strategies. We chose codellama based off specific size criteria and initial performance. When we optimized, we largely followed a cycle of craft starting prompt → evaluate against golden samples → iterate with different instructions or prompt structure → evaluate again, and so on.
In-context learning greatly improved the model’s accuracy for this use case. A few different strategies we tried for presenting an example table and output to the LLM included the following:
Table structure – using markdown or xml to structure the table within the prompt
Output structure – how to represent empty or null values
Approach #2: DSPy BootstrapFewShot
Optimization
As the next level following manual optimization, we introduced DSPy into the process.
DSPy Optimizer
In this use case, we had a limited golden dataset available, so we went with the simple optimizer, BootstrapFewShot
, which only needs about 10 samples and makes ~10 LLM inference API calls. It uses your pipeline to generate complete demonstrations of your program, including the question and correct answer. It will simply use the generated demonstrations (if they pass the provided metric) without any further optimization.
Prompt Design
The prompt generated from DSPy using BootstrapFewShot
includes three parts:
Templated intro explaining the signature
Base prompt
Dataset examples chosen by optimizer
See Appendix for full prompt example.
Approach #3: DSPy BootstrapFewShot
Optimization with Chain of Thought (CoT)
"Chain of thought" (CoT) is a problem-solving technique used to break down complex problems into simpler steps or thoughts, allowing a clearer and more detailed approach to reaching a solution. DSPy provides a pre-built CoT module that can easily be added into the workflow – we used the default settings in this case study, but all the tunable parameters can be found in the documentation here.
DSPy Optimizer
During the default CoT module, the optimizer will run two evals per training example, where one prompts the model using CoT and one does not.
Prompt Design
The signature that includes CoT is question → reasoning → answer, which results in a different prompt intro from Approach #2 (without CoT):
The base prompt and dataset examples chosen by the optimizer are still included afterwards.
Results
Accuracy Comparison
To capture the incremental gains through optimization, we use exact match per entry of an entity in the JSON dictionary to evaluate the accuracies. For example, if the model gets location, name, and type of one entity correct, it would be +3 out of a possible 6 fields. Then, an overall accuracy for each approach is calculated as the total % of fields correct over 20 eval samples.
Compared to a short base prompt containing only instructions and no optimizations, we saw a +4% improvement in accuracy with manual optimization (Approach #1) and an additional +12% when using the prompt DSPy compiled.
Is There a Better Model to Use?
DSPy allows for compiling prompts per model with minimal additional effort, so we also compared gpt-3.5-turbo and mixtral-8x7B. We add gpt-4 unoptimized for comparison, as it performs well out of the box but is a costly model to use.
We tried out CoT to test DSPy’s more complex modules, but this use case does not require advanced multistep logic and does not appear to benefit from the longer prompt and more verbose outputs.
Cost Comparison
The total number of tokens used is the primary driver of total cost for each approach. Request size varies greatly depending on the size of the table. The data below reflects an average based on our use case.
In terms of inference cost, OpenAI charges separate rates for input and output tokens. mixtral-8x7B-instruct and codellama-13b can be accessed through several inference platforms, including Gradient for mixtral, so we calculated cost using the cheapest option at time of writing.
Prompt optimization and evals using DSPy cost a total of less than $0.50 for each model tested – a low cost in exchange for the ability to improve performance and select the best model for our requirements.
Conclusion
From a performance perspective, DSPy allowed us to beat GPT-4 performance on this task of extracting data from messy tables, at 10x lower cost per table and 10x lower manual effort. This is the greatest advantage of using DSPy. Overall The experimentation with DSPY was les than $2 and enables using a cheaper model longer term at a lower cost.
From an implementation perspective, DSPy is a great tool for exploring different techniques (CoT, RAG, etc) and models to find what is right for each use case and budget, by making it extremely easy to test different combinations in a reproducible way. In this particular case study, with few golden samples, BootstrapFewShot
improved the response format of the model and ensured the model response only included necessary response text. While some of the prompting techniques could be gleaned and utilized manually outside of the framework, DSPy simplifies the code necessary to do so and does the thinking for you.
All AI use cases require refining a prompt, regardless of how good the model is. At Gradient, we’ve incorporated DSPy into our in-house prompt optimization tooling to maximize the efficiency and quality of our AI solutions.
Appendix
Example of full DSPy-generated prompt