Achieving GPT-4 Level Performance At Lower Cost Using DSPy

May 14, 2024

Erin Mccarthy, ML Engineer at Gradient

Data out in the wild can be messy and inconsistently formatted. While wrangling this data with LLMs has shown promise, out-the-box models have pain points: GPT-4 has decent accuracy but is expensive at scale, and lower-cost models are too inaccurate without time-intensive manual prompt tuning. We showcase DSPy for prompt optimization, achieving better than GPT-4 accuracy at 10x lower cost and substantially lower manual investment.

Data out in the wild can be messy and inconsistently formatted. While wrangling this data with LLMs has shown promise, out-the-box models have pain points: GPT-4 has decent accuracy but is expensive at scale, and lower-cost models are too inaccurate without time-intensive manual prompt tuning. We showcase DSPy for prompt optimization, achieving better than GPT-4 accuracy at 10x lower cost and substantially lower manual investment.

Data out in the wild can be messy and inconsistently formatted. While wrangling this data with LLMs has shown promise, out-the-box models have pain points: GPT-4 has decent accuracy but is expensive at scale, and lower-cost models are too inaccurate without time-intensive manual prompt tuning. We showcase DSPy for prompt optimization, achieving better than GPT-4 accuracy at 10x lower cost and substantially lower manual investment.

Data out in the wild can be messy and inconsistently formatted. While wrangling this data with LLMs has shown promise, out-the-box models have pain points: GPT-4 has decent accuracy but is expensive at scale, and lower-cost models are too inaccurate without time-intensive manual prompt tuning. We showcase DSPy for prompt optimization, achieving better than GPT-4 accuracy at 10x lower cost and substantially lower manual investment.

Data out in the wild can be messy and inconsistently formatted. While wrangling this data with LLMs has shown promise, out-the-box models have pain points: GPT-4 has decent accuracy but is expensive at scale, and lower-cost models are too inaccurate without time-intensive manual prompt tuning. We showcase DSPy for prompt optimization, achieving better than GPT-4 accuracy at 10x lower cost and substantially lower manual investment.

Overview

As companies are trying to leverage LLMs in production workflows more widely, multiple frameworks have arisen over the last year to help create abstractions for interacting with LLMs and build AI systems more programmatically – DSPy, LMQL, Outlines, and more.

At Gradient, we’ve learned that full stack agentic workflow automation systems can be quite brittle and onerous to iterate on without well-engineered abstractions. To create more robust AI systems, we explored several open-source frameworks that caught our eye in addition to building our own in-house abstractions.

In this blog post, we share a case study to demonstrate the following learnings from our deep dive into DSPy.

  • You can use DSPy to beat GPT-4 performance on a specific task, at 10x lower cost.

  • The main improvement from DSPy is structured formatting of the prompt leading to better formatting of the response. This both increases the accuracy and reduced the response size (lowering cost).

Background on DSPy

DSPy is a flexible framework for optimizing LLM prompts, containing building blocks for more complex workflows. Each DSPy program is expressed in Python and involves a series of transformation steps, beginning with an input (e.g. a question) and culminating in an output (e.g. an answer).

The general flow of DSPy resembles the following.

  1. Signature

    1. Defines the structure of your data / how the module handles it AND is given as part of the prompt

    2. Create a dataset that consists of example input-output pairs, with a signature that abstracts the input/output behavior of a module – for example, defining a question and an answer for the “question → answer” signature.

  2. Module Design

    1. Create or use existing modules to represent the dataflow of your program. This can include simple modules or more complex flows, such as Chain of Thought (CoT).

  3. Optimizer

    1. Optimizes all modules in the pipeline to maximize a metric.

    2. Depending on the amount of data that is available to train on, select appropriate optimizer from BootstrapFewShot, BootstrapFewShotWithRandomSearch, and MIPRO. See DSPy’s documentation here for more details.

  4. Evaluation

    1. Create new metric or use a pre-defined one, such as exact match or f2, to be used with the optimizer.

DSPy composes these four steps into a few lines of code, in addition to allowing you to specify the model endpoint. At Gradient, we have been using DSPy to great success for to drastically improve the efficiency of our agentic solutions. In the next section, we’ll describe a recent use case.

Case Study

In this case study, we will compare manual optimization vs DSPy to get results in a situation where the data is complex and noisy. This is a challenge we have had to solve while building full stack agentic systems, in particular for the financial services industry.

While SOTA GPT-4 has demonstrated strong capabilities with simple prompts, it is cost prohibitive for many enterprises. Here we will take a look at different model options and the cost associated with the different prompt optimization strategies. To explore more complex workflows using the DSPy framework, we compare the simplest module with a slightly more complex module that implements the technique Chain of Thought (CoT).

Through this case study, we’ll show that you can use DSPy to beat GPT-4 performance on a specific task with 10x lower cost per table.

Problem Statement

Given a table, we want to extract all entities and format them as a list of JSON dictionaries.

For these examples, we use generated tables with consistent formatting. In production, there are unique table formats and PDF parsing mistakes that make it more challenging for the LLM to extract entities accurately.

Example table:

Modeled off a real world financial services client use case, the tables we used in this case study consisted of a mix of companies and people and can include information like investment amount, location, DOB (if person) and registration number (if company).

| ID | Profile                        | Location       | Registration Number | Capital Invested | Share % | Status Updates         | Role          |
|----|--------------------------------|----------------|---------------------|------------------|---------|------------------------|---------------|
| 1  | Gina Linetti, born October 31, 1992 | Denver      | None                | 500            | 16.13%  | New partnership        | Co-founder    |
| 2  | Hank Moody                     | Seattle        | None                | 100,000        | 14.74%  | No changes             | Advisor       |
| 3  | Alice Johnson, born July 1, 1985 | Los Angeles   | None                | 10,000         | 23.19%  | Expanded operations    | Investor      |
| 4  | Orion Innovations              | Austin         | REG234567           | 5,000          | 14.38%  | Expanded operations    | Co-founder    |
| 5  | Diana Prince, born April 10, 1980 | New York     | None                | 200,000        | 19.87%  | No changes             | Board member  |
| 6  | Bob Smith, born May 22, 1975   | Chicago        | None                | 100            | 11.68%  | Changed job            | Co-founder

Desired output:

[
    {
        "name": "Gina Linetti",
        "type": "person",
        "date_of_birth": "October 31, 1992",
        "location": "Denver",
        "registration_number": "",
        "percentage": "16.13%"
    },
    {
        "name": "Hank Moody",
        "type": "person",
        "date_of_birth": "", 
        "location": "Seattle",
        "registration_number": "",
        "percentage": "14.74%"
    },
    {
        "name": "Alice Johnson",
        "type": "person",
        "date_of_birth": "July 1, 1985",
        "location": "Los Angeles",
        "registration_number": "",
        "percentage": "23.19%"
    },
    {
        "name": "Orion Innovations",
        "type": "company",
        "date_of_birth": "",
        "location": "Austin",
        "registration_number": "REG234567",
        "percentage": "14.38%"
    },
    {
        "name": "Diana Prince",
        "type": "person",
        "date_of_birth": "April 10, 1980",
        "location": "New York",
        "registration_number": "",
        "percentage": "19.87%"
    },
    {
        "name": "Bob Smith",
        "type": "person",
        "date_of_birth": "May 22, 1975",
        "location": "Chicago",
        "registration_number": "",
        "percentage": "11.68%"
    }
]

Approaches

Approach #1: Manual Optimization

The most basic approach is to manually optimize the prompt using various prompt engineering strategies. We chose codellama based off specific size criteria and initial performance. When we optimized, we largely followed a cycle of craft starting prompt → evaluate against golden samples → iterate with different instructions or prompt structure → evaluate again, and so on.

In-context learning greatly improved the model’s accuracy for this use case. A few different strategies we tried for presenting an example table and output to the LLM included the following:

  • Table structure – using markdown or xml to structure the table within the prompt

  • Output structure – how to represent empty or null values


Approach #2: DSPy BootstrapFewShot Optimization

As the next level following manual optimization, we introduced DSPy into the process.

DSPy Optimizer

In this use case, we had a limited golden dataset available, so we went with the simple optimizer, BootstrapFewShot, which only needs about 10 samples and makes ~10 LLM inference API calls. It uses your pipeline to generate complete demonstrations of your program, including the question and correct answer. It will simply use the generated demonstrations (if they pass the provided metric) without any further optimization.

Prompt Design

The prompt generated from DSPy using BootstrapFewShot includes three parts:

  1. Templated intro explaining the signature

Given the fields `question`, produce the answer `answer`

---

Follow the following format.

Question: ${question}
Answer: ${answer}

  1. Base prompt

  2. Dataset examples chosen by optimizer

See Appendix for full prompt example.


Approach #3: DSPy BootstrapFewShot Optimization with Chain of Thought (CoT)

"Chain of thought" (CoT) is a problem-solving technique used to break down complex problems into simpler steps or thoughts, allowing a clearer and more detailed approach to reaching a solution. DSPy provides a pre-built CoT module that can easily be added into the workflow – we used the default settings in this case study, but all the tunable parameters can be found in the documentation here.

DSPy Optimizer

During the default CoT module, the optimizer will run two evals per training example, where one prompts the model using CoT and one does not.

Prompt Design

The signature that includes CoT is question → reasoning → answer, which results in a different prompt intro from Approach #2 (without CoT):

Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: ${answer}

The base prompt and dataset examples chosen by the optimizer are still included afterwards.

Results

Accuracy Comparison

To capture the incremental gains through optimization, we use exact match per entry of an entity in the JSON dictionary to evaluate the accuracies. For example, if the model gets location, name, and type of one entity correct, it would be +3 out of a possible 6 fields. Then, an overall accuracy for each approach is calculated as the total % of fields correct over 20 eval samples.

Compared to a short base prompt containing only instructions and no optimizations, we saw a +4% improvement in accuracy with manual optimization (Approach #1) and an additional +12% when using the prompt DSPy compiled.

Is There a Better Model to Use?

DSPy allows for compiling prompts per model with minimal additional effort, so we also compared gpt-3.5-turbo and mixtral-8x7B. We add gpt-4 unoptimized for comparison, as it performs well out of the box but is a costly model to use.

We tried out CoT to test DSPy’s more complex modules, but this use case does not require advanced multistep logic and does not appear to benefit from the longer prompt and more verbose outputs.

Cost Comparison

The total number of tokens used is the primary driver of total cost for each approach. Request size varies greatly depending on the size of the table. The data below reflects an average based on our use case.

In terms of inference cost, OpenAI charges separate rates for input and output tokens. mixtral-8x7B-instruct and codellama-13b can be accessed through several inference platforms, including Gradient for mixtral, so we calculated cost using the cheapest option at time of writing.

Prompt optimization and evals using DSPy cost a total of less than $0.50 for each model tested – a low cost in exchange for the ability to improve performance and select the best model for our requirements.

Conclusion

From a performance perspective, DSPy allowed us to beat GPT-4 performance on this task of extracting data from messy tables, at 10x lower cost per table and 10x lower manual effort. This is the greatest advantage of using DSPy. Overall The experimentation with DSPY was les than $2 and enables using a cheaper model longer term at a lower cost.

From an implementation perspective, DSPy is a great tool for exploring different techniques (CoT, RAG, etc) and models to find what is right for each use case and budget, by making it extremely easy to test different combinations in a reproducible way. In this particular case study, with few golden samples, BootstrapFewShot improved the response format of the model and ensured the model response only included necessary response text. While some of the prompting techniques could be gleaned and utilized manually outside of the framework, DSPy simplifies the code necessary to do so and does the thinking for you.

All AI use cases require refining a prompt, regardless of how good the model is. At Gradient, we’ve incorporated DSPy into our in-house prompt optimization tooling to maximize the efficiency and quality of our AI solutions.

Appendix

Example of full DSPy-generated prompt

Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: ${question}
Answer: ${answer}

---

Question:
Extract the entities from the provided tables.
Return a response as a list of JSON objects, where each object represents an entity that must be a person or a company.
Each object should have the following keys:
- 'name': the name of a person or company
- 'type': either 'person' or 'company'
- 'date_of_birth': will only be present if the 'type' is 'person' otherwise 'None'
- 'location': city of residence of the person or company
- 'registration_number': will only be present if the 'type' is 'company' otherwise 'None'
- 'percentage': the percentage of ownership or number of shares owned by the person or company
Table: | Business Share Serial | Shareholder Info                                                             | Share Amount in  | Share Percentage | Share Changes        |
|------------------------|------------------------------------------------------------------------------|-------------------|------------------|----------------------|
| 1                      | Charlie Brown, born on May 22, 1975, residing in Los Angeles                 | 100000          | 0.07%            | Changed job          |
| 2                      | Bob Smith, born on July 1, 1985, residing in New York                       | 100             | 0.08%            | Changed address      |
| 3                      | Diana Prince, born on June 30, 1990, residing in San Francisco              | 1000            | 0.13%            | No changes           |
| 4                      | Alice Johnson, born on April 10, 1980, residing in Chicago                  | 10000           | 0.04%            | Updated contact info |

Answer: [{'name': 'Charlie Brown', 'type': 'person', 'date_of_birth': 'May 22, 1975', 'location': 'Los Angeles', 'registration_number': '', 'percentage': '0.07%'}, {'name': 'Bob Smith', 'type': 'person', 'date_of_birth': 'July 1, 1985', 'location': 'New York', 'registration_number': '', 'percentage': '0.08%'}, {'name': 'Diana Prince', 'type': 'person', 'date_of_birth': 'June 30, 1990', 'location': 'San Francisco', 'registration_number': '', 'percentage': '0.13%'}, {'name': 'Alice Johnson', 'type': 'person', 'date_of_birth': 'April 10, 1980', 'location': 'Chicago', 'registration_number': '', 'percentage': '0.04%'}]

---

Question:
Extract the entities from the provided tables.
Return a response as a list of JSON objects, where each object represents an entity that must be a person or a company.
Each object should have the following keys:
- 'name': the name of a person or company
- 'type': either 'person' or 'company'
- 'date_of_birth': will only be present if the 'type' is 'person' otherwise 'None'
- 'location': city of residence of the person or company
- 'registration_number': will only be present if the 'type' is 'company' otherwise 'None'
- 'percentage': the percentage of ownership or number of shares owned by the person or company
Table: | Business Share Serial | Shareholder Info                                                             | Share Amount in  | Share Percentage | Share Changes        |
|------------------------|------------------------------------------------------------------------------|-------------------|------------------|----------------------|
| 1                      | Charlie Brown, born on April 10, 1980, residing in Chicago                   | 100             | 1.4%             | Changed address      |
| 2                      | Bob Smith, born on July 1, 1985, residing in New York                       | 1000            | 0.3%             | Updated contact info |
| 3                      | Alice Johnson, born on May 22, 1975, residing in San Francisco              | 10000           | 0.56%            | No changes           |

Answer: [{'name': 'Charlie Brown', 'type': 'person', 'date_of_birth': 'April 10, 1980', 'location': 'Chicago', 'registration_number': '', 'percentage': '1.4%'}, {'name': 'Bob Smith', 'type': 'person', 'date_of_birth': 'July 1, 1985', 'location': 'New York', 'registration_number': '', 'percentage': '0.3%'}, {'name': 'Alice Johnson', 'type': 'person', 'date_of_birth': 'May 22, 1975', 'location': 'San Francisco', 'registration_number': '', 'percentage': '0.56%'}]

---

<