RULER vs. Gradient’s 1M Context Length Llama-3 70B

May 27, 2024

Leo Pekelis, Chief Scientist at Gradient

Recently we shared some thoughts around how we evaluate long context models, including a deep dive into NIAH and RULER. Join us as we dive deeper into the RULER evals and show why variation in the instruction prompt should be taken into account when evaluating performance.

Recently we shared some thoughts around how we evaluate long context models, including a deep dive into NIAH and RULER. Join us as we dive deeper into the RULER evals and show why variation in the instruction prompt should be taken into account when evaluating performance.

Recently we shared some thoughts around how we evaluate long context models, including a deep dive into NIAH and RULER. Join us as we dive deeper into the RULER evals and show why variation in the instruction prompt should be taken into account when evaluating performance.

Recently we shared some thoughts around how we evaluate long context models, including a deep dive into NIAH and RULER. Join us as we dive deeper into the RULER evals and show why variation in the instruction prompt should be taken into account when evaluating performance.

Recently we shared some thoughts around how we evaluate long context models, including a deep dive into NIAH and RULER. Join us as we dive deeper into the RULER evals and show why variation in the instruction prompt should be taken into account when evaluating performance.

A Deeper Look Into RULER

A couple weeks ago, we shared some thoughts around how we’re evaluating the quality of our long context models. More comprehensive and systematic evaluation methods, like RULER, are critical to achieving production-grade performance on real world use cases. Using RULER, we benchmarked the progress of our 8B model, putting it at 7th place overall and 1st place among models with fewer than 30B parameters.

Today we’ll take it a bit further, as we bring back RULER and apply it to our 1M context length 70B model. While we’re at it, we’ll double click into the RULER evals themselves and show why variation in the instruction prompt should be taken into account when evaluating performance.

A New Challenger: Gradient's 1M Context Length Llama-3 70B

In the past week, we’ve had the opportunity to apply RULER to our 1M context length 70B model and the results are in. Gradient’s 70B model ranks 4th place, just behind Gemini-1.5-pro, GPT-4-1106-preview, and Command-R-plus.

Our goal was simple – to evaluate our 70B model more comprehensively. To do that, we took extra time to evaluate the set of RULER benchmarks over 500 samples. We also compared our models on wAvg. (dec) - the weighted average score across the 13 benchmark tasks and context lengths up to 128k, linearly decreasing in weight. Both the sample count and wAvg metric are how the official leaderboard ranks long context models.

Our augmented Llama-3 70B model* improved its RULER score by 12.7 points, putting it at 4th place, just behind Command-R-plus. This is especially notable considering we trained on only 1.4B tokens in total, < 0.01% of Llama-3’s original training data. It is evidence that extending context length is transfer learning when rotary embeddings are chosen intentionally.

The model’s particular strengths are Retrieval with only 13% degradation between 4k and 128k performance, and Q&A with 22% degradation.

*Long-context, Continued Pre-training, and Alignment of Llama3 70B

The Prompt Strikes Back

However, this story is not over. While digging into RULER results, we found that changes in the instruction prompt can have a significant impact on eval scores.

The following table shows how scores change when we run the CWE task - extracting the most common words in a prompt - on our 70b 1048k model, and vary the prompt in two ways:

  1. Removing newline characters

  2. Moving instructions from the beginning of the prompt to the end

Scores are for CWE task run on Gradient-70b-1048k, averaged over context lengths.

Updating the prompt structure raises the task score by a whopping 10 points. We are not the first to showcase Llama3’s particular preference for how you communicate with it, and it’s not surprising that LLMs behave better when prompts match their instruction tuning stylistically. But a question remains: for eval results, what number should you trust?

Here’s a short statistical argument we came up with for why the answer is “probably all of them.” And for those of you who like equations, don’t worry, we included them in the appendix for you.

Phantom Statistics

When running an eval, there are broadly 3 choices to make – the model, the test data, and the prompt. This defines a tuple in action space M, X, π, and we consider a setup where an evaluator chooses two models to compare M1 and M2, and behind the scenes a probability model generates test data at random and a prompt from a prior distribution g.

We’ll model score dependence on the prompt as a linear decomposition of the average score, with variation around the score from randomness in test data, denoted σ^2,

Sx(M,π) ~ (μM + μπ + μM,π, σ^2).

Working through some equations (see the appendix), and appealing to the Central Limit Theorem, we find that for large samples, the probability that model 1 ranks higher than model 2 looks like

Px,π (S(M1) > S(M2)) ≈ Pπ (δM+ δM,π > 0)

where δM = μM1 - μm2, and δM,π = μM2,π - μM1,π. In words, the chance that model 1 beats model 2 depends on how much the prompt impacts the eval score, averaged over prompts. If the interaction delta is small, δM,π ≈ 0, then the model with largest average score wins, which is exactly what you’d expect, Px,π(S(M1) > S(M2)) ≈ 1δM > 0 .

But what if the model that has a better score is also more susceptible to prompt variation? A stylistic example where Model 1 is on average better, but performs worse on 50% of prompts is shown in the chart below. Since we are picking a prompt at random, the rank of the model is effectively a coin toss, Px,π (S(M1) > S(M2)) → 0.5!

The Rise of Prompt Aware Evals

The reality is somewhere in between these two extremes, but they highlight why picking a single prompt to run with for evals can be misleading. There’s at least 2 alternative strategies for ranking models, taking prompt variation into account:

  • max-max: Choose the best scoring model after picking the best prompt for each model.

  • max-avg: Choose the best scoring model after averaging over prompts.

Max-max scores the model after allowing for some prompt optimization, while max-avg requires prompt robustness from the model. Which method to use depends on the situation, and highlights our key take-away of this analysis: one number summaries are rarely enough to pick the right model for the job.

Besides being a yardstick to measure performance versus the rest of the field, the benefit of evals is to give direction on how to improve. Like an effective objective function, a well-crafted eval highlights the next step in optimizing a model. We hope our work continues to encourage the community forward and you should definitely stay tuned for more long-context content!

Thanks

A big thanks to the team at NVIDIA for putting together the RULER evals, and Crusoe for the compute used to run them on our models.

Appendix

Here’s a proof of the result that

Under pretty general conditions, the score function has approximately a normal distribution for large samples, and

where bar Φ is the right tail of the normal CDF. As N gets large, the normal CDF tends to an indicator function, giving the result