The Haystack Matters for NIAH Evals

May 2, 2024

Leo Pekelis, Chief Scientist at Gradient

Evals are often used to help provide a sense of quality when it comes to models. With the recent release of our 1M context window Llama 3 8B model, we take a deeper look into one of the most popular evals – Needle in a Haystack. Take a look at why haystack diversity matters and the significant impact that haystack composition has on this eval.

Evals are often used to help provide a sense of quality when it comes to models. With the recent release of our 1M context window Llama 3 8B model, we take a deeper look into one of the most popular evals – Needle in a Haystack. Take a look at why haystack diversity matters and the significant impact that haystack composition has on this eval.

Evals are often used to help provide a sense of quality when it comes to models. With the recent release of our 1M context window Llama 3 8B model, we take a deeper look into one of the most popular evals – Needle in a Haystack. Take a look at why haystack diversity matters and the significant impact that haystack composition has on this eval.

Long Context Models

Update 5/10/2024: Our team is constantly working on improving our evals for long context. Check out our latest post on RULER - providing a more comprehensive view of quality on top of NIAH.

The release of Llama 3 just a couple weeks ago has already generated many exciting fine-tunes and adaptations. Iterating through all of them would require a series of blog post all of its own!

For us at Gradient, one particularly relevant thread has been extending Llama 3 models to handle long context. Llama 3 models were trained on a default context length of 8k tokens. For reference, that’s about 6,000 words, or 10 pages (12 pt font, single spaced). While this is plenty for typical workloads, enterprise solutions often call for considerably longer context (e.g. financial institutions routinely ingest company 10-K reports at 150K+ words each).

To unlock long context use cases, we recently published an extension of Llama 3 to over 1 million tokens. This lets Llama 3 hold the first 5 books of the Happy Potter series, and then some! We achieved this using our partner Crusoe’s high performance, high capacity L40S cluster.

We didn’t do this in a vacuum. A number of other folks have also been working in the same direction, and we wouldn’t have had as much impact without their contributions and feedback. Notably Wing Lian, Zhang Peiyuan, and the BAIR lab have all released long context models, greatly benefitting the open source community.

Needles in Haystacks

But how do you compare the long context models out there? It’s quickly becoming a standard to produce “green rug” plots of a needle in a haystack (NIAH) evaluation. In short, you hide a needle of content (e.g. 8 digit number) at various depths (D) in a haystack of content up to a certain cutoff length (L), and record the % of the time the model is able to accurately retrieve the needle when queried.

But as Facebook Marketplace will quickly teach you, not all rugs are equal. In short, the “hay” that you make the rug out of definitely matters.

Not All (Green) Rugs are Equal

As a quick demo, I’m going to compare 3 different generator haystacks:

  1. The sentences: The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. (25 tokens)

  2. A set of Paul Graham essays (150k tokens)

  3. The contents of (2) plus the entire Project Gutenberg eBook of War and Peace (925k tokens)

What’s important to notice is that none of the 3 haystacks are long enough by themselves to create a haystack up to L=1M context length. Instantiating the haystack for any L is done by repeating the generator haystack.

You might’ve guessed where this is going. It should be easier for a model to pick a needle out of a haystack when all the hay looks more uniform. And that’s exactly what happens.

Here’s green rug plots using each haystack to eval our 1M context length Llama 3 model. I only evaluate cutoff lengths L > 780K because the model gets perfect scores on all L and D with lower L. Haystack1 is easier than Haystack2, which is easier than Haystack3.

Putting it all together, I plot the generator haystack token count vs perfect recall %, or the # of (L,D) boxes where the model is able to achieve 100% retrieval accuracy. The x-axis isn’t perfectly informative – the three haystacks also differ in number of unique words, structure, and thematic length – but serves to illustrate the point. The more complex and diverse the generator haystack, the more difficult the benchmark.

More Representative Haystacks

At Gradient, we feel that more complex generator haystacks are closer to real world use cases for long context models, providing a more accurate representation of how the model would behave for a practical use case. That’s why we have been using the Paul Graham essays haystack (#2) for our published evals. In addition, in the spirit of transparency, we have updated our Hugging Face model card with results from all three haystack generators and specified the remaining configs. I earnestly encourage the rest of the community to do the same.

Shoutouts

In addition to the killer team at Gradient, I want to thank Wing Lian and Ethan Peterson for the fruitful discussions that led to this post!

About Gradient

Gradient is a full stack AI platform that enables businesses to build customized agents to power enterprise workloads. By leveraging Gradient AI Foundry, you’ll be able to develop custom-tailored solutions using the most comprehensive solution for AI transformation.

© 2024 Gradient. All rights reserved.

Connect

LinkedIn

Twitter (X)

Learn

Company

Get started

© 2024 Gradient. All rights reserved.

© 2024 Gradient. All rights reserved.

© 2024 Gradient. All rights reserved.

Overview

Learn