RULER: Evaluating Models Beyond Needle-in-a-Haystack

May 10, 2024

Leo Pekelis, Chief Scientist at Gradient

With the recent release of our long context models, we’ve been taking a deeper look into model quality. Take a look at Gradient’s journey on evaluating long context models using de facto standards like NIAH and more comprehensive evaluation methods like RULER.

With the recent release of our long context models, we’ve been taking a deeper look into model quality. Take a look at Gradient’s journey on evaluating long context models using de facto standards like NIAH and more comprehensive evaluation methods like RULER.

With the recent release of our long context models, we’ve been taking a deeper look into model quality. Take a look at Gradient’s journey on evaluating long context models using de facto standards like NIAH and more comprehensive evaluation methods like RULER.

With the recent release of our long context models, we’ve been taking a deeper look into model quality. Take a look at Gradient’s journey on evaluating long context models using de facto standards like NIAH and more comprehensive evaluation methods like RULER.

With the recent release of our long context models, we’ve been taking a deeper look into model quality. Take a look at Gradient’s journey on evaluating long context models using de facto standards like NIAH and more comprehensive evaluation methods like RULER.

Rise of Long Context Models

Recently, long context models have been gaining popularity, due to continued advancements in deep learning architectures and enhanced hardware capabilities. While 16k context length was impressive just a year ago, today it is table stakes. The release of commercial models with long context (Anthropic Claude - 200k tokens, Google Gemini 1.5 Pro - 1M tokens), as well as numerous research papers, highlights the evolution and critical role of long context models in AI applications.

Working with partners like Crusoe, our team at Gradient was able to develop a range of long context models. Today we’re seeing a particular need for long context models in text modality, across a variety of our enterprise use cases including:

  • Generating code suggestions based on the context of an entire repository

  • Synthesizing nuanced investment analysis from company reports that span across time and industry sectors

  • Automating analysis of large sets of poorly structured tabular data

  • Generating legal analysis of a case using the historical precedent of previous court proceedings

For detail critical tasks where individual pieces of related information are important to the final output, typical RAG and summarization are often unsatisfying, and long context models show strong promise.

NIAH Evals - The De Facto Standard

While there’s an undeniable amount of interest in the emergence of long context models, there’s currently no established method to evaluate these models. As of today, the de facto standard amongst the community has been the use of NIAH (Needle-in-a-Haystack) - a method which embeds specific targeted information (the “needle”) within a larger, more complex body of text (the “haystack”). The objective is to assess a model’s ability to identify and utilize this specific piece of information amidst a vast amount of data.

Despite the impact of NIAH, the eval comes with its own set of nuances that were recently documented in our blog post. Recognizing these gaps is crucial, and we view NIAH as a foundational stepping stone since it evaluates a primitive considered required for long in-context learning, induction heads.

If a long context model can’t get NIAH right, there is little hope it can achieve the ambitious use cases from our intro. The long context models that Gradient recently released (e.g. 1M Context Length Llama-3 70B and 1M Context Length Llama-3 8B) all achieved perfect scores on NIAH evals up to the context lengths they were trained on, and then some.

Still, the biggest challenge is that NIAH is far from a real world scenario. In this blog post, we describe the most interesting work we’ve seen to bridge this gap with more sophistical evals, and share the results of one we are most excited about - RULER - on our long context models.

Evolving the Way We Evaluate Long Context Models

We’re thrilled to see a lot of conversation and contribution from the community to advance the way we assess long context models. A non-comprehensive list of benchmarks we’ve had a chance to dive into include:

  • ZeroSCROLLS: A zero-shot benchmark for natural language understanding over long texts, which contains only test and small validation sets, without training data. It adapts six tasks from the SCROLLS benchmark and adds four new datasets, including two novel information fusing tasks, such as aggregating the percentage of positive reviews.

  • BAMBOO: A multi-task long context benchmark that’s been designed with four principles: comprehensive capacity evaluation, avoidance of data contamination, accurate automatic evaluation, and different length levels. It consists of 10 datasets from 5 different long text understanding tasks to cover core capacities and various domains of LLMs.

  • InfiniteBench: InfiniteBench is designed to push the boundaries of language models by testing them against a context length of 100k+ — 10x longer than traditional datasets.

  • LooGLE: Features relatively new documents that used human annotators to meticulously craft more than 1.1K high-quality question-answer pairs. These pairs underwent thorough cross-validation, yielding the most precise assessment of LLMs' long dependency capabilities.

  • RULER: A synthetic benchmark on 13 tasks to expand on the vanilla NIAH test. The tasks encompass NIAH variations with diverse types and quantities of needles, and well as new task categories for multi-hop tracing, aggregation and Q&A.

For the team at Gradient, the one that stood out to us in particular was RULER - providing a more comprehensive evaluation of a model’s ability to perform under diverse types and quantities of data. In particular, our team found that RULER:

  • Provided a systematic increase in complexity from NIAH

  • Intentionally measured different characteristics of long context

  • Gradually presented the model with progressively more challenging tasks using intentional characteristics - making it easier to determine where it broke down

For those unfamiliar with RULER, the benchmark groups 13 tasks into 4 categories:

  • Retrieval (NIAH): Single NIAH, Multi-key NIAH, Multi-value NIAH, Multi-query NIAH

  • Multi-hop tracing: Variable tracking

  • Aggregation: Common word extraction (CWE), Frequent word extraction (FWE)

  • Question Answering (QA)

Taken together, the 13 tasks iterate on vanilla NIAH by further evaluating a model’s ability to disambiguate similarly sounding information (multi-key NIAH), retrieve without missing critical information (multi-value and multi-query), establish a chain of reference (variable tracking), summarize long passages (CWE and FWE), and answer questions in the face of distracting information (QA). Full details of the tasks as well as parametrization of task complexity are covered in the RULER research paper.

A model that scores well on these tasks, at long context length, is well on its way to achieving production grade performance on the use cases we set out.

Crusoe’s GPUs have been hard at work cranking through task iterations and ablations (they take a while to run!). As an initial result, we are sharing the scores for our Llama-3 8B 1M context length model. Since the RULER codebase only supports up to 128k context length, that’s where we cut off our evaluations too.

*- evaluated on 50 repeated task samples. All other tasks were evaluated on 100 samples.

The grand average score for our 8B model is 81.1 across sequence lengths. Compared to the leaderboard on RULER’s Github page, this puts the model at 7th place, just behind Mixtral-8x22B, which scored 81.9. For an 8B model, this is quite an impressive feat! It has about 75% fewer parameters than the next smallest higher ranked model - the 34B parameter Yi model, and was only trained on 1.4B tokens on top of the base Llama-3 8B Instruct model.

The model’s particular strengths are retrieval - score of 94.7 overall average - and Q&A - score of 65.6. These scores are just behind GPT-4 and Yi, the top 2 scoring models in these categories from the original 10 evaluated in the research paper.

We consider this a motivating result for Gradient’s custom built long context models and encouraging for open-source long context LLMs. In the coming weeks, we are very excited to share results for our 70B long context models, as well as applying our custom infrastructure to evaluate the RULER benchmark on longer context lengths.

Shoutouts

As per usual, our GPU grill masters here at Gradient have been cooking up a storm to get these evals ready. Crusoe has been an invaluable partner, and sponsored this study with their high performance, high capacity L40S cluster. In addition, thanks to Cheng-Ping Hsieh @ Nvidia for helpful comments, and the rest of the RULER team for open sourcing their codebase!

About Gradient

Gradient is a full stack AI platform that enables businesses to build customized agents to power enterprise workloads. By leveraging Gradient AI Foundry, you’ll be able to develop custom-tailored solutions using the most comprehensive solution for AI transformation.

If you’re interested in our work with long context models or model development, you can reach out to us here, join our community on Discord, or join our custom agent and long context (262k-1M+) waitlist.