Synthetic Data Generation for Contexts Up to 1 Million Tokens Using Short-Context Models

Michael Feil

A few weeks ago, we released a series of Llama-3 long context models. Contrary to popular belief, the major challenge in training these models isn't just improving attention implementation, but also the scarcity of adequate long context training data. To make training our Llama-3 long context models possible, we developed a synthetic data generation pipeline that composes coherent training data for contexts up to 1 million tokens.

From Pre-Training to Post-Training

With the latest research developments, it has become clear that context length is no longer bottlenecked by the complexity of attention operation. The first milestone was the introduction of Flash-Attention, which made training drastically cheaper and decreased memory requirements for up to 64k in context length, unlocking a new generation of models. The second milestone was the introduction of Ring-Attention [4], which enabled the training of the LWM-series models up to 1048k in context length, as well as our own LLama-3 long context series. Models in both of these context ranges (Nous-Yarn-Llama-2-13b-128k, LWM, Llama-3-70B-Instruct-Gradient-1048k) are no longer pre-trained from scratch but rather extended to longer context lengths post-training.

The Need for Synthetic Data

[1] Sequence length distribution for classical long-context training data

When narrowed down to non-synthetic data sources that offer longer context sequences, we see a significant limitation – only a select few sources, such as Wikipedia, the Gutenberg Project, and extensive C++ code repositories, provide coherent sequences exceeding 16k tokens. This graph [1] displays the distribution of context lengths in the Wikipedia and Gutenberg datasets. Even for the English Gutenberg subset (sedthh/gutenberg_english), 95% of the books contain fewer than 300k tokens.

Due to this limitation, synthetically generated data sets are the only viable path for training longer context models.

The Lack of Synthetic Datasets Suitable for Long Context Training

The existing synthetic datasets and generation approaches can be grouped into three categories:

Short context, pre-training style fully-synthetic datasets generated using short context models based on WebCrawl data
Short context, instruction style fully-synthetic datasets generated using short context models based on Human questions
Long context, pre-training style datasets generated by appending WebCrawl data

Short Context, Pre-Training Style

HuggingFaceTB/cosmopedia is a great example of synthetic pre-training style datasets generated by short context models. Mixtral-8x7b-Instruct was asked to generate content related to sequences sampled from RefinedWeb.

Short Context, Instruction Style

Synthetically-created instruct datasets are commonly used to create models like WizardLM, Starling-LM or OpenChat. In most cases, they consist of a user-written question, followed by one-or-multiple answers by LLMs. Some great examples for these datasets are:

LDJnr/Capybara (multi-turn conversations, Amplify-Instruct method, 2k tokens)
Contains over 10,000 multi-turn conversational examples synthesized using the Amplify-Instruct method. This approach expands single-turn seeds from diverse, high-quality datasets into detailed multi-turn conversations. The dataset has an average context length of ~2k tokens per conversation.
vicgalle/alpaca-gpt4 (single-turn instructions, based on user queries)
English instruction-following dataset generated by GPT-4 using Alpaca prompts for fine-tuning LLMs. These instructions are single-turn and based on user queries.
berkeley-nest/Nectar (generated using GPT-4, GPT-3.5-turbo, GPT-3.5-turbo-instruct, LLama-2-7B-chat, and Mistral-7B-Instruct)
7-wise comparison dataset. It includes 182,954 diverse chat prompts with seven responses per prompt, sourced from models like GPT-4, GPT-3.5-turbo, LLama-2-7B-chat, and Mistral-7B-Instruct. Each response is ranked by GPT-4 based on helpfulness and harmlessness criteria, resulting in 3.8 million pairwise comparisons.

Long Context, Pre-Training Style

On the opposite end of the spectrum, there are dataset augmentation methods to achieve longer context data sets. Most of these methods focus mainly on generating longer context and diversity, less on chat instructions. One example is the long context engineering paper [2], which takes the approach of concatenating a number of passages. However, we found that the random-sampling algorithm of the accompanied implementation has several biases that do not reflect the ideal bin-packing and drawing without replacement nature. Additionally, start tokens are added to each passage and multiple passages have little overlap in content, potentially harming the model’s learning of certain adjacencies in the attention operation.

None of These are Long Enough

None of the existing three categories of synthetic datasets filled the need for a high quality 1M context length dataset, based on the criteria in the table below. This motivated us to devise a novel dataset generation pipeline for long context chat instructions.

Gradient’s Data Generation Pipeline for Long Context

To overcome the challenges, we implemented a new pipeline for synthetic long context data generation. We grounded the generated text on pre-training style datasets and we used coherent context from long sources such as code or books from cerebras/SlimPajama-627B, as the starting point. This helped generate a diverse set of instructions.

The pipeline involved the following steps:

Select a 50-150k token chunk of context to ground the generation.
Use a short context model similar to Mixtral8x7B to generate single user-assistant pairs. Then, append the user-assistant pairs to the content.
1. As a novel approach to cope with very long contexts, we used context masking to fit the large context length into a short context model. We masked a target length of 1k to 4k tokens by splitting the source material into paragraphs and sampled until a target length is reached. Additionally, if any user/assistant pairs are available, we also included a few sampled pairs. The rationale was to avoid the effects described in the “Lost in the Middle” [6] paper and encourage context usage over the entire sequence.
Repeat step 2 until the target length is reached for this section.
Then repeat steps 1 and 2 again until 1M token length is reached for the full context.

Multiple jobs can run in parallel for more efficient batched inference.

Using this pipeline, we generated a dataset that met all our quality requirements for model training: coherent context over 1M tokens with instruction following. This specific method also ensured that the model was trained to focus on the entire attention matrix.

Conclusion

High quality synthetic dataset generation requires a combination of AI-generated samples and human verified training samples. With the constantly increasing context window for models, longer context datasets are increasingly necessary and useful for the community.

Check out the other content in our long context series:

Sources

[1] In the long context run, https://www.harmdevries.com/post/context-length/

[2] Data Engineering for Scaling Language Models to 128K Context, https://arxiv.org/abs/2402.10171

[3] Flash-Attention, https://arxiv.org/abs/2205.14135

[4] Ring-Attention, https://arxiv.org/pdf/2310.01889

[5] Amplify Instruct, https://huggingface.co/datasets/LDJnr/Capybara

[6] Lost-in-the-middle, https://arxiv.org/pdf/2307.03172