Imagine you have a large language model (LLM) with a very large context window (e.g., > 64KB), and the fine-tuning training data or the prompts at inference time are much smaller than the context window size. In this case, sample packing can help reduce model training time and/or increase LLM inference throughput (in tokens/sec).
Sample Packing (also known as multipack)is a technique used in machine learning, particularly in tasks involving sequential data processing such as LLMs and time series modeling, to efficiently process variable-length sequences. It involves dynamically padding sequences within a batch to the same length and applying masking mechanisms to ignore irrelevant padded elements during computation, thereby maximizing computational efficiency without compromising learning efficacy.
Traditional neural network architectures, such as recurrent neural networks (RNNs) and transformer models, excel at processing sequential data. However, their performance can be hindered by the inherent variability in sequence lengths encountered in real-world datasets where the model expects a fixed length input window size. Sample packing enables multiple inputs to be included in a single training sample or inference request, enhancing computational efficiency of the model, helping to reduce training time and/or increase inference throughput.
In LLM tasks, data is represented as sequences of tokens, where each token typically corresponds to a word, character, or sub-word. These sequences form the input to the models, which aim to learn meaningful representations and capture intricate patterns within the data. However, the lengths of these sequences can vary substantially, posing a computational challenge during training.
To illustrate this challenge, consider a scenario where we aim to fine-tune a LLM on a corpus of text. Each sentence in the corpus may have a different length, ranging from a few words to several dozen or even hundreds. When processing these sentences in batches, traditional approaches would require padding shorter sequences with special tokens to match the length of the longest sequence in the batch. While this ensures uniformity for batch processing, it introduces inefficiencies by including irrelevant padding tokens in the computation, which not only wastes GPU resources but also dilutes the model's learning signal.
Sample Packing offers a compelling solution to this challenge. It involves organizing sequences within a batch in a manner that maximizes computational efficiency without sacrificing learning efficacy. This is achieved by dynamically padding sequences to the same length within a batch, allowing for parallel processing across multiple sequences. By eliminating the need for fixed-length padding across batches, Sample Packing minimizes wasted computation on padding tokens, leading to significant improvements in training efficiency and batch inference throughput.
Central to the implementation of Sample Packing is the concept of masking. Masking mechanisms enable neural network models to selectively ignore padded regions during computation, ensuring that they do not contribute to the model's output or gradients. This enables the model to focus exclusively on the relevant parts of the input sequences, thereby preserving the integrity of the learning process while mitigating the effects of variable-length sequences.
In RNNs, Sample Packing entails packing sequences into a single tensor and applying masking to disregard padded elements during computation. Similarly, in transformer models like the GPT models, Sample Packing optimizes batch processing by dynamically adjusting sequence lengths within each batch, facilitated by sophisticated masking mechanisms.
Sample Packing can not only contribute to computational efficiency, but it also accelerates the training process, allowing researchers and practitioners to experiment with larger datasets and more complex models. Moreover, it fosters the development of more robust and scalable neural network architectures.
In summary, Sample Packing offers a pragmatic solution to the challenge of processing variable-length sequences efficiently. By combining dynamic padding with masking mechanisms, Sample Packing empowers neural network models to unlock new levels of performance and scalability, contributing to sequence processing tasks.
Axolotl supports flash-attention for open-source models like Llama-2 and Mistral, by adding to the yaml file:
sample_packing: true
Axolotl can be used for fine-tuning models on Hopsworks by simply installing it as a Python dependency in your project. Your fine-tuning training data can be loaded from Hopsworks by Axolotl using the built-in FUSE support that makes your training data, stored on HopsFS-S3, available as local files to Axolotl.