As large language models (LLMs) have advanced, many industries have been developing and serving systems with LLMs for diverse purposes. Models like GPT-4, Claude, Gemini, and many others have demonstrated impressive abilities to understand, generate, and reason about human language. However, when serving the LLMs to a greater audience, comes with significant memory and computational requirements.
Systems often struggle with high throughput serving due to inefficient memory usage. One key area of focus has been on the models' key-value (KV) caches - the large memory stores that hold the contextual information the models use during inference. PagedAttention is an innovative technique, proposed by Kwon et al., in their paper “Efficient Memory Management for Large Language Model Serving with PagedAttention”, that aims to dramatically reduce the memory footprint of LLM KV caches so to help make LLMs more memory-efficient and accessible.
LLMs like GPT-4 can have trillions of parameters, making them extremely powerful but also incredibly memory-hungry when inferencing during serving. The main bottleneck of memory is due to the KV cache.
During the decoding process of transformer-based LLMs, as each input token is processed, the model generates corresponding attention key and value tensors. These key and value tensors encode important contextual information about the current input and its relationship to the broader context. Rather than recomputing these attention-related tensors from scratch for each step of the decoding process, the model stores them in GPU memory. This stored collection of key and value tensors is commonly referred to as KV cache.
By maintaining the KV cache, LLMs can retrieve and reuse the pre-computed contextual information when generating the next output token during inference. The cache acts as a sort of "memory" for the model to draw upon.
However, the size of the KV cache can quickly become a bottleneck. For a 13B-parameter LLM, the KV cache can be 40 gigabytes in size (as illustrated in the paper). The size of the KV cache is also dynamic and unpredictable as it depends on the length of the input sequence.
Therefore, efficiently storing and accessing all that information during inference is a major computational and memory challenge, especially for deployments on resource-constrained hardware like edge devices or mobile phones.
The core idea behind Paged Attention is to partition the KV cache of each sequence into smaller, more manageable "pages" or blocks. Each block contains key-value vectors for a fixed number of tokens. This way, the KV cache can be loaded and accessed more efficiently during attention computation.
PagedAttention's approach to managing the memory used for storing key and value vectors is similar to how operating systems handle virtual memory. PagedAttention decouples the logical organization of the data from its physical storage. The logical blocks belonging to a sequence are mapped to potentially non-consecutive physical blocks using a block table. This abstraction allows for more flexible memory utilization, as new physical blocks can be allocated when new tokens are being generated.
Paged Attention is employed by vLLM, an open-source library for fast LLM inference and serving, developed at UC Berkeley.
LMSYS has employed vLLM as a backend in their FastChat which powers Chatbot Arena to support the growing demands. LMSYS shows that the vLLM with the PagedAttention technique can improve the throughput up to 30x higher compared to one of their initial HF backend.
To use PagedAttention, one can use via vLLM. It can be installed via the following pip command.
vLLM supports both online serving and offline inference, which can be used directly in Python script through their Python library. The details can be found in the documentation.
The PagedAttention approach offers several key advantages that can make LLMs more memory-efficient and accessible.
While PagedAttention offers many benefits, there are also some potential challenges and limitations to consider.
In conclusion, PagedAttention is an innovative technique that addresses the significant memory challenges faced by serving LLMs. By partitioning the KV cache into smaller, more manageable pages and leveraging a lookup table for efficient access, PagedAttention can dramatically reduce the memory footprint of these powerful models, paving the way for their wider adoption and deployment in a variety of real-world applications.