Nowadays large language models (LLMs) have revolutionized various domains. However, deploying these models in real-world applications can be challenging due to their high computational demands. This is where vLLM steps in. vLLM stands for Virtual Large Language Model and is an active open-source library that supports LLMs in inferencing and model serving efficiently.
vLLM was first introduced in a paper - Efficient Memory Management for Large Language Model Serving with PagedAttention, authored by Kwon et al. The paper identifies that the challenges faced when serving LLMs are memory allocation and measures their impact on performance. Specifically, it emphasizes the inefficiency of managing Key-Value (KV) cache memory in current LLM serving systems. These limitations can often result in slow inference speed and high memory footprint.
To address this, the paper presents PagedAttention, an attention algorithm inspired by virtual memory and paging techniques commonly used in operating systems. PagedAttention enables efficient memory management by allowing for non-contiguous storage of attention keys and values. Following this idea, the paper develops vLLM, a high-throughput distributed LLM serving engine that is built on PagedAttention. vLLM achieves near-zero waste in KV cache memory, significantly improving serving performance. Moreover, leveraging techniques like virtual memory and copy-on-write, vLLM efficiently manages the KV cache and handles various decoding algorithms. This results in 2-4 times throughput improvements compared to state-of-the-art systems such as FasterTransformer and Orca. This improvement is especially noticeable with longer sequences, larger models, and complex decoding algorithms.
The attention mechanism allows LLMs to focus on relevant parts of the input sequence while generating output/response. Inside the attention mechanism, the attention scores for all input tokens need to be calculated. Existing systems store KV pairs in contiguous memory spaces, limiting memory sharing and leading to inefficient memory management.
PagedAttention is an attention algorithm inspired by the concept of paging in operating systems. It allows storing continuous KV pairs in non-contiguous memory space by partitioning the KV cache of each sequence into KV block tables. This way, it enables the flexible management of KV vectors across layers and attention heads within a layer in separate block tables, thus optimizing memory usage, reducing fragmentation, and minimizing redundant duplication.
vLLM doesn't stop at PagedAttention. It incorporates a suite of techniques to further optimize LLM serving.
vLLM is easy-to-use. Here is a glimpse into how it can be used in Python:
One can install vLLM via pip:
Then import the vLLM module into your code and do an offline inference with vLLM’s engine. The LLM class is to initialize the vLLM engine with a specific built-in LLM model. The LLM models are by default downloaded from HuggingFace. The SamplingParams class is to set the parameters for inferencing.
Then we define an input sequence and set the sampling parameters. Initialize vLLM’s engine for offline inference with the LLM class and an LLM model:
Finally, the output/response can be generated by:
The code example can be found here.
To use vLLM for online serving, OpenAI’s completions and APIs can be used in vLLM. The server can be started with Python:
To call the server, the official OpenAI Python client library can be used. Alternatively, any other HTTP client works as well.
More examples can be found on the official vLLM documentation.
vLLM's efficient operation of LLMs opens numerous practical applications. Here are some compelling scenarios that highlight vLLM's potential:
vLLM is quickly becoming a preferred solution for those looking to optimize their Large Language Model (LLM) deployments. Traditional LLM frameworks often require extensive resources and infrastructure, which can be costly and challenging to scale. However, vLLM is designed specifically for efficient memory management, leveraging a sophisticated paging mechanism that reduces the memory footprint and enhances overall performance. By adopting vLLM, teams can reduce hardware costs while improving inference speed and efficiency, especially when scaling up for high-demand applications. This makes vLLM ideal for organizations aiming to deploy LLMs at scale without sacrificing speed or requiring extensive resources.
When combined with Hopsworks, vLLM seamlessly integrates into a robust MLOps pipeline, enabling teams to deploy, monitor, and optimize LLM applications with ease. Hopsworks offers end-to-end MLOps capabilities, such as experiment tracking, model versioning, and monitoring, which can be directly applied to manage vLLM deployments. Additionally, Hopsworks’ feature store provides data consistency and high performance, critical for training and deploying LLMs. By integrating vLLM with Hopsworks, MLOps teams gain a scalable, efficient way to manage and monitor large-scale LLM deployments effectively, bringing the benefits of vLLM optimization into the broader MLOps ecosystem. With Hopsworks 4.0 you can build and operate LLMs end to end, creating instruction datasets and fine tuning, to model serving with vLLM on KServe, to monitoring and RAG. We added a vector index to our feature store, so now in a single feature pipeline you can both index your documents for RAG and create instruction datasets.
The vLLM project provides an implementation of an OpenAI-compatible server that initializes a vLLM engine, loads a given LLM, and handles incoming user prompts at endpoints following the OpenAI’s Completions and Chat API. Hopsworks enables users to deploy LLMs using the vLLM OpenAI server by providing a configuration YAML file containing the parameters to be passed to the vLLM OpenAI server. These parameters include information about the tokenizer, the tool parser used for extracting tool calls from answers (used for function calling [link-to-mlops-function-calling]), and the chat template, among other things. The list of all available parameters can be found in the vLLM OpenAI server documentation.
For example, to deploy a fine-tuned Llama3.1 model using the vLLM OpenAI server in Hopsworks, you can pass the server configuration file in the config_file parameter as shown in the code snippet below.
Pros:
Cons:
Alternatively, Hopsworks supports a second, more customizable approach for deploying LLMs with vLLM that allows users to initialize the vLLM engine and provide their own implementation of the /chat/completions and /completions endpoints. This approach leverages a KServe-provided OpenAI-compatible server that offers a default implementation of the endpoints to handle completion requests and user prompts, but that allows users to override these methods.
For example, to deploy a fine-tuned Llama3.1 model using the KServe vLLM server in Hopsworks, you can implement the predictor script and pass it in the script_file parameter as shown in the code snippet below. Optionally, you can also pass a configuration file that will be available inside the server.
Pros:
Cons:
vLLM addresses a critical bottleneck in LLM deployment: inefficient inferencing and serving. Using the innovative PagedAttention technique, vLLM optimizes memory usage during the core attention operation, leading to significant performance gains. This translates to faster inference speeds and the ability to run LLMs on resource-constrained hardware. Beyond raw performance, vLLM offers advantages like scalability and cost-efficiency. With its open-source nature and commitment to advancement, vLLM positions itself as a key player in the future of LLM technology.