Model quantization can reduce the memory footprint and computation requirements of deep neural network models. Weight quantization is a common quantization technique that converts a model’s weights from the standard floating-point data type (e.g., 32-bit floats) to a lower precision data type (e.g., 8-bit integers), thus saving memory and resulting in faster inference (through reduced computational complexity). Model quantization can make large models, such as LLMs, more practical for real-world applications at edge devices.