Back to the Index

Model Quantization

What is model quantization?

Model quantization can reduce the memory footprint and computation requirements of deep neural network models. Weight quantization is a common quantization technique that converts a model’s weights from the standard floating-point data type (e.g., 32-bit floats) to a lower precision data type (e.g., 8-bit integers), thus saving memory and resulting in faster inference (through reduced computational complexity). Model quantization can make large models, such as LLMs, more practical for real-world applications at edge devices.

Interested for more?

🤖 Register for free on Hopsworks Serverless
🐍 Learn all about the Python-Centric Feature Store
🛠️ Explore all Hopsworks Integrations
🧩 Get started with codes and examples
⚖️ Compare other Feature Stores with Hopsworks

Does this content look outdated? If you are interested in helping us maintain this, feel free to contact us.

M

Auto-regressive Models

M

Backfill features

Backfill training data

Backpressure for feature stores

Batch Inference Pipeline

M

CI/CD for MLOps

Compound AI Systems

Context Window for LLMs

M

DAG Processing Model

Data Compatibility

Data Partitioning

Data Transformation

Data Type (for features)

Data Validation (for features)

Data-Centric ML

Dimensional Modeling and Feature Stores

M

Encoding (for Features)

M

Gradient Accumulation

Grouped Query Attention

M

Hallucinations in LLMs

Hyperparameter Tuning

M

Idempotent Machine Learning Pipelines

In Context Learning (ICL)

Inference Pipeline

Instruction Datasets for Fine-Tuning LLMs

M

LLM Code Interpreter

LLM Temperature

LLMs - Large Language Models

Lagged features

M

Natural Language Processing (NLP)

M

On-Demand Features

On-Demand Transformation

Online Inference Pipeline

Online-Offline Feature Skew

Online-Offline Feature Store Consistency

M

Parameter-Efficient Fine-Tuning (PEFT) of LLMs

Point-in-Time Correct Joins

Precomputed Features

Prompt Engineering

M

RLHF - Reinforcement Learning from Human Feedback

Real-Time Machine Learning

Recommender System

Representation Learning

Retrieval Augmented Generation (RAG) for LLMs

M

SQL UDF in Python

Similarity Search

Splitting Training Data

Streaming Feature Pipeline

Streaming Inference Pipeline

M

Theory-of-Mind Tasks

Time travel (for features)

Train (Training) Set

Training Pipeline

Training-Inference Skew

Two-Tower Embedding Model

Types of Machine Learning

M

M

Vector Database

Versioning (of ML Artifacts)