MLOps was born from the need to build infrastructural software to support production AI systems. In 2025, 52% of AI systems will not make it to production. One of the reasons is that developers make false assumptions when building AI systems. This article introduces 10 fallacies of MLOps that should help developers understand and identify potential problems ahead of time, enabling them to build better AI systems and get more models in production.
MLOps is a set of principles and practices that guide developers when building any type of AI system - from batch to real-time to LLM-powered systems. However, the foundations of MLOps are not deep. There are no biblical commandments internalized by the members of its church. Many members of the MLOps church worship false gods - fallacies that cause AI systems to never make it to production. These fallacies are inspired by a more mature computer discipline, distributed systems, that has a core set of tenants that developers agree on - the 8 fallacies of distributed computing.
The ten fallacies of MLOps listed below have been informed through my experience in building AI systems on Hopsworks used by everything from Fortune-500 companies to AI-powered startups, teaching a course on MLOps at KTH University, making the world’s first free MLOps course that built batch and real-time AI systems, and writing a book on Building AI Systems for O’Reilly.
When you write your first batch AI system, it is possible to write it as a single program that can be parameterized to run in either training or inference modes. This can lead to the false assumption that you can run any AI system in a single ML pipeline. You cannot run a real-time AI system as a single ML pipeline. It consists of at least an offline training pipeline that is run when you train a new version of the model and an online inference pipeline that runs 24/7. This leads to confusion as to what exactly a ML pipeline is. What are its inputs and outputs? Are data pipelines that create feature data also ML pipelines? They create the features (the inputs to our ML models).
So what should you do? You should decompose your AI system into feature/training/inference pipelines (FTI pipelines) that are connected together to make up your AI system, see Figure 1. Feature pipelines transform data from many different sources into features. Training pipelines take features/labels as input and output a trained model. Inference pipelines take one or more trained models as input and feature data and output predictions. Further decomposition of these pipelines is also possible - generally following the principle that you name the ML pipeline after its output. For example, feature pipelines can be classified as stream processing (streaming) feature pipelines, batch transformation pipelines, feature validation pipelines, and vector embedding pipelines (that transform source data into vector embeddings and store them in a vector index). Similarly, training pipelines can be further decomposed into training dataset creation pipelines (useful for CPU-bound image/video/audio deep learning training pipelines, where you shift-left data transformations to a separate pipeline run on CPUs, not GPUs), model validation pipelines, and model deployment pipelines. Inference pipelines can be decomposed into batch inference pipelines and online inference pipelines.
Reference: https://www.hopsworks.ai/post/mlops-to-ml-systems-with-fti-pipelines
In a real-time AI system, a client issues a prediction request with some parameters. A model deployment receives the prediction request and can use any provided entity ID(s) to retrieve precomputed features for that entity. Precomputing features reduces online prediction latency by removing the need to compute them at prediction time. However, some features require data only available as part of the prediction request, and need to be computed at request time. If we precompute features, we would like them to be reusable across different models. But my decision tree model doesn’t need to scale the numerical feature, while my deep learning model needs it to be zero-centered and normalized. Similarly, my CatBoost model can take the categorical string as input, but XGBoost requires me to encode the string before inputting it to the model.
There is a data transformation taxonomy for AI that has three different types of data transformation:
Model-independent transformations are the same as those found in data pipelines (extract-transform-load (ETL) pipelines). However, if you want to support real-time AI systems, you need to support on-demand transformations. They enable both real-time feature computation and offline feature computation using historical data - to backfill feature data in feature pipelines. If you want to support feature reuse, you need model-dependent transformations, delaying scaling/encoding feature data until it is used. If you don’t have explicit support for all three transformations, you will not be able to log untransformed feature data in your inference pipelines. For example, Databricks only support two of the three transformations and their inference tables store the inputs to models - the scaled/encoded feature data. That makes it very hard to monitor and debug your features and predictions. For example, if you are predicting credit card fraud, and the transaction action is 0.78, there is no real-world interpretability for that value. What’s more, model monitoring frameworks like NannyML work best with untransformed feature data (from the original feature space). To enable observability for AI systems, untangle your data transformations by following the data transformation taxonomy.
The feature store is the data layer that connects the feature pipelines and the model training/inference pipelines. It is possible to build a batch AI system without a feature store if you do not care about reusing features, and you are willing to implement your own solutions for governance, lineage, feature/prediction logging, and monitoring. However, if you are working with time-series data, you will also have to roll your own solution for creating point-in-time correct training data from your tables. If you are building a real-time AI system, you will need a feature store (or build one yourself) to provide precomputed features (as context/history) for online models. The feature store also ensures there is no skew between your model-dependent and on-demand transformations, see Figure 3. It also helps you backfill feature data from historical data sources.
In short, without a feature store, you may be able to roll out your first batch AI system, without any platform for collaboration, governance, or reuse of features, but your velocity for each additional batch model will not improve. Building batch AI systems without a feature store would be akin to analytics without a data warehouse. It can work, but won’t scale. For real-time AI systems, you will need a feature store to provide history/context to online models and infrastructure for ensuring correct, governed, and observable features.
Many teams erroneously believe that the starting point for building AI systems is installing an experiment tracking service. Making experiment tracking a prerequisite will slow you down in getting to your first minimal viable AI system. Experiment tracking is premature optimization in MLOps. For operational needs, such as model storage, governance, model performance/bias evaluation, and model cards, you should use a model registry. Experiment tracking is primarily for research. However, like the monkey ladder experiment, where monkeys continue to beat up any monkey that tries to climb the rope (even though they don’t know why they do it), many ML engineers believe the starting point in MLOps is to install an experiment tracking service.
DevOps is a software development process where you write unit, integration, and systems tests for your software, and whenever you make changes to your source code, you automatically execute those tests using a continuous integration continuous deployment (CI/CD) process. This typically involves a developer pushing source code changes to a source code repository that then triggers automated tests on a CI/CD service that checks out your source code onto containers, compiles/builds the code, runs the tests, packages the binaries, and deploys the binaries if all the tests are successful.
MLOps, however, is more than DevOps. In MLOps, in addition to the automated testing of the source code for your machine learning pipelines, you also need to version and test the input data. Data tests could be evals for LLMs that test whether changes in your prompt template, multi-shot prompts, RAG, or LLM improve or worsen the performance of your AI system. Similarly, data validation tests for classical ML systems prevent garbage-in (training data) producing garbage-out (from models). There is also the challenge that AI system performance tends to degrade over time, due to data drift and model drift. For this, you need to monitor the distribution of inference data and model predictions.
For a real-time AI system (with a model deployment), your versioned model should be tightly coupled to any versioned precomputed feature data (feature group) it uses. It is not enough to just upgrade the version of your model. You need to upgrade the model version in sync with upgrading the version of the feature group used by the online model.
In Figure 4, you can see that when you upgrade the airquality model to v2, you need to connect it to the precomputed features in v2 of the air_quality Feature Group. V1 of the model was connected to v1 of the air_quality Feature Group. The same is true for rolling back a model to a previous version, this needs to be done in sync with the feature group version.
Reproducibility of training data (often needed for compliance) requires data versioning. For example, consider Figure 5 where we have late arriving data after Training Dataset v1 was created. Without data versioning, if you re-create training dataset V1 at a later point in time using only the date of the desired Air Quality Measurements, the late measurements that arrived just after V1 was created will be included in the training data.
Data versioning enables you to re-create the training data exactly as it was at the point-in-time when it was originally created. Data versioning requires a data layer that knows about the ingestion time for data points and the event-time of data points.
A real-time AI system uses a model deployment that makes predictions in response to prediction requests. The parameters that are sent by the client to the Model Deployment API are typically not the same as the input parameters to the model (the model signature). In Figure 6, you can see an example of an online inference pipeline for a credit card fraud detection model. You can see here that the Deployment API includes details about the credit card transaction (amount, credit_card_number, ip_address (of the payment provider)). This is the interface between clients and the model deployment. Following the information hiding principle, you could redeploy a new version of the model (even changing its signature), without requiring clients to be rebuilt if the deployment API remains unchanged. In this example, the parameters sent by the client are used to lookup precomputed features (1hr_spend, 1day_spend), compute on-demand features (card_present, location), and a model-dependent transformation is applied to one of the features (normalize the amount). The model’s deployment API is decoupled from the API to the model - the model signature.
You may think that LLM AI systems are exempt from this fallacy, but LLM deployment APIs that use retrieval augmented generation (RAG) or function calling often have both the prompt text as well as non-text parameters that used to retrieve examples that are included in the final encoded prompt. The LLM signature is the encoded prompt.
Model prediction can be fast on your laptop but slow in a deployed model. Why is that? When you serve a model behind a network endpoint, you typically have to perform a lot of operations before you finally call model.predict() with the final feature vector(s) as input. You may need to retrieve precomputed features from a feature store or a vector index, create features from request parameters with on-demand transformations, encode/scale/shift feature values with model-dependent transformations, log untransformed feature values, and finally call predict on the model, before returning a result. All of these steps add latency to the prediction request, as does the client latency to the model deployment network endpoint, see Figure 7.
LLMs need GPUs for inference and fine-tuning. Similarly, LLMs need support for scalable compute, scalable storage, and scalable model serving. However, many MLOps platforms do not support either GPUs or scale, and the result is that LLMs are often seen as outside of MLOps, part of a new LLMOps discipline. However, LLMs still follow the same FTI architecture, see Figure 8. If your MLOps platform supports GPUs and scale, LLMOps is just MLOps with LLMs.
Feature pipelines are used to chunk, clean, and score text for instruction and alignment datasets. They are also used to compute vector embeddings that are stored in a vector index for RAG. Training pipelines are used to fine-tune and align foundation LLMs. Tokenization is a model-dependent transformation that needs to be consistent between training and inference - without platform support, users often slip up using the wrong version of the tokenizer for their LLM in inference. Agents and workflows are found in online inference pipelines, as are calls to external systems with RAG and function calling. Your MLOps team should be able to bring the same architecture and tools to bear on LLM systems as it does with batch and real-time AI systems.
The MLOps fallacies presented here are assumptions that architects and designers of AI systems can make that work against the main goals of MLOps - to get to a working AI system as quickly as possible, to tighten the development loop, and to improve system quality through continuous delivery and automated testing and versioning. Falling for the MLOps fallacies results in AI projects either taking longer to reach production or failing to reach production.
Thanks to the following people for reviewing a draft of this post: Raphaël Hoogvliets, Maria Vechtemova, Paul Izustin, Miguel Otero Pedrido, and Aurimas Griciūnas.