Scheduled upgrade from November 26, 07:00 UTC to November 26, 17:00 UTC
Kindly note that during the maintenance window, app.hopsworks.ai will not be accessible.
5
View the Changes
arrow back
Back to Blog
Jim Dowling
link to linkedin
CEO and Co-Founder
Article updated on

The 10 Fallacies of MLOps

And the Effects of Falling for these Bad Assumptions
February 25, 2025
20 min
Read
Jim Dowling
Jim Dowlinglink to linkedin
CEO and Co-Founder
Hopsworks

TL;DR

MLOps was born from the need to build infrastructural software to support production AI systems. In 2025, 52% of AI systems will not make it to production. One of the reasons is that developers make false assumptions when building AI systems. This article introduces 10 fallacies of MLOps that should help developers understand and identify potential problems ahead of time, enabling them to build better AI systems and get more models in production.

Top 10 Fallacies of MLOps

MLOps is a set of principles and practices that guide developers when building any type of AI system - from batch to real-time to LLM-powered systems. However, the foundations of MLOps are not deep. There are no biblical commandments internalized by the members of its church. Many members of the MLOps church worship false gods - fallacies that cause AI systems to never make it to production. These fallacies are inspired by a more mature computer discipline, distributed systems, that has a core set of tenants that developers agree on - the 8 fallacies of distributed computing.

The ten fallacies of MLOps listed below have been informed through my experience in building  AI systems on Hopsworks used by everything from Fortune-500 companies to AI-powered startups, teaching a course on MLOps at KTH University, making the world’s first free MLOps course that built batch and real-time AI systems,  and writing a book on Building AI Systems for O’Reilly

1. Do it all in one ML Pipeline

When you write your first batch AI system, it is possible to write it as a single program that can be parameterized to run in either training or inference modes. This can lead to the false assumption that you can run any AI system in a single ML pipeline. You cannot run a real-time AI system as a single ML pipeline. It consists of at least an offline training pipeline that is run when you train a new version of the model and an online inference pipeline that runs 24/7. This leads to confusion as to what exactly a ML pipeline is. What are its inputs and outputs? Are data pipelines that create feature data also ML pipelines? They create the features (the inputs to our ML models).

So what should you do? You should decompose your AI system into feature/training/inference pipelines (FTI pipelines) that are connected together to make up your AI system, see Figure 1. Feature pipelines transform data from many different sources into features. Training pipelines take features/labels as input and output a trained model. Inference pipelines take one or more trained models as input and feature data and output predictions. Further decomposition of these pipelines is also possible - generally following the principle that you name the ML pipeline after its output. For example, feature pipelines can be classified as stream processing (streaming) feature pipelines, batch transformation pipelines, feature validation pipelines, and vector embedding pipelines (that transform source data into vector embeddings and store them in a vector index). Similarly, training pipelines can be further decomposed into training dataset creation pipelines (useful for CPU-bound image/video/audio deep learning training pipelines, where you shift-left data transformations to a separate pipeline run on CPUs, not GPUs), model validation pipelines, and model deployment pipelines. Inference pipelines can be decomposed into batch inference pipelines and online inference pipelines.

Figure 1. The feature/training/inference pipeline architecture is a unified architecture for creating batch/real-time/LLM AI systems. The ML pipelines are connected via well-defined APIs to the feature store and model registry.

Reference: https://www.hopsworks.ai/post/mlops-to-ml-systems-with-fti-pipelines 

2. All Data Transformations for AI are Created Equal

In a real-time AI system, a client issues a prediction request with some parameters. A model deployment receives the prediction request and can use any provided entity ID(s) to retrieve precomputed features for that entity. Precomputing features reduces online prediction latency by removing the need to compute them at prediction time. However, some features require data only available as part of the prediction request, and need to be computed at request time. If we precompute features, we would like them to be reusable across different models. But my decision tree model doesn’t need to scale the numerical feature, while my deep learning model needs it to be zero-centered and normalized. Similarly, my CatBoost model can take the categorical string as input, but XGBoost requires me to encode the string before inputting it to the model. 

Figure 2. The Data Transformation Taxonomy for AI Systems.

There is a data transformation taxonomy for AI that has three different types of data transformation:

  • model-independent transformations are the same as those found in feature pipelines that create reusable feature data,
  • model-dependent transformations are often parameterized by the training dataset or are more generally model-specific. They are performed in both offline training and batch/online inference pipelines. If there is any difference between the implementations in the offline and online pipelines, this skew can introduce subtle but devastating bugs.
  • on-demand (real-time) transformations to create features using request-time data. They are performed in both feature pipelines (when backfilling with historical data) and online inference pipelines (on request-time data). If there is any difference between the implementations in the offline and online pipelines, this skew can introduce subtle but devastating bugs.

Model-independent transformations are the same as those found in data pipelines (extract-transform-load (ETL) pipelines). However, if you want to support real-time AI systems, you need to support on-demand transformations. They enable both real-time feature computation and offline feature computation using historical data - to backfill feature data in feature pipelines. If you want to support feature reuse, you need model-dependent transformations, delaying scaling/encoding feature data until it is used. If you don’t have explicit support for all three transformations, you will not be able to log untransformed feature data in your inference pipelines. For example, Databricks only support two of the three transformations and their inference tables store the inputs to models - the scaled/encoded feature data. That makes it very hard to monitor and debug your features and predictions. For example, if you are predicting credit card fraud, and the transaction action is 0.78, there is no real-world interpretability for that value. What’s more, model monitoring frameworks like NannyML work best with untransformed feature data (from the original feature space). To enable observability for AI systems, untangle your data transformations by following the data transformation taxonomy.

3. There is no need for a Feature Store

The feature store is the data layer that connects the feature pipelines and the model training/inference pipelines. It is possible to build a batch AI system without a feature store if you do not care about reusing features, and you are willing to implement your own solutions for governance, lineage, feature/prediction logging, and monitoring. However, if you are working with time-series data, you will also have to roll your own solution for creating point-in-time correct training data from your tables. If you are building a real-time AI system, you will need a feature store (or build one yourself) to provide precomputed features (as context/history) for online models. The feature store also ensures there is no skew between your model-dependent and on-demand transformations, see Figure 3. It also helps you backfill feature data from historical data sources.

Figure 3. A feature store manages mutable data and data transformations to support the creation of versioned training data, batch inference data, and online inference data.

In short, without a feature store, you may be able to roll out your first batch AI system, without any platform for collaboration, governance, or reuse of features, but your velocity for each additional batch model will not improve. Building batch AI systems without a feature store would be akin to analytics without a data warehouse. It can work, but won’t scale. For real-time AI systems, you will need a feature store to provide history/context to online models and infrastructure for ensuring correct, governed, and observable features.

4. Experiment Tracking is part of MLOps

Many teams erroneously believe that the starting point for building AI systems is installing an experiment tracking service. Making experiment tracking a prerequisite will slow you down in getting to your first minimal viable AI system. Experiment tracking is premature optimization in MLOps. For operational needs, such as model storage, governance, model performance/bias evaluation, and model cards, you should use a model registry. Experiment tracking is primarily for research. However, like the monkey ladder experiment, where monkeys continue to beat up any monkey that tries to climb the rope (even though they don’t know why they do it), many ML engineers believe the starting point in MLOps is to install an experiment tracking service. 

5. MLOps is just DevOps for ML

DevOps is a software development process where you write unit, integration, and systems tests for your software, and whenever you make changes to your source code, you automatically execute those tests using a continuous integration continuous deployment (CI/CD) process. This typically involves a developer pushing source code changes to a source code repository that then triggers automated tests on a CI/CD service that checks out your source code onto containers, compiles/builds the code, runs the tests, packages the binaries, and deploys the binaries if all the tests are successful.

MLOps, however, is more than DevOps. In MLOps, in addition to the automated testing of the source code for your machine learning pipelines, you also need to version and test the input data. Data tests could be evals for LLMs that test whether changes in your prompt template, multi-shot prompts, RAG, or LLM improve or worsen the performance of your AI system. Similarly, data validation tests for classical ML systems prevent garbage-in (training data) producing garbage-out (from models). There is also the challenge that AI system performance tends to degrade over time, due to data drift and model drift. For this, you need to monitor the distribution of inference data and model predictions.

6. Versioning Models is enough for Safe Upgrade/Rollbacks

For a real-time AI system (with a model deployment), your versioned model should be tightly coupled to any versioned precomputed feature data (feature group) it uses. It is not enough to just upgrade the version of your model. You need to upgrade the model version in sync with upgrading the version of the feature group used by the online model.

In Figure 4, you can see that when you upgrade the airquality model to v2, you need to connect it to the precomputed features in v2 of the air_quality Feature Group. V1 of the model was connected to v1 of the air_quality Feature Group. The same is true for rolling back a model to a previous version, this needs to be done in sync with the feature group version.

Figure 4. Deployed models are versioned and that version of feature engineering is tightly coupled to any versioned feature groups it uses.

7. There is no need for Data Versioning

Reproducibility of training data (often needed for compliance) requires data versioning. For example, consider Figure 5 where we have late arriving data after Training Dataset v1 was created. Without data versioning, if you re-create training dataset V1 at a later point in time using only the date of the desired Air Quality Measurements, the late measurements that arrived just after V1 was created will be included in the training data. 

Figure 5. Exact recreating of Training Dataset v1 at a later point in time requires data versioning, otherwise late data will erroneously be included when re-creating training datasets using only event-time (not ingestion time).

Data versioning enables you to re-create the training data exactly as it was at the point-in-time when it was originally created. Data versioning requires a data layer that knows about the ingestion time for data points and the event-time of data points.

8. The Model Signature is the API for Model Deployments

A real-time AI system uses a model deployment that makes predictions in response to prediction requests. The parameters that are sent by the client to the Model Deployment API are typically not the same as the input parameters to the model (the model signature). In Figure 6, you can see an example of an online inference pipeline for a credit card fraud detection model. You can see here that the Deployment API includes details about the credit card transaction (amount, credit_card_number, ip_address (of the payment provider)). This is the interface between clients and the model deployment. Following the information hiding principle, you could redeploy a new version of the model (even changing its signature), without requiring clients to be rebuilt if the deployment API remains unchanged. In this example, the parameters sent by the client are used to lookup precomputed features (1hr_spend, 1day_spend), compute on-demand features (card_present, location), and a model-dependent transformation is applied to one of the features (normalize the amount). The model’s deployment API is decoupled from the API to the model - the model signature.

Figure 6. Model deployments do not just make predictions on models. They also build feature vectors from precomputed features (1hr_spend, 1day_spend), passed features (amount) and on-demand features (card_present, location), and encode/normalized/scale those features before calling predict on the model and logging the untransformed features and predictions.

You may think that LLM AI systems are exempt from this fallacy, but LLM deployment APIs that use retrieval augmented generation (RAG) or function calling often have both the prompt text as well as non-text parameters that used to retrieve examples that are included in the final encoded prompt. The LLM signature is the encoded prompt.

9. Prediction Latency is the Time taken for the Model Prediction

Model prediction can be fast on your laptop but slow in a deployed model. Why is that? When you serve a model behind a network endpoint, you typically have to perform a lot of operations before you finally call model.predict() with the final feature vector(s) as input. You may need to retrieve precomputed features from a feature store or a vector index, create features from request parameters with on-demand transformations, encode/scale/shift feature values with model-dependent transformations, log untransformed feature values, and finally call predict on the model, before returning a result. All of these steps add latency to the prediction request, as does the client latency to the model deployment network endpoint, see Figure 7.

Figure 7. Prediction request latency is the sum of all operations in your online inference pipeline, from building the feature vector to calling model.predict() and logging the features/prediction.

10. LLMOps is not MLOps

LLMs need GPUs for inference and fine-tuning. Similarly, LLMs need support for scalable compute, scalable storage, and scalable model serving. However, many MLOps platforms do not support either GPUs or scale, and the result is that LLMs are often seen as outside of MLOps, part of a new LLMOps discipline. However, LLMs still follow the same FTI architecture, see Figure 8. If your MLOps platform supports GPUs and scale, LLMOps is just MLOps with LLMs.

Figure 8. The FTI architecture applied to LLM systems. The only real change is syntactic - online inference pipelines are now called agents. 

Feature pipelines are used to chunk, clean, and score text for instruction and alignment datasets. They are also used to compute vector embeddings that are stored in a vector index for RAG. Training pipelines are used to fine-tune and align foundation LLMs. Tokenization is a model-dependent transformation that needs to be consistent between training and inference - without platform support, users often slip up using the wrong version of the tokenizer for their LLM in inference. Agents and workflows are found in online inference pipelines, as are calls to external systems with RAG and function calling. Your MLOps team should be able to bring the same architecture and tools to bear on LLM systems as it does with batch and real-time AI systems.

The Effects of the Fallacies

  1. Without a unified architecture for building AI systems, every new batch or real-time AI system you build will be like starting from scratch. This makes it difficult for developers to transition from building one type of AI system to another. Without a clear architecture, data scientists only learn to fit training data to models, not how to create features from non-static data sources and build inference data for predictions. (“Data is not static” was another fallacy we considered, but was considered a bit too obvious for inclusion).
  2. Good luck building an observable AI system (that logs untransformed feature data) when you have tangled data transformations. If you don’t untangle your data transformations into model-independent, model-dependent, and on-demand transformations, you will have difficulty building an observable AI system that logs/monitors interpretable features. 
  3. You will end up building your own AI platform and spending most of your time figuring out how to manage mutable data, how to create point-in-time correct training data, how to synchronize data in columnar datastores with low-latency row-oriented stores for online inference. You will use less features in your online models, because of the pain in making them available as precomputed features. The cost to build and deploy every new model will always be high and not go down over time as fast as it would with a feature store.
  4. Developers will not know the API to a model deployment. Some developers will think it is the same as the ordered set of input data types for the model - the model signature. You will not have an explicit schema for accessing the deployment, and mistakes will be made in using the model deployment and whenever maintenance or upgrades occur.
  5. Your data layer for your AI system may get contaminated with bad data if you do not validate data ingested. Your AI system performance may degrade over time due to a lack of feature monitoring and model performance monitoring.
  6. You will show your boss loss curves for model training to show your progress. But your boss will know that experiment tracking side tracks you from real MLOps - building your minimal viable AI system. Experiment tracking should be the last capability you add, not the first. You should focus on adding value first through data-centric AI, and only progress to model-centric AI (and experiment tracking) when you have to.
  7. If you do not couple model versions with feature data versions, you can introduce subtle bugs. For example, if your new model uses the old feature version data, but the new feature group version is schema compatible with the previous version, then the system will appear to work as before. However, as the implementation of one or more features differs, its performance will suffer and it will be a hard bug to find.
  8. AI regulation is coming, and the source of bias in many models is their training data. If you cannot provide lineage to training datasets and you cannot recreate them from their source data, you may be in legal jeopardy. If you only depend on the event-time of your data when creating training datasets, it will not recreate training data or batch inference data ASOF the point-in-time it was originally created. Your data storage system needs support for time-travel with ingestion time to support reproducible training datasets.
  9. You cannot assume that prediction latency for network hosted models is only the time taken for the model prediction. You have to include the time for all pre-processing (building feature vectors, RAG, etc) and post-processing (feature/prediction logging).
  10. You may duplicate your AI infrastructure by supporting a separate LLMOps stack from your MLOps stack. For example, most feature stores now support vector indexes - you do not need a separate vector DB for RAG. Similarly, LLMs and deep learning models both require GPUs and they should be managed in the same platform (to improve GPU utilization levels). Finally, developers should be able to easily transition from  batch/real-time AI systems to a LLM AI system - if you follow the FTI architecture.

Summary

The MLOps fallacies presented here are assumptions that architects and designers of AI systems can make that work against the main goals of MLOps - to get to a working AI system as quickly as possible, to tighten the development loop, and to improve system quality through continuous delivery and automated testing and versioning. Falling for the MLOps fallacies results in AI projects either taking longer to reach production or failing to reach production.

Thanks to the following people for reviewing a draft of this post: Raphaël Hoogvliets, Maria Vechtemova, Paul Izustin, Miguel Otero Pedrido, and Aurimas Griciūnas.

References