Scheduled upgrade on April 4, 08:00 UTC

Kindly note that during the maintenance window, app.hopsworks.ai will not be accessible.

April 4, 2025

App Status

Back to Blog

Prithivee Ramalingam

Machine Learning Engineer

Let's keep in touch!

Subscribe to our newsletter and receive the latest product updates, upcoming events, and industry news.

More Blogs

How we secure your data with Hopsworks

Migrating from AWS to a European Cloud - How We Cut Costs by 62%

The 10 Fallacies of MLOps

Hopsworks AI Lakehouse: The Power of Integrated MLOps Components

Unlocking the Power of AI in Government

Article updated on

Machine Learning Embeddings as Features for Models

October 9, 2023

13 min

Read

Prithivee Ramalingam

Machine Learning Engineer

ValueMomentum

Data Engineering

TL;DR

This blog guides you through the practical process of creating embeddings and storing them efficiently in Hopsworks Feature Store, unlocking their significance and diverse applications in data-driven decision-making.

Introduction

In the rapidly expanding landscape of data driven decision making, embeddings have emerged as one of the formidable discoveries. With their ability to represent data in a higher dimensional vector space, they capture semantic relationships which in turn can be used in various applications of machine learning.

We begin this blog by discussing the profound implications of embeddings in machine learning, their diverse applications, and their crucial role in reshaping the way we interact with data. Once the embeddings are created the next step is to store them in a vector database for efficient retrieval. So we discuss what a vector database is and compare it with traditional ones. Lastly, we will delve into the process of using LangChain for data loading and chunking, harnessing the power of OpenAI models for creating embeddings and seamlessly storing them in Hopsworks Feature Store.

What are Embeddings?

The machine learning models which we use in our everyday life operate on numerical data, as they seldom understand text. What this basically tells us is, for performing a NLP task we need to first convert the textual information to a vector of numbers which can be given to the model as an input. These numerical vectors capture the semantic and contextual information in text, enabling the models to learn meaningful relationships and patterns. In addition to using embeddings for text, it can also be used to encode other unstructured data such as images, audio, categorical variables and numerical values as well.

Before the advent of the latest embedding models, methods such as One-Hot encoding, Bag of Words(BoW), Term Frequency and Inverse Document Frequency (TF-IDF) and Latent Semantic Analysis(LSA) were used to convert text into numerical embeddings. Researchers also came up with custom embeddings for domain specific data but it turned out to be too complex to create. These earlier methods had limitations in capturing the nuanced semantics and contextual information present in natural language.

Then came the era of neural network based embeddings where techniques such as Word2Vec, GloVe, FastText, BERT and ELMo gained prominence. These embeddings played a significant role in advancing NLP tasks by capturing semantic information and context, enabling models to understand and generate human-like text.

Applications of Embeddings in Machine Learning:

Embeddings in machine learning have applications in a wide range of machine learning tasks. In this section, we explore some of the applications where embeddings play a pivotal role.

Natural Language Processing (NLP):
Word Embeddings: Capturing semantic and contextual meaning of words, improving tasks like sentiment analysis, machine translation, and text classification.
Document Embeddings: Representing entire documents as vectors for tasks like document retrieval, clustering, and summarization.

Recommendation Systems:
Recommendation systems make use of embeddings to create personalized suggestions. By analyzing the similarity between user and item embeddings, recommendation systems uncover intricate patterns, thereby providing custom-tailored content that enhances user experience.

Computer Vision:
Image Embeddings: Creating vector representations of images for tasks like image retrieval, object recognition, and image clustering.
Face Embeddings: Encoding facial features for face recognition and verification.

Speech and Audio Processing:
Representing phonemes or speakers for speech recognition, speaker identification, and audio classification.

Dimensionality Reduction and Visualization:
t-SNE and PCA Embeddings: Reducing high-dimensional data to lower dimensions for visualization and exploration.

Information Retrieval:
Document and Query Embeddings: Transforming documents and search queries into vector representations to improve information retrieval and search relevance.

Anomaly Detection:
Feature Embeddings: Embedding features for anomaly detection in various domains, such as fraud detection and network security.

What is a vector database and how does it differ from a traditional database?

A vector database is a specialized type of database designed to efficiently store, manage, and query high-dimensional vector data, such as embeddings, feature vectors, and other numerical representations. Unlike traditional relational databases that primarily handle structured data, vector databases focus on unstructured or semi-structured data represented as vectors in multi-dimensional spaces.

The querying of vector databases is very different when compared with traditional databases. In traditional databases the query is exactly matched with the values. On the other hand, in vector databases a similarity metric such as Cosine Similarity is applied to find the vector which is most similar to the query. A vector database utilizes multiple algorithms to perform the Approximate Nearest Neighbour(ANN) Search such as Random Projection, Product Quantization and Locality Sensitive Hashing.

Another aspect which differentiates the vector database from a traditional database is the concept of indexes. In a vector database, an "index" refers to a data structure that is used to optimize the retrieval of high-dimensional vectors or embeddings.

High-dimensional vector data, such as embeddings or feature vectors, can be challenging to search through efficiently without indexing. The purpose of indexing is to reduce the search space and accelerate the process of finding the nearest neighbors or matching vectors to a query vector. An index structure in a vector database typically stores a subset of the dataset's vectors and organizes them in a way that enables fast similarity search.

Tools involved

Now we will explore the steps involved in generating embeddings in machine learning and saving them as features within the Hopsworks Feature Store. This procedure encompasses document retrieval, chunking, embedding generation, and subsequent storage within the Feature Store. For document retrieval and chunking, we rely on the langchain library, while the OpenAI Embedding model is utilized to generate embeddings for the chunks created.

Langchain:
LangChain is a framework designed to simplify the creation of applications using large language models. Chatbots, Question Answering systems and Summarization tools are some of the use cases from langchain.

OpenAI Embeddings:
‍OpenAI’s text embedding model “text-embedding-ada-002” outperforms all the old embedding models on text search, code search, and sentence similarity tasks and gets comparable performance on text classification.

Hopsworks:
Hopsworks includes OpenSearch as a multi-tenant service in projects. OpenSearch provides vector database capabilities through its k-NN plugin that supports the FAISS and nsmlib embedding indexes. Through Hopsworks, OpenSearch also provides enterprise capabilities, including authentication and access control to indexes (an index can be private to a Hopsworks project), filtering, scalability, high availability, and disaster recovery support.

Code Walkthrough

The dataset used for this blog is the wikipedia information on elements Hydrogen, Helium and Lithium. These files are stored in text format in the Elements directory. You can access the code repository here, where you'll find the hands-on example to apply the concepts discussed in this blog. Dive in and start turning theory into practice!

If we want our application to answer certain questions, or perform recommendations from custom data or data which it hasn’t been trained on we need to connect the external datasource to the LLM. So the first step is to load data from external sources which may be of different formats to a standard one.

‍Document Loading:

‍Document loaders load data from the source. Source can be a single document or a folder with multiple documents. Langchain’s document loader loads data in multiple formats such as csv, json, pdf and text files. In our case we use Directory Loader to load the directory called “Elements” which has our data(text files).

def load_documents():
    loader = DirectoryLoader('Elements', loader_cls=TextLoader)
    docs = loader.load()
    print("Number of documents is ",len(docs))
    return docs

Document Splitting:

Once the document is loaded, we move on to Document Splitting(Chunking). Chunking is required before embedding primarily due to size limitations and the need to preserve contextual information. Language models often have token limits, so breaking the text into smaller chunks ensures it fits within these limits. It is imperative that we split the documents into semantically relevant chunks to perform downstream tasks.

The input text is split based on a defined chunk size with some defined chunk overlap. Chunk Size is a length function to measure the size of the chunk and the chunk overlap ensures continuity between the chunks. The value of these parameters are determined through experiments and mainly depend on the data and task in hand.

The recommended TextSplitter is the RecursiveCharacterTextSplitter. This will split documents recursively by different characters - starting with "\n\n", then "\n", then " ". This is nice because it will try to keep all the semantically relevant content in the same place for as long as possible.

def create_chunks(content,chunk_size=250,chunk_overlap=10):
    text_splitter=RecursiveCharacterTextSplitter.from_tiktoken_encoder
    (chunk_size=chunk_size,chunk_overlap=chunk_overlap)
    texts = text_splitter.split_documents(content)
    return texts

Creating Embeddings for Machine Learning:

So far we have loaded the documents and converted it into meaningful chunks, now we have to create embeddings out of them, so that our Machine learning models can make sense of it. There are multiple embedding models provided by OpenAI, Cohere and HuggingFace etc. For our use case, we have chosen OpenAI’s text-embedding-ada-002 model. This is one of the best models out there for embedding.

def get_embedding(chunk_data, embedding_object):
    embedded_chunk = embedding_object.embed_query(chunk_data)
    return embedded_chunk

‍Storing and Retrieving the embeddings in a Feature Store:

Once we have the embeddings we need to store it in a vector database to assist faster retrieval and efficient querying. We use Hopsworks as the feature store for storing the embeddings for machine learning. It also provides support to find k-nearest neighbors for a query point by OpenSearch knn plugin.

hopsworks_project = hopsworks.login() 
fs = hopsworks_project.get_feature_store()

We store the embeddings along with the chunk in the newly created feature group.

openai_embedding_fg = fs.get_or_create_feature_group( 		name="openai_embedding", version=1, description="Embedding data for elements", primary_key=['p_key'], online_enabled=True)

openai_embedding_fg.insert(df)

# Connecting to hopsworks feature store and reading the data from the feature group which we had created earlier 

connection = hsfs.connection()
fs = connection.get_feature_store(name='practice_featurestore')
fg = fs.get_feature_group('openai_embedding', version=1)

Next Steps

In the next phase of your journey, you can dive deeper into the world of recommendation systems. With Hopsworks, you can build Personalized Search Retrieval and Ranking systems. This repository contains notebooks for exploring the creation and retrieval of Embedding Features for a recommendation system use case.

Conclusion

In this blog we have discussed the essential steps in the creation of embeddings for machine learning as features for the feature store using tools like Langchain and OpenAI. We also discussed the significance of embeddings in machine learning, their applications and the need for vector databases to store these embeddings. By delving deep into how vector databases differ from traditional ones we are able to appreciate vector databases even more. The role of embeddings in capturing semantic content has been phenomenal and their integration with Feature Stores will open new avenues for data driven decision making.

References

Interested for more?

🤖 Register for free on Hopsworks Serverless
🌐 Read about the open, disaggregated AI Lakehouse stack
📚 Get your early copy: O'Reilly's 'Building Machine Learning Systems' book
🛠️ Explore all Hopsworks Integrations
🧩 Get started with codes and examples
⚖️ Compare other Feature Stores with Hopsworks

More blogs

We expand on the Hopsworks multi-region architecture to fit a Tier 1 classification where all components are replicated in a different geographical region.

Multi-Region High Availability Comes to Feature Stores

Following our previous blog, we expand on the architecture to fit a Tier 1 classification where all components of Hopsworks are replicated in a different geographical region.

Antonios Kouzoupis

Introducing the Serverless Feature Store

Hopsworks Serverless is the first serverless feature store for ML, allowing you to manage features and models seamlessly without worrying about scaling, configuration or management of servers.

Jim Dowling

A guide to popular file formats used in open source frameworks for machine learning in Python, including TensorFlow/Keras, PyTorch, Scikit-Lean & PySpark.

Guide to File Formats for Machine Learning

This blog is a guide to the popular file formats used in open source frameworks for machine learning in Python, including TensorFlow/Keras, PyTorch, Scikit-Learn, and PySpark.

Jim Dowling