What are Embeddings and Vector Databases?

Community Article Published August 20, 2024

Embeddings are numerical representations of any information. They allow us to determine similarity, to empower quick search, classification, and recommendations. Imagine a digital library with a vast collection (our dataset). Each book is represented by coordinates — a numerical vector that captures the essence of the book’s content, genre, style, and other features with a unique ‘digital fingerprint’ for every book. When a user is searching for a book, they provide a search prompt. The library’s search system converts this prompt into vector coordinates using the same embeddings method it used for all the books to search through the library’s database. The system looks for the book vectors that are most similar to the prompt vectors. The books with the closest matching coordinates are then recommended to the user in the search results based on the initial request. Another simplest use-case example would be if you are looking for a synonymous of a word. Embeddings can help you to find similar or “close” words, but it can do more than that. Semantic search is a very effective way to find related information to your prompt fast and that’s how Google Search Engine works.

The classic novel “Pride and Prejudice” by Jane Austen is known by a different name in some countries — it’s called “First Impressions” in some translations and adaptations. Despite the different names and languages, embedding these in a vector database would reveal their close semantic relationship, placing them near each other in the vector space. How do embeddings models work? Embeddings models specifically trained on large datasets to reveal these correlations, including "Pride and Prejudice" = "First Impressions", so if a model is not trained on that particular pair, it will not be as accurate in finding correlation.

Let me give you another example. This is better understood in comparison to how humans look at data vs. computers: Imagine you are looking for nearby cities to Chicago, IL on a map. If the computer knows the coordinates are {41°88’18"N, -87°62’31"W}, to find a city close to Chicago, it doesn’t need a map, just the list of coordinates of all other cities! Among the cities this spot {41°84’56"N, -87°75’39"W}, is the closest - Cicero, IL. For computers now this task is a mathematical problem to solve. Notice how latitude and longitude coordinate numbers are close. Now we can add additional “dimension” with the size of the city by population, and if the user requests to find the closest city to Chicago with a similar size, the answer could be different for the given prompt. We can add more dimensions. Computers can find similarities in TV comedies, clothes, or many other types of information using this algorithm. In scientific language, it would formulate as “Placing semantically similar inputs close together in the embedding space”. And FYI these coordinates are also referred to as latent space.

image/png

Embeddings is a very powerful tool to modify user prompt by enriching them with relevant information by placing the user search prompt into categories it belongs to and finding similar information via common categories from other sources. A good example would be daily news that our model is not aware of yet. Instead of baking this new information into the model daily, we simply retrieve the news from other sources and provide the closest and relevant information as additional context with the original user prompt to the model.

Why do we need to encode and represent our dataset in a converted state as embeddings and convert user prompt into embeddings and then search vectors instead of just searching in the original dataset the text of the prompt directly? Because it's fast to process and easy for computers to understand the relationships between information this way. In other words, numerically similar embeddings of a text are also semantically similar.

In preparing the first phase for our RAG application, the information in our entire dataset is split into overlapping chunks and stored in a database (called Vector DB) with encoded numerical representations so that later in the second phase, you can quickly retrieve a smaller portion of relevant information as an additional context for the user prompt. Embeddings encode text from our dataset into an index of vectors at the first phase and store them both in the vector database. Then on the second phase of application runtime, the user prompt is also encoded with the same Embeddings model, and the index with generated vectors for the user prompt is used to search and retrieve from the Vector DB chunks of text similarly to search engines work. That’s why they called Bi-encoder models. To encode text with numerical vector representations embeddings model is used which is typically much smaller, than LLMs. The beauty of searching Embeddings similarities stored in Vector DB is no need to know your data nor any schema to make this work. Today, virtually all embeddings are some flavors of the BERT model.

Advantages & Disadvantages of Embeddings:

Embeddings, despite their popularity, have a notable limitation: they lack transitivity and summarized concepts over large data. This has implications for interpreting and responding to queries in RAG systems. In vector space, when traversing disparate chunks of information through their shared attributes to provide new synthesized insights if vector A is similar to vector B, and vector B is similar to vector C, it does not necessarily mean that vector A is similar to vector C. When a user’s query, represented as vector A, gets B but seeks information that aligns with vector C, the direct similarity might not be immediately apparent via vector B. Also, disadvantages of embeddings are evident when trying to provide synthesized insights or holistically understand summarized semantic concepts over large data.

These limitations can lead to suboptimal situations where RAG systems, return only 60%, 70%, or 90% correct answers, rather than consistently achieving 100% accuracy.

While embeddings may not always be correct, they always return something, making them reliable in that regard. You might start thinking about what use of such relatability is if no quality is guaranteed though its simplicity often is a prerequisite to work with more complex data such as Semantic Layer, making Vector search just a first step to retrieve your data, more about it in my next posts. One of the key advantages is that you do not need to understand your data or have a schema to retrieve information, simplifying the initial stages of working with complex data. When implemented correctly and combined with other techniques, embeddings can have a positive compounding effect, which explains their widespread use despite their inherent limitations.

Retrieving from a Vector database is not the only way, you can retrieve data in many ways, from a Relational Database from tables or via APIs such as Google Maps or Yelp. You may want to use Vector database if you don’t have any other more convenient ways of storing and retrieving your data.

https://huggingface.co/blog/getting-started-with-embeddings https://quamernasim.medium.com/mastering-rag-choosing-the-right-vector-embedding-model-for-your-rag-application-bbb57517890e https://github.com/alfredodeza/learn-retrieval-augmented-generation/tree/main

Enjoyed This Story?

If you like this topic and you want to support me:

  1. Upvote ⬆️ my article; that will help me out
  2. Follow me on Hugging Face Blog to get my latest articles and Join AI Sky Discord Server 🫶
  3. Share this article on social media ➡️🌐
  4. Give me feedback in the comments 💬 on LinkedIn. It’ll help me better understand that this work was useful, even a simple “thanks” will do. Give me good, or bad, whatever you think as long as you tell me the place to improve and how.
  5. Connect with me or follow me on LinkedIn or Discord.

Disclaimer: This blog is not affiliated with, endorsed by, or sponsored in any way by any companies or any of their subsidiaries. Any references to products, services, logos, or trademarks are used solely to provide information and commentary and belong to respective owners. The views and opinions expressed in this blog post are the author’s own and do not necessarily reflect the views or opinions of corresponding companies.