Understanding Embedding: Enhancing Text Data Analysis with AI

Author :

Kyle Kim (AI Lead)

May 26, 2023

what is embedding
what is embedding
what is embedding

Brief Introduction to Embedding

In this article, we will talk about the concept of embedding, which is one of the important concepts  in AI. An embedding is a critical element to come up with an intended result by incorporating massive prior knowledge in large language models (LLM), including semantic search, recommendation, clustering, etc. known as core features of applications using text data. Syncly uses embeddings in various features such as feedback auto-categorization, sentiment classification, and others.


<Contents>

  • What is an embedding?

  • How can you generate an embedding?

  • When should you use an embedding?

  • What is a vector database?

  • References


What is an embedding?

Embedding (or embedding vector) is a representation of text as a real-valued vector  (i.e. a fixed-size array of floating-point numbers). If you input a certain word, sentence, or document into an embedding model, you can obtain a vector output composed of real numbers as shown in the figure below.

It is difficult for humans to interpret its own meaning directly. However, they may contain information about the semantic relationship between different words or documents.


Embedding Projector provided by TensorFlow is a visualization tool developed to help users better understand embeddings. By using the embedding technique called Word2Vec, the tool extracts embeddings from 10,000 words, and it projects the embeddings into 3D spaces. For example, if you click a point that corresponds to the embeddings for the word "geographic" in a space as seen above, you can see the closest points that correspond to the embeddings of the words. You can find that the words that have similar meanings with the word that you clicked (e.g., geographical, coordinates, map, location, etc.) are sorted in a descending order based on the semantic similarity.


Why do you need embeddings?

An AI model is a function. Thus, it can only receive numbers as  input and output numbers as well. As text that humans typed in does not consist of numbers, it should be converted into numbers which an AI model can understand. That is one of the fundamental reasons that embeddings are needed.

If you have a good knowledge on how a computer works, you may wonder why it should be converted again as the text handled in computers is already a set of encoded characters. The reason you need embeddings is that text is long in general and their lengths  as well. AI models' internal structure is not designed to deal with input with variable length. You have to give the original text to an AI model after converting it to embedding, which is a numeric input with fixed length.


Word (token) embedding vs. Sentence/document embedding

Embeddings are largely classified into word (token) embedding and sentence/document embedding based on the type of the original text.

Original text should be split into smaller pieces before being given to an AI model. Here, the pieces are referred to as tokens, and the model in charge of that procedure is called a tokenizer. A token can be a word or a subword (a piece of word) depending on the type of tokenizer you use. An embedding extracted from a token is usually called word (token) embedding.

GPT-3 Tokenizer presented by OpenAI is a tool that demonstrates how its tokenizers used by GPT models operate to understand them easily. If you type in an English text into a tool as seen above, you can check how the text is split into (subword) tokens and inputted into the AI model at a glance.

However, what users may  want to input is not just a single word but a sentence or a document with multiple sentences in most cases. An embedding extracted from a sentence or an entire document is called a sentence embedding or a document embedding. As for sentence/document embedding, it is obtained by aggregating several word (token) embeddings calculated from (sub)words through e.g. averaging.


How can you generate an embedding?

Past: One-hot encoding

In the past, a relatively simple technique was taken to generate embeddings. If there is a massive collection of documents used to train an AI model, it counts every word that appeared in the collection, creates a word book, and assigns an index number to each word in the book. Based on this, when a word is given, it first generates a vector that comprises 0s with the same length as the number of the unique words in the word book. It then allocates a 1 only at the location corresponding to the index for the given word. This technique for generating embeddings is called one-hot encoding.

Although one-hot encoding is easy for humans to understand as it is, there is a tendency that dimensions of an embedding vector (i.e. count of numbers included in the vector) can get bigger markedly depending on the number of the words -  up to 100,000 or even 1,000,000. Also, an embedding vector is extremely sparse as most elements have 0 values. Thus, it was somewhat difficult for AI models to handle effectively.


Present: Learned embedding

Learned embedding is a new embedding generation technique designed to overcome the shortcomings of one-hot encoding explained above. The embedding described in this article refers to learned embedding. In general, the learned embedding generation model is obtained by training an AI model (or LLM) with neural network architecture on a massive collection of documents. During the training process, AI models accept various words and learn the semantic relationship in a way that the distance between embeddings from two words gets closer if the words have similar meaning, and farther if the words don’t have similar meaning.

On the contrary to one-hot encoding, the learned embedding is hard for humans to understand as it is. However, the dimensions of the embedding vector is very low (384-1,536) compared to that of one-hot encoding, and all elements in the embedding vector are densely used. Thus, an AI model can handle it more efficiently.

OpenAI Embeddings (GPT-3)

OpenAI Embeddings API is one of the most commonly used means across the world to extract embeddings from text effectively. There is no need to retain computing resources required to operate a massive LLM. Instead, you can extract embeddings by calling the API through a Python script with a few lines as suggested below at a relatively low cost. Therefore, OpenAI embeddings are beloved by many developers.

According to a guide document presented by OpenAI, OpenAI embeddings use a GPT-3 based LLM to extract embeddings. Specifically, an embedding vector extracted from OpenAI embeddings (when using text-embedding-ada-002 model) has 1,536 dimensions. Thus, it has a proper capacity to hold a sufficient level of semantic information in lengthy text.


When should you use an embedding?

As mentioned above, you can identify semantic relationship between different words or documents using embeddings. There are several cases that such a feature can be particularly useful for. They are largely classified into the following two categories.

Case 1: When exploring or comparing multiple documents based on their meanings

The first case is exploring a document or comparing documents among many. Semantic search, recommendation, and clustering features fall into this category.

If you are using OpenAI embeddings, you can find how each feature can be implemented using Python in detail by referring to  Semantic text search using embeddings, Recommendation using embeddings and nearest neighbor search, Clustering notebooks provided by OpenAI Cookbook.


Semantic search

Semantic search is a feature that searches documents that are  semantically relevant to a query (search word) that a user submits in text. The following indicates a simple process of semantic search using embeddings.

  1. Calculate embedding of each document in the collection of documents and store them in a separate storage (e.g. local drive, vector database, etc.)

  2. Calculate embedding of the query.

  3. Calculate cosine similarity between the query embedding and the embeddings of each document and sort the entire documents in a descending order of the similarity score.

  4. Load and return the documents corresponding to the top k of the results.


Recommendation

Recommendation is a feature that suggests other documents that have a high semantic relatedness to the one that a user is currently viewing. Its process is almost the same as that of semantic search. It can be said that the query embedding is replaced with that of the document that the user is currently viewing.

  1. Calculate embedding of each document in the collection of documents and store them in a separate storage (e.g. local drive, vector database, etc.)

  2. Calculate cosine similarity between the embedding of currently viewing document and those of other documents and sort the entire documents in a descending order of the similarity score.

  3. Load and return the documents corresponding to the top k of the results.


Clustering

Clustering groups multiple documents into several clusters based on their semantic relationships. The main difference between clustering and semantic search is that you need to calculate the distances between the embeddings of many document pairs (although the number of pairs may differ depending on the types of clustering algorithms).

If you use Python ML libraries like scikit-learn, you can easily conduct clustering by just inputting embedding vectors to the function without needing to have extensive knowledge on a clustering algorithm.


Case 2: When additional information has to given to an LLM to generate a result

The second case is a bit more complex. It is relevant to the fundamental characteristics of LLMs that are commonly used today. An LLM has general knowledge about the information that is available on the Internet. However, it does not have knowledge on private information that only you may have. If you want an output from an LLM as the answer based on the result related to your private information, you have to request LLM with a prompt which includes the text containing the corresponding information.

However, the maximum length of text (i.e. the total number of tokens) that can be added to a  prompt is limited in LLM services. If you want to input text that is long enough to constitute a book, you should split it into chunks and select one that is the most relevant to a given question to add it to the prompt.

This is a significant point that should be considered when implementing a feature such as question answering. The following indicates a simple question answering process that allows a LLM to answer the question (query) based on the additional information. It is somewhat similar to the process of semantic search.

  1. Split the entire text containing the whole information into chunks with a fixed length to constitute several text chunks, and calculate embedding of each chunk and store them at a separate storage.

  2. Calculate embedding of the query containing the question.

  3. Calculate cosine similarity between the query embedding and the embeddings of each chunk and sort the entire documents in a descending order of the similarity score.

  4. Load the documents corresponding to the top k of the results and add them to the prompt - Here, k is determined as the maximum value available under the limitation on the maximum text length required by the LLM service.

  5. Input the completed prompt to an LLM and return the answer generated from the LLM.

If you are using OpenAI embeddings, you can find how such a feature can be implemented using Python  in detail by referring to Question answering using embeddings-based searchnotebook provided by OpenAI Cookbook.


What is a vector database?

When additional information is required to allow LLMs to generate the intended result, you can use embeddings as described above. However, LLMs have a limitation since you have to repeat such a process every time when you want to generate the intended result as LLMs do not "remember" the information it gets. That is why LLMs are sometimes considered stateless.

Vector database is a new type of database developed to overcome the absence of long-term memory of LLM and other AI models. Unlike conventional databases such as RDBMS, vector databases are specialized in storing high-dimensional, real-number vector indexes efficiently. Vector databases do not retrieve the item that precisely matches the query expressed through e.g. SQL, but they retrieve the item whose embedding has the highest similarity score to the query embedding. Namely, it can be said that it is optimized to store embeddings obtained from AI models and retrieve data based on them.

If the task you requested to an LLM requires high-dimensional embeddings (512 dimensions or more) and it does have a huge number of embeddings in total (10,000 or more), we highly recommend you use a vector database. Chroma is an open source vector database that you can use immediately. You can find how to implement the question answering feature by referring toRobust Question Answering with Chroma and OpenAI notebook provided by OpenAI Cookbook.


References



Get started with Syncly today

Sign up for a free trial

Book a Demo

Get started with

Syncly today

Sign up for a free trial

Book a Demo

Get started with Syncly today

Sign up for a free trial

Book a Demo