What are embeddings produced by LLMs?

2.12.2024
http://Taikurin%20hattu,%20jonne%20putoaa%20A-kirjain%20ja%20joukko%20numeroita

An ordinary user interacts with large language models, like ChatGPT, by writing prompts through the user interface. In addition to this, large language models offer another functionality for technically skilled users – the creation of embeddings based on text. But what exactly are these embeddings, and what are they used for?

Meaning of text in vectors

When a large language model is given some text to embed, it produces a vector as a result. A vector is a list of numbers that may not be immediately interpretable to the human eye, but it enables the exploration of the text’s meaning through mathematical methods. These vectors produced by the language model are called embeddings.

UralicNLP Python library provides tools for embedding text using different language models. Here is an example of how text can be embedded with OpenAI’s model using UralicNLP.

from uralicNLP.llm import get_llm
llm = get_llm(“chatgpt”, “VAIHDA TÄHÄN API-AVAIMESI”, model=”text-embedding-3-small”)
llm.embed(“Teksti, jonka haluat upottaa”)
>>[-0.1803697, 1.1973963, 0.5283669, 1.5049516, -0.27077377…]

As seen in the example, the result of an embedding is a list of numbers. These numbers represent the meaning of the text and can be used to compare the similarity of texts through mathematical methods.

What are the benefits of embeddings?

With embeddings, large volumes of text can be stored in a vector database for quick retrieval. This means database searches are based on meaning rather than character strings. The most common use case for such vector databases currently is the RAG model.

RAG stands for Retrieval-Augmented Generation, which refers to a process where a large language model is provided with not just the user prompt but also source material to help generate a response. Retrieving the source material involves using embeddings to find documents relevant to the user’s input from a vector database. For example, Metropolia’s own Mikro-Mikko operates based on this principle.

Embeddings can also be used to automatically group text documents into clusters of similar texts. This can be done with UralicNLP as follows.

from uralicNLP.llm import get_llm
from uralicNLP import semantics
llm = get_llm(“chatgpt”, “VAIHDA TÄHÄN API-AVAIMESI”, model=”text-embedding-3-small”)
texts = [“koirat on hauskoja”, “autot ajaa nopeasti”, “kissat leikkii keskenään”, “rekat ajaa kaupungista toiseen”]
semantics.cluster(texts, llm)
>>[[“koirat on hauskoja”, “kissat leikkii keskenään”], [“autot ajaa nopeasti”, “rekat ajaa kaupungista toiseen”]]

The result is that texts are grouped into clusters of similar texts using embeddings and calculating their similarity.

Does the model matter when embedding?

Embeddings can be generated using both commercial large language models and open-source language models. When choosing a model, it’s important to remember that embeddings are not compatible across models. For example, you cannot create some embeddings with OpenAI’s GPT-4 and others with an open-source LLaMA model and expect them to work together. Each model has learned its own representation of meaning from its training data, so the numerical content of the embeddings varies between models.

When choosing a model, it’s important to consider the cost of the model, the languages it supports, and its context window. Larger models can accommodate a large amount of text within the context window, allowing for a single embedding of an entire text. Smaller models require the text to be split into segments. This technical limitation can be significant depending on how the embeddings are intended to be used.

Not all models support all languages. If a language model produces poor Finnish responses to prompts, it likely does not understand Finnish very well. Consequently, embeddings generated for Finnish text may not capture the meaning accurately enough.

Comments

05.06.2025 At: 09:12 veronique said

"Hey! Someeone in my Myspace group shared this site with us so I came to give it a look. I'm definitely enjoying the information. I'm bookmkarking and will be tweeting this to my followers! Exceptional blog and brilliant deign and style."

Comment