Embeddings are a powerful tool for working with text. By "embedding" text into vectors, you encode its meaning into a representation that can more easily be used for tasks like semantic search, clustering, and classification. If you're new to embeddings, check out this awesome introduction by Simon Willison to get up to speed. These days, embeddings are being used for even more interesting applications like Retrieval Augmented Generation, which uses semantic search over embeddings to improve the quality of responses from language models.
In this guide, we'll see how to use the BAAI/bge-large-en-v1.5
model on Replicate to generate text embeddings. The "BAAI General Embedding" (BGE) suite of models, released by the Beijing Academy of Artificial Intelligence (BAAI), are open source and available on the Hugging Face Hub.
As of October 2023, the large BGE model we'll use here is the current state-of-the-art open source model for text embeddings. It is ranked higher than OpenAI embeddings on the MTEB leaderboard, and is 4x cheaper to run on Replicate for large-scale text embedding (more on this later!).
👇 The code in this post is also available as a hosted, interactive Google Colab notebook:
You’ll need:
👀 See the model in the Replicate UI here, and more ways to run it (Node.js, cURL, Docker, etc.) here.
Start by installing the following dependencies:
Grab a Replicate API token from replicate.com/account/api-tokens and set it as an environment variable:
Now you can run the embedding model. We'll use the replicate
library to run the model on Replicate:
The output here will be a list of embeddings for each text.
JSONL (or "JSON lines") is a file format for storing structured data in a text-based, line-delimited format. Each line in the file is a standalone JSON object.
Here's an example of a JSONL file, dummy_example.jsonl
:
Run the model on this file by specifying the path
input.
The SAMSum dataset is a collection of ~14k example dialogues with manually annotated summaries. It is often used for training and evaluating language models.
Here we'll encode the whole SAMSum dataset. We'll use the datasets
library to load the dataset, convert it to a JSONL file, and then run the BGE model on it to generate text embeddings.
To convert the dataset to a JSONL file, call .to_json
on the dataset.
If all goes well, the dataset should be written to samsum_dialogue.jsonl
. Use the head
command to see the first few lines of the file:
You should see the following:
Let's embed the dataset. This time we'll specify convert_to_numpy=True
to get the embeddings as a numpy array, which is a more efficient output format for such a large dataset.
Since we chose to convert to numpy, we'll load with numpy here.
At the time of this writing, OpenAI's Ada v2 model costs $0.0001 / 1K tokens.
On Replicate, you're charged by the second for the hardware you're running on. The nateraw/bge-large-en-v1.5 we're using here runs on A40 (Large) instances, which cost $0.000725/sec.
Below, we'll compare both OpenAI and Replicate. To do so, we'll need to count the number of tokens in the dataset. We'll use the transformers
library to do this:
In the snippet above, we prepare a benchmark file with 512 tokens per line. This is the maximum number of tokens supported by the BGE model. In total, the dataset has 5,120,000 tokens. Let's double-check that:
Finally, we'll write this dataset to a JSONL file, just as we did earlier.
Now we'll run the benchmark. We'll use replicate.predictions.create
to run the model asynchronously. This will return a Prediction
object, which we can use to get the results of the run, as well as its associated metrics. We can then use the predict_time
to calculate the price of the run.
Let's see what the price of this run would have been using the OpenAI API:
And the price on Replicate:
The price on Replicate is more than 4x cheaper than OpenAI, and that's with a model ranked higher on the MTEB leaderboard. 🎉
If you enjoyed this post and want to see a more in-depth example of using this text embedding model in the wild, check out this blogpost by @jakedahn that covers how to do Retrieval Augmented Generation (RAG) with ChromaDB and Mistral.
Happy hacking!