Local Embeddings with HuggingFace
LlamaIndex has support for HuggingFace embedding models, including Sentence Transformer models like BGE, Mixedbread, Nomic, Jina, E5, etc. We can use these models to create embeddings for our documents and queries for retrieval.
Furthermore, we provide utilities to create and use ONNX and OpenVINO models using the Optimum library from HuggingFace.
HuggingFaceEmbedding
Section titled βHuggingFaceEmbeddingβThe base HuggingFaceEmbedding
class is a generic wrapper around any HuggingFace model for embeddings. All embedding models on Hugging Face should work. You can refer to the embeddings leaderboard for more recommendations.
This class depends on the sentence-transformers package, which you can install with pip install sentence-transformers
.
NOTE: if you were previously using a HuggingFaceEmbeddings
from LangChain, this should give equivalent results.
If youβre opening this Notebook on colab, you will probably need to install LlamaIndex π¦.
%pip install llama-index-embeddings-huggingface
!pip install llama-index
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# loads https://huggingface.co/BAAI/bge-small-en-v1.5embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
embeddings = embed_model.get_text_embedding("Hello World!")print(len(embeddings))print(embeddings[:5])
384[-0.003275700844824314, -0.011690810322761536, 0.041559211909770966, -0.03814814239740372, 0.024183044210076332]
Benchmarking
Section titled βBenchmarkingβLetβs try comparing using a classic large document β the IPCC climate report, chapter 3.
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 20.7M 100 20.7M 0 0 69.6M 0 --:--:-- --:--:-- --:--:-- 70.0M
from llama_index.core import VectorStoreIndex, SimpleDirectoryReaderfrom llama_index.core import Settings
documents = SimpleDirectoryReader( input_files=["IPCC_AR6_WGII_Chapter03.pdf"]).load_data()
Base HuggingFace Embeddings
Section titled βBase HuggingFace Embeddingsβfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding
# loads BAAI/bge-small-en-v1.5 with the default torch backendembed_model = HuggingFaceEmbedding( model_name="BAAI/bge-small-en-v1.5", device="cpu", embed_batch_size=8,)test_embeds = embed_model.get_text_embedding("Hello World!")
Settings.embed_model = embed_model
%%timeit -r 1 -n 1index = VectorStoreIndex.from_documents(documents, show_progress=True)
Parsing nodes: 100%|ββββββββββ| 172/172 [00:00<00:00, 428.44it/s]Generating embeddings: 100%|ββββββββββ| 459/459 [00:19<00:00, 23.32it/s]
20.2 s Β± 0 ns per loop (mean Β± std. dev. of 1 run, 1 loop each)
ONNX Embeddings
Section titled βONNX Embeddingsβ# pip install sentence-transformers[onnx]
# loads BAAI/bge-small-en-v1.5 with the onnx backendembed_model = HuggingFaceEmbedding( model_name="BAAI/bge-small-en-v1.5", device="cpu", backend="onnx", model_kwargs={ "provider": "CPUExecutionProvider" }, # For ONNX, you can specify the provider, see https://sbert.net/docs/sentence_transformer/usage/efficiency.html)test_embeds = embed_model.get_text_embedding("Hello World!")
Settings.embed_model = embed_model
%%timeit -r 1 -n 1index = VectorStoreIndex.from_documents(documents, show_progress=True)
Parsing nodes: 100%|ββββββββββ| 172/172 [00:00<00:00, 421.63it/s]Generating embeddings: 100%|ββββββββββ| 459/459 [00:31<00:00, 14.53it/s]
32.1 s Β± 0 ns per loop (mean Β± std. dev. of 1 run, 1 loop each)
OpenVINO Embeddings
Section titled βOpenVINO Embeddingsβ# pip install sentence-transformers[openvino]
# loads BAAI/bge-small-en-v1.5 with the openvino backendembed_model = HuggingFaceEmbedding( model_name="BAAI/bge-small-en-v1.5", device="cpu", backend="openvino", # OpenVINO is very strong on CPUs revision="refs/pr/16", # BAAI/bge-small-en-v1.5 itself doesn't have an OpenVINO model currently, but there's a PR with it that we can load: https://huggingface.co/BAAI/bge-small-en-v1.5/discussions/16 model_kwargs={ "file_name": "openvino_model_qint8_quantized.xml" }, # If we're using an optimized/quantized model, we need to specify the file name like this)test_embeds = embed_model.get_text_embedding("Hello World!")
Settings.embed_model = embed_model
%%timeit -r 1 -n 1index = VectorStoreIndex.from_documents(documents, show_progress=True)
Parsing nodes: 100%|ββββββββββ| 172/172 [00:00<00:00, 403.15it/s]Generating embeddings: 100%|ββββββββββ| 459/459 [00:08<00:00, 53.83it/s]
9.03 s Β± 0 ns per loop (mean Β± std. dev. of 1 run, 1 loop each)
References
Section titled βReferencesβ- Local Embedding Models explains more about using local models like these.
- Sentence Transformers > Speeding up Inference contains extensive documentation on how to use the backend options effectively, including optimization and quantization for ONNX and OpenVINO.