Optimized Embedding Model using Optimum-Intel

LlamaIndex Framework

Integrations

Embeddings

LlamaIndex has support for loading quantized embedding models for Intel, using the Optimum-Intel library.

Optimized models are smaller and faster, with minimal accuracy loss, see the documentation and an optimization guide using the IntelLabs/fastRAG library.

Optimization is based on math instructions in the Xeon® 4th generation or newer processors.

In order to be able to load and use the quantized models, install the required dependency pip install optimum[exporters] optimum-intel neural-compressor intel_extension_for_pytorch.

Loading is done using the class IntelEmbedding; usage is similar to any HuggingFace local embedding model; See example:

%pip install llama-index-embeddings-huggingface-optimum-intel

from llama_index.embeddings.huggingface_optimum_intel import IntelEmbedding

embed_model = IntelEmbedding("Intel/bge-small-en-v1.5-rag-int8-static")

embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])

384
[-0.0032782123889774084, -0.013396517373621464, 0.037944991141557693, -0.04642259329557419, 0.027709005400538445]

Note for AI agents: this documentation is built for programmatic access. - Overview of all docs: https://developers.llamaindex.ai/llms.txt - Any page is available as raw Markdown by appending index.md to its URL — e.g. https://developers.llamaindex.ai/llamaparse/parse/getting_started/index.md - Agent-friendly REST search APIs live under https://developers.llamaindex.ai/api/ — search (BM25 full-text), grep (regex), read (fetch a page), and list (browse the doc tree). See https://developers.llamaindex.ai/llms.txt for parameters. - A hosted documentation MCP server is available at https://developers.llamaindex.ai/mcp. If you support MCP, you can ask the user to install it for browsing these docs directly (an alternative to the REST API). Setup: https://developers.llamaindex.ai/python/shared/mcp/