Skip to content

OpenVINO LLMs

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. OpenVINO™ Runtime can enable running the same model optimized across various hardware devices. Accelerate your deep learning performance across use cases like: language + LLMs, computer vision, automatic speech recognition, and more.

OpenVINO models can be run locally through OpenVINOLLM entitiy wrapped by LlamaIndex :

In the below line, we install the packages necessary for this demo:

%pip install llama-index-llms-openvino transformers huggingface_hub

Now that we’re set up, let’s play around:

If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

!pip install llama-index
from llama_index.llms.openvino import OpenVINOLLM
def messages_to_prompt(messages):
prompt = ""
for message in messages:
if message.role == "system":
prompt += f"<|system|>\n{message.content}</s>\n"
elif message.role == "user":
prompt += f"<|user|>\n{message.content}</s>\n"
elif message.role == "assistant":
prompt += f"<|assistant|>\n{message.content}</s>\n"
# ensure we start with a system prompt, insert blank if needed
if not prompt.startswith("<|system|>\n"):
prompt = "<|system|>\n</s>\n" + prompt
# add final assistant prompt
prompt = prompt + "<|assistant|>\n"
return prompt
def completion_to_prompt(completion):
return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"

Models can be loaded by specifying the model parameters using the OpenVINOLLM method.

If you have an Intel GPU, you can specify device_map="gpu" to run inference on it.

ov_config = {
"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1",
"CACHE_DIR": "",
}
ov_llm = OpenVINOLLM(
model_id_or_path="HuggingFaceH4/zephyr-7b-beta",
context_window=3900,
max_new_tokens=256,
model_kwargs={"ov_config": ov_config},
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
device_map="cpu",
)
response = ov_llm.complete("What is the meaning of life?")
print(str(response))

It is possible to export your model to the OpenVINO IR format with the CLI, and load the model from local folder.

!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta ov_model_dir

It is recommended to apply 8 or 4-bit weight quantization to reduce inference latency and model footprint using --weight-format:

!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int8 ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int4 ov_model_dir
ov_llm = OpenVINOLLM(
model_id_or_path="ov_model_dir",
context_window=3900,
max_new_tokens=256,
model_kwargs={"ov_config": ov_config},
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
device_map="gpu",
)

You can get additional inference speed improvement with Dynamic Quantization of activations and KV-cache quantization. These options can be enabled with ov_config as follows:

ov_config = {
"KV_CACHE_PRECISION": "u8",
"DYNAMIC_QUANTIZATION_GROUP_SIZE": "32",
"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1",
"CACHE_DIR": "",
}

Using stream_complete endpoint

response = ov_llm.stream_complete("Who is Paul Graham?")
for r in response:
print(r.delta, end="")

Using stream_chat endpoint

from llama_index.core.llms import ChatMessage
messages = [
ChatMessage(
role="system", content="You are a pirate with a colorful personality"
),
ChatMessage(role="user", content="What is your name"),
]
resp = ov_llm.stream_chat(messages)
for r in resp:
print(r.delta, end="")

For more information refer to: