--- title: IPEX-LLM on Intel CPU | Developer Documentation --- > [IPEX-LLM](https://github.com/intel-analytics/ipex-llm/) is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency. This example goes over how to use LlamaIndex to interact with [`ipex-llm`](https://github.com/intel-analytics/ipex-llm/) for text generation and chat on CPU. > **Note** > > You could refer to [here](https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/llms/llama-index-llms-ipex-llm/examples) for full examples of `IpexLLM`. Please note that for running on Intel CPU, please specify `-d 'cpu'` in command argument when running the examples. Install `llama-index-llms-ipex-llm`. This will also install `ipex-llm` and its dependencies. ``` %pip install llama-index-llms-ipex-llm ``` In this example we’ll use [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) model for demostration. It requires updating `transformers` and `tokenizers` packages. ``` %pip install -U transformers==4.37.0 tokenizers==0.15.2 ``` Before loading the Zephyr model, you’ll need to define `completion_to_prompt` and `messages_to_prompt` for formatting prompts. This is essential for preparing inputs that the model can interpret accurately. ``` # Transform a string into input zephyr-specific input def completion_to_prompt(completion): return f"<|system|>\n\n<|user|>\n{completion}\n<|assistant|>\n" # Transform a list of chat messages into zephyr-specific input def messages_to_prompt(messages): prompt = "" for message in messages: if message.role == "system": prompt += f"<|system|>\n{message.content}\n" elif message.role == "user": prompt += f"<|user|>\n{message.content}\n" elif message.role == "assistant": prompt += f"<|assistant|>\n{message.content}\n" # ensure we start with a system prompt, insert blank if needed if not prompt.startswith("<|system|>\n"): prompt = "<|system|>\n\n" + prompt # add final assistant prompt prompt = prompt + "<|assistant|>\n" return prompt ``` ## Basic Usage Load the Zephyr model locally using IpexLLM using `IpexLLM.from_model_id`. It will load the model directly in its Huggingface format and convert it automatically to low-bit format for inference. ``` import warnings warnings.filterwarnings( "ignore", category=UserWarning, message=".*padding_mask.*" ) from llama_index.llms.ipex_llm import IpexLLM llm = IpexLLM.from_model_id( model_name="HuggingFaceH4/zephyr-7b-alpha", tokenizer_name="HuggingFaceH4/zephyr-7b-alpha", context_window=512, max_new_tokens=128, generate_kwargs={"do_sample": False}, completion_to_prompt=completion_to_prompt, messages_to_prompt=messages_to_prompt, ) ``` ``` Loading checkpoint shards: 0%| | 0/8 [00:00 Note that the saved path for the low-bit model only includes the model itself but not the tokenizers. If you wish to have everything in one place, you will need to manually download or copy the tokenizer files from the original model’s directory to the location where the low-bit model is saved. ``` llm_lowbit = IpexLLM.from_model_id_low_bit( model_name=saved_lowbit_model_path, tokenizer_name="HuggingFaceH4/zephyr-7b-alpha", # tokenizer_name=saved_lowbit_model_path, # copy the tokenizers to saved path if you want to use it this way context_window=512, max_new_tokens=64, completion_to_prompt=completion_to_prompt, generate_kwargs={"do_sample": False}, ) ``` ``` 2024-04-11 21:38:06,151 - INFO - Converting the current model to sym_int4 format...... ``` Try stream completion using the loaded low-bit model. ``` response_iter = llm_lowbit.stream_complete("What is Large Language Model?") for response in response_iter: print(response.delta, end="", flush=True) ``` ``` A large language model (LLM) is a type of artificial intelligence (AI) model that is trained on a massive amount of text data. These models are capable of generating human-like responses to text inputs and can be used for various natural language processing (NLP) tasks, such as text classification, sentiment analysis ```