vLLM
There’s two modes of using vLLM local and remote. Let’s start form the former one, which requeries CUDA environment available locally.
Install vLLM
Section titled “Install vLLM”pip install vllm
or if you want to compile you can compile from source
Orca-7b Completion Example
Section titled “Orca-7b Completion Example”%pip install llama-index-llms-vllmimport os
os.environ["HF_HOME"] = "model/"from llama_index.llms.vllm import Vllm, VllmServerllm = Vllm( model="microsoft/Orca-2-7b", tensor_parallel_size=4, max_new_tokens=100, vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},)llm.complete("[INST]You are a helpful assistant[/INST] What is a black hole ?")LLama-2-7b Completion Example
Section titled “LLama-2-7b Completion Example”llm = Vllm( model="codellama/CodeLlama-7b-hf", dtype="float16", tensor_parallel_size=4, temperature=0, max_new_tokens=100, vllm_kwargs={ "swap_space": 1, "gpu_memory_utilization": 0.5, "max_model_len": 4096, },)llm.complete("import socket\n\ndef ping_exponential_backoff(host: str):")Mistral chat 7b Completion Example
Section titled “Mistral chat 7b Completion Example”llm = Vllm( model="mistralai/Mistral-7B-Instruct-v0.1", dtype="float16", tensor_parallel_size=4, temperature=0, max_new_tokens=100, vllm_kwargs={ "swap_space": 1, "gpu_memory_utilization": 0.5, "max_model_len": 4096, },)Vllm mock initializedllm.complete(" What is a black hole ?")Calling vLLM via HTTP
Section titled “Calling vLLM via HTTP”In this mode there is no need to install vllm model nor have CUDA available locally. To setup the vLLM API you can follow the guide present here.
Note: llama-index-llms-vllm module is a client for vllm.entrypoints.api_server which is only a demo.
If vLLM server is launched with vllm.entrypoints.openai.api_server as OpenAI Compatible Server or via Docker you need OpenAILike class from llama-index-llms-openai-like module
Completion Response
Section titled “Completion Response”from llama_index.core.llms import ChatMessagellm = VllmServer( api_url="http://localhost:8000/generate", max_new_tokens=100, temperature=0)llm.complete("what is a black hole ?")message = [ChatMessage(content="hello", role="user")]llm.chat(message)Streaming Response
Section titled “Streaming Response”list(llm.stream_complete("what is a black hole"))[-1]message = [ChatMessage(content="what is a black hole", role="user")][x for x in llm.stream_chat(message)][-1]Async Response
Section titled “Async Response”import asyncio
await llm.acomplete("What is a black hole")await llm.achat(message)[x async for x in await llm.astream_complete("what is a black hole")][-1][x async for x in await llm.astream_chat(message)][-1]Note for AI agents: this documentation is built for programmatic access.
- Overview of all docs: https://developers.llamaindex.ai/llms.txt
- Any page is available as raw Markdown by appending index.md to its URL — e.g. https://developers.llamaindex.ai/llamaparse/parse/getting_started/index.md
- Agent-friendly REST search APIs live under https://developers.llamaindex.ai/api/ — search (BM25 full-text), grep (regex), read (fetch a page), and list (browse the doc tree). See https://developers.llamaindex.ai/llms.txt for parameters.
- A hosted documentation MCP server is available at https://developers.llamaindex.ai/mcp. If you support MCP, you can ask the user to install it for browsing these docs directly (an alternative to the REST API). Setup: https://developers.llamaindex.ai/python/shared/mcp/