NVIDIA LLM Text Completion API
The llama-index-llms-nvidia package extends the NVIDIA class to support the /completions API for code completion models such as:
bigcode/starcoder2-7bbigcode/starcoder2-15b
Installation
Section titled “Installation”%pip install --upgrade --quiet llama-index-llms-nvidiaTo get started:
-
Create a free account with NVIDIA, which hosts NVIDIA AI Foundation models.
-
Click on your model of choice.
-
Under Input select the Python tab, and click
Get API Key. Then clickGenerate Key. -
Copy and save the generated key as NVIDIA_API_KEY. From there, you should have access to the endpoints.
import getpassimport os
# del os.environ['NVIDIA_API_KEY'] ## delete key and resetif os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"): print("Valid NVIDIA_API_KEY already in environment. Delete to reset")else: nvapi_key = getpass.getpass("NVAPI Key (starts with nvapi-): ") assert nvapi_key.startswith( "nvapi-" ), f"{nvapi_key[:5]}... is not a valid key" os.environ["NVIDIA_API_KEY"] = nvapi_key# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncioimport nest_asyncio
nest_asyncio.apply()Working with the NVIDIA API Catalog
Section titled “Working with the NVIDIA API Catalog”Usage of the use_chat_completions argument
Section titled “Usage of the use_chat_completions argument”Set None (default) to decide per-invocation whether to use /chat/completions or /completions endpoints with query keyword arguments.
- Set
Falseto use the/completionsendpoint. - Set
Trueto use the/chat/completionsendpoint.
from llama_index.llms.nvidia import NVIDIA
llm = NVIDIA(model="bigcode/starcoder2-15b", use_chat_completions=False)Available models
Section titled “Available models”Use is_chat_model to filter available text completion models:
print([model for model in llm.available_models if model.is_chat_model])Working with NVIDIA NIMs
Section titled “Working with NVIDIA NIMs”In addition to connecting to hosted NVIDIA NIMs, this connector can be used to connect to local NIM instances. This helps you take your applications local when necessary.
For instructions on how to set up local NIM instances, refer to NVIDIA NIM.
from llama_index.llms.nvidia import NVIDIA
# Connect to a NIM running at localhost:8080llm = NVIDIA(base_url="http://localhost:8080/v1")Complete: .complete()
Section titled “Complete: .complete()”We can use .complete()/.acomplete() (which takes a string) to prompt a response from the selected model.
Let’s use our default model for this task.
print(llm.complete("# Function that does quicksort:"))As expected, LlamaIndex returns a CompletionResponse.
Async Complete: .acomplete()
Section titled “Async Complete: .acomplete()”There is also an async implementation which can be leveraged in the same way!
await llm.acomplete("# Function that does quicksort:")Streaming
Section titled “Streaming”x = llm.stream_complete(prompt="# Reverse string in python:", max_tokens=512)for t in x: print(t.delta, end="")Async Streaming
Section titled “Async Streaming”x = await llm.astream_complete( prompt="# Reverse program in python:", max_tokens=512)async for t in x: print(t.delta, end="")