HuggingFace LLM - StableLM
If youβre opening this Notebook on colab, you will probably need to install LlamaIndex π¦.
%pip install llama-index-llms-huggingface
!pip install llama-index
import loggingimport sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import VectorStoreIndex, SimpleDirectoryReaderfrom llama_index.llms.huggingface import HuggingFaceLLMfrom llama_index.core import Settings
INFO:numexpr.utils:Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.INFO:numexpr.utils:NumExpr defaulting to 8 threads.NumExpr defaulting to 8 threads.
/home/loganm/miniconda3/envs/gpt_index/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Download Data
Section titled βDownload Dataβ!mkdir -p 'data/paul_graham/'!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
Load documents, build the VectorStoreIndex
Section titled βLoad documents, build the VectorStoreIndexβ# load documentsdocuments = SimpleDirectoryReader("./data/paul_graham").load_data()
# setup prompts - specific to StableLMfrom llama_index.core import PromptTemplate
system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.- StableLM will refuse to participate in anything that could harm a human."""
# This will wrap the default prompts that are internal to llama-indexquery_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")
import torch
llm = HuggingFaceLLM( context_window=4096, max_new_tokens=256, generate_kwargs={"temperature": 0.7, "do_sample": False}, system_prompt=system_prompt, query_wrapper_prompt=query_wrapper_prompt, tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b", model_name="StabilityAI/stablelm-tuned-alpha-3b", device_map="auto", stopping_ids=[50278, 50279, 50277, 1, 0], tokenizer_kwargs={"max_length": 4096}, # uncomment this if using CUDA to reduce memory usage # model_kwargs={"torch_dtype": torch.float16})
Settings.llm = llmSettings.chunk_size = 1024
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:24<00:00, 12.21s/it]
index = VectorStoreIndex.from_documents(documents)
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens> [build_index_from_nodes] Total LLM token usage: 0 tokensINFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 20729 tokens> [build_index_from_nodes] Total embedding token usage: 20729 tokens
Query Index
Section titled βQuery Indexβ# set Logging to DEBUG for more detailed outputsquery_engine = index.as_query_engine()response = query_engine.query("What did the author do growing up?")
INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens> [retrieve] Total LLM token usage: 0 tokensINFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 8 tokens> [retrieve] Total embedding token usage: 8 tokens
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 2126 tokens> [get_response] Total LLM token usage: 2126 tokensINFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens> [get_response] Total embedding token usage: 0 tokens
print(response)
The author is a computer scientist who has written several books on programming languages and software development. He worked on the IBM 1401 and wrote a program to calculate pi. He also wrote a program to predict how high a rocket ship would fly. The program was written in Fortran and used a TRS-80 microcomputer. The author is a PhD student and has been working on multiple projects, including a novel and a PBS documentary. He is envious of the author's work and feels that he has made significant contributions to the field of computer science. He is working on multiple projects and is envious of the author's work. He is also interested in learning Italian and is considering taking the entrance exam in Florence. The author is not aware of how he managed to pass the written exam and is not sure how he will manage to do so.
Query Index - Streaming
Section titled βQuery Index - Streamingβquery_engine = index.as_query_engine(streaming=True)
# set Logging to DEBUG for more detailed outputsresponse_stream = query_engine.query("What did the author do growing up?")
INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens> [retrieve] Total LLM token usage: 0 tokensINFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 8 tokens> [retrieve] Total embedding token usage: 8 tokensINFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 0 tokens
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
> [get_response] Total LLM token usage: 0 tokensINFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens> [get_response] Total embedding token usage: 0 tokens
# can be slower to start streaming since llama-index often involves many LLM callsresponse_stream.print_response_stream()
The author is a computer scientist who has written several books on programming languages and software development. He worked on the IBM 1401 and wrote a program to calculate pi. He also wrote a program to predict how high a rocket ship would fly. The program was written in Fortran and used a TRS-80 microcomputer. The author is a PhD student and has been working on multiple projects, including a novel and a PBS documentary. He is envious of the author's work and feels that he has made significant contributions to the field of computer science. He is working on multiple projects and is envious of the author's work. He is also interested in learning Italian and is considering taking the entrance exam in Florence. The author is not aware of how he managed to pass the written exam and is not sure how he will manage to do so.<|USER|>
# can also get a normal response objectresponse = response_stream.get_response()print(response)
# can also iterate over the generator yourselfgenerated_text = ""for text in response.response_gen: generated_text += text