Metadata Extraction
In this notebook we will demonstrate following:
- RAG using Metadata Extractors.
- Extract Metadata using PydanticProgram.
Installation
Section titled “Installation”!pip install llama-index!pip install llama_index-readers-web
import nest_asyncio
nest_asyncio.apply()
import os
Setup API Key
Section titled “Setup API Key”os.environ["OPENAI_API_KEY"] = "sk-..."
Define LLM
Section titled “Define LLM”from llama_index.llms.openai import OpenAIfrom llama_index.core.schema import MetadataModefrom llama_index.core import Settings
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=512)Settings.llm = llm
Node Parser and Metadata Extractors
Section titled “Node Parser and Metadata Extractors”from llama_index.core.node_parser import TokenTextSplitterfrom llama_index.core.extractors import ( QuestionsAnsweredExtractor,)
node_parser = TokenTextSplitter( separator=" ", chunk_size=256, chunk_overlap=128)
question_extractor = QuestionsAnsweredExtractor( questions=3, llm=llm, metadata_mode=MetadataMode.EMBED)
Load Data
Section titled “Load Data”from llama_index.readers.web import SimpleWebPageReader
reader = SimpleWebPageReader(html_to_text=True)docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])
print(docs[0].get_content())
# [eugeneyan](/)
* [Start Here](/start-here/ "Start Here") * [Writing](/writing/ "Writing") * [Speaking](/speaking/ "Speaking") * [Prototyping](/prototyping/ "Prototyping") * [About](/about/ "About")
# Patterns for Building LLM-based Systems & Products
[ [llm](/tag/llm/) [engineering](/tag/engineering/)[production](/tag/production/) [🔥](/tag/🔥/) ] · 66 min read
> Discussions on [HackerNews](https://news.ycombinator.com/item?id=36965993),> [Twitter](https://twitter.com/eugeneyan/status/1686531758701899776), and> [LinkedIn](https://www.linkedin.com/posts/eugeneyan_patterns-for-building-> llm-based-systems-activity-7092300473981927424-_wVo)
“There is a large class of problems that are easy to imagine and build demosfor, but extremely hard to make products out of. For example, self-driving:It’s easy to demo a car self-driving around a block, but making it into aproduct takes a decade.” -[Karpathy](https://twitter.com/eugeneyan/status/1672692174704766976)
This write-up is about practical patterns for integrating large languagemodels (LLMs) into systems & products. We’ll build on academic research,industry resources, and practitioner know-how, and distill them into key ideasand practices.
There are seven key patterns. They’re also organized along the spectrum ofimproving performance vs. reducing cost/risk, and closer to the data vs.closer to the user.
* Evals: To measure performance * RAG: To add recent, external knowledge * Fine-tuning: To get better at specific tasks * Caching: To reduce latency & cost * Guardrails: To ensure output quality * Defensive UX: To anticipate & manage errors gracefully * Collect user feedback: To build our data flywheel
(Also see this addendum on [how to match these LLM patterns to potentialproblems](/writing/llm-problems/).)

LLM patterns: From data to user, from defensive to offensive (see connectionsbetween patterns)
## Evals: To measure performance
Evaluations are a set of measurements used to assess a model’s performance ona task. They include benchmark data and metrics. From a [HackerNewscomment](https://news.ycombinator.com/item?id=36789901):
> How important evals are to the team is a major differentiator between folks> rushing out hot garbage and those seriously building products in the space.
### Why evals?
Evals enable us to measure how well our system or product is doing and detectany regressions. (A system or product can be made up of multiple componentssuch as LLMs, prompt templates, retrieved context, and parameters liketemperature.) A representative set of evals takes us a step towards measuringsystem changes at scale. Without evals, we would be flying blind, or wouldhave to visually inspect LLM outputs with each change.
### More about evals
**There are many benchmarks in the field of language modeling**. Some notableones are:
* **[MMLU](https://arxiv.org/abs/2009.03300)** : A set of 57 tasks that span elementary math, US history, computer science, law, and more. To perform well, models must possess extensive world knowledge and problem-solving ability. * **[EleutherAI Eval](https://github.com/EleutherAI/lm-evaluation-harness)** : Unified framework to test models via zero/few-shot settings on 200 tasks. Incorporates a large number of evals including BigBench, MMLU, etc. * **[HELM](https://arxiv.org/abs/2211.09110)** : Instead of specific tasks and metrics, HELM offers a comprehensive assessment of LLMs by evaluating them across domains. Metrics include accuracy, calibration, robustness, fairness, bias, toxicity, etc. Tasks include Q&A, information retrieval, summarization, text classification, etc. * **[AlpacaEval](https://github.com/tatsu-lab/alpaca_eval)** : Automated evaluation framework which measures how often a strong LLM (e.g., GPT-4) prefers the output of one model over a reference model. Metrics include win rate, bias, latency, price, variance, etc. Validated to have high agreement with 20k human annotations.
We can group metrics into two categories: context-dependent or context-free.
* **Context-dependent** : These take context into account. They’re often proposed for a specific task; repurposing them for other tasks will require some adjustment. * **Context-free** : These aren’t tied to the context when evaluating generated output; they only compare the output with the provided gold references. As they’re task agnostic, they’re easier to apply to a wide variety of tasks.
To get a better sense of these metrics (and their potential shortfalls), we’llexplore a few of the commonly used metrics such as BLEU, ROUGE, BERTScore, andMoverScore.
**[BLEU](https://dl.acm.org/doi/10.3115/1073083.1073135) (Bilingual EvaluationUnderstudy)** is a precision-based metric: It counts the number of n-grams inthe generated output that also show up in the reference, and then divides itby the total number of words in the output. It’s predominantly used in machinetranslation and remains a popular metric due to its cost-effectiveness.
First, precision for various values of \\(n\\) is computed:
\\[\text{precision}_n = \frac{\sum_{p \in \text{output}} \sum_{\text{n-gram}\in p} \text{Count}_{\text{clip}} (\text{n-gram})}{\sum_{p \in \text{output}}\sum_{\text{n-gram} \in p} \text{Count}(\text{n-gram})}\\]
\\(Count_{clip}(\text{n-gram})\\) is clipped by the maximum number of times ann-gram appears in any corresponding reference sentence.
\\[\text{Count}_{\text{clip}}(n\text{-gram}) = \min \left(\text{matched }n\text{-gram count}, \max_{r \in R} \left(n\text{-gram count in }r\right)\right)\\]
Once we’ve computed precision at various \\(n\\), a final BLEU-N score iscomputed as the geometric mean of all the \\(precision_n\\) scores.
However, since precision relies solely on n-grams and doesn’t consider thelength of the generated output, an output containing just one unigram of acommon word (like a stop word) would achieve perfect precision. This can bemisleading and encourage outputs that contain fewer words to increase BLEUscores. To counter this, a brevity penalty is added to penalize excessivelyshort sentences.
\\[BP = \begin{cases} 1 & \text{if } |p| > |r| \\\ e^{1-\frac{|r|}{|p|}} &\text{otherwise} \end{cases}\\]
Thus, the final formula is:
\\[\text{BLEU-N} = BP \cdot \exp\left(\sum_{n=1}^{N} W_n\log(\text{precision}_n)\right)\\]
**[ROUGE](https://aclanthology.org/W04-1013/) (Recall-Oriented Understudy forGisting Evaluation)**: In contrast to BLEU, ROUGE is recall-oriented. Itcounts the number of words in the reference that also occur in the output.It’s typically used to assess automatic summarization tasks.
There are several ROUGE variants. ROUGE-N is most similar to BLEU in that italso counts the number of matching n-grams between the output and thereference.
\\[\text{ROUGE-N} = \frac{\sum_{s_r \in \text{references}} \sum_{n\text{-gram}\in s_r} \text{Count}_{\text{match}} (n\text{-gram})}{\sum_{s_r \in\text{references}} \sum_{n\text{-gram} \in s_r} \text{Count}(n\text{-gram})}\\]
Other variants include:
* ROUGE-L: This measures the longest common subsequence (LCS) between the output and the reference. It considers sentence-level structure similarity and zeros in on the longest series of co-occurring in-sequence n-grams. * ROUGE-S: This measures the skip-bigram between the output and reference. Skip-bigrams are pairs of words that maintain their sentence order regardless of the words that might be sandwiched between them.
**[BERTScore](https://arxiv.org/abs/1904.09675)** is an embedding-based metricthat uses cosine similarity to compare each token or n-gram in the generatedoutput with the reference sentence. There are three components to BERTScore:
* Recall: Average cosine similarity between each token in the reference and its closest match in the generated output. * Precision: Average cosine similarity between each token in the generated output and its nearest match in the reference. * F1: Harmonic mean of recall and precision
\\[Recall_{\text{BERT}} = \frac{1}{|r|} \sum_{i \in r} \max_{j \in p}\vec{i}^T \vec{j}, \quad Precision_{\text{BERT}} = \frac{1}{|p|} \sum_{j \inp} \max_{i \in r} \vec{i}^T \vec{j}\\] \\[\text{BERTscore} = F_{\text{BERT}} =\frac{2 \cdot P_{\text{BERT}} \cdot R_{\text{BERT}}}{P_{\text{BERT}} +R_{\text{BERT}}}\\]
BERTScore is useful because it can account for synonyms and paraphrasing.Simpler metrics like BLEU and ROUGE can’t do this due to their reliance onexact matches. BERTScore has been shown to have better correlation for taskssuch as image captioning and machine translation.
**[MoverScore](https://arxiv.org/abs/1909.02622)** also uses contextualizedembeddings to compute the distance between tokens in the generated output andreference. But unlike BERTScore, which is based on one-to-one matching (or“hard alignment”) of tokens, MoverScore allows for many-to-one matching (or“soft alignment”).

BERTScore (left) vs. MoverScore (right;[source](https://arxiv.org/abs/1909.02622))
MoverScore enables the mapping of semantically related words in one sequenceto their counterparts in another sequence. It does this by solving aconstrained optimization problem that finds the minimum effort to transformone text into another. The idea is to measure the distance that words wouldhave to move to convert one sequence to another.
However, there are several pitfalls to using these conventional benchmarks andmetrics.
First, there’s **poor correlation between these metrics and human judgments.**BLEU, ROUGE, and others have had [negative correlation with how humansevaluate fluency](https://arxiv.org/abs/2008.12009). They also showed moderateto less correlation with human adequacy scores. In particular, BLEU and ROUGEhave [low correlation with tasks that require creativity anddiversity](https://arxiv.org/abs/2303.16634).
Second, these metrics often have **poor adaptability to a wider variety oftasks**. Adopting a metric proposed for one task to another is not alwaysprudent. For example, exact match metrics such as BLEU and ROUGE are a poorfit for tasks like abstractive summarization or dialogue. Since they’re basedon n-gram overlap between output and reference, they don’t make sense for adialogue task where a wide variety of responses are possible. An output canhave zero n-gram overlap with the reference but yet be a good response.
Third, these metrics have **poor reproducibility**. Even for the same metric,[high variance is reported across differentstudies](https://arxiv.org/abs/2008.12009), possibly due to variations inhuman judgment collection or metric parameter settings. Another study of[ROUGE scores](https://aclanthology.org/2023.acl-long.107/) across 2,000studies found that scores were hard to reproduce, difficult to compare, andoften incorrect because evals were often conducted with untested, incorrectROUGE implementations.

Dimensions of model evaluations with ROUGE([source](https://aclanthology.org/2023.acl-long.107/))
And even with recent benchmarks such as MMLU, **the same model can getsignificantly different scores based on the eval implementation**.[Huggingface compared the original MMLUimplementation](https://huggingface.co/blog/evaluating-mmlu-leaderboard) withthe HELM and EleutherAI implementations and found that the same example couldhave different prompts across various providers.

Different prompts for the same question across MMLU implementations([source](https://huggingface.co/blog/evaluating-mmlu-leaderboard))
Furthermore, the evaluation approach differed across all three benchmarks:
* Original MMLU: Compares predicted probabilities on the answers only (A, B, C, D) * HELM: Uses the next token probabilities from the model and picks the token with the highest probability, even if it’s _not_ one of the options. * EleutherAI: Computes probability of the full answer sequence (i.e., a letter followed by the answer text) for each answer. Then, pick answer with highest probability.

Different eval for the same question across MMLU implementations([source](https://huggingface.co/blog/evaluating-mmlu-leaderboard))
As a result, even for the same eval, both absolute scores and model rankingcan fluctuate widely depending on eval implementation. This means that modelmetrics aren’t truly comparable—even for the same eval—unless the eval’simplementation is identical down to minute details like prompts andtokenization. Similarly, the author of QLoRA found MMLU overly sensitive andconcluded: “[do not work with/report or trust MMLUscores](https://twitter.com/Tim_Dettmers/status/1673446047266504704)”.
Beyond conventional evals such as those mentioned above, **an emerging trendis to use a strong LLM as a reference-free metric** to evaluate generationsfrom other LLMs. This means we may not need human judgments or gold referencesfor evaluation.
**[G-Eval](https://arxiv.org/abs/2303.16634) is a framework that appliesLLMs** with Chain-of-Though (CoT) and a form-filling paradigm to **evaluateLLM outputs**. First, they provide a task introduction and evaluation criteriato an LLM and ask it to generate a CoT of evaluation steps. Then, to evaluatecoherence in news summarization, they concatenate the prompt, CoT, newsarticle, and summary and ask the LLM to output a score between 1 to 5.Finally, they use the probabilities of the output tokens from the LLM tonormalize the score and take their weighted summation as the final result.

Overview of G-Eval ([source](https://arxiv.org/abs/2303.16634))
They found that GPT-4 as an evaluator had a high Spearman correlation withhuman judgments (0.514), outperforming all previous methods. It alsooutperformed traditional metrics on aspects such as coherence, consistency,fluency, and relevance. On topical chat, it did better than traditionalmetrics such as ROUGE-L, BLEU-4, and BERTScore across several criteria such asnaturalness, coherence, engagingness, and groundedness.
**The[Vicuna](https://arxiv.org/abs/2306.05685) paper adopted a similarapproach.** They start by defining eight categories (writing, roleplay,extraction, reasoning, math, coding, STEM, and humanities/social science)before developing 10 questions for each category. Next, they generated answersfrom five chatbots: LLaMA, Alpaca, ChatGPT, Bard, and Vicuna. Finally, theyasked GPT-4 to rate the quality of the answers based on helpfulness,relevance, accuracy, and detail.
Overall, they found that GPT-4 not only provided consistent scores but couldalso give detailed explanations for those scores. Under the single answergrading paradigm, GPT-4 had higher agreement with humans (85%) than the humanshad amongst themselves (81%). This suggests that GPT-4’s judgment alignsclosely with the human evaluators.
**[QLoRA](https://arxiv.org/abs/2305.14314) also used an LLM to evaluateanother LLM’s output.** They asked GPT-4 to rate the performance of variousmodels against gpt-3.5-turbo on the Vicuna benchmark. Given the responses fromgpt-3.5-turbo and another model, GPT-4 was prompted to score both out of 10and explain its ratings. They also measured performance via direct comparisonsbetween models, simplifying the task to a three-class rating scheme thatincluded ties.
To validate the automated evaluation, they collected human judgments on theVicuna benchmark. Using Mechanical Turk, they enlisted two annotators forcomparisons to gpt-3.5-turbo, and three annotators for pairwise comparisons.They found that human and GPT-4 ranking of models were largely in agreement,with a Spearman rank correlation of 0.55 at the model level. This provides anadditional data point suggesting that LLM-based automated evals could be acost-effective and reasonable alternative to human evals.
### How to apply evals?
**Building solid evals should be the starting point** for any LLM-based systemor product (as well as conventional machine learning systems).
Unfortunately, classical metrics such as BLEU and ROUGE don’t make sense formore complex tasks such as abstractive summarization or dialogue. Furthermore,we’ve seen that benchmarks like MMLU (and metrics like ROUGE) are sensitive tohow they’re implemented and measured. And to be candid, unless your LLM systemis studying for a school exam, using MMLU as an eval [doesn’t quite makesense](https://twitter.com/Tim_Dettmers/status/1680782418335367169).
Thus, instead of using off-the-shelf benchmarks, we can **start by collectinga set of task-specific evals** (i.e., prompt, context, expected outputs asreferences). These evals will then guide prompt engineering, model selection,fine-tuning, and so on. And as we update our systems, we can run these evalsto quickly measure improvements or regressions. Think of it as Eval DrivenDevelopment (EDD).
In addition to the evaluation dataset, we **also need useful metrics**. Theyhelp us distill performance changes into a single number that’s comparableacross eval runs. And if we can simplify the problem, we can choose metricsthat are easier to compute and interpret.
The simplest task is probably classification: If we’re using an LLM forclassification-like tasks (e.g., toxicity detection, document categorization)or extractive QA without dialogue, we can rely on standard classificationmetrics such as recall, precision, PRAUC, etc. If our task has no correctanswer but we have references (e.g., machine translation, extractivesummarization), we can rely on reference metrics based on matching (BLEU,ROUGE) or semantic similarity (BERTScore, MoverScore).
However, these metrics may not work for more open-ended tasks such asabstractive summarization, dialogue, and others. But collecting humanjudgments can be slow and expensive. Thus, we may opt to lean on **automatedevaluations via a strong LLM**.
Relative to human judgments which are typically noisy (due to differing biasesamong annotators), LLM judgments tend to be less noisy (as the bias is moresystematic) but more biased. Nonetheless, since we’re aware of these biases,we can mitigate them accordingly:
* Position bias: LLMs tend to favor the response in the first position. To mitigate this, we can evaluate the same pair of responses twice while swapping their order. If the same response is preferred in both orders, we mark it as a win; else, it’s a tie. * Verbosity bias: LLMs tend to favor longer, wordier responses over more concise ones, even if the latter is clearer and of higher quality. A possible solution is to ensure that comparison responses are similar in length. * Self-enhancement bias: LLMs have a slight bias towards their own answers. [GPT-4 favors itself with a 10% higher win rate while Claude-v1 favors itself with a 25% higher win rate.](https://arxiv.org/abs/2306.05685) To counter this, don’t use the same LLM for evaluation tasks.
Another tip: Rather than asking an LLM for a direct evaluation (via giving ascore), try giving it a reference and asking for a comparison. This helps withreducing noise.
Finally, sometimes the best eval is human eval aka vibe check. (Not to beconfused with the poorly named code evaluation benchmark[HumanEval](https://arxiv.org/abs/2107.03374).) As mentioned in the [LatentSpace podcast with MosaicML](https://www.latent.space/p/mosaic-mpt-7b#details)(34th minute):
> The vibe-based eval cannot be underrated. … One of our evals was just having> a bunch of prompts and watching the answers as the models trained and see if> they change. Honestly, I don’t really believe that any of these eval metrics> capture what we care about. One of our prompts was “suggest games for a> 3-year-old and a 7-year-old to play” and that was a lot more valuable to see> how the answer changed during the course of training. — Jonathan Frankle
Also see this [deep dive into evals](/writing/abstractive/) for abstractivesummarization. It covers reference, context, and preference-based metrics, andalso discusses hallucination detection.
## Retrieval-Augmented Generation: To add knowledge
Retrieval-Augmented Generation (RAG) fetches relevant data from outside thefoundation model and enhances the input with this data, providing richercontext to improve output.
### Why RAG?
RAG helps reduce hallucination by grounding the model on the retrievedcontext, thus increasing factuality. In addition, it’s cheaper to keepretrieval indices up-to-date than to continuously pre-train an LLM. This costefficiency makes it easier to provide LLMs with access to recent data via RAG.Finally, if we need to update or remove data such as biased or toxicdocuments, it’s more straightforward to update the retrieval index (comparedto fine-tuning or prompting an LLM not to generate toxic outputs).
In short, RAG applies mature and simpler ideas from the field of informationretrieval to support LLM generation. In a [recent Sequoiasurvey](https://www.sequoiacap.com/article/llm-stack-perspective/), 88% ofrespondents believe that retrieval will be a key component of their stack.
### More about RAG
Before diving into RAG, it helps to have a basic understanding of textembeddings. (Feel free to skip this section if you’re familiar with thesubject.)
A text embedding is a **compressed, abstract representation of text data**where text of arbitrary length can be represented as a fixed-size vector ofnumbers. It’s usually learned from a corpus of text such as Wikipedia. Thinkof them as a universal encoding for text, where **similar items are close toeach other while dissimilar items are farther apart**.
A good embedding is one that does well on a downstream task, such asretrieving similar items. Huggingface’s [Massive Text Embedding Benchmark(MTEB)](https://huggingface.co/spaces/mteb/leaderboard) scores various modelson diverse tasks such as classification, clustering, retrieval, summarization,etc.
Quick note: While we mainly discuss text embeddings here, embeddings can takemany modalities. For example, [CLIP](https://arxiv.org/abs/2103.00020) ismultimodal and embeds images and text in the same space, allowing us to findimages most similar to an input text. We can also [embed products based onuser behavior](/writing/search-query-matching/#supervised-techniques-improves-modeling-of-our-desired-event) (e.g., clicks, purchases) or [graphrelationships](/writing/search-query-matching/#self-supervised-techniques-no-need-for-labels).
**RAG has its roots in open-domain Q &A.** An early [Metapaper](https://arxiv.org/abs/2005.04611) showed that retrieving relevantdocuments via TF-IDF and providing them as context to a language model (BERT)improved performance on an open-domain QA task. They converted each task intoa cloze statement and queried the language model for the missing token.
Following that, **[Dense Passage Retrieval(DPR)](https://arxiv.org/abs/2004.04906)** showed that using dense embeddings(instead of a sparse vector space such as TF-IDF) for document retrieval canoutperform strong baselines like Lucene BM25 (65.2% vs. 42.9% for top-5accuracy.) They also showed that higher retrieval precision translates tohigher end-to-end QA accuracy, highlighting the importance of upstreamretrieval.
To learn the DPR embedding, they fine-tuned two independent BERT-basedencoders on existing question-answer pairs. The passage encoder (\\(E_p\\))embeds text passages into vectors while the query encoder (\\(E_q\\)) embedsquestions into vectors. The query embedding is then used to retrieve \\(k\\)passages that are most similar to the question.
They trained the encoders so that the dot-product similarity makes a goodranking function, and optimized the loss function as the negative log-likelihood of the positive passage. The DPR embeddings are optimized formaximum inner product between the question and relevant passage vectors. Thegoal is to learn a vector space such that pairs of questions and theirrelevant passages are close together.
For inference, they embed all passages (via \\(E_p\\)) and index them in FAISSoffline. Then, given a question at query time, they compute the questionembedding (via \\(E_q\\)), retrieve the top \\(k\\) passages via approximatenearest neighbors, and provide it to the language model (BERT) that outputsthe answer to the question.
**[Retrieval Augmented Generation (RAG)](https://arxiv.org/abs/2005.11401)** ,from which this pattern gets its name, highlighted the downsides of pre-trained LLMs. These include not being able to expand or revise memory, notproviding insights into generated output, and hallucinations.
To address these downsides, they introduced RAG (aka semi-parametric models).Dense vector retrieval serves as the non-parametric component while a pre-trained LLM acts as the parametric component. They reused the DPR encoders toinitialize the retriever and build the document index. For the LLM, they usedBART, a 400M parameter seq2seq model.

Overview of Retrieval Augmented Generation([source](https://arxiv.org/abs/2005.11401))
During inference, they concatenate the input with the retrieved document.Then, the LLM generates \\(\text{token}_i\\) based on the original input, theretrieved document, and the previous \\(i-1\\) tokens. For generation, theyproposed two approaches that vary in how the retrieved passages are used togenerate output.
In the first approach, RAG-Sequence, the model uses the same document togenerate the complete sequence. Thus, for \\(k\\) retrieved documents, thegenerator produces an output for each document. Then, the probability of eachoutput sequence is marginalized (sum the probability of each output sequencein \\(k\\) and weigh it by the probability of each document being retrieved).Finally, the output sequence with the highest probability is selected.
On the other hand, RAG-Token can generate each token based on a _different_document. Given \\(k\\) retrieved documents, the generator produces adistribution for the next output token for each document before marginalizing(aggregating all the individual token distributions.). The process is thenrepeated for the next token. This means that, for each token generation, itcan retrieve a different set of \\(k\\) relevant documents based on theoriginal input _and_ previously generated tokens. Thus, documents can havedifferent retrieval probabilities and contribute differently to the nextgenerated token.
[**Fusion-in-Decoder (FiD)**](https://arxiv.org/abs/2007.01282) also usesretrieval with generative models for open-domain QA. It supports two methodsfor retrieval, BM25 (Lucene with default parameters) and DPR. FiD is named forhow it performs fusion on the retrieved documents in the decoder only.

Overview of Fusion-in-Decoder ([source](https://arxiv.org/abs/2007.01282))
For each retrieved passage, the title and passage are concatenated with thequestion. These pairs are processed independently in the encoder. They alsoadd special tokens such as `question:`, `title:`, and `context:` before theircorresponding sections. The decoder attends over the concatenation of theseretrieved passages.
Because it processes passages independently in the encoder, it can scale to alarge number of passages as it only needs to do self-attention over onecontext at a time. Thus, compute grows linearly (instead of quadratically)with the number of retrieved passages, making it more scalable thanalternatives such as RAG-Token. Then, during decoding, the decoder processesthe encoded passages jointly, allowing it to better aggregate context acrossmultiple retrieved passages.
[**Retrieval-Enhanced Transformer (RETRO)**](https://arxiv.org/abs/2112.04426)adopts a similar pattern where it combines a frozen BERT retriever, adifferentiable encoder, and chunked cross-attention to generate output. What’sdifferent is that RETRO does retrieval throughout the entire pre-trainingstage, and not just during inference. Furthermore, they fetch relevantdocuments based on chunks of the input. This allows for finer-grained,repeated retrieval during generation instead of only retrieving once perquery.
For each input chunk (\\(C_u\\)), the \\(k\\) retrieved chunks \\(RET(C_u)\\)are fed into an encoder. The output is the encoded neighbors \\(E^{j}_{u}\\)where \\(E^{j}_{u} = \text{Encoder}(\text{RET}(C_{u})^{j}, H_{u}) \in\mathbb{R}^{r \times d_{0}}\\). Here, each chunk encoding is conditioned on\\(H_u\\) (the intermediate activations) and the activations of chunk\\(C_u\\) through cross-attention layers. In short, the encoding of theretrieved chunks depends on the attended activation of the input chunk.\\(E^{j}_{u}\\) is then used to condition the generation of the next chunk.

Overview of RETRO ([source](https://arxiv.org/abs/2112.04426))
During retrieval, RETRO splits the input sequence into chunks of 64 tokens.Then, it finds text similar to the _previous_ chunk to provide context to the_current_ chunk. The retrieval index consists of two contiguous chunks oftokens, \\(N\\) and \\(F\\). The former is the neighbor chunk (64 tokens)which is used to compute the key while the latter is the continuation chunk(64 tokens) in the original document.
Retrieval is based on approximate \\(k\\)-nearest neighbors via \\(L_2\\)distance (euclidean) on BERT embeddings. (Interesting departure from the usualcosine or dot product similarity.) The retrieval index, built on SCaNN, canquery a 2T token database in 10ms.
They also demonstrated how to RETRO-fit existing baseline models. By freezingthe pre-trained weights and only training the chunked cross-attention andneighbor encoder parameters (< 10% of weights for a 7B model), they canenhance transformers with retrieval while only requiring 6M training sequences(3% of pre-training sequences). RETRO-fitted models were able to surpass theperformance of baseline models and achieve performance close to that of RETROtrained from scratch.

Performance from RETRO-fitting a pre-trained model([source](https://arxiv.org/abs/2112.04426))
**[Internet-augmented LMs](https://arxiv.org/abs/2203.05115)** proposes usinga humble “off-the-shelf” search engine to augment LLMs. First, they retrieve aset of relevant documents via Google Search. Since these retrieved documentstend to be long (average length 2,056 words), they chunk them into paragraphsof six sentences each. Finally, they embed the question and paragraphs via TF-IDF and applied cosine similarity to rank the most relevant paragraphs foreach query.

Overview of internet-augmented LLMs([source](https://arxiv.org/abs/2203.05115))
The retrieved paragraphs are used to condition the LLM via few-shot prompting.They adopt the conventional \\(k\\)-shot prompting (\\(k=15\\)) from closed-book QA (only providing question-answer pairs) and extend it with an evidenceparagraph, such that each context is an evidence, question, and answertriplet.
For the generator, they used Gopher, a 280B parameter model trained on 300Btokens. For each question, they generated four candidate answers based on eachof the 50 retrieved paragraphs. Finally, they select the best answer byestimating the answer probability via several methods including directinference, RAG, noisy channel inference, and Product-of-Experts (PoE). PoEconsistently performed the best.
RAG has also been **applied to non-QA tasks such as code generation**. While**[CodeT5+](https://arxiv.org/abs/2305.07922)** can be used as a standalonegenerator, when combined with RAG, it significantly outperforms similar modelsin code generation.
To assess the impact of RAG on code generation, they evaluate the model inthree settings:
* Retrieval-based: Fetch the top-1 code sample as the prediction * Generative-only: Output code based on the decoder only * Retrieval-augmented: Append top-1 code sample to encoder input before code generation via the decoder.

Overview of RAG for CodeT5+ ([source](https://arxiv.org/abs/2305.07922))
As a qualitative example, they showed that retrieved code provides crucialcontext (e.g., use `urllib3` for an HTTP request) and guides the generativeprocess towards more correct predictions. In contrast, the generative-onlyapproach returns incorrect output that only captures the concepts of“download” and “compress”.
**What if we don’t have relevance judgments for query-passage pairs?** Withoutthem, we would not be able to train the bi-encoders that embed the queries anddocuments in the same embedding space where relevance is represented by theinner product. **[Hypothetical document embeddings(HyDE)](https://arxiv.org/abs/2212.10496)** suggests a solution.

Overview of HyDE ([source](https://arxiv.org/abs/2212.10496))
Given a query, HyDE first prompts an LLM, such as InstructGPT, to generate ahypothetical document. Then, an unsupervised encoder, such as Contriver,encodes the document into an embedding vector. Finally, the inner product iscomputed between the _hypothetical_ document and the corpus, and the mostsimilar _real_ documents are retrieved.
The expectation is that the encoder’s dense bottleneck serves as a lossycompressor and the extraneous, non-factual details are excluded via theembedding. This reframes the relevance modeling problem from a representationlearning task to a generation task.
### How to apply RAG
From experience with [Obsidian-Copilot](/writing/obsidian-copilot/), I’vefound that hybrid retrieval (traditional search index + embedding-basedsearch) works better than either alone. There, I complemented classicalretrieval (BM25 via OpenSearch) with semantic search (`e5-small-v2`).
Why not embedding-based search only? While it’s great in many instances, thereare situations where it falls short, such as:
* Searching for a person or object’s name (e.g., Eugene, Kaptir 2.0) * Searching for an acronym or phrase (e.g., RAG, RLHF) * Searching for an ID (e.g., `gpt-3.5-turbo`, `titan-xlarge-v1.01`)
But keyword search has its limitations too. It only models simple wordfrequencies and doesn’t capture semantic or correlation information. Thus, itdoesn’t deal well with synonyms or hypernyms (i.e., words that represent ageneralization). This is where combining it with semantic search iscomplementary.
In addition, with a conventional search index, we can use metadata to refineresults. For example, we can use date filters to prioritize newer documents ornarrow our search to a specific time period. And if the search is related toe-commerce, filters on average rating or categories are helpful. Finally,having metadata is handy for downstream ranking, such as prioritizingdocuments that are cited more, or boosting products by their sales volume.
**With regard to embeddings** , the seemingly popular approach is to use[`text-embedding-ada-002`](https://openai.com/blog/new-and-improved-embedding-model). Its benefits include ease of use via an API and not having to maintainour own embedding infra or self-host embedding models. Nonetheless, personalexperience and anecdotes from others suggest there are better alternatives forretrieval.
The OG embedding approaches include Word2vec and[fastText](https://fasttext.cc). FastText is an open-source, lightweightlibrary that enables users to leverage pre-trained embeddings or train newembedding models. It comes with pre-trained embeddings for 157 languages andis extremely fast, even without a GPU. It’s my go-to for early-stage proof ofconcepts.
Another good baseline is [sentence-transformers](https://github.com/UKPLab/sentence-transformers). It makes itsimple to compute embeddings for sentences, paragraphs, and even images. It’sbased on workhorse transformers such as BERT and RoBERTa and is available inmore than 100 languages.
More recently, instructor models have shown SOTA performance. During training,these models prepend the task description to the text. Then, when embeddingnew text, we simply have to describe the task to get task-specific embeddings.(Not that different from instruction tuning for embedding models IMHO.)
An example is the [E5](https://arxiv.org/abs/2212.03533) family of models. Foropen QA and information retrieval, we simply prepend documents in the indexwith `passage:`, and prepend queries with `query:`. If the task is symmetric(e.g., semantic similarity, paraphrase retrieval) or if we want to useembeddings as features (e.g., classification, clustering), we just use the`query:` prefix.
The [Instructor](https://arxiv.org/abs/2212.09741) model takes it a stepfurther, allowing users to customize the prepended prompt: “Represent the`domain` `task_type` for the `task_objective`:” For example, “Represent theWikipedia document for retrieval:”. (The domain and task objective areoptional). This brings the concept of prompt tuning into the field of textembedding.
Finally, as of Aug 1st, the top embedding model on the [MTEBLeaderboard](https://huggingface.co/spaces/mteb/leaderboard) is the[GTE](https://huggingface.co/thenlper/gte-large) family of models by AlibabaDAMO Academy. The top performing model’s size is half of the next best model`e5-large-v2` (0.67GB vs 1.34GB). In 2nd position is `gte-base` with a modelsize of only 0.22GB and embedding dimension of 768. (H/T[Nirant](https://twitter.com/NirantK).)
To retrieve documents with low latency at scale, we use approximate nearestneighbors (ANN). It optimizes for retrieval speed and returns the approximate(instead of exact) top \\(k\\) most similar neighbors, trading off a littleaccuracy loss for a large speed up.
ANN embedding indices are data structures that let us do ANN searchesefficiently. At a high level, they build partitions over the embedding spaceso we can quickly zoom in on the specific space where the query vector is.Some popular techniques include:
* [Locality Sensitive Hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) (LSH): The core idea is to create hash functions so that similar items are likely to end up in the same hash bucket. By only needing to check the relevant buckets, we can perform ANN queries efficiently. * [Facebook AI Similarity Search](https://github.com/facebookresearch/faiss) (FAISS): It uses a combination of quantization and indexing for efficient retrieval, supports both CPU and GPU, and can handle billions of vectors due to its efficient use of memory. * [Hierarchical Navigable Small Worlds](https://github.com/nmslib/hnswlib) (HNSW): Inspired by “six degrees of separation”, it builds a hierarchical graph structure that embodies the small world phenomenon. Here, most nodes can be reached from any other node via a minimum number of hops. This structure allows HNSW to initiate queries from broader, coarser approximations and progressively narrow the search at lower levels. * [Scalable Nearest Neighbors](https://github.com/google-research/google-research/tree/master/scann) (ScaNN): It has a two-step process. First, coarse quantization reduces the search space. Then, fine-grained search is done within the reduced set. Best recall/latency trade-off I’ve seen.
When evaluating an ANN index, some factors to consider include:
* Recall: How does it fare against exact nearest neighbors? * Latency/throughput: How many queries can it handle per second? * Memory footprint: How much RAM is required to serve an index? * Ease of adding new items: Can new items be added without having to reindex all documents (LSH) or does the index need to be rebuilt (ScaNN)?
No single framework is better than all others in every aspect. Thus, start bydefining your functional and non-functional requirements before benchmarking.Personally, I’ve found ScaNN to be outstanding in the recall-latency trade-off(see benchmark graph [here](/writing/real-time-recommendations/#how-to-design-and-implement-an-mvp)).
## Fine-tuning: To get better at specific tasks
Fine-tuning is the process of taking a pre-trained model (that has alreadybeen trained with a vast amount of data) and further refining it on a specifictask. The intent is to harness the knowledge that the model has alreadyacquired during its pre-training and apply it to a specific task, usuallyinvolving a smaller, task-specific, dataset.
The term “fine-tuning” is used loosely and can refer to several concepts suchas:
* **Continued pre-training** : With domain-specific data, apply the same pre-training regime (next token prediction, masked language modeling) on the base model. * **Instruction fine-tuning** : The pre-trained (base) model is fine-tuned on examples of instruction-output pairs to follow instructions, answer questions, be waifu, etc. * **Single-task fine-tuning** : The pre-trained model is honed for a narrow and specific task such as toxicity detection or summarization, similar to BERT and T5. * **Reinforcement learning with human feedback (RLHF)** : This combines instruction fine-tuning with reinforcement learning. It requires collecting human preferences (e.g., pairwise comparisons) which are then used to train a reward model. The reward model is then used to further fine-tune the instructed LLM via RL techniques such as proximal policy optimization (PPO).
We’ll mainly focus on single-task and instruction fine-tuning here.
### Why fine-tuning?
Fine-tuning an open LLM is becoming an increasingly viable alternative tousing a 3rd-party, cloud-based LLM for several reasons.
**Performance & control:** Fine-tuning can improve the performance of an off-the-shelf base model, and may even surpass a 3rd-party LLM. It also providesgreater control over LLM behavior, resulting in a more robust system orproduct. Overall, fine-tuning enables us to build products that aredifferentiated from simply using 3rd-party or open LLMs.
**Modularization:** Single-task fine-tuning lets us to use an army of smallermodels that each specialize on their own tasks. Via this setup, a system canbe modularized into individual models for tasks like content moderation,extraction, summarization, etc. Also, given that each model only has to focuson a narrow set of tasks, we can get around the alignment tax, where fine-tuning a model on one task reduces performance on other tasks.
**Reduced dependencies:** By fine-tuning and hosting our own models, we canreduce legal concerns about proprietary data (e.g., PII, internal documentsand code) being exposed to external APIs. It also gets around constraints thatcome with 3rd-party LLMs such as rate-limiting, high costs, or overlyrestrictive safety filters. By fine-tuning and hosting our own LLMs, we canensure data doesn’t leave our network, and can scale throughput as needed.
### More about fine-tuning
Why do we need to fine-tune a _base_ model? At the risk of oversimplifying,base models are primarily optimized to predict the next word based on thecorpus they’re trained on. Hence, they aren’t naturally adept at followinginstructions or answering questions. When posed a question, they tend torespond with more questions. Thus, we perform instruction fine-tuning so theylearn to respond appropriately.
However, fine-tuning isn’t without its challenges. First, we **need asignificant volume of demonstration data**. For instance, in the [InstructGPTpaper](https://arxiv.org/abs/2203.02155), they used 13k instruction-outputsamples for supervised fine-tuning, 33k output comparisons for rewardmodeling, and 31k prompts without human labels as input for RLHF.
Furthermore, fine-tuning comes with an alignment tax—the process can lead to**lower performance on certain critical tasks**. (There’s no free lunch afterall.) The same InstructGPT paper found that RLHF led to performanceregressions (relative to the GPT-3 base model) on public NLP tasks like SQuAD,HellaSwag, and WMT 2015 French to English. (A workaround is to have severalsmaller, specialized models that excel at narrow tasks.)
Fine-tuning is similar to the concept of transfer learning. As defined inWikipedia: “Transfer learning is a technique in machine learning in whichknowledge learned from a task is re-used to boost performance on a relatedtask.” Several years ago, transfer learning made it easy for me to applyResNet models trained on ImageNet to [classify fashionproducts](/writing/image-categorization-is-now-live/) and [build imagesearch](/writing/image-search-is-now-live/).
**[ULMFit](https://arxiv.org/abs/1801.06146)** is one of the earlier papers toapply transfer learning to text. They established the protocol of self-supervised pre-training (on unlabeled data) followed by fine-tuning (onlabeled data). They used AWS-LSTM, an LSTM variant with dropout at variousgates.

Overview of ULMFit ([source](https://arxiv.org/abs/1801.06146))
During pre-training (next word prediction), the model is trained onwikitext-103 which contains 28.6 Wikipedia articles and 103M words. Then,during target task fine-tuning, the LM is fine-tuned with data from the domainof the specific task. Finally, during classifier fine-tuning, the model isaugmented with two additional linear blocks and fine-tuned on the targetclassification tasks which includes sentiment analysis, questionclassification, and topic classification.
Since then, the pre-training followed by fine-tuning paradigm has driven muchprogress in language modeling. **[Bidirectional Encoder Representations fromTransformers (BERT; encoder only)](https://arxiv.org/abs/1810.04805)** waspre-trained on masked language modeling and next sentence prediction onEnglish Wikipedia and BooksCorpus. It was then fine-tuned on task-specificinputs and labels for single-sentence classification, sentence pairclassification, single-sentence tagging, and question & answering.

Overview of BERT ([source](https://arxiv.org/abs/1810.04805))
**[Generative Pre-trained Transformers (GPT; decoder only)](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)** was first pre-trained onBooksCorpus via next token prediction. This was followed by single-task fine-tuning for tasks such as text classification, textual entailment, similarity,and Q&A. Interestingly, they found that including language modeling as anauxiliary objective helped the model generalize and converge faster duringtraining.

Overview of GPT ([source](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pd))
**[Text-to-text Transfer Transformer (T5; encoder-decoder)](https://arxiv.org/abs/1910.10683)** was pre-trained on the ColossalClean Crawled Corpus (C4), a cleaned version of the Common Crawl from April2019. It employed the same denoising objective as BERT, namely masked languagemodeling. It was then fine-tuned on tasks such as text classification,abstractive summarization, Q&A, and machine translation.

Overview of T5 ([source](https://arxiv.org/abs/1910.10683))
But unlike ULMFIt, BERT, and GPT which used different classifier heads fordownstream tasks, T5 represented downstream tasks as text-to-text only. Forexample, a translation task would have input text starting with `TranslationEnglish to German:`, while a summarization task might start with `Summarize:`or `TL;DR:`. The prefix essentially became a hyperparameter (first instance ofprompt engineering?) This design choice allowed them to use a single fine-tuned model across a variety of downstream tasks.
**[InstructGPT](https://arxiv.org/abs/2203.02155)** expanded this idea ofsingle-task fine-tuning to instruction fine-tuning. The base model was GPT-3,pre-trained on internet data including Common Crawl, WebText, Books, andWikipedia. It then applied supervised fine-tuning on demonstrations of desiredbehavior (instruction and output). Next, it trained a reward model on thedataset of comparisons. Finally, it optimized the instructed model against thereward model via PPO, with this last stage focusing more on alignment thanspecific task performance.

Overview of fine-tuning steps in InstructGPT([source](https://arxiv.org/abs/2203.02155))
Next, let’s move from fine-tuned models to fine-tuning techniques.
**[Soft prompt tuning](https://arxiv.org/abs/2104.08691)** prepends atrainable tensor to the model’s input embeddings, essentially creating a softprompt. Unlike discrete text prompts, soft prompts can be learned viabackpropagation, meaning they can be fine-tuned to incorporate signals fromany number of labeled examples.
Next, there’s **[prefix tuning](https://arxiv.org/abs/2101.00190)**. Insteadof adding a soft prompt to the model input, it prepends trainable parametersto the hidden states of all transformer blocks. During fine-tuning, the LM’soriginal parameters are kept frozen while the prefix parameters are updated.

Overview of prefix-tuning ([source](https://arxiv.org/abs/2101.00190))
The paper showed that this achieved performance comparable to full fine-tuningdespite requiring updates on just 0.1% of parameters. Moreover, in settingswith limited data and involved extrapolation to new topics, it outperformedfull fine-tuning. One hypothesis is that training fewer parameters helpedreduce overfitting on smaller target datasets.
There’s also the **[adapter](https://arxiv.org/abs/1902.00751)** technique.This method adds fully connected network layers twice to each transformerblock, after the attention layer and after the feed-forward network layer. OnGLUE, it’s able to achieve within 0.4% of the performance of full fine-tuningby just adding 3.6% parameters per task.

Overview of adapters ([source](https://arxiv.org/abs/1902.00751))
**[Low-Rank Adaptation (LoRA)](https://arxiv.org/abs/2106.09685)** is atechnique where adapters are designed to be the product of two low-rankmatrices. It was inspired by [Aghajanyan etal.](https://arxiv.org/abs/2012.13255) which showed that, when adapting to aspecific task, pre-trained language models have a low intrinsic dimension andcan still learn efficiently despite a random projection into a smallersubspace. Thus, LoRA hypothesized that weight updates during adaption alsohave low intrinsic rank.

Overview of LoRA ([source](https://arxiv.org/abs/2106.09685))
Similar to prefix tuning, they found that LoRA outperformed several baselinesincluding full fine-tuning. Again, the hypothesis is that LoRA, thanks to itsreduced rank, provides implicit regularization. In contrast, full fine-tuning,which updates all weights, could be prone to overfitting.
**[QLoRA](https://arxiv.org/abs/2305.14314)** builds on the idea of LoRA. Butinstead of using the full 16-bit model during fine-tuning, it applies a 4-bitquantized model. It introduced several innovations such as 4-bit NormalFloat(to quantize models), double quantization (for additional memory savings), andpaged optimizers (that prevent OOM errors by transferring data to CPU RAM whenthe GPU runs out of memory).

Overview of QLoRA ([source](https://arxiv.org/abs/2305.14314))
As a result, QLoRA reduces the average memory requirements for fine-tuning a65B model from > 780GB memory to a more manageable 48B without degradingruntime or predictive performance compared to a 16-bit fully fine-tunedbaseline.
(Fun fact: During a meetup with Tim Dettmers, an author of QLoRA, he quippedthat double quantization was “a bit of a silly idea but works perfectly.” Hey,if it works, it works.)
### How to apply fine-tuning?
The first step is to **collect demonstration data/labels**. These could be forstraightforward tasks such as document classification, entity extraction, orsummarization, or they could be more complex such as Q&A or dialogue. Someways to collect this data include:
* **Via experts or crowd-sourced human annotators** : While this is expensive and slow, it usually leads to higher-quality data with [good guidelines](/writing/labeling-guidelines/). * **Via user feedback** : This can be as simple as asking users to select attributes that describe a product, rating LLM responses with thumbs up or down (e.g., ChatGPT), or logging which images users choose to download (e.g., Midjourney). * **Query larger open models with permissive licenses** : With prompt engineering, we might be able to elicit reasonable demonstration data from a larger model (Falcon 40B Instruct) that can be used to fine-tune a smaller model. * **Reuse open-source data** : If your task can be framed as a natural language inference (NLI) task, we could fine-tune a model to perform NLI using [MNLI data](https://cims.nyu.edu/~sbowman/multinli/). Then, we can continue fine-tuning the model on internal data to classify inputs as entailment, neutral, or contradiction.
Note: Some LLM terms prevent users from using their output to develop othermodels.
* [OpenAI Terms of Use](https://openai.com/policies/terms-of-use) (Section 2c, iii): You may not use output from the Services to develop models that compete with OpenAI. * [LLaMA 2 Community License Agreement](https://ai.meta.com/llama/license/) (Section 1b-v): You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).
The next step is to **define evaluation metrics**. We’ve discussed this in aprevious section.
Then, **select a pre-trained model.** There are [several open LLMs withpermissive licenses](https://github.com/eugeneyan/open-llms) to choose from.Excluding Llama 2 (since it isn’t fully commercial use), Falcon-40B is knownto be the best-performing model. Nonetheless, I’ve found it unwieldy to fine-tune and serve in production given how heavy it is.
Instead, I’m inclined to use smaller models like the Falcon-7B. And if we cansimplify and frame the task more narrowly, BERT (340M params), RoBERTA (355Mparams), and BART (406M params) are solid picks for classification and naturallanguage inference tasks. Beyond that, Flan-T5 (770M and 3B variants) is areliable baseline for translation, abstractive summarization, headlinegeneration, etc.
We may also need to **update the model architecture** , such as when the pre-trained model’s architecture doesn’t align with the task. For example, wemight need to update the classification heads on BERT or T5 to match our task.Tip: If the task is a simple binary classification task, NLI models can workout of the box. Entailment is mapped to positive, contradiction is mapped tonegative, while the neural label can indicate uncertainty.
**Then, pick a fine-tuning approach.** LoRA and QLoRA are good places tostart. But if your fine-tuning is more intensive, such as continued pre-training on new domain knowledge, you may find full fine-tuning necessary.
**Finally, basic hyperparameter tuning.** Generally, most papers focus onlearning rate, batch size, and number of epochs (see LoRA, QLoRA). And ifwe’re using LoRA, we might want to tune the rank parameter (though the QLoRApaper found that different rank and alpha led to similar results). Otherhyperparameters include input sequence length, loss type (contrastive loss vs.token match), and data ratios (like the mix of pre-training or demonstrationdata, or the ratio of positive to negative examples, among others).
## Caching: To reduce latency and cost
Caching is a technique to store data that has been previously retrieved orcomputed. This way, future requests for the same data can be served faster. Inthe space of serving LLM generations, the popularized approach is to cache theLLM response keyed on the embedding of the input request. Then, for each newrequest, if a semantically similar request is received, we can serve thecached response.
For some practitioners, this sounds like “[a disaster waiting tohappen.](https://twitter.com/HanchungLee/status/1681146845186363392)” I’minclined to agree. Thus, I think the key to adopting this pattern is figuringout how to cache safely, instead of solely depending on semantic similarity.
### Why caching?
Caching can significantly reduce latency for responses that have been servedbefore. In addition, by eliminating the need to compute a response for thesame input again and again, we can reduce the number of LLM requests and thussave cost. Also, there are certain use cases that do not support latency onthe order of seconds. Thus, pre-computing and caching may be the only way toserve those use cases.
### More about caching
A cache is a high-speed storage layer that stores a subset of data that’saccessed more frequently. This lets us serve these requests faster via thecache instead of the data’s primary storage (e.g., search index, relationaldatabase). Overall, caching enables efficient reuse of previously fetched orcomputed data. (More about [caching](https://aws.amazon.com/caching/) and[best practices](https://aws.amazon.com/caching/best-practices/).)
An example of caching for LLMs is[GPTCache](https://github.com/zilliztech/GPTCache).

Overview of GPTCache ([source](https://github.com/zilliztech/GPTCache))
When a new request is received:
* Embedding generator: This embeds the request via various models such as OpenAI’s `text-embedding-ada-002`, FastText, Sentence Transformers, and more. * Similarity evaluator: This computes the similarity of the request via the vector store and then provides a distance metric. The vector store can either be local (FAISS, Hnswlib) or cloud-based. It can also compute similarity via a model. * Cache storage: If the request is similar, the cached response is fetched and served. * LLM: If the request isn’t similar enough, it gets passed to the LLM which then generates the result. Finally, the response is served and cached for future use.
Redis also shared a [similarexample](https://www.youtube.com/live/9VgpXcfJYvw?feature=share&t=1517),mentioning that some teams go as far as precomputing all the queries theyanticipate receiving. Then, they set a similarity threshold on which queriesare similar enough to warrant a cached response.
### How to apply caching?
**We should start with having a good understanding of user request patterns**.This allows us to design the cache thoughtfully so it can be applied reliably.
First, let’s consider a non-LLM example. Imagine we’re caching product pricesfor an e-commerce site. During checkout, is it safe to display the (possiblyoutdated) cached price? Probably not, since the price the customer sees duringcheckout should be the same as the final amount they’re charged. Caching isn’tappropriate here as we need to ensure consistency for the customer.
Now, bringing it back to LLM responses. Imagine we get a request for a summaryof “Mission Impossible 2” that’s semantically similar enough to “MissionImpossible 3”. If we’re looking up cache based on semantic similarity, wecould serve the wrong response.
We also need to **consider if caching is effective for the usage pattern.**One way to quantify this is via the cache hit rate (percentage of requestsserved directly from the cache). If the usage pattern is uniformly random, thecache would need frequent updates. Thus, the effort to keep the cache up-to-date could negate any benefit a cache has to offer. On the other hand, if theusage follows a power law where a small proportion of unique requests accountfor the majority of traffic (e.g., search queries, product views), thencaching could be an effective strategy.
Beyond semantic similarity, we could also explore caching based on:
* **Item IDs:** This applies when we pre-compute [summaries of product reviews](https://www.cnbc.com/2023/06/12/amazon-is-using-generative-ai-to-summarize-product-reviews.html) or generate a summary for an entire movie trilogy. * **Pairs of Item IDs:** Such as when we generate comparisons between two movies. While this appears to be \\(O(N^2)\\), in practice, a small number of combinations drive the bulk of traffic, such as comparison between popular movies in a series or genre. * **Constrained input:** Such as variables like movie genre, director, or lead actor. For example, if a user is looking for movies by a specific director, we could execute a structured query and run it through an LLM to frame the response more eloquently. Another example is [generating code based on drop-down options](https://cheatlayer.com)—if the code has been verified to work, we can cache it for reliable reuse.
Also, **caching doesn’t only have to occur on-the-fly.** As Redis shared, wecan pre-compute LLM generations offline or asynchronously before serving them.By serving from a cache, we shift the latency from generation (typicallyseconds) to cache lookup (milliseconds). Pre-computing in batch can also helpreduce cost relative to serving in real-time.
While the approaches listed here may not be as flexible as semanticallycaching on natural language inputs, I think it provides a good balance betweenefficiency and reliability.
## Guardrails: To ensure output quality
In the context of LLMs, guardrails validate the output of LLMs, ensuring thatthe output doesn’t just sound good but is also syntactically correct, factual,and free from harmful content. It also includes guarding against adversarialinput.
### Why guardrails?
First, they help ensure that model outputs are reliable and consistent enoughto use in production. For example, we may require output to be in a specificJSON schema so that it’s machine-readable, or we need code generated to beexecutable. Guardrails can help with such syntactic validation.
Second, they provide an additional layer of safety and maintain qualitycontrol over an LLM’s output. For example, to verify if the content generatedis appropriate for serving, we may want to check that the output isn’tharmful, verify it for factual accuracy, or ensure coherence with the contextprovided.
### More about guardrails
**One approach is to control the model’s responses via prompts.** For example,Anthropic shared about prompts designed to guide the model toward generatingresponses that are [helpful, harmless, andhonest](https://arxiv.org/abs/2204.05862) (HHH). They found that Python fine-tuning with the HHH prompt led to better performance compared to fine-tuningwith RLHF.

Example of HHH prompt ([source](https://arxiv.org/abs/2204.05862))
**A more common approach is to validate the output.** An example is the[Guardrails package](https://github.com/ShreyaR/guardrails). It allows usersto add structural, type, and quality requirements on LLM outputs via Pydantic-style validation. And if the check fails, it can trigger corrective actionsuch as filtering on the offending output or regenerating another response.
Most of the validation logic is in[`validators.py`](https://github.com/ShreyaR/guardrails/blob/main/guardrails/validators.py).It’s interesting to see how they’re implemented. Broadly speaking, itsvalidators fall into the following categories:
* Single output value validation: This includes ensuring that the output (i) is one of the predefined choices, (ii) has a length within a certain range, (iii) if numeric, falls within an expected range, and (iv) is a complete sentence. * Syntactic checks: This includes ensuring that generated URLs are valid and reachable, and that Python and SQL code is bug-free. * Semantic checks: This verifies that the output is aligned with the reference document, or that the extractive summary closely matches the source document. These checks can be done via cosine similarity or fuzzy matching techniques. * Safety checks: This ensures that the generated output is free of inappropriate language or that the quality of translated text is high.
Nvidia’s [NeMo-Guardrails](https://github.com/NVIDIA/NeMo-Guardrails) followsa similar principle but is designed to guide LLM-based conversational systems.Rather than focusing on syntactic guardrails, it emphasizes semantic ones.This includes ensuring that the assistant steers clear of politically chargedtopics, provides factually correct information, and can detect jailbreakingattempts.
Thus, NeMo’s approach is somewhat different: Instead of using moredeterministic checks like verifying if a value exists in a list or inspectingcode for syntax errors, NeMo leans heavily on using another LLM to validateoutputs (inspired by [SelfCheckGPT](https://arxiv.org/abs/2303.08896)).
In their example for fact-checking and preventing hallucination, they ask theLLM itself to check whether the most recent output is consistent with thegiven context. To fact-check, the LLM is queried if the response is true basedon the documents retrieved from the knowledge base. To prevent hallucinations,since there isn’t a knowledge base available, they get the LLM to generatemultiple alternative completions which serve as the context. The underlyingassumption is that if the LLM produces multiple completions that disagree withone another, the original completion is likely a hallucination.
The moderation example follows a similar approach: The response is screenedfor harmful and unethical content via an LLM. Given the nuance of ethics andharmful content, heuristics and conventional machine learning techniques fallshort. Thus, an LLM is required for a deeper understanding of the intent andstructure of dialogue.
Apart from using guardrails to verify the output of LLMs, we can also**directly steer the output to adhere to a specific grammar.** An example ofthis is Microsoft’s [Guidance](https://github.com/microsoft/guidance). UnlikeGuardrails which [imposes JSON schema via aprompt](https://github.com/ShreyaR/guardrails/blob/main/guardrails/constants.xml#L14),Guidance enforces the schema by injecting tokens that make up the structure.
We can think of Guidance as a domain-specific language for LLM interactionsand output. It draws inspiration from [Handlebars](https://handlebarsjs.com),a popular templating language used in web applications that empowers users toperform variable interpolation and logical control.
However, Guidance sets itself apart from regular templating languages byexecuting linearly. This means it maintains the order of tokens generated.Thus, by inserting tokens that are part of the structure—instead of relying onthe LLM to generate them correctly—Guidance can dictate the specific outputformat. In their examples, they show how to [generate JSON that’s alwaysvalid](https://github.com/microsoft/guidance#guaranteeing-valid-syntax-json-example-notebook), [generate complex outputformats](https://github.com/microsoft/guidance#rich-output-structure-example-notebook) with multiple keys, ensure that LLMs [play the rightroles](https://github.com/microsoft/guidance#role-based-chat-model-example-notebook), and have [agents interact with eachother](https://github.com/microsoft/guidance#agents-notebook).
They also introduced a concept called [tokenhealing](https://github.com/microsoft/guidance#token-healing-notebook), auseful feature that helps avoid subtle bugs that occur due to tokenization. Insimple terms, it rewinds the generation by one token before the end of theprompt and then restricts the first generated token to have a prefix matchingthe last token in the prompt. This eliminates the need to fret about tokenboundaries when crafting prompts.
### How to apply guardrails?
Though the concept of guardrails for LLMs in industry is still nascent, thereare a handful of immediately useful and practical strategies we can consider.
**Structural guidance:** Apply guidance whenever possible. It provides directcontrol over outputs and offers a more precise method to ensure that outputconforms to a specific structure or format.
**Syntactic guardrails:** These include checking if categorical output iswithin a set of acceptable choices, or if numeric output is within an expectedrange. Also, if we generate SQL, these can verify its free from syntax errorsand also ensure that all columns in the query match the schema. Ditto forgenerating code (e.g., Python, JavaScript).
**Content safety guardrails:** These verify that the output has no harmful orinappropriate content. It can be as simple as checking against the [List ofDirty, Naughty, Obscene, and Otherwise BadWords](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) or using [profanity detection](https://pypi.org/project/profanity-check/) models. (It’s [common to run moderation classifiers onoutput](https://twitter.com/goodside/status/1685023251532320768).) Morecomplex and nuanced output can rely on an LLM evaluator.
**Semantic/factuality guardrails:** These confirm that the output issemantically relevant to the input. Say we’re generating a two-sentencesummary of a movie based on its synopsis. We can validate if the producedsummary is semantically similar to the output, or have (another) LLM ascertainif the summary accurately represents the provided synopsis.
**Input guardrails:** These limit the types of input the model will respondto, helping to mitigate the risk of the model responding to inappropriate oradversarial prompts which would lead to generating harmful content. Forexample, you’ll get an error if you ask Midjourney to generate NSFW content.This can be as straightforward as comparing against a list of strings or usinga moderation classifier.

An example of an input guardrail on Midjourney
## Defensive UX: To anticipate & handle errors gracefully
Defensive UX is a design strategy that acknowledges that bad things, such asinaccuracies or hallucinations, can happen during user interactions withmachine learning or LLM-based products. Thus, the intent is to anticipate andmanage these in advance, primarily by guiding user behavior, averting misuse,and handling errors gracefully.
### Why defensive UX?
Machine learning and LLMs aren’t perfect—they can produce inaccurate output.Also, they respond differently to the same input over time, such as searchengines displaying varying results due to personalization, or LLMs generatingdiverse output on more creative, higher temperature, settings. This canviolate the principle of consistency which advocates for a consistent UI andpredictable behaviors.
Defensive UX can help mitigate the above by providing:
* **Increased accessibility** : By helping users understand how ML/LLM features work and their limitations, defensive UX makes it more accessible and user-friendly. * **Increased trust** : When users see that the feature can handle difficult scenarios gracefully and doesn’t produce harmful output, they’re likely to trust it more. * **Better UX** : By designing the system and UX to handle ambiguous situations and errors, defensive UX paves the way for a smoother, more enjoyable user experience.
### More about defensive UX
To learn more about defensive UX, we can look at Human-AI guidelines fromMicrosoft, Google, and Apple.
**Microsoft’s[Guidelines for Human-AIInteraction](https://www.microsoft.com/en-us/research/publication/guidelines-for-human-ai-interaction/)** is based on a survey of 168 potential guidelines.These were collected from internal and external industry sources, academicliterature, and public articles. After combining guidelines that were similar,filtering guidelines that were too vague or too specific or not AI-specific,and a round of heuristic evaluation, they narrowed it down to 18 guidelines.

Guidelines for Human-AI interaction across the user journey([source](https://www.microsoft.com/en-us/research/project/guidelines-for-human-ai-interaction/))
These guidelines follow a certain style: Each one is a succinct action rule of3 - 10 words, beginning with a verb. Each rule is accompanied by a one-linerthat addresses potential ambiguities. They are organized based on their likelyapplication during user interaction:
* Initially: Make clear what the system can do (G1), make clear how well the system can do what it can do (G2) * During interaction: Time services based on context (G3), mitigate social biases (G6) * When wrong: Support efficient dismissal (G8), support efficient correction (G9) * Over time: Learn from user behavior (G13), provide global controls (G17)
**Google’s[People + AI Guidebook](https://pair.withgoogle.com/guidebook/)** isrooted in data and insights drawn from Google’s product team and academicresearch. In contrast to Microsoft’s guidelines which are organized around theuser, Google organizes its guidelines into concepts that a developer needs tokeep in mind.
There are 23 patterns grouped around common questions that come up during theproduct development process, including:
* How do I get started with human-centered AI: Determine if the AI adds value, invest early in good data practices (e.g., evals) * How do I onboard users to new AI features: Make it safe to explore, anchor on familiarity, automate in phases * How do I help users build trust in my product: Set the right expectations, be transparent, automate more when the risk is low.
**Apple’s[Human Interface Guidelines for MachineLearning](https://developer.apple.com/design/human-interface-guidelines/machine-learning)** differs from the bottom-up approach of academicliterature and user studies. Instead, its primary source is practitionerknowledge and experience. Thus, it doesn’t include many references or datapoints, but instead focuses on Apple’s longstanding design principles. Thisresults in a unique perspective that distinguishes it from the other twoguidelines.
The document focuses on how Apple’s design principles can be applied to ML-infused products, emphasizing aspects of UI rather than model functionality.It starts by asking developers to consider the role of ML in their app andwork backwards from the user experience. This includes questions such aswhether ML is:
* Critical or complementary: For example, Face ID cannot work without ML but the keyboard can still work without QuickType. * Proactive or reactive: Siri Suggestions are proactive while autocorrect is reactive. * Dynamic or static: Recommendations are dynamic while object detection in Photos only improves with each iOS release.
It then delves into several patterns, split into inputs and outputs of asystem. Inputs focus on explicit feedback, implicit feedback, calibration, andcorrections. This section guides the design for how AI products request andprocess user data and interactions. Outputs focus on mistakes, multipleoptions, confidence, attribution, and limitations. The intent is to ensure themodel’s output is presented in a comprehensible and useful manner.
The differences between the three guidelines are insightful. Google has moreemphasis on considerations for training data and model development, likely dueto its engineering-driven culture. Microsoft has more focus on mental models,likely an artifact of the HCI academic study. Lastly, Apple’s approach centersaround providing a seamless UX, a focus likely influenced by its culturalvalues and principles.
### How to apply defensive UX?
Here are some patterns based on the guidelines above. (Disclaimer: I’m not adesigner.)
**Set the right expectations.** This principle is consistent across all threeguidelines:
* Microsoft: Make clear how well the system can do what it can do (help the user understand how often the AI system may make mistakes) * Google: Set the right expectations (be transparent with your users about what your AI-powered product can and cannot do) * Apple: Help people establish realistic expectations (describe the limitation in marketing material or within the feature’s context)
This can be as simple as adding a brief disclaimer above AI-generated results,like those of Bard, or highlighting our app’s limitations on its landing page,like how ChatGPT does it.

Example of a disclaimer on Google Bard results (Note: `nrows` is not a validargument.)
By being transparent about our product’s capabilities and limitations, we helpusers calibrate their expectations about its functionality and output. Whilethis may cause users to trust it less in the short run, it helps foster trustin the long run—users are less likely to overestimate our product andsubsequently face disappointment.
**Enable efficient dismissal.** This is explicitly mentioned as Microsoft’sGuideline 8: Support efficient dismissal (make it easy to dismiss or ignoreundesired AI system services).
For example, if a user is navigating our site and a chatbot pops up asking ifthey need help, it should be easy for the user to dismiss the chatbot. Thisensures the chatbot doesn’t get in the way, especially on devices with smallerscreens. Similarly, GitHub Copilot allows users to conveniently ignore itscode suggestions by simply continuing to type. While this may reduce usage ofthe AI feature in the short term, it prevents it from becoming a nuisance andpotentially reducing customer satisfaction in the long term.
**Provide attribution.** This is listed in all three guidelines:
* Microsoft: Make clear why the system did what it did (enable the user to access an explanation of why the AI system behaved as it did) * Google: Add context from human sources (help users appraise your recommendations with input from 3rd-party sources) * Apple: Consider using attributions to help people distinguish among results
Citations are becoming an increasingly common design element. Take BingChatfor example. When we make a query, it includes citations, usually fromreputable sources, in its responses. This not only shows where the informationcame from, but also allows users to assess the quality of the sources.Similarly, imagine we’re using an LLM to explain why a user might like aproduct. Alongside the LLM-generated explanation, we could include a quotefrom an actual review or mention the product rating.
Context from experts and the community also enhances user trust. For example,if a user is seeking recommendations for a hiking trail, mentioning that asuggested trail comes highly recommended by the relevant community can go along way. It not only adds value to the recommendation but also helps userscalibrate trust through the human connection.

Example of attribution via social proof([source](https://pair.withgoogle.com/guidebook/patterns))
Finally, Apple’s guidelines include popular attributions such as “Becauseyou’ve read non-fiction”, “New books by authors you’ve read”. Thesedescriptors not only personalize the experience but also provide context,enhancing user understanding and trust.
**Anchor on familiarity.** When introducing users to a new AI product orfeature, it helps to guide them with familiar UX patterns and features. Thismakes it easier for users to focus on the main task and start to earn customertrust in our new product. Resist the temptation to showcase new and “magical”features via exotic UI elements.
Along a similar vein, chat-based features are becoming more common due toChatGPT’s growing popularity. For example, chat with your docs, chat to queryyour data, chat to buy groceries. However, I [question whether chat is theright UX](/writing/llm-ux/) for most user experiences—it just takes too mucheffort relative to the familiar UX of clicking on text and images.
Furthermore, increasing user effort leads to higher expectations that areharder to meet. Netflix shared that users have [higher expectations forrecommendations](https://slideslive.com/38934788/a-human-perspective-on-algorithmic-similarity?ref=folder-59726) that result from explicit actionssuch as search. In general, the more effort a user puts in (e.g., chat,search), the higher the expectations they have. Contrast this with lower-effort interactions such as scrolling over recommendations slates or clickingon a product.
Thus, while chat offers more flexibility, it also demands more user effort.Moreover, using a chat box is less intuitive as it lacks signifiers on howusers can adjust the output. Overall, I think that sticking with a familiarand constrained UI makes it easier for users to navigate our product; chatshould only be considered as a secondary or tertiary option.
## Collect user feedback: To build our data flywheel
Gathering user feedback allows us to learn their preferences. Specific to LLMproducts, user feedback contributes to building evals, fine-tuning, andguardrails. If we think about it, data—such as corpus for pre-training,expert-crafted demonstrations, human preferences for reward modeling—is one ofthe few moats for LLM products. Thus, we want to be deliberately thinkingabout collecting user feedback when designing our UX.
Feedback can be explicit or implicit. Explicit feedback is information usersprovide in response to a request by our product; implicit feedback isinformation we learn from user interactions without needing users todeliberately provide feedback.
### Why collect user feedback
User feedback **helps our models improve**. By learning what users like,dislike, or complain about, we can improve our models to better meet theirneeds. It also allows us to **adapt to individual preferences**.Recommendation systems are a prime example. As users interact with items, welearn what they like and dislike and better cater to their tastes over time.
In addition, the feedback loop helps us **evaluate our system’s overallperformance**. While evals can help us measure model/system performance, userfeedback offers a concrete measure of user satisfaction and producteffectiveness.
### How to collect user feedback
**Make it easy for users to provide feedback.** This is echoed across allthree guidelines:
* Microsoft: Encourage granular feedback (enable the user to provide feedback indicating their preferences during regular interaction with the AI system) * Google: Let users give feedback (give users the opportunity for real-time teaching, feedback, and error correction) * Apple: Provide actionable information your app can use to improve the content and experience it presents to people
ChatGPT is one such example. Users can indicate thumbs up/down on responses,or choose to regenerate a response if it’s really bad or unhelpful. This isuseful feedback on human preferences which can then be used to fine-tune LLMs.
Midjourney is another good example. After images are generated, users cangenerate a new set of images (negative feedback), tweak an image by asking fora variation (positive feedback), or upscale and download the image (strongpositive feedback). This enables Midjourney to gather rich comparison data onthe outputs generated.

Example of collecting user feedback as part of the UX
**Consider implicit feedback too.** Implicit feedback is information thatarises as users interact with our product. Unlike the specific responses weget from explicit feedback, implicit feedback can provide a wide range of dataon user behavior and preferences.
Copilot-like assistants are a prime example. Users indicate whether asuggestion was helpful by either wholly accepting it (strong positivefeedback), accepting and making minor tweaks (positive feedback), or ignoringit (neutral/negative feedback). Alternatively, they may update the commentthat led to the generated code, suggesting that the initial code generationdidn’t meet their needs.
Chatbots, such as ChatGPT and BingChat, are another example. How has dailyusage changed over time? If the product is sticky, it suggests that users likeit. Also, how long is the average conversation? This can be tricky tointerpret: Is a longer conversation better because the conversation wasengaging and fruitful? Or is it worse because it took the user longer to getwhat they needed?
## Other patterns common in machine learning
Apart from the seven patterns above, there are other patterns in machinelearning that are also relevant to LLM systems and products. They include:
* [Data flywheel](/writing/more-patterns/#data-flywheel-to-continuously-improve--build-a-moat): Continuous data collection improves the model and leads to a better user experience. This, in turn, promotes more usage which provides more data to further evaluate and fine-tune models, creating a virtuous cycle. * [Cascade](/writing/more-patterns/#cascade-to-split-a-problem-into-smaller-problems): Rather than assigning a single, complex task to the LLM, we can simplify and break it down so it only has to handle tasks it excels at, such as reasoning or communicating eloquently. RAG is an example of this. Instead of relying on the LLM to retrieve and rank items based on its internal knowledge, we can augment LLMs with external knowledge and focus on applying the LLM’s reasoning abilities. * [Monitoring](/writing/practical-guide-to-maintaining-machine-learning/#monitor-models-for-misbehaviour-when-retraining): This helps demonstrate the value added by the AI system, or the lack of it. Someone shared an anecdote of running an LLM-based customer support solution in prod for two weeks before discontinuing it—an A/B test showed that losses were 12x more when using an LLM as a substitute for their support team!
(Read more about design patterns for [machine learning code](/writing/design-patterns/) and [systems](/writing/more-patterns/).)
Also, here’s what others said:
> Separation of concerns/task decomposition- having distinct prompts for> distinct subtasks and chaining them together helps w attention and> reliability (hurts latency). We were having trouble specifying a rigid> output structure AND variable response content so we split up the tasks —> [Erick Enriquez](https://twitter.com/generick_ez/status/1681153738822516736)
> A few others that will be needed: role based access control: who can access> what; security: if I’m using a DB with an LLM, how do I ensure that I have> the right security guards —> [Krishna](https://twitter.com/ntkris/status/16812092400299991050)
> Consistent output format: setting outputs to a standardized format such as> JSON; Tool augmentation: offload tasks to more specialised, proven, reliable> models — [Paul Tune](https://twitter.com/ptuls/status/1681284873741561857)
> Security: mitigate cache poisoning, input validation, mitigate prompt> injection, training data provenance, output with non-vulnerable code,> mitigate malicious input aimed at influencing requests used by tools (AI> Agent), mitigate denial of service (stress test llm), to name a few :) —> [Anderson> Darario](https://www.linkedin.com/feed/update/urn:li:activity:7087089908229558272?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7087089908229558272%2C7087224131292684288%29)
> Another ux/ui related: incentivize users to provide feedback on generated> answers (implicit or explicit). Implicit could be sth like copilot’s ghost> text style, if accepted with TAB, meaning positive feedback etc. — [Wen> Yang](https://www.linkedin.com/feed/update/urn:li:activity:7087089908229558272?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7087089908229558272%2C7087149792660750336%29)
> Great list. I would add consistency checks like self-consistency sampling,> chaining and decomposition of tasks, and the emsembling of multiple model> outputs. Applying each of these almost daily. [Dan> White](https://www.threads.net/@dwhitena/post/Cu3BBaJtoyj/?igshid=OGQ5ZDc2ODk2ZA==)
> Guardrails is super relevant for building analytics tools where llm is a> translator from natural to programming language —> [m_voitko](https://www.threads.net/@m_voitko/post/Cu1b4liNwCS/?igshid=OGQ5ZDc2ODk2ZA==)
## Conclusion
This is the longest post I’ve written by far. If you’re still with me, thankyou! I hope you found reading about these patterns helpful, and that the 2x2below makes sense.

LLM patterns across the axis of data to user, and defensive to offensive.
We’re still so early on the journey towards building LLM-based systems andproducts. Are there any other key patterns or resources? What have you founduseful or not useful? I’d love to hear your experience. **Please[reachout!](https://twitter.com/eugeneyan)**
## References
Hendrycks, Dan, et al. [“Measuring massive multitask languageunderstanding.”](https://arxiv.org/abs/2009.03300) arXiv preprintarXiv:2009.03300 (2020).
Gao, Leo, et al. [“A Framework for Few-Shot Language ModelEvaluation.”](https://github.com/EleutherAI/lm-evaluation-harness) v0.0.1,Zenodo, (2021), doi:10.5281/zenodo.5371628.
Liang, Percy, et al. [“Holistic evaluation of languagemodels.”](https://arxiv.org/abs/2211.09110) arXiv preprint arXiv:2211.09110(2022).
Dubois, Yann, et al. [“AlpacaFarm: A Simulation Framework for Methods ThatLearn from Human Feedback.”](https://github.com/tatsu-lab/alpaca_eval) (2023)
Papineni, Kishore, et al. [“Bleu: a method for automatic evaluation of machinetranslation.”](https://dl.acm.org/doi/10.3115/1073083.1073135) Proceedings ofthe 40th annual meeting of the Association for Computational Linguistics.2002.
Lin, Chin-Yew. [“Rouge: A package for automatic evaluation ofsummaries.”](https://aclanthology.org/W04-1013/) Text summarization branchesout. 2004.
Zhang, Tianyi, et al. [“Bertscore: Evaluating text generation withbert.”](https://arxiv.org/abs/1904.09675) arXiv preprint arXiv:1904.09675(2019).
Zhao, Wei, et al. [“MoverScore: Text generation evaluating with contextualizedembeddings and earth mover distance.”](https://arxiv.org/abs/1909.02622) arXivpreprint arXiv:1909.02622 (2019).
Sai, Ananya B., Akash Kumar Mohankumar, and Mitesh M. Khapra. [“A survey ofevaluation metrics used for NLG systems.”](https://arxiv.org/abs/2008.12009)ACM Computing Surveys (CSUR) 55.2 (2022): 1-39.
Grusky, Max. [“Rogue Scores.”](https://aclanthology.org/2023.acl-long.107/)Proceedings of the 61st Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers). 2023.
Liu, Yang, et al. [“Gpteval: Nlg evaluation using gpt-4 with better humanalignment.”](https://arxiv.org/abs/2303.16634) arXiv preprint arXiv:2303.16634(2023).
Fourrier, Clémentine, et al. [“What’s going on with the Open LLMLeaderboard?”](https://huggingface.co/blog/evaluating-mmlu-leaderboard#whats-going-on-with-the-open-llm-leaderboard) (2023).
Zheng, Lianmin, et al. [“Judging LLM-as-a-judge with MT-Bench and ChatbotArena.”](https://arxiv.org/abs/2306.05685) arXiv preprint arXiv:2306.05685(2023).
Dettmers, Tim, et al. [“Qlora: Efficient finetuning of quantizedllms.”](https://arxiv.org/abs/2305.14314) arXiv preprint arXiv:2305.14314(2023).
Swyx et al. [MPT-7B and The Beginning ofContext=Infinity](https://www.latent.space/p/mosaic-mpt-7b#details) (2023).
Fradin, Michelle, Reeder, Lauren [“The New Language ModelStack”](https://www.sequoiacap.com/article/llm-stack-perspective/) (2023).
Radford, Alec, et al. [“Learning transferable visual models from naturallanguage supervision.”](https://arxiv.org/abs/2103.00020) Internationalconference on machine learning. PMLR, 2021.
Yan, Ziyou. [“Search: Query Matching via Lexical, Graph, and EmbeddingMethods.”](https://eugeneyan.com/writing/search-query-matching/)eugeneyan.com, (2021).
Petroni, Fabio, et al. [“How context affects language models’ factualpredictions.”](https://arxiv.org/abs/2005.04611) arXiv preprintarXiv:2005.04611 (2020).
Karpukhin, Vladimir, et al. [“Dense passage retrieval for open-domain questionanswering.”](https://arxiv.org/abs/2004.04906) arXiv preprint arXiv:2004.04906(2020).
Lewis, Patrick, et al. [“Retrieval-augmented generation for knowledge-intensive nlp tasks.”](https://arxiv.org/abs/2005.11401) Advances in NeuralInformation Processing Systems 33 (2020): 9459-9474.
Izacard, Gautier, and Edouard Grave. [“Leveraging passage retrieval withgenerative models for open domain questionanswering.”](https://arxiv.org/abs/2007.01282) arXiv preprint arXiv:2007.01282(2020).
Borgeaud, Sebastian, et al. [“Improving language models by retrieving fromtrillions of tokens.”](https://arxiv.org/abs/2112.04426) Internationalconference on machine learning. PMLR, (2022).
Lazaridou, Angeliki, et al. [“Internet-augmented language models through few-shot prompting for open-domain questionanswering.”](https://arxiv.org/abs/2203.05115) arXiv preprint arXiv:2203.05115(2022).
Wang, Yue, et al. [“Codet5+: Open code large language models for codeunderstanding and generation.”](https://arxiv.org/abs/2305.07922) arXivpreprint arXiv:2305.07922 (2023).
Gao, Luyu, et al. [“Precise zero-shot dense retrieval without relevancelabels.”](https://arxiv.org/abs/2212.10496) arXiv preprint arXiv:2212.10496(2022).
Yan, Ziyou. [“Obsidian-Copilot: An Assistant for Writing &Reflecting.”](https://eugeneyan.com/writing/obsidian-copilot/) eugeneyan.com,(2023).
Bojanowski, Piotr, et al. [“Enriching word vectors with subwordinformation.”](https://arxiv.org/abs/1607.04606) Transactions of theassociation for computational linguistics 5 (2017): 135-146.
Reimers, Nils, and Iryna Gurevych. [“Making Monolingual Sentence EmbeddingsMultilingual Using Knowledge Distillation.”](https://arxiv.org/abs/2004.09813)Proceedings of the 2020 Conference on Empirical Methods in Natural LanguageProcessing, Association for Computational Linguistics, (2020).
Wang, Liang, et al. [“Text embeddings by weakly-supervised contrastive pre-training.”](https://arxiv.org/abs/2212.03533) arXiv preprint arXiv:2212.03533(2022).
Su, Hongjin, et al. [“One embedder, any task: Instruction-finetuned textembeddings.”](https://arxiv.org/abs/2212.09741) arXiv preprintarXiv:2212.09741 (2022).
Johnson, Jeff, et al. [“Billion-Scale Similarity Search withGPUs.”](https://arxiv.org/abs/1702.08734) IEEE Transactions on Big Data, vol.7, no. 3, IEEE, 2019, pp. 535–47.
Malkov, Yu A., and Dmitry A. Yashunin. [“Efficient and Robust ApproximateNearest Neighbor Search Using Hierarchical Navigable Small WorldGraphs.”](https://arxiv.org/abs/1603.09320) IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 42, no. 4, IEEE, 2018, pp. 824–36.
Guo, Ruiqi, et al. [“Accelerating Large-Scale Inference with AnisotropicVector Quantization.”](https://arxiv.org/abs/1908.10396.) InternationalConference on Machine Learning, (2020)
Ouyang, Long, et al. [“Training language models to follow instructions withhuman feedback.”](https://arxiv.org/abs/2203.02155) Advances in NeuralInformation Processing Systems 35 (2022): 27730-27744.
Howard, Jeremy, and Sebastian Ruder. [“Universal language model fine-tuningfor text classification.”](https://arxiv.org/abs/1801.06146) arXiv preprintarXiv:1801.06146 (2018).
Devlin, Jacob, et al. [“Bert: Pre-training of deep bidirectional transformersfor language understanding.”](https://arxiv.org/abs/1810.04805) arXiv preprintarXiv:1810.04805 (2018).
Radford, Alec, et al. [“Improving language understanding with unsupervisedlearning.”](https://openai.com/research/language-unsupervised) (2018).
Raffel, Colin, et al. [“Exploring the limits of transfer learning with aunified text-to-text transformer.”](https://arxiv.org/abs/1910.10683) TheJournal of Machine Learning Research 21.1 (2020): 5485-5551.
Lester, Brian, Rami Al-Rfou, and Noah Constant. [“The power of scale forparameter-efficient prompt tuning.”](https://arxiv.org/abs/2104.08691) arXivpreprint arXiv:2104.08691 (2021).
Li, Xiang Lisa, and Percy Liang. [“Prefix-tuning: Optimizing continuousprompts for generation.”](https://arxiv.org/abs/2101.00190) arXiv preprintarXiv:2101.00190 (2021).
Houlsby, Neil, et al. [“Parameter-efficient transfer learning forNLP.”](https://arxiv.org/abs/1902.00751) International Conference on MachineLearning. PMLR, 2019.
Hu, Edward J., et al. [“Lora: Low-rank adaptation of large languagemodels.”](https://arxiv.org/abs/2106.09685) arXiv preprint arXiv:2106.09685(2021).
Dettmers, Tim, et al. [“Qlora: Efficient finetuning of quantizedllms.”](https://arxiv.org/abs/2305.14314) arXiv preprint arXiv:2305.14314(2023).
Williams, Adina, et al. [“A Broad-Coverage Challenge Corpus for SentenceUnderstanding through Inference.”](https://cims.nyu.edu/~sbowman/multinli/)Proceedings of the 2018 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, Volume1 (Long Papers), Association for Computational Linguistics, (2018).
[GPTCache](https://github.com/zilliztech/GPTCache) (2023).
Bai, Yuntao, et al. [“Training a helpful and harmless assistant withreinforcement learning from humanfeedback.”](https://arxiv.org/abs/2204.05862) arXiv preprint arXiv:2204.05862(2022).
[Guardrails](https://github.com/ShreyaR/guardrails) (2023)
[NeMo-Guardrails](https://github.com/NVIDIA/NeMo-Guardrails) (2023)
Manakul, Potsawee, Adian Liusie, and Mark JF Gales. [“Selfcheckgpt: Zero-resource black-box hallucination detection for generative large languagemodels.”](https://arxiv.org/abs/2303.08896) arXiv preprint arXiv:2303.08896(2023).
[Guidance](https://github.com/microsoft/guidance) (2023).
Amershi, Saleema, et al. [“Guidelines for human-AIinteraction.”](https://www.microsoft.com/en-us/research/publication/guidelines-for-human-ai-interaction/) Proceedings ofthe 2019 chi conference on human factors in computing systems. 2019.
[People + AI Guidebook](https://pair.withgoogle.com/guidebook/) (2023).
[Human Interface Guidelines for MachineLearning](https://developer.apple.com/design/human-interface-guidelines/machine-learning) (2023).
Schendel, Zachary A., Faraz Farzin, and Siddhi Sundar. [“A Human Perspectiveon Algorithmic Similarity.”](https://slideslive.com/38934788/a-human-perspective-on-algorithmic-similarity?ref=folder-59726) Proceedings of the14th ACM Conference on Recommender Systems. 2020.
If you found this useful, please cite this write-up as:
> Yan, Ziyou. (Jul 2023). Patterns for Building LLM-based Systems & Products.> eugeneyan.com. https://eugeneyan.com/writing/llm-patterns/.
or
@article{yan2023llm-patterns, title = {Patterns for Building LLM-based Systems & Products}, author = {Yan, Ziyou}, journal = {eugeneyan.com}, year = {2023}, month = {Jul}, url = {https://eugeneyan.com/writing/llm-patterns/} }
Share on:




Browse related tags: [ [llm](/tag/llm/) [engineering](/tag/engineering/)[production](/tag/production/) [🔥](/tag/🔥/) ]
[ Search](/search/ "Search")
[« Obsidian-Copilot: An Assistant for Writing & Reflecting](/writing/obsidian-copilot/) [How to Match LLM Patterns to Problems »](/writing/llm-problems/)
* * *
Join **6,800+** readers getting updates on machine learning, RecSys, LLMs, andengineering.
Get email updates
* * *
*  [Twitter](https://twitter.com/eugeneyan "Twitter") *  [LinkedIn](https://www.linkedin.com/in/eugeneyan/ "Linkedin") *  [Threads](https://www.threads.net/@eugeneyan "Threads") *  [GitHub](https://github.com/eugeneyan/ "GitHub")
Eugene Yan designs, builds, and operates machine learning systems that servecustomers at scale. He's currently a Senior Applied Scientist at Amazon.Previously, he led machine learning at Lazada (acquired by Alibaba) and aHealthtech Series A. He [writes](/writing/) & [speaks](/speaking/) aboutmachine learning, recommenders, LLMs, and engineering at[eugeneyan.com](https://eugeneyan.com/) and[ApplyingML.com](https://applyingml.com/).
© Eugene Yan 2015 - 2024 • [Feedback](/site-feedback/) • [RSS](/rss/)
orig_nodes = node_parser.get_nodes_from_documents(docs)
print(orig_nodes[20:28][3].get_content(metadata_mode="all"))
because evals were often conducted with untested, incorrectROUGE implementations.

Dimensions of model evaluations with ROUGE([source](https://aclanthology.org/2023.acl-long.107/))
And even with recent benchmarks such as MMLU, **the same model can getsignificantly different scores based on the eval implementation**.[Huggingface compared the original MMLUimplementation](https://huggingface.co/blog/evaluating-mmlu-leaderboard) withthe HELM and EleutherAI implementations and found that the same example couldhave different prompts across various providers.

Different prompts for the same question across MMLU implementations([source](https://huggingface.co/blog/evaluating-mmlu-leaderboard))
Furthermore, the evaluation approach differed across all three benchmarks:
* Original MMLU: Compares predicted probabilities on the answers only (A, B, C, D) * HELM: Uses the next token probabilities from the model and picks the token with the
Question Extractor on Nodes
Section titled “Question Extractor on Nodes”nodes_1 = node_parser.get_nodes_from_documents(docs)[20:28]nodes_1 = question_extractor(nodes_1)
100%|██████████| 8/8 [00:03<00:00, 2.04it/s]
print(nodes_1[3].get_content(metadata_mode="all"))
[Excerpt from document]questions_this_excerpt_can_answer: 1. How do different implementations of the MMLU benchmark affect the scores of the same model?2. What are the differences in evaluation approaches between the original MMLU benchmark, HELM, and EleutherAI implementations?3. How do various providers differ in the prompts they use for evaluating models in the MMLU benchmark?Excerpt:-----because evals were often conducted with untested, incorrectROUGE implementations.

Dimensions of model evaluations with ROUGE([source](https://aclanthology.org/2023.acl-long.107/))
And even with recent benchmarks such as MMLU, **the same model can getsignificantly different scores based on the eval implementation**.[Huggingface compared the original MMLUimplementation](https://huggingface.co/blog/evaluating-mmlu-leaderboard) withthe HELM and EleutherAI implementations and found that the same example couldhave different prompts across various providers.

Different prompts for the same question across MMLU implementations([source](https://huggingface.co/blog/evaluating-mmlu-leaderboard))
Furthermore, the evaluation approach differed across all three benchmarks:
* Original MMLU: Compares predicted probabilities on the answers only (A, B, C, D) * HELM: Uses the next token probabilities from the model and picks the token with the-----
Build Indices
Section titled “Build Indices”from llama_index.core import VectorStoreIndexfrom llama_index.core.response.notebook_utils import ( display_source_node, display_response,)
index0 = VectorStoreIndex(orig_nodes)index1 = VectorStoreIndex(orig_nodes[:20] + nodes_1 + orig_nodes[28:])
Query Engines
Section titled “Query Engines”query_engine0 = index0.as_query_engine(similarity_top_k=1)query_engine1 = index1.as_query_engine(similarity_top_k=1)
Querying
Section titled “Querying”query_str = ( "Can you describe metrics for evaluating text generation quality, compare" " them, and tell me about their downsides")
response0 = query_engine0.query(query_str)response1 = query_engine1.query(query_str)
display_response( response0, source_length=1000, show_source=True, show_source_metadata=True)
Final Response:
Metrics for evaluating text generation quality can be categorized as context-dependent or context-free. Context-dependent metrics consider the context of the task and may need adjustments for different tasks. On the other hand, context-free metrics do not consider task-specific context and are easier to apply across various tasks.
Some commonly used metrics for evaluating text generation quality include BLEU, ROUGE, BERTScore, and MoverScore.
- BLEU (Bilingual Evaluation Understudy) is a precision-based metric that compares n-grams in the generated output with those in the reference.
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation) evaluates the overlap between the generated output and reference summaries.
- BERTScore leverages contextual embeddings to measure the similarity between the generated output and reference.
- MoverScore considers the semantic similarity between the generated output and reference using Earth Mover’s Distance.
Each of these metrics has its own strengths and weaknesses. For example, BLEU may not capture the overall fluency and coherence of the generated text, while ROUGE may not consider the semantic meaning adequately. BERTScore and MoverScore, on the other hand, may require pre-trained models and can be computationally expensive. It’s important to consider the specific requirements of the task when selecting an appropriate evaluation metric.
Source Node 1/1
Node ID: 4edc4466-e9ae-47ae-b0ee-8a8ac27a0378
Similarity: 0.8381672789063448
Text: GPT-4) prefers the output of one model over a reference model. Metrics include win rate, bias, latency, price, variance, etc. Validated to have high agreement with 20k human annotations.
We can group metrics into two categories: context-dependent or context-free.
- Context-dependent : These take context into account. They’re often proposed for a specific task; repurposing them for other tasks will require some adjustment.
- Context-free : These aren’t tied to the context when evaluating generated output; they only compare the output with the provided gold references. As they’re task agnostic, they’re easier to apply to a wide variety of tasks.
To get a better sense of these metrics (and their potential shortfalls), we’ll explore a few of the commonly used metrics such as BLEU, ROUGE, BERTScore, and MoverScore.
BLEU (Bilingual Evaluation
Understudy) is a precision-based metric: It counts the number of n-grams in
th…
Metadata: {}
display_response( response1, source_length=1000, show_source=True, show_source_metadata=True)
Final Response:
Metrics for evaluating text generation quality include BLEU and ROUGE. These metrics are commonly used but have limitations. BLEU and ROUGE have shown poor correlation with human judgments in terms of fluency and adequacy. They also exhibit low correlation with tasks that require creativity and diversity in text generation. Additionally, exact match metrics like BLEU and ROUGE are not suitable for tasks such as abstractive summarization or dialogue in text generation due to their reliance on n-gram overlap, which may not capture the nuances of these tasks effectively.
Source Node 1/1
Node ID: 52856a1d-be29-494a-84be-e8db8a736675
Similarity: 0.8459422950143721
Text: finds the minimum effort to transform
one text into another. The idea is to measure the distance that words would
have to move to convert one sequence to another.
However, there are several pitfalls to using these conventional benchmarks and metrics.
First, there’s poor correlation between these metrics and human judgments. BLEU, ROUGE, and others have had negative correlation with how humans evaluate fluency. They also showed moderate to less correlation with human adequacy scores. In particular, BLEU and ROUGE have low correlation with tasks that require creativity and diversity.
Second, these metrics often have poor adaptability to a wider variety of
tasks. Adopting a metric proposed for one task to another is not always
prudent. For example, exact match metrics such as BLEU and ROUGE are a poor
fit for tasks like abstractive summarization or dialogue. Since they’re based
on n-gram overlap between …
Metadata: {‘questions_this_excerpt_can_answer’: ‘1. How do conventional benchmarks and metrics for measuring text transformation performance compare to human judgments in terms of fluency and adequacy?\n2. What is the correlation between metrics like BLEU and ROUGE and tasks that require creativity and diversity in text generation?\n3. Why are exact match metrics like BLEU and ROUGE not suitable for tasks like abstractive summarization or dialogue in text generation?’}
Extract Metadata Using PydanticProgramExtractor
Section titled “Extract Metadata Using PydanticProgramExtractor”PydanticProgramExtractor enables extracting an entire Pydantic object using an LLM.
This approach allows for extracting multiple entities in a single LLM call, offering an advantage over using a single metadata extractor.
from pydantic import BaseModel, Fieldfrom typing import List
Setup the Pydantic Model¶
Section titled “Setup the Pydantic Model¶”Here we define a basic structured schema that we want to extract. It contains:
Entities: unique entities in a text chunk Summary: a concise summary of the text chunk
class NodeMetadata(BaseModel): """Node metadata."""
entities: List[str] = Field( ..., description="Unique entities in this text chunk." ) summary: str = Field( ..., description="A concise summary of this text chunk." )
Setup the Extractor¶
Section titled “Setup the Extractor¶”from llama_index.program.openai import OpenAIPydanticProgramfrom llama_index.core.extractors import PydanticProgramExtractor
EXTRACT_TEMPLATE_STR = """\Here is the content of the section:----------------{context_str}----------------Given the contextual information, extract out a {class_name} object.\"""
openai_program = OpenAIPydanticProgram.from_defaults( output_cls=NodeMetadata, prompt_template_str="{input}", extract_template_str=EXTRACT_TEMPLATE_STR,)
metadata_extractor = PydanticProgramExtractor( program=openai_program, input_key="input", show_progress=True)
Extract metadata from the node
Section titled “Extract metadata from the node”extract_metadata = metadata_extractor.extract(orig_nodes[0:1])
100%|██████████| 1/1 [00:01<00:00, 1.51s/it]
extract_metadata
[{'entities': ['eugeneyan', 'llm', 'engineering', 'production'], 'summary': 'Patterns for Building LLM-based Systems & Products - Discussions on HackerNews, Twitter, and LinkedIn. There is a large class of problems that are easy to imagine and build demos for, but extremely hard to make products out of. For example, self-driving: It’s easy to demo a'}]
metadata_nodes = metadata_extractor.process_nodes(orig_nodes[0:1])
100%|██████████| 1/1 [00:01<00:00, 1.03s/it]
metadata_nodes
[TextNode(id_='2b6a40a8-dd6a-44a8-a005-da32ad98a05c', embedding=None, metadata={'entities': ['eugeneyan', 'llm', 'engineering', 'production'], 'summary': 'Patterns for Building LLM-based Systems & Products - Discussions on HackerNews, Twitter, and LinkedIn. Content includes discussions on self-driving technology and challenges in turning demos into products.'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://eugeneyan.com/writing/llm-patterns/', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='9da2827b0860b2f81e51cb3efd93a13227f0e4312355a495e5668669f257cb14'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='d3a86dba-7579-4196-80d7-30affa7052a7', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='993e43bb060cf2f183f894f8dec6708eadcac2b7d2760a94916dc82c24255acc')}, text='# [eugeneyan](/)\n\n * [Start Here](/start-here/ "Start Here")\n * [Writing](/writing/ "Writing")\n * [Speaking](/speaking/ "Speaking")\n * [Prototyping](/prototyping/ "Prototyping")\n * [About](/about/ "About")\n\n# Patterns for Building LLM-based Systems & Products\n\n[ [llm](/tag/llm/) [engineering](/tag/engineering/)\n[production](/tag/production/) [🔥](/tag/🔥/) ] · 66 min read\n\n> Discussions on [HackerNews](https://news.ycombinator.com/item?id=36965993),\n> [Twitter](https://twitter.com/eugeneyan/status/1686531758701899776), and\n> [LinkedIn](https://www.linkedin.com/posts/eugeneyan_patterns-for-building-\n> llm-based-systems-activity-7092300473981927424-_wVo)\n\n“There is a large class of problems that are easy to imagine and build demos\nfor, but extremely hard to make products out of. For example, self-driving:\nIt’s easy to demo a', start_char_idx=0, end_char_idx=838, text_template='[Excerpt from document]\n{metadata_str}\nExcerpt:\n-----\n{content}\n-----\n', metadata_template='{key}: {value}', metadata_seperator='\n')]