RAG
Large language models (LLMs) are trained on public data, and have a training cutoff date, so they aren’t inherently aware of private, or the latest, information. As a result, LLMs may not have the necessary information to answer a user’s question, even though they might pretend to (a.k.a. hallucinate). This is a common problem, especially in an enterprise setting, where the information is proprietary and/or constantly changing. Unfortunately, this is also an environment where plausible but inaccurate answers can have serious consequences.
There are rougly three general approaches to addressing this problem. Going from the least to most complex:
- System prompt: If the information that the model needs to perform well can fit within a system prompt (i.e., fit within the relevant context window), you should consider that first.
- Tool calling: Provide the LLM with tools it can use to retrieve the information that it needs. Compared to RAG, this has the benefit of not needing to pre-fetch/maintain an information database, compute document/query similarities, and can even be combined with RAG.
- Retrieval-Augmented Generation (RAG): Among these options, this can be the most demanding to implement and maintain, but also offers the most control over when and how exactly the LLM gains access to the information it needs.
If RAG sounds right for your use case, the next section gives a basic example of how to implement it using Shiny and chatlas
. The last section provides some tips on how to scale up your RAG implementation.
RAG basics
The core concept of RAG is fairly simple, yet general: given a set of documents and a user query, find the document(s) that are the most similar to the query and supply those documents as additional context to the LLM. This requires choosing a numerical technique to compute similarity; of which there are many, each with its own strengths and weaknesses. The often tricky part of doing RAG well is finding the similarity measure that is both performant and effective for your use case.
To demonstrate, let’s use a basic example derived from chatlas
’s article on RAG. The main idea is to implement a function (get_top_k_similar_documents
) that finds the top-k most similar documents to a user query. Note that similarity depends on two main factors:
- The embedding model (for embedding text into a numerical vector space).
- Here we use the popular
all-MiniLM-L12-v2
model, which offers a nice balance between performance and speed.
- The similarity metric (for computing similarity between two vectors).
- Here we use cosine similarity, which is a common choice for text embeddings.
rag.py
import numpy as np
from sentence_transformers import SentenceTransformer
= SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")
embed_model
# A list of 'documents' (one document per list element)
= [
documents "The unicorn programming language was created by Horsey McHorseface.",
"It's known for its magical syntax and rainbow-colored variables.",
"Unicorn is a dynamically typed language with a focus on readability.",
"Some other programming languages include Python, Java, and C++.",
"Some other useless context...",
]
# Compute embeddings for each document (do this once for performance reasons)
= [embed_model.encode([doc])[0] for doc in documents]
embeddings
def get_top_k_similar_documents(user_query, top_k=3):
# Compute embedding for the user query
= embed_model.encode([user_query])[0]
query_embedding
# Calculate cosine similarity between the query and each document
= np.dot(embeddings, query_embedding) / (
similarities =1) * np.linalg.norm(query_embedding)
np.linalg.norm(embeddings, axis
)
# Get the top-k most similar documents
= np.argsort(similarities)[-top_k:][::-1]
top_indices return [documents[i] for i in top_indices]
app.py
from chatlas import ChatAnthropic
from rag import get_top_k_similar_documents
from shiny.express import ui
= ChatAnthropic(
chat_client ="claude-3-7-sonnet-latest",
model="""
system_prompt You are a helpful AI assistant. Using the provided context,
answer the user's question. If you cannot answer the question based on the
context, say so.
""",
)
= ui.Chat(
chat id="chat",
=["Hello! How can I help you today?"],
messages
)
chat.ui()
="Who created the unicorn language?")
chat.update_user_input(value
@chat.on_user_submit
async def _(user_input: str):
= get_top_k_similar_documents(user_input, top_k=3)
top_docs = f"Context: {top_docs}\nQuestion: {user_input}"
prompt = await chat_client.stream_async(prompt)
response await chat.append_message_stream(response)
Scaling up
To scale this basic example up to your use case, you’ll not only want to consider an embedding model and similarity metric that better matches your use case, but also a more efficient way to store/retrieve your documents.
Nowadays, there are many options for efficient storage/retrieval of documents (i.e., vector databases). That said, duckdb
’s vector extension comes highly recommended, and here is a great blog post on building a database and retrieving from it with a custom embedding model. Many of the these options will offer both a local and cloud-based solution, so you can choose the one that best fits your needs. For example, with duckdb
, you can leverage MotherDuck for your hosting needs, as well as others like Pinecone and Weaviate.