RAG stands for Retrieval-Augmented Generation. It’s a technique that enhances the responses returned by a large language model (LLM) to be more accurate. LLMs alone can answer many questions provided to them. But to respond with accurate answers and cite sources, the LLM needs help to do some research. In this post, I’ll show how to do this in real-time.
Here is what the overall flow will look like.
Use Case
A popular trend today is to use an AI chatbot to answer questions about a set of documents provided to it. These documents are supplied to the LLM using a vector store to provide accurate responses. If these documents change a lot, you’ll need to update the vector store to ensure the responses from the LLM are fresh and accurate.
If you are unfamiliar with vector stores, read this previous post for an easy introduction.
Data Flow
Using Python, I use LangChain to recursively load a site and write the embeddings into Kafka. In this case, I’m loading pages from rtasummit.com.
In the code snippet below, we use a RecursiveUrlLoader from LangChain to load the documents from a URL. We also use BeautifulSoup to parse the HTML and get the text.
loader = RecursiveUrlLoader(
url=my_url,
max_depth=5,
extractor=lambda x: Soup(x, "html.parser").text
)
docs = loader.load()
We then loop through the documents generated, convert the text into embeddings, and write them to Kafka.
client = OpenAI()
for doc in docs:
message = {
"source": doc.metadata['source'],
"content": doc.page_content,
"metadata": doc.metadata,
"embedding": client.embeddings.create(
input = [doc.page_content],
model=self.model
).embedding.data[0]
}
kafka.write(message)
The schema of the message sent to Kafka will be used to define the table in Pinot.
source - This field will be used as a primary key so that we can perform UPSERTS in Pinot. This will ensure that the responses from the LLM are fresh.
content - This is the actual content of the web page.
metadata - A JSON object with fields: description, language, source, and title of each page.
embedding - A float array holding the embedding generated by OpenAI.
Real-Time Analytics Summit Chatbot
To generate AI responses from rtasummit.com, we first need to understand what a Prompt is.
A prompt is a piece of text inputted into the LLM to initiate or guide its generation of text. It serves as the LLM’s starting point or instruction, telling it what kind of text to produce. For example, a prompt can be a question, a statement, a sentence to complete, or even a more complex instruction for the model to follow. The AI then generates text that continues from, answers, or elaborates on the prompt, depending on how it's been trained and the specifics of the task it's being used for.
Below, we have a prompt. There are two variables that we’ll use to format the prompt. The {context} variable will be replaced with content from the pages from rtasummit.com. We retrieve (R in RAG) this context from the Apache Pinot vector table using similarity search.
PROMPT_TEMPLATE = """
Answer the question based only on the following context:
{context}
---
Answer the question based on the above context: {question}
"""
The {question} variable is replaced with the question from the user. The LLM will generate (G in RAG) a response to the {question}. The response is augmented (A in RAG) by the {context}. The LLM will provide an accurate answer to a question that is within the context of the documents provided.
Let’s walk through the process:
In the diagram below, a user asks a question related to the loaded document site. In our use case - rtasummit.com.
The question was converted into a vector embedding and used to find similar documents in Pinot. Similarity search will find documents that will most likely answer the user’s question.
The content of each page returned and joined together and used as the context in the prompt.
The question itself added to the prompt.
Once the prompt is assembled with context and the question, it’s submitted to OpenAI to generate a response.
model = ChatOpenAI()
response_text = model.invoke(prompt)
Meanwhile, Pinot is being updated with latest changes to the site to ensure the answers provided by the LLM are accurate and fresh.
Code
Here is the code that does this work. PinotVector is a Python class that wraps pinotdb python library.
query_text = input("\nsearch query: ")
# Prepare the DB.
db = PinotVector(host="pinot")
# Search the DB.
results = db.similarity_search(query_text, dist=.5)
if len(results) == 0:
print(f"Unable to find matching results.")
else:
context_text = "\n\n---\n\n".join([doc.page_content for doc in results])
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text,
question=query_text)
model = ChatOpenAI()
response_text = model.invoke(prompt)
sources = [doc.metadata.get("source", None) for doc in results]
print("response:")
print(f'{response_text.content} \n')
[print(f' - {source}') for source in sources]
Let’s Ask a Question
search query: when does Kishore speak at rta summit 2024?
response:
Kishore speaks at the Real-Time Analytics Summit 2024 on May 8th from 8:30 AM to 9:15 AM in the Imperial Ballroom.
- https://www.rtasummit.com/agenda/sessions/ao200
- https://www.rtasummit.com/agenda/sessions/ao188
- https://www.rtasummit.com/agenda/sessions/570389
- https://www.rtasummit.com/agenda/sessions/ao206
If there are any adjustments made to the schedule, Pinot’s UPSERT feature will only return the latest documents when populating the prompt and the LLM will pickup the latest schedule.
Similarity Search
In this example, we have control of how we perform similarity search by invoking Pinot directly.
When we search, we have to convert the user’s question into a vector embedding using OpenAI. We can only compare embeddings to other embeddings. The SQL below is formatted with these variables:
search_embedding - This is the user’s question converted into an embedding.
limit - How many documents we want to receive. Providing too many documets to the LLM may cause errors.
dist - Similarity is measured by distance. This variable is the distance between the search_embedding and the document’s embedding. This is where Pinots vector index optimizes the search.
search_embedding = get_embedding(query_text)
sql = f"""
SELECT
source,
content,
metadata,
cosine_distance(embedding, ARRAY{search_embedding}) AS cosine
from documentation
having cosine < {dist}
order by cosine asc
limit {limit}
"""
Scaling Search
The document table in Pinot can have as many additional fields along with the embedding column. Additional indexes can be applied to these other fields mixed with the vector index. This enables you to apply exact search with vector similarity search.
Take for example the question:
Find me all the RTA Summit speakers talking about stream processing from Confluent.
If you capture the company for each speaker in the metadata, you can optimize the search by selecting only speakers from a company before performing the similarity search. In this example, metadata is a JSON field so we can use Pinot’s JSON index to speed up the search.
With similarity search, you also need to think about queries per second (QPS) and concurrency (number of users).
Try It Yourself
You can try the RAG use case by cloning this GitHub repository.