At RTA Summit 2024, we will have a booth demonstration showcasing Apache Pinot’s ability to serve analytics and similarity search in real-time. There is actually a lot to make this successful. Let me unpack this by describing the use case.
Booth Duty Demo
At conferences, you should always have someone at your booth to answer questions, hand out swag, and scan attendees. This tends to require multiple people. This demo will monitor your booth using your camera to identify people assigned to booth duty and measure their activity throughout the day.
Below is a screenshot of what the “Booth Duty” demo can look like.
The timeline chart at the top of the dashboard plots the occurrences of those assigned booth duty. The middle table is the same dataset but displayed as a table. The bottom of the dashboard is a GenAI response using information obtained from the frames. It provides a summary of what has been happening at the booth for the last 15 minutes while providing the frames that captured this information.
So how are we doing this? I can break this down into these features.
Video frame capturing using computer vision.
Creating image embeddings from video frames.
Image captioning to get an image description.
Similarity search of booth duty assignees.
Real-Time RAG (Retrieval-Augmented Generation).
Computer Vision
To capture a video feed from Python, you can use a computer vision module called OpenCV.
pip install opencv-python
import cv2
video = cv2.VideoCapture(0)
while True:
success, frame = video.read()
To convert the frame to an embedding, you can use the PIL module to read the frame and pass it to a sentence transformer using the clip-ViT-B-32 image embedding model.
from PIL import Image
from sentence_transformers import SentenceTransformer
iframe = Image.fromarray(frame)
model = SentenceTransformer('clip-ViT-B-32')
img_emb = model.encode(iframe).tolist()
The clip-ViT-B-32 model is multimodal in that it can create embeddings for text and images, putting them in the same vector space. This means you can ask the question, “Find me images of a crowd of people,” and the model can convert this question to an embedding and search for image embeddings.
This model will help us search for images but will not tell us what is happening in the image.
Image Captioning
To get a simple caption for the frames being captured we can use the Salesforce/blip-image-captioning-base model.
from transformers import pipeline
captioner = pipeline("image-to-text",model="Salesforce/blip-image-captioning-base")
caption = captioner(iframe)[0]['generated_text']
This model will give you a short description of what is happening in an image. It does not know our booth duty assignees, so it will not help you count the number of instances a booth assignee appears in the video.
Similarity Search in Pinot
First, we need to pre-load Apache Pinot with image embeddings of the booth duty assignees. When frame images are captured, we create embeddings of the frame image and compare them to the embeddings of our assignees.
with DIST as (
SELECT
name,
cosine_distance(person_embedding, ARRAY{frame_embedding}) AS distance
from people
where VECTOR_SIMILARITY(person_embedding, ARRAY{frame_embedding}, 10)
)
select * from DIST
where distance < {threshold}
order by distance asc
In the SQL above, Pinot leverages the vector index in the predicate with VECTOR_SIMILARITY which speeds up the vector search. We further filter the results by providing a threshold, making sure the distance between the two embeddings is within a limit. The threshold is how we can tune the search results. The smaller the threshold, the shorter the distance, and therefore, the search is tuned more precisely, which can result in no assignees being found. Conversely, the higher the threshold, the more lenient the search is, which may result in false positives. In this case, our threshold is set to .3.
We can be more precise with our search by picking the faces out of the frame and comparing facial features. We chose to simplify by only comparing the entire image, but the results are still good.
Frame Rate
Videos create an average of about 24 frames per second, which is about 691,200 frames for 8 hours of booth duty. Monitoring two booths at the same time creates 1,382,400 frames for 8 hours. Monitoring two booths for 16 hours (two days) creates 2,764,800 frames.
If we capture one frame every 5 seconds, we can reduce this to 23,040 frames to support 2 booths for 16 hours.
For every frame we capture, we perform a similarity search in Pinot against the booth duty assignees table. If we find a person we recognize, we add it to the message below including the frame number, the image caption, and timestamp. If a person is not found, we still send it over to Pinot to capture instances where the booth is empty.
{
"frame": frame_number,
"person": person,
"description": captioner(iframe)[0]['generated_text'],
"embedding":img_emb,
"ts": ts
}
Real-Time GenAI
For the dashboard, we are refreshing the dashboard every 5 seconds. This sends two queries to Pinot to populate the timeline chart and to pull video descriptions for the past 15 minutes.
SELECT
frame,
person,
description
from video
where ts > ago('PT15M')
order by frame desc
limit 50
Then we assemble the prompt with the person and description for every frame entry.
frame: [100] - person [hubert]: a man on his phone wearing glasses
frame: [200] - person [hubert]: a man wearing a hoddie
We add this to the context of the prompt below and answer a question:
Summarize what has been happening at the booths in two sentences
PROMPT_TEMPLATE = """
Below are video logs for the last 15 minutes. They contain descriptions
of video frames and the name of a person that was found in the frame if
one was identified.
No logs indicates the video stream has just started.
Answer the question based on this log:
----
{context}
----
Based on the above video frame descriptions, answer this question:
{question}
"""
The LLM will return an answer similar to the one below:
Based on the video frame descriptions, it appears that a man named Hubert has been consistently present at the booths. He is described as wearing glasses and a brown hoodie in most frames, occasionally standing in a room or in front of a ceiling with lights. No other individuals have been identified in the frames.
AI and Real-Time Analytics Together
With Pinot supporting real-time analytics with vector search provides a more compelling real-time view of the business. Real-time analytics combined with real-time GenAI can enable hyper-personalized experiences for users. By analyzing real-time data streams and generating personalized content or recommendations in real-time, businesses can tailor their offerings to individual preferences and behaviors.
Learn more about this demo by trying it out here.