Background
Like in most organizations, our support staff spends a lot of time answering the same questions for different users, especially at the start of a quarter. To ease some of the burden on our staff, I wanted to develop a knowledgebase of common issues/questions, which we could then use to answer support tickets using an LLM (yes, I could have made an FAQ, but who reads FAQs). HubSpot (where our support tickets live) does have this capability, but the cost of using their agents would be at least $300 a month even at our smallish scale, and we'd be limited to using the models they allow. Instead, I decided to try building a custom RAG pipeline using AWS infrastructure that we already had -- and ended up with a automated system that matched our human agents' responses 95% of the time.
Components
-
Data Storage: Amazon S3 for storing support tickets, and the knowledgebase articles we'll create from them. Cost: Negligible (if the storage required ever reaches 1GB I'd be shocked)
-
Vector Database: Postgres with
pgvectorrunning on at4g.microRDS instance. Cost: technically free (I used an instance we already had), about $15 a month if we weren't already paying for it. -
Embeddings: OpenAI
text-embedding-3-small. Cost: $0.02 per 1 million tokens; basically free at our scale. -
Servers: AWS Lambda and Fargate for running the RAG pipeline. Cost: under $1 a month
-
LLM: Claude Sonnet 4.6. With an average of 3500 input tokens and 400 output tokens per interaction, the cost is under 2 cents per use.
Initial Run
I wanted to start with a decent-sized knowledgebase built from existing tickets; to do this, I downloaded the previous year's resolved support tickets from HubSpot using their API. I saved each ticket as a JSONL file containing the entire conversation related to that ticket, and stripped all PII (phone numbers, email addresses, invoice numbers, etc) from the conversation's body using a set of regex functions.
Taking this conversation data, I generated embeddings (stored in the Postgres DB) and clustered the tickets by subject matter using HDBSCAN (giving priority to more recent tickets, in case the correct answer had changed over the course of the year). Since we have a large number of mostly-similar question/answer pairs, clustering bunches of similar tickets together is much more efficient than treating each ticket as its own, unique entity.
From these clusters, I generated a set of knowledgebase articles (one per cluster) using Claude Sonnet 4.6 (running on Bedrock). These articles were then tagged with the cluster ID, and saved in S3 (with the embeddings in Postgres).
Pipeline
With the initial knowledgebase populated, I set up a Lambda to run weekly, downloading any new, resolved tickets from HubSpot. The Lambda checks to see if the ticket's question is unanswered -- or if the answer to the question has changed since the knowledgebase was last updated. Detecting changed answers is super important for us, as policies for different user segments change fairly regularly for our organization.
To detect if an answer has drifted, the coverage-checker function uses Sonnet to compare the generated response with the actual, human response to the support ticket. The drift-checker assigns a degree of confidence (high, medium, or low) to the proposition that the correct answer to the question has changed. If the confidence level is medium or high, the cluster is marked for re-generation based on the new information we now have. If the confidence level is low, the drift report gets saved in S3 for human review.
When a question/answer pair is new or has drifted, the ticket goes into the processing queue (i.e. S3 bucket). Once the number of tickets in that bucket has reached a certain threshold, the Lambda triggers a Fargate task to process the tickets. The Fargate task loads all the uncovered tickets, generates embeddings, then clusters and creates articles as in the initial run (this process is a little too intensive for a Lambda to handle, hence running it as a Fargate task).
Usage
Now that we have a knowledgebase of articles and their related embeddings, we just need a way for users to request answers from it. I added an LLM Support service to our main app, which generates embeddings from the user's question, queries the vector database with those embeddings, and returns articles related to the question.
It then passes these articles (along with a detailed system prompt and information about the requesting user) to Claude Sonnet, which synthesizes a response from this data, which is then returned to the user. If the LLM's response doesn't work for the user (like, if an answer has drifted but we don't know about that yet), they can click "I need to open a ticket", which submits their original support request to a human agent. We log all interactions with the LLM Support system in order to monitor how well it's working for users (who can rate and leave feedback on any answers they receive).
Early Observations
To test out the pipeline, we pulled a set of recent resolved tickets from HubSpot, then fed the users' questions/issues into the LLM Support service, and finally compared the auto-generated answer to the actual human support agent's response for each ticket. Although the system is still in its early days, it looks promising in its ability to lift a huge amount of work off the shoulders of our time-crunched support staff.