14 January 2024

Optimizing RAG: Basic to Advanced Strategies

by Shyamal Anadkat

Share on:

Here are some basic → advanced (where more complexity is acceptable) strategies that I see for optimizing RAG implementations these days:

original tweet

Basic

Using effective prompt eng, templating, and conditioning. eg: “given the context information and no prior knowledge, answer the query..” etc. ok, we’ve all done some pretty aggressive prompt engineering.
Understand the challenges: don’t overoptimize and first really identify common issues with retrieval, augmentation, and generation. you always want to start simple. simplicity is sexier.
Choose the right chunk size: determine the optimal chunk size for your data to ensure efficient processing and retrieval. chunk overlaps don’t always work; use smaller chunks?
Using summaries for data chunks: apply summarization techniques to data chunks to provide the model with a concise representation of the information
Data, data, and data: carefully managing, scrutinizing, versioning, and cleaning data sources and pipelines. quality > quantity. garbage data, garbage r-a-g
Evaluating retrieval: this can include 1/assessing retrieval performance by measuring the proportion of relevant documents retrieved (precision) and all relevant documents retrieved (recall) and 2/ integrating human-in-the-loop evals/feedback and basic evaluations. think about use-case-specific evaluation metrics.
Evaluating generation: evaluating faithfulness and answer relevancy using something like ragas or a custom-built eval framework.
The enlightening realization that you don’t always need a vector db or just appreciate simpler options like pgvector

Intermediate

Metadata filtering: adding meta-data to the chunks to help process results. remember: similar ≠ relevant. this could also include filtering by relevancy. caution: be careful about metadata.
Managing embeddings: strategies to handle frequently updated or newly added documents; challenges include incremental indexing and dynamic document ranking.
Trustworthiness: using citations/attributions and employing techniques such as confidence estimation, uncertainty quantification, and error analysis to ensure the accuracy and trustworthiness of the generated content; sooner or later, thinking about “answerable probability” + “I don’t know” problems for retrieval.
Leverage hybrid search techniques or other index types: integrate different search techniques, such as keyword-based and semantic searches (eg: bm25). again, similar != relevant for your use case.
Apply query transformations: modify the user’s query to better match the information needed from the data sources. users don’t always know what they want. query transformations can include strategies like hypothetical document embeddings which take a query, generate a hypothetical response, and then use both for embedding lookup 2/ decomposing the original query into multiple sub-queries or questions and 3/ iteratively evaluating query for missing information, and generate response once all information is available.
Trade-offs: considering trade-offs between precision, recall, computation/cost to optimize the retrieval and generation process
Advanced chunking strategies: experiment with different chunking strategies, such as sentence window retrieval and auto-merging retrieval to improve precision and relevance; there’s a lot here?
Re-ranking: re-rank (reordering the retrieved documents) the retrieved documents based on their relevance to the user’s query. you can also combine multiple retrieval techniques and reranking strategies to improve the overall performance.

Advanced

Fine-tune the model and/or the embeddings: either continue the training process on a smaller, more specific dataset to optimize performance or fine-tune to better represent the relationships between data points. fine-tuning on domain-specific datasets can sometimes help the generator understand the context the retriever provides.
Customize embeddings using labeled training data: the approach involves creating a matrix that you can use to multiply your embeddings. the product of this multiplication is a ‘custom embedding’ that will better emphasize aspects of the text relevant to your use case.”
Query routing: have more than one index or tool then route sub-queries to the appropriate index or tool/function call.
Multi-retrieval: combining the results from multiple retrieval (and generator?) agents to improve the overall quality and fidelity.
Contextual compression and filtering: apply compression techniques to reduce the size of the context while preserving its relevance, and use filtering to select the most relevant information for the model
Self-querying: use the model’s output as a query to retrieve more information, which can be combined with the initial response to generate a more truthful answer
Document hierarchies and knowledge graphs: use document hierarchies and knowledge graphs to improve the organization and retrieval of information. this could also include combining the strengths of both knowledge graphs with vector db. I’ve also seen folks leveraging knowledge graphs to improve the interpretability/explainability.

Let’s go build

tags: RAG - AI - Large Language Models - Applied AI