Retrieval Augmented Generation vs Query Augmented Retrieval. What’s best when trust matters?

Retrieval Augmented Generation vs Query Augmented Retrieval. What’s best when trust matters?

A comparison of AI patterns for retrieving text from knowledge sources.

Retrieval Augmented Generation (RAG) has taken off recently as it allows users to question their own information in a GPT-like way. A lesser-known alternative is Query Augmented Retrieval (QAR). This blog compares these two AI patterns for retrieving text from knowledge sources, outlines their advantages and disadvantages, the challenges associated with each pattern, and why QAR is a better choice for trust-critical scenarios.

What is Retrieval Augmented Generation?

RAG is an AI generation pattern that lets you use external knowledge sources to enrich the generated text. RAG has become a hugely popular way of getting Generative AI technology like GPT4 to achieve ‘AI with your own data’ or ‘ask a question of my own information’.

The basic idea is to use a retrieval (aka search) module to query a set of documents or facts based on the user’s input (e.g., a question)and then use a generation module (e.g., GPT4) to generate an answer/output using both the input and the retrieved information.

Retrieval augmented generation can be applied to various generation tasks, such as text summarisation, and question answering, with the benefit that the generated text is based on the retrieved documents/text. This is also called “grounding” the AI model with the data you give it.

What are the challenges with RAG?

Getting the RAG pattern right poses some challenges, such as:

  1. How to efficiently and accurately retrieve the most useful information
  2. How to align and fuse the retrieved information with the input
  3. Ensuring the quality and consistency of the generated text
  4. Protecting against prompt engineering attacks and keeping the generation on task
  5. Speed of responses and cost of responding.

In more detail:

  1. Efficient and accurate retrieval of the most useful information. This is effectively a search problem – i.e., how do you find the most appropriate document(s) or text snippets for the question someone asks? There are different ways of indexing source text, from vectorisation (which is chosen by many) to classic text indexing techniques. And different ways of slicing large documents into smaller indexed pieces. Finding the right combination of techniques for the specific scenario is important. For example, several vectorisation algorithms are suited to different scenarios.
  2. Aligning and infusing the retrieved information with the input. This is the problem of generating text from the found information that includes the answer to the question coherently and accurately. This somewhat depends on how you create the right size text snippets to use as the source material to generate an answer. Given the correct text snippets, technology like GPT4 is very good at synthesizing the information into the desired format/framing. New LLMs can handle larger amounts of input text, so they are better able to work with bigger sections of documents.
  3. Ensuring quality and consistency of the generated text. The challenge with consistency and accuracy of responses is that Generative AI isn’t deterministic. This means that what gets generated in response can be different every time. This isn’t necessarily a problem unless getting the answer wrong has consequences. The main solution is to add a disclaimer saying, “AI-generated answers may include mistakes or incorrect information” but that is unlikely to be good enough for most businesses.
  4. Protecting against prompt engineering and sticking to task. Any time generative AI is responsible for generating responses, it can be misled or tricked into generating information that is not in the source material – or out of the area of expertise. This is sometimes referred to as jailbreaking and is almost impossible to prevent when using RAG (E.g. Jailbreaking Large Language Models: If You Torture the Model Long Enough, It Will Confess! |by Moshe Sipper, Ph.D. | The Generator | Medium)
  5. Speed and cost of responses. Because the response is generated on the fly every time, heavy computation is involved to retrieve the relevant source data and then process it through generative AI to create the response. AI models typically are charged based on tokens processed. The cost to serve each query can add up as a large set of tokens can be sent to the AI model to complete the processing. Unless smart caching is enabled, both speed of response and cost can become a challenge at high volumes.

What is Query Augmented Retrieval?

QAR is an alternative pattern to RAG. QAR is a better pattern choice when trust is critical – for instance, in a regulated or high-trust environment. QAR also addresses many of the other challenges RAG faces.

Instead of generating responses on the fly, which has inherent weaknesses, in QAR only approved information is returned in response to the query or question.  

Generative AI is still used in QAR but is used to understand better or clarify the query. What does this mean? Advanced LLMs like GPT4 are good at keeping the context of a conversation. For example, they can handle follow-up questions on a topic – “where is your Auckland store?” can be followed up with ‘what time do you close’ and the query can be augmented with AI to include the additional context “what times does your Auckland store close?”.   The LLM can also be told about its working context via a system prompt.

By augmenting the query, there is a higher chance the right information can be returned to the user.

QAR does not suffer from the speed/performance penalty associated with generating the answer on the fly. It also avoids the bulk of the cost of LLM processing as dealing with the query part is a much smaller parcel of text (and tokens) than processing the document or search results through the LLM. In many cases, the query does not need to be augmented with AI.

This results in much lower operating costs and better performance – especially in scenarios where hundreds to thousands of people use the system.

What are the challenges with QAR?

Getting the QAR pattern right poses some challenges, such as:

  1. Getting the knowledge/answers in a suitable format to retrieve
  2. Having a good search capability to find the appropriate response to the query
  3. Understanding where knowledge is missing and dealing with escalations.

In more detail:

  1. Using suitable retrieval formats. To make QAR work well, it is sometimes necessary to turn existing knowledge into an alternate format – like taking a long help guide and turning it into a question-and-answer pair format. Thankfully, it is possible to use AI to do this task. (Check out our FAQ Wizard - it can do this in minutes).
  2. Search capability. This is a classic search problem and, in many ways, has been solved with full-text indexing, natural language search, and vector-based searches being some of the tried and tested techniques available
  3. Dealing with information gaps. If someone has a query that there is no answer for, then you need to be able to do a couple of things. Firstly, you need a way to fill in the gaps – perhaps with some web search results, and report to admins that there is a gap. Secondly, you may want to offer some sort of escalation, like a contact form or live chat.

Self-service tools built on QAR

If trustworthy answers, free from hallucination risks, are important to you, then FAQ Bot is a great fit because it uses QAR. It is also faster and more cost-effective than many other tools as Gen AI is only used when necessary.

FAQ Bot also offers escalations like live chat and contact forms and can integrate with APIs for anything from parcel tracking to product catalogues. If you’d like to test how we use QAR to give the right help at the right time, check out a free 30-day trial here.