From AI Knowledge Base to RAG

When building AI applications, there is the problem of “AI not having seen the data in the task.” For example, for enterprises, AI cannot grasp the information of every customer; for individuals, AI is not well aware of some personal information and privacy information. Even if AI is very capable (the ideal world model is no exception), without data for specific tasks, it loses the ability to “analyze specific problems specifically.”

1 What is RAG

Improving the accuracy and reliability of generative AI models by retrieving external information is Retrieval-Augmented Generation (Retrieval-Augmented Generation). If the process of a large language model (LLM) completing a task is compared to an exam, then a large model with RAG is equivalent to an open-book exam, while without RAG, it is like a closed-book exam. RAG is a technology that helps LLMs retrieve information to improve generation results.

RAG was first proposed by Patrick Lewis and others in this paper, and the company they worked for is Cohere, which currently provides API services including Embedding and Rerank models with good performance.

2 Why RAG is needed

The emergence of RAG is to solve some problems and deficiencies of large language models in applications. The most prominent point is the hallucination problem of large models, where the output of large models does not match facts or fabricates some answers. Also, the data used to train LLMs may be outdated, and LLMs know nothing about relatively new information.

RAG allows LLMs to access the latest or customized information and allows users to verify the information sources of LLMs to ensure their accuracy. The data retrieved by RAG can be public (such as search engines) or private (such as company information, personal sensitive data), which gives RAG broad application prospects. RAG is already widely used, such as Nvidia’s NeMo Retriever reading internal company information, and Kimi Chat from the Dark Side of the Moon using search engines to answer questions.

Huang Renxun introducing NeMo Retriever at GTC2024
Huang Renxun introducing NeMo Retriever at GTC2024

3 Knowledge Base Built Around RAG

AI knowledge bases are important tools that allow AI to “tailor to fit.” By helping AI better complete tasks through knowledge bases, the current construction of AI knowledge bases can be done in the following three ways:

  • Prompt Engineering
  • Fine Tuning
  • Embedding

Prompt engineering is to directly build a knowledge base in the prompt, putting all the information into the prompt. This method is suitable for small-scale use, but the number of tokens that current AI models can input basically cannot meet this implementation method. In fact, even as AI develops, one day when AI’s input window is large enough to accommodate a general knowledge base, building a knowledge base will still have its value. Because the length of the input content will affect AI’s performance (at least the current models are like this), you can check Needle In A Haystack - Pressure Testing LLMs for details.

Fine-tuning is a form that is popular in academia, using specific task data to fine-tune on pre-trained models. This approach is actually suitable for making an industry-general large model, such as a legal industry large model, a medical large model, etc. On one hand, the training data required for fine-tuning is not small, and the cost is high; on the other hand, fine-tuning is not flexible enough, such as timely adjustments based on one or two documents. The process of fine-tuning is actually learning and generalizing the training data, rather than memorizing the content, it is more about enhancing the ability in a certain field.

So the most mainstream way to build a knowledge base currently is mostly using the Embedding method. And this form of knowledge base also needs to be combined with RAG to be effective.

4 Basic Components of RAG

A classic, basic RAG composition is shown in the figure below.

Basic Components of RAG
The RAG system mainly includes three stages: indexing, retrieval, and generation.

4.1 Embedding

In this process, users need to upload documents first, and the system stores the uploaded documents in a vector database after embedding. Embedding is to convert semantically similar texts into vectors that are close in distance, so this process is commonly known as vectorization.

4.2 Retrieval

When users ask LLMs questions, the content of the question will be embedded and then matched in the vector database, querying a series of content. This is the first stage of retrieval.

4.3 Rerank

The content directly queried in the vector database may not be perfect, and the results often do not match the query content, so a second stage of retrieval is needed, which is Rerank. In this stage, the Rerank model will reorder the content queried in the previous stage and output the results according to relevance. After Rerank is completed, taking the Top K can be applied in the subsequent generation stage.

5 Implementing RAG in 5 Lines of Code

An assignment statement counts as one line

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration
from datasets import load_dataset

retriever = RagRetriever.from_pretrained(
    config_name = "facebook/rag-token-nq",
    index_name = "compressed",  # Type of index
    use_dummy_dataset = True,  # Use a dummy dataset for testing
    dataset = "wiki_dpr"  # Dataset used for retrieval
)

# Load the pre-trained tokenizer and model
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)

# Input question and convert to vector
input_ids = tokenizer(input_text = "What is the capital of France?", 
		return_tensors="pt").input_ids

# Generate answer
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

RagTokenizer is used for tokenizing text, RagTokenForGeneration is the generator part of the RAG model, and RagRetriever is responsible for retrieval. RagTokenizer.from_pretrained("facebook/rag-token-nq") loads a pre-trained tokenizer to convert text into a format that the model can understand (i.e., tokenization). RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever) loads a pre-trained RAG model. facebook/rag-token-nq is the name of the model and tokenizer, which are pre-trained on the Natural Questions dataset.

6 Open-source RAG Implementations

Dify is an LLM application development platform, with over 100,000 applications built based on Dify.AI. It integrates the concept of Backend as Service and LLMOps, covering the core technology stack needed to build generative AI native applications, including a built-in RAG engine. With Dify, you can deploy capabilities similar to Assistants API and GPTs based on any model. This project is hosted by a company in Suzhou and provides SaaS services.

Langchain-Chatchat is an open-source, offline deployable retrieval-augmented generation (RAG) large model knowledge base project based on large language models like ChatGLM and application frameworks like Langchain. Initially, it only supported the ChatGLM model, but later added support for many open-source models and online models.

The functional comparison of the two is shown in the table below:

Dify-api ChatChat
Peripheral Capabilities General Document Reading General Document
Image OCR
Data Sources Document Text Content
Vector Database
Search Engine
Vector Database
Model Support Online Embedding Model
Online Rerank Model
Online LLM
Online Embedding Model
Offline Embedding Model
Offline LLM
Advanced Features ES Hybrid Retrieval None
Advanced RAG Not Supported Not Supported

In fact, there are some features that current open-source projects do not fully cover, such as:

  • Multimodal Capabilities
  • Traditional Relational Database Support
  • Multi-database Joint/Cross-database Information Retrieval
  • Citation Function
  • Advanced RAG
  • Evaluation Metrics

7 References

  1. gkamradt/LLMTest_NeedleInAHaystack: Doing simple retrieval from LLM models at various context lengths to measure accuracy (github.com)
  2. What is retrieval-augmented generation? | IBM Research Blog
  3. Retrieval (langchain.com)
  4. langgenius/dify (github.com)
  5. Langchain-Chatchat (github.com)
Buy me a coffee~
Tim AlipayAlipay
Tim PayPalPayPal
Tim WeChat PayWeChat Pay
0%