From AI Knowledge Base to RAG
When building AI applications, there is the problem of “AI not having seen the data in the task.” For example, for enterprises, AI cannot grasp the information of every customer; for individuals, AI is not well aware of some personal information and privacy information. Even if AI is very capable (the ideal world model is no exception), without data for specific tasks, it loses the ability to “analyze specific problems specifically.”
1 What is RAG
Improving the accuracy and reliability of generative AI models by retrieving external information is Retrieval-Augmented Generation (Retrieval-Augmented Generation). If the process of a large language model (LLM) completing a task is compared to an exam, then a large model with RAG is equivalent to an open-book exam, while without RAG, it is like a closed-book exam. RAG is a technology that helps LLMs retrieve information to improve generation results.
RAG was first proposed by Patrick Lewis and others in this paper, and the company they worked for is Cohere, which currently provides API services including Embedding and Rerank models with good performance.
2 Why RAG is needed
The emergence of RAG is to solve some problems and deficiencies of large language models in applications. The most prominent point is the hallucination problem of large models, where the output of large models does not match facts or fabricates some answers. Also, the data used to train LLMs may be outdated, and LLMs know nothing about relatively new information.
RAG allows LLMs to access the latest or customized information and allows users to verify the information sources of LLMs to ensure their accuracy. The data retrieved by RAG can be public (such as search engines) or private (such as company information, personal sensitive data), which gives RAG broad application prospects. RAG is already widely used, such as Nvidia’s NeMo Retriever reading internal company information, and Kimi Chat from the Dark Side of the Moon using search engines to answer questions.
Huang Renxun introducing NeMo Retriever at GTC20243 Knowledge Base Built Around RAG
AI knowledge bases are important tools that allow AI to “tailor to fit.” By helping AI better complete tasks through knowledge bases, the current construction of AI knowledge bases can be done in the following three ways:
- Prompt Engineering
- Fine Tuning
- Embedding
Prompt engineering is to directly build a knowledge base in the prompt, putting all the information into the prompt. This method is suitable for small-scale use, but the number of tokens that current AI models can input basically cannot meet this implementation method. In fact, even as AI develops, one day when AI’s input window is large enough to accommodate a general knowledge base, building a knowledge base will still have its value. Because the length of the input content will affect AI’s performance (at least the current models are like this), you can check Needle In A Haystack - Pressure Testing LLMs for details.
Fine-tuning is a form that is popular in academia, using specific task data to fine-tune on pre-trained models. This approach is actually suitable for making an industry-general large model, such as a legal industry large model, a medical large model, etc. On one hand, the training data required for fine-tuning is not small, and the cost is high; on the other hand, fine-tuning is not flexible enough, such as timely adjustments based on one or two documents. The process of fine-tuning is actually learning and generalizing the training data, rather than memorizing the content, it is more about enhancing the ability in a certain field.
So the most mainstream way to build a knowledge base currently is mostly using the Embedding method. And this form of knowledge base also needs to be combined with RAG to be effective.
4 Basic Components of RAG
A classic, basic RAG composition is shown in the figure below.
The RAG system mainly includes three stages: indexing, retrieval, and generation.4.1 Embedding
In this process, users need to upload documents first, and the system stores the uploaded documents in a vector database after embedding. Embedding is to convert semantically similar texts into vectors that are close in distance, so this process is commonly known as vectorization.
4.2 Retrieval
When users ask LLMs questions, the content of the question will be embedded and then matched in the vector database, querying a series of content. This is the first stage of retrieval.
4.3 Rerank
The content directly queried in the vector database may not be perfect, and the results often do not match the query content, so a second stage of retrieval is needed, which is Rerank. In this stage, the Rerank model will reorder the content queried in the previous stage and output the results according to relevance. After Rerank is completed, taking the Top K can be applied in the subsequent generation stage.
5 Implementing RAG in 5 Lines of Code
An assignment statement counts as one line
|
|
RagTokenizer
is used for tokenizing text, RagTokenForGeneration
is the generator part of the RAG model, and RagRetriever
is responsible for retrieval. RagTokenizer.from_pretrained("facebook/rag-token-nq")
loads a pre-trained tokenizer to convert text into a format that the model can understand (i.e., tokenization). RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)
loads a pre-trained RAG model. facebook/rag-token-nq
is the name of the model and tokenizer, which are pre-trained on the Natural Questions dataset.
6 Open-source RAG Implementations
Dify is an LLM application development platform, with over 100,000 applications built based on Dify.AI. It integrates the concept of Backend as Service and LLMOps, covering the core technology stack needed to build generative AI native applications, including a built-in RAG engine. With Dify, you can deploy capabilities similar to Assistants API and GPTs based on any model. This project is hosted by a company in Suzhou and provides SaaS services.
Langchain-Chatchat is an open-source, offline deployable retrieval-augmented generation (RAG) large model knowledge base project based on large language models like ChatGLM and application frameworks like Langchain. Initially, it only supported the ChatGLM model, but later added support for many open-source models and online models.
The functional comparison of the two is shown in the table below:
Dify-api | ChatChat | |
---|---|---|
Peripheral Capabilities | General Document Reading | General Document Image OCR |
Data Sources | Document Text Content Vector Database |
Search Engine Vector Database |
Model Support | Online Embedding Model Online Rerank Model Online LLM |
Online Embedding Model Offline Embedding Model Offline LLM |
Advanced Features | ES Hybrid Retrieval | None |
Advanced RAG | Not Supported | Not Supported |
In fact, there are some features that current open-source projects do not fully cover, such as:
- Multimodal Capabilities
- Traditional Relational Database Support
- Multi-database Joint/Cross-database Information Retrieval
- Citation Function
- Advanced RAG
- Evaluation Metrics