Vector Database Comparison: Weaviate, Milvus, and Qdrant
The success of RAG systems largely depends on their ability to efficiently acquire and process massive amounts of information. Vector databases play an irreplaceable role in this and form the core of RAG systems. Vector databases are specifically designed to store and manage high-dimensional vector data, capable of converting and storing text, images, audio, and even video into vectors (this will be elaborated on later in the article). The ultimate effect that RAG systems can achieve depends on the performance of these underlying vector databases.
Among the many vector databases and vector libraries, each has its own characteristics, and choosing one suitable for your application scenario requires evaluation. This article will delve into the key factors to consider when choosing a vector database for RAG, including open-source availability, CRUD (Create, Read, Update, Delete) support, distributed architecture, replica support, scalability, performance, and continuous maintenance.
Currently, databases specifically designed for vectors like Weaviate, Milvus, Qdrant, Vespa, and Pinecone are highly regarded in the industry. In addition, some earlier vector libraries also have this functionality. This article will also compare various vector libraries, such as FAISS, HNSWLib, ANNOY, and SQL databases supporting vector functions, such as pgvector and Supabase.
Image Semantic Search Implemented with Milvus1 Vector Libraries (FAISS, HNSWLib, ANNOY)
The difference between vector databases and vector libraries is that vector libraries are mainly used for storing static data, where indexed data is immutable. This is because vector libraries only store vector embeddings and do not store the associated objects that generate these vector embeddings. Therefore, unlike vector databases, vector libraries do not support CRUD (Create, Read, Update, Delete) operations. This means adding new documents to existing indexes in vector libraries like FAISS or ANNOY can be difficult. HNSWLib is an exception, as it has CRUD functionality and uniquely supports concurrent read and write operations. However, it also suffers from the limitations of being a vector library, such as not providing deployment ecosystems, replication capabilities, and fault tolerance.
2 Full-Text Search Databases (ElasticSearch, OpenSearch)
Full-text search databases (e.g., ElasticSearch and OpenSearch) can support comprehensive text retrieval and advanced analysis functions. However, when it comes to performing vector similarity searches and handling high-dimensional data, they are not as strong as specialized vector databases. These databases often need to be used in conjunction with other tools to achieve semantic search, as they mainly rely on inverted indexes rather than vector indexes. According to Qdrant’s test results, Elasticsearch lags in performance compared to vector databases like Weaviate, Milvus, and Qdrant.
3 SQL Databases Supporting Vectors (pgvector, Supabase, StarRocks)
SQL databases like pgvector provide a way to integrate vector data into existing data storage systems through their vector support extensions, but they also have some obvious drawbacks compared to dedicated vector databases.
The most obvious drawback is the mismatch between the relational model of traditional SQL databases and the nature of unstructured vector data. This mismatch leads to inefficient operations involving vector similarity searches, and these databases do not perform well in building indexes and handling large amounts of vector data, as detailed in the ANN benchmarks. Additionally, the upper limit of vector dimensions supported by pgvector (2000 dimensions) is lower compared to dedicated vector databases like Weaviate, which can handle up to 65535-dimensional vector data. In terms of scalability and efficiency, dedicated vector databases also have more advantages. SQL database extensions supporting vectors, such as pgvector, are more suitable for scenarios where the amount of vector data is small (fewer than 100,000 vectors) and vector data is only a supplementary function of the application. Conversely, if vector data is the core of the application or if there are high requirements for scalability, dedicated vector databases would be a more suitable choice.
As for StarRocks, it is another system running on the SQL framework, optimized for online analytical processing (OLAP) and online transaction processing (OLTP) scenarios, but not specifically optimized for vector similarity searches.
4 NoSQL Databases Supporting Vectors (Redis, MongoDB)
The newly added vector support features in NoSQL databases are still in the early stages and have not been fully tested and verified. Taking Redis Vector Similarity Search (VSS) as an example, this feature was only released in April 2022, less than two years ago. Although Redis VSS can serve as a multifunctional database, it is not optimized and designed specifically for vector similarity search.
5 Specialized Vector Databases (Pinecone, Milvus, Weaviate, Qdrant, Vald, Chroma, Vespa, Vearch)
Specialized vector databases inherently support various vector operations, such as dot product, cosine similarity, etc. These databases are designed to handle high-dimensional data, capable of handling a large number of query requests, and can quickly complete similarity searches between vectors. To achieve these goals, they employ various indexing strategies, usually based on approximate nearest neighbor (ANN) algorithms. These algorithms need to balance efficiency, storage space usage, and search accuracy. For example, the FLAT index is a vector index that does not use any optimization or approximation techniques, meaning it can achieve 100% recall and accuracy, but it is slower and less efficient than other types of vector indexes; relatively speaking, the IVF_FLAT index sacrifices some accuracy for faster search speed; the HNSW index provides a compromise between accuracy and search speed.
Pinecone is a closed-source vector database maintained by a professional team, with limited scalability features in its free version. Chroma is a system specifically designed for audio data, but it has not been specially optimized for handling text data. Compared to other mainstream vector databases, Chroma has relatively scarce comprehensive performance benchmark data. Since Chroma uses SQLite as a document storage method in its 0.4 version, it may not be as scalable and efficient as other storage solutions specifically designed for vector data.
Vearch and Vald have shortcomings in integration with Langchain, which is very unfavorable for development use. Compared to competitors like Milvus, their developer community is smaller, and the maintenance of the open-source community is not active enough.
Therefore, for RAG, Weaviate, Milvus, Qdrant, and Vespa may be the best choices. In theory, the most suitable system should be selected based on performance and scalability benchmarks (see ANN Benchmarks below). However, there are also some system design and feature characteristics that need to be compared. The table below provides a visual comparison from these aspects.
Database | Qdrant | Weaviate | Milvus |
---|---|---|---|
Open-source and self-hostable | ✅ | ✅ | ✅ |
Open-source license | Apache-2.0 | BSD | Apache-2.0 |
Development language | Rust | Go | Go, C++ |
Github Stars | 17k | 9.2k | 26.2k |
First release date | 2021 | 2019 | 2019 |
SDK | Python, JS, Go, Java, .Net, Rust | Python, JS, Java, Go | Python, Java, JS, Go |
Hosted cloud service | ✅ | ✅ | ✅ |
Built-in text embedding | ✅FastEmbed | ✅ | ❌ |
Hybrid retrieval | ❌ | ✅RRF*+RSF* | ✅In-table multi-vector hybrid |
Metadata filtering | ✅ | ✅ | ✅ |
BM25 support | ❌ | ✅ | ✅ |
Text search | ✅ | ✅ | ❌ |
Single-point multi-vector | ✅ | ✅ | |
Tensor search | ❌ | ❌ | ❌ |
Langchain integration | ✅ | ✅ | ✅ |
Llama index integration | ✅ | ✅ | ✅ |
Geo-geographic information search | ✅ | ✅ | ❌ |
Multi-tenant support | ✅ via collections/metadata | ✅ | |
Metadata and document size limit | Unlimited | ||
Maximum dimension | Unlimited | 65535 | 32768 |
Index types | HNSW | HNSW | ANNOY, FAISS, HNSW, ScANN … |
Streaming index | ❌ | ||
Sparse vector support | ❌ | ❌ | ❌ |
Temporary index support (excluding server) | ✅ | ❌ | |
Sharding | |||
Price | |||
Facets (aggregation with counts) | ❌ | ✅ | |
Built-in image embedding | ✅ | ||
Recommendation API | ✅ | ||
Personalization | |||
User events | |||
Call built-in LLM for RAG | ✅Generative Search |
Database | Qdrant | Weaviate | Milvus |
---|---|---|---|
Subjective advantages | 1. Can store multiple types of vectors (images, text, etc.) in one collection 2. Very low resource usage |
1. Relatively good performance 2. Supports built-in embedding 3. Supports text search 4. GraphQL API 5. Supports S3 backup |
1. Officially supported visual operation interface 2. High search accuracy 3. Rich SDK 4. GPU acceleration |
In summary, Qdrant has particularly low overhead, Weaviate supports a combination of vector search, object storage, and inverted index, and Milvus has the strongest performance and the most features.
6 Comparison of Search Methods in Vector Databases
Milvus | Weaviate | Qdrant | |
---|---|---|---|
Unique search methods | Multi-vector search | BM25 keyword search + hybrid search | Keyword filtering applied to vector search |
6.1 Milvus
Milvus supports two types of searches, depending on the number of vector fields in the collection: single-vector search and multi-vector search.
Single-vector search uses the search() method, comparing the query vector with existing vectors in the collection, returning the IDs of the most similar entities and their distances, and optionally returning the vector values and metadata of the results.
Multi-vector search applies to collections with two or more vector fields and is executed through the hybrid_search() method, which performs multiple approximate nearest neighbor (ANN) search requests and combines the results for re-ranking to return the most relevant matches. (Supported only in the latest 2.4.x version, with a maximum of 10 vectors for search)
Multi-vector search is particularly suitable for complex situations requiring high precision, especially when an entity can be represented by multiple different vectors. This applies to the same data (e.g., a sentence) processed by different embedding models or when multimodal information (e.g., a person’s image, fingerprint, and voiceprint) is converted into various vector formats. By performing “multi-path recall” across the table and assigning weights to these vectors, their combined effect can significantly increase recall capability and improve the effectiveness of search results.
Other basic search operations:
- Basic searches include single-vector search, batch vector search, partition search, and searches with specified output fields.
- Filtered search refines search results based on filtering conditions of scalar fields.
- Range search finds vectors within a specific distance range from the query vector.
- Grouped search groups search results based on specific fields to ensure diversity in the results.
6.2 Weaviate
- Vector similarity search: Covers a range of approximate search methods, seeking objects most similar to the query vector representation.
- Image search: Uses images as input for similarity search.
- Keyword search: A keyword search using the BM25F algorithm to rank results.
- Hybrid search: Combines BM25 and similarity search to rank results.
- Generative search: Uses search results as prompts for LLM.
- Re-ranking: Re-ranks retrieved search results using a re-ranking module.
- Aggregation: Aggregates data from the result set.
- Filters: Applies conditional filters to searches.
6.3 Qdrant
Supported basic search operations:
- Filtering by relevance score
- Loading multiple search operations in a single request
- Recommendation API
- Grouping operations
Other search methods supported by Qdrant:
Qdrant is primarily a vector search engine, and we only implement full-text support when it does not affect vector search use cases. This includes interfaces and performance.
What Qdrant can do:
- Use full-text filters for search
- Apply full-text filters to vector searches (i.e., perform vector searches within records containing specific words or phrases)
- Perform prefix search and semantic instant search
Features Qdrant plans to introduce in the future:
- Support for sparse vectors, such as those used in SPLADE or similar models
Features Qdrant does not intend to support:
- BM25 or other non-vector-based retrieval or ranking functions
- Built-in ontologies or knowledge graphs
- Query analyzers and other NLP tools
- Relevance Scoring:
- Simple keyword search is usually based on term frequency: if a term appears in a document, then the document is considered relevant. This method may only count the occurrence of keywords, and all keywords are considered equally important.
- BM25 uses a more complex algorithm that considers term frequency, document length, and the inverse document frequency of the term (i.e., its rarity across all documents). This means BM25 can provide a more refined relevance score, better reflecting the match between the query and the document.
- Document Length Handling:
- Simple keyword search may not consider the length of the document. This may lead to longer documents (containing more words) being overly prioritized simply because they have more opportunities to contain the keywords.
- BM25 considers the length of the document through a normalization process within its algorithm, avoiding this bias and ensuring fairness in relevance scoring for both long and short documents.
- Importance of Query Terms:
- In simple keyword search, all keywords may be treated equally, regardless of their commonality.
- BM25 uses inverse document frequency (IDF) to adjust the importance of each query term. This means terms that appear in fewer documents (more unique terms) will have a greater impact on the document’s relevance score.
- Parameter Tuning:
- Simple keyword search usually does not have many configurable parameters to optimize search results.
- BM25 provides parameters (such as k1 and b) that allow fine-tuning of the algorithm’s sensitivity to suit different types of text and search needs.
Compared to simple keyword search, BM25 offers a more complex and refined method for evaluating the relevance between documents and queries, capable of producing more accurate and user-expected search results.
The current dilemma is whether there is a solution that can achieve both the semantic search characteristics of vector databases and the precision characteristics of traditional keyword search.
7 Appendix
7.1 ANN Benchmarks
Benchmarks are influenced by various factors affecting database performance, such as search type (filtered search or regular search), configuration settings, indexing algorithms, data embeddings, hardware, etc. In addition to the performance of benchmark tests, the selection of vector libraries should also consider distributed capabilities, support for memory replicas and caching, adopted indexing algorithms, vector similarity search capabilities (including hybrid search, filtering, and multiple similarity metrics), sharding mechanisms, clustering methods, scalability potential, data consistency, and overall system availability.
ANN-Benchmarks is a primary benchmarking platform for evaluating the performance of approximate nearest neighbor search algorithms. In text retrieval, the performance of vector databases on angular metrics is often more important than their performance on Euclidean metrics. This is because angular metrics are more sensitive to the semantic similarity of text documents, while Euclidean metrics are more sensitive to document length and scale. Therefore, when considering the context of retrieval-enhanced generation, more attention should be paid to evaluating the performance of vector databases on angular datasets across different dimensions.
7.1.1 glove-100-angular
Evidently, Milvus has the highest throughput when the recall value is below 0.95. When the recall value exceeds 0.95, the throughput gap narrows. Vespa has the longest build time. Weaviate and Milvus have comparable build times, but Milvus is slightly longer. In terms of index size, Weaviate's index is the smallest. Although Milvus's index is the largest, it is still less than 1.5GB (for a dataset containing 1.2 million vectors, each with 100 dimensions).7.1.2 nytimes-256-angular
The results on this dataset are similar to those on the glove-100-angular dataset. Weaviate has the longest build time and the smallest index on this dataset. Milvus's index is the largest, but it is only 440MB (for a dataset containing 290,000 vectors, each with 256 dimensions).7.2 Vector Similarity Metrics
Metric | Description | Supported Databases |
---|---|---|
Cosine Distance | Measures the cosine of the angle between two vectors | pgvector, Pinecone, Weaviate, Qdrant, Milvus, Vespa |
Euclidean Distance (L2) | Calculates the straight-line distance between two vectors in multidimensional space | pgvector, Pinecone, Qdrant, Milvus, Vespa |
Inner Product (Dot Product) | Calculates the sum of the products of corresponding vector components | pgvector, Pinecone, Weaviate, Qdrant, Milvus |
L2 Squared Distance | The square of the Euclidean distance between two vectors | Weaviate |
Hamming Distance | Measures the number of differences between vectors in each dimension | Weaviate, Milvus, Vespa |
Manhattan Distance | Measures the distance between vector dimensions along right-angle axes | Weaviate |
Below is a detailed introduction to each metric, including their relative advantages, disadvantages, and suitable use cases.
7.2.1 Cosine Distance
Cosine distance measures the cosine of the angle between two vectors, commonly used for handling normalized or convex sets.
- Advantages: Mainly considers the direction of vectors, making it very suitable for high-dimensional spaces, such as text comparison, where document length is less important.
- Disadvantages: Not suitable for scenarios requiring matching vector dimensions, such as comparing image embeddings based on pixel density. If the data does not form a convex set, it may not provide an accurate similarity measure.
Cosine distance is suitable for document classification, semantic search, recommendation systems, and any other tasks involving high-dimensional and standardized data. In information retrieval, cosine distance is often used to measure the similarity between query content and document vectors, ignoring their length but focusing on semantic meaning.
7.2.2 Euclidean Distance L2
Euclidean distance calculates the straight-line distance between two vectors in multidimensional space, also known as the second norm.
- Advantages: Intuitive, easy to calculate, sensitive to both the size and direction of vectors.
- Disadvantages: May perform poorly in high-dimensional spaces due to the “curse of dimensionality.”
Suitable for image recognition, speech recognition, handwriting analysis, and other scenarios.
7.2.3 Inner Product
Inner product calculates the sum of the products of corresponding vector components, also known as the nth norm.
- Advantages: Fast calculation, reflects the size and direction of vectors.
- Disadvantages: Sensitive to both the direction and size of vectors.
The classic application of inner product is in the field of recommendation systems. In recommendation systems, the inner product can be used to determine the similarity between user vectors and item vectors, helping predict a user’s interest in an item. Inner product is suitable for recommendation systems, collaborative filtering, and matrix decomposition.
7.2.4 L2 Squared Distance
The square of the Euclidean distance between two vectors.
- Advantages: Penalizes large differences between vector elements, which can be useful in some situations.
- Disadvantages: The square operation may distort distances and is sensitive to outliers.
L2 squared distance is particularly suitable for problems involving differences in individual dimensions, such as comparing the differences between two images in image processing.
7.2.5 Hamming Distance
Measures the number of differences between vectors in each dimension.
- Advantages: Suitable for comparing binary or categorical data.
- Disadvantages: Not applicable to continuous or numerical data.
The applicable scenarios are also quite specific, such as error detection and correction (categorical data); measuring the genetic distance between two DNA strands.
7.2.6 Manhattan Distance L1
Measures the distance between vector dimensions along right-angle axes, also known as the first norm.
- Advantages: More resistant to outliers than Euclidean distance.
- Disadvantages: Less intuitive in geometric terms compared to Euclidean distance.
Suitable for calculating chessboard distance and shortest path problems in logistics planning.