Fountain Voyage

The long long journey...

Vector Database Comparison: Weaviate, Milvus, and Qdrant

Tim published on 2024-04-02 included in Large Language Models and RAG

The success of RAG systems lies in their ability to efficiently process massive amounts of information, with vector databases being their core. Vector databases convert and store data such as text and images into vectors, affecting the final outcome of RAG systems. Choosing the right vector database requires considering factors like open-source nature, CRUD support, and distributed architecture. Weaviate, Milvus, Qdrant, Vespa, and Pinecone are currently popular specialized vector databases. Vector libraries like FAISS, HNSWLib, ANNOY are mainly used for storing static data and do not support CRUD operations. Full-text search databases like ElasticSearch are not as powerful as vector databases when handling high-dimensional data. SQL databases supporting vectors like pgvector perform poorly when handling large amounts of vector data, while NoSQL databases’ vector support is still in its early stages. Specialized vector databases support various vector operations, using approximate nearest neighbor algorithms to balance efficiency, storage, and accuracy. Pinecone, Chroma, Vearch, and Vald each have their pros and cons, with Weaviate, Milvus, Qdrant, and Vespa being the best choices for RAG. Milvus supports multi-vector search, suitable for complex situations; Weaviate supports hybrid and generative search; Qdrant supports keyword filtering applied to vector search. Choosing a vector database should be based on performance and scalability benchmarks, while also considering system design and feature characteristics.

From AI Knowledge Base to RAG

Tim published on 2024-03-21 included in Large Language Models and RAG

In AI application development, models often face the problem of insufficient data, leading to an inability to accurately handle specific tasks. Retrieval-Augmented Generation (RAG) technology improves the accuracy and reliability of generative AI models by retrieving external information. The emergence of RAG technology is to solve the hallucination problem of large language models in applications, where the output does not match facts or fabricates answers. Through RAG, models can access the latest or customized information and allow users to verify information sources to ensure accuracy. The basic components of RAG include three stages: indexing, retrieval, and generation. First, users upload documents, and the system stores them in a vector database through embedding. When users ask questions, the questions are converted into vectors and matched in the database for initial retrieval. Then, the Rerank model reorders the retrieval results, outputting the most relevant results for the generation stage. Methods for building AI knowledge bases include prompt engineering, fine-tuning, and embedding, with embedding being the most mainstream method and needing to be combined with RAG to be effective. Open-source RAG implementations like Dify and Langchain-Chatchat provide different functional support to help developers build efficient AI applications.

Stderr and Stdout: Understanding Logs and Output

Tim published on 2024-02-17 included in Computer Technology Coding

Standard output (stdout) and standard error (stderr) are the two main output streams of a process, used for outputting normal information and error messages, respectively. In Python, the print function outputs to stdout by default, while the logging module outputs to stderr by default, making it easy to distinguish between normal output and log information. The tqdm library is used to display progress bars, outputting to stderr by default to avoid interfering with normal output. Through command line redirection and Python code configuration, these two outputs can be managed flexibly. When using the nohup command, stdout and stderr are merged by default and output to the nohup.out file, but can be managed separately through redirection. In Python, stdout and stderr have different buffering behaviors; stdout is line-buffered in interactive mode and block-buffered in non-interactive mode, while stderr is always line-buffered. After Python 3.9, stderr is also line-buffered in non-interactive mode. Buffering behavior can be disabled using the python -u option or by setting the PYTHONUNBUFFERED environment variable. In concurrent environments, stdout and stderr outputs may interleave, requiring the use of thread locks or process synchronization mechanisms to manage. In C++, stdout (std::cout) is line-buffered, while stderr (std::cerr) is unbuffered, suitable for outputting error information. Output can be redirected to a file using the freopen function. In multithreaded environments, mutex locks are used to synchronize output and avoid race conditions. By mastering these techniques, you can effectively manage and control program output, enhancing application stability and user experience.

Automatic Segmentation Tool for Long Webpage Screenshots

Tim published on 2024-02-06 included in Tools & Applications

Long screenshots are a very practical form when needing to share or analyze web content, but processing these screenshots to maintain information integrity and readability, while facilitating subsequent operations, has always been a challenge. To solve this problem, a tool based on OpenCV, Web-page-Screenshot-Segmentation, was developed. This tool automatically identifies the natural dividing lines of web content, finds the most suitable segmentation points, and ensures content integrity and readability. Users only need to prepare a long screenshot, and the tool will automatically analyze the image content and intelligently decide the segmentation points, generating a series of complete and well-structured images for easy sharing and further processing. The project is open source on Github, providing simple installation and usage guidelines. Usage includes obtaining the segmentation line height of the image in the command line, drawing the segmentation line in the image, and splitting the image. It also provides example code for using from the source code, demonstrating how to use the split_heights and draw_line_from_file functions for image segmentation and line drawing.

GPT-Driven General Web Crawler

Tim published on 2023-12-30 included in Network Tools & Applications

GPT-Web-Crawler is a general web crawler tool based on Python and Puppeteer, which significantly simplifies the process of extracting information from web pages using GPT technology. Traditional crawlers require special configuration for each website, whereas GPT-Web-Crawler can achieve web scraping and information extraction with just a few lines of code. This tool is particularly suitable for users who are unfamiliar with web crawlers but wish to extract content from web pages. Users only need to install the relevant packages and configure the OpenAI API key (if AI content extraction is needed) to start the crawler. The tool supports four different types of crawlers: NoobSpider, CatSpider, ProSpider, and LionSpider, providing basic information extraction, screenshot functionality, AI content extraction, and image extraction, respectively. CatSpider requires Puppeteer installation to enable webpage screenshot functionality. The output of the crawler can be in JSON format, which can be easily converted to CSV files or imported into a database.

IoT and Sensor Network Course Review Notes

Tim published on 2023-11-30 included in Learning Notes and Course-Notes

The Internet of Things (IoT) enables information exchange between objects and between people and objects through intelligent sensing devices and transmission networks. Its features include comprehensive perception, reliable transmission, and intelligent processing. The conceptual model of IoT consists of the perception layer, network layer, and application layer, each responsible for identification, connection, and application functions. Sensor data is characterized by massiveness, polymorphism, correlation, and semantic nature, supporting various wireless sensing methods, including traditional sensing and intelligent wireless sensing. A Wireless Local Area Network (WLAN) consists of stations, wireless access points, wireless media, and distributed systems, facing classic problems such as hidden terminals and exposed terminals. The CSMA/CD protocol is unsuitable for wireless environments due to its inability to detect collisions, replaced by the CSMA/CA protocol, which avoids conflicts through priority acknowledgment and random backoff algorithms. A Wireless Sensor Network (WSN) consists of sensor nodes, sink nodes, and management nodes, with node characteristics including limited power, computing, and communication capabilities. The architecture of sensor networks is divided into hierarchical and clustered systems, with data dissemination and collection aimed at optimizing energy consumption and delay. Positioning technologies are divided into range-based and range-free categories, with the former including RSS, TOA/TDOA methods, and the latter such as centroid algorithm and DV-HOP. Time synchronization is crucial in sensor networks, with common synchronization mechanisms including NTP, RBS, and TPSN. The Industrial Internet promotes the upgrade of the physical economy on the basis of the Internet, jointly advancing the intelligence of the manufacturing industry with Industry 4.0. The five-dimensional model of digital twins optimizes physical devices through physical entities, virtual entities, services, network connections, and twin data. IoT, big data, cloud computing, and artificial intelligence support each other, jointly driving technological progress.