Fountain Voyage

The long long journey...

Multithreading Programming With Pthread

Tim published on 2023-04-17 included in Computer Technology Data Science & Machine Learning

This article introduces how to use pthread for multithreading programming, mainly explained through two examples. The first example is calculating the value of Pi in parallel using multithreading, utilizing the Leibniz formula, and dividing data so that multiple threads calculate parts of the data to speed up the process. To avoid conflicts when multiple threads access the global result, mutexes and semaphores are used to organize threads to orderly accumulate local results into the global result. In the code implementation, a BLOCK_SIZE constant is defined, each thread calculates the sum of a block of data, and the update of the global variable sum is protected with a mutex lock. The second example is a thread pool design based on the producer-consumer model. The thread pool uses a task queue as a buffer between producers and consumers, where each element in the task queue contains the function to execute and the function parameters. The implementation of the thread pool includes initializing the task queue, semaphores, starting consumer threads, and waiting for all threads to finish running when shutting down the thread pool. Producers add tasks to the task queue through the thread_pool_enqueue function, using mutexes and condition variables to protect operations on the task queue. Consumers take tasks from the task queue and execute them through the thread_pool_worker function, ensuring that consumer threads are blocked when the task queue is empty, waiting for new tasks to arrive. Finally, by designing a simple task to output task id and thread id and running it in the thread pool, the practical application of the thread pool is demonstrated. Through these examples, readers can better understand the basic principles and application scenarios of multithreading programming.

Essence of Linear Algebra

Tim published on 2023-04-12 included in Data Science & Machine Learning

This article discusses in detail several core concepts in linear algebra and their applications. First, vectors are described as linear combinations of basis vectors, emphasizing the different behaviors of linearly dependent and independent vectors in space. Matrices are viewed as representations of linear transformations, with matrix multiplication representing composite transformations. The geometric meaning of determinants is the change in area after transformation, and a determinant of zero indicates a non-invertible transformation. Inverse matrices are used to solve systems of equations, and rank represents the dimension of the transformed space. The duality of dot products reveals the deep connection between vectors and matrices. Eigenvalues and eigenvectors are used to describe the transformation characteristics of matrices, especially in rotation and shear transformations. Basis transformations involve conversions between different coordinate systems. Cramer’s rule provides a geometric perspective to understand the calculation of determinants. Overall, the article uses both geometric and algebraic perspectives to help readers better understand the fundamental concepts of linear algebra and their importance in practical applications.

Basic Algorithm Templates

Tim published on 2023-03-07 included in Coding

This blog covers a wide range of topics on basic algorithms and data structures, providing detailed code templates and application examples. The sorting algorithm section introduces the implementation of quicksort and merge sort. The binary search section demonstrates integer and floating-point binary search templates. The high precision calculation section includes implementations of addition, subtraction, multiplication, and division. The prefix and difference section explains one-dimensional and two-dimensional prefix sums and differences. The bit manipulation section provides common bit operation methods. The two-pointer algorithm section introduces techniques for maintaining intervals and order. The discretization and interval merging section shows how to handle interval and discretization problems. The linked list and adjacency list section explains the implementation of singly and doubly linked lists. The stack and queue section introduces the implementation of stacks, regular queues, and circular queues. The KMP string matching section provides templates for computing the Next array and matching. The Trie tree section demonstrates the implementation of string insertion and query. The union-find section introduces the naive union-find, union-find maintaining size, and union-find maintaining distance to ancestor nodes. The heap section provides templates for heap sort and heap simulation. The hash section introduces the implementation of general hash and string hash. The search and graph theory section explains algorithms for DFS, BFS, topological sorting, shortest path, minimum spanning tree, and bipartite graph. The mathematics section covers prime numbers, divisors, Euler’s function, fast exponentiation, extended Euclidean algorithm, Gaussian elimination, and combinatorial counting. The game theory section introduces Catalan numbers, NIM games, and directed graph game theory.

SSH Tunnel Port Forwarding

Tim published on 2023-02-27 included in Network

In some cases, a server may only have the SSH service port open, while other ports are closed for security reasons. To communicate with these ports, SSH tunneling technology can be used. SSH tunneling allows port forwarding through an SSH connection to access restricted ports. The basic command format is: ssh -L local_portX:hostC:hostC_portZ username@hostB, where -L is used for local port forwarding. Optional parameters include -N (no SSH login, only port forwarding), -f (put SSH process in the background), -R (reverse forwarding), and -D (dynamic port forwarding). Application scenarios include: Bypassing firewalls: Access ports blocked by firewalls by connecting to host B via SSH and port forwarding. Network partitioning: When host B and host C are on the same internal network, an external host A can access host C through host B. Accessing non-public network ports: An internal network host A can connect to a public network host B via SSH and port forwarding, allowing B to access A’s ports. Dynamic port forwarding: Create a SOCKS proxy server using the -D parameter to forward local network traffic through the SSH tunnel to the remote server, enabling internet access via the remote server. These technologies provide flexible solutions to help users achieve the required communication in restricted network environments.

Data Mining Course Review Notes

Tim published on 2023-02-20 included in Learning Notes Data Science & Machine Learning and Course-Notes

Data mining is a process of automatically analyzing and extracting information from data using computer technology, aiming to discover potentially valuable information in the data. Its methods include supervised and unsupervised learning. The data mining process typically includes preparing data, selecting techniques or algorithms, interpreting and evaluating models, and applying models. Basic data mining techniques include decision trees, association rules, and clustering techniques. Decision trees build models by selecting attributes with the highest gain ratio, while association rules use the Apriori algorithm to generate rules that meet support and confidence. The K-means algorithm is used for cluster analysis, classifying by calculating the similarity between instances. Knowledge Discovery in Databases (KDD) is the process of extracting credible and valuable information from datasets, often requiring data preprocessing such as histogram reduction and data normalization. Evaluation techniques are used to assess the accuracy and error of classification and numerical output models. The neural network section introduces the structure and algorithm process of artificial neuron models and BP neural networks, and convolution and pooling operations of convolutional neural networks are also explained in detail. In statistical techniques, regression analysis and Bayesian analysis are important tools, the former for determining dependency relationships between variables, and the latter for parameter estimation. Clustering techniques include agglomerative clustering and Cobweb hierarchical clustering algorithms, the latter can automatically adjust the number of classes.

Big Data Storage Course Notes

Tim published on 2023-02-19 included in Learning Notes Data Science & Machine Learning and Course-Notes

This blog first introduces the background of distributed databases and big data storage, emphasizing the importance of horizontal and vertical scaling, and the four characteristics of big data: volume, velocity, variety, and value. The traditional relational model is difficult to meet the needs of big data storage, so a cluster system capable of unified management and scheduling of computer and storage resources is needed. Then, the differences between NoSQL and NewSQL are discussed. NoSQL is mainly used to solve the scalability problem of SQL, while NewSQL combines the massive storage capability of NoSQL with the ACID characteristics of relational databases. In the C/S-based hierarchical structure, the functional changes of AP and DP are analyzed in detail, revealing the characteristics of three distributed architectures. The three architectures are Partition ALL, Partition Engine, and Partition Storage, each with different performances in scalability and compatibility. The component structure and pattern structure of DDBS are introduced in detail, emphasizing the roles of global external schema, global conceptual schema, fragmentation schema, allocation schema, local conceptual schema, and local internal schema. In terms of data transparency, fragmentation transparency, allocation transparency, and local mapping transparency are defined and explained. In distributed database design, fragmentation, allocation, and replication are key steps. The role of fragmentation is to reduce the amount of data transmitted over the network, improve query efficiency, and system reliability. The definitions and roles of horizontal and vertical fragmentation are discussed in detail. HBase, as an important tool for big data storage, is analyzed in detail for its characteristics and Region mechanism. HBase uses HDFS storage, supports horizontal scaling and automatic data fragmentation, and has strict read-write consistency and automatic failover capabilities. In terms of big data index structure, skip lists and LSM trees are introduced as efficient data storage engines suitable for different application scenarios. Finally, the consistency of distributed transactions, CAP and BASE theories, and concurrency control strategies are discussed, emphasizing the isolation and consistency of transactions.