avatar

The long long journey...

Review Notes for the Parallel Computing Course

Review OutlineChapter One Amdahl’s Law Understanding of the law (with the task unchanged, the improvement of speed, speed-up ratio), the limits of acceleration Application question 6’*5 Layout of grids and thread blocks, calculating global id Parallelism, concurrency, warp of threads, global id, multi-core CPU vs many-core GPU Program analysis question 10*2 Writing results for given codes, analyze why those results occur Multi-core CPU 10*2 Data partitioning: clarify the data range handled by each part Task parallelism: thread pool experiment CUDA Programming 15*2 Specific problems, design grids and thread blocks, or given thread blocks, just need to design grids; Fixed procedure in the main function; the key lies in writing kernel functions; Parallel Computing Concurrency and ParallelismSerial: single-machine single-core, instructions execute in order. Concurrency: single-machine single-core, instructions execute in parallel over time, occurring within the same time interval. Parallelism: single-machine multi-core, multi-machine single/multi-core, instructions execute in parallel in space, occurring at the same moment. Parallel computing is supercomputing performed on parallel computers or distributed systems and other high-performance computing systems. Parallel computing can reduce the time it takes to solve a single problem, increase the scale and precision of solutions, and improve throughput, among other benefits. Three classifications: Computation modes: time parallelism (pipelining), space parallelism (multiprocessor) Program logic: task parallelism, data parallelism From the application perspective: compute-intensive, data-intensive, network-intensive Flynn’s TaxonomyA method for classifying parallel computing architectures based on the execution mode of instruction streams and data streams. Includes SISD (early serial machines), SIMD (single-core computers), MISD (rarely used), MIMD (multi-core computers, parallel);

Implementing Parallel Programming with CUDA

This article demonstrates how to write CUDA kernel functions and uses five examples, such as matrix transposition, to accelerate the solution of large-scale problems with GPU multi-core. Matrix Transposition Algorithm Process Store the matrix to be transposed in GPU memory. Allocate space on the GPU to store the transposed matrix. Define a CUDA kernel function to implement the matrix transposition. This kernel function should use the concept of thread blocks and thread grids to handle all elements in the matrix. Within each thread block, threads can use shared memory to process data. Finally, use global memory to write the result back to the GPU. Call the CUDA kernel function to perform matrix transposition. Copy the transposed matrix from GPU memory to host memory. Release GPU memory Code ImplementationBased on traditional code, use shared memory to optimize access to global memory and apply padding operation to avoid bank conflicts. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 const int N = 10000; const int TILE_DIM = 32; const int grid_size_x = (N + TILE_DIM - 1) / TILE_DIM; const int grid_size_y = grid_size_x; const dim3 block_size(TILE_DIM, TILE_DIM); const dim3 grid_size(grid_size_x, grid_size_y); __global__ void transpose(const float *A, float *B, const int N){ __shared__ float S[TILE_DIM][TILE_DIM+1]; int bx = blockIdx.x * TILE_DIM; int by = blockIdx.

Understanding Bayes' Theorem from a Machine Learning Perspective

A Fascinating Interpretation of Bayes’ Theorem The Bayesian school believes that nothing is random. If it appears so, then it’s due to a lack of information (Shannon Information Theory); The Bayesian school in statistics has led to the Bayesian theory in machine learning. Bayes’ theorem provides us with the ability to infer based on the probabilities before an event, after the event has occurred. An accidental use of Bayes’ example: A joke—water is deadly poison because everyone who got cancer drank water. An example of being inadvertently misled by Bayes: Diagnostic methods with very high detection rates (99.9% accuracy) have an extremely high misdiagnosis rate (>50%). This is because the prevalence of the disease in the general population is less than 1%. Probability and statistics are indeed like a young girl who is ready to be dressed up by anyone. $$P(c|x) = \frac{P(c)P(x|c)}{P(x)}$$ Understanding Bayes’ Theorem from a Machine Learning PerspectiveThe same formula as above, but in machine learning, it defines a naive Bayes classifier, read as P c given x, where the left side is the posterior probability, $P(c)$ is the prior probability, $P(x|c)$ is the likelihood, which is the main focus of the model’s learning. $P(x)$ is the same for all input samples and is used for normalization (it is expanded using the total probability formula during computation); the estimation of $P(c)P(c|x)$ can use the method of Maximum Likelihood Estimation (see Watermelon Book p148). From a general perspective (which might not be entirely accurate): $P(c)$ is the original probability of an event.

Pthread for Multithreading Programming

This article will use two examples to practice multithreading programming with pthread, mainly covering two parts: Parallel computation of the PI value by dividing data Thread pool development based on the producer-consumer pattern, with the specific business logic simplified to focus on thread management and synchronization Calculating Pi Conceptual OverviewBased on the Leibniz formula, calculate a large number of times through multithreading to approximate $\pi$. Use multithreading for dividing data, i.e., each thread handles a part of the data for acceleration. Due to potential conflicts when multiple threads access a global result, mutexes and semaphores are used to organize threads to orderly add local results to the global outcome. Code Implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <math.h> const int BLOCK_SIZE = 100000; double sum = 0; int num_threads; // Define mutex and condition variables pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t cond = PTHREAD_COND_INITIALIZER; void *calculate_block(void *thread_id) { long id = (long)thread_id; int start = id * BLOCK_SIZE; int end = start + BLOCK_SIZE; double block_sum = 0; for (int i = start; i < end; i++) { double term = pow(-1, i) / (2 * i + 1); block_sum += term; } pthread_mutex_lock(&mutex); sum += block_sum; pthread_mutex_unlock(&mutex); pthread_cond_signal(&cond); } int main() { num_threads = 8; pthread_t threads[num_threads]; for (long i = 0; i < num_threads; i++) pthread_create(&threads[i], NULL, calculate_block, (void *)i); for(int i = 0; i < num_threads; i ++) pthread_join(threads[i], NULL); printf("%lf", sum * 4); } Thread Pool Design Thread Pool ImplementationA task queue is used as a buffer area between the producer and the consumer.

Essence of Linear Algebra

Vector The introduction of numbers as coordinates is an act of violence. AND on the flip side, it gives people a language to describe space and the manipulation of space using numbers that can be crunched and run through a computer. Heresy: Linear algebra lets programmers manipulate space. $i$ and $j$ are basis vectors, any vector can be seen as their linear combination. Collinear vectors are linearly dependent, and the space they span is just a line (or the origin); Non-collinear vectors are linearly independent, and the space they span is the entire set of vectors; Matrix Kind of linear transformation Fortunately, linear algebra only involves linear transformations; Matrices can be understood as a kind of transformation; ac represents the position of a basis after transformation, and so does bd, therefore 1001 is equivalent to no transformation; Knowing how the basis is transformed, we know how all vectors are transformed; Orthogonal transformation is a transformation where the basis vectors maintain their length and remain perpendicular to each other (rigid body motion); Matrix Multiply A single matrix is a linear transformation, so matrix multiplication represents a composite transformation. The collective meaning of matrix multiplication lies in making two linear transformations succeed each other (apply one transformation then another). Non-square matrices represent transformations across dimensions; Determinant This is the determinant! 👆 The area of a 1x1 square, after a matrix transformation, equals the value of the corresponding determinant. If the determinant is 0, everything is flattened, indicating a non-invertible transformation, and the matrix is also non-invertible.

Algorithm Basics Template

Graph Theory Dynamic Programming Basic Algorithms Sorting Quick Sort Template—— Template problem AcWing 785. Quick Sort 1 2 3 4 5 6 7 8 9 10 11 12 13 void quick_sort(int q[], int l, int r) { if (l >= r) return; int i = l - 1, j = r + 1, x = q[l + r >> 1]; while (i < j) { do i ++ ; while (q[i] < x); do j -- ; while (q[j] > x); if (i < j) swap(q[i], q[j]); } quick_sort(q, l, j), quick_sort(q, j + 1, r); } Merge Sort Template—— Template problem AcWing 787. Merge Sort 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 void merge_sort(int q[], int l, int r) { if (l >= r) return; int mid = l + r >> 1; merge_sort(q, l, mid); merge_sort(q, mid + 1, r); int k = 0, i = l, j = mid + 1; while (i <= mid && j <= r) if (q[i] <= q[j]) tmp[k++] = q[i++]; else tmp[k++] = q[j++]; while (i <= mid) tmp[k++] = q[i++]; while (j <= r) tmp[k++] = q[j++]; for (i = l, j = 0; i <= r; i++, j++) q[i] = tmp[j]; } Binary Search Integer Binary Search Template—— Template problem AcWing 789. Range of Numbers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 bool check(int x) {/* .
0%