Parallel Computing Course Review Notes
1 Review Course Outline
Chapter 1 Amdahl’s Law Understanding the law (speedup and acceleration ratio with constant tasks), limits of acceleration
Application Questions 6’*5
Layout of grids and thread blocks, calculation of global id Parallelism, concurrency, warp, global id, CPU multicore vs GPU many-core
Program Analysis Questions 10*2
Write results for code, analyze why such results occur
CPU Multicore 10*2
Data partitioning: clearly define the data range processed by each part Task parallelism: thread pool experiment
CUDA Programming 15*2
Specific problems, design grids and thread blocks, or given thread blocks, only need to design grids; Fixed process in the main function; the key is to write the kernel function;
2 Parallel Computing
2.1 Concurrency and Parallelism
Serial: Single machine single core, instructions executed in order.
Concurrency: Single machine single core, instructions executed in parallel over time, occurring within the same time interval.
Parallelism: Single machine multi-core, multi-machine single/multi-core, instructions executed in parallel in space, occurring at the same moment.
Parallel computing is supercomputing performed on parallel computers or distributed systems and other high-performance computing systems. Parallel computing can reduce the time to solve a single problem, increase the scale and accuracy of solutions, and improve throughput, etc.
Three classifications:
- Computational model: time parallelism (pipeline), space parallelism (multi-processor)
- Program logic: task parallelism, data parallelism
- Application perspective: compute-intensive, data-intensive, network-intensive
2.2 Flynn’s Taxonomy
A method of classifying parallel computer architectures based on the execution of instruction streams and data streams. Includes SISD (early serial machines), SIMD (single-core computers), MISD (rarely used), MIMD (multi-core computers, parallel);
2.3 Amdahl’s Law
Assuming a fixed number of tasks, it reveals the rule that the non-parallelizable part of a program will limit the overall performance improvement of the program by calculating the performance speedup ratio. $$S=\frac{W_{s}+W_{p}}{W_{s}+W_{p}/p}$$ where $W_{s}$ is the number of serial tasks, $W_{p}$ is the number of parallel tasks, $p$ is the number of processors, and $S$ is the speedup ratio. Based on the proportion of the serial component $f=W_{s}/W$, dividing the above equation by $W$ gives the following equation: $$S=\frac{f+(1-f)}{f+\frac{1-f}{p}} =\frac{p}{1+f(p-1)}$$ $\lim_{x\rightarrow \infty}S=1/f$, when the number of processors increases infinitely, the achievable speedup ratio is limited by the serial part of the program.
|
|
3 CUDA Overview
3.1 Heterogeneous Computing
GPU parallel computing is a type of heterogeneous computing, divided into host (CPU) and device (GPU) ends, and their relationship is never equal. CUDA even requires explicitly marking where the code needs to run.
3.2 Differences between CPU and GPU
Intuitively, more resources in CPUs are used for cache and control flow, while GPUs are more for data computation.
- In the GPU environment, the GPU core is responsible for executing all computation tasks, but the work instructions always come from the CPU.
- In the GPU scenario, GPU cores never fetch data themselves; data always comes from the CPU end, and computation results are then transmitted back to the CPU end. Therefore, the GPU acts as a computation accelerator in the background, completing certain outsourced tasks for the CPU.
- This type of architecture is only very effective when there are a large number of parallel processing units, not just 2 or 4.
- The concept of warp has a significant impact on GPU architecture. Data must be input to the GPU in blocks of the same size, a block of data is half a warp, i.e., 16 elements.
- The fact that data must be transferred to the GPU core in blocks of half a warp size means that the storage system responsible for inputting data into the GPU should input 16 data each time. This requires a parallel storage subsystem capable of transferring 16 numbers at once. This is why the GPU’s DRAM memory is composed of DDR5, because it is parallel memory.
- Since GPU cores and CPU cores are completely different processing units, it is foreseeable that they have different ISAs (instruction set architectures). That is, they speak different languages.
GPU threads differ from CPU threads, with extremely low creation overhead. CPUs reduce latency through multi-level cache, while GPUs reduce latency by increasing throughput through pipelines. Due to their different design goals, CPUs require strong versatility to handle various data types, while logical judgments introduce a large number of branch jumps and interrupt handling. In contrast, GPUs face large-scale data that is uniform in type and independent of each other, and a pure computing environment that does not need to be interrupted.
3.3 CUDA Thread Organization
Thread: The basic unit of parallelism
Thread Block: A group of threads that cooperate with each other, allowing synchronization and data exchange through fast shared memory, organized in 1D, 2D, or 3D, containing up to 1024 threads
Grid: A group of thread blocks organized in 1D, 2D, or 3D, sharing global variables
Kernel: The core program executed on the GPU One kernel One Grid
3.4 CUDA Host/Device Programming Model
3.4.1 Function Qualifiers
- _ device _: Executed on the device side and can only be called from the device side, used as a sub-function on the device side.
- _ host _: Executed on the host side and can only be called from the host side, same as general C functions. Cannot be used simultaneously with global, but can be used with device, in which case the function will be compiled on both the device and host.
- _ global _: The kernel function, executed on the device but called from the host side.
3.4.2 CUDA Kernel Function Limitations
- Can only access device memory
- Must return void type
- Does not support variable number of arguments
- Does not support static variables
- Exhibits asynchronous behavior, meaning the host does not wait for the kernel to finish executing before proceeding to the next step
3.5 SIMT Parallel Computing Model
Thread blocks are the basic unit of program launch, while warps are the unit of program execution;
For example, when we say a block size is 256 threads, it means the thread block size is 8 warps. Each warp always contains 32 threads. This parameter indicates that although the program is launched with 256 threads per thread block, it does not mean they will be executed immediately, meaning these 256 threads will not all be executed or completed at the same time. Instead, the GPU’s execution hardware will use 8 warps to execute these threads.
SIMT belongs to the category of SIMD because it also executes the same instruction on multiple data. However, SIMT allows users to allocate threads, specifically CUDA assigns an identifier (number) to each thread.
A key difference is that SIMD requires all elements in the same vector to be executed in a unified synchronous group, while SIMT allows multiple threads in the same warp to execute independently, and these threads can have different behaviors. Therefore, SIMT allows thread-level concurrency, meaning threads in the same warp can do different things simultaneously.
Three differences:
- Each thread has its own instruction address counter
- Each thread has its own register state
- Each thread can have an independent execution path
3.6 GPU Architecture
Streaming Multiprocessor SM
A thread block can only be scheduled on one SM, but one SM can correspond to multiple thread blocks.
When an SM assigns one or more thread blocks to be computed, these thread blocks are divided into multiple warps, waiting for scheduling.
Threads in a warp execute the same command on different data.
The number of thread blocks an SM can accommodate depends on the shared memory and registers within the SM and the resources occupied by the threads.
All threads in a thread block logically run in parallel, but not all threads can be executed simultaneously at the physical level. (An SM schedules only one warp at a time, with other warps waiting, and switching between different warps is zero-cost because the execution context of the warp is maintained by the SM throughout the warp’s lifecycle)
NVIDIA GeForce RTX 3090’s Compute Capabilities is 8.6, which includes 82 SMs, with a maximum number of threads allowed per SM being 1536. Calculate: How many threads are executed in parallel at the same time theoretically? How many threads are executed concurrently? It includes 82 SMs, with a maximum number of threads allowed per SM being 1536, i.e., up to 48 warps can exist. Since warps are executed concurrently by the warp scheduler, 32 threads in a warp are executed in parallel. Therefore, it can be roughly considered that the number of threads executed in parallel at the same time is 82*32=2624, and the number of threads executed concurrently is 82*32*48=125952.
3.7 Memory Model
Memory Type | Location | Cached | Access Rights | Lifetime |
---|---|---|---|---|
Register | On-chip | No | device | Same as thread, kernel function |
Shared Memory | On-chip | No | device | Same as block |
Local Memory | Off-chip | No | device | Same as thread, kernel function |
Global Memory | Off-chip | No | device&host | Program |
Texture Memory, Constant Memory | Off-chip | Yes | device&host | Program |
When are variables defined in a CUDA kernel function register variables, and when are they local variables?
The following three cases are local variables, others are register variables
- Arrays that cannot be determined during the compilation phase
- Arrays or structures that occupy a large space
- Many variables are defined in the kernel function, and registers cannot hold them
From register overflow to local memory, it is essentially in the same storage area as global memory
3.8 Memory Access Patterns
Global memory is loaded/stored through cache. All accesses to global memory go through the L2 cache (generally 128 bytes).
Aligned Access
The first address is an even multiple of the cache granularity (generally 32 bytes) (the starting position of the cache line is such)
Coalesced Access
All threads in a warp access a contiguous block of memory. Coalesced access means that a memory access request by a warp to global memory results in the least amount of data transfer (coalescence degree = 100%), otherwise it is non-coalesced access.
5 access methods, calculation of coalescence degree??
If both read and write cannot be coalesced, then coalesced writing should be prioritized. Non-coalesced access to read-only data can be cached with the __ldg() function or converted to coalesced access using shared memory.
3.9 Shared Memory and Bank Conflict
Shared memory can be directly manipulated by programmers. Shared memory is divided into many banks.
- All threads in a warp access the same bank at the same address - broadcast
- Different threads in a warp access different addresses in the same bank - bank conflict
- Multiple threads access the same bank at the same address - multicast
Memory Padding solves Bank conflict
Padding operation: Add 1 to the second dimension of sData, i.e., sData[BS][BS+1]
The padded part cannot be used for data storage, reducing the amount of available shared memory.
4 Code
4.1 Image Flipping CPU
Multithreaded image flipping while manually maintaining cache.
|
|
4.2 Array Addition
|
|
4.3 Image Flipping
|
|
4.4 Matrix Transposition
|
|
4.5 Square Matrix Multiplication
|
|
4.6 Histogram
|
|
4.7 Reduction Summation
Reduction summation is similar to TOP K, the following code is the official code, understanding can be referred to this article
|
|
4.8 TOP K
The specific implementation process is as follows:
- Copy the data to the GPU memory
float *d_data; cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);
- Store the data in tuples
|
|
- Perform reduction operation on the tuples to obtain the indices of the top K maximum/minimum values
|
|
- Restore the original data on the CPU and sort by index to get the top K maximum/minimum values
|
|
5 Experiment
Experiment One: Three methods for calculating PI, thread pool
Experiment Two: Matrix multiplication, transposition, reduction, TOP K problem
Global memory, shared memory optimization, bank conflict optimization
Write a parallel computing report of 3-4 pages, CPU PI three methods, producer-consumer model GPU main program process clear once, kernel function different, focus on global memory and shared memory implementation, matrix multiplication histogram (cross-grid loop) reduction of the maximum value in 1 million arrays Report submitted during the exam
6 2020 Exam Questions
6.1 Short Answer Questions
Amdahl’s Law, n processors, 40% serial, calculate the speedup limit.
Given an RGB image 680*480, divided into 4 threads (no explanation of how to divide), the pixel range and byte range processed by each thread;
PPT example question on warp parallel and concurrent quantity;
NVIDIA GeForce RTX 3090’s Compute Capabilities is 8.6, which includes 82 SMs, with a maximum number of threads allowed per SM being 1536. Calculate: How many threads are executed in parallel at the same time theoretically? How many threads are executed concurrently? It includes 82 SMs, with a maximum number of threads allowed per SM being 1536, i.e., up to 48 warps can exist. Since warps are executed concurrently by the warp scheduler, 32 threads in a warp are executed in parallel. Therefore, it can be roughly considered that the number of threads executed in parallel at the same time is 82*32=2624, and the number of threads executed concurrently is 82*32*48=125952.
Global id of a specific element in the matrix transposition process;
Ask for the global id of element 3
Can data be copied while transposing (CUDA stream);
6.2 Program Analysis Questions
First question, original class exercise; <4,4> changed to <5,5>; need to explain the process;
Second question, histogram reduction without atomic operation, ask what problems exist;
6.3 CPU Programming
Find prime numbers in the array a[2,1000000]
, requiring 10 threads to divide equally;
Thread pool pseudocode: client, server (email function, export function, traffic statistics, etc. a bunch of functions);
6.4 GPU Programming
Matrix multiplication in global memory;
Inner product of vectors a, b, dimension = 1024000000; blockdim.x = blockdim.y =16 fixed; design Grid; require shared memory optimization, solve bank conflict, output result back to CPU for final merge.
Feeling: Non-code questions are baby bus, code questions hit hard, can’t finish writing at all.


