Review Notes for the Parallel Computing Course
1 Review Outline
Chapter One Amdahl’s Law Understanding of the law (with the task unchanged, the improvement of speed, speed-up ratio), the limits of acceleration
Application question 6’*5
Layout of grids and thread blocks, calculating global id Parallelism, concurrency, warp of threads, global id, multi-core CPU vs many-core GPU
Program analysis question 10*2
Writing results for given codes, analyze why those results occur
Multi-core CPU 10*2
Data partitioning: clarify the data range handled by each part Task parallelism: thread pool experiment
CUDA Programming 15*2
Specific problems, design grids and thread blocks, or given thread blocks, just need to design grids; Fixed procedure in the main function; the key lies in writing kernel functions;
2 Parallel Computing
2.1 Concurrency and Parallelism
Serial: single-machine single-core, instructions execute in order.
Concurrency: single-machine single-core, instructions execute in parallel over time, occurring within the same time interval.
Parallelism: single-machine multi-core, multi-machine single/multi-core, instructions execute in parallel in space, occurring at the same moment.
Parallel computing is supercomputing performed on parallel computers or distributed systems and other high-performance computing systems. Parallel computing can reduce the time it takes to solve a single problem, increase the scale and precision of solutions, and improve throughput, among other benefits.
Three classifications: Computation modes: time parallelism (pipelining), space parallelism (multiprocessor) Program logic: task parallelism, data parallelism From the application perspective: compute-intensive, data-intensive, network-intensive
2.2 Flynn’s Taxonomy
A method for classifying parallel computing architectures based on the execution mode of instruction streams and data streams. Includes SISD (early serial machines), SIMD (single-core computers), MISD (rarely used), MIMD (multi-core computers, parallel);
2.3 Amdahl’s Law
Assuming a fixed amount of task, by calculating the speed-up ratio, it reveals the principle that the unparallelizable part of a program will limit the overall performance improvement of the program. $$S=\frac{W_{s}+W_{p}}{W_{s}+W_{p}/p}$$ where $W_{s}$ is the number of serial tasks, $W_{p}$ is the number of parallel tasks, $p$ is the number of processors, $S$ is the speed-up ratio. Based on the serial component proportion $f=W_{s}/W$, dividing the above formula by $W$ can get the following formula: $$S=\frac{f+(1-f)}{f+\frac{1-f}{p}} =\frac{p}{1+f(p-1)}$$ $\lim_{x\rightarrow \infty}S=1/f$, when the number of processors increases infinitely, the achievable speed-up ratio is limited by the serial part of the program.
|
|
3 Overview of CUDA
3.1 Heterogeneous Computing
GPU parallel computing is a form of heterogeneous computing, divided between the host end (CPU) and the device end (GPU), their relationship has never been equal, CUDA even needs explicit designation of where the code should run.
3.2 Differences Between CPU and GPU
Intuitively speaking, the CPU allocates more resources to cache and control flow, whereas the GPU is geared more towards computational data processing.
- In a GPU environment, the cores of the GPU are responsible for executing all computational tasks, but the work instructions always come from the CPU.
- With GPUs, GPU cores never fetch data on their own; data always comes from the CPU side and the computed results are then sent back to the CPU. Thus, the GPU only plays the role of a computational accelerator in the background, undertaking certain outsourced tasks for the CPU.
- This type of architecture is only truly effective when there are a large number of parallel processing units instead of just 2 or 4.
- The concept of thread blocks has a significant impact on GPU architecture. Data must be input into the GPU in blocks of the same size, with a block being half a thread block, that is, 16 elements.
- The fact that data must be transferred to the GPU core in the size of half a thread block means that the storage system responsible for inputting data into the GPU should input 16 pieces of data at a time. This requires a parallel storage subsystem that can transmit 16 numbers at once. This is why the GPU’s DRAM is made up of DDR5, because it is parallel memory.
- Since GPU cores and CPU cores are completely different processing units, it can be anticipated that they have different ISAs (Instruction Set Architectures). That is to say: they speak different languages. GPU threads differ from CPU threads, with extremely low creation overhead. CPUs reduce latency through multi-level caching, whereas GPUs reduce latency by increasing throughput with pipelining. Due to their different design targets, CPUs require strong versatility to handle various types of data, while logic judgements introduce a large amount of branch jumping and interrupt handling. In contrast, GPUs deal with homogeneously typed, independent large-scale data in a pure computational environment that does not need to be interrupted.
3.3 CUDA Thread Organization
Thread: The basic unit of parallelism Thread Block: A group of cooperating threads Allows synchronization among themselves Exchanges data through fast shared memory Organized in 1D, 2D, or 3D Can contain up to 1024 threads Grid: A group of thread blocks Organized in 1D, 2D, or also 3D Shares global variables Kernel: The core program executed on the GPU One kernel One Grid
3.4 CUDA Host/Device Programming Model
3.4.1 Function Specifiers
- device: Executes on the device side and can only be called from the device side, i.e., used as a subfunction on the device side.
- host: Executes on the host side, can only be called from the host side, identical to a regular C function. Cannot be used with global simultaneously, but can be used with device; in this case, the function is compiled both on the device and host.
- global: Refers to kernel functions, executed on the device but called from the host side.
3.4.2 Restrictions on CUDA Kernel Functions
- Can only access device memory
- Must return a void type
- Do not support a variable number of arguments
- Do not support static variables
- Exhibit asynchronous behavior, meaning the host does not wait for the kernel execution to finish before proceeding
3.5 SIMT Parallel Computing Model
Thread blocks are the basic units of program initiation, and warps are the units of program execution;
For example, when we say a block size is 256 threads, it also means that the thread block size is equivalent to 8 warps. Each warp always contains 32 threads. This parameter indicates: although when launching a program, each thread block has 256 threads, this does not mean they will all be executed or completed at the same moment. Instead, the GPU’s execution hardware will use 8 warps to execute these threads.
SIMT belongs to the SIMD category because it also executes the same instruction on multiple data. However, SIMT allows users to allocate threads, specifically, CUDA assigns an identifier (number) to each thread. A key difference is that SIMD requires all elements in the same vector to execute together in a unified synchronization group, whereas SIMT allows multiple threads belonging to the same thread block to execute independently, these threads can have different behaviors. Therefore, SIMT allows thread-level concurrency, meaning threads under the same thread block can be doing different things simultaneously. Three differences:
- Each thread has its own instruction address counter
- Each thread has its own register state
- Each thread can have an independent execution path
3.6 GPU Architecture
Streaming Multiprocessors (SM) A thread block can only be scheduled on one SM, but an SM can correspond to multiple thread blocks. When an SM specifies one or more thread blocks to compute, these thread blocks are divided into multiple warps, waiting for scheduling. The threads in the thread bundle execute the same command on different data. The number of thread blocks an SM can accommodate depends on the shared memory and registers within the SM and the resources occupied by the threads. All threads in the thread block run logically in parallel, but not all threads can execute simultaneously at the physical level. (An SM schedules one warp at a time, the rest of the warps wait, and the switching between different warps is zero-cost because the execution context of the warp is maintained by the SM throughout the life of the warp)
The NVIDIA GeForce RTX 3090’s Compute Capabilities is 8.6, which includes 82 SMs, and the maximum number of threads that can exist simultaneously in each SM is 1536, asking: How many threads can theoretically be executed in parallel at the same moment? What is the number for concurrent thread execution? It includes 82 SMs, each SM allows up to 1536 threads simultaneously, which is up to 48 warps, since warps are executed concurrently through the warp scheduler, with 32 threads in each warp executing in parallel, broadly speaking, the number of threads executing in parallel at the same moment is 82*32=2624, and the number for concurrent thread execution is 82*32*48=125952.
3.7 Memory Model
Memory | Location | Cached | Access Rights | Lifetime |
---|---|---|---|---|
Register | On-chip | No | Device | Same as thread, kernel function |
Shared Memory | On-chip | No | Device | Same as block |
Local Memory | On-board | No | Device | Same as thread, kernel function |
Global Memory | On-board | No | Device&Host | Program |
Texture Memory, Constant Memory | On-board | Yes | Device&Host | Program |
Under what circumstances are variables defined in a CUDA kernel register variables, and when are they local variables?
The following three cases are local variables, others are register variables
- Arrays whose size cannot be determined during the compilation phase
- Arrays or structures that take up a lot of space
- The kernel function defines many variables, and there are not enough registers Variables that overflow from registers to local memory are essentially in the same storage area as global memory.
3.8 Memory Access Patterns
Global memory uses caches for loads/stores. All accesses to global memory go through the L2 cache (usually 128 bytes). Aligned Access The first address is an even multiple of the cache granularity (usually 32 bytes) (this is where the cache line starts fetching data) Coalesced Access All threads in a thread block access a contiguous block of memory. Coalesced access refers to the situation where a thread block’s single request for access to global memory leads to the minimum data transfer (coalescence = 100%), otherwise, it is non-coalesced access.
5 types of access, how to calculate coalescence?
If neither reading nor writing can be coalesced, priority should be given to ensuring coalesced writing. Non-coalesced access to read-only data can be cached with the __ldg() function, or converted to coalesced by using shared memory.
3.9 Shared Memory and Bank Conflict
Shared memory can be directly manipulated by programmers. Shared memory is divided into many banks.
- All threads in a warp accessing the same address in the same bank - broadcast
- Different threads in a warp accessing different addresses in the same bank - bank conflict
- Multiple threads accessing the same address in the same bank - multicast
Memory Padding to Solve Bank Conflict
Padding operation: Add 1 to the second dimension of sData
, i.e., sData[BS][BS+1]
.
The padded portion cannot be used for data storage, resulting in a reduced amount of available shared memory.
4 Code
4.1 Image Flip on CPU
Flipping an image with multiple threads while manually managing the cache.
|
|
4.2 Array Addition
|
|
4.3 Image Flip
|
|
4.4 Matrix Transpose
|
|
4.5 Square Matrix Multiplication
|
|
4.6 Histogram
|
|
4.7 Reduction Sum
Reduction sum is similar to TOP K. Below is the official code, for understanding refer to this article
|
|
4.8 Shared Memory Summation Stage
This section outlines the process of performing per-block summation using shared memory within a CUDA kernel.
|
|
4.9 TOP K
The implementation process is as follows:
- Copy data to GPU memory:
float *d_data; cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);
- Store data in tuples:
|
|
- Perform reduction on tuples to obtain the indices of the top K maximum/minimum values:
|
|
- Retrieve original data on the CPU and sort it by index to obtain the top K maximum/minimum values:
|
|
After completing these steps, we obtain the sorted top K maximum/minimum values.
5 Experiment
Experiment 1: Three methods of calculating π, thread pool.
Experiment 2: Matrix multiplication, transposition, reduction, TOP K problem.
Global memory, shared memory optimization, and bank conflict optimization. A 3-4-page handwritten parallel computing report is required, covering the three methods of calculating π, the producer-consumer pattern, and GPU. The main program flow needs to be clearly described at least once, with the focus being on the implementation using global and shared memory for matrix multiplication, histograms (looping over grids), and the reduction of the largest value in an array of 1 million. The report should be submitted during the exam.
6 Real Questions for Grade 20
6.1 Short Answer Questions
Amdahl’s Law: for n processors and a 40% serialization ratio, calculate the limit of the speedup.
Given a 680x480 RGB image distributed over 4 threads (the distribution method is not specified), detail the range of pixels and bytes each thread processes.
A direct question from the slides on thread warp parallelism and concurrency quantity:
NVIDIA GeForce RTX 3090’s Compute Capabilities is 8.6, including 82 SMs, with a maximum of 1536 concurrent threads per SM allowed, thus totaling to 48 warps. Since warps execute concurrently through a warp scheduler and the 32 threads in a warp execute in parallel, it can be broadly assumed that the maximum concurrent thread execution count at any given moment is 82 * 32 = 2624, and the number of concurrently executing threads is 82 * 32 * 48 = 125952.
Global id of a certain element during matrix transposition;
Given an element 3 in the matrix transposition process, calculate its global ID.
Whether it’s possible to simultaneously copy and transpose data (CUDA streams).
6.2 Program Analysis Questions
The first question is a classroom exercise repeat with a <4,4> grid changed to <5,5>; the process needs to be explained.
The second question is about histogram reduction missing atomic operations; identifying the issue is required.
6.3 CPU Programming
Find the prime numbers in array a[2,1000000]
, with the task evenly distributed among 10 threads; Thread pool pseudocode: client side, server side (email function, export function, traffic statistics, and a bunch of other functions);
6.4 GPU Programming
Global memory matrix multiplication; Inner product of vectors a, b, dimension=1024000000; Fix blockdim.x = blockdim.y =16; Design the grid; Requires optimization with shared memory, solving bank conflicts, and sending the result back to the CPU for final merging.
Reflection: A sucker for non-coding problems, but when it comes to coding questions, it’s a heavy hitter, simply impossible to finish.