Review Notes for the Parallel Computing Course

1 Review Outline

Chapter One Amdahl’s Law Understanding of the law (with the task unchanged, the improvement of speed, speed-up ratio), the limits of acceleration

Application question 6’*5

Layout of grids and thread blocks, calculating global id Parallelism, concurrency, warp of threads, global id, multi-core CPU vs many-core GPU

Program analysis question 10*2

Writing results for given codes, analyze why those results occur

Multi-core CPU 10*2

Data partitioning: clarify the data range handled by each part Task parallelism: thread pool experiment

CUDA Programming 15*2

Specific problems, design grids and thread blocks, or given thread blocks, just need to design grids; Fixed procedure in the main function; the key lies in writing kernel functions;

2 Parallel Computing

2.1 Concurrency and Parallelism

Serial: single-machine single-core, instructions execute in order.

Concurrency: single-machine single-core, instructions execute in parallel over time, occurring within the same time interval.

Parallelism: single-machine multi-core, multi-machine single/multi-core, instructions execute in parallel in space, occurring at the same moment.

Parallel computing is supercomputing performed on parallel computers or distributed systems and other high-performance computing systems. Parallel computing can reduce the time it takes to solve a single problem, increase the scale and precision of solutions, and improve throughput, among other benefits.

Three classifications: Computation modes: time parallelism (pipelining), space parallelism (multiprocessor) Program logic: task parallelism, data parallelism From the application perspective: compute-intensive, data-intensive, network-intensive

2.2 Flynn’s Taxonomy

A method for classifying parallel computing architectures based on the execution mode of instruction streams and data streams. Includes SISD (early serial machines), SIMD (single-core computers), MISD (rarely used), MIMD (multi-core computers, parallel);

2.3 Amdahl’s Law

Assuming a fixed amount of task, by calculating the speed-up ratio, it reveals the principle that the unparallelizable part of a program will limit the overall performance improvement of the program. $$S=\frac{W_{s}+W_{p}}{W_{s}+W_{p}/p}$$ where $W_{s}$ is the number of serial tasks, $W_{p}$ is the number of parallel tasks, $p$ is the number of processors, $S$ is the speed-up ratio. Based on the serial component proportion $f=W_{s}/W$, dividing the above formula by $W$ can get the following formula: $$S=\frac{f+(1-f)}{f+\frac{1-f}{p}} =\frac{p}{1+f(p-1)}$$ $\lim_{x\rightarrow \infty}S=1/f$, when the number of processors increases infinitely, the achievable speed-up ratio is limited by the serial part of the program.

1
2
3
1. In a serial application, where 20% of the portion must execute serially, now needing to achieve a 3-times performance increase, how many CPUs are required? If aiming for a 5-times speed-up ratio, how many CPUs are needed?
2. A parallel program running on 5 computers has a 10% parallel portion. Compared to the serial execution on one computer, what is the speed-up ratio? If we wish to double the speed-up ratio, how many CPUs are required?
3. Modifying an application with an unparallelizable part accounting for 5% into a parallel program. Currently, there are two types of parallel computers on the market: Computer X has 4 CPUs, each capable of completing the execution of the application within 1 hour; Computer Y has 16 CPUs, each capable of executing the application in 2 hours. If minimizing running time is required, which computer would you purchase?

3 Overview of CUDA

3.1 Heterogeneous Computing

GPU parallel computing is a form of heterogeneous computing, divided between the host end (CPU) and the device end (GPU), their relationship has never been equal, CUDA even needs explicit designation of where the code should run.

3.2 Differences Between CPU and GPU

Intuitively speaking, the CPU allocates more resources to cache and control flow, whereas the GPU is geared more towards computational data processing.

  1. In a GPU environment, the cores of the GPU are responsible for executing all computational tasks, but the work instructions always come from the CPU.
  2. With GPUs, GPU cores never fetch data on their own; data always comes from the CPU side and the computed results are then sent back to the CPU. Thus, the GPU only plays the role of a computational accelerator in the background, undertaking certain outsourced tasks for the CPU.
  3. This type of architecture is only truly effective when there are a large number of parallel processing units instead of just 2 or 4.
  4. The concept of thread blocks has a significant impact on GPU architecture. Data must be input into the GPU in blocks of the same size, with a block being half a thread block, that is, 16 elements.
  5. The fact that data must be transferred to the GPU core in the size of half a thread block means that the storage system responsible for inputting data into the GPU should input 16 pieces of data at a time. This requires a parallel storage subsystem that can transmit 16 numbers at once. This is why the GPU’s DRAM is made up of DDR5, because it is parallel memory.
  6. Since GPU cores and CPU cores are completely different processing units, it can be anticipated that they have different ISAs (Instruction Set Architectures). That is to say: they speak different languages. GPU threads differ from CPU threads, with extremely low creation overhead. CPUs reduce latency through multi-level caching, whereas GPUs reduce latency by increasing throughput with pipelining. Due to their different design targets, CPUs require strong versatility to handle various types of data, while logic judgements introduce a large amount of branch jumping and interrupt handling. In contrast, GPUs deal with homogeneously typed, independent large-scale data in a pure computational environment that does not need to be interrupted.

image.png

3.3 CUDA Thread Organization

image.png

Thread: The basic unit of parallelism Thread Block: A group of cooperating threads Allows synchronization among themselves Exchanges data through fast shared memory Organized in 1D, 2D, or 3D Can contain up to 1024 threads Grid: A group of thread blocks Organized in 1D, 2D, or also 3D Shares global variables Kernel: The core program executed on the GPU One kernel One Grid

3.4 CUDA Host/Device Programming Model

3.4.1 Function Specifiers

  • device: Executes on the device side and can only be called from the device side, i.e., used as a subfunction on the device side.
  • host: Executes on the host side, can only be called from the host side, identical to a regular C function. Cannot be used with global simultaneously, but can be used with device; in this case, the function is compiled both on the device and host.
  • global: Refers to kernel functions, executed on the device but called from the host side.

3.4.2 Restrictions on CUDA Kernel Functions

  1. Can only access device memory
  2. Must return a void type
  3. Do not support a variable number of arguments
  4. Do not support static variables
  5. Exhibit asynchronous behavior, meaning the host does not wait for the kernel execution to finish before proceeding

3.5 SIMT Parallel Computing Model

Thread blocks are the basic units of program initiation, and warps are the units of program execution;

For example, when we say a block size is 256 threads, it also means that the thread block size is equivalent to 8 warps. Each warp always contains 32 threads. This parameter indicates: although when launching a program, each thread block has 256 threads, this does not mean they will all be executed or completed at the same moment. Instead, the GPU’s execution hardware will use 8 warps to execute these threads.

SIMT belongs to the SIMD category because it also executes the same instruction on multiple data. However, SIMT allows users to allocate threads, specifically, CUDA assigns an identifier (number) to each thread. A key difference is that SIMD requires all elements in the same vector to execute together in a unified synchronization group, whereas SIMT allows multiple threads belonging to the same thread block to execute independently, these threads can have different behaviors. Therefore, SIMT allows thread-level concurrency, meaning threads under the same thread block can be doing different things simultaneously. Three differences:

  • Each thread has its own instruction address counter
  • Each thread has its own register state
  • Each thread can have an independent execution path

3.6 GPU Architecture

Streaming Multiprocessors (SM) A thread block can only be scheduled on one SM, but an SM can correspond to multiple thread blocks. When an SM specifies one or more thread blocks to compute, these thread blocks are divided into multiple warps, waiting for scheduling. The threads in the thread bundle execute the same command on different data. The number of thread blocks an SM can accommodate depends on the shared memory and registers within the SM and the resources occupied by the threads. All threads in the thread block run logically in parallel, but not all threads can execute simultaneously at the physical level. (An SM schedules one warp at a time, the rest of the warps wait, and the switching between different warps is zero-cost because the execution context of the warp is maintained by the SM throughout the life of the warp)

The NVIDIA GeForce RTX 3090’s Compute Capabilities is 8.6, which includes 82 SMs, and the maximum number of threads that can exist simultaneously in each SM is 1536, asking: How many threads can theoretically be executed in parallel at the same moment? What is the number for concurrent thread execution? It includes 82 SMs, each SM allows up to 1536 threads simultaneously, which is up to 48 warps, since warps are executed concurrently through the warp scheduler, with 32 threads in each warp executing in parallel, broadly speaking, the number of threads executing in parallel at the same moment is 82*32=2624, and the number for concurrent thread execution is 82*32*48=125952.

3.7 Memory Model

Memory Location Cached Access Rights Lifetime
Register On-chip No Device Same as thread, kernel function
Shared Memory On-chip No Device Same as block
Local Memory On-board No Device Same as thread, kernel function
Global Memory On-board No Device&Host Program
Texture Memory, Constant Memory On-board Yes Device&Host Program

Under what circumstances are variables defined in a CUDA kernel register variables, and when are they local variables?

The following three cases are local variables, others are register variables

  • Arrays whose size cannot be determined during the compilation phase
  • Arrays or structures that take up a lot of space
  • The kernel function defines many variables, and there are not enough registers Variables that overflow from registers to local memory are essentially in the same storage area as global memory.

image.png image.png

3.8 Memory Access Patterns

Global memory uses caches for loads/stores. All accesses to global memory go through the L2 cache (usually 128 bytes). Aligned Access The first address is an even multiple of the cache granularity (usually 32 bytes) (this is where the cache line starts fetching data) Coalesced Access All threads in a thread block access a contiguous block of memory. Coalesced access refers to the situation where a thread block’s single request for access to global memory leads to the minimum data transfer (coalescence = 100%), otherwise, it is non-coalesced access.

5 types of access, how to calculate coalescence?

If neither reading nor writing can be coalesced, priority should be given to ensuring coalesced writing. Non-coalesced access to read-only data can be cached with the __ldg() function, or converted to coalesced by using shared memory.

3.9 Shared Memory and Bank Conflict

Shared memory can be directly manipulated by programmers. Shared memory is divided into many banks.

  • All threads in a warp accessing the same address in the same bank - broadcast
  • Different threads in a warp accessing different addresses in the same bank - bank conflict
  • Multiple threads accessing the same address in the same bank - multicast

Memory Padding to Solve Bank Conflict image.png

Padding operation: Add 1 to the second dimension of sData, i.e., sData[BS][BS+1]. The padded portion cannot be used for data storage, resulting in a reduced amount of available shared memory.

4 Code

4.1 Image Flip on CPU

Flipping an image with multiple threads while manually managing the cache.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
void * MTFlipHM(void * tid){
        struct Pixel pix; //temp swap pixel
        int row, col;
        int id = *((int *) tid);
        int start = id * ip.Vpixels/NumThreads;
        int end = start + ip.Vpixels/NumThreads;
        unsigned char buffer[16384];
        for (row = st ; row < ed; row++)
        {
            memcpy(buffer, TheImage[row], ip.Hbytes);
            col = 0;
            while (col < ip.Hpixels * 3 /2){
            pix.B = buffer[col];
            pix.G = buffer[col+1];
            pix.R = buffer[col+2];
            buffer[col]   = buffer[ip.Hpixels*3 - col -3];
            buffer[col+1] = buffer[ip.Hpixels*3 - col -2];
            buffer[col+2] = buffer[ip.Hpixels*3 - col -1];
            buffer[ip.Hpixels*3 - col -3] = pix.B;
            buffer[ip.Hpixels*3 - col -2] = pix.G;
            buffer[ip.Hpixels*3 - col -1] = pix.R;
            col += 3;
            }
        }
        memcpy(TheImage[row],buffer,ip.Hbytes);
        pthread_exit(NULL);
}

void * MTFlipVM(void * tid){
        struct Pixel pix; //temp swap pixel
        int row, col;
        int id = *((int *) tid);
        int start = id * ip.Vpixels/NumThreads;
        int end = start + ip.Vpixels/NumThreads;
        unsigned char buffer1[16384], buffer2[16384];
        for (row = start; row < end; row ++)
        {
            memcpy(buffer1,TheImage[row],ip.Hbytes);
            int  mirrow = ip.Vpixels - 1 - row;
            memcpy(buffer2,TheImage[mirrow],ip.Hbytes);
            // Swap by misplacing copies
            memcpy(TheImage[row],buffer2,ip.Hbytes);
            memcpy(TheImage[mirrow],buffer1,ip.Hbytes);
        }
}

pthread_create(&ThHandle[i], &ThAttr, MTFlipFunc, (void *)&ThParam[i]);
for(i=0; i<NumThreads; i++)
                pthread_join(ThHandle[i], NULL);

4.2 Array Addition

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
const int a=1, b=2, c=3;
__global__ void add(double *x, double *y, double* z){
    const int n = blockIdx.x * blockDim.x + threadIdx.x;
    if (n<N)  z[n] = x[n] + y[n];
}
int main(){
    const int N = 1e9;
    const int M = sizeof(double) * N;
    double *h_x = (double*) malloc(M);
    double *h_y = (double*) malloc(M);
    double *h_z = (double*) malloc(M);
    for (int i=0; i<N; i++)
    {
        h_x[i] = a;
        h_y[i] = b;
    }
    double *d_x, *d_y, *d_z;
    cudaMalloc((void**) &d_x, M);
    cudaMalloc((void**) &d_y, M);
    cudaMalloc((void**) &d_z, M);

    const int block_size =  128;
    int grid_size = (N+block_size-1) / block_size;
    add<<<grid_size, block_size>>>(d_x, d_y, d_z);
    cudaMemcpy(h_z, d_z, M, cudaMemcpyDeviceToHost);
    free(h_x),free(h_y),free(h_z);
    cudaFree(d_x),cudaFree(d_y),cudaFree(d_z);
}

4.3 Image Flip

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
__global__ void Vflip(uch *ImgDst, uch *ImgSrc, ui Hpixels, ui Vpixels){
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYgtid = ThrPerBlk * MYbid + MYtid;
ui BlkPerRow = (Hpixels + ThrPerBlk - 1) / ThrPerBlk; // ceil
ui RowBytes = (Hpixels * 3 + 3) & (~3);
ui MYrow = MYbid / BlkPerRow;
ui MYcol = MYgtid - MYrow*BlkPerRow*ThrPerBlk;
if (MYcol >= Hpixels) return;// col out of range
ui MYmirrorrow = Vpixels - 1 - MYrow;
ui MYsrcOffset = MYrow * RowBytes;
ui MYdstOffset = MYmirrorrow * RowBytes;
ui MYsrcIndex = MYsrcOffset + 3 * MYcol;
ui MYdstIndex = MYdstOffset + 3 * MYcol;
// swap pixels RGB @MYcol , @MYmirrorcol
ImgDst[MYdstIndex] = ImgSrc[MYsrcIndex];
ImgDst[MYdstIndex + 1] = ImgSrc[MYsrcIndex + 1];
ImgDst[MYdstIndex + 2] = ImgSrc[MYsrcIndex + 2];}

## 翻译 private_upload\default_user\2024-02-29-21-03-36\parallel-computing-course-notes.md.part-4.md

```cpp
__global__ void Hflip(unsigned char *ImgDst, unsigned char *ImgSrc, unsigned int Hpixels){
unsigned int ThrPerBlk = blockDim.x;
unsigned int MYbid = blockIdx.x;
unsigned int MYtid = threadIdx.x;
unsigned int MYgtid = ThrPerBlk * MYbid + MYtid;
unsigned int BlkPerRow = (Hpixels + ThrPerBlk -1 ) / ThrPerBlk; // ceil
unsigned int RowBytes = (Hpixels * 3 + 3) & (~3);
unsigned int MYrow = MYbid / BlkPerRow;
unsigned int MYcol = MYgtid - MYrow*BlkPerRow*ThrPerBlk;
if (MYcol >= Hpixels) return;// col out of range
unsigned int MYmirrorcol = Hpixels - 1 - MYcol;
unsigned int MYoffset = MYrow * RowBytes;
unsigned int MYsrcIndex = MYoffset + 3 * MYcol;
unsigned int MYdstIndex = MYoffset + 3 * MYmirrorcol;
// swap pixels RGB @MYcol , @MYmirrorcol
ImgDst[MYdstIndex] = ImgSrc[MYsrcIndex];
ImgDst[MYdstIndex + 1] = ImgSrc[MYsrcIndex + 1];
ImgDst[MYdstIndex + 2] = ImgSrc[MYsrcIndex + 2];}

4.4 Matrix Transpose

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
__global__ void transpose(int a[], int b[], int N){
    //Allocate shared memory
    __shared__ int S[TILE][TILE + 1];
    int bx = blockIdx.x * TILE;
    int by = blockIdx.y * TILE;
    int ix = bx + threadIdx.x;
    int iy = by + threadIdx.y;
    if (ix < N && iy < N)// Read into shared memory
        S[threadIdx.y][threadIdx.x] = a[iy * N + ix];
    __syncthreads();//Synchronization, this is essential
    int ix2 = bx + threadIdx.y;
    int iy2 = by + threadIdx.x;
    if (ix2 < N && iy2 < N)// Write back
        b[ix2 * N + iy2 ] = S[threadIdx.x][threadIdx.y];
}

4.5 Square Matrix Multiplication

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
__shared__ float Mds[WIDTH][TILE_WIDTH];
__shared__ float Nds[TILE_WIDTH][WIDTH];
int bx=blockIdx.x ; int by = blockIdx.y;
int tx=threadIdx.x ; int ty = threadIdx.y;
int Row = by *TILE_WIDTH +ty;
int Col = bx*TILE_WIDTH + tx;
float Pvalue = 0;
for(int m=0; m<WIDTH/TILE_WIDTH; ++m)
    {
    // Each thread loads one element of M sub-matrix
    Mds[ty][tx] = Md[Row*width+(m*TILE_WIDTH+tx)];
    //Each thread loads one element of N sub-matrix
    Nds[ty][tx] = Nd[(m*TILE_WIDTH+ty)*width+Col];
	__syncthreads();
	for (int k = 0; k < TILE_WIDTH; ++k)
    Pvalue += Mds[ty][k] * Nds[k][tx];
	__syncthreads();
    }
Pd[Row*WIDTH+Col] = Pvalue;//Write the result back to P matrix

4.6 Histogram

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#define SIZE (100*1024*1024)
//Random byte stream generated by the utility function big_random_block()
unsigned char *buffer =(unsigned char*)big_random_block( SIZE );
unsigned int histo[256];
for (int i = 0; i<256; i++)
    histo[i] = 0;
for (int i = 0; i < SIZE; i++)
    histo[buffer[i]]++;
long histoCount = 0;
for (int i = 0; i<256; i++) {
    histoCount += histo[i]; }

__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo){
__shared__ unsigned int temp[256];
temp[threadIdx.x] = 0;
__syncthreads();
int i = threadIdx.x + blockIdx.x * blockDim.x;
int offset = blockDim.x *gridDim.x;
while (i<size){
    atomicAdd(&temp[buffer[i]], 1);
    i += offset;
}
__syncthreads();
atomicAdd(&(histo[threadIdx.x]), temp[threadIdx.x]);
}

4.7 Reduction Sum

Reduction sum is similar to TOP K. Below is the official code, for understanding refer to this article

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
__global__ void _sum_gpu(int *input, int count, int *output)
{
    __shared__ int sum_per_block[BLOCK_SIZE];

    int temp = 0;
    for (int idx = threadIdx.x + blockDim.x * blockIdx.x;
         idx < count; idx += gridDim.x * blockDim.x
   )
    {// Loop across grids, one thread adds multiple data, for dealing with massive data
        temp += input[idx];
    }

    sum_per_block[threadIdx.x] = temp;  //the per-thread partial sum is temp!
    __syncthreads();

4.8 Shared Memory Summation Stage

This section outlines the process of performing per-block summation using shared memory within a CUDA kernel.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
//**********Shared Memory Summation Stage***********
for (int length = BLOCK_SIZE / 2; length >= 1; length /= 2)
{
    int intermediate_sum = -1;
    if (threadIdx.x < length)
    {
        intermediate_sum = sum_per_block[threadIdx.x] + sum_per_block[threadIdx.x + length];
    }
    __syncthreads();  // Necessity of two __syncthreads() here, and,
    
    if (threadIdx.x < length)
    {
        sum_per_block[threadIdx.x] = intermediate_sum;
    }
    __syncthreads();  // ....here ?
    
} // The partial sum for each block is now sum_per_block[0]

if (blockDim.x * blockIdx.x < count) // To handle edge cases
{
    // Final reduction using atomicAdd()
    if (threadIdx.x == 0) atomicAdd(output, sum_per_block[0]);
}

4.9 TOP K

The implementation process is as follows:

  1. Copy data to GPU memory: float *d_data; cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);
  2. Store data in tuples:
1
2
3
4
5
typedef struct { float value; int index; } Tuple;  
Tuple *d_tuples; 
int threadsPerBlock = 256; 
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
initializeTuples<<<blocksPerGrid, threadsPerBlock>>>(d_data, d_tuples, n);
  1. Perform reduction on tuples to obtain the indices of the top K maximum/minimum values:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
int *d_indices;
kReduceKernel<<<blocksPerGrid, threadsPerBlock>>>(d_tuples, d_indices, n, k);
__global__ void kReduceKernel(Tuple *input, int *output, int n, int k) {
    extern __shared__ Tuple shared[];
    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    shared[tid] = (i < n) ? input[i] : Tuple{0, 0};
    __syncthreads();
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s)
            shared[tid] = (shared[tid].value > shared[tid + s].value) ? shared[tid] : shared[tid + s];
        __syncthreads();
    }

    if (tid == 0)
        output[blockIdx.x] = shared[0].index;
}
  1. Retrieve original data on the CPU and sort it by index to obtain the top K maximum/minimum values:
1
2
3
4
5
cudaMemcpy(h_indices, d_indices, size, cudaMemcpyDeviceToHost);  
for (int i = 0; i < k; ++i) {     
    int index = h_indices[i];     
    h_result[i] = h_data[index]; }  
std::sort(h_result, h_result + k);

After completing these steps, we obtain the sorted top K maximum/minimum values.

5 Experiment

Experiment 1: Three methods of calculating π, thread pool.

Experiment 2: Matrix multiplication, transposition, reduction, TOP K problem.

Global memory, shared memory optimization, and bank conflict optimization. A 3-4-page handwritten parallel computing report is required, covering the three methods of calculating π, the producer-consumer pattern, and GPU. The main program flow needs to be clearly described at least once, with the focus being on the implementation using global and shared memory for matrix multiplication, histograms (looping over grids), and the reduction of the largest value in an array of 1 million. The report should be submitted during the exam.

6 Real Questions for Grade 20

6.1 Short Answer Questions

Amdahl’s Law: for n processors and a 40% serialization ratio, calculate the limit of the speedup.

Given a 680x480 RGB image distributed over 4 threads (the distribution method is not specified), detail the range of pixels and bytes each thread processes.

A direct question from the slides on thread warp parallelism and concurrency quantity:

NVIDIA GeForce RTX 3090’s Compute Capabilities is 8.6, including 82 SMs, with a maximum of 1536 concurrent threads per SM allowed, thus totaling to 48 warps. Since warps execute concurrently through a warp scheduler and the 32 threads in a warp execute in parallel, it can be broadly assumed that the maximum concurrent thread execution count at any given moment is 82 * 32 = 2624, and the number of concurrently executing threads is 82 * 32 * 48 = 125952.

Global id of a certain element during matrix transposition;

Given an element 3 in the matrix transposition process, calculate its global ID.

Whether it’s possible to simultaneously copy and transpose data (CUDA streams).

6.2 Program Analysis Questions

The first question is a classroom exercise repeat with a <4,4> grid changed to <5,5>; the process needs to be explained.

The second question is about histogram reduction missing atomic operations; identifying the issue is required.

6.3 CPU Programming

Find the prime numbers in array a[2,1000000], with the task evenly distributed among 10 threads; Thread pool pseudocode: client side, server side (email function, export function, traffic statistics, and a bunch of other functions);

6.4 GPU Programming

Global memory matrix multiplication; Inner product of vectors a, b, dimension=1024000000; Fix blockdim.x = blockdim.y =16; Design the grid; Requires optimization with shared memory, solving bank conflicts, and sending the result back to the CPU for final merging.

Reflection: A sucker for non-coding problems, but when it comes to coding questions, it’s a heavy hitter, simply impossible to finish.

Buy me a coffee~
Tim AlipayAlipay
Tim PayPalPayPal
Tim WeChat PayWeChat Pay
0%