Parallel Computing Course Review Notes

1 Review Course Outline

Chapter 1 Amdahl’s Law Understanding the law (speedup and acceleration ratio with constant tasks), limits of acceleration

Application Questions 6’*5

Layout of grids and thread blocks, calculation of global id Parallelism, concurrency, warp, global id, CPU multicore vs GPU many-core

Program Analysis Questions 10*2

Write results for code, analyze why such results occur

CPU Multicore 10*2

Data partitioning: clearly define the data range processed by each part Task parallelism: thread pool experiment

CUDA Programming 15*2

Specific problems, design grids and thread blocks, or given thread blocks, only need to design grids; Fixed process in the main function; the key is to write the kernel function;

2 Parallel Computing

2.1 Concurrency and Parallelism

Serial: Single machine single core, instructions executed in order.

Concurrency: Single machine single core, instructions executed in parallel over time, occurring within the same time interval.

Parallelism: Single machine multi-core, multi-machine single/multi-core, instructions executed in parallel in space, occurring at the same moment.

Parallel computing is supercomputing performed on parallel computers or distributed systems and other high-performance computing systems. Parallel computing can reduce the time to solve a single problem, increase the scale and accuracy of solutions, and improve throughput, etc.

Three classifications:

  • Computational model: time parallelism (pipeline), space parallelism (multi-processor)
  • Program logic: task parallelism, data parallelism
  • Application perspective: compute-intensive, data-intensive, network-intensive

2.2 Flynn’s Taxonomy

A method of classifying parallel computer architectures based on the execution of instruction streams and data streams. Includes SISD (early serial machines), SIMD (single-core computers), MISD (rarely used), MIMD (multi-core computers, parallel);

2.3 Amdahl’s Law

Assuming a fixed number of tasks, it reveals the rule that the non-parallelizable part of a program will limit the overall performance improvement of the program by calculating the performance speedup ratio. $$S=\frac{W_{s}+W_{p}}{W_{s}+W_{p}/p}$$ where $W_{s}$ is the number of serial tasks, $W_{p}$ is the number of parallel tasks, $p$ is the number of processors, and $S$ is the speedup ratio. Based on the proportion of the serial component $f=W_{s}/W$, dividing the above equation by $W$ gives the following equation: $$S=\frac{f+(1-f)}{f+\frac{1-f}{p}} =\frac{p}{1+f(p-1)}$$ $\lim_{x\rightarrow \infty}S=1/f$, when the number of processors increases infinitely, the achievable speedup ratio is limited by the serial part of the program.

1
2
3
1. In a serial application program, 20% must be executed serially. Now, to achieve a 3x performance improvement, how many CPUs are needed? If a 5x speedup is desired, how many CPUs are needed?
2. A parallel program running on 5 computers has 10% of its part parallelized. What is the speedup ratio compared to serial execution on one computer? If we want to double the speedup ratio, how many CPUs are needed?
3. Modify an application program with 5% non-parallelizable part to a parallel program. Currently, there are two types of parallel computers on the market: Computer X has 4 CPUs, each CPU can complete the execution of the application in 1 hour; Computer Y has 16 CPUs, each CPU can execute the application in 2 hours. If minimizing runtime is required, which computer should you buy?

3 CUDA Overview

3.1 Heterogeneous Computing

GPU parallel computing is a type of heterogeneous computing, divided into host (CPU) and device (GPU) ends, and their relationship is never equal. CUDA even requires explicitly marking where the code needs to run.

3.2 Differences between CPU and GPU

Intuitively, more resources in CPUs are used for cache and control flow, while GPUs are more for data computation.

  1. In the GPU environment, the GPU core is responsible for executing all computation tasks, but the work instructions always come from the CPU.
  2. In the GPU scenario, GPU cores never fetch data themselves; data always comes from the CPU end, and computation results are then transmitted back to the CPU end. Therefore, the GPU acts as a computation accelerator in the background, completing certain outsourced tasks for the CPU.
  3. This type of architecture is only very effective when there are a large number of parallel processing units, not just 2 or 4.
  4. The concept of warp has a significant impact on GPU architecture. Data must be input to the GPU in blocks of the same size, a block of data is half a warp, i.e., 16 elements.
  5. The fact that data must be transferred to the GPU core in blocks of half a warp size means that the storage system responsible for inputting data into the GPU should input 16 data each time. This requires a parallel storage subsystem capable of transferring 16 numbers at once. This is why the GPU’s DRAM memory is composed of DDR5, because it is parallel memory.
  6. Since GPU cores and CPU cores are completely different processing units, it is foreseeable that they have different ISAs (instruction set architectures). That is, they speak different languages.

GPU threads differ from CPU threads, with extremely low creation overhead. CPUs reduce latency through multi-level cache, while GPUs reduce latency by increasing throughput through pipelines. Due to their different design goals, CPUs require strong versatility to handle various data types, while logical judgments introduce a large number of branch jumps and interrupt handling. In contrast, GPUs face large-scale data that is uniform in type and independent of each other, and a pure computing environment that does not need to be interrupted.

image.png

3.3 CUDA Thread Organization

image.png

Thread: The basic unit of parallelism

Thread Block: A group of threads that cooperate with each other, allowing synchronization and data exchange through fast shared memory, organized in 1D, 2D, or 3D, containing up to 1024 threads

Grid: A group of thread blocks organized in 1D, 2D, or 3D, sharing global variables

Kernel: The core program executed on the GPU One kernel One Grid

3.4 CUDA Host/Device Programming Model

3.4.1 Function Qualifiers

  • _ device _: Executed on the device side and can only be called from the device side, used as a sub-function on the device side.
  • _ host _: Executed on the host side and can only be called from the host side, same as general C functions. Cannot be used simultaneously with global, but can be used with device, in which case the function will be compiled on both the device and host.
  • _ global _: The kernel function, executed on the device but called from the host side.

3.4.2 CUDA Kernel Function Limitations

  1. Can only access device memory
  2. Must return void type
  3. Does not support variable number of arguments
  4. Does not support static variables
  5. Exhibits asynchronous behavior, meaning the host does not wait for the kernel to finish executing before proceeding to the next step

3.5 SIMT Parallel Computing Model

Thread blocks are the basic unit of program launch, while warps are the unit of program execution;

For example, when we say a block size is 256 threads, it means the thread block size is 8 warps. Each warp always contains 32 threads. This parameter indicates that although the program is launched with 256 threads per thread block, it does not mean they will be executed immediately, meaning these 256 threads will not all be executed or completed at the same time. Instead, the GPU’s execution hardware will use 8 warps to execute these threads.

SIMT belongs to the category of SIMD because it also executes the same instruction on multiple data. However, SIMT allows users to allocate threads, specifically CUDA assigns an identifier (number) to each thread.

A key difference is that SIMD requires all elements in the same vector to be executed in a unified synchronous group, while SIMT allows multiple threads in the same warp to execute independently, and these threads can have different behaviors. Therefore, SIMT allows thread-level concurrency, meaning threads in the same warp can do different things simultaneously.

Three differences:

  • Each thread has its own instruction address counter
  • Each thread has its own register state
  • Each thread can have an independent execution path

3.6 GPU Architecture

Streaming Multiprocessor SM

A thread block can only be scheduled on one SM, but one SM can correspond to multiple thread blocks.

When an SM assigns one or more thread blocks to be computed, these thread blocks are divided into multiple warps, waiting for scheduling.

Threads in a warp execute the same command on different data.

The number of thread blocks an SM can accommodate depends on the shared memory and registers within the SM and the resources occupied by the threads.

All threads in a thread block logically run in parallel, but not all threads can be executed simultaneously at the physical level. (An SM schedules only one warp at a time, with other warps waiting, and switching between different warps is zero-cost because the execution context of the warp is maintained by the SM throughout the warp’s lifecycle)

NVIDIA GeForce RTX 3090’s Compute Capabilities is 8.6, which includes 82 SMs, with a maximum number of threads allowed per SM being 1536. Calculate: How many threads are executed in parallel at the same time theoretically? How many threads are executed concurrently? It includes 82 SMs, with a maximum number of threads allowed per SM being 1536, i.e., up to 48 warps can exist. Since warps are executed concurrently by the warp scheduler, 32 threads in a warp are executed in parallel. Therefore, it can be roughly considered that the number of threads executed in parallel at the same time is 82*32=2624, and the number of threads executed concurrently is 82*32*48=125952.

3.7 Memory Model

Memory Type Location Cached Access Rights Lifetime
Register On-chip No device Same as thread, kernel function
Shared Memory On-chip No device Same as block
Local Memory Off-chip No device Same as thread, kernel function
Global Memory Off-chip No device&host Program
Texture Memory, Constant Memory Off-chip Yes device&host Program

When are variables defined in a CUDA kernel function register variables, and when are they local variables?

The following three cases are local variables, others are register variables

  • Arrays that cannot be determined during the compilation phase
  • Arrays or structures that occupy a large space
  • Many variables are defined in the kernel function, and registers cannot hold them

From register overflow to local memory, it is essentially in the same storage area as global memory

image.png image.png

3.8 Memory Access Patterns

Global memory is loaded/stored through cache. All accesses to global memory go through the L2 cache (generally 128 bytes).

Aligned Access

The first address is an even multiple of the cache granularity (generally 32 bytes) (the starting position of the cache line is such)

Coalesced Access

All threads in a warp access a contiguous block of memory. Coalesced access means that a memory access request by a warp to global memory results in the least amount of data transfer (coalescence degree = 100%), otherwise it is non-coalesced access.

5 access methods, calculation of coalescence degree??

If both read and write cannot be coalesced, then coalesced writing should be prioritized. Non-coalesced access to read-only data can be cached with the __ldg() function or converted to coalesced access using shared memory.

3.9 Shared Memory and Bank Conflict

Shared memory can be directly manipulated by programmers. Shared memory is divided into many banks.

  • All threads in a warp access the same bank at the same address - broadcast
  • Different threads in a warp access different addresses in the same bank - bank conflict
  • Multiple threads access the same bank at the same address - multicast

Memory Padding solves Bank conflict image.png

Padding operation: Add 1 to the second dimension of sData, i.e., sData[BS][BS+1] The padded part cannot be used for data storage, reducing the amount of available shared memory.

4 Code

4.1 Image Flipping CPU

Multithreaded image flipping while manually maintaining cache.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
void * MTFlipHM(void * tid){
        struct Pixel pix; //temp swap pixel
        int row, col;
        int id = *((int *) tid);
        int start = id * ip.Vpixels/NumThreads;
        int end = start + ip.Vpixels/NumThreads;
        unsigned char buffer[16384];
        for (row = st ; row < ed; row++)
        {
            memcpy(buffer, TheImage[row], ip.Hbytes);
            col = 0;
            while (col < ip.Hpixels * 3 /2){
            pix.B = buffer[col];
            pix.G = buffer[col+1];
            pix.R = buffer[col+2];
            buffer[col]   = buffer[ip.Hpixels*3 - col -3];
            buffer[col+1] = buffer[ip.Hpixels*3 - col -2];
            buffer[col+2] = buffer[ip.Hpixels*3 - col -1];
            buffer[ip.Hpixels*3 - col -3] = pix.B;
            buffer[ip.Hpixels*3 - col -2] = pix.G;
            buffer[ip.Hpixels*3 - col -1] = pix.R;
            col += 3;
            }
        }
        memcpy(TheImage[row],buffer,ip.Hbytes);
        pthread_exit(NULL);
}

void * MTFlipVM(void * tid){
        struct Pixel pix; //temp swap pixel
        int row, col;
        int id = *((int *) tid);
        int start = id * ip.Vpixels/NumThreads;
        int end = start + ip.Vpixels/NumThreads;
        unsigned char buffer1[16384], buffer2[16384];
        for (row = start; row < end; row ++)
        {
            memcpy(buffer1,TheImage[row],ip.Hbytes);
            int  mirrow = ip.Vpixels - 1 - row;
            memcpy(buffer2,TheImage[mirrow],ip.Hbytes);
            // Copy again with offset to complete the swap
            memcpy(TheImage[row],buffer2,ip.Hbytes);
            memcpy(TheImage[mirrow],buffer1,ip.Hbytes);
        }
}

pthread_create(&ThHandle[i], &ThAttr, MTFlipFunc, (void *)&ThParam[i]);
for(i=0; i<NumThreads; i++)
                pthread_join(ThHandle[i], NULL);

4.2 Array Addition

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
const int a=1, b=2, c=3;
__global__ void add(double *x, double *y, double* z){
    const int n = blockIdx.x * blockDim.x + threadIdx.x;
    if (n<N)  z[n] = x[n] + y[n];
}
int main(){
    const int N = 1e9;
    const int M = sizeof(double) * N;
    double *h_x = (double*) malloc(M);
    double *h_y = (double*) malloc(M);
    double *h_z = (double*) malloc(M);
    for (int i=0; i<N; i++)
    {
        h_x[i] = a;
        h_y[i] = b;
    }
    double *d_x, *d_y, *d_z;
    cudaMalloc((void**) &d_x, M);
    cudaMalloc((void**) &d_y, M);
    cudaMalloc((void**) &d_z, M);

    const int block_size =  128;
    int grid_size = (N+block_size-1) / block_size;
    add<<<grid_size, block_size>>>(d_x, d_y, d_z);
    cudaMemcpy(h_z, d_z, M, cudaMemcpyDeviceToHost);
    free(h_x),free(h_y),free(h_z);
    cudaFree(d_x),cudaFree(d_y),cudaFree(d_z);
}

4.3 Image Flipping

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
__global__ void Vflip(uch *ImgDst, uch *ImgSrc, ui Hpixels, ui Vpixels){
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYgtid = ThrPerBlk * MYbid + MYtid;
ui BlkPerRow = (Hpixels + ThrPerBlk - 1) / ThrPerBlk; // ceil
ui RowBytes = (Hpixels * 3 + 3) & (~3);
ui MYrow = MYbid / BlkPerRow;
ui MYcol = MYgtid - MYrow*BlkPerRow*ThrPerBlk;
if (MYcol >= Hpixels) return;// col out of range
ui MYmirrorrow = Vpixels - 1 - MYrow;
ui MYsrcOffset = MYrow * RowBytes;
ui MYdstOffset = MYmirrorrow * RowBytes;
ui MYsrcIndex = MYsrcOffset + 3 * MYcol;
ui MYdstIndex = MYdstOffset + 3 * MYcol;
// swap pixels RGB @MYcol , @MYmirrorcol
ImgDst[MYdstIndex] = ImgSrc[MYsrcIndex];
ImgDst[MYdstIndex + 1] = ImgSrc[MYsrcIndex + 1];
ImgDst[MYdstIndex + 2] = ImgSrc[MYsrcIndex + 2];}

__global__ void Hflip(uch *ImgDst, uch *ImgSrc, ui Hpixels){
ui ThrPerBlk = blockDim.x;
ui MYbid = blockIdx.x;
ui MYtid = threadIdx.x;
ui MYgtid = ThrPerBlk * MYbid + MYtid;
ui BlkPerRow = (Hpixels + ThrPerBlk -1 ) / ThrPerBlk; // ceil
ui RowBytes = (Hpixels * 3 + 3) & (~3);
ui MYrow = MYbid / BlkPerRow;
ui MYcol = MYgtid - MYrow*BlkPerRow*ThrPerBlk;
if (MYcol >= Hpixels) return;// col out of range
ui MYmirrorcol = Hpixels - 1 - MYcol;
ui MYoffset = MYrow * RowBytes;
ui MYsrcIndex = MYoffset + 3 * MYcol;
ui MYdstIndex = MYoffset + 3 * MYmirrorcol;
// swap pixels RGB @MYcol , @MYmirrorcol
ImgDst[MYdstIndex] = ImgSrc[MYsrcIndex];
ImgDst[MYdstIndex + 1] = ImgSrc[MYsrcIndex + 1];
ImgDst[MYdstIndex + 2] = ImgSrc[MYsrcIndex + 2];}

4.4 Matrix Transposition

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
__global__ void transpose(int a[], int b[], int N){
    // Allocate shared memory
    __shared__ int S[TILE][TILE + 1];
    int bx = blockIdx.x * TILE;
    int by = blockIdx.y * TILE;
    int ix = bx + threadIdx.x;
    int iy = by + threadIdx.y;
    if (ix < N && iy < N)// Read into shared memory
        S[threadIdx.y][threadIdx.x] = a[iy * N + ix];
    __syncthreads();// Synchronize, this is essential
    int ix2 = bx + threadIdx.y;
    int iy2 = by + threadIdx.x;
    if (ix2 <N && iy2 < N)// Write back
        b[ix2 * N + iy2 ] = S[threadIdx.x][threadIdx.y];
}

4.5 Square Matrix Multiplication

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
__shared__ float Mds[WIDTH][TILE_WIDTH];
__shared__ float Nds[TILE_WIDTH][WIDTH];
int bx=blockIdx.x ; int by = blockIdx.y;
int tx=threadIdx.x ; int ty = threadIdx.y;
int Row = by *TILE_WIDTH +ty;
int Col = bx*TILE_WIDTH + tx;
float Pvalue = 0;
for(int m=0; m<WIDTH/TILE_WIDTH; ++m)
    {
    // Each thread loads one element of the sub-matrix of M
    Mds[ty][tx] = Md[Row*width+(m*TILE_WIDTH+tx)];
    // Each thread loads one element of the sub-matrix of N
    Nds[ty][tx] = Nd[(m*TILE_WIDTH+ty)*width+Col];
    __syncthreads();
    for (int k = 0; k < TILE_WIDTH; ++k)
        Pvalue += Mds[ty][k] * Nds[k][tx];
    __syncthreads();
    }
Pd[Row*WIDTH+Col] = Pvalue;// Write the result back to matrix P

4.6 Histogram

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#define SIZE (100*1024*1024)
// Generate a random byte stream using the utility function big_random_block()
unsigned char *buffer =(unsigned char*)big_random_block( SIZE );
unsigned int histo[256];
for (int i = 0; i<256; i++)
    histo[i] = 0;
for (int i = 0; i < SIZE; i++)
    histo[buffer[i]]++;
long histoCount = 0;
for (int i = 0; i<256; i++) {
    histoCount += histo[i]; }

__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo){
__shared__ unsigned int temp[256];
temp[threadIdx.x] = 0;
__syncthreads();
int i = threadIdx.x + blockIdx.x * blockDim.x;
int offset = blockDim.x *gridDim.x;
while (i<size){
    atomicAdd(&temp[buffer[i]], 1);
    i += offset;
}
__syncthreads();
atomicAdd(&(histo[threadIdx.x]), temp[threadIdx.x]);
}

4.7 Reduction Summation

Reduction summation is similar to TOP K, the following code is the official code, understanding can be referred to this article

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
__global__ void _sum_gpu(int *input, int count, int *output)
{
    __shared__ int sum_per_block[BLOCK_SIZE];

    int temp = 0;
    for (int idx = threadIdx.x + blockDim.x * blockIdx.x;
         idx < count; idx += gridDim.x * blockDim.x
    )
    {// Cross-grid loop, a thread adds multiple data, coping with massive data
        temp += input[idx];
    }

    sum_per_block[threadIdx.x] = temp;  //the per-thread partial sum is temp!
    __syncthreads();

    //**********shared memory summation stage***********
    for (int length = BLOCK_SIZE / 2; length >= 1; length /= 2)
    {
        int double_kill = -1;
    if (threadIdx.x < length)
    {
        double_kill = sum_per_block[threadIdx.x] + sum_per_block[threadIdx.x + length];
    }
    __syncthreads();  //why we need two __syncthreads() here, and,
    
    if (threadIdx.x < length)
    {
        sum_per_block[threadIdx.x] = double_kill;
    }
    __syncthreads();  //....here ?
    
    } //the per-block partial sum is sum_per_block[0]

    if (blockDim.x * blockIdx.x < count) //in case that our users are naughty
    {
        //the final reduction performed by atomicAdd()
        if (threadIdx.x == 0) atomicAdd(output, sum_per_block[0]);
    }
}

4.8 TOP K

The specific implementation process is as follows:

  1. Copy the data to the GPU memory float *d_data; cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);
  2. Store the data in tuples
1
2
3
4
5
typedef struct {     float value;     int index; } Tuple;  
Tuple *d_tuples; 
int threadsPerBlock = 256; 
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
initializeTuples<<<blocksPerGrid, threadsPerBlock>>>(d_data, d_tuples, n);
  1. Perform reduction operation on the tuples to obtain the indices of the top K maximum/minimum values
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
int *d_indices;
kReduceKernel<<<blocksPerGrid, threadsPerBlock>>>(d_tuples, d_indices, n, k);
__global__ void kReduceKernel(Tuple *input, int *output, int n, int k) {
    extern __shared__ Tuple shared[];
    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    shared[tid] = (i < n) ? input[i] : Tuple{0, 0};
    __syncthreads();
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s)
            shared[tid] = (shared[tid].value > shared[tid + s].value) ? shared[tid] : shared[tid + s];
        __syncthreads();
    }

    if (tid == 0)
        output[blockIdx.x] = shared[0].index;
}
  1. Restore the original data on the CPU and sort by index to get the top K maximum/minimum values
1
2
3
4
5
cudaMemcpy(h_indices, d_indices, size, cudaMemcpyDeviceToHost);  
for (int i = 0; i < k; ++i) {     
    int index = h_indices[i];     
    h_result[i] = h_data[index]; }  
std::sort(h_result, h_result + k);

5 Experiment

Experiment One: Three methods for calculating PI, thread pool

Experiment Two: Matrix multiplication, transposition, reduction, TOP K problem

Global memory, shared memory optimization, bank conflict optimization

Write a parallel computing report of 3-4 pages, CPU PI three methods, producer-consumer model GPU main program process clear once, kernel function different, focus on global memory and shared memory implementation, matrix multiplication histogram (cross-grid loop) reduction of the maximum value in 1 million arrays Report submitted during the exam

6 2020 Exam Questions

6.1 Short Answer Questions

Amdahl’s Law, n processors, 40% serial, calculate the speedup limit.

Given an RGB image 680*480, divided into 4 threads (no explanation of how to divide), the pixel range and byte range processed by each thread;

PPT example question on warp parallel and concurrent quantity;

NVIDIA GeForce RTX 3090’s Compute Capabilities is 8.6, which includes 82 SMs, with a maximum number of threads allowed per SM being 1536. Calculate: How many threads are executed in parallel at the same time theoretically? How many threads are executed concurrently? It includes 82 SMs, with a maximum number of threads allowed per SM being 1536, i.e., up to 48 warps can exist. Since warps are executed concurrently by the warp scheduler, 32 threads in a warp are executed in parallel. Therefore, it can be roughly considered that the number of threads executed in parallel at the same time is 82*32=2624, and the number of threads executed concurrently is 82*32*48=125952.

Global id of a specific element in the matrix transposition process;

image.png

Ask for the global id of element 3

Can data be copied while transposing (CUDA stream);

6.2 Program Analysis Questions

First question, original class exercise; <4,4> changed to <5,5>; need to explain the process;

Program Analysis One

Second question, histogram reduction without atomic operation, ask what problems exist;

6.3 CPU Programming

Find prime numbers in the array a[2,1000000], requiring 10 threads to divide equally;

Thread pool pseudocode: client, server (email function, export function, traffic statistics, etc. a bunch of functions);

6.4 GPU Programming

Matrix multiplication in global memory;

Inner product of vectors a, b, dimension = 1024000000; blockdim.x = blockdim.y =16 fixed; design Grid; require shared memory optimization, solve bank conflict, output result back to CPU for final merge.

Feeling: Non-code questions are baby bus, code questions hit hard, can’t finish writing at all.

Buy me a coffee~
Tim AlipayAlipay
Tim PayPalPayPal
Tim WeChat PayWeChat Pay
0%