This article accelerates the solution of large-scale problems using GPU multi-core by writing CUDA kernel functions, taking matrix transposition and other 5 issues as examples.
0.1 Matrix Transposition
0.1.1 Algorithm Process
Store the matrix to be transposed in GPU memory.
Allocate space on the GPU to store the transposed matrix.
Define a CUDA kernel function to implement matrix transposition. This kernel function should use the concepts of thread blocks and grids to handle all elements in the matrix. In each thread block, threads can use shared memory to process data. Finally, use global memory to write the results back to the GPU.
Call the CUDA kernel function to execute matrix transposition.
Copy the transposed matrix from GPU memory to host memory.
Release GPU memory
0.1.2 Code Implementation
Based on traditional code, shared memory is used to optimize access to global memory, and padding operations are used to avoid bank conflicts.
constintN=10000;constintTILE_DIM=32;constintgrid_size_x=(N+TILE_DIM-1)/TILE_DIM;constintgrid_size_y=grid_size_x;constdim3block_size(TILE_DIM,TILE_DIM);constdim3grid_size(grid_size_x,grid_size_y);__global__voidtranspose(constfloat*A,float*B,constintN){__shared__floatS[TILE_DIM][TILE_DIM+1];intbx=blockIdx.x*TILE_DIM;intby=blockIdx.y*TILE_DIM;// Sequential read
intnx1=bx+threadIdx.x;intny1=by+threadIdx.y;if(nx1<N&&ny1<N)S[threadIdx.y][threadIdx.x]=A[ny1*N+nx1];__syncthreads();// Sequential write
intnx2=bx+threadIdx.y;intny2=by+threadIdx.x;if(nx2<N&&ny2<N)B[nx2*N+ny2]=S[threadIdx.x][threadIdx.y];}intmain(){intnn=N*N;intnBytes=nn*sizeof(float);floata=1.1;floatb=2.2;// CPU space allocation
float*A=(float*)malloc(nBytes);float*B=(float*)malloc(nBytes);// Matrix initialization
for(inti=1;i<=nn;i++){if(i%2==0)A[i-1]=a;elseA[i-1]=b;}// GPU space allocation (needs to be defined first)
float*d_A,*d_B;cudaMalloc((void**)&d_A,nBytes);cudaMalloc((void**)&d_B,nBytes);cudaMemcpy(d_A,A,nBytes,cudaMemcpyHostToDevice);transpose<<<grid_size,block_size>>>(d_A,d_B,N);cudaMemcpy(B,d_B,nBytes,cudaMemcpyDeviceToHost);free(A);free(B);cudaFree(d_A);cudaFree(d_B);return0;}
0.2 Matrix Multiplication
0.2.1 Kernel Function Design
Use of Shared Memory: A portion of the two original matrices should be extracted as tiles and placed in shared memory to reduce access to global memory, thereby improving program performance.
Non-square Matrix Handling: When multiplying two non-square matrices, the size of the tiles in shared memory needs to be adjusted, and conditional statements need to be added to prevent out-of-bounds errors.
__global__voidmatrixmul(constfloat*A,constfloat*B,float*C,constintN,constintM,constintK){//N*M x M*K
__shared__floatA_S[TILE_DIM][TILE_DIM];__shared__floatB_S[TILE_DIM][TILE_DIM];inttx=threadIdx.x;intty=threadIdx.y;intbx=blockIdx.x;intby=blockIdx.y;introw=by*TILE_DIM+ty;intcol=bx*TILE_DIM+tx;floatvalue=0;for(intph=0;ph<M/TILE_DIM+1;ph++){if(row<N&&ph*TILE_DIM+tx<M)A_S[ty][tx]=A[row*M+ph*TILE_DIM+tx];elseA_S[ty][tx]=0.0;if(col<K&&ph*TILE_DIM+ty<M)B_S[ty][tx]=B[(ph*TILE_DIM+ty)*K+col];elseB_S[ty][tx]=0.0;__syncthreads();for(intk=0;k<TILE_DIM;k++)value+=A_S[ty][k]*B_S[k][tx];__syncthreads();}if(row<N&&col<K)C[row*K+col]=value;}
0.3 Histogram Statistics
Use of Shared Memory: A temporary array is created in shared memory to reduce access to global memory, thereby improving program performance.
Cross-grid Looping: For large data volumes, a single thread needs to process multiple data.
Atomic Operations: Whether the result is in shared memory or global memory, atomic operations are needed to avoid conflicts.
Cross-grid Looping: For massive data (>1 million), even with GPU multi-core, it’s not possible to have so many threads, so a single thread needs to process multiple data.
Shared Memory Optimization: The result of summing multiple elements by each thread is stored in shared memory, reducing access to global memory.
Interleaved Pairing Method: Using the interleaved pairing method for reduction can alleviate some of the thread warp divergence.
Atomic Operations: Accumulating the result to global memory may cause conflicts, so atomic operations are needed.
__global__void_sum_gpu(int*input,intcount,int*output){__shared__intsum_per_block[BLOCK_SIZE];inttemp=0;for(intidx=threadIdx.x+blockDim.x*blockIdx.x;idx<count;idx+=gridDim.x*blockDim.x)temp+=input[idx];sum_per_block[threadIdx.x]=temp;__syncthreads();//**********shared memory summation stage***********
for(intlength=BLOCK_SIZE/2;length>=1;length/=2){intdouble_kill=-1;if(threadIdx.x<length)double_kill=sum_per_block[threadIdx.x]+sum_per_block[threadIdx.x+length];__syncthreads();//why we need two __syncthreads() here, and,
if(threadIdx.x<length)sum_per_block[threadIdx.x]=double_kill;__syncthreads();}//the per-block partial sum is sum_per_block[0]
if(threadIdx.x==0)atomicAdd(output,sum_per_block[0]);
0.5 TOP K Problem
The TOP K problem refers to finding the top K largest or smallest elements in a dataset. Using CUDA reduction can efficiently solve the TOP K problem.
The detailed steps to implement sorting and selecting the top K largest/smallest elements using CUDA reduction are as follows:
Define a tuple type containing value and index for storing the original data and its index (to restore the original data after sorting)
Traverse the original data and store the data in tuples
Perform reduction operations on the tuples to obtain the indices of the top K largest/smallest values
Restore the original data and sort by index to get the top K largest/smallest values
Specific implementation process:
Copy data to GPU memory
float *d_data; cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);