This article demonstrates how to write CUDA kernel functions and uses five examples, such as matrix transposition, to accelerate the solution of large-scale problems with GPU multi-core.
0.1 Matrix Transposition
0.1.1 Algorithm Process
Store the matrix to be transposed in GPU memory.
Allocate space on the GPU to store the transposed matrix.
Define a CUDA kernel function to implement the matrix transposition. This kernel function should use the concept of thread blocks and thread grids to handle all elements in the matrix. Within each thread block, threads can use shared memory to process data. Finally, use global memory to write the result back to the GPU.
Call the CUDA kernel function to perform matrix transposition.
Copy the transposed matrix from GPU memory to host memory.
Release GPU memory
0.1.2 Code Implementation
Based on traditional code, use shared memory to optimize access to global memory and apply padding operation to avoid bank conflicts.
constintN=10000;constintTILE_DIM=32;constintgrid_size_x=(N+TILE_DIM-1)/TILE_DIM;constintgrid_size_y=grid_size_x;constdim3block_size(TILE_DIM,TILE_DIM);constdim3grid_size(grid_size_x,grid_size_y);__global__voidtranspose(constfloat*A,float*B,constintN){__shared__floatS[TILE_DIM][TILE_DIM+1];intbx=blockIdx.x*TILE_DIM;intby=blockIdx.y*TILE_DIM;//Sequential read
intnx1=bx+threadIdx.x;intny1=by+threadIdx.y;if(nx1<N&&ny1<N)S[threadIdx.y][threadIdx.x]=A[ny1*N+nx1];__syncthreads();//Sequential write
intnx2=bx+threadIdx.y;intny2=by+threadIdx.x;if(nx2<N&&ny2<N)B[nx2*N+ny2]=S[threadIdx.x][threadIdx.y];}intmain(){intnn=N*N;intnBytes=nn*sizeof(float);floata=1.1;floatb=2.2;//CPU space allocation
float*A=(float*)malloc(nBytes);float*B=(float*)malloc(nBytes);//Matrix initialization
for(inti=1;i<=nn;i++){if(i%2==0)A[i-1]=a;elseA[i-1]=b;}//GPU space allocation (needs to be defined first)
float*d_A,*d_B;cudaMalloc((void**)&d_A,nBytes);cudaMalloc((void**)&d_B,nBytes);cudaMemcpy(d_A,A,nBytes,cudaMemcpyHostToDevice);transpose<<<grid_size,block_size>>>(d_A,d_B,N);cudaMemcpy(B,d_B,nBytes,cudaMemcpyDeviceToHost);free(A);free(B);cudaFree(d_A);cudaFree(d_B);return0;}
0.2 Matrix Multiplication
0.2.1 Kernel Function Design
Use of Shared Memory: Part of the two original matrices should be extracted as tiles and placed in shared memory to reduce access to global memory and thus improve program performance.
Non-square Matrices Handling: When multiplying two non-square matrices, it’s necessary to adjust the size of the tiles in shared memory and add judgment statements to prevent boundary errors.
__global__voidmatrixmul(constfloat*A,constfloat*B,float*C,constintN,constintM,constintK){//N*M x M*K
__shared__floatA_S[TILE_DIM][TILE_DIM];__shared__floatB_S[TILE_DIM][TILE_DIM];inttx=threadIdx.x;intty=threadIdx.y;intbx=blockIdx.x;intby=blockIdx.y;introw=by*TILE_DIM+ty;intcol=bx*TILE_DIM+tx;floatvalue=0;for(intph=0;ph<M/TILE_DIM+1;ph++){if(row<Nandph*TILE_DIM+tx<M)A_S[ty][tx]=A[row*M+ph*TILE_DIM+tx];elseA_S[ty][tx]=0.0;if(col<Kandph*TILE_DIM+ty<M)B_S[ty][tx]=B[(ph*TILE_DIM+ty)*K+col];elseB_S[ty][tx]=0.0;__syncthreads();
0.3 Histogram Statistics
Use of Shared Memory: A temporary array is allocated in shared memory to reduce access to global memory, thus improving program performance.
Loop Across Grids: When handling large amounts of data, a thread needs to process multiple pieces of data.
Atomic Operations: Whether it is shared memory or global memory results, atomic operations are needed to avoid conflicts.
__global__void_sum_gpu(int*input,intcount,int*output){__shared__intsum_per_block[BLOCK_SIZE];inttemp=0;for(intidx=threadIdx.x+blockDim.x*blockIdx.x;idx<count;idx+=gridDim.x*blockDim.x)temp+=input[idx];sum_per_block[threadIdx.x]=temp;__syncthreads();//**********shared memory summation stage***********
for(intlength=BLOCK_SIZE/2;length>=1;length/=2){intdouble_kill=-1;if(threadIdx.x<length)double_kill=sum_per_block[threadIdx.x]+sum_per_block[threadIdx.x+length];__syncthreads();//why we need two __syncthreads() here, and,
if(threadIdx.x<length)sum_per_block[threadIdx.x]=double_kill;__syncthreads();}//the per-block partial sum is sum_per_block[0]
if(threadIdx.x==0)atomicAdd(output,sum_per_block[0]);
0.5 TOP K Problem
The TOP K problem is to find the top K largest or smallest elements in a set of data. Using CUDA reduction, the TOP K problem can be efficiently solved.
Below are the detailed steps using CUDA reduction to implement sorting and selecting the top K largest/smallest elements:
Define a tuple type that contains value and index, for storing original data and their index (in order to restore the original data after sorting)
Traverse through the original data and store the data in tuples
Perform reduction on the tuple to get the index of the top K largest/smallest values
Restore the original data and sort according to index, to get the top K largest/smallest values
The implementation process is as follows:
Copy the data to GPU memory
float *d_data; cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);