On Github yszheda / CUDA-lab
Command to get your GPGPU HW spec:
$ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
Device 0: "Tesla K20Xm" CUDA Driver Version / Runtime Version 5.5 / 5.5 CUDA Capability Major/Minor version number: 3.5 Total amount of global memory: 5760 MBytes (6039339008 bytes) (14) Multiprocessors, (192) CUDA Cores/MP: 2688 CUDA Cores GPU Clock rate: 732 MHz (0.73 GHz) Memory Clock rate: 2600 Mhz Memory Bus Width: 384-bit L2 Cache Size: 1572864 bytes Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
theoretical memory bandwidth: $2600 \times 10^{6} \times (384 / 8) \times 2 ÷ 1024^3 = 243 GB/s$
Official HW Spec details: http://www.nvidia.com/object/tesla-servers.html
CUDA cores, max block/grid size, shared mem# compile the source code to executable file $ nvcc a.cu -o a.outGPU and CPU code are compiled and linked separately
The nvcc compiler will translate CUDA source code into Parallel Thread Execution (PTX) language in the intermediate phase.
# keep all intermediate phase files $ nvcc a.cu -keep # or $ nvcc a.cu -save-temps
$ nvcc a.cu -keep $ ls a.cpp1.ii a.cpp4.ii a.cudafe1.c a.cudafe1.stub.c a.cudafe2.stub.c a.hash a.out a.cpp2.i a.cu a.cudafe1.cpp a.cudafe2.c a.fatbin a.module_id a.ptx a.cpp3.i a.cu.cpp.ii a.cudafe1.gpu a.cudafe2.gpu a.fatbin.c a.o a.sm_10.cubin
# clean all intermediate phase files $ nvcc a.cu -keep -cleanPTX provides a stable programming model and instruction set for general purpose parallel programming. It is designed to be efficient on NVIDIA GPUs.
Print code generation statistics:
$ nvcc -Xptxas -v reduce.cu ptxas info : 0 bytes gmem ptxas info : Compiling entry function '_Z6reducePiS_' for 'sm_10' ptxas info : Used 6 registers, 32 bytes smem, 4 bytes cmem[1]
-Xptxas --ptxas-options Specify options directly to the ptx optimizing assembler.
This tool checks the following memory errors of your program, and it also reports hardware exceptions encountered by the GPU. These errors may not cause program crash, but they could unexpected program and memory misusage.
Table . Memcheck reported error types Name Description Location Precision Memory access error Errors due to out of bounds or misaligned accesses to memory by a global, local, shared or global atomic access. Device Precise Hardware exception Errors that are reported by the hardware error reporting mechanism. Device Imprecise Malloc/Free errors Errors that occur due to incorrect use of malloc()/free() in CUDA kernels. Device Precise CUDA API errors Reported when a CUDA API call in the application returns a failure. Host Precise cudaMalloc memory leaks Allocations of device memory using cudaMalloc() that have not been freed by the application. Host Precise Device Heap Memory Leaks Allocations of device memory using malloc() in device code that have not been freed by the application. Device ImpreciseProgram with double free fault
int main(int argc, char *argv[]) { const int elemNum = 1024; int h_data[elemNum]; int *d_data; initArray(h_data); int arraySize = elemNum * sizeof(int); cudaMalloc((void **) &d_data, arraySize); incrOneForAll<<< 1, 1024 >>>(d_data); cudaMemcpy((void **) &h_data, d_data, arraySize, cudaMemcpyDeviceToHost); cudaFree(d_data); cudaFree(d_data); // fault printArray(h_data); return 0; }
$ nvcc -g -G example.cu $ cuda-memcheck ./a.out ========= CUDA-MEMCHECK ========= Program hit error 17 on CUDA API call to cudaFree ========= Saved host backtrace up to driver entry point at error ========= Host Frame:/usr/lib64/libcuda.so [0x26d660] ========= Host Frame:./a.out [0x42af6] ========= Host Frame:./a.out [0x2a29] ========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xfd) [0x1ecdd] ========= Host Frame:./a.out [0x2769] =========
No error is shown if it is run directly, but CUDA-MEMCHECK can detect the error.
Purpose: Query and modify GPU devices' state.
$ nvidia-smi +------------------------------------------------------+ | NVIDIA-SMI 5.319.37 Driver Version: 319.37 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20Xm On | 0000:0B:00.0 Off | 0 | | N/A 35C P0 60W / 235W | 84MB / 5759MB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K20Xm On | 0000:85:00.0 Off | 0 | | N/A 39C P0 60W / 235W | 14MB / 5759MB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0 33736 ./RS 69MB | +-----------------------------------------------------------------------------+
You can query more specific information on temperature, memory, power, etc.
$ nvidia-smi -q -d [TEMPERATURE|MEMORY|POWER|CLOCK|...]
For example:
$ nvidia-smi -q -d POWER ==============NVSMI LOG============== Timestamp : Driver Version : 319.37 Attached GPUs : 2 GPU 0000:0B:00.0 Power Readings Power Management : Supported Power Draw : 60.71 W Power Limit : 235.00 W Default Power Limit : 235.00 W Enforced Power Limit : 235.00 W Min Power Limit : 150.00 W Max Power Limit : 235.00 W GPU 0000:85:00.0 Power Readings Power Management : Supported Power Draw : 31.38 W Power Limit : 235.00 W Default Power Limit : 235.00 W Enforced Power Limit : 235.00 W Min Power Limit : 150.00 W Max Power Limit : 235.00 W
Program-#1: increase each element in an array by one. (You are required to rewrite a CPU program into a CUDA one.) Program-#2: use parallel reduction to calculate the sum of all the elements in an array. (You are required to fill in the blanks of a template CUDA program, and report your GPU bandwidth to TA after you finish each assignment.) SUM CUDA programming with "multi-kernel and shared memory" SUM CUDA programming with "interleaved addressing" SUM CUDA programming with "sequential addressing" SUM CUDA programming with "first add during load"
0.2 scores per task.
Rewrite the following CPU function into a CUDA kernel function and complete the main function by yourself:
// increase one for all the elements void incrOneForAll(int *array, const int elemNum) { int i; for (i = 0; i < elemNum; ++i) { array[i] ++; } }
__global__ void reduce(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; // TODO: load the content of global memory to shared memory // NOTE: synchronize all the threads after this step // TODO: sum calculation // NOTE: synchronize all the threads after each iteration // TODO: write back the result into the corresponding entry of global memory // NOTE: only one thread is enough to do the job }
// parameters for the first kernel // TODO: set grid and block size // threadNum = ? // blockNum = ? int sMemSize = 1024 * sizeof(int); reduce<<< threadNum, blockNum, sMemSize >>>(d_idata, d_odata);
mykernel <<< gridSize, blockSize, sMemSize, streamID >>> (args);
cudaError_t cudaMemcpy ( void *dst, const void *src, size_t count, enum cudaMemcpyKind kind )
Enumerator:
kernel1 <<< gridSize, blockSize >>> (args); cudaDeviceSynchronize(); kernel2 <<< gridSize, blockSize >>> (args);
Methods:
cudaEvent_t start, stop; float time; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord( start, 0 ); kernel<<< grid,threads >>> (d_idata, d_odata); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &time, start, stop ); cudaEventDestroy( start ); cudaEventDestroy( stop );