CUDA Lab – LSALab – GPU Server



CUDA Lab – LSALab – GPU Server

1 0


CUDA-lab

slides for CUDA-lab of Parallel Programming course

On Github yszheda / CUDA-lab

CUDA Lab

LSALab

Overview

  • Programming Environment
  • Compile & Run CUDA program
  • CUDA Tools
  • Lab Tasks
  • CUDA Programming Tips
  • References

GPU Server

  • Intel E5-2670 V2 10Cores CPU X 2
  • NVIDIA K20X GPGPU CARD X 2

Command to get your GPGPU HW spec:

$ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery

Device 0: "Tesla K20Xm"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 5760 MBytes (6039339008 bytes)
  (14) Multiprocessors, (192) CUDA Cores/MP:     2688 CUDA Cores
  GPU Clock rate:                                732 MHz (0.73 GHz)
  Memory Clock rate:                             2600 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)

theoretical memory bandwidth: $2600 \times 10^{6} \times (384 / 8) \times 2 ÷ 1024^3 = 243 GB/s$

Official HW Spec details: http://www.nvidia.com/object/tesla-servers.html

CUDA cores, max block/grid size, shared mem

Compile & Run CUDA

  • Directly compile to executable code
# compile the source code to executable file
$ nvcc a.cu -o a.out
GPU and CPU code are compiled and linked separately

Compile & Run CUDA

The nvcc compiler will translate CUDA source code into Parallel Thread Execution (PTX) language in the intermediate phase.

# keep all intermediate phase files
$ nvcc a.cu -keep
# or
$ nvcc a.cu -save-temps
$ nvcc a.cu -keep
$ ls
a.cpp1.ii  a.cpp4.ii    a.cudafe1.c    a.cudafe1.stub.c  a.cudafe2.stub.c  a.hash       a.out
a.cpp2.i   a.cu         a.cudafe1.cpp  a.cudafe2.c       a.fatbin          a.module_id  a.ptx
a.cpp3.i   a.cu.cpp.ii  a.cudafe1.gpu  a.cudafe2.gpu     a.fatbin.c        a.o          a.sm_10.cubin
# clean all intermediate phase files
$ nvcc a.cu -keep -clean
PTX provides a stable programming model and instruction set for general purpose parallel programming. It is designed to be efficient on NVIDIA GPUs.

Useful NVCC Usage

Print code generation statistics:

$ nvcc -Xptxas -v reduce.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6reducePiS_' for 'sm_10'
ptxas info    : Used 6 registers, 32 bytes smem, 4 bytes cmem[1]
-Xptxas
--ptxas-options
	 Specify options directly to the ptx optimizing assembler.

  • register number: should be less than the number of available registers, otherwises the rest registers will be mapped into the local memory (off-chip).
  • smem stands for shared memory.
  • cmem stands for constant memory. The bank-#1 constant memory stores 4 bytes of constant variables.

As shown in the above example, the amounts of local and shared memory are listed by two numbers each. First number represents the total size of all the variables declared in that memory segment and the second number represents the amount of system allocated data. The amount and location of system allocated data as well as the allocation of constant variables to constant banks is profile specific. For constant memory, the total space allocated in that bank is shown.

CUDA Tools

  • cuda-memcheck: functional correctness checking suite.
  • nvidia-smi: NVIDIA System Management Interface

cuda-memcheck

This tool checks the following memory errors of your program, and it also reports hardware exceptions encountered by the GPU. These errors may not cause program crash, but they could unexpected program and memory misusage.

Table . Memcheck reported error types Name Description Location Precision Memory access error Errors due to out of bounds or misaligned accesses to memory by a global, local, shared or global atomic access. Device Precise Hardware exception Errors that are reported by the hardware error reporting mechanism. Device Imprecise Malloc/Free errors Errors that occur due to incorrect use of malloc()/free() in CUDA kernels. Device Precise CUDA API errors Reported when a CUDA API call in the application returns a failure. Host Precise cudaMalloc memory leaks Allocations of device memory using cudaMalloc() that have not been freed by the application. Host Precise Device Heap Memory Leaks Allocations of device memory using malloc() in device code that have not been freed by the application. Device Imprecise

cuda-memcheck

Example

Program with double free fault

int main(int argc, char *argv[])
{
	const int elemNum = 1024;
	int h_data[elemNum];
	int *d_data;
	initArray(h_data);
	int arraySize = elemNum * sizeof(int);
	cudaMalloc((void **) &d_data, arraySize);
	incrOneForAll<<< 1, 1024 >>>(d_data);
	cudaMemcpy((void **) &h_data, d_data, arraySize, cudaMemcpyDeviceToHost);
	cudaFree(d_data);
	cudaFree(d_data);	// fault
	printArray(h_data);
	return 0;
}

cuda-memcheck

Example

$ nvcc -g -G example.cu
$ cuda-memcheck ./a.out
========= CUDA-MEMCHECK
========= Program hit error 17 on CUDA API call to cudaFree
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so [0x26d660]
=========     Host Frame:./a.out [0x42af6]
=========     Host Frame:./a.out [0x2a29]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xfd) [0x1ecdd]
=========     Host Frame:./a.out [0x2769]
=========

No error is shown if it is run directly, but CUDA-MEMCHECK can detect the error.

NVIDIA System Management Interface (NVIDIA-SMI)

Purpose: Query and modify GPU devices' state.

$ nvidia-smi       
+------------------------------------------------------+                       
| NVIDIA-SMI 5.319.37   Driver Version: 319.37         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20Xm         On   | 0000:0B:00.0     Off |                    0 |
| N/A   35C    P0    60W / 235W |       84MB /  5759MB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20Xm         On   | 0000:85:00.0     Off |                    0 |
| N/A   39C    P0    60W / 235W |       14MB /  5759MB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0     33736  ./RS                                                  69MB  |
+-----------------------------------------------------------------------------+

nvidia-smi

You can query more specific information on temperature, memory, power, etc.

$ nvidia-smi -q -d [TEMPERATURE|MEMORY|POWER|CLOCK|...]

For example:

$ nvidia-smi -q -d POWER
==============NVSMI LOG==============
Timestamp                           : 
Driver Version                      : 319.37
Attached GPUs                       : 2
GPU 0000:0B:00.0
    Power Readings
        Power Management            : Supported
        Power Draw                  : 60.71 W
        Power Limit                 : 235.00 W
        Default Power Limit         : 235.00 W
        Enforced Power Limit        : 235.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 235.00 W

GPU 0000:85:00.0
    Power Readings
        Power Management            : Supported
        Power Draw                  : 31.38 W
        Power Limit                 : 235.00 W
        Default Power Limit         : 235.00 W
        Enforced Power Limit        : 235.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 235.00 W

Lab Assignments

Program-#1: increase each element in an array by one. (You are required to rewrite a CPU program into a CUDA one.) Program-#2: use parallel reduction to calculate the sum of all the elements in an array. (You are required to fill in the blanks of a template CUDA program, and report your GPU bandwidth to TA after you finish each assignment.) SUM CUDA programming with "multi-kernel and shared memory" SUM CUDA programming with "interleaved addressing" SUM CUDA programming with "sequential addressing" SUM CUDA programming with "first add during load"

0.2 scores per task.

Labs Assignment #1

Rewrite the following CPU function into a CUDA kernel function and complete the main function by yourself:

// increase one for all the elements
void incrOneForAll(int *array, const int elemNum)
{
    int i;
    for (i = 0; i < elemNum; ++i)
    {
        array[i] ++;
    }
}

Labs Assignment #2

  • Fill in the CUDA kernel function:
    __global__ void reduce(int *g_idata, int *g_odata)
    {
    	extern __shared__ int sdata[];
    
    	// TODO: load the content of global memory to shared memory
    	// NOTE: synchronize all the threads after this step
    
    	// TODO: sum calculation
    	// NOTE: synchronize all the threads after each iteration
    
    	// TODO: write back the result into the corresponding entry of global memory
    	// NOTE: only one thread is enough to do the job
    }
  • Part of the main function is given, you are required to fill in the blanks according to the comments:
    // parameters for the first kernel
    // TODO: set grid and block size
    // threadNum = ?
    // blockNum = ?
    int sMemSize = 1024 * sizeof(int);
    reduce<<< threadNum, blockNum, sMemSize >>>(d_idata, d_odata);

Labs Assignment #2

  • Given $10^{22}$ INTs, each block has the maximum block size $10^{10}$
  • How to use 3 kernel to synchronize between iterations?
Hint: for "first add during global load" optimization (Assignment #2-4), the third kernel is unnecessary.

Labs Assignment #2-1

  • Implement the naïve data parallelism assignment as follows:

Labs Assignment #2-2

  • Reduce number of active warps of your program:

Labs Assignment #2-3

  • Prevent shared memory access bank confliction:

Labs Assignment #2-4

  • Reduce the number of blocks in each kernel:
  • Notice:
    • Only 2 kernels are needed in this case because each kernel can now process twice amount of data than before.
    • Global memory should be accessed in a sequential addressing way.

CUDA Programming Tips

Kernel Launch

mykernel <<< gridSize, blockSize, sMemSize, streamID >>> (args);
  • gridSize: number of blocks per grid
  • blockSize: number of threads per block
  • sMemSize[optional]: shared memory size (in bytes)
  • streamID[optional]: stream ID, default is 0

Built-in Variables for Indexing in a Kernel Function

  • blockIdx.x, blockIdx.y, blockIdx.z: block index
  • threadIdx.x, threadIdx.y, threadIdx.z: thread index
  • gridDim.x, gridDim.y, gridDim.z: grid size (number of blocks per grid) per dimension
  • blockDim.x, blockDim.y, blockDim.z: block size (number of threads per block) per dimension

cudaMemcpy

cudaError_t cudaMemcpy ( void *dst,
const void *src,
size_t 	count,
enum cudaMemcpyKind kind	 
)

Enumerator:

  • cudaMemcpyHostToHost: Host -> Host
  • cudaMemcpyHostToDevice: Host -> Device
  • cudaMemcpyDeviceToHost; Device -> Host
  • cudaMemcpyDeviceToDevice: Device -> Device

Synchronization

  • __synthread(): synchronizes all threads in a block (used inside the kernel function).
  • cudaDeviceSynchronize(): blocks until the device has completed all preceding requested tasks (used between two kernel launches).
    kernel1 <<< gridSize, blockSize >>> (args);
    cudaDeviceSynchronize();
    kernel2 <<< gridSize, blockSize >>> (args);

How to Measure Kernel Execution Time Using CUDA GPU Timers

Methods:

  • cudaEventCreate(): init timer
  • cudaEventDestory(): destory timer
  • cudaEventRecord(): set timer
  • cudaEventSynchronize(): sync timer after each kernel call
  • cudaEventElapsedTime(): returns the elapsed time in milliseconds

How to Measure Kernel Execution Time Using CUDA GPU Timers

Example:
cudaEvent_t start, stop;
float time;

cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord( start, 0 );
kernel<<< grid,threads >>> (d_idata, d_odata);
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );

cudaEventElapsedTime( &time, start, stop );
cudaEventDestroy( start );
cudaEventDestroy( stop );

References

NVIDIA CUDA Runtime API Programming Guide :: CUDA Toolkit Documentation Best Practices Guide :: CUDA Toolkit Documentation NVCC :: CUDA Toolkit Documentation CUDA-MEMCHECK :: CUDA Toolkit Documentation nvidia-smi documentation CUDA error types

THE END

Enjoy CUDA & Happy New Year!