GPGPU – General-Purpose computing on Graphics Processor Units – Why should I bother GPU?



GPGPU – General-Purpose computing on Graphics Processor Units – Why should I bother GPU?

0 1


GPGPU_presentation

Presentation for my lightning talk on Warsaw Hadoop Users Group (10.12.2014)

On Github janisz / GPGPU_presentation

GPGPU

General-Purpose computing on Graphics Processor Units

Tomek Janiszewski / @janiszt

1. Who has ever play with GPU? Some custom shader, openCL, CUDA ...? My Experience. I'm part of a team which is building in memory time series database like OpenTSDB but stored on GPUs. This talk will be really short introduction to GPGPU just to show it exist.

Why should I bother GPU?

2. Ok. Why we should consider moving our computations from "normal" CPUs to GPUs?

Moore is no more

Moor says that "the number of transistors in a dense integrated circuit doubles approximately every two years" Now we reach the limit and can't put more transistors in one chip. Instead of this we put more core in one processor. But what if we has thousands of not so powerful cores instated of dozen mighty ones?

Real parallel computation

With data analysis we need to compute thousand of data in similar meaner which is opposite to traditional Multi Instruction Single Data and now we can do it with thousent of cores

Available for normal people

Modern GPUs are available for average men. They costs less than 50$. Of course there are high-end cards that costs thousands and doesn't have video interface.

Applications

  • Graphics
  • Simulations
  • AI
  • Data Analysis
3. Where we can gain performance by using GPU? Of course graphics. Graphics card are made for compute graphics transformation so that's not surprise. But maybe you don't know modern browsers accelerate page rendering using GPUs. Have you ever heard about Havoc of PhysiX? They are physics engines that make use of GPUs to provide better simulations. But that's not only application of GPU in simulations. We use GPUs for simulation of protein structures. Artificial Intelligence. This is kind of new application of GPU. 3 months ago NVidia released cuDNN - deep neural network accelerator. Data Analysis. I think this is biggest group in fact we can put applications I mentioned before in this group - for example image edge detection is data analysis. Mostly bioinformatics with text algorithms to analyse DNA but there are more and more data analysis accelerator for Matlab or R and database that use GPU but they are only in for academic research.

“Talk is cheep show me the code”

Linus Torvalds

4. Time for basic example. We will sum 2 vectors. Here is formal definition.

Add the corresponding locations of A and B, and store the result in C.

Plain Old C

							void vecadd( int *A , int *B , int *C)
							{
							    for (int i = 0; i < L; i++) {
							        C[i] = A[i] + B[i];
							    }
							}
						
I hope everybody understand this code. And probably some of you write it in little changed form at your daily work. How we can make it faster? By using all of our cores.

OpenMP

							void vecadd( int *A , int *B , int *C)
							{
							    chunk = CHUNKSIZE;
							    #pragma omp parallel shared(A,B,C,chunk) private(i)
							    {
							        #pragma omp for schedule(dynamic,chunk) nowait
							        for (int i = 0; i < L; i++) {
							            C[i] = A[i] + B[i];
							        }
							    }
							}
							
I think as long as we want to stay in C and keep code readable OpenMP fits perfectly

GLSL

							#version 110

							uniform sampler2D texture1;
							uniform sampler2D texture2;

							void main() {
							    vec4 A = texture2D(texture1, gl_TexCoord[0].st);
							    vec4 B = texture2D(texture2, gl_TexCoord[0].st);
							    gl_FragColor = A + B;
							}
							
In 2001 we get first cards with Shader support. Our task is similar to blending two textures.

OpenCL

								__kernel
								void vecadd(__global int *A,
								            __global int *B,
								            __global int *C)
								{
								    int id = get_global_id(0);
								    C[id] = A[id] + B[id];
								}
							
Next step in moving Computation to GPU was or is OpenCL. It was designed as one standard to rule all platforms. It was created by Apple but now it's lead by Khronos Group with support from companies. What's going on here? We get something like thread id but in GPU world and add corresponding values.

CUDA

								__global__
								void vecadd( int *A , int *B , int *C)
								{
								    int id = blockIdx.x*blockDim.x+threadIdx.x;
								    C[id] = A[id] + B[id] ;
								}
							
2007 NVidia presents CUDA. In concepts it's similar to OpenCL but works only with NVidia cards but in return we get IDE based on Eclipse with debugger and profiler.

Not so fast

Copy data to GPU Launch kernel Check for errors Copy output back to RAM But wait. I show you code that is responsible for computation. we need put data on GPU somehow, launch our function and get results back.

Results

http://hpclab.blogspot.com/2011/09/is-gpu-good-for-large-vector-addition.html

And the results are... miserable. Comparison is from 2011 but trends are still actual. CPU is much faster. Lets take another try and instead of adding vectors try to multiple matrices.

but for harder problems

http://hpclab.blogspot.com/2011/09/is-gpu-good-for-large-vector-addition.html

This is nice. GPU overtake CPU but why it wasn't working with vectors?

How it works inside?

5. To understand it we must go deeper under the C and touch raw metal.
This is concept diagram of GPU. Our main bottle neck is PCIe where we are limited to 16 GB/s which is comparable with DDR3. And that's why vector addition is slower than on regular CPU data transfer overwhelm computation. In matrix multiplication we didn't see it because computation takes much longer than copying of data. Next we have 2 types of read only constant memory where constants and kernel arguments are stored and texture memory optimised for 2D access. Global memory is read and write and requires sequential and aligned (16 byte) reads and writes to be fast. Other types of memory are faster. Shared memory is shared across threads in one block. Local memory and registers are privet for thread. Thread are packed in block which are packed in grids but I'll not talk about it.

Rules Of Thumb

6. Some best practices.

RTFM

“Life is too short for man pages, and occasionally much too short without them.”

Randall Munroe (xkcd.com)

Documentation is really friendly and is best way to avoid errors and allow to get maximum performance from hardware. So I strongly recommend read it.

Think parallel

Vector adding is "Hello World" example and GPGPU is more than run loop over thousand threads. We need to change our algorithms, structures and probably data to work on full speed.

SIMD

Branches, synchronisation often kills performance. Sometimes it faster to replace it with sinus and multiplication.

Problems

7. You saw that GPU desn't solve all your problems but when you use it it will generate some.

Problem: Development is hard

Solution: Always have spare GPU in your computer

What do you think, when your program start infinite loop on GPU, what will you see? It will eat all resources and display will hang. So good practice is to have another GPU for development. If you have laptops you already has 2 cards not cheep ones like this.

Problem: Debugging is impossible

Solution: Write tests and run them!

We are working near hardware so every bit matters and could change performance so test often. And debugging 10k threads is hard

Problem: Copying data to/from GPU is slow

Solution: Use stream and compute while data are loaded

We can load data while our card is busy working on data we loaded before.

Problem: GPU doesn't like 64bit computation

Solution: Wait for next release

Right now we can't do much with it.

Problem: I dont want to code a lot

Solution: Use libs

  • ArrayFire
  • Thrust (STL for CUDA)
  • cuBLAS (Basic Linear Algebra Subprograms)
  • cuFFT
  • cuDNN (GPU-accelerated library of primitives for deep neural networks)
ArrayFire is commercial product which was open-sourced month ago. And it's integrate cuBLAS, cuFFT and some other statistical function in multiple backend lib (CUDA, OpenCL, CPU). Other libs are only for CUDA.

Before you code your custom solution.

8. Let's see some data related example

PGStorm

							postgres=# SELECT COUNT(*) FROM t1 WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 15;
							count
							-------
							6718
							(1 row)

							Time: 7019.855 ms
						
							postgres=# SELECT COUNT(*) FROM t2 WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 15;
							count
							-------
							6718
							(1 row)

							Time: 176.301 ms
						

t1 and t2 contain same contents with 10 millions of records, but t1 is a regular table and t2 is a foreign table managed by PG-Strom

PGStorm is foreign data wrapper for Postgre This is example from their documentation and you must admint that result are spectacular. There are some Map-Reduce frameworks to accelerate Hadoop. Some researches clam that they jobs runs 200 times faster but I not sure if they tested it in production or in home.

Questions?

That's all thank you for listening