On Github janisz / GPGPU_presentation
Linus Torvalds
4. Time for basic example. We will sum 2 vectors. Here is formal definition.Add the corresponding locations of A and B, and store the result in C.
void vecadd( int *A , int *B , int *C) { for (int i = 0; i < L; i++) { C[i] = A[i] + B[i]; } }I hope everybody understand this code. And probably some of you write it in little changed form at your daily work. How we can make it faster? By using all of our cores.
void vecadd( int *A , int *B , int *C) { chunk = CHUNKSIZE; #pragma omp parallel shared(A,B,C,chunk) private(i) { #pragma omp for schedule(dynamic,chunk) nowait for (int i = 0; i < L; i++) { C[i] = A[i] + B[i]; } } }I think as long as we want to stay in C and keep code readable OpenMP fits perfectly
#version 110 uniform sampler2D texture1; uniform sampler2D texture2; void main() { vec4 A = texture2D(texture1, gl_TexCoord[0].st); vec4 B = texture2D(texture2, gl_TexCoord[0].st); gl_FragColor = A + B; }In 2001 we get first cards with Shader support. Our task is similar to blending two textures.
__kernel void vecadd(__global int *A, __global int *B, __global int *C) { int id = get_global_id(0); C[id] = A[id] + B[id]; }Next step in moving Computation to GPU was or is OpenCL. It was designed as one standard to rule all platforms. It was created by Apple but now it's lead by Khronos Group with support from companies. What's going on here? We get something like thread id but in GPU world and add corresponding values.
__global__ void vecadd( int *A , int *B , int *C) { int id = blockIdx.x*blockDim.x+threadIdx.x; C[id] = A[id] + B[id] ; }2007 NVidia presents CUDA. In concepts it's similar to OpenCL but works only with NVidia cards but in return we get IDE based on Eclipse with debugger and profiler.
http://hpclab.blogspot.com/2011/09/is-gpu-good-for-large-vector-addition.html
And the results are... miserable. Comparison is from 2011 but trends are still actual. CPU is much faster. Lets take another try and instead of adding vectors try to multiple matrices.http://hpclab.blogspot.com/2011/09/is-gpu-good-for-large-vector-addition.html
This is nice. GPU overtake CPU but why it wasn't working with vectors?Randall Munroe (xkcd.com)
Documentation is really friendly and is best way to avoid errors and allow to get maximum performance from hardware. So I strongly recommend read it.Problem: Development is hard
Solution: Always have spare GPU in your computer
What do you think, when your program start infinite loop on GPU, what will you see? It will eat all resources and display will hang. So good practice is to have another GPU for development. If you have laptops you already has 2 cards not cheep ones like this.Problem: Debugging is impossible
Solution: Write tests and run them!
We are working near hardware so every bit matters and could change performance so test often. And debugging 10k threads is hardProblem: Copying data to/from GPU is slow
Solution: Use stream and compute while data are loaded
We can load data while our card is busy working on data we loaded before.Problem: GPU doesn't like 64bit computation
Solution: Wait for next release
Right now we can't do much with it.Problem: I dont want to code a lot
Solution: Use libs
Before you code your custom solution.
8. Let's see some data related examplepostgres=# SELECT COUNT(*) FROM t1 WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 15; count ------- 6718 (1 row) Time: 7019.855 ms
postgres=# SELECT COUNT(*) FROM t2 WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 15; count ------- 6718 (1 row) Time: 176.301 ms
t1 and t2 contain same contents with 10 millions of records, but t1 is a regular table and t2 is a foreign table managed by PG-Strom
PGStorm is foreign data wrapper for Postgre This is example from their documentation and you must admint that result are spectacular. There are some Map-Reduce frameworks to accelerate Hadoop. Some researches clam that they jobs runs 200 times faster but I not sure if they tested it in production or in home.