The Hardware abstracted as a Grid of Thread Blocks, which are indexed from 0.
Blocks map to SMPs:
Threads map to CUDA cores (kernel)
Kernel calls on Host:
1
myKernel<<<blocksPerGrid, threadsPerBlock>>>(arguments);
1
cudaDeviceSynchronise();
Kernel functions on Device:
Memory Access by abstraction:
Dynamic global memory :
int main(void) {
float * a, d_a;
# allocate on CPU
a = (float *)malloc(N*sizeof(float));
# allocate and free memory on CUDA
cudaMalloc((void **)&d_a, N*sizeof(float));
cudaFree(d_a);
# copying
cudaMemcpy(dest, start, size, cudaMemcpyHostToDevice / cudaMemcpyDeviceToHost);
# free on CPU & GPU
free(a);
cudaFree(d_a);
return 0;
}
Statical global memory:
1
2
3
4
5
6
7
8
9
__device__ float d_a[N]
int main(void) {
# copying
cudaMemcpyToSymbol(dest, start, size) # cpu to cuda
cudaMemcpyFromSymbol(dest, start, size) # cuda to cpu
return 0;
}
Constant Memory is set at runtime and read through the per SM Constant Cache. When using correctly, only 1/16 of the traffic compared to global loads:
Small amounts of read-only data
values are broadcast to threads in half warp (groups of 16 threads)
very fast when cache hit; very slow when no cache hit.
Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.
1
2
3
4
5
6
7
8
9
10
11
12
__constant__ int my_const[16];
__global__ void add() {
int i = blockIdx.x;
int value = my_const[i%16]; # all threads in this block will be the same
}
int main(void) {
add<<<blocksPerGrid, 16>>>();
return 0;
}
They are unified after Kepler. There are two methods for utilising Read-only Memory / Texture Memory.
Memory bandwidth is the rate at which data can be read from or stored into a semiconductor memory by a processor. Memory bandwidth is usually expressed in units of bytes/second, though this can vary for systems with natural data sizes that are not a multiple of the commonly used 8-bit bytes.
Shared Memory are only accessible from within device functions