Writing LOW-level (GPUIterator Only) programs¶
Here we provide a step-by-step guide for utilizing the GPUIterator
module using a simple Chapel program (vector copy
) in single-locale and multiple-locale scenarios.
Single-locale version¶
In this single-locale scenario, you are supposed to create and edit one Chapel source file and one CUDA source file: vc.hybrid.chpl
and vc.cu
.
Import the GPUIterator module
First, import the module using the
use
keyword:use GPUIterator;
Declare Chapel arrays
Then, create two Chapel arrays,
A
andB
, which will be used for the copy operation:use GPUIterator; config const n = 32: int; var A: [1..n] real(32); var B: [1..n] real(32);
Tip
It’s wise to define
n
as a configurable constant, which can be overridden on the command line (e.g.,./vc --n=1024
).Import your GPU program
Write a GPU program
It is worth noting that the design and implementation of the GPU program is completely your choice. Please also see Guide to Write GPU programs. Here is one working vector copy example with CUDA:
__global__ void vc(float *dA, float *dB, int N) { int id = blockIdx.x * blockDim.x + threadIdx.x; if (id < N) { dA[id] = dB[id]; } } extern "C" { void vcCUDA(float* A, float *B, int start, int end, int GPUN) { float *dA, *dB; cudaMalloc(&dA, sizeof(float) * GPUN); cudaMalloc(&dB, sizeof(float) * GPUN); cudaMemcpy(dB, B + start, sizeof(float) * GPUN, cudaMemcpyHostToDevice); vc<<<ceil(((float)GPUN)/1024), 1024>>>(dA, dB, GPUN); cudaDeviceSynchronize(); cudaMemcpy(A + start, dA, sizeof(float) * GPUN, cudaMemcpyDeviceToHost); cudaFree(dA); cudaFree(dB); } }
Note
For the presentation purposes, any error checking is omitted. A complete program can be found in
apps
directory.Declare it as an external function
Use Chapel’s C interoperability feature to declare
vcCUDA()
as an external function.
use GPUIterator; config const n = 32: int; var A: [1..n] real(32); var B: [1..n] real(32); extern proc vcCUDA(A: [] real(32), B: [] real(32), lo: int, hi: int, N: int);
Note
More details on the C interoperability feature can be found here.
Write a GPU callback function
The GPU callback function is supposed to be invoked from the GPUIterator with an automatically computed subrange (
lo
andhi
). In this example, we call the external functionvcCUDA
with the two global arrays (A
andB
), the subrange (lo
andhi
), plus the number of elements (N = size(lo..hi)
).use GPUIterator; config const n = 32: int; var A: [1..n] real(32); var B: [1..n] real(32); extern proc vcCUDA(A: [] real(32), B: [] real(32), lo: int, hi: int, N: int); // lo, hi, and N are automatically computed by the GPUIterator proc GPUCallBack(lo: int, hi: int, N: int) { vcCUDA(A, B, lo, hi, N); }
It is worth noting that there will be multiple calls to
GPUCallBack()
when the number of GPUs is greater than one. Internally, the GPUIterator detects the number of GPUs within a locale, then automatically computes a subrange for each GPU, and creates a separate task that is responsible for each GPU. This design keeps the callback function simple and independent from GPU ID. The table below illustrates howGPUCallBack()
is called whenn=1024, nLocales=1, nGPUs=2
:
¶ Locales[0]
CPUs
GPU0
GPU1
lo..hi
1..512
512..767
768..1024
GPUCallBack(lo,hi,N);
N/A
GPUCallBack(512,767,256);
GPUCallBack(768,1024,256);
Tip
The number of GPUs can be overridden by giving the –nGPUs=n option (two dashes) on the command line
Note
Writing GPU ID dependent code in a callback function can be also done using the
GetDevice
function of the GPUAPI :use GPUAPI; proc GPUCallBack(lo: int, hi:int, N:int) { var id; GetDevice(id); if (id == 0) { ... } else if ... }
While the use of a lambda function would be more productive and elegant, we’d recommend writing a Chapel function for the callback since the lambda support in Chapel is still early.
var GPUCallBack = lambda(lo: int, hi:int, N:int) { vcCUDA(A, B, lo, hi, N); }; forall i in GPU(1..n, GPUCallback) { ... }If the this lambda version does not work, try this workaround:
record Lambda { proc this(lo:int, hi:int, N:int) { vcCUDA(A, B, lo, hi, N); } } var GPUCallBack = new Lambda(); forall i in GPU(1..n, GPUCallback) { ... }
Invoke the
GPU()
iterator in aforall
loopWhen writing a
forall
loop, simply wrap the iteration space (1..n
) inGPU()
and give the callback function (GPUCallBack
). Here is a complete program with output verification:use GPUIterator; config const n = 32: int; var A: [1..n] real(32); var B: [1..n] real(32); extern proc vcCUDA(A: [] real(32), B: [] real(32), lo: int, hi: int, N: int); proc GPUCallBack(lo: int, hi: int, N: int) { vcCUDA(A, B, lo, hi, N); } B = 1; forall i in GPU(1..n, GPUCallBack) { // CPU Version A(i) = B(i); } if (A.equals(B)) { writeln("Verified"); } else { writeln("Not Verified"); }
Compile and Run
Multi-locale version¶
In the multi-locale scenario, you are supposed to update vc.hybrid.chpl
slightly, but you can keep the GPU program (vc.cu
) unchanged.
Copy
vc.hybrid.chpl
tovc.hybrid.dist.chpl
Add
BlockDist
module and replace the range with a block-distributed domainuse GPUIterator use BlockDist;
Then, declare two Chapel arrays with a block-distributed domain
D
.config const n = 32: int; var D: domain(1) dmapped blockDist(boundingBox = {1..n}) = {1..n}; var A: [D] real(32); var B: [D] real(32); // var A: [1..n] real(32); /* single locale version */ // var B: [1..n] real(32); /* single locale version */
Update
GPUCallBack
// lo, hi, and N are automatically computed by the GPUIterator proc GPUCallBack(lo: int, hi: int, N: int) { // the first element of lA is lA(lo), which corresponds to A[0] in the vcCUDA part. ref lA = A.localSlice(lo..hi); // the first element of lB is lB(lo), which corresponds to B[0] in the vcCUDA part. ref lB = B.localSlice(lo..hi); vcCUDA(lA, lB, 0, hi-lo, N); //vcCUDA(A, B, lo, hi, N); /* single locale version */ }
While the code looks pretty much similar to the single-locale version, since the two arrays are distributed, the following two additional things need to be done:
Using
localSlice()
API// for GPU X on locale Y, (locale- and device-neutral) ref lA = A.localSlice(lo..hi);
Similar to the single-locale + multiple GPUs case discussed above, multiple instances of
GPUCallBack()
will be invoked for each GPU on different locales. However, you can still write the callback in a way that is locale and GPU ID independent by utilizing Chapel’slocalSlice(d: domain)
API (link). Essentially, feeding the automatically computed subrange (lo..hi
) to the API returns a proper slice of a distributed array in a specific instance ofGPUCallBack()
.Updating the arguments to
vcCUDA()
// call to the external GPU program vcCUDA(lA, lB, 0, hi-lo, N);
Let us first explain how the local reference (say
lA
) can be accessed in the GPU program (vcCUDA
). To give you a concrete example, supposen=2048, nLocales=2, CPUPercent=50
, in whichA(1..1024)
resides on Locale 0, andA(1025..2048)
resides on Locale 1. The table below summarizes howlA
corresponds to the C array (A
) in each instance of the callback:¶ Locales[0]
Locales[1]
CPUs
GPU0
GPU1
CPUs
GPU0
GPU1
lo..hi
1..512
513..768
769..1024
1025..1536
1537..1792
1793..2048
GPUCallBack(lo,hi,N);
N/A
GPUCallBack(513,768,256);
GPUCallBack(769,1024,256);
N/A
GPUCallBack(1537,1792,256);
GPUCallBack(1793,2048,256);
lA = A.localSlice(lo..hi)
N/A
A.localSlice(513..768);
A.localSlice(769..1024);
N/A
A.localSlice(1537..1792);
A.localSlice(1793..2048);
A[0]
invcCUDA
corresponds toN/A
lA(513)
lA(769)
N/A
lA(1537)
lA(1793)
Notice that
A[0]
invcCUDA(float *A, ...)
corresponds to the first element of the local slice, which is why the third argument is zero (=start
) and thr fourth argument ishi-lo
(=end
).
Update
GPU()
Finally, give the distributed domain (
D
) toGPU()
:forall i in GPU(D, GPUCallBack) { //forall i in GPU(1..n, GPUCallBack) { // CPU Version A(i) = B(i); }
Compile and Run