GPUIterator¶
Overview¶
A primary goal of this module is to provide an appropriate interface between Chapel and accelerator programs such that expert accelerator programmers can explore different variants in a portable way (i.e., CPU-only, GPU-only, X% for CPU + Y% for GPU on a single or multiple CPU+GPU node(s)). To address these challenges, here we introduce a Chapel module, GPUIterator
, which facilitates invoking a user-written GPU program from Chapel. Since Chapel’s data-parallel loops (forall
) fit well with GPU execution, the GPUIterator
is designed to be invoked in a forall
loop. Consider the following STREAM
code:
forall i in 1..n {
A(i) = B(i) + alpha * C(i);
}
Assuming a GPU version of STREAM
is ready (streamCUDA
below), the user can wrap the original iteration space in GPU()
with two additional arguments: GPUCallBack
is a callback function that is invoked after the module has computed a subrange for the GPU portion by using CPUPercent
:
1// A GPUIterator version
2extern proc streamCUDA(A: [] real(32), B:[] real(32), C:[] real(32),
3 alpha: real(32), lo: int, hi: int, N: int);
4var GPUCallBack = lambda(lo: int, hi: int, N: int) {
5 // call the GPU program with a range of lo..hi
6 streamCUDA(A, B, C, alpha, lo, hi, N);
7};
8CPUPercent = 50; // CPU 50% + GPU 50% in this case
9forall i in GPU(1..n, GPUCallBack, CPUPercent) {
10 A(i) = B(i) + alpha * C(i);
11}
It is worth noting that GPUIterator
supports multi-GPUs execution and multi-locale execution. For multi-GPUs execution, the module automatically detects the numbers of GPUs per node (or accept a user-specified number), and invokes the callback function for each GPU, which can be done without any modification to the code above. For multi-locale execution, the iterator accepts a block distributed domain, which allows the user to run the code above on multiple CPUs+GPUs nodes with minimal modifications.
Why GPUIterator?¶
Chapel offers the C interoperability feature, which allows the user to invoke C/C++ functions from their Chapel programs. In the context of GPU programming in Chapel, the user typically prepares a GPU version of a forall
loop written in CUDA/HIP/OpenCL and invokes it using the interoperability feature. For example, consider the following baseline forall
implementation that performs STREAM
:
1// Chapel file
2var A: [1..n] real(32);
3var B: [1..n] real(32);
4var C: [1..n] real(32);
5var alpha: real(32) = 3.0;
6forall i in 1..n {
7 A(i) = B(i) + alpha * C(i);
8}
Assuming streamCUDA()
, which is a full CUDA/HIP/OpenCL implementation of the forall
, is available, here is what the GPU version looks like:
1// Chapel file
2// Declare an external C/C++ function which performs STREAM on GPUs
3extern proc streamCUDA(A: [] real(32), B:[] real(32), C:[] real(32),
4 alpha: real(32), lo: int, hi: int, N: int);
5
6var A: [1..n] real(32);
7var B: [1..n] real(32);
8var C: [1..n] real(32);
9var alpha: real(32);
10streamCUDA(A, B, C, alpha, 1, n, n);
1// Separate C file
2void streamCUDA(float *A, float *B, float *C,
3 float alpha, int start, int end, int size) {
4// A full GPU implementation of STREAM (CUDA/HIP/OpenCL)
5// 1. device memory allocations
6// 2. host-to-device data transfers
7// 3. GPU kernel compilations (if needed)
8// 4. GPU kernel invocations
9// 5. device-to-host data transfers
10// 6. clean up
11// Note: A[0] and B[0] here corresponds to
12// A(1) and B(1) in the Chapel part respectively
13}
The key difference is that the original forall
loop is replaced with the function call to the native function that includes typical host and device operations including device memory (de)allocations, data transfers, and kernel invocations.
Unfortunately, the source code is not very portable particularly when the user wants to explore different configurations to get higher performance. One scenario is that, since GPUs are not always faster than CPUs (and vice versa), the user has to be juggling forall
with streamCUDA()
depending on the data size and the complexity of computations (e.g., by commenting in/out each version).
One intuitive workaround would be to put an if
statement to decide whether to use which version (CPUs or GPUs):
1if (cond) {
2 forall i in 1..n { // STREAM }
3} else {
4 streamCUDA(...);
5}
However, this raises another problem: it is still not very portable when the user wants to do 1) multi-locale CPU+GPU execution, and 2) advanced workload distributions such as hybrid execution of the CPU and GPU versions. Specifically, WITHOUT the module, the user has to write the following code:
1// WITHOUT the GPUIterator module (no hybrid execution)
2// suppose D is a block distributed domain
3if (cond) {
4 forall i in D { ... }
5} else {
6 coforall loc in Locales {
7 on loc {
8 coforall GPUID in 0..#nGPUs {
9 var lo = ...; // needs to be computed manually
10 var hi = ...; // needs to be computed manually
11 var localA = A.localSlice(lo..hi);
12 ...
13 // GPUID needs to be manually set before streamCUDA() is called
14 streamCUDA(localA, ...);
15 }
16 }
17 }
18}
WITH the module, again, the code is much simpler and more portable:
1// WITH the GPUIterator module
2// suppose D is a block distributed domain
3var GPUCallBack = lambda(lo: int, hi: int, N: int) {
4 // call the GPU program with a range of lo..hi
5 // lo..hi is automatically computed
6 // the module internally and automatically sets GPUID
7 streamCUDA(A.localSlice(lo..hi), ...);
8};
9CPUPercent = 50; // CPU 50% + GPU 50% in this case
10forall i in GPU(D, GPUCallBack, CPUPercent) {
11 A(i) = B(i) + alpha * C(i);
12}