Writing LOW-level (GPUIterator Only) programs¶

Here we provide a step-by-step guide for utilizing the GPUIterator module using a simple Chapel program (vector copy) in single-locale and multiple-locale scenarios.

Single-locale version¶

In this single-locale scenario, you are supposed to create and edit one Chapel source file and one CUDA source file: vc.hybrid.chpl and vc.cu.

Import the GPUIterator module

First, import the module using the use keyword:
vc.hybrid.chpl¶
```
use GPUIterator;
```
Declare Chapel arrays

Then, create two Chapel arrays, A and B, which will be used for the copy operation:
vc.hybrid.chpl¶
```
use GPUIterator;

config const n = 32: int;
var A: [1..n] real(32);
var B: [1..n] real(32);
```
Tip

It’s wise to define n as a configurable constant, which can be overridden on the command line (e.g., ./vc --n=1024).

Import your GPU program

Write a GPU program

It is worth noting that the design and implementation of the GPU program is completely your choice. Please also see Guide to Write GPU programs. Here is one working vector copy example with CUDA:

vc.cu¶

__global__ void vc(float *dA, float *dB, int N) {
  int id = blockIdx.x * blockDim.x + threadIdx.x;
  if (id < N) {
        dA[id] = dB[id];
  }
}

extern "C" {
  void vcCUDA(float* A, float *B, int start, int end, int GPUN) {
    float *dA, *dB;
    cudaMalloc(&dA, sizeof(float) * GPUN);
    cudaMalloc(&dB, sizeof(float) * GPUN);
    cudaMemcpy(dB, B + start, sizeof(float) * GPUN, cudaMemcpyHostToDevice);
    vc<<<ceil(((float)GPUN)/1024), 1024>>>(dA, dB, GPUN);
    cudaDeviceSynchronize();
    cudaMemcpy(A + start, dA, sizeof(float) * GPUN, cudaMemcpyDeviceToHost);
    cudaFree(dA);
    cudaFree(dB);
  }
}

Note

For the presentation purposes, any error checking is omitted. A complete program can be found in apps directory.

Declare it as an external function

Use Chapel’s C interoperability feature to declare vcCUDA() as an external function.

vc.hybrid.chpl¶

use GPUIterator;

config const n = 32: int;
var A: [1..n] real(32);
var B: [1..n] real(32);

extern proc vcCUDA(A: [] real(32), B: [] real(32), lo: int, hi: int, N: int);

Note

More details on the C interoperability feature can be found here.

Write a GPU callback function

The GPU callback function is supposed to be invoked from the GPUIterator with an automatically computed subrange (lo and hi). In this example, we call the external function vcCUDA with the two global arrays (A and B), the subrange (lo and hi), plus the number of elements (N = size(lo..hi)).

vc.hybrid.chpl¶

use GPUIterator;

config const n = 32: int;
var A: [1..n] real(32);
var B: [1..n] real(32);

extern proc vcCUDA(A: [] real(32), B: [] real(32), lo: int, hi: int, N: int);

// lo, hi, and N are automatically computed by the GPUIterator
proc GPUCallBack(lo: int, hi: int, N: int) {
  vcCUDA(A, B, lo, hi, N);
}

It is worth noting that there will be multiple calls to GPUCallBack() when the number of GPUs is greater than one. Internally, the GPUIterator detects the number of GPUs within a locale, then automatically computes a subrange for each GPU, and creates a separate task that is responsible for each GPU. This design keeps the callback function simple and independent from GPU ID. The table below illustrates how GPUCallBack() is called when n=1024, nLocales=1, nGPUs=2:

n=1024, nLocales=1, nGPUs=2¶

Locales[0]

CPUs

GPU0

GPU1

lo..hi

1..512

512..767

768..1024

GPUCallBack(lo,hi,N);

N/A

GPUCallBack(512,767,256);

GPUCallBack(768,1024,256);

Tip

The number of GPUs can be overridden by giving the –nGPUs=n option (two dashes) on the command line
Note

Writing GPU ID dependent code in a callback function can be also done using the GetDevice function of the GPUAPI :
use GPUAPI;
proc GPUCallBack(lo: int, hi:int, N:int) {
  var id;
  GetDevice(id);
  if (id == 0) { ... }
  else if ...
}
While the use of a lambda function would be more productive and elegant, we’d recommend writing a Chapel function for the callback since the lambda support in Chapel is still early.
var GPUCallBack = lambda(lo: int, hi:int, N:int) { vcCUDA(A, B, lo, hi, N); };
forall i in GPU(1..n, GPUCallback) { ... }
If the this lambda version does not work, try this workaround:
record Lambda {
  proc this(lo:int, hi:int, N:int) { vcCUDA(A, B, lo, hi, N); }
}
var GPUCallBack = new Lambda();
forall i in GPU(1..n, GPUCallback) { ... }

Invoke the GPU() iterator in a forall loop

When writing a forall loop, simply wrap the iteration space (1..n) in GPU() and give the callback function (GPUCallBack). Here is a complete program with output verification:

vc.hybrid.chpl¶

use GPUIterator;

config const n = 32: int;
var A: [1..n] real(32);
var B: [1..n] real(32);

extern proc vcCUDA(A: [] real(32), B: [] real(32), lo: int, hi: int, N: int);

proc GPUCallBack(lo: int, hi: int, N: int) {
  vcCUDA(A, B, lo, hi, N);
}

B = 1;

forall i in GPU(1..n, GPUCallBack) {
  // CPU Version
  A(i) = B(i);
}

if (A.equals(B)) {
  writeln("Verified");
} else {
  writeln("Not Verified");
}

Compile and Run

See Compiling and running

Multi-locale version¶

In the multi-locale scenario, you are supposed to update vc.hybrid.chpl slightly, but you can keep the GPU program (vc.cu) unchanged.

Copy vc.hybrid.chpl to vc.hybrid.dist.chpl

Add BlockDist module and replace the range with a block-distributed domain

vc.hybrid.dist.chpl¶

use GPUIterator
use BlockDist;

Then, declare two Chapel arrays with a block-distributed domain D.

vc.hybrid.dist.chpl¶

config const n = 32: int;
var D: domain(1) dmapped blockDist(boundingBox = {1..n}) = {1..n};
var A: [D] real(32);
var B: [D] real(32);
// var A: [1..n] real(32); /* single locale version */
// var B: [1..n] real(32); /* single locale version */

Update GPUCallBack

vc.hybrid.dist.chpl¶

// lo, hi, and N are automatically computed by the GPUIterator
proc GPUCallBack(lo: int, hi: int, N: int) {
  // the first element of lA is lA(lo), which corresponds to A[0] in the vcCUDA part.
  ref lA = A.localSlice(lo..hi);
  // the first element of lB is lB(lo), which corresponds to B[0] in the vcCUDA part.
  ref lB = B.localSlice(lo..hi);
  vcCUDA(lA, lB, 0, hi-lo, N);
  //vcCUDA(A, B, lo, hi, N); /* single locale version */
}

While the code looks pretty much similar to the single-locale version, since the two arrays are distributed, the following two additional things need to be done:

Using localSlice() API
```
// for GPU X on locale Y, (locale- and device-neutral)
ref lA = A.localSlice(lo..hi);
```
Similar to the single-locale + multiple GPUs case discussed above, multiple instances of GPUCallBack() will be invoked for each GPU on different locales. However, you can still write the callback in a way that is locale and GPU ID independent by utilizing Chapel’s localSlice(d: domain) API (link). Essentially, feeding the automatically computed subrange (lo..hi) to the API returns a proper slice of a distributed array in a specific instance of GPUCallBack().

Updating the arguments to vcCUDA()

// call to the external GPU program
vcCUDA(lA, lB, 0, hi-lo, N);

Let us first explain how the local reference (say lA) can be accessed in the GPU program (vcCUDA). To give you a concrete example, suppose n=2048, nLocales=2, CPUPercent=50, in which A(1..1024) resides on Locale 0, and A(1025..2048) resides on Locale 1. The table below summarizes how lA corresponds to the C array (A) in each instance of the callback:

n=2048, nLocales=2, nGPUs=2¶
	Locales[0]			Locales[1]
	CPUs	GPU0	GPU1	CPUs	GPU0	GPU1
`lo..hi`	`1..512`	`513..768`	`769..1024`	`1025..1536`	`1537..1792`	`1793..2048`
`GPUCallBack(lo,hi,N);`	N/A	`GPUCallBack(513,768,256);`	`GPUCallBack(769,1024,256);`	N/A	`GPUCallBack(1537,1792,256);`	`GPUCallBack(1793,2048,256);`
`lA = A.localSlice(lo..hi)`	N/A	`A.localSlice(513..768);`	`A.localSlice(769..1024);`	N/A	`A.localSlice(1537..1792);`	`A.localSlice(1793..2048);`
`A[0]` in `vcCUDA` corresponds to	N/A	`lA(513)`	`lA(769)`	N/A	`lA(1537)`	`lA(1793)`

Notice that A[0] in vcCUDA(float *A, ...) corresponds to the first element of the local slice, which is why the third argument is zero (= start) and thr fourth argument is hi-lo (= end).

Update GPU()

Finally, give the distributed domain (D) to GPU():

vc.hybrid.dist.chpl¶

forall i in GPU(D, GPUCallBack) {
//forall i in GPU(1..n, GPUCallBack) {
  // CPU Version
  A(i) = B(i);
}

Compile and Run

See Compiling and running