Writing LOW-level (GPUIterator Only) programs

Here we provide a step-by-step guide for utilizing the GPUIterator module using a simple Chapel program (vector copy) in single-locale and multiple-locale scenarios.

Single-locale version

In this single-locale scenario, you are supposed to create and edit one Chapel source file and one CUDA source file: vc.hybrid.chpl and vc.cu.

  1. Import the GPUIterator module

    First, import the module using the use keyword:

    vc.hybrid.chpl
    use GPUIterator;
    
  2. Declare Chapel arrays

    Then, create two Chapel arrays, A and B, which will be used for the copy operation:

    vc.hybrid.chpl
    use GPUIterator;
    
    config const n = 32: int;
    var A: [1..n] real(32);
    var B: [1..n] real(32);
    

    Tip

    It’s wise to define n as a configurable constant, which can be overridden on the command line (e.g., ./vc --n=1024).

  3. Import your GPU program

    1. Write a GPU program

      It is worth noting that the design and implementation of the GPU program is completely your choice. Please also see Guide to Write GPU programs. Here is one working vector copy example with CUDA:

    vc.cu
    __global__ void vc(float *dA, float *dB, int N) {
      int id = blockIdx.x * blockDim.x + threadIdx.x;
      if (id < N) {
            dA[id] = dB[id];
      }
    }
    
    extern "C" {
      void vcCUDA(float* A, float *B, int start, int end, int GPUN) {
        float *dA, *dB;
        cudaMalloc(&dA, sizeof(float) * GPUN);
        cudaMalloc(&dB, sizeof(float) * GPUN);
        cudaMemcpy(dB, B + start, sizeof(float) * GPUN, cudaMemcpyHostToDevice);
        vc<<<ceil(((float)GPUN)/1024), 1024>>>(dA, dB, GPUN);
        cudaDeviceSynchronize();
        cudaMemcpy(A + start, dA, sizeof(float) * GPUN, cudaMemcpyDeviceToHost);
        cudaFree(dA);
        cudaFree(dB);
      }
    }
    

    Note

    For the presentation purposes, any error checking is omitted. A complete program can be found in apps directory.

    1. Declare it as an external function

      Use Chapel’s C interoperability feature to declare vcCUDA() as an external function.

    vc.hybrid.chpl
    use GPUIterator;
    
    config const n = 32: int;
    var A: [1..n] real(32);
    var B: [1..n] real(32);
    
    extern proc vcCUDA(A: [] real(32), B: [] real(32), lo: int, hi: int, N: int);
    

    Note

    More details on the C interoperability feature can be found here.

  4. Write a GPU callback function

    The GPU callback function is supposed to be invoked from the GPUIterator with an automatically computed subrange (lo and hi). In this example, we call the external function vcCUDA with the two global arrays (A and B), the subrange (lo and hi), plus the number of elements (N = size(lo..hi)).

    vc.hybrid.chpl
    use GPUIterator;
    
    config const n = 32: int;
    var A: [1..n] real(32);
    var B: [1..n] real(32);
    
    extern proc vcCUDA(A: [] real(32), B: [] real(32), lo: int, hi: int, N: int);
    
    // lo, hi, and N are automatically computed by the GPUIterator
    proc GPUCallBack(lo: int, hi: int, N: int) {
      vcCUDA(A, B, lo, hi, N);
    }
    

It is worth noting that there will be multiple calls to GPUCallBack() when the number of GPUs is greater than one. Internally, the GPUIterator detects the number of GPUs within a locale, then automatically computes a subrange for each GPU, and creates a separate task that is responsible for each GPU. This design keeps the callback function simple and independent from GPU ID. The table below illustrates how GPUCallBack() is called when n=1024, nLocales=1, nGPUs=2:

n=1024, nLocales=1, nGPUs=2

Locales[0]

CPUs

GPU0

GPU1

lo..hi

1..512

512..767

768..1024

GPUCallBack(lo,hi,N);

N/A

GPUCallBack(512,767,256);

GPUCallBack(768,1024,256);

Tip

The number of GPUs can be overridden by giving the –nGPUs=n option (two dashes) on the command line

Note

  1. Writing GPU ID dependent code in a callback function can be also done using the GetDevice function of the GPUAPI :

use GPUAPI;
proc GPUCallBack(lo: int, hi:int, N:int) {
  var id;
  GetDevice(id);
  if (id == 0) { ... }
  else if ...
}
  1. While the use of a lambda function would be more productive and elegant, we’d recommend writing a Chapel function for the callback since the lambda support in Chapel is still early.

var GPUCallBack = lambda(lo: int, hi:int, N:int) { vcCUDA(A, B, lo, hi, N); };
forall i in GPU(1..n, GPUCallback) { ... }

If the this lambda version does not work, try this workaround:

record Lambda {
  proc this(lo:int, hi:int, N:int) { vcCUDA(A, B, lo, hi, N); }
}
var GPUCallBack = new Lambda();
forall i in GPU(1..n, GPUCallback) { ... }
  1. Invoke the GPU() iterator in a forall loop

    When writing a forall loop, simply wrap the iteration space (1..n) in GPU() and give the callback function (GPUCallBack). Here is a complete program with output verification:

    vc.hybrid.chpl
    use GPUIterator;
    
    config const n = 32: int;
    var A: [1..n] real(32);
    var B: [1..n] real(32);
    
    extern proc vcCUDA(A: [] real(32), B: [] real(32), lo: int, hi: int, N: int);
    
    proc GPUCallBack(lo: int, hi: int, N: int) {
      vcCUDA(A, B, lo, hi, N);
    }
    
    B = 1;
    
    forall i in GPU(1..n, GPUCallBack) {
      // CPU Version
      A(i) = B(i);
    }
    
    if (A.equals(B)) {
      writeln("Verified");
    } else {
      writeln("Not Verified");
    }
    
  2. Compile and Run

    See Compiling and running

Multi-locale version

In the multi-locale scenario, you are supposed to update vc.hybrid.chpl slightly, but you can keep the GPU program (vc.cu) unchanged.

  1. Copy vc.hybrid.chpl to vc.hybrid.dist.chpl

  2. Add BlockDist module and replace the range with a block-distributed domain

    vc.hybrid.dist.chpl
    use GPUIterator
    use BlockDist;
    

    Then, declare two Chapel arrays with a block-distributed domain D.

    vc.hybrid.dist.chpl
    config const n = 32: int;
    var D: domain(1) dmapped blockDist(boundingBox = {1..n}) = {1..n};
    var A: [D] real(32);
    var B: [D] real(32);
    // var A: [1..n] real(32); /* single locale version */
    // var B: [1..n] real(32); /* single locale version */
    
  3. Update GPUCallBack

    vc.hybrid.dist.chpl
    // lo, hi, and N are automatically computed by the GPUIterator
    proc GPUCallBack(lo: int, hi: int, N: int) {
      // the first element of lA is lA(lo), which corresponds to A[0] in the vcCUDA part.
      ref lA = A.localSlice(lo..hi);
      // the first element of lB is lB(lo), which corresponds to B[0] in the vcCUDA part.
      ref lB = B.localSlice(lo..hi);
      vcCUDA(lA, lB, 0, hi-lo, N);
      //vcCUDA(A, B, lo, hi, N); /* single locale version */
    }
    

    While the code looks pretty much similar to the single-locale version, since the two arrays are distributed, the following two additional things need to be done:

    1. Using localSlice() API

      // for GPU X on locale Y, (locale- and device-neutral)
      ref lA = A.localSlice(lo..hi);
      

      Similar to the single-locale + multiple GPUs case discussed above, multiple instances of GPUCallBack() will be invoked for each GPU on different locales. However, you can still write the callback in a way that is locale and GPU ID independent by utilizing Chapel’s localSlice(d: domain) API (link). Essentially, feeding the automatically computed subrange (lo..hi) to the API returns a proper slice of a distributed array in a specific instance of GPUCallBack().

    2. Updating the arguments to vcCUDA()

      // call to the external GPU program
      vcCUDA(lA, lB, 0, hi-lo, N);
      

      Let us first explain how the local reference (say lA) can be accessed in the GPU program (vcCUDA). To give you a concrete example, suppose n=2048, nLocales=2, CPUPercent=50, in which A(1..1024) resides on Locale 0, and A(1025..2048) resides on Locale 1. The table below summarizes how lA corresponds to the C array (A) in each instance of the callback:

      n=2048, nLocales=2, nGPUs=2

      Locales[0]

      Locales[1]

      CPUs

      GPU0

      GPU1

      CPUs

      GPU0

      GPU1

      lo..hi

      1..512

      513..768

      769..1024

      1025..1536

      1537..1792

      1793..2048

      GPUCallBack(lo,hi,N);

      N/A

      GPUCallBack(513,768,256);

      GPUCallBack(769,1024,256);

      N/A

      GPUCallBack(1537,1792,256);

      GPUCallBack(1793,2048,256);

      lA = A.localSlice(lo..hi)

      N/A

      A.localSlice(513..768);

      A.localSlice(769..1024);

      N/A

      A.localSlice(1537..1792);

      A.localSlice(1793..2048);

      A[0] in vcCUDA corresponds to

      N/A

      lA(513)

      lA(769)

      N/A

      lA(1537)

      lA(1793)

      Notice that A[0] in vcCUDA(float *A, ...) corresponds to the first element of the local slice, which is why the third argument is zero (= start) and thr fourth argument is hi-lo (= end).

  4. Update GPU()

    Finally, give the distributed domain (D) to GPU():

    vc.hybrid.dist.chpl
    forall i in GPU(D, GPUCallBack) {
    //forall i in GPU(1..n, GPUCallBack) {
      // CPU Version
      A(i) = B(i);
    }
    
  5. Compile and Run

    See Compiling and running