Writing MID-level programs

MID-level API

To reiterate, the biggest motivation for introducing MID-LOW and MID -level GPU API is moving some of low-level GPU operations to the Chapel-level. Consider the following GPU callback function and C function:

vc.hybrid.chpl
// lo, hi, and N are automatically computed by the GPUIterator
proc GPUCallBack(lo: int, hi: int, N: int) {
  vcCUDA(A, B, lo, hi, N);
}
vc.cu
extern "C" {
  void vcCUDA(float* A, float *B, int start, int end, int GPUN) {
    float *dA, *dB;
    cudaMalloc(&dA, sizeof(float) * GPUN);
    cudaMalloc(&dB, sizeof(float) * GPUN);
    cudaMemcpy(dB, B + start, sizeof(float) * GPUN, cudaMemcpyHostToDevice);
    vc<<<ceil(((float)GPUN)/1024), 1024>>>(dA, dB, GPUN);
    cudaDeviceSynchronize();
    cudaMemcpy(A + start, dA, sizeof(float) * GPUN, cudaMemcpyDeviceToHost);
    cudaFree(dA);
    cudaFree(dB);
  }
}

At the MID-level, most of the CUDA/HIP/OpenCL-level 1) device memory allocation, 2) device synchronization, and 3) data transfer can be written in Chapel. Also, unlike the MID-LOW level, the MID-level API is more Chapel programmer-friendly, where you can allocate GPU memory using the new keyword and no longer need to directly manipulate C types. Here is an example program written with the MID-level API:

vc.hybrid.chpl
proc GPUCallBack(lo: int, hi: int, N: int) {
  // n * sizeof(int) will be automatically allocated onto the device
  var dA = new GPUArray(A);
  var dB = new GPUArray(B);
  dB.toDevice();
  LaunchVC(dA.dPtr(), dB.dPtr(), N: c_size_t);
  dA.fromDevice();
  free(dA, dB);
}

Tip

The MID-level API can interoperate with the MID-LOW-level API.