.. default-domain:: chpl ============================================= Writing LOW-level (GPUIterator Only) programs ============================================= Here we provide a step-by-step guide for utilizing the ``GPUIterator`` module using a simple Chapel program (``vector copy``) in single-locale and multiple-locale scenarios. Single-locale version ###################### In this single-locale scenario, you are supposed to create and edit one Chapel source file and one CUDA source file: ``vc.hybrid.chpl`` and ``vc.cu``. 1. Import the GPUIterator module First, import the module using the ``use`` keyword: .. code-block:: chapel :caption: vc.hybrid.chpl use GPUIterator; 2. Declare Chapel arrays Then, create two Chapel arrays, ``A`` and ``B``, which will be used for the copy operation: .. code-block:: chapel :caption: vc.hybrid.chpl :emphasize-lines: 3,4,5 use GPUIterator; config const n = 32: int; var A: [1..n] real(32); var B: [1..n] real(32); .. tip:: It's wise to define ``n`` as `a configurable constant `_, which can be overridden on the command line (e.g., ``./vc --n=1024``). 3. Import your GPU program a. Write a GPU program It is worth noting that the design and implementation of the GPU program is completely your choice. Please also see :ref:`Guide to Write GPU programs`. Here is one working vector copy example with CUDA: .. code-block:: c :caption: vc.cu __global__ void vc(float *dA, float *dB, int N) { int id = blockIdx.x * blockDim.x + threadIdx.x; if (id < N) { dA[id] = dB[id]; } } extern "C" { void vcCUDA(float* A, float *B, int start, int end, int GPUN) { float *dA, *dB; cudaMalloc(&dA, sizeof(float) * GPUN); cudaMalloc(&dB, sizeof(float) * GPUN); cudaMemcpy(dB, B + start, sizeof(float) * GPUN, cudaMemcpyHostToDevice); vc<<>>(dA, dB, GPUN); cudaDeviceSynchronize(); cudaMemcpy(A + start, dA, sizeof(float) * GPUN, cudaMemcpyDeviceToHost); cudaFree(dA); cudaFree(dB); } } .. note:: For the presentation purposes, any error checking is omitted. A complete program can be found in ``apps`` directory. b. Declare it as an external function Use `Chapel's C interoperability feature `_ to declare ``vcCUDA()`` as an external function. .. code-block:: chapel :caption: vc.hybrid.chpl :emphasize-lines: 7 use GPUIterator; config const n = 32: int; var A: [1..n] real(32); var B: [1..n] real(32); extern proc vcCUDA(A: [] real(32), B: [] real(32), lo: int, hi: int, N: int); .. note:: More details on the C interoperability feature can be found `here `_. 4. Write a GPU callback function The GPU callback function is supposed to be invoked from the GPUIterator with an automatically computed subrange (``lo`` and ``hi``). In this example, we call the external function ``vcCUDA`` with the two global arrays (``A`` and ``B``), the subrange (``lo`` and ``hi``), plus the number of elements (``N = size(lo..hi)``). .. code-block:: chapel :caption: vc.hybrid.chpl :emphasize-lines: 9,10,11,12 use GPUIterator; config const n = 32: int; var A: [1..n] real(32); var B: [1..n] real(32); extern proc vcCUDA(A: [] real(32), B: [] real(32), lo: int, hi: int, N: int); // lo, hi, and N are automatically computed by the GPUIterator proc GPUCallBack(lo: int, hi: int, N: int) { vcCUDA(A, B, lo, hi, N); } .. _callback: It is worth noting that there will be multiple calls to ``GPUCallBack()`` when the number of GPUs is greater than one. Internally, the GPUIterator detects the number of GPUs within a locale, then automatically computes a subrange for each GPU, and creates a separate task that is responsible for each GPU. This design keeps the callback function simple and independent from GPU ID. The table below illustrates how ``GPUCallBack()`` is called when ``n=1024, nLocales=1, nGPUs=2``: .. list-table:: n=1024, nLocales=1, nGPUs=2 :widths: 15 15 15 15 :header-rows: 1 * - - Locales[0] - - * - - CPUs - GPU0 - GPU1 * - ``lo..hi`` - ``1..512`` - ``512..767`` - ``768..1024`` * - ``GPUCallBack(lo,hi,N);`` - N/A - ``GPUCallBack(512,767,256);`` - ``GPUCallBack(768,1024,256);`` .. tip:: The number of GPUs can be overridden by giving the `--nGPUs=n` option (two dashes) on the command line .. note:: 1. Writing GPU ID dependent code in a callback function can be also done using the ``GetDevice`` function of the GPUAPI : .. code-block:: chapel use GPUAPI; proc GPUCallBack(lo: int, hi:int, N:int) { var id; GetDevice(id); if (id == 0) { ... } else if ... } 2. While the use of a lambda function would be more productive and elegant, we'd recommend writing a Chapel function for the callback since the lambda support in Chapel is still early. .. code-block:: chapel var GPUCallBack = lambda(lo: int, hi:int, N:int) { vcCUDA(A, B, lo, hi, N); }; forall i in GPU(1..n, GPUCallback) { ... } If the this lambda version does not work, try `this workaround `_: .. code-block:: chapel record Lambda { proc this(lo:int, hi:int, N:int) { vcCUDA(A, B, lo, hi, N); } } var GPUCallBack = new Lambda(); forall i in GPU(1..n, GPUCallback) { ... } 5. Invoke the ``GPU()`` iterator in a ``forall`` loop When writing a ``forall`` loop, simply wrap the iteration space (``1..n``) in ``GPU()`` and give the callback function (``GPUCallBack``). Here is a complete program with output verification: .. code-block:: chapel :caption: vc.hybrid.chpl :emphasize-lines: 15-18 use GPUIterator; config const n = 32: int; var A: [1..n] real(32); var B: [1..n] real(32); extern proc vcCUDA(A: [] real(32), B: [] real(32), lo: int, hi: int, N: int); proc GPUCallBack(lo: int, hi: int, N: int) { vcCUDA(A, B, lo, hi, N); } B = 1; forall i in GPU(1..n, GPUCallBack) { // CPU Version A(i) = B(i); } if (A.equals(B)) { writeln("Verified"); } else { writeln("Not Verified"); } 6. Compile and Run See :doc:`Compiling and running ` Multi-locale version ###################### In the multi-locale scenario, you are supposed to update ``vc.hybrid.chpl`` slightly, but you can keep the GPU program (``vc.cu``) unchanged. 0. Copy ``vc.hybrid.chpl`` to ``vc.hybrid.dist.chpl`` 1. Add ``BlockDist`` module and replace the range with a block-distributed domain .. code-block:: chapel :caption: vc.hybrid.dist.chpl :emphasize-lines: 2 use GPUIterator use BlockDist; Then, declare two Chapel arrays with a block-distributed domain ``D``. .. code-block:: chapel :caption: vc.hybrid.dist.chpl :emphasize-lines: 2-4 config const n = 32: int; var D: domain(1) dmapped blockDist(boundingBox = {1..n}) = {1..n}; var A: [D] real(32); var B: [D] real(32); // var A: [1..n] real(32); /* single locale version */ // var B: [1..n] real(32); /* single locale version */ 2. Update ``GPUCallBack`` .. code-block:: chapel :caption: vc.hybrid.dist.chpl :emphasize-lines: 3-7 // lo, hi, and N are automatically computed by the GPUIterator proc GPUCallBack(lo: int, hi: int, N: int) { // the first element of lA is lA(lo), which corresponds to A[0] in the vcCUDA part. ref lA = A.localSlice(lo..hi); // the first element of lB is lB(lo), which corresponds to B[0] in the vcCUDA part. ref lB = B.localSlice(lo..hi); vcCUDA(lA, lB, 0, hi-lo, N); //vcCUDA(A, B, lo, hi, N); /* single locale version */ } While the code looks pretty much similar to the single-locale version, since the two arrays are distributed, the following two additional things need to be done: a. Using ``localSlice()`` API .. code-block:: chapel // for GPU X on locale Y, (locale- and device-neutral) ref lA = A.localSlice(lo..hi); Similar to the single-locale + multiple GPUs case discussed `above `_, multiple instances of ``GPUCallBack()`` will be invoked for each GPU on different locales. However, you can still write the callback in a way that is locale and GPU ID independent by utilizing Chapel's ``localSlice(d: domain)`` API (`link `_). Essentially, feeding the automatically computed subrange (``lo..hi``) to the API returns a proper slice of a distributed array in a specific instance of ``GPUCallBack()``. b. Updating the arguments to ``vcCUDA()`` .. code-block:: chapel // call to the external GPU program vcCUDA(lA, lB, 0, hi-lo, N); Let us first explain how the local reference (say ``lA``) can be accessed in the GPU program (``vcCUDA``). To give you a concrete example, suppose ``n=2048, nLocales=2, CPUPercent=50``, in which ``A(1..1024)`` resides on `Locale 0`, and ``A(1025..2048)`` resides on `Locale 1`. The table below summarizes how ``lA`` corresponds to the C array (``A``) in each instance of the callback: .. list-table:: n=2048, nLocales=2, nGPUs=2 :widths: 15 15 15 15 15 15 15 :header-rows: 1 * - - Locales[0] - - - Locales[1] - - * - - CPUs - GPU0 - GPU1 - CPUs - GPU0 - GPU1 * - ``lo..hi`` - ``1..512`` - ``513..768`` - ``769..1024`` - ``1025..1536`` - ``1537..1792`` - ``1793..2048`` * - ``GPUCallBack(lo,hi,N);`` - N/A - ``GPUCallBack(513,768,256);`` - ``GPUCallBack(769,1024,256);`` - N/A - ``GPUCallBack(1537,1792,256);`` - ``GPUCallBack(1793,2048,256);`` * - ``lA = A.localSlice(lo..hi)`` - N/A - ``A.localSlice(513..768);`` - ``A.localSlice(769..1024);`` - N/A - ``A.localSlice(1537..1792);`` - ``A.localSlice(1793..2048);`` * - ``A[0]`` in ``vcCUDA`` corresponds to - N/A - ``lA(513)`` - ``lA(769)`` - N/A - ``lA(1537)`` - ``lA(1793)`` Notice that ``A[0]`` in ``vcCUDA(float *A, ...)`` corresponds to the first element of the local slice, which is why the third argument is zero (= ``start``) and thr fourth argument is ``hi-lo`` (= ``end``). 3. Update ``GPU()`` Finally, give the distributed domain (``D``) to ``GPU()``: .. code-block:: chapel :caption: vc.hybrid.dist.chpl forall i in GPU(D, GPUCallBack) { //forall i in GPU(1..n, GPUCallBack) { // CPU Version A(i) = B(i); } 4. Compile and Run See :doc:`Compiling and running `