GPUAPI¶

MID-level API Reference¶

class GPUArray¶

proc init(ref arr, pitched=false)¶

Allocates memory on the device. The allocation size is automatically computed by this module -i.e., (arr.size: c_size_t) * c_sizeof(arr.eltType), which means the index space is linearlized when arr is multi-dimensional. Also, if arr is 2D and pitched=true, pitched allocation is performed and the host and device pitch can be obtained by doing obj.hpitch and obj.dpitch. Note that the allocated memory is automatically reclaimed when the object is deleted.

Arguments

arr – The reference of the non-distributed Chapel Array that will be mapped onto the device.
pitched – whether pitched allocation is performed or not (default is false)

// Example 1: Non-distributed array
var A: [1..n] int;

proc GPUCallBack(lo: int, hi: int, N: int) {
  // n * sizeof(int) will be allocated onto the device
  var dA = new GPUArray(A);
  ...
}

// GPUIterator
forall i in GPU(1..n, GPUCallBack) { A(i) = ...; }

// Example 2: Distributed array
use BlockDist;
var D: domain(1) dmapped blockDist(boundingBox = {1..n}) = {1..n};
var A: [D] int;
proc GPUCallBack(lo: int, hi: int, n: int) {
  // get the local portion of the distributed array
  var localA = A.localSlice(lo...hi);
  // n * sizeof(int) will be allocated onto the device
  var dA = new GPUArray(localA);
  ...
}

// GPUIterator
forall i in GPU(D, GPUCallBack) { A(i) = ...; }

Note

The allocated memory resides on the current device. With the GPUIterator, the current device is automatically set by it. Without it, it is the user’s responsibilities to set the current device (e.g., by calling the SetDevice API below). Otherwise, the default device (usually the first GPU) will be used.

Note

With distributed arrays, it is required to use Chapel array’s localSlice API to get the local portion of the distributed array. With the GPUIterator, the local portion is already computed and given as the first two arguments (lo and hi).

proc toDevice()¶

Transfers the contents of the Chapel array to the device.

proc GPUCallBack(lo: int, hi: int, n:int) {
  var dA = GPUArray(A);
  dA.toDevice();
}

proc fromDevice()¶

Transfers back the contents of the device array to the Chapel array.

proc GPUCallBack(lo: int, hi: int, n:int) {
  var dA = GPUArray(A);
  dA.fromDevice();
}

proc free()¶

Frees memory on the device.

proc GPUCallBack(lo: int, hi: int, n:int) {
  var dA = GPUArray(A);
  dA.free();
}

proc dPtr(): c_ptr(void)¶

Returns a pointer to the allocated device memory.

Returns: pointer to the allocated device memory
Return type: c_ptr(void)

proc hPtr(): c_ptr(void)¶

Returns a pointer to the head of the Chapel array.

Returns: pointer to the head of the Chapel array
Return type: c_ptr(void)

proc toDevice(args: GPUArray ...?n)¶: Utility function that takes a variable number of GPUArray and performs the toDevice operation for each.

proc fromDevice(args: GPUArray ...?n)¶: Utility function that takes a variable number of GPUArray and performs the fromDevice operation for each.

proc free(args: GPUArray ...?n)¶: Utility function that takes a variable number of GPUArray and performs the free operation for each.

var dA = GPUArray(A);
var dB = GPUArray(B);
var dC = GPUArray(C);

toDevice(A, B)
..
fromDevice(C);
// GPU memory is automatically deallocated when dA, dB, and dC.

class GPUJaggedArray¶

proc init(ref arr1, ref arr2, ...)¶: Allocates jagged array on the device. Basically it takes a set of Chapel arrays and creates an array of arrays on the device.

Note

A working example can be found here.

MID-LOW-level API Reference¶

proc Malloc(ref devPtr: c_ptr(void), size: c_size_t)¶

Allocates memory on the device.

Arguments

devPtr : c_voidPtr – Pointer to the allocated device array
size : c_size_t – Allocation size in bytes

// Example 1: Non-distributed array
var A: [1..n] int;

proc GPUCallBack(lo: int, hi: int, N: int) {
  var dA: c_ptr(void);
  Malloc(dA, (A.size: c_size_t) * c_sizeof(A.eltType));
  ...
}

// GPUIterator
forall i in GPU(1..n, GPUCallBack) { A(i) = ...; }

// Example 2: Distributed array
use BlockDist;
var D: domain(1) dmapped blockDist(boundingBox = {1..n}) = {1..n};
var A: [D] int;
proc GPUCallBack(lo: int, hi: int, n: int) {
  var dA: c_ptr(void);
  // get the local portion of the distributed array
  var localA = A.localSlice(lo...hi);
  Malloc(dA, (localA.size: c_size_t) * c_sizeof(localA.eltType));
  ...
}

// GPUIterator
forall i in GPU(D, GPUCallBack) { A(i) = ...; }

Note

c_sizeofo(A.eltType) returns the size in bytes of the element of the Chapel array A. For more details, please refer to this.

proc MallocPitch(ref devPtr: c_ptr(void), ref pitch: c_size_t, width: c_size_t, height: c_size_t)¶

Allocates pitched 2D memory on the device.

Arguments

devPtr : c_voidPtr – Pointer to the allocated pitched 2D device array
pitch : c_size_t – Pitch for allocation on the device, which is set by the runtime
width : c_size_t – The width of the original Chapel array (in bytes)
height : c_size_t – The number of rows (height)

Note

A working example can be found here. The detailed descirption of the underlying CUDA API can be found here.

proc Memcpy(dst: c_ptr(void), src: c_ptr(void), count: c_size_t, kind: int)¶

Transfers data between the host and the device

Arguments

dst : c_ptr(void) – the desination address
src : c_ptr(void) – the source address
count : c_size_t – size in bytes to be transferred
kind : int – type of transfer (0: host-to-device, 1: device-to-host)

// Non-distributed array
var A: [1..n] int;

proc GPUCallBack(lo: int, hi: int, N: int) {
  var dA: c_ptr(void);
  Malloc(dA, (A.size: c_size_t) * c_sizeof(A.eltType));
  // host-to-device
  Memcpy(dA, c_ptrTo(A), size, 0);
  // device-to-host
  Memcpy(c_ptrTo(A), dA, size, 1));
}

Note

c_ptrTo(A) returns a pointer to the Chapel rectangular array A. For more details, see this document.

proc Memcpy2D(dst: c_ptr(void), dpitch: c_size_t, src: c_ptr(void), spitch: c_size_t, width: c_size_t, height:c_size_t, kind: int)¶

Transfers pitched 2D array between the host and the device

Arguments

dst : c_ptr(void) – the desination address
dpitch : c_size_t – the pitch of destination memory
src : c_ptr(void) – the source address
spitch : c_size_t – the pitch of source memory
width : c_size_t – the width of 2D array to be transferred (in bytes)
height : c_size_t – the height of 2D array to be transferred (# of rows)
kind : int – type of transfer (0: host-to-device, 1: device-to-host)

Note

A working example can be found here. The detailed descirption of the underlying CUDA API can be found here.

proc Free(devPtr: c_ptr(void))¶

Frees memory on the device

Arguments: devPtr : c_ptr(void) – Device pointer to memory to be freed.

proc GetDeviceCount(ref count: int(32))¶

Returns the number of GPU devices on the current locale.

Arguments: count : int(32) – the number of GPU devices

var nGPUs: int(32);
GetDeviceCount(nGPUs);
writeln(nGPUs);

proc GetDevice(ref id: int(32))¶

Returns the device ID currently being used.

Arguments: id : int(32) – the device ID current being used

proc SetDevice(device: int(32))¶

Sets the device ID to be used.

Arguments: id : int(32) – the device ID to be used. id must be 1) greater than or equal to zero, and 2) less than the number of GPU devices.

proc ProfilerStart()¶: NVIDIA GPUs Only Start profiling with nvprof

proc ProfilerStop()¶

NVIDIA GPUs Only Stop profiling with nvprof

proc GPUCallBack(lo: int, hi: int, N: int) {
  ProfilerStart();
  ...
  ProfilerStop();
}

proc DeviceSynchronize()¶: Waits for the device to finish.