How to Execute Gradients Generated by Clad on a CUDA GPU

Tutorial level: Intro

CLAD provides automatic differentiation (AD) for C/C++ and works without code modification (legacy code). Given that the range of AD application problems are defined by their high computational requirements, it means that they can greatly benefit from parallel implementations on graphics processing units (GPUs).

This tutorial showcases how to firstly use CLAD to obtain your function’s gradient and how to schedule the function’s execution on the GPU.

Definition of the custom function

For the purpose of this tutorial, the custom function chosen is the Gauss Function. We include Clad and we defined the function such as:

#include "clad/Differentiator/Differentiator.h"

extern "C" int printf(const char*,...);
#define N 100

__device__ __host__ double gauss(double* x, double* p, double sigma, int dim) {
    double t = 0;
    for (int i = 0; i < dim; i++)
        t += (x[i] - p[i]) * (x[i] - p[i]);
    t = -t / (2*sigma*sigma);
    return std::pow(2*M_PI, -dim/2.0) * std::pow(sigma, -0.5) * std::exp(t);
}

Definition of Clad gradient

Having our custom function declared, we call Clad gradient and set a device function pointer to be used for the GPU execution:

auto gauss_g = clad::gradient(gauss, "x,p");

//Device function pointer
auto p_gauss = gauss_g.getFunctionPtr();

// using func = void(*)(double* x, double* p, double sigma, int dim, 
//                      clad::array_ref<double> _d_x,clad::array_ref<double> _d_p);
using func = decltype(p_gauss);

Definition of CUDA kernel

The CUDA support for Clad includes extensions that allow one to execute functions on the GPU using many threads in parallel. Thus the kernel for AD execution is to be defined as:

__global__ void compute(func op, double* d_x, double* d_y,
                        double sigma, int n, double* result_dx,
                        double* result_dy) {
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) {
      (*op)(&d_x[i],&d_y[i], sigma, /*dim*/1, &result_dx[i], &result_dy[i], nullptr, nullptr);
    }
}

Defining variables and stating execution

The next and final step implies the definition of host variables and the copying both of host variables to the device but also of device results back to the host.

The main function carries the role of defining our variables inputs passing them to the device for the kernel computation and retrieving the results from the device. This process is done in the following order: we first allocate memory on the host and define the variables inputs (in our case we have two variables x and y which are initialised to have random inputs); we then allocate our device arrays (d_x and d_y) and copy the information from the host to the device; next, we launch our kernel (“compute”) with the execution configuration (<<<N/256+1, 256>>>) which translates to the first argument being number of thread blocks in the grid and the second one (256) being the number of threads in a thread block and thus having the kernel be executed by these threads in parallel; the last and final step is to copy the results computed on the device to the host.

int main() {

   // x and y point to the host arrays, allocated with malloc in the typical
   // fashion, and the d_x and d_y arrays point to device arrays allocated with
   // the cudaMalloc function from the CUDA runtime API

   double *x, *d_x;
   double *y, *d_y;
   double sigma = 50.;
   x = (double*)malloc(N*sizeof(double));
   y = (double*)malloc(N*sizeof(double));

   // The host code will initialize the host arrays

   for (int i = 0; i < N; i++) {
       x[i] = rand()%100;
       y[i] = rand()%100;
   }

   func h_gauss;

   // To initialize the device arrays, we simply copy the data from x and y to
   // the corresponding device arrays d_x and d_y using cudaMemcpy

   cudaMalloc(&d_x, N*sizeof(double));
   cudaMemcpy(d_x, x, N*sizeof(double), cudaMemcpyHostToDevice);
   cudaMalloc(&d_y, N*sizeof(double));
   cudaMemcpy(d_y, y, N*sizeof(double), cudaMemcpyHostToDevice);

   // Similar to the x,y arrays, we employ host and device results array so
   // that we can copy the computed values from the device back to the host
   double *result_x, *result_y;
   result_x = (double*)malloc(N*sizeof(double));
   result_y = (double*)malloc(N*sizeof(double));
   double *dx_result, *dy_result;
   cudaMalloc(&dx_result, N*sizeof(double));
   cudaMalloc(&dy_result, N*sizeof(double));

   cudaMemcpyFromSymbol(&h_gauss, p_gauss, sizeof(func));

   // The computation kernel is launched by the statement:
   compute<<<N/256+1, 256>>>(h_gauss, d_x, d_y, sigma, N, dx_result, dy_result);
   cudaDeviceSynchronize();

   // After computation, the results hosted on the device should be copied to host
   cudaMemcpy(result_x, dx_result, N*sizeof(double), cudaMemcpyDeviceToHost);
   cudaMemcpy(result_y, dy_result, N*sizeof(double), cudaMemcpyDeviceToHost);

   printf("sigma=%f\n", sigma);
   for (int i = 0; i < N; i+=10000) {
       printf("x[%d]='%f';y[%d]='%f'\n", i, x[i], i, y[i]);
       printf("grad_x[%d]='%.10f';grad_y[%d]='%.10f'\n", i, result_x[i], i, result_y[i]);
   }
}

Benchmarking GPU Accelerated Clads

Following the implementation pattern described in this tutorial, will bring forth an increase in performance. The execution time for the GPU implementation decreases with the increase of dimensions, firstly in the case of the above Gauss function and secondly in the case of a product function gradient: