Getting started

This first part is devoted to learning the basic of GPU programming. For those not familiar with the Julia language, we highly recommend reading this introduction to Julia.

GPU programming using CUDA.jl

Julia has an excellent support for GPU programming with the organization JuliaGPU.

In this tutorial, we will focus on the library CUDA provided by NVIDIA. We recommend reading this tutorial to understand the basic concept for programming on the GPU with CUDA. Once installed, you can import CUDA as:

using CUDA

By default, you can allocate a new vector in Julia using

x_cpu = zeros(10)

10-element Vector{Float64}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

The allocation of a new vector on the GPU has to be explicited as

x_gpu = CUDA.zeros(10)

10-element CUDA.CuArray{Float32, 1, CUDA.DeviceMemory}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

By default, CUDA.jl allocates a vector of float. In scientific computing, it is often recommended to work with double, encoded by the type Float64 in Julia. This has to be explicited as

x_gpu = CUDA.zeros(Float64, 10)

10-element CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

The array can be manipulated using the broadcast operator (using a syntax similar as in matlab). Incrementing all the elements in x_gpu by 1 just amounts to

x_gpu .+= 1.0

10-element CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

Info

On the GPU, accessing the element of an array by its index (e.g. by calling x_gpu[1]) is prohibited by default, and should by avoided at all cost. The whole point of using a GPU is to evaluate operations in parallel, so it usually makes little sense to access an array element by element. If you have to implement non-trivial operations with complicated indexing operations, it is recommended to implement your custom GPU kernels.

In general,

We recommend using the broadcast operator . as much as you can, as it generates automatically the GPU kernels you need to implement the operation.
More complicated operations can be implemented with map or mapreduce, where Julia will generate the appropriate GPU kernels for you.
Often you can use the BLAS functions provided by CUDA.jl, which are optimized for the GPU.
If you really have to, you can implement your own GPU kernel using CUDA.jl or using an abstraction layer like KernelAbstractions.jl.

Now comes the question of evaluating complicated expressions on the GPU.

Modeling with ExaModels.jl

In optimization, it is recommended to use a modeler to implement your model. The modeler acts as a domain specific language providing you the syntax needed to implement your optimization problem, and it converts the resulting problem in a form suitable for an optimization solver. Ampl, JuMP.jl and Pyomo are among the most popular modelers, but none of them support GPUs. ExaModels.jl is an attempt to fill this gap. ExaModels is designed to run on a variety of GPU architectures (NVIDIA, AMD, INTEL) as well as multi-threaded CPUs. As for CUDA.jl, we recommend this introductory course to ExaModels.

You can import ExaModels.jl simply as

using ExaModels

and instantiate a new model with

core = ExaCore()

An ExaCore

  Float type: ...................... Float64
  Array type: ...................... Vector{Float64}
  Backend: ......................... Nothing

  number of objective patterns: .... 0
  number of constraint patterns: ... 0

Note that ExaModels supports multi-precision by default. Adding new variables to the model is very much similar to other modelers. E.g., adding 10 lower-bounded variables $x ≥ 0$ amounts to

x = variable(core, 10; lvar=0.0)

Variable

  x ∈ R^{10}

The keyword lvar is used to pass the lower-bounds on the variable x. Similarly, we can pass the upper-bounds using the keyword uvar, and a starting-point using the keyword start.

Once the variables are defined, ExaModels relies on a powerful SIMD abstraction to identify automatically the potential for parallelism in the expression tree. ExaModels implements the expression trees using iterator objects. As a consequence, we should define all the expressions in iterator format, and avoid accessing the variable x by its index outside a generator.

As a demonstration, we show how to generate the constraint $10 × sin(x_i) ≥ 0$. We start by building a Julia generator encoding the expression:

gen = (10.0 * sin(x[i]) + i for i in 1:10)

Base.Generator{UnitRange{Int64}, Main.var"Main".var"#1#2"}(Main.var"Main".var"#1#2"(), 1:10)

We can pass the generator gen to ExaModels to build our inequality constraints:

cons = constraint(core, gen; lcon=0.0)

Constraint

  s.t. (...)
       g♭ ≤ [g(x,p)]_{p ∈ P} ≤ g♯

  where |P| = 10

Info

All the data arrays should have concrete, bit-type elements.

Warning

Do not place for loops outside of the generator. For instance, the code below:

# Don't do this!
for i in 1:10
    constraint(core, 10.0 * sin(x[i]) + i; lcon=0.0)
end

can severely impact performance by generating a new kernel with each loop iteration, leading to increased compilation time. When working with ExaModels, always construct constraints using generators to clearly communicate the structure of the expressions.

Note that the generator just provides a way to generate expression. The evaluation part comes apart, by creating an ExaModel instance that takes as input the structure core that stores all the generators required to build the model:

nlp = ExaModel(core)

An ExaModel{Float64, Vector{Float64}, ...}

  Problem name: Generic
   All variables: ████████████████████ 10     All constraints: ████████████████████ 10    
            free: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                 free: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           lower: ████████████████████ 10               lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           upper: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                upper: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
         low/upp: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0              low/upp: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
           fixed: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0                fixed: ████████████████████ 10    
          infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0               infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
            nnzh: ( 81.82% sparsity)   10              linear: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0     
                                                    nonlinear: ████████████████████ 10    
                                                         nnzj: ( 90.00% sparsity)   10

The constructor ExaModel generates an AbstractNLPModel, which comes with a proper API to evaluate the model in a syntax appropriate for numerical computing. The API can be found in this documentation. As a consequence, evaluating the constraints implemented by the generator we defined before just translates to:

using NLPModels
x = ones(10)                # get an initial point
c = NLPModels.cons(nlp, x)  # return the results as a vector

10-element Vector{Float64}:
  9.414709848078965
 10.414709848078965
 11.414709848078965
 12.414709848078965
 13.414709848078965
 14.414709848078965
 15.414709848078965
 16.414709848078964
 17.414709848078964
 18.414709848078964

As ExaModels is just manipulating expressions, it is very easy to offload the evaluation of the model on the GPU. ExaModels builds automatically the appropriate kernels to evaluate the expressions implemented in the generators. You can generate a new model on the GPU simply by specifying a new backend to ExaModels:

core = ExaCore(; backend=CUDABackend())

An ExaCore

  Float type: ...................... Float64
  Array type: ...................... CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}
  Backend: ......................... CUDA.CUDAKernels.CUDABackend

  number of objective patterns: .... 0
  number of constraint patterns: ... 0

The generation of the model on the GPU follows the same syntax:

x = variable(core, 10; lvar=0.0)
cons = constraint(core, 10.0 * sin(x[i]) + i for i in 1:10; lcon=0.0)

Constraint

  s.t. (...)
       g♭ ≤ [g(x,p)]_{p ∈ P} ≤ g♯

  where |P| = 10

as well as the model's evaluation:

nlp = ExaModel(core)
x_gpu = CUDA.ones(Float64, 10)  # get an initial point
c_gpu = NLPModels.cons(nlp, x_gpu)  # return the results as a vector

10-element CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}:
  9.414709848078965
 10.414709848078965
 11.414709848078965
 12.414709848078965
 13.414709848078965
 14.414709848078965
 15.414709848078965
 16.414709848078964
 17.414709848078964
 18.414709848078964

As we will see in the next tutorial, ExaModels is a powerful tool to evaluate the model's derivatives using automatic differentiation. This will prove to be particularly useful for solving the power flow equations.

This page was generated using Literate.jl.