Tutorial 2: solving the power-flow equations in batch on the GPU
In the previous tutorial, we have seen how to solve the power flow equations using ExaModels. Now, we want to fully leverage the capability of ExaModels to solve the power flow equations in batch on the GPU.
We start by importing the usual packages:
using LinearAlgebra
using SparseArrays
using NLPModels
using ExaModels
using JLD2
include("utils.jl")
We load again the instance case9ieee:
DATA_DIR = joinpath(splitdir(Base.active_project())[1], "instances")
data = JLD2.load(joinpath(DATA_DIR, "case9.jld2"))["data"]
nbus = length(data.bus)
ngen = length(data.gen)
nlines = length(data.branch);
Batched power flow with ExaModels
The power flow are parameterized by the active and reactive power loads $p_d$ and $q_d$ at each bus, among others. This gives a total of 2*nbus
parameters.
In this tutorial, we want to solve the power flow equations in batch for $N$ different load scenarios $\{ p_d^n, q_d^n \}_{n=1,⋯,N}$, with $N$ playing the role of the batch size. To each scenario $(p_d^n, q_d^n)$ is associated a block. The number of blocks is the batch size $N$.
As a demonstration, we set the batch size to 100:
N = 100;
For each parameter $(p_d^n, q_d^n)$ is associated a given solution $(v_m^n, v_a^n, p_g^n, q_g^n)$ of the power flow equations. We will look at computing all the solutions in parallel using Newton.
Using ExaModels, we can define the corresponding batched model by augmenting the dimension of each variable with a second dimension parameterized by the batch size $N$. This amounts to define the following variables:
core = ExaCore()
va = variable(core, nbus, 1:N)
vm = variable(core, nbus, 1:N; start = repeat(data.vm0, N))
pg = variable(core, ngen, 1:N; start=repeat(data.pg0, N))
qg = variable(core, ngen, 1:N; start=repeat(data.qg0, N))
p = variable(core, 2*nlines, 1:N)
q = variable(core, 2*nlines, 1:N);
Note that we have to duplicate $N$ times the starting point for $v_m$, $p_g$ and $q_g$. We also have to evaluate the power flow constraint in batch. As a consequence, the iterator used to generate each constraint has to be modified using the iterator product
:
c2 = constraint(
core,
p[b.f_idx, k]
- b.c5 * vm[b.f_bus, k]^2 -
b.c3 * (vm[b.f_bus, k] * vm[b.t_bus, k] * cos(va[b.f_bus, k] - va[b.t_bus, k])) -
b.c4 * (vm[b.f_bus, k] * vm[b.t_bus, k] * sin(va[b.f_bus, k] - va[b.t_bus, k])) for
(b, k) in product(data.branch, 1:N)
)
Constraint
s.t. (...)
g♭ ≤ [g(x,p)]_{p ∈ P} ≤ g♯
where |P| = 900
To avoid redefining all the models, we provide a utility function to generate the batched power flow model using ExaModels:
include("powerflow.jl")
nlp = batched_power_flow_model(data, N)
An ExaModel{Float64, Vector{Float64}, ...}
Problem name: Generic
All variables: ████████████████████ 6000 All constraints: ████████████████████ 6000
free: ████████████████████ 6000 free: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
upper: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 upper: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
low/upp: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 low/upp: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
fixed: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 fixed: ████████████████████ 6000
infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
nnzh: ( 99.79% sparsity) 37800 linear: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
nonlinear: ████████████████████ 6000
nnzj: ( 99.93% sparsity) 24600
The power flow model can be solved on the CPU using the function solve_power_flow
we implemented in the previous tutorial:
results = solve_power_flow(nlp, N)
nothing
[ Info: It: 1 residual: 28.281211961300386
[ Info: It: 2 residual: 1.6641541672388813
[ Info: It: 3 residual: 0.03123124093240232
[ Info: It: 4 residual: 7.791517832863897e-6
[ Info: It: 5 residual: 6.342121166249363e-13
We recover the solution in matrix format using:
vm = reshape(results[nbus*N+1:2*nbus*N], nbus, N)
9×100 Matrix{Float64}:
0.975472 0.975472 0.975472 0.975472 … 0.975472 0.975472 0.975472
0.987007 0.987007 0.987007 0.987007 0.987007 0.987007 0.987007
1.00338 1.00338 1.00338 1.00338 1.00338 1.00338 1.00338
0.985645 0.985645 0.985645 0.985645 0.985645 0.985645 0.985645
1.0 1.0 1.0 1.0 1.0 1.0 1.0
0.957621 0.957621 0.957621 0.957621 … 0.957621 0.957621 0.957621
0.996185 0.996185 0.996185 0.996185 0.996185 0.996185 0.996185
1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0
Solving the power flow equations in batch on the GPU
Note that we don't exploit in the algorithm the fact that the $N$ blocks are independent. ExaModels is able to detect the repeated data structure automatically, and can evaluate the resulting model in parallel on the GPU. That's the core benefit of the SIMD abstraction used by ExaModels. To evaluate the model on the GPU using ExaModels, you just have to pass the correct backend to the function batched_power_flow_model
we used just before:
using CUDA
nlp_gpu = batched_power_flow_model(data, N; backend=CUDABackend())
n = NLPModels.get_nvar(nlp_gpu)
m = NLPModels.get_ncon(nlp_gpu)
nnzj = NLPModels.get_nnzj(nlp_gpu);
Evaluating the model on the GPU simply amounts to
x0 = NLPModels.get_x0(nlp_gpu)
c = similar(x0, m)
NLPModels.cons!(nlp_gpu, x0, c)
6000-element CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
⋮
0.3
0.0
0.0
0.35
-0.0654
0.5
0.0
0.10949999999999999
-0.2703
for the power flow residual, and for the Jacobian:
Jx = similar(x0, nnzj)
NLPModels.jac_coord!(nlp_gpu, x0, Jx)
24600-element CUDA.CuArray{Float64, 1, CUDA.DeviceMemory}:
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
⋮
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
-1.0
We can benchmark the time spent in the evaluation of the derivative using the macro @time
(or CUDA.@time
if we want also to include the synchronization time in CUDA):
CUDA.@time NLPModels.cons!(nlp_gpu, x0, c)
CUDA.@time NLPModels.jac_coord!(nlp_gpu, x0, Jx)
nothing
0.000345 seconds (933 CPU allocations: 42.953 KiB)
0.000336 seconds (929 CPU allocations: 42.906 KiB)
We observe that the evaluation of the Jacobian takes 0.3ms in this case. In the function analyse_sparsity
, we provide a sparse routine extracting the submatrix corresponding to the power flow equations from the Jacobian J.
On the GPU, the default format for sparse matrices is CSR, as it leads to better parallelism when computing sparse-matrix vector products.
We can assemble the submatrix G
using this new function:
ind_dof = get_index_dof(data, N)
m_fixed = length(ind_dof)
ind_dep = setdiff(1:n, ind_dof)
nx = length(ind_dep)
Ji = similar(x0, Int, nnzj)
Jj = similar(x0, Int, nnzj)
NLPModels.jac_structure!(nlp_gpu, Ji, Jj)
G, coo_to_csr = analyse_sparsity(Ji, Jj, Jx, m, n, m_fixed, ind_dep);
Now the Jacobian is evaluated, we have to compute the LU factorization on the GPU, if possible in sparse format. The solver cuDSS allows to do exactly that. To use cuDSS in Julia, you have to import the package CUDSS
using CUDSS
We update the values in the Jacobian of the original model and transfer them to G
using the same procedure as in Tutorial 1:
NLPModels.jac_coord!(nlp_gpu, x0, Jx)
nonzeros(G) .= Jx[coo_to_csr];
The symbolic factorization in cuDSS proceeds as follows:
d_gpu = CUDA.zeros(Float64, nx)
b_gpu = CUDA.zeros(Float64, nx)
solver = CudssSolver(G, "G", 'F')
cudss_set(solver, "reordering_alg", "algo2") # we have to change the ordering to get valid results
cudss("analysis", solver, d_gpu, b_gpu)
Hence, we are now able to replace KLU by CUDSS in the Newton solver we implemented in the previous tutorial. We initialize the Newton algorithm by evaluating the model at the initial point:
ind_dep = CuVector{Int}(ind_dep)
x = copy(x0)
c = similar(x0, m)
residual = view(c, m_fixed+1:m) # get subvector associated to the power flow residual
NLPModels.cons!(nlp_gpu, x, c)
cudss("factorization", solver, d_gpu, b_gpu)
max_iter = 10
tol = 1e-8
@info "Solving the power flow equations with Newton"
i = 1
for i in 1:max_iter
@info "It: $(i) residual: $(norm(residual))"
if norm(residual) <= tol
break
end
NLPModels.jac_coord!(nlp_gpu, x, Jx) # Update values in Jacobian
nonzeros(G) .= Jx[coo_to_csr]
cudss_set(solver, G) # Update numerical factorization
cudss("refactorization", solver, d_gpu, b_gpu)
b_gpu .= residual
cudss("solve", solver, d_gpu, b_gpu)
x[ind_dep] .-= d_gpu
NLPModels.cons!(nlp_gpu, x, c)
end
[ Info: Solving the power flow equations with Newton
[ Info: It: 1 residual: 28.281211961300375
[ Info: It: 2 residual: 1.6641541672388802
[ Info: It: 3 residual: 0.031231240932401514
[ Info: It: 4 residual: 7.791517847008339e-6
[ Info: It: 5 residual: 6.254789866464209e-13
We observe that we get exactly the same convergence as before on the CPU. However, the time to solution is significantly higher than on the CPU: it turns out that KLU is much more efficient than cuDSS on this particular example.
This page was generated using Literate.jl.