RAJA/CHAI Execution Patterns

Purpose

This document describes how Spheral uses RAJA and CHAI around current device-capable physics loops, especially the RAJA SPH implementations. It is the execution companion to Value/View and Device Execution Model and Current RAJA-Captured Object Families.

The object-model details live in those pages. This page focuses on launch-time mechanics:

selecting the RAJA execution policy;
creating views from host objects;
moving or touching CHAI-backed storage;
capturing views and managed pointers in RAJA lambdas;
using atomics in pair loops;
returning written data to CPU consumers.

Launch-Path Source Map

This page lists the source files that define the common launch configuration and the primary RAJA hydro launch sites. For the source files behind each RAJA-captured object family, use Source Map by Family.

Configuration and helpers:

src/config.hh.in
src/Utilities/GPUUtils.hh

Primary RAJA hydro implementations:

src/SPH/SPH_RAJA.hh and src/SPH/SPH_RAJA.cc
src/SPH/SolidSPH_RAJA.hh and src/SPH/SolidSPH_RAJA.cc

Build-Time Execution Model

src/config.hh.in defines the small set of macros and aliases that most device-capable code uses:

SPHERAL_HOST_DEVICE

Expands to RAJA_HOST_DEVICE.

SPHERAL_HOST and SPHERAL_DEVICE

Expand to RAJA host/device annotations.

GPU_BLOCK_SIZE

Currently set to 256.

EXEC_POLICY

Selects the default RAJA::forall policy:

HIP builds use RAJA::hip_exec<GPU_BLOCK_SIZE>;
CUDA builds use RAJA::cuda_exec<GPU_BLOCK_SIZE>;
OpenMP builds use RAJA::omp_parallel_for_exec;
otherwise Spheral uses RAJA::seq_exec.

TRS_UINT

Alias for RAJA::TypedRangeSegment<size_t>.

Most device-capable loops are written once:

RAJA::forall<EXEC_POLICY>(TRS_UINT(0u, n),
  [=] SPHERAL_HOST_DEVICE(size_t i) {
    ...
  });

The selected backend is a build configuration decision, not a physics-package decision.

CHAI’s Role

CHAI provides two related services in Spheral:

chai::ManagedArray: Wraps storage that can be moved between CPU and GPU execution spaces. In non-UVM builds, many Spheral views contain a ManagedArray.
chai::managed_ptr: Holds managed objects whose methods may be called on device. Spheral uses this for artificial-viscosity view objects where a RAJA loop needs polymorphic QPiij behavior.

Spheral wraps common CHAI operations in GPUUtils.hh:

initMAView creates or refreshes a managed-array view over host-object storage;
freeMAView frees the managed-array view;
move and touch abstract UVM vs non-UVM behavior;
atomic operation wrappers call RAJA atomics with RAJA::auto_atomic.

In unified-memory builds, the same wrapper APIs compile to span/no-op behavior where movement is unnecessary. This keeps most container code independent of the memory model.

RAJA Launch Setup Pattern

The current RAJA hydro code follows a repeated setup pattern:

Read durable host objects from the package, DataBase, State, and StateDerivatives.
Create view objects with view() or managed view pointer accessors.
Move or touch captured objects when the backend requires explicit movement.
Launch one or more RAJA::forall loops.
Move written views back to CPU if later host code will consume them.

A simplified version of the pattern in SPH_RAJA::evaluateDerivativesImpl is:

auto W_view = W.view();
auto WQ_view = WQ.view();

const auto& pairs_owner = connectivityMap.nodePairList();
const auto pairs = pairs_owner.view();

auto mass_v = state.fields(HydroFieldNames::mass, 0.0);
auto DvDt_v = derivs.fields(HydroFieldNames::hydroAcceleration,
                            Vector::zero());

auto mass = mass_v.view();
auto DvDt = DvDt_v.view();

RAJA::forall<EXEC_POLICY>(TRS_UINT(0u, pairs.size()),
  [=] SPHERAL_HOST_DEVICE(size_t kk) {
    const auto pair = pairs[kk];
    const auto mi = mass(pair.i_list, pair.i_node);
    ...
    DvDt(pair.i_list, pair.i_node).atomicSub(...);
  });

DvDt.move(chai::CPU);

The actual implementation has many more fields, but the structure is the same: host objects stay on the host side of the launch boundary; views and managed view pointers are captured by the RAJA lambda.

SPH_RAJA Derivative Flow

SPH_RAJA::evaluateDerivatives first dispatches on the artificial-viscosity return type:

scalar viscosity uses chai::managed_ptr<ArtificialViscosityView<Dimension, Scalar>>;
tensor viscosity uses chai::managed_ptr<ArtificialViscosityView<Dimension, Tensor>>.

The templated evaluateDerivativesImpl then performs three broad phases.

Initial setup: TableKernelView objects are created from TableKernel objects. The active ConnectivityMap supplies a NodePairList host object, whose view is captured by the pair loop. State and derivative field lists are looked up by key and converted to FieldListView objects. The code checks that field-list sizes match the number of node lists.
Pair loop: A RAJA loop over npairs computes pairwise SPH contributions. Each loop iteration reads one NodePairIdxType containing i_node, i_list, j_node, and j_list. It reads state for both nodes, evaluates SPH interpolation kernels, calls Q->QPiij(...), and atomically accumulates pair contributions into per-node derivative fields.
Per-node finalization: A second set of RAJA loops walks internal nodes per node list. These loops add self-contributions, finish velocity gradients, compute continuity derivatives, finish XSPH position evolution, and apply other node-local finalization.

After the loops, derivative views are moved back to chai::CPU. This is necessary because the integrator and later package hooks may run host-side code that reads the derivative fields.

SolidSPH_RAJA follows the same broad shape, with more state and derivative fields for solid mechanics. It also shows explicit move(chai::GPU) calls for many views under HIP builds before launching RAJA loops.

Pair Loops and Atomics

The SPH pair loop updates both nodes in an interacting pair. Different pairs can contribute to the same destination node concurrently, so the RAJA loop must use atomic accumulation for shared per-node outputs.

Spheral uses two forms of atomic update:

scalar wrappers such as GPUUtils::AtomicAddOp and GPUUtils::AtomicMaxOp;
data-type methods such as Vector::atomicAdd and Tensor::atomicSub where geometry types provide component-wise atomics.

The pair loop therefore has a deliberate asymmetry:

each pair is visited once through NodePairListView;
contributions for both endpoints are accumulated in the same loop iteration;
output fields that receive many pair contributions must be atomic.

This design avoids separate gather/scatter passes but makes atomic placement and data-race analysis part of RAJA loop development.

Managed Dispatch in the Launch Path

Artificial viscosity is the current RAJA launch path that captures a managed pointer to a polymorphic view. The object-model contract is described in Value/View and Device Execution Model; the launch path is:

SPH_RAJA::evaluateDerivatives asks the viscosity host object for its QPiTypeIndex.
Host code selects the scalar or tensor templated path.
The host object returns a chai::managed_ptr<ArtificialViscosityView<Dimension, QPiType>> through getScalarView() or getTensorView().
The RAJA pair-loop lambda captures that managed base pointer by value.
The RAJA loop calls Q->QPiij(...) through the device-valid view object.

The host object remains the durable holder of viscosity parameters and restart state. The managed view is reconstructed when host-side parameters change, so RAJA setup should reacquire the managed pointer instead of caching it across configuration changes.

Captured View Movement

Movement policy is distributed by captured object type:

FieldView and FieldListView: Move or touch field data. FieldListView movement may be recursive because the outer view contains nested field views.
TableKernelView: Moves nested interpolator views.
NodePairListView: Moves or touches the pair array.
PairwiseFieldView: Moves or touches the pairwise-value array.
chai::managed_ptr views: Managed object construction and lifetime are controlled by the host object, such as an artificial-viscosity model.

The caller that launches a RAJA loop is responsible for ensuring all captured views are valid in the target execution space. In some paths the first access through CHAI may trigger movement; in others, especially HIP-focused code, explicit movement is used before the launch.

Device-to-host movement is a consumer-boundary decision, not an automatic post-loop step. Written views only need to be moved back to chai::CPU when the next consumer is host-only code that will read those values. If the next consumer is another RAJA/device-capable path using views, the data can remain in the active execution space until a host-only boundary is reached.

CPU, OpenMP, HIP, and CUDA Behavior

Because Spheral uses EXEC_POLICY, the same source code can execute under several backends:

sequential CPU builds use RAJA::seq_exec;
OpenMP builds use RAJA::omp_parallel_for_exec;
HIP builds use RAJA::hip_exec;
CUDA builds use RAJA::cuda_exec.

That portability does not make all code backend-neutral automatically. RAJA lambda bodies still need to obey device restrictions:

capture simple view objects by value;
avoid host-only APIs;
avoid unannotated helper functions in RAJA lambda code;
use atomics for shared output locations;
ensure managed data has been moved or touched correctly for the backend;
keep virtual device calls rare and carefully managed.

Extension Guidance

When adding a new RAJA loop:

gather all host objects before the launch;
convert fields, SPH kernel objects, pair lists, and scratch data to views before the launch;
move or touch the views needed by the target backend;
capture views by value in the RAJA lambda;
use SPHERAL_HOST_DEVICE helpers only inside the lambda;
avoid STL containers, host iterators, FieldBase, and DataBase access inside the lambda;
use RAJA/Spheral atomics for shared outputs;
move written data back to CPU before host code reads it;
reacquire views after any storage resize or connectivity rebuild.

Common Failure Modes

Missing host/device annotation: A helper used in a RAJA lambda is host-only. Mark small loop-body helpers SPHERAL_HOST_DEVICE or keep them outside the RAJA lambda.
Hidden host object capture: Capturing this or a rich host object can pull host-only state into a RAJA lambda. Capture views and primitive values instead.
Non-atomic shared writes: Pair loops write many contributions into per-node fields. Any shared destination must use an atomic update.
Stale CHAI view: Host-object storage changed after the view was created. Rebuild the view by calling view() again after the storage or layout change.
Host reads GPU-written data: A derivative field was written in a device RAJA loop but not moved/touched back before host code read it.
Device virtual dispatch instability: Managed polymorphic objects must be constructed in a way that preserves the device vtable. Follow the artificial-viscosity managed-view pattern.