Spheral GPU Port Design Patterns

Status and Audience

This page is a developer design guide for porting Spheral components toward GPU execution. It is intended for maintainers who need to modify C++ classes, RAJA kernels, CHAI-managed data, PYB11 bindings, and Python construction helpers without losing the existing host behavior.

The examples are drawn from existing Spheral code and from the RAJA evaluateDerivatives work merged in commit 548becc88a4ed60e58873326456a9559fa1db62b. The guidance here documents patterns used by those examples. It does not settle project-specific design questions for future Neighbor, TreeNeighbor, NodeList, or ConnectivityMap work.

This page is the applied porting guide. For subsystem reference material, see Value/View and Device Execution Model for the value/view and managed view pointer execution model, Current RAJA-Captured Object Families for the current device-facing object families, RAJA/CHAI Execution Patterns for RAJA and CHAI execution details, and Connectivity Data Structures for connectivity and pair-list structure.

Evidence and Limits

This document distinguishes between three kinds of statements:

Observed examples: Descriptions of how existing code such as FieldView, NodePairListView, ArtificialViscosityView, SPH_RAJA, and SolidSPH_RAJA currently work.
Design guidance: Recommendations for future ports based on those examples and general maintainability concerns.
Open design choices: Decisions that require additional analysis of the target subsystem. This page should not present those choices as already resolved.

When the document discusses future ConnectivityMap or Neighbor work, read it as guidance to investigate, not as a claim that the current examples already prove the final design.

Design Goals

For maintainability, GPU-porting changes should aim to:

Preserve the existing host-facing APIs unless there is a deliberate design decision to change them.
Keep owning objects responsible for lifetime, invariants, restart, Python exposure, and compatibility.
Expose compact device-facing views that can be copied into RAJA kernels.
Make CHAI data movement visible and reviewable.
Keep host-only discovery, allocation, and polymorphic selection out of device loop bodies.
Add tests at the abstraction boundary being changed, not only at the final application level.

Non-goals

This page is not a tutorial for RAJA, CHAI, PYB11, or CMake. It also does not require every future port to use the exact implementation shape used by the examples below. In particular, inheritance from a view class is an existing Spheral pattern, not a universal requirement.

Terminology

Owning object: The full host-facing C++ object. It owns storage, invariants, compatibility behavior, restart state, Python-facing semantics, and any host-only containers or lookup structures.
View: A lightweight object containing the subset of data and operations needed by device code. A view should be cheap to copy and safe to capture by value in a RAJA lambda.
Managed view: A view allocated with chai::managed_ptr so that device code can call methods through a pointer. This is used for device-side virtual dispatch.
Device-callable method: A method marked with SPHERAL_HOST_DEVICE and limited to operations that are valid in host and device execution spaces.

Pattern Summary

The GPU-ready examples documented on this page use this general shape:

Host-facing object
  owns data, invariants, restart, Python semantics
  |
  | view()
  v
Device-facing view
  compact data access, SPHERAL_HOST_DEVICE methods
  |
  | captured by value
  v
RAJA kernel
  simple range, explicit inputs, atomics where required

The documented managed-dispatch example uses this shape:

Concrete host object
  owns chai::managed_ptr<ConcreteView>
  |
  | dynamic_pointer_cast
  v
chai::managed_ptr<BaseView>
  |
  | captured in RAJA kernel
  v
virtual device-callable method

Worked Example: `Field` and `FieldView`

Field Problem Addressed

Field is one of the central data containers in Spheral. Existing host code expects a rich object that knows its name, its NodeList, how to resize, how to pack values for communication, how to serialize, and how to participate in restart and output.

The documented kernels need much less. In those kernels, the common operation is indexed access to field values, together with a small amount of metadata such as the number of internal and ghost elements.

Field Design

Field remains the owning object. It owns the host std::vector storage and the full host API. It derives from FieldBase and participates in Spheral’s field, restart, communication, and output infrastructure.

FieldView is the device-facing access object. It contains:

A CHAI ManagedArray or span over the field values.
The internal and ghost element counts.
The nodes-per-smoothing-scale value.
SPHERAL_HOST_DEVICE element access and simple arithmetic operations.
move and touch methods for CHAI-managed data.

The current implementation has Field inherit from FieldView. That keeps the view operations directly available on Field and lets Field::view() return a lightweight copy of the FieldView subobject:

auto positionFields = state.fields(HydroFieldNames::position,
                                   Vector::zero());
auto position = positionFields.view();

RAJA::forall<EXEC_POLICY>(TRS_UINT(0u, n),
  [=] SPHERAL_HOST_DEVICE (size_t i) {
    const auto& ri = position(nodeListi, i);
    // Device-callable work using ri.
  });

Field Important Consequences

The kernel captures position, not the owning State, FieldList, or Field. This is the important design boundary. Host code performs lookup and policy decisions before the loop. Device code receives a compact access object.

Field also has to keep the host vector and CHAI view synchronized. When the host storage changes, the managed span is refreshed so that the view still points at the correct data. That synchronization belongs in the owning object, not in arbitrary kernels.

Review Checklist

When applying this pattern to another class, check:

Does the owning object retain all host-only behavior?
Does the view contain only data needed by device code?
Are all view methods used in kernels marked SPHERAL_HOST_DEVICE?
Is CHAI movement explicit and close to the algorithm that needs it?
Is there a clear rule for invalidating or refreshing views when host storage changes?

Worked Example: `FieldList` and `FieldListView`

FieldList Problem Addressed

Many Spheral algorithms operate on one field per NodeList. Host code needs the full FieldList abstraction, but kernels need fast indexed access by (fieldIndex, nodeIndex).

FieldList Design

FieldListView packages a collection of FieldView objects into a device-usable container. Device code can write:

auto massFields = state.fields(HydroFieldNames::mass, 0.0);
auto mass = massFields.view();

const auto mi = mass(nodeListi, i);

The view hides the host FieldList machinery while preserving the indexing model used throughout the C++ code.

FieldList Important Consequences

The RAJA hydro code first extracts all state and derivative FieldList objects, converts them to views, and only then enters the pair or node loops. That makes kernel dependencies obvious: the loop body uses captured views rather than reaching back into State or StateDerivatives.

Worked Example: `NodePairList` and `NodePairListView`

NodePairList Problem Addressed

Pairwise hydro loops need a compact schedule of node-node interactions. Host code also needs richer operations: sorting, insertion, pair-to-index lookup, and domain-decomposition ordering.

NodePairList Design

NodePairList is the host-side owner. It stores a std::vector<NodePairIdxType> and maintains host-side conveniences such as iterators and a lazy pair-to-index lookup map.

NodePairListView is intentionally much smaller. It stores only a CHAI ManagedArray or span of NodePairIdxType values and exposes:

operator[] for indexed access.
size() and data().
move() and touch().

This is enough for the RAJA pair loops shown in the current SPH examples:

const auto& pairsValue = connectivityMap.nodePairList();
const auto  pairs = pairsValue.view();
const auto  npairs = pairs.size();

RAJA::forall<EXEC_POLICY>(TRS_UINT(0u, npairs),
  [=] SPHERAL_HOST_DEVICE (size_t kk) {
    const auto i = pairs[kk].i_node;
    const auto j = pairs[kk].j_node;
    const auto nodeListi = pairs[kk].i_list;
    const auto nodeListj = pairs[kk].j_list;

    // Compute this pair interaction.
  });

NodePairList Important Consequences

The view does not expose the pair-to-index lookup map. That omission is a design feature, not a missing API, for loops that operate on an already-built pair schedule. In that case, device code can consume the flattened pair list without carrying the host lookup structure into the kernel.

This is a useful precedent, but not by itself a complete answer for future connectivity work. NodePairListView shows that an already-built pair schedule can be consumed on device through a flattened representation. It does not prove that the algorithms that build or patch connectivity can avoid richer device-side data structures. Future connectivity work should determine which parts can use flattened inputs and which parts require additional staging or new device-ready structures.

Worked Example: `ArtificialViscosity` and Managed Views

ArtificialViscosity Problem Addressed

ArtificialViscosity is not just a data container. It is a Physics package with state registration, derivative registration, boundary behavior, restart state, and several concrete implementations. The hydro algorithm needs to call the pairwise QPiij operation on device, but the concrete viscosity type is selected through the host object model.

ArtificialViscosity Design

The host-facing ArtificialViscosity object owns the full physics-package behavior. It also exposes getScalarView() and getTensorView() so callers can obtain the appropriate device-facing polymorphic view.

ArtificialViscosityView is the base device-facing interface. It carries the common viscosity parameters and defines a virtual SPHERAL_HOST_DEVICE QPiij method. Concrete views, such as MonaghanGingoldViscosityView and LimitedMonaghanGingoldViscosityView, implement that method.

Concrete host viscosity classes allocate the concrete view with chai::make_managed and return it through a base managed pointer:

chai::managed_ptr<ArtViscView> getScalarView() override {
  initView();
  return chai::dynamic_pointer_cast<ArtViscView, ViewType>(m_viewPtr);
}

void initView() {
  if (!m_viewPtr) {
    m_viewPtr = chai::make_managed<ViewType>(mClinear,
                                             mCquadratic,
                                             mLinearInExpansion,
                                             mQuadraticInExpansion);
  }
}

The RAJA hydro path chooses scalar or tensor dispatch before entering the main pair loop:

auto& Qhandle = this->artificialViscosity();
if (Qhandle.QPiTypeIndex() == std::type_index(typeid(Scalar))) {
  chai::managed_ptr<ArtificialViscosityView<Dimension, Scalar>> Q =
    Qhandle.getScalarView();
  this->evaluateDerivativesImpl(time, dt, dataBase, state, derivatives, Q);
}

Inside the pair loop, the managed base pointer is captured and used through the virtual interface:

Q->QPiij(QPiij, QPiji, Qi, Qj,
         nodeListi, i, nodeListj, j,
         ri, Hi, etai, vi, rhoi, ci,
         rj, Hj, etaj, vj, rhoj, cj,
         fClQView, fCqQView, DvDxQView);

ArtificialViscosity Important Consequences

This pattern is best reserved for cases where the device path really needs polymorphism. A managed virtual hierarchy has stricter lifetime rules than a plain data view.

In the viscosity implementation, when host-side parameters change, the concrete host class frees and reconstructs the managed view rather than modifying it in place. This is intended to preserve the validity of the device object and its vtable.

RAJA Algorithm Structure

The RAJA SPH_RAJA and SolidSPH_RAJA implementations show how the building blocks compose into a larger algorithm.

The RAJA SPH examples use these stages. Future ports can use this sequence as a checklist, but should still verify which stages apply to the target algorithm:

Gather host-side state from State, StateDerivatives, kernels, viscosity objects, and connectivity.
Convert owning objects to views or managed pointers.
Move views to the GPU when needed.
Launch simple range-based pair loops over NodePairListView.
Use atomics for shared per-node outputs.
Launch separate final per-node loops when reduction/finalization differs from pair interaction.
Move modified views back to CPU before host code reads them.

A representative loop looks like:

RAJA::forall<EXEC_POLICY>(TRS_UINT(0u, npairs),
  [=] SPHERAL_HOST_DEVICE (size_t kk) {
    const auto i = pairs[kk].i_node;
    const auto j = pairs[kk].j_node;
    const auto nodeListi = pairs[kk].i_list;
    const auto nodeListj = pairs[kk].j_list;

    // Read state through views.
    // Compute pair contribution.
    // Accumulate results through atomics where needed.
  });
GPU_ERROR_CHECK

In this staged pattern, keep the loop body from allocating storage, searching host containers, or selecting concrete implementations. Those steps are easier to review when they happen in host setup before the kernel.

Race Management

The documented pair loops update both members of a pair. Different pair iterations can therefore update the same node concurrently. The RAJA SPH path handles this with atomics where shared outputs are unavoidable.

Examples include:

GPUUtils::AtomicAddOp::apply for scalar accumulation.
GPUUtils::AtomicMaxOp::apply for maxima.
Value-type methods such as atomicAdd and atomicSub on vector and tensor data.

Before porting a loop, identify each output and classify it:

Private per iteration: No atomic is needed.
One writer per element: No atomic is needed if the one-writer property is guaranteed by the loop structure.
Multiple writers per element: Use atomics, restructure the algorithm, or split the operation into staged reductions.

Python and PYB11 Integration

Spheral GPU ports often need Python-facing plumbing, not just C++ kernels.

The RAJA SPH work added C++ classes such as SPH_RAJA and SolidSPH_RAJA rather than replacing SPH and SolidSPH in place. Those classes were then exposed through PYB11 and selected from the Python SPHHydros.py helper when the caller passes RAJA=True.

This staged approach has several advantages:

Existing Python scripts keep their default behavior.
New behavior is opt-in and easy to compare against the original path.
Tests can select the RAJA path explicitly.
C++ and Python review can happen together, so the public construction API is not an afterthought.

When a future port adds a new C++ implementation class, check the full exposure path:

Is the class added to the package CMake instantiation lists?
Is there a PYB11 wrapper for the new class?
Is the wrapper included and instantiated for the enabled dimensions?
Does the Python helper expose an explicit option for selecting the new path?
Do functional tests exercise that Python option?

Testing Strategy

GPU-porting tests should start at the smallest boundary that can fail.

View tests: Verify construction, copy behavior, indexing, movement, and touch semantics.
Managed pointer tests: Verify chai::managed_ptr capture, dynamic cast, and virtual calls inside RAJA loops.
Loop tests: Verify simple pair-list traversal and atomic updates under the enabled execution policies.
Functional tests: Verify that Python-facing options select the intended implementation and that physics behavior remains acceptable.
Performance tests: Isolate the ported kernel or method so timing changes can be interpreted.

The RAJA evaluateDerivatives work used all of these categories: low-level view tests, managed pointer tests, loop tests, functional Noh tests, and an isolated evaluateDerivatives performance driver.

Design Checklist for Future Ports

Before implementing a new GPU-ready class or algorithm, write down:

The owning object and the behavior it must preserve.
The minimal data needed by device code.
The proposed view type and its lifetime.
Which methods must be SPHERAL_HOST_DEVICE.
Whether polymorphism is required on device.
How CHAI movement will be triggered and audited.
Which writes can race and how races are handled.
How the C++ implementation is exposed to Python, if applicable.
Which unit, loop, functional, and performance tests will validate the change.

Reference Implementation Map

The following files are useful starting points when applying these patterns:

src/Field/Field.hh and src/Field/FieldView.hh: owning field object and compact field view.
src/Field/FieldList.hh and src/Field/FieldListView.hh: host collection of fields and device-usable collection of views.
src/Neighbor/NodePairList.hh and src/Neighbor/NodePairListView.hh: host pair-list owner and device pair-list view.
src/ArtificialViscosity/ArtificialViscosity.hh and src/ArtificialViscosity/ArtificialViscosityView.hh: host physics package interface and managed device-view interface.
src/SPH/SPH_RAJA.cc and src/SPH/SolidSPH_RAJA.cc: full RAJA algorithm examples.
src/PYB11/SPH/SPH_RAJA.py and src/SPH/SPHHydros.py: Python exposure and opt-in construction path.
tests/cpp/Field, tests/cpp/Neighbor, tests/cpp/Loops, and tests/cpp/Utilities/managedptr_tests.cc: tests for the building blocks used by GPU-ready code.