Spheral GPU Port Design Patterns ################################ Status and Audience =================== This page is a developer design guide for porting Spheral components toward GPU execution. It is intended for maintainers who need to modify C++ classes, RAJA kernels, CHAI-managed data, PYB11 bindings, and Python construction helpers without losing the existing host behavior. The examples are drawn from existing Spheral code and from the RAJA ``evaluateDerivatives`` work merged in commit ``548becc88a4ed60e58873326456a9559fa1db62b``. The guidance here documents patterns used by those examples. It does not settle project-specific design questions for future ``Neighbor``, ``TreeNeighbor``, ``NodeList``, or ``ConnectivityMap`` work. This page is the applied porting guide. For subsystem reference material, see :doc:`../design/value_view_and_device_execution_model` for the value/view and managed view pointer execution model, :doc:`../design/value_view_conversion_case_studies` for the current device-facing object families, :doc:`../design/raja_chai_execution_patterns` for RAJA and CHAI execution details, and :doc:`../design/connectivity_data_structures` for connectivity and pair-list structure. Evidence and Limits =================== This document distinguishes between three kinds of statements: Observed examples Descriptions of how existing code such as ``FieldView``, ``NodePairListView``, ``ArtificialViscosityView``, ``SPH_RAJA``, and ``SolidSPH_RAJA`` currently work. Design guidance Recommendations for future ports based on those examples and general maintainability concerns. Open design choices Decisions that require additional analysis of the target subsystem. This page should not present those choices as already resolved. When the document discusses future ``ConnectivityMap`` or ``Neighbor`` work, read it as guidance to investigate, not as a claim that the current examples already prove the final design. Design Goals ============ For maintainability, GPU-porting changes should aim to: * Preserve the existing host-facing APIs unless there is a deliberate design decision to change them. * Keep owning objects responsible for lifetime, invariants, restart, Python exposure, and compatibility. * Expose compact device-facing views that can be copied into RAJA kernels. * Make CHAI data movement visible and reviewable. * Keep host-only discovery, allocation, and polymorphic selection out of device loop bodies. * Add tests at the abstraction boundary being changed, not only at the final application level. Non-goals ========= This page is not a tutorial for RAJA, CHAI, PYB11, or CMake. It also does not require every future port to use the exact implementation shape used by the examples below. In particular, inheritance from a view class is an existing Spheral pattern, not a universal requirement. Terminology =========== Owning object The full host-facing C++ object. It owns storage, invariants, compatibility behavior, restart state, Python-facing semantics, and any host-only containers or lookup structures. View A lightweight object containing the subset of data and operations needed by device code. A view should be cheap to copy and safe to capture by value in a RAJA lambda. Managed view A view allocated with ``chai::managed_ptr`` so that device code can call methods through a pointer. This is used for device-side virtual dispatch. Device-callable method A method marked with ``SPHERAL_HOST_DEVICE`` and limited to operations that are valid in host and device execution spaces. Pattern Summary =============== The GPU-ready examples documented on this page use this general shape: :: Host-facing object owns data, invariants, restart, Python semantics | | view() v Device-facing view compact data access, SPHERAL_HOST_DEVICE methods | | captured by value v RAJA kernel simple range, explicit inputs, atomics where required The documented managed-dispatch example uses this shape: :: Concrete host object owns chai::managed_ptr | | dynamic_pointer_cast v chai::managed_ptr | | captured in RAJA kernel v virtual device-callable method Worked Example: ``Field`` and ``FieldView`` =========================================== Field Problem Addressed ----------------------- ``Field`` is one of the central data containers in Spheral. Existing host code expects a rich object that knows its name, its ``NodeList``, how to resize, how to pack values for communication, how to serialize, and how to participate in restart and output. The documented kernels need much less. In those kernels, the common operation is indexed access to field values, together with a small amount of metadata such as the number of internal and ghost elements. Field Design ------------ ``Field`` remains the owning object. It owns the host ``std::vector`` storage and the full host API. It derives from ``FieldBase`` and participates in Spheral's field, restart, communication, and output infrastructure. ``FieldView`` is the device-facing access object. It contains: * A CHAI ``ManagedArray`` or span over the field values. * The internal and ghost element counts. * The nodes-per-smoothing-scale value. * ``SPHERAL_HOST_DEVICE`` element access and simple arithmetic operations. * ``move`` and ``touch`` methods for CHAI-managed data. The current implementation has ``Field`` inherit from ``FieldView``. That keeps the view operations directly available on ``Field`` and lets ``Field::view()`` return a lightweight copy of the ``FieldView`` subobject: .. code-block:: c++ auto positionFields = state.fields(HydroFieldNames::position, Vector::zero()); auto position = positionFields.view(); RAJA::forall(TRS_UINT(0u, n), [=] SPHERAL_HOST_DEVICE (size_t i) { const auto& ri = position(nodeListi, i); // Device-callable work using ri. }); Field Important Consequences ---------------------------- The kernel captures ``position``, not the owning ``State``, ``FieldList``, or ``Field``. This is the important design boundary. Host code performs lookup and policy decisions before the loop. Device code receives a compact access object. ``Field`` also has to keep the host vector and CHAI view synchronized. When the host storage changes, the managed span is refreshed so that the view still points at the correct data. That synchronization belongs in the owning object, not in arbitrary kernels. Review Checklist ---------------- When applying this pattern to another class, check: * Does the owning object retain all host-only behavior? * Does the view contain only data needed by device code? * Are all view methods used in kernels marked ``SPHERAL_HOST_DEVICE``? * Is CHAI movement explicit and close to the algorithm that needs it? * Is there a clear rule for invalidating or refreshing views when host storage changes? Worked Example: ``FieldList`` and ``FieldListView`` =================================================== FieldList Problem Addressed --------------------------- Many Spheral algorithms operate on one field per ``NodeList``. Host code needs the full ``FieldList`` abstraction, but kernels need fast indexed access by ``(fieldIndex, nodeIndex)``. FieldList Design ---------------- ``FieldListView`` packages a collection of ``FieldView`` objects into a device-usable container. Device code can write: .. code-block:: c++ auto massFields = state.fields(HydroFieldNames::mass, 0.0); auto mass = massFields.view(); const auto mi = mass(nodeListi, i); The view hides the host ``FieldList`` machinery while preserving the indexing model used throughout the C++ code. FieldList Important Consequences -------------------------------- The RAJA hydro code first extracts all state and derivative ``FieldList`` objects, converts them to views, and only then enters the pair or node loops. That makes kernel dependencies obvious: the loop body uses captured views rather than reaching back into ``State`` or ``StateDerivatives``. Worked Example: ``NodePairList`` and ``NodePairListView`` ========================================================= NodePairList Problem Addressed ------------------------------ Pairwise hydro loops need a compact schedule of node-node interactions. Host code also needs richer operations: sorting, insertion, pair-to-index lookup, and domain-decomposition ordering. NodePairList Design ------------------- ``NodePairList`` is the host-side owner. It stores a ``std::vector`` and maintains host-side conveniences such as iterators and a lazy pair-to-index lookup map. ``NodePairListView`` is intentionally much smaller. It stores only a CHAI ``ManagedArray`` or span of ``NodePairIdxType`` values and exposes: * ``operator[]`` for indexed access. * ``size()`` and ``data()``. * ``move()`` and ``touch()``. This is enough for the RAJA pair loops shown in the current SPH examples: .. code-block:: c++ const auto& pairsValue = connectivityMap.nodePairList(); const auto pairs = pairsValue.view(); const auto npairs = pairs.size(); RAJA::forall(TRS_UINT(0u, npairs), [=] SPHERAL_HOST_DEVICE (size_t kk) { const auto i = pairs[kk].i_node; const auto j = pairs[kk].j_node; const auto nodeListi = pairs[kk].i_list; const auto nodeListj = pairs[kk].j_list; // Compute this pair interaction. }); NodePairList Important Consequences ----------------------------------- The view does not expose the pair-to-index lookup map. That omission is a design feature, not a missing API, for loops that operate on an already-built pair schedule. In that case, device code can consume the flattened pair list without carrying the host lookup structure into the kernel. This is a useful precedent, but not by itself a complete answer for future connectivity work. ``NodePairListView`` shows that an already-built pair schedule can be consumed on device through a flattened representation. It does not prove that the algorithms that build or patch connectivity can avoid richer device-side data structures. Future connectivity work should determine which parts can use flattened inputs and which parts require additional staging or new device-ready structures. Worked Example: ``ArtificialViscosity`` and Managed Views ========================================================= ArtificialViscosity Problem Addressed ------------------------------------- ``ArtificialViscosity`` is not just a data container. It is a ``Physics`` package with state registration, derivative registration, boundary behavior, restart state, and several concrete implementations. The hydro algorithm needs to call the pairwise ``QPiij`` operation on device, but the concrete viscosity type is selected through the host object model. ArtificialViscosity Design -------------------------- The host-facing ``ArtificialViscosity`` object owns the full physics-package behavior. It also exposes ``getScalarView()`` and ``getTensorView()`` so callers can obtain the appropriate device-facing polymorphic view. ``ArtificialViscosityView`` is the base device-facing interface. It carries the common viscosity parameters and defines a virtual ``SPHERAL_HOST_DEVICE`` ``QPiij`` method. Concrete views, such as ``MonaghanGingoldViscosityView`` and ``LimitedMonaghanGingoldViscosityView``, implement that method. Concrete host viscosity classes allocate the concrete view with ``chai::make_managed`` and return it through a base managed pointer: .. code-block:: c++ chai::managed_ptr getScalarView() override { initView(); return chai::dynamic_pointer_cast(m_viewPtr); } void initView() { if (!m_viewPtr) { m_viewPtr = chai::make_managed(mClinear, mCquadratic, mLinearInExpansion, mQuadraticInExpansion); } } The RAJA hydro path chooses scalar or tensor dispatch before entering the main pair loop: .. code-block:: c++ auto& Qhandle = this->artificialViscosity(); if (Qhandle.QPiTypeIndex() == std::type_index(typeid(Scalar))) { chai::managed_ptr> Q = Qhandle.getScalarView(); this->evaluateDerivativesImpl(time, dt, dataBase, state, derivatives, Q); } Inside the pair loop, the managed base pointer is captured and used through the virtual interface: .. code-block:: c++ Q->QPiij(QPiij, QPiji, Qi, Qj, nodeListi, i, nodeListj, j, ri, Hi, etai, vi, rhoi, ci, rj, Hj, etaj, vj, rhoj, cj, fClQView, fCqQView, DvDxQView); ArtificialViscosity Important Consequences ------------------------------------------ This pattern is best reserved for cases where the device path really needs polymorphism. A managed virtual hierarchy has stricter lifetime rules than a plain data view. In the viscosity implementation, when host-side parameters change, the concrete host class frees and reconstructs the managed view rather than modifying it in place. This is intended to preserve the validity of the device object and its vtable. RAJA Algorithm Structure ======================== The RAJA ``SPH_RAJA`` and ``SolidSPH_RAJA`` implementations show how the building blocks compose into a larger algorithm. The RAJA SPH examples use these stages. Future ports can use this sequence as a checklist, but should still verify which stages apply to the target algorithm: 1. Gather host-side state from ``State``, ``StateDerivatives``, kernels, viscosity objects, and connectivity. 2. Convert owning objects to views or managed pointers. 3. Move views to the GPU when needed. 4. Launch simple range-based pair loops over ``NodePairListView``. 5. Use atomics for shared per-node outputs. 6. Launch separate final per-node loops when reduction/finalization differs from pair interaction. 7. Move modified views back to CPU before host code reads them. A representative loop looks like: .. code-block:: c++ RAJA::forall(TRS_UINT(0u, npairs), [=] SPHERAL_HOST_DEVICE (size_t kk) { const auto i = pairs[kk].i_node; const auto j = pairs[kk].j_node; const auto nodeListi = pairs[kk].i_list; const auto nodeListj = pairs[kk].j_list; // Read state through views. // Compute pair contribution. // Accumulate results through atomics where needed. }); GPU_ERROR_CHECK In this staged pattern, keep the loop body from allocating storage, searching host containers, or selecting concrete implementations. Those steps are easier to review when they happen in host setup before the kernel. Race Management =============== The documented pair loops update both members of a pair. Different pair iterations can therefore update the same node concurrently. The RAJA SPH path handles this with atomics where shared outputs are unavoidable. Examples include: * ``GPUUtils::AtomicAddOp::apply`` for scalar accumulation. * ``GPUUtils::AtomicMaxOp::apply`` for maxima. * Value-type methods such as ``atomicAdd`` and ``atomicSub`` on vector and tensor data. Before porting a loop, identify each output and classify it: Private per iteration No atomic is needed. One writer per element No atomic is needed if the one-writer property is guaranteed by the loop structure. Multiple writers per element Use atomics, restructure the algorithm, or split the operation into staged reductions. Python and PYB11 Integration ============================ Spheral GPU ports often need Python-facing plumbing, not just C++ kernels. The RAJA SPH work added C++ classes such as ``SPH_RAJA`` and ``SolidSPH_RAJA`` rather than replacing ``SPH`` and ``SolidSPH`` in place. Those classes were then exposed through PYB11 and selected from the Python ``SPHHydros.py`` helper when the caller passes ``RAJA=True``. This staged approach has several advantages: * Existing Python scripts keep their default behavior. * New behavior is opt-in and easy to compare against the original path. * Tests can select the RAJA path explicitly. * C++ and Python review can happen together, so the public construction API is not an afterthought. When a future port adds a new C++ implementation class, check the full exposure path: * Is the class added to the package CMake instantiation lists? * Is there a PYB11 wrapper for the new class? * Is the wrapper included and instantiated for the enabled dimensions? * Does the Python helper expose an explicit option for selecting the new path? * Do functional tests exercise that Python option? Testing Strategy ================ GPU-porting tests should start at the smallest boundary that can fail. View tests Verify construction, copy behavior, indexing, movement, and touch semantics. Managed pointer tests Verify ``chai::managed_ptr`` capture, dynamic cast, and virtual calls inside RAJA loops. Loop tests Verify simple pair-list traversal and atomic updates under the enabled execution policies. Functional tests Verify that Python-facing options select the intended implementation and that physics behavior remains acceptable. Performance tests Isolate the ported kernel or method so timing changes can be interpreted. The RAJA ``evaluateDerivatives`` work used all of these categories: low-level view tests, managed pointer tests, loop tests, functional Noh tests, and an isolated ``evaluateDerivatives`` performance driver. Design Checklist for Future Ports ================================= Before implementing a new GPU-ready class or algorithm, write down: * The owning object and the behavior it must preserve. * The minimal data needed by device code. * The proposed view type and its lifetime. * Which methods must be ``SPHERAL_HOST_DEVICE``. * Whether polymorphism is required on device. * How CHAI movement will be triggered and audited. * Which writes can race and how races are handled. * How the C++ implementation is exposed to Python, if applicable. * Which unit, loop, functional, and performance tests will validate the change. Reference Implementation Map ============================ The following files are useful starting points when applying these patterns: * ``src/Field/Field.hh`` and ``src/Field/FieldView.hh``: owning field object and compact field view. * ``src/Field/FieldList.hh`` and ``src/Field/FieldListView.hh``: host collection of fields and device-usable collection of views. * ``src/Neighbor/NodePairList.hh`` and ``src/Neighbor/NodePairListView.hh``: host pair-list owner and device pair-list view. * ``src/ArtificialViscosity/ArtificialViscosity.hh`` and ``src/ArtificialViscosity/ArtificialViscosityView.hh``: host physics package interface and managed device-view interface. * ``src/SPH/SPH_RAJA.cc`` and ``src/SPH/SolidSPH_RAJA.cc``: full RAJA algorithm examples. * ``src/PYB11/SPH/SPH_RAJA.py`` and ``src/SPH/SPHHydros.py``: Python exposure and opt-in construction path. * ``tests/cpp/Field``, ``tests/cpp/Neighbor``, ``tests/cpp/Loops``, and ``tests/cpp/Utilities/managedptr_tests.cc``: tests for the building blocks used by GPU-ready code.