Stabilized asynchronous fast adaptive composite multigrid using additive damping

Multigrid solvers face multiple challenges on parallel computers. Two fundamental ones read as follows: Multiplicative solvers issue coarse grid solves which exhibit low concurrency and many multigrid implementations suffer from an expensive coarse grid identification phase plus adaptive mesh refinement overhead. We propose a new additive multigrid variant for spacetrees, that is, meshes as they are constructed from octrees and quadtrees: It is an additive scheme, that is, all multigrid resolution levels are updated concurrently. This ensures a high concurrency level, while the transfer operators between the mesh levels can still be constructed algebraically. The novel flavor of the additive scheme is an augmentation of the solver with an additive, auxiliary damping parameter per grid level per vertex that is in turn constructed through the next coarser level—an idea which utilizes smoothed aggregation principles or the motivation behind AFACx: Per level, we solve an additional equation whose purpose is to damp too aggressive solution updates per vertex which would otherwise, in combination with all the other levels, yield an overcorrection and, eventually, oscillations. This additional equation is constructed additively as well, that is, is once more solved concurrently to all other equations. This yields improved stability, closer to what is seen with multiplicative schemes, while pipelining techniques help us to write down the additive solver with single‐touch semantics for dynamically adaptive meshes.

serves as a building block in many applications.Examples are chemical dispersion in subsurface reservoirs, the heat distribution in buildings, or the diffusion of oxygen in tissue.It is also the starting point to construct more complex differential operators.Solving this PDE quickly is important yet not trivial.One reason is buried within the operator: any local modifications of the solution propagate through the whole computational domain, though this effect can be damped by large  variations.The operator exhibits multiscale behavior.A successful family of iterative techniques to solve (1) hence is multigrid.It relies on representations of the operator's behavior on multiple scales.It builds the operator's multiscale behavior into the algorithm.
There are numerical, algorithmic and implementational hurdles that must be tackled when we write multigrid codes.In this article, we focus on three algorithmic/implementation challenges which should be addressed before we scale up multigrid.(i) State-of-the-art multigrid codes have to support dynamically adaptive mesh refinement (AMR) without significant overhead.While constructing coarser (geometric) representations from regular grids is straightforward, it is nontrivial for adaptive meshes.Furthermore, setup phases become costly if the  distribution is complex, 1 or if the mesh changes frequently. 2 (ii) If an algorithm solves problems on cascades of coarser and coarser, that is, smaller and smaller, problems, the smallest problems eventually do not exhibit enough computational work to scale among larger core counts.For adaptive meshes, such low concurrency phases can-depending on the implementation-arise for fine grid solves, too, if we run through the multigrid levels resolution by resolution.(iii) If an algorithm projects a problem to multiple resolutions and then constructs a solution from these resolutions, its implementation tends to read and write data multiple times using indirect or scattered data accesses.Repeated data access however is poisonous on today's hardware which suffers from a widening gap between what cores could compute and what throughput the memory can provide. 3umerous concepts have been proposed to tackle this triad of challenges.We refrain from a comprehensive overview but sketch some particular popular code design decisions: Many successful multigrid codes use a cascade of geometric grids which are embedded into each other, and then run through the resolutions level by level, that is, embedding by embedding.This simplifies the coarse grid identification, and, given a sufficiently homogeneous refinement pattern, implies that a fine grid decomposition induces a fair partitioning of coarser resolutions.Many codes actually start from a coarse resolution and make the refinement yield the finer mesh levels: 4 Combining this with algebraic coarse grid identification once the grid is reasonably coarse adds additional flexibility to a geometric scheme.It allows codes to treat massive  variations (or complex geometries) through a coarse mesh, [5][6][7] whereas the geometric multigrid component handles the bulk of the compute work and exploits structured grids.Many successful codes furthermore use classic rediscretization on finer grids and employ expensive algebraic operator computation only on coarser meshes.They avoid the algebraic overhead to assemble accurate fine grid operators. 8Many large-scale codes do not use the recursive multigrid paradigm across all possible mesh resolutions, but switch to alternative solvers such as Krylov schemes for systems that are still relatively big. 6,9They avoid low concurrency phases arising from very small multigrid subproblem solves or exact inversions of very small systems.Finally, our own work 2,10 has studied rewrites of multigrid with single-touch semantics, that is, each unknown is, amortized, fetched from the main memory into the caches once per grid sweep or additive cycle, respectively.Similar to other strategies such as pipelining or smoother and stencil optimization [11][12][13] our single-touch rewrites reduce the data movement of solvers as well as indirect and scattered memory accesses.If meshes exhibit steep adaptivity, that is, refine-specific subregions to be particularly detailed, or if problems have very strongly varying coefficients, all approaches will run into issues.These take the form of scalability challenges (from low concurrency phases on the coarse mesh or situations where only small parts of the mesh are updated as we lack a global, uniform coarsening scheme), load balancing challenges (a geometric fine grid splitting does not map to coarser resolutions anymore and there's no inclusion property of multiscale domain partitions), or materialize in the memory overhead and performance penalty of algebraic multigrid. 8he majority of multigrid papers focus on its multiplicative form, as it exhibits superior convergence rates compared with additive alternatives; consequently such formulations are more often used as preconditioner.Additive multigrid however remains an interesting solver alternative to multiplicative multigrid in the era of massive concurrency growth. 3n the present article, we propose an additive solver-implementation combination which tackles introductory challenges by combining three key concepts.The first concept is a pure geometric construction through the spacetree paradigm, a generalization of the classic octree/quadtree idea. 14,15Spacetrees yield adaptive Cartesian grids which are nested within each other. 2,10,14,15Adaptivity decreases the cost of a solve by reducing the degrees of freedom without adversely affecting the accuracy.Additional computational effort is invested where it improves the solution significantly.With complex boundary conditions or nontrivial -or even (u) which renders (1) nonlinear-the regions where to refine may not be known a priori, might be expensive to determine, or depend on the right-hand side (if the PDE is employed within a time-stepping scheme, eg).Schemes that allow for dynamic mesh refinement are therefore key for many applications.On top of this, the spacetree idea yields a natural multiresolution cascade well-suited for multigrid (cmp.(i) above).The second concept is the increase of asynchronicity and concurrency through additive multigrid (cmp.(ii) above).Our final concept is the application of mature implementation patterns to our algorithms such that we obtain a single-touch multigrid solver with low memory footprint (cmp.(iii) above): We rely on the triad of fast adaptive composite (FAC), hierarchical transformation multigrid (HTMG), 16 and full approximation storage (FAS). 17hese three techniques allow us to elegantly realize a multigrid scheme which straightforwardly works for dynamically adaptive meshes.We merge it with quasi-matrix-free multigrid relying on algebraic BoxMG intergrid transfer operators. 2,18,19We also utilize pipelining combined with recursive elementwise grid traversals.We run through the spacetree depth-first which yields excellent cache hit rates 14 and simple recursive implementations.The approach equals a multiscale elementwise grid traversal.Going from coarse levels to fine levels and backtracking however does not fit straightforwardly to FAS with additive multigrid, where we restrict the residual from fine to coarse, prolong the correction from coarse to fine, and inject the solution from fine to coarse again.Yet, we know that some additional auxiliary variables allow us to write additive solvers as single-touch. 10Each unknown is read into the chip's caches once per cycle.
None of the enlisted ingredients or their implementation flavor as enlisted are new.Our novel contribution is a modification of the additive formulation plus the demonstration that this modification still fits with the other presented ideas: Plain additive approaches face a severe problem.They are less robust than their multiplicative counterparts.Naïvely restricting residuals to multiple levels and eliminating errors concurrently tends to make the iterative scheme overshoot. 20,21Multiple strategies exist to improve the stability without compromising on the additivity.In the simplest case, additive multigrid is employed as a preconditioner and a more robust solver is used thereon.Our work goes down the "multigrid as a solver" route: A well-known approach to mitigate overshooting in the solver is to more aggressively damp levels the coarser they are.This reduces their impact and improves stability but decreases the rate of convergence. 10We refrain from such resolution-parameterized damping and follow up on the idea behind AFACx: [22][23][24][25] By introducing an additional correction component per level, our approach predicts additive overshooting from coarser levels.Different to AFACx, we however do not make the additional auxiliary solves preprocessing steps.We phrase them in a completely parallel (additive) way to the actual correction's solve.To make the auxiliary contributions meaningful nevertheless, we tweak them through ideas resembling smoothed aggregation [26][27][28] which approximate the smoothing steps of multiplicative multigrid. 29We end up with an additively damped Asynchronous FAC (adAFAC).
Our adAFAC implementation merges the levels of the multigrid scheme plus their traversal into each other, and thus provides a single-touch implementation.Through this, we eliminate synchronization between the solves on different resolution levels and anticipate that FAC yields multigrid grid sequences where work nonmonotonously grows and shrinks upon each resolution transition.Additive literature usually emphasizes the advantage of "additivity" in that the individual levels can be processed independently.Recent work on a further decoupling of both the individual levels' solves as well as the solves within a level 21 shows great upscaling potential.Our strategy allows us to head in the other direction: We vertically integrate solves 30 , that is, we partition the finest mesh where the residual is computed, apply this decomposition vertically-to all mesh resolutions-then merge the traversal of multiple meshes per subpartition.The multigrid challenge to balance not one mesh but a cascade of meshes becomes a challenge of balancing a single set of jobs again.
We reiterate which algorithmic ingredients we use in Section 2 before we introduce our new additive solver adAFAC.This Section 3 is the main new contribution of the present text.Section 4 then translates adAFAC into a single-touch algorithm blueprint.Some numerical results outline the solver's potential (Section 5): An extensive comparison of solver variants, upscaling studies, or the application to real-world problems are out of scope.Yet, adAFAC adds an interesting novel solver variant to the known suite of multigrid techniques available to scientists and engineers.We close the discussion with a brief summary and sketch future work.

Spacetrees
Our meshing relies upon a spacetree 14,15 (Figure 1): The computational domain is embedded into a square (d = 2) or cube (d = 3) which yields a (degenerated) Cartesian mesh with one cell and 2 d vertices.We use cell as generic synonym for cube The construction continues recursively while we decide per cell individually whether to refine further or not.The process creates a cascade of Cartesian grids Ω =0 , Ω =1 , Ω =2 , … .We count levels the other way round compared with most multigrid literature 17,31 assigning the finest grid level  = 0. Our level grids might be ragged: Ω  is a regular grid covering the whole domain if and only if all cells on all levels l <  are refined.We use k = 3. Choosing three-partitioning is due to References 14,15 acting as the implementation baseline.All of our concepts however apply to bipartitioning, too.
Our code discretizes (1) with d-linear finite elements.Each vertex on each level  that is surrounded by 2 d cells on level  carries one "pagoda," that is, bi-or trilinear shape function.The remaining vertices are hanging vertices.Testing shape functions against other functions from the same level yields compact 3 d stencils.For this, we make hanging and boundary vertices carry truncated shape functions but no test functions.A discussion of Neumann conditions is out of scope.We therefore may assume that the scaling of the truncated shapes along the boundary is known.Due to the spacetree's construction pattern, stencils act on a nodal generating system over an adaptive Cartesian grid Ω h = ∪  Ω  .If we study (1) only over the vertices from all levels that carry a shape function and do not spatially coincide with any other vertex of the grid from finer levels, we obtain a nodal shape space over an adaptive Cartesian grid Ω h .
Let  max identify the finest mesh, that is, the maximum level, while  min ≥ 1 is the coarsest level which holds degrees of freedom. max ≥  min .For our benchmarking,  min = 1 is appropriate.However, bigger  min might be reasonable if a problem's solution cannot be accurately represented on the coarsest meshes anymore or performance arguments imply that it is not reasonable to continue to use multigrid.[7]9 We neglect such solver hybrids and emphasize that our new additive solver allows us to use rather small  min .Our subsequent discussion introduces the linear algebra ingredients for a regular grid corresponding to  max .The elegant handling of the adaptive grid is subject of a separate subsection where we exploit the transition from a generating system into a basis.Without loss of generality, (1) is thus discretized into

Additive and multiplicative multigrid
Additive multigrid reads as where M  is an approximation to A  .We use the Jacobi smoother M −1  = diag −1 (A  ) on all grid levels .No alternative (direct) solver or update scheme is employed on any level.The generic prolongation symbol P accepts a solution on a particular level  − 1 and projects it onto the next finer level .The exponent indicates repeated application of this intergrid transfer operator.Restriction works the other way round, that is, projects from finer to coarser meshes.Ritz-Galerkin multigrid 17 finally yields A  = RA  + 1 P for  <  max .
For an -independent, constant  add () ∈]0, 1], additive multigrid tends to become unstable once  max −  min becomes large: 10,20,32 If the fine grid residual b  max − A  max u  max is homogeneously distributed, the residuals all produce similar corrections-effectively attempting to reduce the same error multiple times.Summation of all level contributions then moves the solution too aggressively into this direction.A straightforward fix is exponential damping  add () = ω max − add with a fixed ωadd ∈]0, 1[.If an adaptive mesh is used,  max −  is ill-suited as there is no global  max hosting the solution.We introduce an appropriate, adaptive damping in Reference 10 where we make  max a per-vertex property.It is derived from a tree grammar. 33Such exponential damping, while robust, struggles to track global solution effects efficiently once many mesh levels are used: The coarsest levels make close to no contribution to the solution.
Multiplicative multigrid is more robust than additive multigrid by construction.Multiplicative multigrid does not make one residual feed into all level updates in one rush, but updates the levels one after another.It starts with the finest level.Before it transitions from a fine level to the next coarsest level, it runs some approximate solves (smoothing steps) on the current level to yield a new residual.We may assume that the error represented by this residual is smooth.Yet, the representation becomes rough again on the next level, where we become able to smooth it efficiently again.Cascades of smoothers act on cascades of frequency bands.Multiplicative methods are characterized by the number of the pre-and postsmoother steps  pre and  post , that is, the number of relaxation steps before we move to the next coarser level (pre) or next finer level (post), respectively.The multiplicative multigrid solve closest to the additive scheme is a V(0,  post )-cycle, that is, a scheme without any presmoothing and  post postsmoothing steps.Different to real additive multigrid, the effect of smoothing on a level  here does feed into the subsequent smoothing on  + 1.Since  pre = 0 yields no classic multiplicative scheme-the resulting solver does not smooth prior to the coarsening-we conclude that the V( pre = 1, 0)-cycle thus is the (robust) multiplicative scheme most similar to an additive scheme.The multiplicative two-grid scheme with exact coarse grid solve reads

Multigrid on hierarchical generating systems
Early work on locally adaptive multigrid (see References 22,34 as well as the historical notes in Reference 24) already relies on block-regular Cartesian grids 35 and nests the geometric grid resolutions into each other.The coarse grid vertices spatially coincide with finer vertices where the domain is refined.This yields a hierarchical generating system rather than a basis.The fast adaptive composite (FAC) method 32,36 describes a multiplicative multigrid scheme over this hierarchical system: We start from the finest grid, determine the residual equation there, smooth, recompute the residual and restrict it to the next coarser level.It continues recursively.As we compute corrections using residual equations, this is a multigrid scheme.As we sequentially smooth and then recompute the residual, it is a multiplicative scheme.Early FAC papers operate on the assumption of a small set of reasonable fine grids and leave it open to the implementation which iterative scheme to use.Some explicitly speak of FAC-MG if a multigrid cycle is used as the iterative smoother per level.We may refrain from such details and consider fast adaptive composite grid as a multiplicative scheme overall which can be equipped with simple single-level smoothers.
The first fast adaptive composite grid papers 36 acknowledge difficulties for operators along the resolution transitions.While we discuss an elegant handling of these difficulties in Section 2.4, fast adaptive composite grid traditionally addresses them through a top-down traversal: 32 The cycle starts with the coarsest grid, and then uses the updated solution to impose Dirichlet boundary conditions on hanging nodes on the next finer level.This inversion of the grid-level order continues to yield a multiplicative scheme as updates on coarser levels immediately propagate down and as all steps are phrased as residual update equations.
FAC relies on spatial discretizations that are conceptually close to our spacetrees.Both approaches thus benefit from structural simplicity: As the grid segments per level are regular, solvers (smoothers) for regular Cartesian grids can be (re)used.As the grid resolutions are aligned with each other, hanging nodes can be assigned interpolated values from the next coarsest grid with a geometrically inspired prolongation.As all grid entities are cubes, squares or lines, all operators exhibit tensor-product structure.FAC's hierarchical basis differs from textbook multigrid 17 for adaptive meshes: The fine grid smoothers do not address the real fine grid, but only separate segments that have the same resolution.The transition from fine to coarse grid does not imply that the number of degrees of freedom decreases.Rather, the number of degrees of freedom can increase if the finer grid accommodates a very localized AMR region.It is obvious that this poses challenges for parallelization.
We can mechanically rewrite multiplicative FAC into an additive version.The hierarchical generating system renders this endeavor straightforward.However, plain additive multigrid on a FAC data structure again yields a nonrobus solver that tends to overcorrect. 10,32There are multiple approaches to tackle this: Additive multigrid with exponential damping removes oscillations from the solution at the cost of multigrid convergence behavior.Bramble, Pasciak and Xu's scheme (BPX) 31 is the most popular variant where we accept the nonrobustness and use the additive scheme solely as a preconditioner.To make this preconditioner cheap, BPX traditionally neglects (Ritz-Galerkin) coarse grid operators.Instead, it replaces the M −1  in (2) with a diagonal matrix for the correction equations, where the diagonal matrix is scaled such that it mimics the Laplacian.The hierarchical basis approach starts from the observation that the instabilities within the generating system are induced by spatially coinciding vertices.Therefore, it drops all vertices (and their shape functions) on one level that coincide with coarser vertices.The asynchronous fast adaptive composite grid (AFAC) solver family finally modifies the operators to anticipate overshooting.We may read BPX as particular modification of additive multigrid and AFAC as a generalization of BPX. 37

HTMG and FAS on spacetrees
Although the implementation of multigrid on adaptive meshes is, in principle, straightforward, implementational complexity arises along resolution transitions.Weights associated with the vertices change semantics once we compare vertices on a level  which are surrounded by refined spacetree cells to vertices on that level which belong to the fine grid: The latter carry a nodal solution representation, that is, a scaling of the finite element shape functions, while the former carry correction weights.In classic multigrid starting from a fine grid and then traversing correction levels, it is not straightforward how to handle the vertices on the border between a fine grid region and a refined region within one level.They carefully have to be separated. 2,10ne elegant solution to address this ambiguity relies on full approximation storage (FAS). 17Every vertex holds a nodal solution representation.If two vertices v  and v  + 1 from two levels spatially coincide, the coarser vertex holds a copy of the finer vertex: In areas where two grids overlap, the coarse grid additionally holds the injection u  = Iu  + 1 of the fine grid.This definition exploits the regular construction pattern of spacetrees.Vertices in refined areas now carry a correction equation plus the injected solution rather than a sole correction.The injection couples the fine grid problem with its coarsened representation and makes this representation consistent with the fine grid problem on adjacent meshes which have not been refined further.In the present article, we use FAS exclusively to resolve the semantic ambiguity that arises for vertices at the boundary between the fine grid and a correction region on one level; further potential such as -extrapolation 38 or the application to nonlinear PDEs, that is,  = (u) in (1), is not exploited.
Our code relies on hierarchical transformation multigrid (HTMG) 16 for the implementation of the full approximation storage scheme.It also relies on the assumption/approximation that all of our operators can be approximated by Ritz-Galerkin multigrid RA  + 1 P = A  .Injection u  = Iu  + 1 allows us to rewrite each and every nodal representation into its hierarchical representation û  = (id − PI)u  .A hierarchical residual r is defined in the expected way.This elegantly yields the modified multigrid equation when we switch from the correction equation to that is, per-level equations To the smoother, u  resulting from the injection serves as the initial guess.Subsequently it determines a correction c  .This correction feeds into the multigrid prolongation.
Equation ( 4) clarifies that the right-hand side of this full approximation storage does not require a complicated calculation: We "simply" have to determine the hierarchical representation û on the finer level, compute a hierarchical residual r on this level (which uses the smoother's operator), and restrict this value to the coarse grid's right-hand side.

BoxMG and algebraic-geometric multigrid hybrids
BoxMG is a geometrically inspired algebraic technique 18,19,39 to determine intergrid transfer operators on meshes that are embedded into each other.We assume that the prolongation from a coarse vertex maps onto the nullspace of the fine grid operator.However, BoxMG does not examine the "real" operator.Instead, it studies an operator which is collapsed along the coarse grid cell boundaries.
All fine grid points are classified into c-points (coinciding spatially with coarse grid points of the next coarser level), -points which coincide with the faces of the next coarser levels and f -points.Prolongation and restriction are defined as the identity on c-points.Along -points, we collapse the stencil: If  members reside on a face with normal n, the stencil is accumulated (lumped) along the n direction.The result contains only entries along the non-n directions.Higher dimensional collapsing can be constructed iteratively.We solve ÃPe = 0|  -Ã stems from the collapsed operators-along these -points where e is the characteristic vector for a vertex on the coarse grid, that is, holds one entry 1 and zeroes everywhere else.Finally, we solve APe = 0| f for the remaining points.No two f -points separated by a coarse grid line are coupled to each other anymore.
In our previous work, 2 we detail how to store BoxMG's operators as well as all Ritz-Galerkin operators which typically supplement BoxMG within the spacetree.This yields a hybrid scheme in two ways: On the one hand, BoxMG itself is a geometrically inspired way to construct algebraic intergrid transfer operators.Storing the entries within the spacetree on the other hand allows for a "matrix-free" implementation where no explicit matrix structure is held but all matrix entries are embedded into the mesh.With an on-the-fly compression of entries relative to rediscretized values, 2 we effectively obtain the total memory footprint of a matrix-free scheme.

ADDITIVELY DAMPED AFAC SOLVERS WITH FAS AND HTMG
With our ingredients and observations at hand, our research agenda reads as follows: We first introduce our additive multigrid scheme which avoids oscillations without compromising on the convergence speed.Second, we discuss two operators suited to realize our scheme.Finally, we contextualize this idea and show that the new solver actually belongs into the family of AFAC solvers.

An additively damped additive multigrid solver
Both additive and multiplicative multigrid sum up all the levels' corrections.Multiplicative multigrid is more stable than additive-it does not overshoot-as each level eliminates error modes tied to its resolution.In practice, we cannot totally separate error modes, and we cannot assume that a correction on level  does not introduce a new error on level  + 1. Multigrid solvers thus often use postsmoothing.Once we ignore this multiplicative lesson, the simplest class of multiplicative solvers is V( pre = 1, 0).We start with our recast of the multiplicative V(1,0) two-grid cycle (3) into an additive formulation (2).Our objective is to quantify additive multigrid's overcorrection relative to its multiplicative cousin.For this, we compare the multiplicative two-grid scheme u  max ,mult (3) to the two-level additive scheme with an exact solve on the coarse level The difference is The superscripts (n) and (n + 1) denote old and respectively new iterates of a vector.We continue to omit it from here where possible.
Starting from the additive rewrite of the V(1, 0) multiplicative two-level scheme, we intend to express multiplicative multigrid as an additive scheme.This is a popular endeavor as additive multigrid tends to more readily show improved performance on large-scale parallel implementation.There is no close-to-serial coarse grid solve.There is no coarse grid bottleneck in an Amdahl sense.Multiplicative multigrid however tends to converge faster and is more robust.Different to popular approaches such as Mult-additive, 21,29 our approach does not aim to achieve the exact convergence rate of multiplicative multigrid.Instead, we aim to mimic the robustness of multiplicative multigrid in an additive regime-that is, allow additive multigrid to successfully converge across a wider range of setups.Our hypothesis is that any gain in concurrency will eventually outperform efficiency improvements on future machines.A few ideas guide our agenda: Idea 1.We add an additional one-level term to our additive scheme which compensates for additives overly aggressive updates compared with multiplicative V(1, 0) multigrid.This idea describes the rationale behind (5) where we stick to a two-grid formalism.Our strategy next is to find an approximation to from ( 5) such that we obtain a modified additive two-grid scheme which, on the one hand, mimics multiplicative stability and, on the other hand, is cheap.For this, we read the difference term as an auxiliary solve.
Idea 2. We approximate the auxiliary term ( 6) with a single smoothing step.
The approach yields a per-level correction We use the tilde to denote the auxiliary solves.Following Idea 1, this is a per-level correction: When we regeneralize the scheme from two grids to multigrid (by a recursive expansion of A −1  max −1 within the original additive formulation), we do not further expand the correction (6) or (7).This implies another error which we accept in return for a simplistic correction term without additional synchronization or data flow between levels.Idea 3. The damping runs asynchronously to the actual solve.It is another additive term computed concurrently to each correction equation.
max adds a sequential ingredient to the damping term.A fine grid solve must be finished before it can enter the auxiliary equation.This reduces concurrency.Therefore, we propose to merge this preamble smoothing step into the restriction.typically uses a simple aggregation/restriction operator and then improves it by applying a smoother.It is also similar to Mult-additive, 29 which constructs intergrid transfer operators that pick up multiplicative pre-or postsmoothing behavior.We apply the smoothed operator concept to the restriction R = RAM −1 , and end up with a wholly additive correction term Idea 4. We geometrically identify the auxiliary coarse grid levels with the actual multilevel grid hierarchy.All resolution levels integrate into the spacetree.
M and Ã are auxiliary operators but act on mesh levels which we hold anyway.With the spacetree at hand, we finally unfold the two-grid scheme into where we set, without loss of generality, M −1  max −1 = 0.This assumes that no level coarser than  min hosts any degree of freedom.
Algorithms in standard AFAC literature present all levels as correction levels.That is, a global residual is computed on the composite grid and then restricted to construct the right-hand side of error equations on all grid resolutions.This includes the finest grid level.Here instead we use standard multigrid convention and directly smooth the finest grid level (Algorithms 1 and 2).We only restrict the residual to coarse grid levels.
The four ideas align with the three key concepts from the introduction: We stick to a geometric grid hierarchy and then also reuse this hierarchy for additional equation terms.We stick to an additive paradigm and then also make additional equation terms additive.We stick to a geometric-algebraic mindset.Algorithm 1. Blueprint of one cycle of the our adAFAC-Jac without AMR.R i or P i denote the recursive application of the restriction or prolongation, respectively.
Ri applies R i − 1 times, followed by an application of one smoothed operator ⊳ Additional restriction residual into additional grid space end for for all  min < <  max do c  ← 0; c−1 ← 0 ⊳ Initial "guess" of correction and damping ⊳ Damping of finest grid correction (one level coarser)

Two damping operator choices
It is obvious that the effectiveness of the approach depends on a proper construction of (8).We propose two variants.7][28] Simple geometric transfer operators fail to capture complex solution behavior [40][41][42] for nontrivial  choices in (1).
Let  in (1) be one.We observe that a smoothed operator M −1 A =1 P derived from bilinear interpolation P using a Jacobi smoother M −1 = diag(A) −1 for three-partitioning corresponds to the stencil Here the stencil is a restructured row of the full operator.Assuming  = 1 is reasonable as the term A  M −1  or M −1 −1 RA  , respectively, enters the auxiliary restriction.Such an expression removes the impact of -it yields the Laplacian-on all elements with nonvariable .Assuming  is reasonably smooth, we neglect only small perturbations in the off-diagonals of the system matrix.
F I G U R E 2 Data flow overview of adAFAC-JAC.Solid red lines denote traditional subspaces within additive correction equations, dashed blue lines correspond to auxiliary equations that damp the existing correction equations Algorithm 2. Blueprint of our adAFAC-PI without AMR.R i or P i denote the recursive application of the single level restriction or prolongation, R or P, respectively.I is the injection operator This motivates us to introduce two modified, that is, smoothed restriction operators R: where we take the transpose of the full operator above and truncate the support, that is, throw away the small negative entries by which the stencil support grows.Furthermore, we approximate M−1 = M −1 , that is, reuse multigrid's correction operator within the damping term.For this choice, memory requirements are slightly increased (we have to track one more "unknown") and two solves on all grid level besides the finest mesh are required (Algorithm 1).The flow of data between grids can be seen in (Figure 2). 2. Sole injection where we collapse M−1 IA  into the identity.The overall damping reduces to −PIM −1  .We evaluate the original additive solution update.While we perform this update, we identify updates within c-points, that is, for vertices spatially coinciding with the next coarser mesh, inject these, immediately prolongate them down again, and damp the overall solution with the result.The damping equation is PI (Algorithm 2).A schematic representation is shown in (Figure 3).Both choices are motivated through empirical observations.Our results study them for jumping coefficients in complicated domains, while our previous work demonstrates the suitability for Helmholtz-type setups. 10Though the outcome of both approaches is promising for our tests, we hypothesize that more complicated setups such as convection-dominated phenomena require more care in the choice of R, as R has to be chosen more carefully. 39 I G U R E 3 Data flow overview of adAFAC-PI.Solid red lines denote traditional subspaces within additive correction equations, dashed blue lines correspond to auxiliary equations that damp the existing correction equations Both approaches can be combined with multigrid with geometric transfer operators where P is d-linear everywhere or with algebraic approaches where P stems from BoxMG.Both approaches inherit Ritz-Galerkin operators if they are used in the baseline additive scheme.Otherwise, they exploit rediscretization.

The AFAC solver family and further related approaches
It is not a new idea to damp the additive formulation of fast adaptive composite grids (FAC) such that additive multigrid remains stable.Among the earliest endeavors is FAC's asynchronous variant referred to as asynchronous fast adaptive composite grids (AFAC) 23,24 which decouples the individual grid levels to yield higher concurrency, too.To remove oscillations, AFAC is traditionally served in two variants: 32 AFACc simultaneously determines the right-hand side for all grid levels .Before it restricts the fine grid residual to a particular level , any residuals on vertices spatially coinciding with vertices on the level  are instead set to zero.They are masked out on the fine grid.This effectively damps the correction equation's right-hand side.If we applied this residual masking recursively-a discussion explicitly not found in the original AFACc paper where only the points are masked which coincide with the target grid-that is, if we constructed the masking restriction recursively over the levels instead of in one rush, then AFACc would become a hybrid solver between additive multigrid and the hierarchical basis approach. 32FACf goes down a different route: The individual levels are treated independently from each other, but each level's right-hand side is damped by an additional coarse grid contribution.This coarse grid contribution is an approximate solve of the correction term for the particular grid.AFACf solves all meshes in parallel and sums up their contributions, but each mesh has its contribution reduced by the local additional coarse grid cycle.The resulting scheme is similar to the combination technique as introduced for sparse grids: 43 We determine all solution updates additively but remove the intersection of their coarser meshes.
Since multiplicative methods are superior to additive in terms of stability and simplicity, the transition from AFAC into AFACx 24 is seminal: Its inventors retain one auxiliary coarser level for each multigrid level, and split the additive scheme's solve into two phases (Figure 4): A first phase determines per level which modi might be eliminated by coarser grid solves.For this, they employ the auxiliary helper level.Each level keeps its additive right-hand side in the second phase, but it starts with a projection from this auxiliary level as an initial guess.The projection damps the correction after smoothing.Only the resultant damped corrections derived in the second phase are eventually propagated to finer grids.
AFAC and FAC solvers traditionally remain vague which solvers are to be used for particular substeps.They are meta algorithms and describe a family of solvers.AFACx publications allow a free, independent choice of multigrid hierarchy and auxiliary levels.Our approach is different.We stick to the spacetree construction paradigm.As a result, real and auxiliary grid levels coincide.Furthermore, we do not follow AFACx's multiplicative per-level update (anticipate first the corrections made on coarser grids and then determine own grid's contribution).Instead, we run two computations in parallel (additively).One is the classic additive correction computation.The other term imitates the effective reduction of this update compared with multiplicative multigrid.This additional, auxiliary term is subject to a single smoothing step on one single auxiliary level which is the same as the next additive resolution.

F I G U R E 4
Schematic overview of AFACx (left) and our adAFAC (right).Black markers denote smoothing steps on the auxiliary equations, white markers correspond to traditional additive multigrid Our approach shares ideas with the Mult-additive approach 29 where smoothed transfer operators are used in the approximation of a V(1, 1) cycle.Mult-additive yields faster convergence as it effectively yields stronger smoothers.We stick to the simple presmoothing approach and solely hijack the additional term to circumvent overshooting, while the asynchronicity of the individual levels is preserved.
We finally observe that our solver variant with a PI-term exhibits some similarity with BPX.BPX builds up its correction solely through intergrid transfer operators, while the actual fine grid system matrix does not directly enter the correction equations.Though not delivering an explanation why the solver converges, the introduction of the PI-scheme in Reference 10 thus refers to this solver as BPX-like.Idea 5.As our solver variants are close to AFAC, we call them adaptively damped AFAC and use the postfix PI or Jac to identify which damping equations we employ.Our manuscript thus introduces adAFAC-PI and adAFAC-Jac.

AN ELEMENTWISE, SINGLE-TOUCH IMPLEMENTATION
adAFAC works seamlessly with our algorithmic building blocks.It solves up to three equations of the same type per level.We distinguish the unknowns of these equations as follows: u is the solution in a full approximation storage (FAS) sense, while û is the hierarchical solution.adAFAC solves a correction equation, but no solution equation in the FAS sense.We do not to store need an additional ũ adAFAC unknown.A complicated multiscale representation along resolution boundaries is thus not required for the auxiliary damping equation: No semantic distinction between solution and correction areas is required.Let d and d encode the iterative updates of the unknowns through the additive full approximation storage or the auxiliary adAFAC equation, respectively.

Operator storage
To make adAFAC stable and efficient for nontrivial , each vertex stores its elementwise operator parts from A. Vertices hold the stencils.For vertex members of the finest grid, the stiffness matrix entries result from the discretization of (1).If we use d-linear intergrid transfer operators this storage scheme is applied to all levels.Otherwise, we augment each vertex by further stencils for P and R and proceed as follows: For a vertex on a particular level which overlaps with finer resolutions, this vertex belongs to a correction equation.Its stencil results from the Ritz-Galerkin coarse grid operator definition, whereas the intergrid transfer operators P and R result from Dendy's BoxMG. 18BoxMG is well-suited for three-partitioning. 2,19,39We refer to Reference 2 for remarks how to make the scheme effectively matrix-free, that is, memory saving, nevertheless.Each coarse grid vertex carries its prolongation and restriction operator plus its stencil.We are also required to store the auxiliary R for adAFAC-Jac.All further adAFAC terms use operators already held.

Grid traversal
For the realization of the (dynamically) adaptive scheme, we follow 2,10,15,44 and propose to run through the spacetree in a depth-first (DFS) manner while each level's cells are organized along a space-filling curve. 14We write the code as recursive function where each cell has access to its 2 d adjacent vertices, its parent cell, and the parent cell's 2 d adjacent vertices.The latter ingredients are implicitly stored on the call stack of the recursive function.
As we combine DFS with space-filling curves, our tree traversal is weakly single-touch w.r.t. the vertices: Vertices are loaded when an adjacent cell from the spacetree is first entered.They are "touched" for the last time once all 2 d adjacent cells within the spacetree have been left due to recursion backtracking.In-between, they reside either on the call stack or can be temporarily stored in stacks. 14The call stack is bounded by the depth of the spacetree-it is small-while all temporary stacks are bounded by the time in-between the traversal of two face-connected cells.The latter is short due to the Hölder continuity of the underlying space-filling curve.Hanging vertices per grid level, that is, vertices surrounded by less than 2 d cells, are created on-demand on-the-fly.They are not held persistently.We may assume that all data remains in the caches. 14,15,44s we extract elementwise operators for A, P, R from the stencils stored within the vertices or hard-code these elementwise operators, we end up with a strict elementwise traversal in a multiscale sense.All matrix-vector products (mat-vecs) are accumulated.The realization of the elementwise mat-vecs reads as follows: Once we have loaded the vertices adjacent to a cell, we can derive the elementwise stiffness matrix or intergrid transfer operator for the cell.To evaluate r = Au, we set one variable r per vertex to zero, and then accumulate the matrix-vector (mat-vec) contributions in each of the vertex's adjacent cells.Since the hierarchical û can be determined on-the-fly while running DFS from coarse grids into fine grids, the evaluation of r follows exactly r's pattern.So does the realization of r. adAFAC's mat-vecs can be realized within a single spacetree traversal.The mat-vecs are single-touch.

Logical iterate shifts and pipelining
A full approximation storage sweep can not straightforwardly be realized within a single DFS grid sweep: 10 The residual computation propagates information bottom-up, the corrections propagate information top-down, and the final injection propagates information bottom-up again.This yields a cycle of causal dependencies.We thus offset the additive cycle's smoothing steps by half a grid sweep: Each grid sweep, that is, DFS traversal, evaluates all three mat-vecs-of FAS, of the hierarchical transformation multigrid, of adAFAC-but does not perform the actual updates.Instead, correction quantities sl, sc, sf , sc, sf , and sl are bookmarked as additional attributes within the vertices while the grid traversal backtracks, that is, returns from the fine grids to the coarser ones.Their impact is added to the solution throughout the downstepping of the subsequent tree sweep.Here, we also evaluate the prolongation.Restriction of the residual to the auxiliary right-hand side and hierarchical residual continue to be the last action on the vertices at the end of the sweep when variables are last written/accessed.As we plug into the recursive function's backtracking, we know that all right-hand sides are accumulated from finer grid levels when we touch a vertex for the last time throughout a multiscale grid traversal.We can thus compute the unknown updates though we do not directly apply them.
As we use helper variables to store intermediate results throughout the solve and apply them the next time, we need one tree traversal per V-cycle plus one kick-off traversal.Our helper variables pick up ideas behind pipelining and are a direct translation of optimization techniques proposed in Reference 10 to our scheme.Per traversal, each unknown is read into memory/caches only once.We obtain a single-touch implementation.adAFAC's auxiliary equations do not harm its suitability to architectures with a widening memory-compute facilities gap.
Dynamic mesh refinement integrates seamlessly into the single-touch traversal: We rely on a top-down tree traversal which adds additional tree levels on-demand throughout the steps down in the grid hierarchy.The top-down traversal's backtracking drops parts of the tree if a coarsening criterion demands so it erases mesh parts.Though the erasing feature is not required for the present test cases, both refinement and coarsening integrate into Algorithm 3. We inherit FAC's straightforward handling of dynamic adaptivity, simply the treatment of resolution transitions through full approximation storage, and provide an implementation which reads/writes each unknown only once from the main memory.
Algorithm 3. Outline of single-touch adAFAC-Jac.A tilde identifies variables related to the auxiliary adAFAC grid.We invoke the cycle passing in the coarsest grid  min , that is, all helper variables are initially set to zero function adaFAC-Jac()

end function
In a parallel environment, the logical offset of the computations by half a grid sweep allows us to send out the partial (elementwise) residuals along the domain boundaries nonblockingly and to accumulate them on the receiver side.As the restriction is a linear operator, residuals along the domain boundaries can be restricted partially, too, before they are exchanged on the next level.Prior to the subsequent multiscale mesh traversal, all residuals and restricted right-hand sides are received and can be merged into the local residual contributions.If a rank is responsible for a cell, it also holds a copy of its parent cell.As we apply this definition recursively, a rank can restrict its partial residuals all the way through to the coarsest mesh.As we apply this definition recursively, all ranks hold the spacetree's root.The two ingredients imply that parallel adAFAC can be realized with one spacetree traversal per cycle.The two ingredients also imply that we do not run the individual grid levels in parallel even though we work with additive multigrid.Instead, we vertically integrate the levels, that is, if a tree traversal is responsible for a certain fine grid partition of the mesh, it is also responsible for its coarser representations, and the traversal of all of the grid resolution is collapsed (integrated) into one mesh run-through.

NUMERICAL RESULTS
To assess the potential of adAFAC, we study three test setups of type (1) on the unit square.They are simplistic yet already challenging for multigrid.All setups use f = 0 and set u| Ω = 1 for x 2 = 0 and u| Ω = 0 otherwise.This discontinuity in the boundary conditions and the reduced global smoothness are beyond the scope here.However, they stress the adaptivity criteria and pose a challenge for the multigrid corrections.The criteria have to refine toward the corners, while a sole geometric prolongation close to the corner introduces errors.Similar effects are well known among lid-driven cavity setups for example.A first test is the sole Poisson equation with a homogeneous material parameter.The three other setups (Figure 5) use regions with  = 1 and regions with  = 10 −k .Per run, the respective k ∈ {1, 2, … , 5} is fixed.The second setup splits up the parameter domain into two equally sized sections.We emphasize that the split is axis-aligned but does not coincide with the mesh as we employ three-partitioning of the unit square.The third setup penetrates the area with a thin protruding line of width 0.02.This line is parameterized with .It extends from the x 0 axis-(x 1 , x 2 ) T ∈ R 2 are the coordinate axes-and terminates half-way into the domain.Such small material inhomogeneities cannot be represented explicitly on coarse meshes.The last setup makes the lines x 2 = 5x 1 − 2.5 and x 2 = 0.2x 1 + 0.5 separate domains which hold different  in a checkerboard fashion.No parameter split is axis-aligned.If tests are labeled as regular grid runs, each grid level is regular and we consequently end up with a mesh holding (3 7 − 1) d = 4,778,596 degrees of freedom for  max = 7.If not labeled as regular grid run, our tests rely on dynamic mesh refinement.
Otherwise our experiments focus on d = 2 and start with a 2-grid algorithm ( max = 2) where the coarser level has (3 − 1) d = 4 degrees of freedom and the finer level hosts (3 2 − 1) d = 64 vertices carrying degrees of freedom.From hereon, we add further grid levels and build up to an 8-grid scheme ( max = 8).
In every other cycle, our code manually refines the cells along the bottom boundary, that is, the cells where one face carries u| Ω ≠ 0. We stop with this refinement when the grid meets  max .Our manual mesh construction ensures that we kick off with a low total vertex count, while the solver does not suffer from pollution effects: The scheme kickstarts further feature-based refinement.Parallel to the manual refinement along the boundary, our implementation measures the absolute second derivatives of the solution along both coordinate axes in every single unknown.A bin sorting algorithm is used to identify the vertices carrying the (approximately) 10% biggest directional derivatives.These are refined unless they already meet  max .The overall approach is similar to full multigrid where coarse grid solutions serve as initial guesses for subsequent cycles on finer meshes, though our implementation lacks higher order operators.All interpolation from coarse to fine meshes both for hanging vertices and for newly created vertices is d-linear.
Our runs employ a damped Jacobi smoother with damping  = 0.6 and report the normalized residuals with n being the cycle count.r (n) i is the residual in vertex i and h i is the local mesh spacing around vertex i. Dynamic mesh refinement inserts additional vertices and thus might increase the residual vectors between two subsequent iterations.As a consequence, residuals under an Eulerian norm may temporarily grow due to mesh expansion.This effect is amplified by the lack of higher order interpolation for new vertices.The normalized residual (10) enables us to quantify how much the residual has decreased compared with the residual fed into the very first cycle.
Where appropriate, we also display the normalized maximum residual This metric identifies localized errors, while (10) weights errors with the mesh size.

Consistency study: The Poisson equation
Our first set of experiments focuses on the Poisson equation, that is,  = 1 everywhere.Multigrid is expected to yield a perfect solver for this setup: Each cycle (multiscale grid sweep) has to reduce the residual by a constant factor which is independent of the degrees of freedom, that is, number of vertices.Ritz-Galerkin multigrid yields the same operators as rediscretization, since BoxMG gives bilinear intergrid transfer operators.The setup is a natural choice to validate the consistency and correctness of the adaFACx ingredients.All grids are regular.Our experiments (Figure 6) confirm that additive multigrid is insignificantly faster than the other alternatives if it is stable.The more grid levels are added, the more we overshoot per multilevel relaxation.When we start to add a seventh level, this suddenly makes the plain additive code's performance deteriorate.With an eighth-level added, the solver would diverge (not shown).Exponentially damped multigrid does not suffer from the instability for lots of levels, but its damping of coarse grid influence leads to the situation that long-range solution updates do not propagate quickly.The convergence speed suffers from additional degrees of freedom.Both of our adAFAC variants are stable, but they do not suffer from a speed deterioration.Their localized damping makes both schemes effective and stable.We see similar rates of convergence in our damped implementations to that of undamped additive multigrid.adAFAC-PI and adAFAC-Jac are almost indistinguishable.
Despite the instability of plain additive multigrid, we continue to benchmark against the undamped additive scheme, as exponential damping is not competitive.All experiments from hereon are reasonable irregular/coarse to circumnavigate the instabilities.Feature-based dynamic refinement criterion makes the mesh spread out from the bottom edge where u| Ω = 1 (Figure 7).To assess its impact on cost, we count the number of required degrees of freedom updates plus the updates on coarser levels.These degree of freedom updates correlate directly to runtime.We do not neglect the coarse grid costs.
One smoothing step on a regular mesh of level eight yields 4.3 ⋅ 10 7 updates plus the updates on the correction levels.If the solver terminated in 40 cycles, we would have to invest more than 10 9 updates.Dynamic mesh unfolding reduces the cost to reduce the residual by up to three orders of magnitude.For Poisson, this saving applies to both our adAFAC variants and plain additive multigrid, while the latter remains stable.
If ran with BoxMG, our codebase uses Ritz-Galerkin coarse operator construction for both the correction terms and the auxiliary adAFAC operators in adAFAC-Jac.We validated that both the algebraic intergrid transfer operators and geometric operators yield exactly the same outcome.This is correct for Poisson as BoxMG yields geometric operators here and Ritz-Galerkin coarse operator construction for the correction terms thus yields the same result as rediscretization.
Observation 1.Our code is consistent.For very simple, homogeneous setups, it however makes only limited sense to use adAFAC-unless there are many grid levels.If adAFAC is to be used, adAFAC-PI is sufficient.There is no need to really solve an additional auxiliary equation.
We conclude with the observation that all of our solvers, if stable, converge to the same solution.They are real solvers, not mere preconditioners that only yield approximate solutions.

One material jump
We next study a setup where the material "jumps" in the middle of the domain.The stronger the material transition is the more important it is to pick up the  changes in the intergrid transfer operators.Otherwise, a prolongation of coarse grid corrections yields errors close to x 1 = 0.5.As no grid in the present setup has degrees of freedom exactly on the material transition, the intergrid transfer operators are never able to mirror the material transition exactly.Without dynamic adaptivity, multigrid runs the risk of deteriorating in the multiplicative case and becoming unstable in the additive case.To document this phenomenon, we monitor the solution in one sample point coinciding with the real degree of freedom next to x = (0.5, 0.5) T , and employ a jump in  of seven orders of magnitude.A regular grid corresponding to  = 6 is used.We start from a single grid algorithm, and add an increasing number of correction levels.Not all grid level setups are shown herein.Without algebraic intergrid transfer operators, oscillations arise if we do not use our additional damping parameter (Figure 8).The oscillations increase with the number of coarse grid levels used.Our damping parameter eliminates these oscillations and does not harm the rate of convergence.Algebraic intergrid transfer operators eliminate these oscillations, too.The results show why codes without algebraic operators and without damping usually require a reasonably coarse mesh to align with  transitions.
We continue with dynamically adaptive meshes.All experiments use the AMR/FMG setup, that is, start from a coarse mesh and then dynamically adapt the grid.We observe that the hard-coded grid refinement refines along the stimulus boundary at the bottom, while the dynamic refinement criterion unfolds it along the material transition (Figure 8).
Starting from reasonably small changes in  (Figure 9), additive multigrid with geometric intergrid transfer operators again fails to converge.Once BoxMG is used, it becomes stable.The residual plot in the maximum norm validates our statement that large errors arise along the material transition when we insert new degrees of freedom.We need an algebraic interpolation routine.Our adAFAC variants in contrast all converge.The absence of a higher order interpolation for new degrees of freedom hurts, but it does not destroy the overall stability.Once the dynamic AMR stops inserting new vertices-this happens after around 10 6 degrees of freedom have been processed-the residual drops under both norms.
The picture changes when we increase the variation in .adAFAC-Jac with bilinear transfer operators converges for all  = 10 −k values tested, whereas additive multigrid and adAFAC-PI diverge without BoxMG if the -transition is too

F I G U R E 9
The left plot shows the normalized residual and the right shows the normalized L ∞ -norm of the residual. max = 8.  in {1, 10 −1 }, that is, the material parameter changes by one order of magnitude.We present only data for converging solver flavors F I G U R E 10 Setup of Figure 9 but with a five orders of magnitude jump in the material parameter.We present only data for converging runs and observe that fewer solver ingredient combinations converge harsh (Figure 10).The geometric intergrid transfer approach suffers from oscillations around the material transition.All stable solvers play in the same league.
Observation 2. If we face reasonably small jumping materials, adAFAC-PI is superior to plain additive multigrid, adAFAC-Jac or any algebraic-geometric extension, as it is both stable and simple to compute.Once the jump grows, adAFAC-Jac becomes the method of choice.Its auxiliary damping equations compensate for the lack of algebraic intergrid transfer operators which are typically not cheap to compute.

A material inclusion
Tiny, localized variations in  are notoriously difficult to handle for multigrid.The spike setup from our test suite yields a problem where diffusive behavior is "broken" along the inclusion.The adaptivity criterion thus immediately refines along the tiny material spike (Figure 11) since the solution's curvature and gradient there is very high.We see diffusive behavior around this refined area, but we know that there is no long-range, smooth solution component overlapping the  changes.Again, a reasonable small variation in  does not pose major difficulties to either of our damped adAFAC solvers.The strong localization of the adaptivity ensures that the material transition is reasonably handled, such that a sole geometric choice of intergrid transfer operators is totally sufficient.However, this setup is challenging for additive multigrid which fails to converge even with BoxMG.
Once we increase the material change by three orders of magnitude, we need an explicit elimination of oscillations arising along the  changes.Solely employing algebraic BoxMG operators is insufficient.They can mirror the solution behavior to some degree but they are incapable to compensate for the poor choice of our coarse grid points.The present setup would require algebraic coarse grid identification where the coarse grid aligns with the inclusion.
While adAFAC-PI with algebraic operators manages to obtain reasonable convergence for a material variation of one order of magnitude nevertheless, it is unable to converge for three orders of magnitude change even with algebraic F I G U R E 11 Typical adaptive mesh for setup with a tiny, needle-like inclusion once the refinement criterion has stopped adding further elements (top left).The material inclusion either holds an  which is bigger than its surrounding by a factor of ten (top row) or even by a factor of 1,000 (bottom row)

F I G U R E 12
Typical adaptive mesh for a setup where the regions with different material parameter  are not axis-aligned.One order of magnitude differences in the material parameter (top) versus three orders of magnitude (bottom) intergrid transfer operators.adAFAC-Jac is able to handle the sharp, localized transition which also can be read as extreme case of an anisotropic  choice in (1).We see convergence for both its geometric variant and its algebraic extension, though now the BoxMG variant is superior to its geometric counterpart.Observation 3. adAFAC-Jac equips the geometric-algebraic BoxMG method with the opportunity to compensate, to some degree, for the lack of support of anisotropic refinement.

Non-axis-aligned subdomains
We move on to our experimental setup with a deformed checkerboard setup (Figure 12), where the dynamic adaptivity criterion unfolds the mesh along the material transitions.The solution behavior within the four subregions itself is smooth, that is, diffusive, and the adaptivity around the material transitions thus is wider, more balanced, than the hard-coded adaptivity directly at the bottom of the domain.With smallish variations in , this setup does not pose a challenge to any of our solvers, irrespective of whether they work with algebraic or geometric intergrid transfer operators.With increasing differences in , we however observe that additive multigrid starts to diverge.The smooth regions are still sufficiently dominant, and we suffer from overcorrection.adAFAC-PI performs better yet requires algebraic operators to remain robust up to  variations of three orders of magnitude.adAFAC-Jac with geometric operators remains stable for all studied setups, up to and including the five order of magnitude jump.adAFAC-Jac with algebraic operators outperforms its geometric cousin.BoxMG's accurate handling of material transitions decouples the subdomains from each other on the coarse correction levels.Updates in one domain thus do not pollute the solution in a neighboring domain.
Observation 4. While the auxiliary equations can replace/exchange algebraic operators in some cases, they fail to tackle material transitions that are not grid-aligned.

Parallel adAFAC
We close our experiments with a scalability exercise.All data stem from a cluster hosting Intel E5-2650V4 (Broadwell) nodes with 24 cores per node.They are connected via Omnipath.We use Intel's Threading Building Blocks (TBB) for the shared memory parallelization and use Intel MPI for a distributed memory realization.
Both the shared and the distributed memory parallelization of our code use a multilevel space-filling curve approach.The fine grid cells are arranged along the Peano space-filling curve and cut into curve segments of roughly the same number of cells.We use a nonoverlapping domain decomposition on the finest mesh.Logically, our code does not distinguish between the code's shared and distributed memory strategy.They both decompose the data in the same way.The distributed memory variant however replaces memory copies along the boundary with MPI calls.All timings rely on runtimes for one cycle of a stationary mesh, that is, load ill-balances and overhead induced by adaptive mesh refinement are omitted.
For all experiments, we start adAFAC and wait until the dynamic adaptivity has unfolded the grid completely such that it meets our prescribed h as a maximum mesh size.We furthermore hardcode the domain decomposition such that the partitioning is close to optimal: We manually eliminate geometric ill-balancing, and we focus on the most computationally demanding cycles of a solve.Cycles before that, where the grid is not yet fully unfolded, yield performance which is similar to experiments with a bigger h.
Our shared memory experiments (Figure 13) show reasonable scalability up to eight cores if the mesh is detailed.The curves are characteristic for both adAFAC-PI and adAFAC-Jac, that is, we have not been able to distinguish the runtime behavior of these two approaches.If the mesh is too small, we see strong runtime variations.Otherwise, the curves are reasonably smooth.Overall, the shared memory efficiency is limited by less than 70% even if we make the mesh more detailed.
Our code employs a very low-order discretization and thus exhibits a low arithmetic intensity.This intensity is increased by both adAFAC-PI and adAFAC-Jac, but the increase is lost behind other effects such as data transfer or the management of adaptive mesh refinement.The reason for the performance stagnation is not clear, but we may assume that NUMA effects play a role, and that communication overhead affects the runtime, too.With a distributed memory parallelization, we can place two ranks onto each node.NUMA then does not have further knock-on effects, and we obtain smooth curves until we run into too small partitions per node.With a low-order discretization, our code is communication-bound-in line with most multigrid codes-yet primarily suffers from a strong synchronization between cycles: Due to a nonoverlapping domain decomposition on the finest grid, all traversals through the individual grid partitions are synchronized with each other.Our adAFAC implementation merges the coarse grid updates into the fine grid smoother, but each smoothing step requires a core to synchronize with all other cores.We eliminate strong scaling bottlenecks due to small system solves, but we have not yet eliminated scaling bottlenecks stemming from a tight synchronization of the (fine grid) smoothing steps.Observation 5. Despite adaFAC's slight increase of the arithmetic intensity, it seems to be mandatory to switch to higher order methods 8 or approaches with higher asynchronicity 21 to obtain better scalability.

F I G U R E 13
Left: Shared memory experiments with adAFAC.All solver variants rely on the same code base, that is, exchange only operators, such that they all share the same performance characteristics.Right: Some distributed memory runtime results with the time for one multiscale grid sweep.This corresponds to one additive cycle as we realize single-touch semantics.We study three different mesh sizes given via upper bounds on the h.Two ranks per node, that is, one rank per socket, are used This is in line with other research. 8,21

CONCLUSION AND OUTLOOK
We introduce two additive multigrid variants which are conceptually close to asynchronous fast adaptive composite grid solvers and Mult-additive.An auxiliary term in the equation ensures that overshooting of plain additive multigrid is eliminated.Our results validate that we obtain reasonable multigrid performance and stability.They confirm that adAFAC aligns with the three key concepts from the introduction: It relies solely on the geometric grid hierarchy, it sticks to the asynchronous additive paradigm, and all new ideas can be used in combination with advanced implementation patterns such as single-touch formulations or quasi-matrix-free matrix storage.Beyond that, the results uncover further insights: adAFAC delivers reasonable robustness when solely using geometric intergrid transfer operators.The construction of good intergrid transfer operators for nontrivial  is far from trivial and computationally cheap.It is thus conceptually an interesting idea to give up on the idea of a good operator and in turn to eliminate oscillations resulting from poor operators within the correction equation.We show that this is a valid strategy for some setups.In this context, adAFAC can be read as an antagonist to BPX.BPX omits the system operator from the correction equations and "solely" relies on proper intergrid transfer operators.With our geometric adAFAC variants, we work without algebraic operators but a problem-dependent auxiliary smoothing, that is, a problem-dependent operator.It is notoriously difficult to integrate multigrid ideas into existing solvers.Multigrid builds upon several sophisticated building blocks and needs mature, advanced data structures.On the implementation side, an interesting contribution of our work is the simplification and integration of the novel adAFAC idea into well-established concepts.The fusion of three different solves (real solution, hierarchical solution required for the hierarchical transformation multigrid (HTMG) implementation variant and the damping equations) does not introduce any additional implementational complexity compared with standard relaxation strategies.However, it increases the arithmetic intensity.adAFAC can be implemented as single-touch solver on dynamically adaptive grids.This renders it an interesting idea for high performance codes relying on dynamic, flexible meshes.
Studies from a high-performance computing viewpoint are among our next steps.Interest in additive solvers has recently increased as they promise to become a seedcorn for asynchronous algorithms. 21Our algorithmic sketches integrate all levels' updates into one grid sweep and thus fall into the class of vertically integrated solvers. 2,30It will be interesting to study how desynchronization interplays with the present solver and single-touch ideas.Furthermore, we have to apply the scheme to more realistic and more challenging scenarios.Nonlinear equations here are particularly attractive, as our adAFAC implementation already offers a full approximation storage data representation.The multigrid community has a long tradition of fusing different ingredients: Geometric multigrid on very fine levels, direct solvers on very coarse levels, algebraic techniques in-between, for example.adAFAC is yet another solver variant within this array of options, and it will be interesting to see where and how it can work in conjunction with other multilevel solvers.On the numerical method side, we expect further payoffs by improving individual components of the solver-such as tailoring the smoother to our modified restriction or modifying the prolongation in tandem.Notably ideas following 29 which mimic a V(1, 1)-cycle or even a V-cycle with more smoothing steps are worth investigating.

F I G U R E 1
Left: A 3 × 3 mesh ( = 1; top layer) serves as first refinement level.From here, we construct subsequent refinement levels by subdividing cells into 3 × 3 patches.This yields a spacetree.Multiple Cartesian meshes are embedded into each other.Middle: Conductivity (material) parameter setup as used for a stationary heat equation solve where the right bottom side of the unit square is heated up.A high conductivity in two domain subregions makes the solution (right) asymmetric or square, respectively.Let the bounding cell have level  = 0.It is equidistantly cut into k parts along each coordinate axis.We obtain k d child cells having level  = 1.

F I G U R E 5
The three  distributions studies throughout the tests.The blue area (left; inside of the inclusion; or top left and bottom right, respectively) holds  = 1.The remaining domain  = 10 −k , k ∈ {1, 2, … , 5}

F I G U R E 6
Solves of the Poisson equation on regular grids of different levels.We compare plain additive multigrid (top, left), multigrid using exponential damping (top, right), adAFAC-PI (bottom, left), and adAFAC-Jac (bottom, right) F I G U R E 7 Left: Typical adaptive mesh for pure Poisson (constant material parameter) once the refinement criterion has stopped adding further elements.Right: We compare different solvers on the pure Poisson equation using a hybrid FMG-AMR approach starting at a two grid scheme and stopping at an eight grid scheme. max = 8

8
The domain material is split into two halves with an  jump from  = 1 to  = 10 −7 .Typical adaptive mesh for single discontinuity setup once the refinement criterion has stopped adding further elements (left).Solution development in sample point next to a discontinuity, normalized by the true solution value at that point, that is, one means the correct value.We compare d-linear intergrid transfer (middle) to BoxMG operators (right) fine grid residual to coarser levels.b  max remains untouched end for for all  min < <  max do c  ← 0; c ← 0 ⊳ Initial "guesses" for corrections c  ← jacobi(A  c  = b  , ) ⊳ Iterate of correction equation stored in c  c ← PIc  ⊳ Computation of localised damping for c  end for c  min ← 0; c min ← 0 c  min ← jacobi(A  min c  min = b  min , ) ⊳ No auxiliary damping for coarsest level c  max ← 0; c max ← 0 ⊳ Initial "guess" for correction on finest grid contributions from both grids.sc holds the coarse grid corrections, sc  ← sl  − sl  + P  −1 sc −1 − P  −1 sl −1 ⊳ i.e.∼is recursively prolongated.u  ← u  + sf  − sf  ⊳ Anticipate fine grid smoothing effects deposited in sf . = jacobi(A  u  = b  , ) ⊳ Determine update through Jacobi step.d = jacobi( Ã 0  = b , ) ⊳ Auxiliary smooth with initial guess 0. r  ← residual(A  u  = b  ) ⊳ Bookmark residual from Jacobi update.r ← residual( Ã u  = b ) ⊳ Compute hierarchical residual.sl  ← d  ; u  ← u  + d  ; sl  ← d ⊳ Bookmark updates in sl and apply them, too.b  ← R −1  r+1 ; b ← R−1  r +1 ⊳ Restrict right-hand sides to coarse equation systems.sf −1 ← I (sf  + sl  ) ⊳ Inform coarser levels about updates, but do not sf −1 ← I ( sf  + sl  ) ⊳ apply them there.Information propagates bottom-up.