SWIFT: Using Task-Based Parallelism, Fully Asynchronous Communication, and Graph Partition-Based Domain Decomposition for Strong Scaling on more than 100,000 Cores

We present a new open-source cosmological code, called SWIFT, designed to solve the equations of hydrodynamics using a particle-based approach (Smooth Particle Hydrodynamics) on hybrid shared / distributed-memory architectures. SWIFT was designed from the bottom up to provide excellent strong scaling on both commodity clusters (Tier-2 systems) and Top100-supercomputers (Tier-0 systems), without relying on architecture-specific features or specialized accelerator hardware. This performance is due to three main computational approaches: • Task-based parallelism for shared-memory parallelism, which provides fine-grained load balancing and thus strong scaling on large numbers of cores. • Graph-based domain decomposition, which uses the task graph to decompose the simulation domain such that the work, as opposed to just the data, as is the case with most partitioning schemes, is equally distributed across all nodes. • Fully dynamic and asynchronous communication, in which communication is modelled as just another task in the task-based scheme, sending data whenever it is ready and deferring on tasks that rely on data from other nodes until it arrives. In order to use these approaches, the code had to be re-written from scratch, and the algorithms therein adapted to the task-based paradigm. As a result, we can show upwards of 60% parallel efficiency for moderate-sized problems when increasing the number of cores 512-fold, on both x86-based and Power8-based architectures.


INTRODUCTION
For the past decade physical limitations have kept the speed of individual processor cores constrained, so instead of getting faster, computers are getting more parallel.Systems containing up to 64 general-purpose cores are becoming commonplace, and the number of cores can be expected to continue growing exponentially, e.g.following Moore's law, much in the same way processor speeds were up until a few years ago.
As a consequence, in order to get faster, computer programs need to rely on better exploitation of this massive parallelism, i.e. they need to exhibit strong scaling, the ability to run roughly N times faster when executed on N times as many processors.Without strong scaling, massive parallelism can still be used to tackle ever larger problems, but not make fixed-size problems faster.
Although this switch from growth in speed to growth in parallelism has been anticipated and observed for quite some time, very little has changed in terms of how we design and implement parallel computations.Branch-and-bound synchronous parallelism using OpenMP [9] and MPI [23], as well as domain decompositions based on geometry or space-filling curves [27] are still commonplace, despite both the architectures and problem scales having changed dramatically since their introduction.
The design and implementation of SWIFT 1  [15,25,13], a largescale cosmological simulation code built from scratch, provided the perfect opportunity to test some newer approaches, i.e. taskbased parallelism, fully asynchronous communication, and graph partition-based domain decompositions.
This paper describes these techniques, which are not exclusive to cosmological simulations or any specific architecture, as well as the results obtained with them.While [15,13] already describes the underlying algorithms in more detail, in this paper we focus on the parallelization strategy and the results obtained on larger Tier-0 systems.

PARTICLE-BASED HYDRODYNAMICS
Smoothed Particle Hydrodynamics [12,20] (SPH) uses particles to represent fluids.Each particle pi has a position xi, velocity vi, internal energy ui, mass mi, and a smoothing length hi.The particles are used to interpolate any quantity Q at any point in space as a weighted sum over the particles: where Qi is the quantity at the ith particle, h is the smoothing length, i.e. the radius of the sphere within which data will be considered for the interpolation, and W (r, h) is the smoothing kernel or smoothing function.Several different forms for W (r, h) exist, each with their own specific benefits and drawbacks.
The particle density ρi used in ( 1) is itself computed similarly: where rij = r i − r j is the Euclidean distance between particles pi and pj.In compressible simulations, the smoothing length hi of each particle is chosen such that the number of neighbours with which it interacts is kept more or less constant, and can result in smoothing lengths spanning several orders of magnitudes within the same simulation.
Once the densities ρi have been computed, the time derivatives of the velocity and internal energy, which require ρi, are computed as follows: where ĥij = max{hi, hj}, and the particle pressure Pi = ρiui(γ− 1) and correction term Ωi = 1 + h i 3ρ i ∂ρ ∂h are computed on the fly.The computations in (2), (3), and (4) involve finding all pairs of particles within range of each other.Any particle pj is within range of a particle pi if the distance between pi and pj is smaller or equal to the smoothing distance hi of pi, e.g. as is done in (2).Note that since particle smoothing lengths may vary between particles, this association is not symmetric, i.e. pj may be in range of pi, but pi not in range of pj.If rij < max{hi, hj}, as is required in (3), then particles pi and pj are within range of each other. 1 SWIFT is an open-source software project and the latest version of the source code, along with all the data needed to run the test cased presented in this paper, can be downloaded at www.swiftsim.com.
The computation thus proceeds in two distinct stages that are evaluated separately: 1. Density computation: For each particle pi, loop over all particles pj within range of pi and evaluate (2).
2. Force computation: For each particle pi, loop over all particles pj within range of each other and evaluate (3) and (4).
Finding the interacting neighbours for each particle constitutes the bulk of the computation.Many codes, e.g. in Astrophysics simulations [12], rely on spatial trees for neighbour finding [12,16,24,26], i.e. k-d trees [6] or octrees [19] are used to decompose the simulation space.In Astrophysics in particular, spatial trees are also a somewhat natural choice as they are used to compute long-range gravitational interactions via a Barnes-Hut [5] or Fast Multipole [8] method.The particle interactions are then computed by traversing the list of particles and searching for their neighbours in the tree.In current state-of-the-art simulations (e.g.[22]), the gravity calculation corresponds to roughly one quarter of the calculation time whilst the hydrodynamics scheme takes approximately half of the total time.The remainder is spent in the astrophysics modules, which contain interactions of the same kind as the SPH sums presented here.Gravity could, however, be vastly sped-up compared to commonly used software by employing more modern algorithms such as the Fast Multipole Method [8].
Although such tree traversals are trivial to parallelize, they have several disadvantages, e.g. with regards to computational efficiency, cache efficiency, and exploiting symmetries in the computation (see [13] for a more detailed analysis).

PARALLELIZATION STRATEGY
One of the main concerns when developing SWIFT was to break with the branch-and-bound type parallelism inherent to parallel codes using OpenMP and MPI, and the constant synchronization between computational steps it results in.
If synchronization is the main problem, then asynchronicity is the obvious solution.We therefore opted for a task-based approach for maximum single-node, or shared-memory, performance.This approach not only provides excellent load-balancing on a single node, it also provides a powerful model of the computation that can be used to distribute the work equitably over a set of distributedmemory nodes using general-purpose graph partitioning algorithms.Finally, the necessary communication between nodes can itself be modelled in a task-based way, interleaving communication seamlessly with the rest of the computation.

Task-based parallelism
Task-based parallelism is a shared-memory parallel programming paradigm in which a computation is broken-down in to a set of tasks which can be executed concurrently.In order to ensure that the tasks are executed in the right order, e.g. that data needed by one task is only used once it has been produced by another task, dependencies between tasks are specified and strictly enforced by a task scheduler.Additionally, if two tasks require exclusive access to the same resource, yet in no particular order, they are treated as conflicts and the scheduler ensures that they are not executed concurrently.Computations described in this way then parallelize trivially: each processor repeatedly grabs a task for which all dependencies have been satisfied and executes it until there are no tasks left.
The main advantages of using a task-based approach are • The order in which the tasks are processed, and how they are assigned to each processor is completely dynamic and adapts automatically to load imbalances.
• If the dependencies and conflicts are specified correctly, there is no need for expensive explicit locking, synchronization, or atomic operations to deal with most concurrency problems.
• Each task has exclusive access to the data it is working on, thus improving cache locality and efficiency.
Since none of these existing taks-based libraries provided the flexibility required to experiment with different scheduling and communication techniques, (SWIFT is an interdisciplinary effort between Computer Science and Astrophysics to study not only cosmological phenomena, but also novel simulation algorithms and parallel computing techniques) we chose to implement our own task scheduler in SWIFT, which has since been back-ported as the general-purpose QUICKSCHED task scheduler [14].In QUICKSCHED and SWIFT, task dependencies are specified explicitly, as opposed to being implicitly derived from data dependencies, allowing us to more easily build complex task hierarchies.This also allowed us to extend the scheduler with the concept of task conflicts and integrate the asynchronous communication scheme described further on.
Despite its advantages, and the variety of implementations, taskbased parallelism is rarely used in practice (notable exceptions include the PLASMA project [2] which uses QUARK/StarPU, and the deal.II project [4] which uses Intel's TBB).The main problem is that to effectively use task-based parallelism, most computations need to be completely redesigned to fit the paradigm, which is usually not an option for large and complex codebases.
Since we were re-implementing SWIFT from scratch, this was not an issue.The tree-based neighbour-finding described above was replaced with a more task-friendly approach as described in [13], in which the domain is first decomposed into a grid of cells of edge length larger or equal to the largest particle radius.An initial set of interaction tasks is then defined over all cells and pairs of neighbouring cells, such that if two particles are close enough to interact, they are either in the same cell or they span a pair of neighbouring cells.These initial interaction tasks are then refined by recursively splitting cells that contain more than a certain number of particles and replacing tasks that span a pair of split cells with tasks spanning the neighboring sub-cells.The resulting refined set of tasks contains all the cells and pairs of cells over which particle interactions must be computed.
The dependencies between the tasks are set following equations ( 2), (3), and (4), i.e. such that for any cell, all the tasks computing the particle densities therein must have completed before the particle forces can be computed, and all the force computations must have completed before the particle velocities may be updated.The task hierarchy is shown in Fig. 1, where the particles in each cell are first sorted (round tasks) before the particle densities are computed (first layer of square tasks).Ghost tasks (triangles) are used to ensure that all density computations on a cell of particles have completed before the force evaluation tasks (second layer of square tasks) execute.Once all the force tasks on a cell of particles have completed, the integrator tasks (inverted triangles) update the particle positions and velocities.The decomposition was computed such that each cell contains ∼ 100 particles, which leads to tasks of up to a few milliseconds each.
Due to the cache-friendly nature of the task-based computations, and their ability to exploit symmetries in the particle interactions, Although the data for the yellow cell resides on Node 2, it is required for some tasks on Node 1, and thus needs to be copied over during the computation using send/recv tasks (diamond-shaped).Figure adapted from [13].
the task-based approach is already more efficient than the treebased neighbour search on a single core, and scales efficiently to all cores of a shared-memory machine [13].

Task-based domain decomposition
Given a task-based description of a computation, partitioning it over a fixed number of ranks (using the MPI terminology) is relatively straight-forward: we create a cell hypergraph in which: • Each node represents a single cell of particles, and, • each edge represents the tasks, connecting the cells.
Since in the particular case of SWIFT each task references at most two cells, the cell hypergraph is just a regular cell graph.
Any partition of the cell graph represents a partition of the computation, i.e. the nodes belonging to each partition each belong to a rank, and the data belonging to each cell resides on the partition/rank to which it has been assigned.Any task spanning cells that belong to the same partition needs only to be evaluated on that rank/partition, and tasks spanning more than one partition need to be evaluated on both ranks/partitions.
If we then weight each edge with the computational cost associated with the tasks, then finding a good cell distribution reduces to finding a partition of the cell graph such that the maximum sum of the weight of all edges within and spanning in a partition is minimal (see Fig. 2).Since the sum of the weights is directly proportional to the amount of computation per rank/partition, minimizing the maximum sum corresponds to minimizing the time spent on the slowest rank.Computing such a partition is a standard graph prob-Figure 2: Illustration of the task-based domain decomposition in which the tasks (circles) are hyperedges that connect one or more resources (rectangles).The resources are partitioned along the thick dotted line.The blue and orange tasks are executed on the respective partitions, whereas the green tasks/hyperedges along the cut line are executed on both.The cost of this partition is the sum of the green tasks, which are computed twice, as well as the cost imbalance of the tasks executed in each partition.lem and several software libraries which provide good solutions 2 , e.g.METIS [18] and Zoltan [10], exist.
In SWIFT, the graph partitioning is computed using the METIS library.The cost of each task is initially approximated via the asymptotic cost of the task type and the number of particles involved.After a task has been executed, it's effective computational cost is computed and used.
Note that this approach does not explicitly consider any geometric constraints, or strive to partition the amount of data equitably.The only criteria is the computational cost of each partition, for which the task decomposition provides a convenient model.We are therefore partitioning the computation, as opposed to just the data.
Note also that the proposed partitioning scheme takes neither the task hierarchy, nor the size of the data that needs to be exchanged between partitions/ranks into account.This approach is therefore only reasonable in situations in which the task hierarchy is wider than flat, i.e. the length of the critical path in the task graph is much smaller than the sum of all tasks, and in which communication latencies are negligible.

Asynchronous communications
Although each particle cell resides on a specific rank, the particle data will still need to be sent to any neighbouring ranks that have tasks that depend on this data, e.g. the the green tasks in Fig. 2.This communication must happen twice at each time-step: once to send the particle positions for the density computation, and then again once the densities have been aggregated locally for the force 2 Computing the optimal partition for more than two nodes is considered NP-hard.

computation.
Most distributed-memory codes based on MPI [23] separate computation and communication into distinct steps, i.e. all the ranks first exchange data, and only when the data exchange is complete does computation start.Further data exchanges only happen once computation has finished, and so on.This approach, although conceptually simple and easy to implement, has three major drawbacks: • The frequent synchronization points between communication and computation exacerbate load imbalances, • the communication phase consists mainly of waiting on latencies, during which the node's CPUs usually run idle, and • during the computation phase, the communication network is left completely unused, whereas during the communication phase, all ranks attempt to use it at the same time.
It is for these reasons that in SWIFT we opted for a fully dynamic and asynchronous communication model in which local data is sent whenever it is ready, data from other ranks is only acted on once it has arrived, and there is no separation into communication and computational phases.In practice this means that no rank will sit idle waiting on communication if there is any computation that can be done.
This fits in quite naturally within the task-based framework by modelling communication as just another task type, i.e. adding tasks that send and receive particle data between ranks.For every task that uses data that resides on a different rank, send and recv tasks are generated automatically on the source and destination ranks respectively.At the destination, the task is made dependent of the recv task, i.e. the task can only execute once the data has actually been received.This is illustrated in Fig. 1, where data is exchanged across two ranks for the density and force computations and the extra dependencies are shown in red.
The communication itself is implemented using the non-blocking MPI_Isend and MPI_Irecv primitives to initiate communication, and MPI_Test to check if the communication was successful and resolve the communication task's dependencies.In the taskbased scheme, strictly local tasks which do not rely on communication tasks are executed first.As data from other ranks arrive, the corresponding non-local tasks are unlocked and are executed whenever a thread picks them up.One direct consequence of this approach is that instead of a single send/recv call between each pair of neighbouring ranks, one such pair is generated for each particle cell.This type of communication, i.e. several small messages instead of one large message, is usually strongly discouraged since the sum of the latencies for the small messages is usually much larger than the latency of the single large message.This, however, is of no concern in SWIFT since nobody is actively waiting to receive the messages in order, and the communication latencies are covered by local computations.A nice side-effect of this approach is that communication no longer happens in bursts involving all the ranks at the same time, but is more or less evenly spread over the entire computation, and is therefore less demanding of the communication infrastructure.

SCALING TESTS
In this section we present some strong scaling tests of the SWIFT code on different architectures for a representative cosmology problem.In order to provide a realistic setup, the initial particle distributions used in our tests were extracted by resampling low-redshift outputs of the EAGLE project [22], a large suite of state-of-the-art cosmological simulations.By selecting outputs at late times (redshift z = 0.5), we constructed a simulation setup which is representative of the most expensive part of these simulations, i.e. when the particles are highly-clustered and no longer uniformly distributed.This distribution of particles is shown on Fig. 3.In order to fit our simulation setup into the limited memory of some of the systems tested, we have randomly down-sampled the particle count of the output to 8003 = 5.12 × 10 8 , 600 3 = 2.16 × 10 8 and 376 3 = 5.1 × 107 particles with periodic boundary conditions respectively.Scripts to generate these initial conditions are provided with the source code.We then run the SWIFT code for 100 time-steps and average the wall clock time of these time-steps after having removed the first and last ones, where disk I/O occurs.

Simulation setup
The initial load balancing between nodes is computed in the first steps and re-computed every 100 time steps, and is therefore not included in the timings.Particles were exchanged between nodes whenever they strayed too far beyond the cells in which they originally resided.
On all the machines, the code was compiled out of the box, without any tuning, explicit vectorization, or exploiting any other specific features of the underlying hardware.

x86 architecture: COSMA-5
For our first test, we ran SWIFT on the COSMA-5 DiRAC2 Data Centric System 3 located at the University of Durham.The system consists of 420 nodes with 2 Intel Sandy Bridge-EP Xeon E5-26704  This system is similar to many Tier-2 systems available in most universities or computing facilities.Demonstrating strong scaling on such a machine is essential to show that the code can be efficiently used even on the type of commodity hardware available to most researchers.
The code was compiled with the Intel compiler version 2016.0.1 and linked to the Intel MPI library version 5.1.2.150 and METIS library version 5.1.0.
The simulation setup with 376 3 particles was run on that system using 1 to 256 threads on 1 to 16 nodes, the results of which are shown on Fig. 5.For this strong scaling test, we used one MPI rank per node and 16 threads per node (i.e. one thread per physical core).We also ran on one single node using up to 32 threads, i.e. up to one thread per physical and virtual core.Fig. 4 shows the domain decomposition obtained via the task-graph decomposition algorithm described above for the 16 node run.

x86 architecture: SuperMUC
For our next test, we ran SWIFT on the SuperMUC x86 phase 1 thin nodes5 located at the Leibniz Supercomputing Center in Garching near Munich, currently ranked 23rd in the Top500 list 6 .This system consists of 9 216 nodes with 2 Intel Sandy Bridge-EP Xeon E5-2680 8C 7   Using 16 threads per node (no use of hyper-threading) with one MPI rank per node, a good parallel efficiency (60%) is achieved when increasing the thread count from 1 (1 node) to 128 (8 nodes) even on this relatively small test case.The dashed line indicates the efficiency when running on one single node but using all the physical and virtual cores (hyper-threading).As these CPUs only have one FPU per core, the benefit from hyper-threading is limited to a 20% improvement when going from 16 cores to 32 hyperthreads.which communications are handled via an Infiniband FDR10 nonblocking Tree.These islands are then connected using a 4:1 Pruned Tree.
This system is similar in nature to the COSMA-5 system used in the previous set of tests but is much larger, allowing us to test scalability at a much larger scale.
The code was compiled with the Intel compiler version 2015.5.223 and linked to the Intel MPI library version 5.1.2.150 and METIS library version 5.0.2.
The simulation setup with 800 3 particles was run using 16 to 2048 nodes (4 islands) and the results of this strong scaling test are shown in Fig. 6.For this test, we used one MPI rank per node and 16 threads per node, i.e. one thread per physical core.

Blue Gene architecture: JUQUEEN
For our last set of tests, we ran SWIFT on the JUQUEEN IBM Blue Gene/Q system8 located at the Jülich Supercomputing Center, currently ranked 11th in the Top500 list.This system consists of 28 672 nodes with an IBM PowerPC A2 processor running at 1.6 GHz and 16 GByte of RAM each.Of notable interest is the presence of two floating units per compute core.The system is composed of 28 racks containing each 1 024 nodes.The network uses a 5D torus to link all the racks.
This system is larger than the SuperMUC supercomputer described above and uses a completely different processor and instruction set.We use it here to demonstrate that our results are not dependant on the hardware being used.
The code was compiled with the IBM XL compiler version 30.73.0.13 and linked to the corresponding MPI and METIS library versions 4.0.2.
The simulation setup with 600 3 particles was first run using 512 nodes with one MPI rank per node and varying only the number of threads per node.The results of this test are shown in Fig. 7.
We later repeated the test, this time varying the number of nodes from 32 to 8192 (8 racks).For this test, we used one MPI rank per node and 32 threads per node, i.e. two threads per physical core.The results of this strong scaling test are shown in Fig. 8.

DISCUSSION & CONCLUSIONS
The strong scaling results presented in the previous sections on three different machines demonstrate the ability of our framework to scale on both small commodity machines, thanks to the use of task-based parallelism at the node level, and on the largest machines (Tier-0 systems) currently available, thanks to the task-based domain distribution and asynchronous communication schemes.We would like to emphasize that these results were obtained for a realistic test case without any micro-level optimization or explicit vectorization.
Excellent strong scaling is also achieved when increasing the number of threads per node (i.e. per MPI rank, see fig.7), demonstrating that the description of MPI (asynchronous) communications as tasks within our framework is not a bottleneck.One common conception in HPC is that the number of MPI communications between nodes should be kept to a minimum to optimize the efficiency of the calculation.Our approach does exactly the opposite with large number of point-to-point communications between pairs of nodes occurring over the course of a time-step.For example, on the SuperMUC machine with 32 nodes (512 cores), each MPI rank contains approximately 1.6 × 10 7 particles in 2.5 × 10 5 cells.SWIFT will generate around 58 000 point-to-point asynchronous MPI communications (a pair of send and recv tasks) per node and per timestep.Each one of these communications involves, on average, no more than 6 kB of data.Such an insane number of small messages is discouraged by most practitioners, but seems to works well in practice.Dispatching communications over the course of the calculation and not in short bursts, as is commonly done, may also help lower the load on the network.
One time-step on   63 ms of wall-clock time.All the loading of the tasks, communications and running of the tasks takes place in that short amount of time.Our framework can therefore load-balance a calculation over 2.6 × 10 5 threads with remarkable efficiency.
We emphasize, as was previously demonstrated in [13], that SWIFT is also much faster than the GADGET-2 code [24], the de-facto standard in the field of particle-based cosmological simulations.The simulation setup that was run on the COSMA-5 system takes 2.9 s of wall-clock time per time-step on 256 cores using SWIFT whilst the default GADGET-2 code on exactly the same setup with the same number of cores requires 32 s.The excellent scaling performance of SWIFT allows us to push this number further by simply increasing the number of cores, whilst GADGET-2 reaches its peak speed (for this problem) at around 300 cores and stops scaling beyond that.This unprecedented scaling ability combined with future work on vectorization of the calculations within each task will hopefully make SWIFT an important tool for future simulations in cosmology and help push the entire field to a new level.
These results, which are in no way restricted to astrophysical simulations, provide a compelling argument for moving away from the traditional branch-and-bound paradigm of both shared and distributed memory programming using synchronous MPI and OpenMP.Although fully asynchronous methods, due to their somewhat anarchic nature, may seem more difficult to control, they are conceptually simple and easy to implement 9 .The real cost of using a task-based approach comes from having to rethink the entire computation to fit the task-based setting.This may, as in our specific

Figure 1 :
Figure 1: Task hierarchy for the SPH computations in SWIFT, including communication tasks.Arrows indicate dependencies, i.e. an arrow from task A to task B indicates that A depends on B.The task color corresponds to the cell or cells it operates on, e.g. the density and force tasks work on individual cells or pairs of cells.The task hierarchy is shown for three cells: blue, yellow, and purple.Although the data for the yellow cell resides on Node 2, it is required for some tasks on Node 1, and thus needs to be copied over during the computation using send/recv tasks (diamond-shaped).Figure adapted from[13].

Figure 3 :
Figure 3: The initial density field computed from the initial particle distribution used for our tests.The density ρi of the particles spans 8 orders of magnitude, requiring smoothing lengths hi changing by a factor of almost 1000 across the simulation volume.

Figure 4 :
Figure 4: The particles for the initial conditions shown on Fig. 3 colored according to the node they belong to after a loadbalancing call on 32 nodes.As can be seen, the domain decomposition follows the cells in the mesh but is not made of regular cuts.Domains have different shapes and sizes.
at 2.7 GHz CPUS.Each 16-core node has 32 GByte of RAM.The nodes are split in 18 "islands" of 512 nodes within With hyper − threading on a single node On multiple nodes SWIFT Strong scaling on Cosma-5 with 51M particles from 1 to 256 threads

Figure 5 :
Figure 5: Strong scaling test on the COSMA-5 machine (see text for hardware description).Left panel: Code Speed-up.Right panel: Corresponding parallel efficiency.Using 16 threads per node (no use of hyper-threading) with one MPI rank per node, a good parallel efficiency (60%) is achieved when increasing the thread count from 1 (1 node) to 128 (8 nodes) even on this relatively small test case.The dashed line indicates the efficiency when running on one single node but using all the physical and virtual cores (hyper-threading).As these CPUs only have one FPU per core, the benefit from hyper-threading is limited to a 20% improvement when going from 16 cores to 32 hyperthreads.

Figure 7 :
Figure 7: Strong scaling test of the hybrid component of the code.The same calculation is performed on 512 node of the JUQUEEN Blue Gene supercomputer (see text for hardware description) using a single MPI rank per node and varying only the number of threads per node.The code displays excellent scaling even when all the cores and hardware multi-threads are in use.
SWIFT Strong scaling on SuperMUC with 512M particles from 16 to 2048 nodes and 16 threads per node Figure 6: Strong scaling test on the SuperMUC phase 1 machine (see text for hardware description).Left panel: Code Speed-up.Right panel: Corresponding parallel efficiency.Using 16 threads per node (no use of hyper-threading) with one MPI rank per node, an almost perfect parallel efficiency is achieved when increasing the node count from 16 (256 cores) to 2 048 (32 768 cores).Figure 8: Strong scaling test on the JUQUEEN Blue Gene machine (see text for hardware description).Left panel: Code Speed-up.Right panel: Corresponding parallel efficiency.Using 32 threads per node (2 per physical core) with one MPI rank per node, a parallel efficiency of more than 60% is achieved when increasing the node count from 32 (512 cores) to 8 192 (131 072 cores).On 8 192 nodes there are fewer than 27 000 particles per node and only a few hundred tasks, making the whole problem extremely hard to load-balance effectively.