Randomized diffusion for indivisible loads

We present a new randomized diffusion-based algorithm for balancing indivisible tasks (tokens) on a network. Our aim is to minimize the discrepancy between the maximum and minimum load. The algorithm works as follows. Every vertex distributes its tokens as evenly as possible among its neighbors and itself. If this is not possible without splitting some tokens, the vertex redistributes its excess tokens among all its neighbors randomly (without replacement).
 In this paper we prove several upper bounds on the load discrepancy for general networks. These bounds depend on some expansion properties of the network, that is, the second largest eigenvalue, and a novel measure which we refer to as refined local divergence. We then apply these general bounds to obtain results for some specific networks. For constant-degree expanders and torus graphs, these yield exponential improvements on the discrepancy bounds compared to the algorithm of Rabani, Sinclair, and Wanka [14]. For hypercubes we obtain a polynomial improvement.
 In contrast to previous papers, our algorithm is vertex-based and not edge-based. This means excess tokens are assigned to vertices instead to edges, and the vertex reallocates all of its excess tokens by itself. This approach avoids nodes having "negative loads" (like in [8, 10]), but causes additional dependencies for the analysis.


Introduction
During the last years, large parallel networks became widely available for industrial and academic users.An important prerequisite for their efficient usage is to bal-ance the workload efficiently.Load balancing also has applications in scheduling, routing, numerical computation, and finite element computations.
In this paper we analyze a very simple neighborhood-based load balancing algorithm.We assume that the processors are connected by an arbitrary d-regular network.In the beginning, every vertex has a certain number of tokens (load).The goal is to distribute the tokens as evenly as possible.More precisely, we aim at minimizing the difference between the minimum load and the maximum load, which we call discrepancy.
In general, neighborhood-based load balancing algorithms operate in parallel steps (sometimes also called rounds).In each step, every processor is allowed to probe the load of all of its neighbors (diffusion load balancing), or to probe the load of one neighbor (dimension exchange).Then each processor decides how much load it will forward to its neighbors.In this paper, we consider a very natural diffusion-based approach.In the continuous diffusion model, where tokens can be split arbitrarily, the method works as follows.Along each edge a load of load -difference/(d + 1) is sent from the vertex with the higher load to the vertex with less tokens.Note that this method balances the load perfectly if the number of steps is sufficiently large.Here we consider the (arguably more realistic [15]) case of discrete diffusion where tokens are indivisible.Quantifying by how much the integrality assumption decreases the efficiency of load balancing is an interesting question and has been posed by many authors (e.g., [8,[11][12][13][14][15]).
Most results known so far ([9, 10, 14]) employ an edge-oriented view where each edge decides between forwarding either load -difference/(d + 1) or load -difference/(d+1) tokens (referred to as rounding up or rounding down).Rounding up results in a load balancing algorithm that keeps sending tokens back and forth between processors with a small load difference.Another disadvantage is that the approach can generate "negative loads" for vertices with only a few tokens.On the other hand, always rounding down results in a discrepancy of up to d•diam(G), where diam(G) denotes the diameter of the underlying graph G.To overcome these problems we adopt a vertex-oriented view in this paper where the vertices (not edges) decide randomly how much they are sending.
1.1 Related Work.Due to the vast amount of literature on load balancing, we consider only previous work dealing with diffusion load balancing or randomized algorithms for neighborhood-based load balancing.We do not consider literature on the dimension exchange model in general, or literature for the token distribution model.
Continuous Diffusion.The diffusion model was first studied by Cybenko [3] and, independently, Boillat [1].Cybenko [3] (see also [13,15]) shows a tight connection between the convergence rate of the diffusion algorithm and the absolute value of the second largest eigenvalue λ max of the diffusion matrix P with P ij = 1/(d + 1) if {i, j} ∈ E. Subramanian and Scherson [15] observe similar relations between convergence time and certain properties of the underlying network like electrical and fluid conductance.
Muthukrishnan et al. [13] refer to the above diffusion model as the first order scheme and generalize it to the so called second order scheme.Here the load transferred over an edge (i, j) in step t does not only depend on the load difference of i and j, but also on the amount of load transferred over the edge in step t − 1. Diekmann, Frommer, and Monien [4] extend the idea of [13] and propose a general framework to analyze the convergence behavior of a wide range of diffusion type methods.
Discrete Diffusion.Rabani et al. [14] consider the diffusion algorithm that always rounds down (called RSW algorithm in the following).They approximate the idealized and continuous process by this process with indivisible load.To quantify the deviation of the discrete load process from the idealized process, they propose a natural measure, the local divergence Ψ 1 .The local divergence measures the sum of load differences across all edges in the network, aggregated over the time.They give a general bound on Ψ 1 in terms of λ max (which is the second largest eigenvalue in absolute value of the diffusion matrix).By a more careful analysis, they also get improved bounds on Ψ 1 for tori graphs resulting in tight bounds on the discrepancy achieved by their algorithm.
Discrete Load Balancing via Random Walks.Elsässer et al. [7,8] propose an algorithm based on random walks.
They show that after O (log(Kn)/(1 − λ max )) steps, the maximum load is at most the average load plus a constant [7].In comparison to our algorithm, their algorithm is more complicated and not a simple diffusion type algorithm.For example, vertices require an estimate of n and they have to compute the average load during the balancing procedure.Moreover, the final stage uses concurrent random walks (representing tokens) to reduce the maximum load.In this stage, the load transfer along an edge may be much smaller (or higher) than load -difference/(d + 1).
Discrete Neighborhood Load Balancing with Randomization.
In [9], the authors consider a dimension-exchange algorithm using randomly or deterministically generated matchings.Their algorithm randomly decides to round up or down.For detailed results see Table 1.Note that an algorithm in the dimensionexchange model is typically much easier to analyze than diffusion algorithms since every node exchanges load with at most one neighbor.In [10], the authors analyze a deterministic modification of the standard diffusion algorithm for hypercubes and constant-dimensional tori.The idea is that each edge keeps track of its own rounding errors.In each step an edge's decision to round up or down is done in a way that the sum of its rounding errors is minimized.Again, the detailed results can be found in Table 1.The authors of [10] also consider a randomized version of the diffusion algorithm.Their approach is edge-based, more precisely, edges decide independently at random whether to round up or down.They present a general upper bound for their approach in terms of λ max .Note that both algorithms in [10] may generate negative load due to the edge-based rounding.
Source of Inspiration.We wish to point out that our work was inspired by recent combinatorial results regarding so-called rotor-router walks [2,5].Unlike in a random walk, in a rotor-router walk each vertex serves its neighbors in a fixed order.The resulting (completely deterministic) walk nevertheless closely resembles a random walk in several respects.Similarly, one can say that in each round of our load-balancing algorithm a vertex chooses a random order of its neighbors (and itself) and sends around all its tokens in this order in a round-robin fashion.
1.2 Our Contribution.Algorithm.We consider a vertex-based randomized diffusion algorithm for the discrete model with indivisible tokens.Let d be the degree of the (regular) network and let X i be the load of vertex i.Our algorithm works as follows.First, vertex i sends X i /(d+1) tokens to each neighbor and keeps the same amount of tokens for itself.Then the remaining X i − (d + 1) X i /(d + 1) tokens (called excess tokens) are randomly distributed (without replacement) among vertex i and its d neighbors.
Results.To state our results formally, we let τ (G, K) = O(log(Kn)/(1 − λ max )) be the number of steps after which the continuous process achieves a constant discrepancy for any initial load distribution with discrepancy K (cf.Fact 2.2, [14]).All our bounds on the discrepancy are independent of the initial load vector, and hold with high probability (w.h.p.), i.e., with probability at least 1 − n −Ω (1) .Theorem 1.1.Let G be an arbitrary d-regular graph and let K be the initial discrepancy.Then the discrepancy after τ (G, Ψ 1 (G) for any graph G.The improvement is due to the more balanced reallocation of the excess tokens due to our randomized approach and The next theorem provides more specific bounds on the discrepancy.It is derived by first bounding Υ 2 (G) and then applying Theorem 1.1.Let us compare our results to the RSW algorithm (see [14]) since that algorithm is also a very natural diffusion algorithm that avoids negative loads.For d-regular expanders, [14] proves a discrepancy bound of O(d log n) after τ (G, K) rounds.This is almost tight, as d•diam(G) is a simple lower bound for the RSW algorithm.Hence for small d, we obtain an exponential improvement in terms of the discrepancy.
For the r-dimensional Torus graph, [14, Theorem 8] proves a bound of O(n 1/r ) on the discrepancy after τ (G, K) rounds.This is tight due to the lower bound of diam(G).Again, our new algorithm achieves an exponential improvement.For the hypercube with n vertices, [14,Theorem 4] implies a discrepancy bound of O(log 3 n) after τ (G, K) rounds.
The techniques used to analyze our new algorithm can be used to prove a tight bound of Θ(log 2 n) on the discrepancy of the RSW algorithm.For our new algorithm we obtain a smaller discrepancy bound of O(log n).
Techniques.The key ingredient of the analysis in [9,10,14] is "an appropriate edge-oriented view of the rounding errors in each balancing step, which allows them to be handled independently" (as stated by Rabani et al. [14]).The problem with vertex-oriented algorithms are the dependencies between the rounding results for edges incident to the same vertex.To deal with these dependencies we use a different analysis compared to [9,10].Our analysis is based on martingale tail estimates.The other main technical contribution is the use of the new parameter Υ 2 (G) (Definition 2.2) as opposed to the local divergence Ψ 1 (G) as used in [14].

Algorithms and Notation
We use standard graph-theoretical notation.Let G = (V, E) be a connected, undirected, d-regular and simple graph with n vertices [n] := {1, 2, . . .n}.The neighborhood of a vertex i is denoted by N (i).For a pair of vertices i, j ∈ V (G), let dist(i, j) be the length of a shortest path between i and j, and diam(G) be the diameter of G. [i : j] refers to an edge {i, j} ∈ E with i < j.Every vertex in the graph has a certain amount of load items (tokens).We assume that the load is indivisible and each token is of unit-size.
We denote by P the transition matrix, i.e., P i,j = 1 d+1 if {i, j} ∈ E or i = j, and P i,j = 0 otherwise.We will often use P t which means that we raise the matrix P to the power of t.Note that P t i,j can be also seen as the probability for a random walk being located at vertex j at step t, when having started from vertex i.
For the estimation of the convergence of our processes, the absolute value of the second largest eigenvalues of P plays a crucial role.Let us denote the eigenvalues of P by 1 = λ 1 λ 2 λ 3 . . .λ n > −1 and define λ max := max{λ 2 , |λ n |}.
We show the following general bound, the proof can be found in the full version.
For bounding the deviation between the discrete and continuous process, we adapt a definition from [9] that generalizes the original definition of local divergence from [14] for p = 1. .

Graph class FS [9]
RSW [14] FGS [10] det.FGS [10] rand.our algorithm Properties FS [9] RSW [14] FGS [10] det.FGS [10] 1 for all t, i, k.As pointed out in [14], "Ψ 1 (G) is a natural quantity that measures the sum of load differences across all edges in the network, aggregated over time (and suitably normalized) which may be of independent interest ".Here, we will mainly consider a natural extension of Ψ 1 (G) to the 2 -norm, Ψ 2 (G), and Υ 2 (G) which is defined below.For our probabilistic analysis, we use the following concentration result for martingales, which is commonly known as the "method of average bounded differences".Theorem 2.1.([6, p. 83]) Let Y 1 , . . ., Y n be an arbitrary set of random variables and let f be a function of these random variables satisfying the property that for each ∈ [n], there is a non-negative c such that Then for any δ > 0, , 2.1 Our Discrete Process.Our balancing procedure proceeds in rounds 1, 2, . ... Fix a vertex i at some round and let X i be the current load of this vertex.
Then i sends X i /(d + 1) tokens to each of its neighbors and keeps X i /(d + 1) for itself.The remaining excess-tokens are distributed randomly (without replacement) among i and its d neighbors.
To describe our processes more formally, we first present our notation that is based on [14].For any round t, let X (t) be the n-dimensional load-vector at (the end of) step t (load vectors are always regarded as column-vectors here).The discrepancy of the load vector X (t) at step t is defined as max For each edge {i, j} ∈ E we define a random variable Z (t) i,j with Z (t) i,j = 1 if i sends an excess token to j at step t, and Z (t) i,j = 0 otherwise.Similarly, let Z (t) i,i be one if i keeps an excess token for itself, and zero otherwise.Note that each Z (t) i,j with j ∈ N (i) ∪ {i} is a Bernoulli random variable with Additionally, the number of excess tokens sent out by i satisfies Note that Z i,j and Z j,i are independent for i = j.Now we can describe the discrete process as follows,

The Continuous Process.
Here we consider the continuous process where the load is arbitrarily divisible.The load vector in round t of this process is denoted by ξ (t) .To analyze X (t) we bound its deviation from ξ (t) Here we use the fact that the evolution of ξ (t) is well-understood in the area of Markov chain theory since ξ (t) = ξ (t−1) P, which results in ξ (t) = ξ (0) P t .Alternatively, we can write We define the average load as ξ := n i=1 ξ (0) i /n.The following result bounds the load difference of the vertices and the average load in step t of the continuous process.Lemma 2.2.([13, Lem.1]) Let G = (V, E) be an arbitrary connected graph.Then for any initial vector ξ (0) and time step t 0, We will use the following immediate consequence of this lemma.
Corollary 2.1.Let G = (V, E) be an arbitrary connected graph.Then for any time step t 0 and any vertex k ∈ V , The following well-known result bounds the discrepancy of ξ.Theorem 2.2.([14, Thm.1]) Let G be a regular graph with n vertices.For the continuous process, the discrepancy is reduced to ε > 0 after steps, where K is the discrepancy of the initial load vector.
By τ (G, K) we denote the number of steps required for the continuous process to achieve a discrepancy of 1 for any initial load vector with discrepancy K. Fact 2.2 implies that τ (G, K) = O((log(Kn)/(1 − λ max )).

Difference between Continuous Process and Discrete Process.
To obtain results for the discrete process, we upper bound the deviation between the discrete and continuous process at step t, assuming that both processes are initialized with the same load vector.t is chosen large enough so that after t steps the continuous process has achieved a discrepancy of at most 1 for every load vector with initial discrepancy K (cf.Fact 2.2).Hence, the discrepancy of the discrete process is upper bounded by the deviation between the discrete and continuous process (plus 1).Similar to [9,10,14], we first express the discrepancy between the discrete and idealized process by a sum of weighted rounding errors (equation (2.7)).In this sum, the rounding errors are weighted by powers of the transition probabilities.In contrast to [9,10,14], the rounding errors (of the same time step) are not independent for all edges.This is due to our vertex-based approach and complicates the analysis.
To obtain a recursion for the discrete process, which similar to equation (2.3) for the continuous process, we plug equation (2.1) into equation (2.2) and obtain Comparing equation (2.4) to equation (2.3) motivates the definition of the random variable ∆ (t) i,j , which counts the rounding error made by vertex i on the edge {i, j} in step t.
This allows us to write (2.6) Now we state some basic properties of the rounding errors.The proofs can be found in the full version of the paper.Lemma 2.3.Let G = (V, E) be an arbitrary connected graph.
(1) For every {i, j} ∈ E and time step t, (2) Consider two vertex-disjoint edges ({i, j}, {k, }) ∈ E and assume that X (t−1) is fixed.Then ∆ (t) i,j and ∆ (t) k, are independent.We now continue by returning to equation (2.6).For any vertex i ∈ V and step t, let us define an error vector ∆ (t−1) with ∆ (t) i := j : {i,j}∈E ∆ (t) i,j .With this notation we have Solving this recursion (see [14]) and setting ξ (0) = X (0) results in where P 0 is the n × n-identity matrix.Hence, for any where the last equality uses ∆

Proof of Theorem 1.1
We now bound the discrepancy of our discrete process in terms of the local divergence Υ 2 (G).We do this by upper bounding the deviation between the discrete and the continuous process.A similar approach was used in Rabani et al. [14], who bounded this deviation in terms of Ψ 1 (G).They showed that reducing the initial discrepancy from K to O(Ψ 1 (G)) can be achieved within O(log(Kn)/(1 − λ max )) steps for any initial load vector.However, it turns out that our randomized process can be bounded in terms of Υ 2 (G).Note that Υ 2 (G) is in general much smaller than Υ 1 (G) (or Ψ 1 (G)) (cf. the remarks after Definition 2.1).
Proof.[Proof of Theorem 1.1] We start with the proof of the first statement.Let us now fix a vertex k ∈ V and a time step t.Recall from equation (2.7) that Our goal is to apply the martingale tail estimate from Theorem 2.1 to where the last equality follows by the definition of ∆ (s) i,j .We observe that for a fixed load vector X (0) the function f k depends only on the randomly chosen destinations of the excess tokens.There are t steps, n nodes, and at most d excess tokens per node per step.We describe these random choices by a sequence of t ).Then Y refers to the destination of the r-th excess token of vertex i at step s (if there is one).More precisely, and the r-th excess token of vertex i at step s is sent to j, 0 otherwise.In order to apply Theorem 2.1, we have to upper bound Consider now a fixed that corresponds to (s 1 , i 1 , r 1 ) in the lexicographic ordering.To bound equation (3.9), we use equation (3.8) to get In the remainder of the proof we split the sum over s into the three parts 1 s < s 1 , s = s 1 , and s 1 < s t.
We prove that the parts s < s 1 and s > s 1 both equal zero while the part s = s 1 is upper bounded by 2 i,j is already determined by Y −1 , . . ., Y 1 .Hence, where we used to simplify the notation.Eqn.3.11 follows as Y −1 , . . ., Y 1 determine the load vector X (s−1) .To bound equation (3.12) we consider j∈N (i) Λ (s) i,j for i = i 1 and i = i 1 separately.
is independent of Y , which is the choice of the r 1 -th excess token of vertex i 1 at step s 1 .Hence Combining Case 1 and Case 2 we obtain (3.12) = max j∈N (i1) s > s 1 : Let be the largest integer that corresponds to time-step s − 1.Since s > s 1 , we have s − 1 s 1 and therefore . By the choice of , Y , . . ., Y 1 determine the load vector at the end of step s 1 , X (s1) .By Lemma 2.3, we obtain E ∆ (s) i,j | Y , . . ., Y 1 = 0, and by the chain rule of expectations,

Discussion
We presented a new diffusion-based load-balancing scheme which is very simple and avoids negative load.We show bounds on the discrepancy for general graphs depending on the local (or refined local) divergence the eigenvalue gap of the graph.For (constant-degree) expander graphs we prove a discrepancy of O(log log n), for hypercubes of O(log n), and for r-dimensional torus graphs of O( √ log n ).

Definition 2 . 2 ..
For any p ∈ N >0 , the refined local p-divergence of a graph G = (V, E) is defined as Υ p (G) := max k∈V Note that Υ p (G) Ψ p (G), since for each {i, j} ∈ E(G) the term P t i,k − P t j,k p appears once in Ψ p (G) and at most twice in Υ p (G).
.10) s = s 1 : This is the most involved case due to the dependencies among {∆ (s) i,j : {i, j} ∈ E}.

Case 1 :
Let i = i 1 .Assume first Y = 0.This means that node i 1 has less than r 1 extra tokens at step t 1 .Hence Λ s i,j = 0. Now we assume that Y = 0.This means that node i 1 has at least r 1 extra tokens at step t 1 .Let b r 1 be the number of extra tokens of i 1 at step s 1 .Clearly, b and the destinations of the extra tokens considered in the previous rounds, Y −r1+1 , . . ., Y −1 , are already determined by Y −1 , . . ., Y 1 (note that if r 1 = 1 then this set is empty).The remaining Y +1 , . . ., Y +b−r1 are chosen uniformly at random among (N (i 1 )∪{i 1 })\ Y −r1+1 , . . ., Y =: N (i 1 ) without replacement.Let w ∈ N (i 1 ) be the destination of the r 1 -th excess token of i 1 at step s 1 , that is, Y = w and consequently, Z (s1) i1,w = 1.Clearly, 0 < Λ The last equality holds since j∈N (i1)∪{i1} Z (s1) i1,j = b and b is determined by Y −1 , . . ., Y 1 .Hence, j∈N

Theorem 4 . 4 .Corollary 4 . 1 .
If G is the hypercube with n vertices, log 2 n).As the discrepancy of the RSW algorithm is at most Ψ 1 (G) after τ (G, K) rounds[14, Cor.3], we obtain: The discrepancy of the RSW algorithm[14] is at most O(log 2 n) after τ (G, K) = O(log(Kn) • log 2 n) time steps.Note that the best-possible result from [14, Theorem 4] yields only a weaker bound of O(log 3 n).Our result is tight since d • diam(G) = (log 2 n) 2 is a lower bound.

Table 1 :
Discrepancy of neighborhood load balancing after τ