NeurIPS 2020

Throughput-Optimal Topology Design for Cross-Silo Federated Learning


Review 1

Summary and Contributions: In this paper, the authors propose a new approach to find the topology of underlying network of DPASGD with reduced cycle time. The authors define the minimal cycle time problem and solve it via approximation algorithms. Theoretical analysis and empirical results are provided.

Strengths: The authors define the minimal cycle time problem and solve it via approximation algorithms. Theoretical analysis is provided. Empirical results show the proposed algorithm can reduce the training time.

Weaknesses: Please correct me if I'm wrong. My major concern is that, according to the theoretical analysis of DPASGD and other decentralized optimization algorithms, the convergence depends on the eigenvalues of the connectivity graph. However, the proposed algorithms do not take the eigenvalues of the graph into account. Although the empirical results show good convergence, I believe the eigenvalue/graph Laplacian is a missing piece in the design of the MCT optimization problem and the theoretical analysis.

Correctness: The paper is technically sound.

Clarity: The paper is well written and easy to follow.

Relation to Prior Work: The difference and improvement compared to the previous work is clearly discussed.

Reproducibility: Yes

Additional Feedback: How does the topology affects the convergence of the decentralized optimization algorithm? Is there corresponding theoretical analysis? Is the convergence taken into account in the proposed algorithm? ----------------------- after authors' feedback After reading other reviews and authors' feedback, I agree that the convergence seems not significantly affected in these case. So the model in this paper makes sense. However, it will be very interesting if the authors could also consider the training convergence in future work.


Review 2

Summary and Contributions: This paper studies the optimal topology design problem in the setting of cross-silo federated learning. It nicely model the run time per iteration of decentralized SGD by using max-plus algebra and proposes several algorithms to obtain the overlay network in different scenarios.

Strengths: - The paper is well written and easy to follow. - The runtime modeling part is novel and valuable for the decentralized optimization community

Weaknesses: - Some claims are a bit questionable. I stated it in the "correctness" section - The authors seem to misunderstand MATCHA algorithm. I stated it in the "relation to prior work" section

Correctness: The claim that the convergence speed of decentralized SGD is weakly sensitive to the density of the overlay is questionable. I doubt this conclusion highly relies on the datasets and neural network models used in the experiments. For example, in [1], it is reported the final accuracy of a denser graph is consistently higher than a sparser one. Otherwise, there is no need to design the topology. Why not directly use the ring topology which is nearly the fastest and sparsest one? [1] stochastic gradient push for distributed deep learning. Assran et al. ICML 2019.

Clarity: Overall the paper is well written.

Relation to Prior Work: After checking the MATCHA paper, I feel the authors somehow misunderstand the MATCHA algorithm. It not only model the communication time in decentralized SGD but also can be used to identify and communicate more frequently over the important links in any fixed graphs. (This is why they optimize the algebraic connectivity) So one can also build MATCHA on the top of the overlay, which is the output of the MCT algorithm in this paper. Besides, the communication budget C_b in MATCHA can be tuned to balance throughput and final accuracy. It is unfair to fix C_b=0.5 in all experiments. The authors are supposed to make the above two points clear in the paper. It would be great to add more experiments on: (1) build MATCHA on the overlay (2) changing C_b in MATCHA.

Reproducibility: Yes

Additional Feedback: After rebuttal: (1) I have read the other reviewers' comments and the author's feedback. I agree that although this paper only focuses on the system aspect, the authors executed it very well. The communication time model developed in this paper may be valuable for future researchers. So I would like to recommend an acceptance. (2) I notice that the authors constraint the topology to be a strong digraph. Hence, it implicitly puts some requirements on the spectral properties of the graph. (3) I feel the tile 'optimal topology design' seems to be too general, given the fact that the authors didn't explicitly discuss the convergence properties of the algorithm. The authors should change the title to something like 'Cycle-time optimal topology design' or 'Communication-time optimal topology design'. (4) It would be better to analytically define edge-capacitated and node-capacitated networks. ----------------------------- This paper overall is a good and addresses an important problem. But the experimental results is kind of contradictory to what they are proposing. From figure 2, one can easily draw the conclusion that the ring topology is nearly the fastest. If the topology doesn't influence the convergence rate a lot, why do we need a complicated topology design algorithm as proposed by this paper to find denser topologies?


Review 3

Summary and Contributions: This work considers the setup of cross-silo federated learning and a question of how to design connection topology in order to minimize total wall clock time. Prior to reading this work, I was skeptical about the value of this line of work. This work, well motivated and bringing insights from areas I would not normally follow, certainly challenges my intuition. Overall, I would like to see this work accepted, but I am not sure how big of a problem is what I highlight regarding experimental evaluation, and my overall score could change both ways after rebuttal.

Strengths: I appreciate being clear what kind of federated learning setup this work applies to as well as when it is not relevant. The whole work is very well motivated and grounded in realistic considerations. In my view this work raises the bar for how to realistically setup simulated experiments relevant for the problem studied. Takes a stab at relatively clearly stated question in prior work with little attention. Relevant to a relatively new setup of increasing interest reported by multiple companies.

Weaknesses: While experimental setup is very good on the side of simulation of system properties, I feel it is weak on the algorithmic side. See below.

Correctness: I believe yes.

Clarity: Clearly written, easy to follow.

Relation to Prior Work: Very well grounded in recent works spanning multiple areas.

Reproducibility: Yes

Additional Feedback: I have read the response and other authors comments. I am glad to hear that optimizing for sum of local functions yields the same final accuracy independently from the network. I strongly recommend to redo all of the experiments with this objective, and feel confident it should work. This will make the experiments more conclusive, by having the results comparable across different networks via sharing the same objective. If the final convergence accuracies are very close, it also strongly addresses some other concerns, including mine, about how the topology impacts algorithmic convergence properties. Overall I feel this work exceeds the quality of a few other related and accepted works, and thus would really like to see it accepted, and have increased the rating. initial content --- Abstract: "master-slave" - I think there is a general trend in CS to move away from this language, you may want to consider "server-client" "parent-child" or something similar. In general the paper reads well, there are only minor details I have regarding clarity of presentation. I do appreciate being upfront about what kind of scenarios this work applies to and what not. Sec 1 I think very well grounds the rest of the work in prior research. In particular, in contrast to some related recent works, I feel this thorough grounding helps the authors to ask better questions, and in this sense I feel this work could help inspire further research. L65: I am not sure how compression plays into the preference for sychronous algorithms. It is relevant though, including for the precise topic studied here. Perhaps missing reference is Caldas et al., "Expanding the Reach of Federated Learning by Reducing Client Resource Requirements" which does both model and update compression as in the text. Sec 2.4. Here or earlier, I recommend spelling out that changing the overlay changes the algorithm executed, and thus potentially the convergence properties. The impact of this is studied in some previous works which you do cite and mention later on. I do not think this is a major problem, it is ok to focus on the system properties only, especially when this was not properly studied previously. Sec 3 formulates some practical approaches to finding the solution in prev section or its approximation, based on whether the communiction plays a major role or not. Sec 4 - I both like and do not like the experiments. I like that it is very well motivated and a lot of effort was clearly spent to create a novel simulated setup that should reflect realistic problems as much as possible, including using real topologies of internet infrastructure. This sets the bar higher for the whole field and will help followup research support more persuasive conclusions. I think this is very valuable for the research community and appreciated. What I do *not* like is stemming from footnote 5. The model learned should not be impacted by data partitioning and connectivity network. So measuring speed to reaching different thresholds seems wrong to me. I am not sure (and would like answered in rebuttal) is what is causing this. Two possible explanations I can imagine: a) Eq (1) is sum of averages. This way, a client with little data has disproportionate impact, possibly with negative impact on accuracy of the trained model. By default, I would expect overall average over data points, ususally structured as weighted average of local averages. That way, the data partitioning is irrelevant in terms of the global optimization objective. But maybe the outer average is forgotten in the equation? b) The default hyperparameters (say for STAR topology) are not set well. It may be that different topologies require different learning rates in order to converge, or to converge reasonably fast. Minor: In Fig 2 I would much more prefer to see a notion of test accuracy on the y-axis, not training loss. This way it is not clear at all whether the differences visualized translate to any difference in the task of interest. These are in the appendix, so a minor point. Minor: L235-236 I did not like the description of data partitioning (why is half of the data split randomly???). But the expanded reason in appendix make me feel ok about this. Consider expanding this aspect of the setup in the main body. L257: which table?