NeurIPS 2020

### Review 1

Summary and Contributions: The paper proposes an interesting framework for multiple organization learning, where multiple collaborators can aid supervised learning without sharing private algorithms or data. The approach is implemented via iterative transmissions of task specific statistics (e.g., model residuals) between the collaborators, which leverage side information from alternative models to improve learning. Theoretical analyses shows that the method can achieve lossless learning performance in linear regression models. The approach is validated in synthetic and real data experiments, where the proposed method is compared to an oracle performance achieved by models trained with all the combined data.

Strengths: The proposed approach is interesting, and the paper is very well written and organized. The paper does a very good job motivating the need for the approach, and describes with concrete examples, how it can be used to fulfill this need. The paper does a good job describing its novel contributions relative to previous related work in the literature. The paper also provides theoretical analyses that show that the approach is lossless for linear regression models. The paper illustrates the application of the method using both synthetic and real data sets using linear models (linear and ridge-regression), tree based models (decision trees, gradient boosting, and random forests), and neural networks.

Weaknesses: The paper states that model selection or model averaging approaches will not significantly improve over the best of the models (Alice’s or Bob’s) used in the assisted learning procedure because they fail to utilize the full data (the union of Alice’s and Bob’s features). However, ensemble techniques such as stacked regression (Breiman 1996) are often successfully used to improve predictive performance by combining not only different models trained on the same set of features, but also by combining different models trained on different subsets of features. In all experiments performed in the paper, only comparisons between assisted learning and the oracle model were presented. The paper would be considerably stronger if it was able to show that assisted learning compared favorably against (for instance) a stacked model generated with the predictions obtained from the different models on modules M_1, …, M_m (trained with the original public responses). Note that under the assumptions made by the paper, that the labels/response (as well as, some sort of identifier needed to collate the labels/response to the features) are public available, a simpler ensemble approach (such as stacking) could also be directly used to improve learning without sharing the private feature data. In other words, such an approach might serve the same goals of assisted learning without requiring the iterative transmissions of model residuals (as described in Procedure 1) or of predictions from the first layer of neural nets (as described in Procedure 2) to improve predictive performance. While the paper is very interesting, it would be considerably more appealing, if it provided comparisons and empirical evidence that assisted learning can outperform these simpler approaches. Also, the paper does not describe how the predictions generated by the different modules are combined to form the final predictions. How exactly, the predictions are combined in step 13 of Procedure 1? Is simple unweighted averaging used? Or, are more sophisticated approaches used?

Correctness: The approach appears to be technically sound.

Clarity: The paper is very well written and easy to follow. The approach is very well motivated and most of the time described in detail with plenty illustrative examples (although some important details, such as how the predictions are combined into a final prediction, are missing).

Relation to Prior Work: The paper provides a good overview of related work, clearly discussing the relationships between the proposed method and previous related work.

Reproducibility: Yes

### Review 2

Summary and Contributions: The paper introduces assisted learning, which is a framework for enabling to make use of sensitive data from other entities / organizations without sharing raw data. The gist of the approach is that entities learn models for a specific target variable based on their own set of features and only share their error residuals with other entities. The authors therefore stipulate a novel learning protocol where involved entities sequentially exchange novel models which are fitted to the error residuals. The framework is evaluated on synthetic- and medical data (from the MIMIC3 benchmark ‘length of stay’), where the authors show that adequate prediction errors can be achieved in the novel, decentralized learning setting.

Strengths: * Novel, potentially impactful learning framework for multi-organization ML. The approach is sufficiently novel and relevant. * Adequate evaluation showing the feasibility of the approach with distributed feature subsets * The presentation of the work is widely-understandable, providing a sound introduction to technicalities.

Weaknesses: * For the approach to work, data needs to be aligned, i.e. approach is constrained to organizations which hold data about the same instances * While adequate, the evaluation is small-sized with respect to tested data sets

Correctness: The proposed method is sound, as the stipulated claims are backed with proofs and the empirical evaluation follows standard practice

Clarity: The paper, in general, is well-written in that it thoroughly motivates the novel framework, gives illustrative example scenarios and provides formalisms and proofs only when required. Section 4 is, however, a tad hard to read, although this could be easily remedied if procedure 1 is shown earlier. The empirical evaluation is rather small-sized, but covers realistic datasets (e.g. MIMIC3).

Relation to Prior Work: The related work comprises vertical federated learning, secure multi-party computation and multimodal data fusion. Albeit high-level, it provides a sufficient overview of related frameworks, such that the reader can understand the different interaction scheme and learning goal.

Reproducibility: Yes

Additional Feedback: The employed learning approach withing the decentralized protocol seems strongly related to gradient boosting if I am not mistaken. It would be interesting to show the relationship. Minor comments: * The set $S$ not introduced before (in procedure 1) * There might be an indice error: a module is defined as $M_j$, then you use round $k$ and finally refer to module $k$ (page 5) ### Update after author response ### I want to thank the authors for their answers. After having read the rebuttal as well as the other reviews, I remain with my positive score for the paper, as all questions of the reviewers have been thoroughly addressed. I find the proposed assisted learning framework to be original and well-executed.

### Review 3

Summary and Contributions: This paper proposed Assisted Learning framework to tackle the challenge between data privacy and learning efficiency. Authors developed three algorithms: Linear regression, Gradient boosting, Two-layer NNs. Experiment results indicates the efficiency of the proposed method.

Strengths: The paper introduce the notion of Assisted Learning to tackle the challenge between data privacy and learning efficiency without sharing data. Moreover, in the context of Assisted Learning, we develop two concrete protocols so that a service provider can assist others by improving their predictive performance. The paper is overall well-presented and easy to understand.

Weaknesses: First, in some context, the ideal is similar to Residual network, but authors did not cite the related works. Second, How to evaluate correctness of the stop criterion? Third, experiments are insufficient. Authors should compare with more related algorithms.

Correctness: Yes

Clarity: Good

Relation to Prior Work: Yes

Reproducibility: Yes

### Review 4

Summary and Contributions: The paper presents a method of assisted learning where multiple parties share relevant metrics iteratively in order to train one particular party at a certain task. The contributions are as follows: 1. This work introduces the problem of Assisted Learning. 2. The work discusses two methods of solving this problem.

Strengths: Soundness of the claims: The claims in the paper have been demonstrated by the experiments. Where possible, claims have also been proven theoretically. Significance and novelty: The posed problem certainly appears to be novel. This work can be quite significant when multiple agencies want to improve their predictions without compromising their data or proprietary methods. Relevance: The work is quite relevant to NeurIPS and the ML community in general.

Weaknesses: Soundness of the claim: Although the results were demonstrate on both simulated and real data sets, it was not clear how features were split among the various modules. Was the split random or was it curated by the authors to achieve the best possible results?

Correctness: Yes, claims and method are correct. Empirical methodology can be improved by considering datasets with large number of features. Real world datasets usually have hundred of features. A lot of agencies might share many features. For example, financial firms most often use credit bureau data. So, two or more financial firms are likely to use very similar data. It isn't clear how the performance will get affected as we have increasing number of similar or linearly dependent features across different modules. Perhaps future work should address that.

Clarity: Yes, the paper is well written and easy to follow.

Relation to Prior Work: The authors clearly distinguishes their work with prior work and related fields, like vertical federated learning.

Reproducibility: No

Additional Feedback: The authors don't discuss how features were split across the modules in the experiments. Reporting which feature was used for which module would help make the results reproducible. It might help if the authors could discuss some (imaginary but) pathetic scenarios like: 1. Alice and Bob have the same copy of data and use the same ML algorithms. In this case, Oracle and Alice should always be at par. 2. What if Bob uses a random label generator as a preedictor? 3. What if the data that Bob has is completely irrelevant to the predictive task that Alice has? I have read authors' feedback and stand by my review.