Review for NeurIPS paper: Theoretical Insights Into Multiclass Classification: A High-dimensional Asymptotic View

NeurIPS 2020

Theoretical Insights Into Multiclass Classification: A High-dimensional Asymptotic View

Review 1

Summary and Contributions: The paper studies the test accuracy in linear multiclass classification in the high-dimensional regime. Specifically, they studied two models -- Gaussian mixture model and multinomial logit model -- and three algorithms -- least-squares, weighted LS, and per class averaging -- under the setting where the sample size and the number of features tend to infinity proportionally. Their main contribution is the formulae of the limits of certain quantities (intercepts and correlation matrices in the linear classifier) which can be used to characterize the limiting classification error. The theory is validated by simulation experiments.

Strengths: The paper precisely characterizes the asymptotic accuracy in multiclass classification under Gaussian mixture model and multinomial logit model trained by least squares and per class averaging. The theoretical results agree with the numerical experiments quite well.

Weaknesses: The algorithms do not seem natural. This paper studies two simple parametric models for classification: Gaussian mixture model (with identity covariance) and multinomial logit model, which bring with them some natural algorithms, for instance, maximum likelihood and Bayesian methods. However, the algorithms analyzed in the paper, (weighted) least-squares and per class averaging, do not seem a good fit for the models. What is the rational for doing linear regression on a categorical response, or setting each coefficient as the sample mean of the corresponding class? I could not find any justification for this. The results are only asymptotic and only for linear classifier.

Correctness: The theorem and proofs look correct to me.

Clarity: Generally the paper is well written. Though the main body is mostly a summary of results without much intuition. For instance, why is the averaging estimator better for GMM while the LS is better for MLM? Moreover, no intuition is given regarding why the result should hold or how the proof goes. The proof techniques in the supplementary seem interesting and I think if they were sketched in the main paper, the readers could have a better understanding of the theory.

Relation to Prior Work: The paper has a good literature review and their difference from prior works is clear.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper considers the multi-class classification problem using the linear model, in a case of high dimension and large sample size. In the double asymptotic regime where n, d \to \infty, basical statistics of linear classifiers are asymptotically calculated and performance of the proposed schime is numerically verified.

Strengths: In the asymptotic assumption w.r.t. dimension and sample size, explicit forms of statistics of linear classifiers are shown.

Weaknesses: Calculations in the double asymptotic regime are not quite new. Settings of numerical experiments seems to be arbitrary.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: I donot think so. There are few mentions about asymptotic theory of the case (n,d\to \infty).

Reproducibility: Yes

Additional Feedback: Calculations of limits of statistics themselves do not bring any insights for us and theoretical evaluations of generalization performance of the linear classifiers are required. How do the linear classifiers perform on the double asymptotic situation, in the sense of generalization error?

Review 3

Summary and Contributions: This article deals with multi-category pattern recognition. The authors characterize the asymptotic behaviour of four linear classifiers applied to data generated according to two models. The four classifiers differ according to their loss function: least-squares, class averaging, weighted least-squares and cross-entropy. The data are obtained through a Gaussian mixture or a multinomial logit model. The main results are convergences in probability of the parameters (intercepts and "correlation" matrices). The total and class-wise accuracies are also characterized. Experimental results (obtained on artificial data following the aforementioned models) are also provided in Section 5.

Strengths: This contribution fully characterizes the asymptotic behaviour of popular multivariate linear classifiers.

Weaknesses: The limitations directly spring from the very restrictive theoretical framework : linear classifiers and data generated according to two very simple models. It is unlikely that the results obtained extend nicely to more general settings.

Correctness: I was unable to check the technical sanity. On the other hand, I could not find any flaw either.

Clarity: The paper is clearly written. There are only a few typos. Three examples: -line 35: "the number of classes are large?" -> "the number of classes is large?" -line 82: "none of these prior works have" -> "none of these prior works has" -line 89 "correleations" -> "correlations".

Relation to Prior Work: This seems to be the case, although I am not familiar with the state of the art.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: The paper studies the asymptotic probabilities of error of simple linear classifiers under certain model assumptions: Gaussian mixture model and a multiclass logit model. This refers to providing precise asymptotic distributions of the classification error and not just order-wise bounds. As a key step the paper points out that for both models, the error depends on two correlation matrices, the one of the class-dependent weight vectors with each other and the weight vectors with the class mean feature vectors.

Strengths: The theoretical analysis of multiclass classification is an open problem at the core of machine learning / statistical modelling. While the the specific setting considered seem limited, they are insightful and likely an important stepping stone in the full analysis of multi-class classification. For instance, it is very interesting that the simple approach of findings weight vectors (per class) through per class averaging is Bayes optimal for the Gaussian mixture model (under balanced classes). Such results are a very good lead for future investigations of more general settings. The theory accurately predicts the outcomes of numerical experiments.

Weaknesses: I don’t see any major weaknesses in the work.

Correctness: While the basic mathematical setup including the importance of the correlation matrices are straightforward to follow, the main results are technically involved and I did not check the proofs in the appendix. However, the paper makes a 100 percent rigorous impression and the theoretical results are recovered in the numerical experiments. Therefore I have little concerns about correctness.

Clarity: The paper is written extremely densely and challenging to follow. However, I believe this is due to the intrinsic complexity of the problem, and the authors did a great job in providing an accessible introduction and formal setup in Section 2. Given the space limits there is not much that can be done to improve the presentation. Potentially, a little more room could be found for the interpretation of some results.

Relation to Prior Work: The authors give thorough pointers to related work. On question is whether existing results for binary classification could have been lifted in a straightforward way to compare their predictions with the novel more specific results in the numerical experiments. This might give a quantitative sense for the improvement in understanding of the multilclass problem through this work.

Reproducibility: Yes

Additional Feedback: UPDATE: I thank the authors for the useful and specific comments regarding my inquiry about the differences/relation to binary classification results. In general, the author response was very convincing and further corroborated my high opinion of the paper.

Review 5

Summary and Contributions: The submission discusses a high-dimensional asymptotic analysis of the linear/weighted-linear and plug-in estimators for the multiclass classification problem in the setting where the number of training samples and the number of input features grows to infinity at a proportional rate, while the number of classes is O(1). While a lot of works based on CGMT have recently looked at the binary classification problem, this paper looks at the important extension to multi-class classification. Such theory is welcome at a venue like NeurIPS. The experiments corroborate the theoretical predictions made by the authors.

Strengths: The paper is very clearly written. The problem is well motivated and is of significance to the broad ML community at large. Important aspects of the problem are discussed and the main ingredients to look out for in multi-class classification problem are brought out. Overall, the results are very informative and very interesting to the NeurIPS as well as the broad scientific community at large.

Weaknesses: The paper only considers simple estimators for the problem (for which closed form solutions of the estimator are known in terms of the training data), in the absence of any regularization for the weights. Hence it could be that the results could perhaps have been derived from I think the regularized estimators could have been considered by the authors but has been ignored (perhaps for the sake of brevity).

Correctness: The claims and methods seem correct

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: