Paper ID: 1459
Title: BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper describes an approach to derive the L1 regularized Gaussian maximum likelihood estimator for the sparse inverse covariance estimation problem. The focus of this paper was to scale the previous algorithm QUIC to solve problems involving million of variables. They describe three innovations brought about by this new approach: inexact Hessians, better computation of the logdet function, and carefully selecting the blocks updated in their block coordinate scheme via a smart clustering scheme. The numerical results test their new method against the previous QUIC algorithm, GLASSO and ALM, showing improved performance on a few select problems.

Clarity: this paper is well written, and expresses their ideas in a straight forward manner, though readers will struggle if they have not read the original QUIC paper. Furthermore, it would be enlightening to see the effect of the CG solves, T_{outer} and T_{inner} on the performance of BIGQUIC. How are you selecting these parameters?

Originality: the strategies presented in this paper seem novel, and the presence of the graph clustering algorithm is intriguing.

Significance: QUIC has shown to perform quite well for the sparse inverse covariance problem presented in this paper, relying mostly on the special structure exhibited by the problem formulation. As a result, any strategies that would increase the applicability of this approach could prove to be very useful for those interested in solving these types of problems.

Quality: this paper expresses their ideas very well, and presents smart ideas in scaling up QUIC to solve much larger problems then in the previous paper.

However, the global convergence proof seems to be inconclusive, with two glaring problems that need to be addressed:

1. The proof does not cover the effect of the fixed-free set selection at the beginning of each iteration. Rather, the proof, which borrows many principles from the proof of the original QUIC algorithm, proves the result for J being the set of all variables, which is not the case for this algorithm.

2. It is unclear from the theoretical analysis how T_{outer} and T_{inner} are affecting the proof of convergence or the numerical results. The proof from QUIC seemed to require the exact solution of the sub-problems at each iteration. However, it is not apparent as to why the two-loop coordinate descent procedure of Algorithm 1 would guarantee to exactly solve that inexact model (with an inexact Hessian), particularly in T_inner inner sweeps and T_outer outer sweeps. If you are indeed solving the inner/outer blocks inexactly, then please highlight its effect on the convergence proof and the numerical results.
Q2: Please summarize your review in 1-2 sentences
I believe this paper is interesting, and provides some advancement over the original QUIC algorithm, but the authors must provide more precise arguments supporting the claim of global convergence, in light of the active set identification phase, and the effect of inexactly solving the inner sub-problems.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper is an extension of QUIC method proposed for sparse inverse covariance estimation problems a couple of years ago and published at NIPS. The new method is heavily based on QUIC, which employs active set selection strategy, second-order estimation of the smooth part of the objective, and a coordinate descent scheme to optimize the subproblems. The new methods uses clever and carefully executed schemes to exploit structure of the sparse inverse covariance estimation problems to scale each operation (specifically coordinate descent) and to avoid storage of the dense inverse matrix to enable solution of large-scale problems. The resulting method indeed scales well beyond any other method for sparse inverse covariance problems.

The key improvement is based on the idea of identifying graph clusters in the current working set to decompose the the matrix into blocks of columns (nodes) over which to perform repeated passes of coordinate descent. This minimizes the need for computations and updates of the rest of the matrix. An existing and presumably effective graph clustering algorithm is employed.

Other clever techniques are used, such as computing the blocks of the inverse of the parameter matrix, using a CG method, with variable degree of accuracy. For gradient computation the inverse is computed more accurately than for the Hessian estimations. The convergence theory is provided to show under which conditions the "inaccurate" method retains convergence properties. The theory is not very surprising, as it basically requires some second-order quality of the approximation, to obtain quadratic convergence, even if it is an exact Taylor expansion.

The computation results are quite impressive. They show scalability of the methods for very large problems and also in multi-core setting.

Some minor comments:

Line 209: "number" is missing a "b".
Line 268-269 "embarassingly" parallelized sounds somewhat awkward, although it is clear what is meant.

Line 270-278: The are step numbers of Algorithm 1 referenced here, but no numbers are given in Algorithm 1.
Line 289 - there is an extra "of"

Q2: Please summarize your review in 1-2 sentences
This paper is an extension of QUIC method proposed for sparse inverse covariance estimation problems a couple of years ago and published at NIPS. The new methods uses clever and carefully executed schemes to further exploit structure of the problem to scale each operation to scale to very large problems. The resulting method is quite a substantial improvement over the state-of-the-art.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:
This paper presents several extensions to the QUIC algorithm for sparse inverse covariance estimation, which together enable the method to scale to much larger domains that previously possible, with a maximum demonstrated size of a 1 million by 1 million sparse covariance matrix. The main innovations are: 1) a block coordinate descent approach that first clusters the inverse covariance matrix into near-diagonal blocks and then performs an l1 regularized Newton method on the largest size blocks that fit into memory; 2) an incremental approach to computing the log-det term for step size computation, and 3) a method that only computes certain Hessian terms approximately (this and the gradient computations both use conjugate gradient methods, but Hessian terms can use fewer iterations). The authors apply the method to two large synthetic data sets and a real fMRI data set, and show the method is able to learn these large models on a single multi-core machine.

Comments:
Overall, I think that this is a very good paper. Since this is really an algorithmic paper, there may be some concern that it represents an incremental advance over the QUIC and DC-QUIC papers themselves, but given that the orders of magnitude here are quite significant (20,000^2 -> 1,000,000^2), I think that the methods described are definitely worthy of publication on their own.

A few comments/questions:
- Can the authors elaborate on how they went about choosing the regularization parameter for domains this large? Normally this is fairly straightforward via cross-validation, but given that a single run here can take many hours, I'm not quite clear how they went about actually finding the choices of lambda mentioned in Section 5. Especially since methods like QUIC tend to have many more non-zero parameters during the optimization than at the final solution, this seems worth discussing.

- Similar to the above, would it be possible to show or highlight the number of non-zeros during the optimization phase, and the amount of time this ends up taking?

- Is the choice of lambda to achieve degree 20 for the fMRI data actually reasonable? Can we verify that this is a reasonable choice via held-out log-likelihood, or some similar metric? Given that this real data set was by far the most impressive demonstration in my mind, I'm curious if the proposed approach is learning a good model here, or just solving a hard optimization problem. Right now the only analysis is qualitative, which is nice but not ultimately that convincing especially given that they are ultimately just described as "similar" to those discovered by ICA even though that requires first reducing the dimension.

- Given that these methods are starting to become quite complex and difficult to independently implement, I hope the authors consider releasing a public version of this code.

- Did the authors consider using multifrontal LLT factorizations, etc? Given that they're already using PARMETIS, there are a number of similar "embarrassingly" parallelizable parallel Cholesky factorizations, and for the sparsity levels considered it doesn't seem completely implausible that they might be able to fit factorizations into memory (though of course not the inverse itself). This is a fairly minor point, since I'm sure the authors have put a lot of engineering into these efforts, but I'm genuinely curious if there are interesting tradeoffs involved here. In my experience, CG methods can be widely variable (in terms of the number of iterations needed), for getting an accurate inverse, even with preconditioning.
Q2: Please summarize your review in 1-2 sentences
The authors propose modifications to the QUIC algorithm that let it scale to 1 million by 1 million covariance matrices. Given the substantial increase in size over past work, this is a very impressive piece of work.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their kind comments and suggestions.

1. Reviewer 1: "it would be enlightening to see the effect of the CG solves, T_{outer} and T_{inner} on the performance of BIGQUIC. How are you selecting these parameters?"

The stopping tolerance of CG is ||r^t||<\epsilon*||r^0||, where r^t is the residual at the t-th iteration.
Since gradient has to be computed accurately, we set \epsilon=1e-10.
For the Hessian computations, we follow Theorem 3 and set \epsilon=||grad^S(X_t)||.
According to the arxiv version of QUIC paper, T_{outer} should grow with number of iterations. We use exactly the same setting as QUIC (T_{outer}=1+iter/3).
We set T_{inner}=3 in all our experiments.

The following are the training time to achieve 1e-5 accrate solution on a 20000 fmri dataset using different T_{outer} and T_{inner}:

T_{inner}=1, T_{outer}= 1+iter, cputime = 2415s
T_{inner}=1, T_{outer}= 1+iter/3, cputime = 1941s
T_{inner}=1, T_{outer}= 1+iter/5, cputime = 2054s
T_{inner}=3, T_{outer}=1+iter, cputime = 2745s
T_{inner}=3, T_{outer}=1+iter/3, cputime = 1582s
T_{inner}=3, T_{outer}=1+iter/5, cputime = 1891s
T_{inner}=5, T_{outer}=1+iter, cputime = 2581s
T_{inner}=5, T_{outer}=1+iter/3, cputime = 1458s
T_{inner}=5, T_{outer}=1+iter/5, cputime = 1315s

So generally T_{outer} cannot be too large or too small.
If T_{outer} is too large, QUIC takes too much time to solve each quadratic approximation;
if T_{outer} is too small, the convergence is slow.
However, T_{inner} can be large because each inner iteration is cheap (only taking O(p/k) time).

2. Reviewer 1 "The proof does not cover the effect of the fixed-free set selection at the beginning of each iteration. Rather, the proof, which borrows many principles from the proof of the original QUIC algorithm, proves the result for J being the set of all variables, which is not the case for this algorithm."

We thank the reviewer for paying attention to our technical details. Our proof actually covers the effect of the fixed-free set selection in the following manner: we first prove the general block coordinate descent algorithm (Algorithm 2 in appendix) converges, and then use the argument that under the "fixed/free" set selection, each Newton iteration is equivalent to the following two steps:

(1) perform update on the fixed set, which will not change any variable (can be proved by our fixed set selection criterion), and then
(2) perform update on the free set.

Therefore our algorithm is just a special case of Algorithm 2, and will converge to the global optimum. Please see our detailed discussions in Appendix 6.2.

3. Reviewer 1: "It is unclear from the theoretical analysis how T_{outer} and T_{inner} are affecting the proof of convergence or the numerical results ... If you are indeed solving the inner/outer blocks inexactly, then please highlight its effect on the convergence proof and the numerical results."

Yes, in the theorems we assume each subproblem is solved exactly. We will highlight this assumption in the theorems.

4. Reviewer 3: "Can the authors elaborate on how they went about choosing the regularization parameter for domains this large?"

For synthetic data, the ground truth sparsity is known, so we start from a large lambda and gradually decrease it until the sparsity of *the final solution* is close to the ground truth sparsity. In the FMRI dataset we tested our algorithm with large lambda and then gradually decrease lambda until the solution has desired sparsity. Since our algorithm is fast for large lambda, this selection process is not too time consuming.

5. Reviewer 3: "would it be possible to show or highlight the number of non-zeros during the optimization phase, and the amount of time this ends up taking?"

We list number of nonzeros during the optimization phase for three testing cases with p=20000:

random graph:
step1: 214156 step2: 196811 step3: 186681 step4: 186681 step5: 186681

chain graph:
step1: 61558 step2: 59878 step3: 59876 step4: 59876 step5: 59876

fmri:
step1: 315462 step2: 206680 step3: 206680 step4: 206680 step5: 137794 step6: 137404 step7: 144178 step8:144064 step9:144070 step10:144062

So in our test cases empirically the number of nonzero do not grow too much in intermediate steps.

6. Reviewer 3: "Is the choice of lambda to achieve degree 20 for the fMRI data actually reasonable? Can we verify that this is a reasonable choice via held-out log-likelihood, or some similar metric?"

The choice of degree 20 was based on suggestion by an expert; but we agree that it would be interesting to analyze the network estimated by setting the regularization parameter lambda via cross-validation.

7. Reviewer 3: "I hope the authors consider releasing a public version of this code."

Yes, we will release the code soon.

8. Reviewer 3: "Did the authors consider using multifrontal LLT factorizations, etc?"

This is a great point. We tried the sparse LLT factorization before using CG, but we found in practice number of nonzeros in L will grow very quickly as p increase, even when average degree of the matrix is fixed. For example, when we generate a random p by p matrix X with average degree about 10, we found the number of nonzeros in L (computed by state-of-the-art sparse Cholesky factorization) grows quadratically with the dimension, so that sparse Cholesky factorization becomes infeasible when p>100,000, even when the average degree is only 10.

p nnz(X) nnz(L) time(sec)
100 888 893 0.01
500 5094 17494 0.01
1000 9992 57547 0.02
5000 19960 1327992 3.15
10000 99948 5388053 30.00
50000 500304 130377362 3245.00

Therefore, we use CG to compute the inverse and use another approach to compute logdet of a large sparse matrix.