|
Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presented a so-called unsupervised feature
selection method, which is based on QR decomposition of the data matrix.
And the selected features refer to the pivots chosen from QR. The
definition here is different from standard feature selection in [A][B].
The authors also discussed several randomized algorithms or sampling
methods as related work. Indeed, the problem in this paper relates more
closely to matrix factorization rather than unsupervised feature
selection.
The complexity of the IQRP is discussed in Section 2.3
and 4, however, the flow of the analysis is not clear and confusing. The
authors should provide more details.
One of key advantage of this
work is the low complexity. How to select features from very high
dimensions in unsupervised setting is an interesting topic. There are a
lot of unsupervised feature selection methods, the authors should clarify
why choosing features for matrix factorization is important.
The
related work discussed in Section 2 are not common unsupervised feature
selection. They are neither filter methods, such as mRMR, ReliefF, FCBF
nor wrapper based methods, such as RFE using clustering. Moreover, feature
selection using random strategies is not very meaningful especially for
high dimensional data. Let n = 1000000 and k = 10, it is almost impossible
to select k informative features (the probability to choose one of k
informative features is 0.00001).
I am not convinced about
experimental results. As discussed in [C], redundancy rate is an important
criterion for measuring the quality of the selected features. The authors
should report this for comparisons. Even for unsupervised feature
selection, one can use clustering error or reconstruction error of the
data matrix to evaluate the quality for the given number of selected
features. From Figure 7, though the proposed algorithm is faster to select
k features, I can’t assess the quality of the chosen features. One can
randomly choose k features in very short time, but the quality of the
chosen features can be very poor. There are other efficient matrix
factorization methods [18,19] other than QRP, the authors should compare
the qualify of the chosen features given the same amount of run time.
One suggestion is that the authors could use some toy data with
known informative features to demonstrate how many informative features
can be chosen by the algorithms. Overall the experiment can't demonstrate
the validity of the proposed feature selection.
Moreover, the
authors should compare with other state-of-the-art unsupervised feature
selection methods.
The flow of writing is not easy to follow, the
organization can be greatly improved.
[A] Guyon, and Elisseeff, An
introduction to variable and feature selection. JMLR., 3:1157–1182, 2003.
[B] Hall, A. Correlation-based Feature Selection for Machine Learning.
1999. [C] Zhao, Wang, Liu, and Ye, On similarity preserving feature
selection. IEEE TKDE, 2013.
Q2: Please summarize
your review in 1-2 sentences
The idea of choosing column for matrix factorization
may be interesting. But it is quite different from traditional
unsupervised feature selection methods. The authors should clarify the
problem setting.
Experiments just reported the timing. However,
there is no quality evaluation of the chosen features. Experiments cannot
demonstrate the validity of the proposed feature selection
methods. Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a way to accelerate the pivoted QR
algorithm by Businger and Golub by selecting multiple columns per
iteration instead of one column per iteration as in the classical
algorithm. They argue that the selected columns are the same as those
selected by the original pivoted QR algorithm, but the algorithm needs
much fewer passes through the data.
*Originality* I am not an
expert in numerical linear algebra, but the authors' idea of selecting
several features at once seems natural to me. However, as far as I can
tell after having superficially browsed related literature, the approach
is novel and properly placed in the context of existing work.
*Quality* The paper seems to be technically sound, but its
quality does not fully measure up to NIPS standards in my opinion. The
authors do argue in the conclusion that accelerated pivoted QR is
competitive with the state of the art, but the claim is not accompanied by
empirical results.
*Significance* The experimental results
seem impressive in terms of acceleration wrt classical pivoted QR.
However, I may have missed something but it is not clear to me how IQRP
compares to other state-of-the-art algorithms. The authors state in
Section 6 that IQRP is "competitive" with them in terms of run time but it
may be less accurate -- not having more information than this, I am
tempted to assume that it is actually not very competitive. Could the
authors comment on this?
*Clarity* The structure of the paper
is clear and makes sense, although the writing is a bit dry and lacks
flow, and the paper could be more self-contained. E.g., in Section 2.2.2,
there is no description whatsoever of "leverage", the authors could at
least have given an intuition of what it represents. The experimental
section is not well written (e.g. "the real run time on a laptop" is not
very specific). If this were a paper submitted to a journal, I would ask
for a revision to allow the authors to improve the presentation.
*Typos, comments on style and other details* - Figure 1 -
"indexes" should be "indices". - Figure captions should start with a
capital letter. - Page 4, line 210, "see, e.g., [21]" should be in
parentheses. - Section 4.1 - use punctuation at the end of equations.
- When citing books, such as [9] or [21], the chapter or the page
number should be indicated. - Figure 1 - After step 1.1 you should add
"$\tilde{x}_i = x_i$ for $i \in \{1, \ldots, n\}$", otherwise the
\tilde{x}_i are never defined in Figure 2. - Section 2.3 - the middle
paragraph is a second half of the last sentence of the first paragraph, so
it should be concatenated to the first one, start with a lowercase letter,
and end with a full-stop. In the following paragraph, there is an extra
full-stop before "by a large factor". - Section 5 - "There is still
significant savings" should be "There are ...", "Most interesting the
observation" should be "Most interesting is the observation", "We change
k" should be "We vary k". - Bibliography, [17] - CUR should be put in
curly braces in the bibfile. Same for [13] and [14] and Monte-Carlo. -
Page 8 - "we observe that The IQRP" should be "... the ..." - "memory
efficient" and "time efficient" (resp. "memory optimized", ...) should be
"memory-efficient" and "time-efficient" (resp. "memory-optimized", ...)
- Figure 4 - "utilizing" - "using"? - "Gram Schmidt" -
"Gram-Schmidt"
Q2: Please summarize your review in
1-2 sentences
The authors consider the problem of unsupervised
feature selection and propose a data-dependent acceleration of the pivoted
QR algorithm by Businger and Golub. The paper is well structured and the
contribution may be significant, however experimental comparison with
state-of-the-art algorithms is missing, and the writing lacks
quality. Submitted by
Assigned_Reviewer_8
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes improvements to the classical
pivoted QR algorithm ("QRP") which lead to time and space advantages
useful for feature selection in the large p setting (many variables). The
paper is well-written for the most part, and the presentation is clear
enough. The algorithm proposed in the paper follows from what might be
called low-hanging (or even "obvious") improvements to QRP, in the sense
that anyone working with large datasets in practice would begin to think
about these and related ideas. But, as in many practical settings, the
devil is in the details, and *details matter* particularly in large-scale
implementations. This is where the submission shines. The authors have
been careful about their CPU and memory bookkeeping at every step, and
additionally give a few nice low-level suggestions (e.g., efficient data
structures to consider; stability of orthogonalization subroutines is
discussed, etc.). On the whole this paper provides a detailed, thorough
picture for practitioners working with large datasets, and this is perhaps
its most salient selling point: one can go from the paper directly to
implementation. I believe this is enough of an advantage to overcome
concerns about novelty and the other complaints outlined below.
Other comments: - The interpolative decomposition is closely
related, but it not cited let alone reviewed or compared. This is perhaps
a major omission. Yes, the ID is also related to CUR decompositions, which
are mentioned, but let's not assume the reader knows this, or by
extension, how it ultimately relates to the IQRP algorithm proposed in the
paper. See: - Work from Mark Tygert (NYU) - N. Halko, P. G.
Martinsson, and J. A. Tropp, "Finding Structure with Randomness:
Probabilistic Algorithms for Constructing Approximate Matrix
Decompositions," SIAM Review, 2011.
- The ID and other randomized
techniques are reasonably efficient. So, if we're going to compare
randomized algorithms to IQRP, the efficiency advantage of the latter
seems to critically hinge on c being less than k. Why should c be
substantially less than k? There is no discussion really exploring this
critical, even key, distinction. The experiments suggest it may be true,
but the empirical evaluation is limited, and might make it hard to form a
concrete opinion as to whether one can expect c << k in a given
practical setting going forward. Note that this is apparently what
underpins the question as to whether to use IQRP over randomized ID, for
instance.
- Experiments: major improvements are needed. For what
is mainly an applications/practical paper, the empirical evaluation is a
bit underwhelming. Could you show QRP perhaps on the same plots at a
minimum? Or comparisons to other, more competitive baselines one would
actually consider if faced with making a design decision?
- What
is meant by "randomization algorithms can produce better features"? This
statement is never really made precise. Also you say "IQRP may be less
accurate". In terms of what? It would be interesting to see the accuracy
on the regression/classification problems themselves, using the features
selected by the various algorithms. Ultimately, this is what we're after,
at least for the datasets considered in the paper.
- There are
other, related, applications for the pivoted QR decomposition that may
help your case. For example, finding nearly dependent columns in forward
stepwise regression techniques (and computing related statistics: F-stats,
t-stats etc).
Q2: Please summarize your review in
1-2 sentences
Although the fundamental improvements proposed here
are incremental, and the empirical evaluation is lacking for a mostly
applied submission, the paper does provide a clear, thorough enough
treatment to make the submission a useful, self-contained reference.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for constructive comments.
We begin with general comments and then address individual reviewers.
GENERAL COMMENTS
Novelty - Some reviewers
considered the result as incremental. We believe this increment to be
very significant. The QRP stood unchallenged (in terms of passes and
speed) for nearly 50 years. As a robust QR it is closely related
to 2 of the 10 most important algorithms of the 20th century [SIAM
news v33 n4].
Our idea is not simply selecting multiple vectors
(cf. reviewer-6), but discovering an efficient method to identify SOME
selections that are PROVABLY identical to QRP selections. To the
best of our knowledge this was not previously known or even suspected.
Being able to run the QRP on big data may enable, in addition to
feature selection, many other data manipulations that use it as a
subroutine.
Comparison with randomized algorithms - Most
available results on the randomized algorithms we reviewed are
analysis of asymptotic behavior. Implementing them is nontrivial. We
are not aware of experimental evaluation of feature/column selection
with sub kmn performance. Reported experiments (eg [2]) were performed
with a deterministic SVD which takes at least kmn. The fast
leverage algorithms achieve their superior asymptotic complexity under
the condition log n << k, m^2 << kn. It is unclear at the
moment how well they perform for datasets and k values that do not
satisfy these conditions.
With regard to feature quality, since
the IQRP produces the same features as the QRP we didn't think it was
essential to report the quality of the selected features. We provided
references that discuss the quality of the QRP, including the recent
[12], which directly compares the QRP quality with several
state-of-the-art algorithms.
Experimental evaluation -
We tested many more datasets than what is reported in the paper.
(We also computed several other performance metrics.) The results
are very similar to what is shown, where we selected the largest well
known datasets from UCI. The constraints we faced were the limited
amount of space.
We agree that in the current form the experiments
look "underwhelming", and we will try to significantly improve on
that.
REVIEWER-5
Trying to understand the very bad review
we checked references [A,B] provided by the reviewer. [A] is a
classic paper that we know almost by heart. It deals almost
exclusively with the supervised case; there are only 2 paragraphs on
unsupervised selection, and they are consistent with our approach.
(One discusses matrix factorization.) [B] is a PhD thesis from
Waikato that doesn't discuss unsupervised feature selection at all.
The purpose of giving it as a reference is not clear to us. Perhaps
our paper is not clear enough about the distinction between the
supervised and the unsupervised case.
Regarding the need to
evaluate feature redundancy as in Reference [C] given by the reviewer.
The features selected by QRP/IQRP are not redundant by design.
Regarding comments about randomized methods inappropriateness for
big data. A very large and active community, companies such as Yahoo,
Google, Amazon, and grant awarding agencies would all strongly
disagree.
REVIEWER-6
The reviewer main concern appears to
be the comparison to the current state-of-the-art. In particular our
claim of being competitive. As explained above we are not aware of
implementations of leverage-based feature selection that use the
randomized leverage estimation. (We are aware of some work in
progress..) Our comparison was based on the asymptotic complexity as
reported in publications. In terms of run time the only potentially
sub kmn algorithm requires fast leverage, with run time of O(mn logn +
m^3). We used a very conservative estimate, taking the O()
notation constant to be 1. The datasets in our experiments satisfy mn
log n + m^3 > kmn for most k values, while the IQRP runs in sub kmn
time. Therefore in these instances the IQRP is faster. We believe
this justifies our claim of being competitive in terms of run time.
As for passes we intend to change the text to be less ambiguous.
The IQRP is competitive in terms of the number of passes, and
typically outperforms the randomized algorithms in terms of effective
passes. Here again the competition is with the fast leverage
algorithms. In 2.2.2 we pointed out that one may need to run the
randomized leverage procedure several times. Running it once requires
4 passes, running it twice requires 6 passes. The results reported
in Fig.7 are competitive, since only the thrombin takes more passes,
and only for high k values. In our experiments the number of effective
passes of the IQRP was almost always below 2. This is better than the
number of passes of the randomized algorithms we discussed.
We
will define leverage and accuracy, and remove laptop runtime.
REVIEWER-8
Yes, the technique detailed in Halko et.al.
for computing the QRP using ID should have been discussed. It is a
major omission but it will be easily fixed. In a nutshell, their
technique may be faster than ours when the matrix rank is
approximately k. Otherwise their row extraction step will introduce
unacceptable inaccuracies, which led Halko et.al to recommend using an
alternative O(kmn) technique. See their Remark 6.1. (Our experience
indicates that in typical datasets the above condition doesn't hold.
It would imply that the entire data can be very accurately estimated
by the k selected features.)
With regard to run time guarantees we
will add a proof that the IQRP is no slower than the QRP when l~k.
Anything above that is based on experimental data.
We intend
to improve the presentation, experiments, and properly define accuracy
and leverage.
| |