|
Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents bounds on the search performance of
a simple, tree-based nearest neighbor search algorithm. The bounds
depend on the vector quantization performance on the tree. It is
argued that this result implies that trees with good vector
quantization performance are advantageous for nearest neighbor search.
The statement is extended to large margin splits.
The title of
the paper asks "which space partitioning tree to use for search"? It
should better ask "which tree results in the strongest performance
guarantees"? The paper says almost nothing about practical
performance. This is mostly due to the choice of an artificially
simplified search procedure. More often than not, a better guarantee
is an artifact of a certain flavor of analysis or a proof technique,
since we are only talking about upper bounds. If the bounds are not
tight then the bounds say little about which tree to "use" (in
practice!). This paper makes the common mistake of confusing a better
performance guarantee with a guarantee for better performance. This
happens at several spots, e.g., first sentence of section 4.
Algorithm 1 is analyzed in depth. However, I am unsure how
relevant this algorithm is. It descends the tree without backtracking.
At the target depth l it performs exhaustive search. Although this is
not taken into account in the analysis, the final search can be
performed with efficient exact tree search, so, this time with
backtracking.
This algorithm does not find the exact nearest
neighbor. The obvious alternative is to apply a heuristic for skipping
some of the branches on the fly. The decisive difference is that in
Algorithm 1 the decision which branches to traverse is not made in a
data dependent manner, but instead based on a pre-specified parameter.
This is why personally I would never use this algorithm. Since all
results are restricted to this algorithm I question the relevance of
this paper.
I see that analyzing the computational complexity of a
method with backtracking is surely much harder. I argue that doing so
would be a prerequisite for understanding the behavior of realistic
algorithms. I cannot get rid of the impression that this analysis was
conducted simply because it is possible, and not because it is
relevant.
The logic of the conclusion is as follows: Algorithm
1: good VQ performance => good search performance. Now Algorithm 1
is not a good search algorithm by itself. When using more elaborate
tree search procedures there remains little to nothing to conclude
from the present analysis. However, the title as well as other
statements in the paper (e.g., top of page 5) indicate that the
conclusion is rather general. I want to emphasize that this is not the
case, and thus this paper does NOT answer the question which search
tree to use for search in practice.
I would appreciate the
result if it could help the analysis of more realistic search
procedures. However, I am not an expert on the analysis of tree search
and thus I cannot judge the present paper from this perspective. Also,
this paper does not claim to provide new methods for analysis, it is
all about the theorems. And this makes me question its value.
The empirical evaluation is weak. Only four data sets are used,
and they are even non-standard. E.g., why is MNIST sub-sampled to 6000
training and 1000 test samples? This is arbitrary, an no reason is
given. This does not help my trust in the evaluation. With low numbers
of training points there is no real need for tree search at all. I
see that the empirical results are nicely in line with the analysis.
However, how about computation time? E.g., a single decision in a
kd-tree is cheaper than in a PA-tree by a factor of O(dim(X)). The
added computation time could be used for backtracking, which could
well give the kd-tree an advantage. So once more, this analysis says
nothing about which tree to use for search with a better algorithm.
Q2: Please summarize your review in 1-2
sentences
I don't have trust that this simplified analysis will
actually answer the question posed in the title for practical
purposes. This is because a too much simplified search algorithm is
considered. This reduces the relevance of the analysis to nearly zero.
I have just read the author feedback. I find my points addressed
very well and convincingly. I have changed my decision accordingly.
Thanks for the good feedback!
Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Paper Summary: The authors formalize and prove the
long standing assumption that partition trees with better quantization
rates also have better nearest neighbor search performance. Inspired by
this, they propose the ‘max margin’ partition tree and relate its search
performance with the margin. Experiments with some real world data
validate that ‘max margin’ and partition trees with good quantization
errors yield good nearest neighbors.
Review: I really like
that the authors were able to formalize the long standing assumption that
partition trees with better quantization rates tend to have better nearest
neighbor search performance. The formalism and conditions proposed in
Theorem 3.1 are intuitive; and the performance is nicely related to the
the ‘expansion constant’ (a formalism of intrinsic dimension of the input
sample). Although the bound provided in Eq. 4 can be made tighter (see
below).
I do have a few suggestions that authors should
consider adding to the current text:
- while using the expansion
dimension is intuitive, this results in making a strong assumption like
condition C1 to hold over _every_ convex set in the underlying space. I
believe (I haven't really checked it) that if the authors consider
doubling dimension instead, they dont have to explicitly assume C1 to
hold, thus improving the overall readability of the text. (perhaps make a
note that bounded doubling dimension, also implies the result?)
-
perhaps the authors want to add a short discussion on what happens if the
input S comes from an underlying distribution.
- why the notation
‘\tilde c’ for expansion constant? consider changing it to ‘c’. (Theorem
3.1)
- I believe that the bound provided in Eq. 4 can be made
tighter. Shouldn't the ratio in lines 531--532 always less than 1? This
can significantly improve Eq. 14.
Quality: The paper
systematically explores the connection between vector quantization and
nearest neighbor search, providing both a sound theory and convincing
experiments. The bound seems to be looser than expected.
Clarity:
The paper is written clearly.
Originality: The work is
original.
Significance: I believe this work is very
significant as it formalizes the long standing belief that partition trees
with better quantization rates tend to have better nearest neighbor search
performance. This work can encourage formal analysis of other fast nearest
neighbor schemes.
Update after the author response: Thanks
for the clarifications. In the past week, I came across the following
COLT2013 paper that is highly relevant to this submission: "Randomized
Partition Trees for Exact Nearest Neighbor Search". The authors should
update their discussion and compare their results with this paper as
well. Q2: Please summarize your review in 1-2
sentences
Authors have done a good job in formalizing the long
standing intuition that partition trees with better quantization rates
tend to have better nearest neighbor search performance. The authors can
do a tighter analysis to improve the bound. Experiments with real data
corroborate that the theoretical intuition also works well in practice.
Submitted by
Assigned_Reviewer_8
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The authors study space partioning trees with the
nearest-neighbor search. More specifically, they study the error
caused by dfs search (without backtracking) and show connection
between the error and quantization error of the tree.
The authors
work is interesting, however, there seems to be some weaknesses.
Authors demonstate the connection between the search error and the
quantization error by bounding the search error with a (rather
complex) bound that has parameters derived from the quantization
error. However, this connection is implied only if the bound is tight,
which doesn't seem to be the case:
Consider the expansion
coefficient c in in Theorem 3.1. This coefficient depends on q. If we
select q to be far enough from S: let's say that that min_{x \in S} |x
- p| is bigger than the distance of any two points in S. This implies
that expansion coefficient is n, which forces L in C2 to be 0, making
Theorem 3.1. unusable for this case. The problem with this one is that
Theorem 3.1. analysies only upstairs of the distance error. I think a
more solid result could be obtained is the downstairs term is also
taken into account simultaneosly.
The fact that c depends on q
implies that Theorem 3.1 cannot be used in practice to obtain the
actual error bounds while doing the actual search since computing
expansion coefficient seems to be very computationally expensive.
Theorem 3.1. can be only used as a theoretical evidence for the
connection between the search error and the quantization error.
Let's make an assumption now that q is actually "within" S, that
the expansion coefficient doesn't change that much when we add q into
S. Consider the following case: copy S and transpose the copy from S
such that they both copies are fully separated. The expansion
coefficient should stay about the same, as well as \omega (since any
reasonable tree would first separate the copies from each other,
giving \omega = 0, for the first level). While the quantization
improvement rate will be excellent on the first level, it will be as
bad on the next levels as with original S. Consequently, \beta \beta
will stay the same. As we move the copies from each other away. \psi
will get larger, and the bound will get loose. I think a better
approach would not to consider \psi and \beta globally but instead try
to work on a "local level".
While these cases may not happen in
practice, I would like to see authors demonstrating how tight is the
bound, for example, empirically, similar to Figure 2.
Techical
writing could be improved. The authors tend to use accents and indices
in cases where they are not really needed. For example,
\tilde{c}
-> c
B_{l_2} -> B
use less exotic caligraphy for
intrinsic dimenstion d and full dimension D
027: - -> ---
Definition 2.1. is a bit messy. It would be better to just say
that quantization error improvement rate is equal to beta = V_S(A)
/ (V_S(A) - V_S(A_l, A_r))
C2: what do you mean by complete?
C4: in Theorem 3.1. are all nodes in T included or only leaves?
Q2: Please summarize your review in 1-2
sentences
The authors bound the search error of non-backtracking
dfs of space-partitioning tree (for NN search) with a quantization error.
Interesting paper and while the connection makes sense, I am not
convinced that the provided bound is tight, which dilutes the impact of
the paper.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We would like to thank the reviewers for their time
and feedback. We hope that the ensuing discussion is able to provide some
clarification:
Regarding the claims of the paper * We have
tried to qualify throughout the paper that better VQ and larger margins
imply better search performance *guarantees*. * The empirical results
show that the actual search performance aligns with the proven performance
guarantees. * MNIST (& Tiny Images) are subsampled because
search-error analysis requires brute-force search and is very expensive on
large datasets in terms of computation and memory. * The goal of this
paper was to identify factors that affect search performance guarantees.
No single method is best on all datasets. Knowing the potential factors
affecting search allows us to probe a dataset in question to decide which
method to use. For example, the results in this paper indicate that, for a
given dataset, the trees with better VQ performance and large margin
splits on the dataset might be a better choice for search performance.
Regarding the simplicity of the algorithm analyzed * Algorithm
1 is used for the clarity of the presentation. For a tree of depth l, this
is the first part of the usual backtracking search algorithm where the
query reaches the first leaf. The backtracking algorithm continues from
here to go back up the tree to look for better candidates. * The
result in this paper can be used to obtain search-time guarantees for the
complete backtracking algorithm in the following way -- Thm 3.1 gives the
worst-case distance (say d) of the query to the candidate neighbor in
first leaf visited. Lemma 3.1 (implicitly) bounds the diameter (say D) of
any leaf node. Then the backtracking search algorithm will atmost only
consider leaves completely contained in the ball of radius d+D around the
query. The only (non-trivial) remaining step is to bound the number of
points that might lie in this ball (this would probably depend on the
intrinsic dimension). Now better VQ implies better bounds on both d and D
(from our results). Hence better VQ implies smaller worst-case ball around
the query and hence better worst-case runtime for the backtracking
algorithm.
Regarding the comparison of kd-trees to other BSP-trees
* The superiority of the PA-tree/2M-tree/etc over the kd-tree for
search time has been empirically exhibited in the cited papers. While the
reviewer's observation (that kd-trees require O(1) computation to make
split decisions while all other trees require O(dim(X)) time) is right and
very valid, in practice, the search time is generally dominated by the
number of actual points encountered at the leaves and the total number of
leaves visited and not by the node split decisions. This is the reason why
the PA-tree/2M-tree/etc have been shown to have better performance than
the kd-tree on medium to high dimensional datasets. For kd-tree the
(backtracking) search generally ends up traversing all tree leaves.
Regarding the choice of expansion constant over doubling dimension
* The expansion constant is known to be fragile (as indicated by the
reviewer's examples). The doubling dimension is a more robust notion of
intrinsic dimension. The robustness makes it hard to work with (no search
time guarantees are known with the doubling dimension, only known results
are with the expansion constant). Ideally we want results with the
doubling dimension, but we do not have any yet. * With the
distributional assumption on S and q, it can be shown that the expansion
constant is very close to the doubling (or Federer) measure of the
distribution, which is more robust that the expansion constant.
Regarding the tightness of the results * The reviewer's
counterexample demonstrates the looseness resulting from considering a
single beta* for the whole tree and suggests working on the local level.
The goal of this analysis was to present bounds that compare different
trees. If psi changes, it changes for all trees. Hence the relative
performance of trees still does not change even though the current bound
might get looser. * We considered a single beta* throughout the tree
to reduce notation in the result. If we keep track of beta(n) for every
node n in the tree, then our current proof will still work and the
exp(-l/beta*) term in Eq. 4 will be replaced by exp(-\sum{n in N}
(1/beta(n))) where N is the set of nodes seen by Algorithm 1 for the
query. This is a tighter result. * The ratio in lines 531-532 is not
necessarily less than 1. An example is when a node contains some (say half
the) points which are very closely clustered while the rest are fairly
spread apart far away from this cluster and from each other. If a split
separates the clustered points from the rest, the ratio can be greater
than 1. * We do not yet have matching lower bounds in addition to the
presented upper bounds on the search error.
Regarding
clarification questions * In condition C2, a complete tree of depth l
implies a binary tree with all 2^l nodes at depth l. * Condition C4 is
for all nodes at depth l.
| |