|
Submitted by
Assigned_Reviewer_4
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The authors consider the natural problem of bounding
the number of control questions needed to evaluate workers’ performance in
a crowdsourcing task. They posit two methods of leveraging control
questions. One is a two-stage estimator that produces two estimates, one
for the biases of the workers and another for the bias-neutralized
estimate for the true label. For this method they show that to minimize
the MSE of the estimator one needs O(sqrt(L)) control questions to each
worker where L is the number of labels provided by each worker.
The other method they consider is a joint-estimator of the biases
and true labels. This model turns out to be more complex to analyze and
the authors solve the problem by connecting the MSE to the eigenvalues of
the adjacency matrix of the bipartite graph connecting items to workers.
Here the bound on the number of control items turns out to be
O(L/sqrt(n)), where n is the total number of items labeled. Since the
number of items given to each worker is <= n this bound is O(sqrt(L))
and is much better than the two-stage estimator in terms of the number of
control questions needed as n --> \infty. The joint-estimator crucially
relies on the structure of the assignment graph of items to workers. In
particular, the bound mentioned above holds if the assignment graph is an
expander. Happily, a random graph is an expander almost surely and a
random L-regular assignment scheme works.
The paper concludes with
some realistic experiments which show that the performance follows the
theoretical results. I found it interesting (and appreciate) that the
authors also investigated what happens if some of the assumptions made in
the derivations doesn’t hold and they discover that the joint-estimator
method, though better in control question utilization, is not as robust as
the two-stage method when the model is mis-specified. The paper is well
written and though the proofs are a little too concise for my taste, the
reviewer understands that this might be because of the page limits.
Minor typo: In the conclusion section line #371, O(L) should be
O(sqrt(L)). Q2: Please summarize your review in 1-2
sentences
The reviewer feels that this paper contains important
and interesting results and recommends the paper for
acceptance. Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper considers various methods of achieving
consensus labeling in a crowdsourcing setting, specifically the special
case where some real-valued quantity has to be estimated by e.g.,
averaging estimates from multiple users. If individuals have a fixed
bias, and some truth values are available, the bias could be estimated
using only the true values, or using all labels provided by the user.
The paper provides theoretical results under this specific data
model for these two schemes, in an effort to estimate how many true values
are needed.
The theoretical work seems solid, and matches up
fairly well with empirical data in the simulations. In particular, as one
might intuitively expect, the joint estimation scheme is asymptotically
much preferred as more questions need to be answered. Overall, it's a
nice combination of a theoretical contribution and empirical evaluation on
a current topic. Many questions remain on the model itself, and perhaps
the authors could discuss some of these details in the paper.
Some general questions on the model: 1. relevance: are
there many such "real-valued-estimation" problems that could in fact
benefit from crowdsourcing? The authors mention forecasting as a possible
application.would this bias- or bias-variance model be empirically
appropriate for those settings? 2. model structure: consider an
extremely simplified model where all workers share a bias, i.e.,
crowd-average is always off-by the same quantity irrespective of teh
crowd. Then these estimation schemes are inappropriate/can be vastly
simplified. how would such a model work in practice? e.g., the
football dataset suggests that a "variance-only" model may in fact work
out better. 3. in general, model appropriateness is a challenge - what
would the authors suggest for figuring out the appropriate model? a larger
control experiment, or other strategies? Q2: Please
summarize your review in 1-2 sentences
The paper provides theoretical results and empirical
evaluation of two specific models of consensus labeling, addressing the
question of how many "pre-labeled" items are needed to achieve robust
consensus labeling. The paper is a nice combination of theory and
evaluation on a current, relevant problem. Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper examines the problem of determining
what fraction of control questions (with known answers) vs. target
answers (unknown) to use when using crowdsourcing to estimate
continuous quantities (i.e., not categorical judgments). The authors
describe two models for using control questions to estimate worker
parameters (bias and variance from true answers); a two-stage
estimator which estimates worker parameters from the control items
alone, and a joint estimator which comes up with an ML estimate using
both control and target items. The authors derive expressions for the
optimal number of target items (k) for each case, beginning with a
clear statement of the results and then going through detailed but
clear derivations. They then show their results empirically on both
synthetic and real data, showing how the estimates align with the true
optimal k in cases where model is a perfect match to the data vs.
misspecified (for synthetic) and then show how the practical effects
of misspecification when dealing with real data. They close with
valuable recommendations for practitioners in terms of choosing
between the models.
First of all, this is a very important and
highly practical setting - as someone who has run many crowdsourced
tasks as well as read/heard many accounts from others, using control
questions is a tried and true method of estimating bias and variance
in judges; much of the past theoretical work has ignored this and
assumed no such control questions are available. While control
questions could be used in these other methods in principle, I
know of no previous paper that has examined the *value* of control
items and how many should be used in a given setting.
I have often
wondered about this question in my own experiments, and have
considered working on the problem myself; as such I was delighted to
read this thorough treatment by the authors. This paper is excellent
in so many ways: it is unusually clear in its writing, from the
motivation and setup to the explanation of their the strategy and
purpose of their approach before diving into the derivations, to the
setup and explanation of the experiments and their implications. The
past literature is well-covered, the figures are clear, the notation
and development are easy to follow. The estimation algorithms and
the optimal k are clear, and the discussion of the effects of
model mismatch and recommendations for real settings/applications
are insightful and valuable. I would recommend this paper not only to
my colleagues who work on the analysis of crowdsourcing, but also to
many others who are users of crowdsourcing: an excllent paper all
around, that I expect will be well-regarded at the conference and
well-cited in future years.
Q2: Please summarize
your review in 1-2 sentences
This excellent paper examines the relative value
of a given number/fraction of control items (i.e., with known
answers) to estimate worker parameters when estimating continuous
quantities via crowdsourcing. This is a novel and extremely practical
investigation, as control items are widely used in an ad-hoc manner in
practice. The paper is exceedingly clear and well-structured, and
well-supported by careful experiments on synthetic and real datasets
showing the practical performance of the derived estimates.
Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper considers the recently popular problem of
aggregating low-quality answers collected from crowds to obtain more
accurate results. The central question here is how many control
examples (whose ground truth labels are known) are required to obtain the
most accurate results. With a simple Gaussian model with worker
ability parameters, the authors evaluate expected errors for two
estimation strategies: two-stage estimation and joint estimation, from
which the optimal numbers of control items are derived.
Although I
found no apparent flaw in the analysis and the experiments support the
claims as far as several assumptions hold, the main concern is the
assumption of uniform task assignments to workers. In most
crowdsourcing situations, the assumption is not so realistic; some workers
complete many tasks, but most workers do only a few. Whether or not
the proposed method is robust to such situations is not evaluated in the
experiments since all of the datasets used in the experiments follow the
assumption.
It would be nice if extension to discrete values were
discussed. Also, the authors should mention several existing work
incorporating control items into statistical quality control such as
Tang&Lease(CIR11) and Kajino&Kashima(HCOMP12),
Q2: Please summarize your review in 1-2
sentences
The problem is interesting, but the assumption of
random task assignments might limit the applicability of the proposed
method.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank all the reviewers for their helpful comments
and suggestions. The main concern of AR_5 and AR_7 is about the validation
of the model assumptions, for which we argue that (1) all models are in
some sense “wrong”, but we show that our models are *useful* in the sense
of providing significantly better prediction on the real datasets, and (2)
the model assumptions are made in part for illustrating the theoretical
analysis of the control items, but our results can be extended to more
complicated models in the asymptotic regime.
Some more detailed
responses to the questions by the reviewers:
Assigned_Reviewer_5:
[Are there many such "real-valued-estimation" problems that could in
fact benefit from crowdsourcing?]: There are enormously important
“real-valued estimation” problems that benefit from crowdsourcing,
including the forecasting of event probabilities and points spreads that
we mention in the paper. Another important area is the forecasting of
economic indicators, such as GDP growth and inflation, e.g., the Wall
Street Journal reports forecasts of economic indicators made by a crowd of
around 50 macroeconomists every six months.
[Would the bias- or
bias-variance model be empirically appropriate for those (forecasting)
settings?]: Our models are empirically *useful* in the sense that they
provide significantly better prediction than the baseline uniform
averaging methods as shown in our real datasets. Of course there may exist
better (possibly more complicated) models for these data.
[Consider an extremely simplified model where all workers share a
bias, i.e., crowd-average is always off-by the same quantity irrespective
of the crowd. Then these estimation schemes are inappropriate/can be
vastly simplified.]: Again, for specific problems, other models may be
more appropriate, but some of the same issues likely apply. As we mention
in the paper, at least one control item must be used to fix the
un-identifiability issue in this case.
[Model appropriateness is a
challenge - what would the authors suggest for figuring out the
appropriate model? a larger control experiment, or other strategies?]:
Model selection is an important issue on its own, which we would like to
pursuit as one of our future directions.
Assigned_Reviewer_7:
[The main concern is the assumption of uniform task assignments to
workers]: The assumption of uniform assignment is again made mainly for
simplicity of analysis and notation. Our results should be readily
extensible to other cases if the real datasets have non-uniform degree
distributions.
[It would be nice if extension to discrete values
were discussed.]: The results on discrete models are much more
complicated, and require very different tools to analyze, which we are
planning to study in future work. But we expect that our general scaling
rules remain correct on discrete models (e.g., our recent results have
shown that the two-stage estimator for a standard model for discrete
values requires optimal C*sqrt{\ell} control items, where C is a constant
that dependents on the model parameters).
[The authors should
mention several existing work incorporating control items into statistical
quality control]: We would be happy to learn of and incorporate any
specific suggestions if the reviewer can provide them.
| |