{"title": "Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing", "book": "Advances in Neural Information Processing Systems", "page_first": 1, "page_last": 9, "abstract": "Crowdsourcing has gained immense popularity in machine learning applications for obtaining large amounts of labeled data. Crowdsourcing is cheap and fast, but suffers from the problem of low-quality data. To address this fundamental challenge in crowdsourcing, we propose a simple payment mechanism to incentivize workers to answer only the questions that they are sure of and skip the rest. We show that surprisingly, under a mild and natural no-free-lunch requirement, this mechanism is the one and only incentive-compatible payment mechanism possible. We also show that among all possible incentive-compatible  mechanisms (that may or may not satisfy no-free-lunch), our mechanism makes the smallest possible payment to spammers.  Interestingly, this unique mechanism takes a multiplicative form. The simplicity of the mechanism is an added benefit.  In preliminary experiments involving over several hundred workers, we observe a significant reduction in the error rates under our unique mechanism for the same or lower monetary expenditure.", "full_text": "Double or Nothing: Multiplicative\n\nIncentive Mechanisms for Crowdsourcing\n\nNihar B. Shah\n\nUniversity of California, Berkeley\nnihar@eecs.berkeley.edu\n\nDengyong Zhou\nMicrosoft Research\n\ndengyong.zhou@microsoft.com\n\nAbstract\n\nCrowdsourcing has gained immense popularity in machine learning applications\nfor obtaining large amounts of labeled data. Crowdsourcing is cheap and fast, but\nsuffers from the problem of low-quality data. To address this fundamental chal-\nlenge in crowdsourcing, we propose a simple payment mechanism to incentivize\nworkers to answer only the questions that they are sure of and skip the rest. We\nshow that surprisingly, under a mild and natural \u201cno-free-lunch\u201d requirement, this\nmechanism is the one and only incentive-compatible payment mechanism pos-\nsible. We also show that among all possible incentive-compatible mechanisms\n(that may or may not satisfy no-free-lunch), our mechanism makes the small-\nest possible payment to spammers. Interestingly, this unique mechanism takes a\n\u201cmultiplicative\u201d form. The simplicity of the mechanism is an added bene\ufb01t. In\npreliminary experiments involving over several hundred workers, we observe a\nsigni\ufb01cant reduction in the error rates under our unique mechanism for the same\nor lower monetary expenditure.\n\n1\n\nIntroduction\n\nComplex machine learning tools such as deep learning are gaining increasing popularity and are\nbeing applied to a wide variety of problems. These tools, however, require large amounts of labeled\ndata [HDY+12, RYZ+10, DDS+09, CBW+10]. These large labeling tasks are being performed by\ncoordinating crowds of semi-skilled workers through the Internet. This is known as crowdsourcing.\nCrowdsourcing as a means of collecting labeled training data has now become indispensable to the\nengineering of intelligent systems.\nMost workers in crowdsourcing are not experts. As a consequence, labels obtained from crowd-\nsourcing typically have a signi\ufb01cant amount of error [KKKMF11, VdVE11, WLC+10]. Recent\nefforts have focused on developing statistical techniques to post-process the noisy labels in order\nto improve its quality (e.g., [RYZ+10, ZLP+15, KOS11, IPSW14]). However, when the inputs to\nthese algorithms are erroneous, it is dif\ufb01cult to guarantee that the processed labels will be reliable\nenough for subsequent use by machine learning or other applications. In order to avoid \u201cgarbage in,\ngarbage out\u201d, we take a complementary approach to this problem: cleaning the data at the time of\ncollection.\nWe consider crowdsourcing settings where the workers are paid for their services, such as in the\npopular crowdsourcing platforms of Amazon Mechanical Turk and others. These commercial plat-\nforms have gained substantial popularity due to their support for a diverse range of tasks for machine\nlearning labeling, varying from image annotation and text recognition to speech captioning and ma-\nchine translation. We consider problems that are objective in nature, that is, have a de\ufb01nite answer.\nFigure 1a depicts an example of such a question where the worker is shown a set of images, and for\neach image, the worker is required to identify if the image depicts the Golden Gate Bridge.\n\n1\n\n\fIs this the Golden Gate Bridge? \n\nIs this the Golden Gate Bridge? \n\nYes!\nNo \n\n(a)!\n\nYes!\nNo!\nI\u2019m not sure \n\n(b)!\n\nFigure 1: Different interfaces in a crowdsourcing setup: (a) the conventional interface, and (b) with\nan option to skip.\n\nOur approach builds on the simple insight that in typical crowdsourcing setups, workers are simply\npaid in proportion to the amount of tasks they complete. As a result, workers attempt to answer\nquestions that they are not sure of, thereby increasing the error rate of the labels. For the questions\nthat a worker is not sure of, her answers could be very unreliable [WLC+10, KKKMF11, VdVE11,\nJSV14]. To ensure acquisition of only high-quality labels, we wish to encourage the worker to\nskip the questions about which she is unsure, for instance, by providing an explicit \u201cI\u2019m not sure\u201d\noption for every question (see Figure 1b). Our goal is to develop payment mechanisms to encourage\nthe worker to select this option when she is unsure. We will term any payment mechanism that\nincentivizes the worker to do so as \u201cincentive compatible\u201d.\nIn addition to incentive compatibility, preventing spammers is another desirable requirement from\nincentive mechanisms in crowdsourcing. Spammers are workers who answer randomly without\nregard to the question being asked, in the hope of earning some free money, and are known to exist\nin large numbers on crowdsourcing platforms [WLC+10, Boh11, KKKMF11, VdVE11]. It is thus\nof interest to deter spammers by paying them as low as possible. An intuitive objective, to this end,\nis to ensure a zero expenditure on spammers who answer randomly. In this paper, however, we\nimpose a strictly and signi\ufb01cantly weaker condition, and then show that there is one and only one\nincentive-compatible mechanism that can satisfy this weak condition. Our requirement, referred to\nas the \u201cno-free-lunch\u201d axiom, says that if all the questions attempted by the worker are answered\nincorrectly, then the payment must be zero.\nWe propose a payment mechanism for the aforementioned setting (\u201cincentive compatibility\u201d plus\n\u201cno-free-lunch\u201d), and show that surprisingly, this is the only possible mechanism. We also show that\nadditionally, our mechanism makes the smallest possible payment to spammers among all possible\nincentive compatible mechanisms that may or may not satisfy the no-free-lunch axiom. Our payment\nmechanism takes a multiplicative form: the evaluation of the worker\u2019s response to each question is\na certain score, and the \ufb01nal payment is a product of these scores. This mechanism has additional\nappealing features in that it is simple to compute, and is also simple to explain to the workers. Our\nmechanism is applicable to any type of objective questions, including multiple choice annotation\nquestions, transcription tasks, etc.\nIn order to test whether our mechanism is practical, and to assess the quality of the \ufb01nal labels\nobtained, we conducted experiments on the Amazon Mechanical Turk crowdsourcing platform. In\nour preliminary experiments that involved over several hundred workers, we found that the quality\nof data improved by two-fold under our unique mechanism, with the total monetary expenditure\nbeing the same or lower as compared to the conventional baseline.\n\n2 Problem Setting\n\nIn the crowdsourcing setting that we consider, one or more workers perform a task, where a task\nconsists of multiple questions. The questions are objective, by which we mean, each question has\nprecisely one correct answer. Examples of objective questions include multiple-choice classi\ufb01cation\nquestions such as Figure 1, questions on transcribing text from audio or images, etc.\nFor any possible answer to any question, we de\ufb01ne the worker\u2019s con\ufb01dence about an answer as the\nprobability, according to her belief, of this answer being correct. In other words, one can assume\nthat the worker has (in her mind) a probability distribution over all possible answers to a question,\nand the con\ufb01dence for an answer is the probability of that answer being correct. As a shorthand, we\nalso de\ufb01ne the con\ufb01dence about a question as the con\ufb01dence for the answer that the worker is most\n\n2\n\n\fcon\ufb01dent about for that question. We assume that the worker\u2019s con\ufb01dences for different questions\nare independent. Our goal is that for every question, the worker should be incentivized to:\n\n1. skip if the con\ufb01dence is below a certain pre-de\ufb01ned threshold, otherwise:\n2. select the answer that she thinks is most con\ufb01dent about.\n\nMore formally, let T 2 (0, 1) be a prede\ufb01ned value. The goal is to design payment mechanisms that\nincentivize the worker to skip the questions for which her con\ufb01dence is lower than T , and attempt\nthose for which her con\ufb01dence is higher than T . 1 Moreover, for the questions that she attempts to\nanswer, she must be incentivized to select the answer that she believes is most likely to be correct.\nThe threshold T may be chosen based on various factors of the problem at hand, for example, on\nthe downstream machine learning algorithms using the crowdsourced data, or the knowledge of the\nstatistics of worker abilities, etc. In this paper we assume that the threshold T is given to us.\nLet N denote the total number of questions in the task. Among these, we assume the existence of\nsome \u201cgold standard\u201d questions, that is, a set of questions whose answers are known to the requester.\nLet G (1 \uf8ff G \uf8ff N ) denote the number of gold standard questions. The G gold standard questions\nare assumed to be distributed uniformly at random in the pool of N questions (of course, the worker\ndoes not know which G of the N questions form the gold standard). The payment to a worker for\na task is computed after receiving her responses to all the questions in the task. The payment is\nbased on the worker\u2019s performance on the gold standard questions. Since the payment is based on\nknown answers, the payments to different workers do not depend on each other, thereby allowing us\nto consider the presence of only one worker without any loss in generality.\nWe will employ the following standard notation. For any positive integer K, the set {1, . . . , K} is\ndenoted by [K]. The indicator function is denoted by 1, i.e., 1{z} = 1 if z is true, and 0 otherwise.\nThe notation R+ denotes the set of all non-negative real numbers.\nLet x1, . . . , xG 2 {1, 0, +1} denote the evaluations of the answers that the worker gives to the G\ngold standard questions. Here, \u201c0\u201d denotes that the worker skipped the question, \u201c1\u201d denotes that\nthe worker attempted to answer the question and that answer was incorrect, and \u201c+1\u201d denotes that\nthe worker attempted to answer the question and that answer was correct. Let f : {1, 0, +1}G !\nR+ denote the payment function, namely, a function that determines the payment to the worker\nbased on these evaluations x1, . . . , xG. Note that the crowdsourcing platforms of today mandate the\npayments to be non-negative. We will let \u00b5 (> 0) denote the budget, i.e., the maximum amount that\ncan be paid to any individual worker for this task:\n\nmax\n\nx1,...,xG\n\nf (x1, . . . , xG) = \u00b5.\n\nThe amount \u00b5 is thus the amount of compensation paid to a perfect agent for her work. We will\nassume this budget condition of \u00b5 throughout the rest of the paper.\nWe assume that the worker attempts to maximize her overall expected payment. In what follows, the\nexpression \u2018the worker\u2019s expected payment\u2019 will refer to the expected payment from the worker\u2019s\npoint of view, and the expectation will be taken with respect to the worker\u2019s con\ufb01dences about her\nanswers and the uniformly random choice of the G gold standard questions among the N questions\nin the task. For any question i 2 [N ], let yi = 1 if the worker attempts question i, and set yi = 0\notherwise. Further, for every question i 2 [N ] such that yi 6= 0, let pi be the con\ufb01dence of the\nworker for the answer she has selected for question i, and for every question i 2 [N ] such that\nyi = 0, let pi 2 (0, 1) be any arbitrary value. Let E = (\u270f1, . . . ,\u270f G) 2 {1, 1}G. Then from the\nworker\u2019s perspective, the expected payment for the selected answers and con\ufb01dence-levels is\n\nIn the expression above, the outermost summation corresponds to the expectation with respect to the\nrandomness arising from the unknown choice of the gold standard questions. The inner summation\ncorresponds to the expectation with respect to the worker\u2019s beliefs about the correctness of her\nresponses.\n\n1In the event that the con\ufb01dence about a question is exactly equal to T , the worker may be equally incen-\n\ntivized to answer or skip.\n\n3\n\n1\n\nG X(j1,...,jG)\nN\n\n\u2713{1,...,N}\n\nXE2{1,1}G f (\u270f1yj1, . . . ,\u270f GyjG)\n\n(pji)\n\n1+\u270fi\n\n2 (1  pji)\n\nGYi=1\n\n1\u270fi\n\n2 ! .\n\n\fWe will call any payment function f as an incentive-compatible mechanism if the expected payment\nof the worker under this payment function is strictly maximized when the worker responds in the\nmanner desired.2\n\n3 Main results: Incentive-compatible mechanism and guarantees\n\nIn this section, we present the main results of the paper, namely, the design of incentive-compatible\nmechanisms with practically useful properties. To this end, we impose the following natural re-\nquirement on the payment function f that is motivated by the practical considerations of budget\nconstraints and discouraging spammers and miscreants [Boh11, KKKMF11, VdVE11, WLC+10].\nWe term this requirement as the \u201cno-free-lunch axiom\u201d:\nAxiom 1 (No-free-lunch axiom). If all the answers attempted by the worker in the gold standard are\nwrong, then the payment is zero. More formally, for every set of evaluations (x1, . . . , xG) that satisfy\ni=1 1{xi = 1}, we require the payment to satisfy f (x1, . . . , xG) = 0.\nObserve that no-free-lunch is an extremely mild requirement. In fact, it is signi\ufb01cantly weaker than\nimposing a zero payment on workers who answer randomly. For instance, if the questions are of\nbinary-choice format, then randomly choosing among the two options for each question would result\nin 50% of the answers being correct in expectation, while the no-free-lunch axiom is applicable only\nwhen none of them turns out to be correct.\n\ni=1 1{xi 6= 0} =PG\n\n0 <PG\n\n3.1 Proposed \u201cMultiplicative\u201d Mechanism\n\nWe now present our proposed payment mechanism in Algorithm 1.\n\nswers to the G gold standard questions\n\nAlgorithm 1 \u201cMultiplicative\u201d incentive-compatible mechanism\n\u2022 Inputs: Threshold T , Budget \u00b5, Evaluations (x1, . . . , xG) 2 {1, 0, +1}G of the worker\u2019s an-\n\u2022 Let C =PG\n\u2022 The payment is\n\ni=1 1{xi = 1} and W =PG\n\ni=1 1{xi = 1}\n\nf (x1, . . . , xG) = \u00b5T GC1{W = 0}.\n\nThe proposed mechanism has a multiplicative form: each answer in the gold standard is given a\nscore based on whether it was correct (score = 1\nT ), incorrect (score = 0) or skipped (score = 1),\nand the \ufb01nal payment is simply a product of these scores (scaled by \u00b5). The mechanism is easy to\ndescribe to workers: For instance, if T = 1\n\n2, G = 3 and \u00b5 = 80 cents, then the description reads:\n\n\u201cThe reward starts at 10 cents. For every correct answer in the 3 gold standard questions,\nthe reward will double. However, if any of these questions are answered incorrectly, then\nthe reward will become zero. So please use the \u2018I\u2019m not sure\u2019 option wisely.\u201d\n\nObserve how this payment rule is similar to the popular \u2018double or nothing\u2019 paradigm [Dou14].\nThe algorithm makes a zero payment if one or more attempted answers in the gold standard are\nwrong. Note that this property is signi\ufb01cantly stronger than the property of no-free-lunch which\nwe originally required, where we wanted a zero payment only when all attempted answers were\nwrong. Surprisingly, as we prove shortly, Algorithm 1 is the only incentive-compatible mechanism\nthat satis\ufb01es no-free-lunch.\nThe following theorem shows that the proposed payment mechanism indeed incentivizes a worker\nto skip the questions for which her con\ufb01dence is below T , while answering those for which her\ncon\ufb01dence is greater than T . In the latter case, the worker is incentivized to select the answer which\nshe thinks is most likely to be correct.\nTheorem 1. The payment mechanism of Algorithm 1 is incentive-compatible and satis\ufb01es the no-\nfree-lunch condition.\n\n2Such a payment function that is based on gold standard questions is also called a \u201cstrictly proper scoring\n\nrule\u201d [GR07].\n\n4\n\n\fThe proof of Theorem 1 is presented in Appendix A. It is easy to see that the mechanism satis\ufb01es no-\nfree-lunch. The proof of incentive compatibility is also not hard: We consider any arbitrary worker\n(with arbitrary belief distributions), and compute the expected payment for that worker for the case\nwhen her choices in the task follow the requirements. We then show that any other choice leads to a\nstrictly smaller expected payment.\nWhile we started out with a very weak condition of no-free-lunch of making a zero payment when\nall attempted answers are wrong, the mechanism proposed in Algorithm 1 is signi\ufb01cantly more\nstrict and makes a zero payment when any of the attempted answers is wrong. A natural question\nthat arises is: can we design an alternative mechanism satisfying incentive compatibility and no-\nfree-lunch that operates somewhere in between?\n\n3.2 Uniqueness of the Mechanism\n\nIn the previous section we showed that our proposed multiplicative mechanism is incentive compat-\nible and satis\ufb01es the intuitive requirement of no-free-lunch. It turns out, perhaps surprisingly, that\nthis mechanism is unique in this respect.\nTheorem 2. The payment mechanism of Algorithm 1 is the only incentive-compatible mechanism\nthat satis\ufb01es the no-free-lunch condition.\n\nTheorem 2 gives a strong result despite imposing very weak requirements. To see this, recall our ear-\nlier discussion on deterring spammers, that is, incurring a low expenditure on workers who answer\nrandomly. For instance, when the task comprises binary-choice questions, one may wish to design\nmechanisms which make a zero payment when the responses to 50% or more of the questions in the\ngold standard are incorrect. The no-free-lunch axiom is a much weaker requirement, and the only\nmechanism that can satisfy this requirement is the mechanism of Algorithm 1.\nThe proof of Theorem 2 is available in Appendix B. The proof relies on the following key lemma\nthat establishes a condition that any incentive-compatible mechanism must necessarily satisfy. The\nlemma applies to any incentive-compatible mechanism and not just to those satisfying no-free-lunch.\nLemma. Any incentive-compatible payment mechanism f must satisfy, for every i 2{ 1, . . . , G}\nand every (y1, . . . , yi1, yi+1, . . . , yG) 2 {1, 0, 1}G1,\n\nT f (y1, . . . , yi1, 1, yi+1, . . . , yG) + (1  T )f (y1, . . . , yi1,1, yi+1, . . . , yG)\n\n= f (y1, . . . , yi1, 0, yi+1, . . . , yG).\n\nThe proof of this lemma is provided in Appendix C. Given this lemma, the proof of Theorem 2 is\nthen completed via an induction on the number of skipped questions.\n\n3.3 Optimality against Spamming Behavior\n\nAs discussed earlier, crowdsouring tasks, especially those with multiple choice questions, often\nencounter spammers who answer randomly without heed to the question being asked. For instance,\nunder a binary-choice setup, a spammer will choose one of the two options uniformly at random for\nevery question. A highly desirable objective in crowdsourcing settings is to deter spammers. To this\nend, one may wish to impose a condition of zero payment when the responses to 50% or more of\nthe attempted questions in the gold standard are incorrect. A second desirable metric could be to\nminimize the expenditure on a worker who simply skips all questions. While the aforementioned\nrequirements were deterministic functions of the worker\u2019s responses, one may alternatively wish to\nimpose requirements that depend on the distribution of the worker\u2019s answering process. For instance,\na third desirable feature would be to minimize the expected payment to a worker who answers all\nquestions uniformly at random. We now show that interestingly, our unique multiplicative payment\nmechanism simultaneously satis\ufb01es all these requirements. The result is stated assuming a multiple-\nchoice setup, but extends trivially to non-multiple-choice settings.\nTheorem 3.A (Distributional). Consider any value A 2{ 0, . . . , G}. Among all incentive-\ncompatible mechanisms (that may or may not satisfy no-free-lunch), Algorithm 1 strictly minimizes\nthe expenditure on a worker who skips some A of the questions in the the gold standard, and chooses\nanswers to the remaining (G  A) questions uniformly at random.\n\n5\n\n\fTheorem 3.B (Deterministic). Consider any value B 2 (0, 1]. Among all incentive-compatible\nmechanisms (that may or may not satisfy no-free-lunch), Algorithm 1 strictly minimizes the expen-\nditure on a worker who gives incorrect answers to a fraction B or more of the questions attempted\nin the gold standard.\n\nThe proof of Theorem 3 is presented in Appendix D. We see from this result that the multiplicative\npayment mechanism of Algorithm 1 thus possesses very useful properties geared to deter spammers,\nwhile ensuring that a good worker will be paid a high enough amount.\nTo illustrate this point, let us compare the mechanism of Algorithm 1 with the popular additive class\nof payment mechanisms.\nExample 1. Consider the popular class of \u201cadditive\u201d mechanisms, where the payments to a worker\nare added across the gold standard questions. This additive payment mechanism offers a reward of\n\u00b5\nG for every correct answer in the gold standard, \u00b5T\nG for every question skipped, and 0 for every\nincorrect answer. Importantly, the \ufb01nal payment to the worker is the sum of the rewards across the\nG gold standard questions. One can verify that this additive mechanism is incentive compatible.\nOne can also see that that as guaranteed by our theory, this additive payment mechanism does not\nsatisfy the no-free-lunch axiom.\nSuppose each question involves choosing from two options. Let us compute the expenditure that\nthese two mechanisms make under a spamming behavior of choosing the answer randomly to each\nquestion. Given the 50% likelihood of each question being correct, on can compute that the additive\nmechanism makes a payment of \u00b5\n2 in expectation. On the other hand, our mechanism pays an\nexpected amount of only \u00b52G. The payment to spammers thus reduces exponentially with the\nnumber of gold standard questions under our mechanism, whereas it does not reduce at all in the\nadditive mechanism.\nNow, consider a different means of exploiting the mechanism(s) where the worker simply skips all\nquestions. To this end, observe that if a worker skips all the questions then the additive payment\nmechanism will incur an expenditure of \u00b5T . On the other hand, the proposed payment mechanism\nof Algorithm 1 pays an exponentially smaller amount of \u00b5T G (recall that T < 1).\n\n4 Simulations and Experiments\n\nIn this section, we present synthetic simulations and real-world experiments to evaluate the effects\nof our setting and our mechanism on the \ufb01nal label quality.\n\n4.1 Synthetic Simulations\n\nWe employ synthetic simulations to understand the effects of various kinds of labeling errors in\ncrowdsourcing. We consider binary-choice questions in this set of simulations. Whenever a worker\nanswers a question, her con\ufb01dence for the correct answer is drawn from a distribution P independent\nof all else. We investigate the effects of the following \ufb01ve choices of the distribution P:\n\n\u2022 The uniform distribution on the support [0.5, 1].\n\u2022 A triangular distribution with lower end-point 0.2, upper end-point 1 and a mode of 0.6.\n\u2022 A beta distribution with parameter values \u21b5 = 5 and  = 1.\n\u2022 The hammer-spammer distribution [KOS11], that is, uniform on the discrete set {0.5, 1}.\n\u2022 A truncated Gaussian distribution: a truncation of N (0.75, 0.5) to the interval [0, 1].\n\nWhen a worker has a con\ufb01dence p (drawn from the distribution P) and attempts the question, the\nprobability of making an error equals (1  p).\nWe compare (a) the setting where workers attempt every question, with (b) the setting where workers\nskip questions for which their con\ufb01dence is below a certain threshold T . In this set of simulations,\nwe set T = 0.75. In either setting, we aggregate the labels obtained from the workers for each\nquestion via a majority vote on the two classes. Ties are broken by choosing one of the two options\nuniformly at random.\n\n6\n\n\fFigure 2: Error under different interfaces for synthetic simulations of \ufb01ve distributions of the work-\ners\u2019 error probabilities.\n\nFigure 2 depicts the results from these simulations. Each bar represents the fraction of questions that\nare labeled incorrectly, and is an average across 50,000 trials. (The standard error of the mean is too\nsmall to be visible.) We see that the skip-based setting consistently outperforms the conventional\nsetting, and the gains obtained are moderate to high depending on the underlying distribution of the\nworkers\u2019 errors. In particular, the gains are quite striking under the hammer-spammer model: this\nresult is not surprising since the mechanism (ideally) screens the spammers out and leaves only the\nhammers who answer perfectly.\n\n4.2 Experiments on Amazon Mechanical Turk\n\nWe conducted preliminary experiments on the Amazon Mechanical Turk commercial crowdsourcing\nplatform (mturk.com) to evaluate our proposed scheme in real-world scenarios. The complete\ndata, including the interface presented to the workers in each of the tasks, the results obtained from\nthe workers, and the ground truth solutions, are available on the website of the \ufb01rst author.\nGoal. Before delving into details, we \ufb01rst note certain caveats relating to such a study of mech-\nanism design on crowdsourcing platforms. When a worker encounters a mechanism for only a\nsmall amount of time (a handful of tasks in typical research experiments) and for a small amount of\nmoney (at most a few dollars in typical crowdsourcing tasks), we cannot expect the worker to com-\npletely understand the mechanism and act precisely as required. For instance, we wouldn\u2019t expect\nour experimental results to change signi\ufb01cantly even upon moderate modi\ufb01cations in the promised\namounts, and furthermore, we do expect the outcomes to be noisy. Incentive compatibility kicks\nin when the worker encounters a mechanism across a longer term, for example, when a proposed\nmechanism is adopted as a standard for a platform, or when higher amounts are involved. This is\nwhen we would expect workers or others (e.g., bloggers or researchers) to design strategies that can\ngame the mechanism. The theoretical guarantee of incentive compatibility or strict properness then\nprevents such gaming in the long run.\nWe thus regard these experiments as preliminary. Our intentions towards this experimental exercise\nwere (a) to evaluate the potential of our algorithms to work in practice, and (b) to investigate the\neffect of the proposed algorithms on the net error in the collected labelled data.\nExperimental setup. We conducted the \ufb01ve following experiments (\u201ctasks\u201d) on Amazon Mechan-\nical Turk: (a) identifying the golden gate bridge from pictures, (b) identifying the breeds of dogs\nfrom pictures, (c) identifying heads of countries, (d) identifying continents to which \ufb02ags belong,\nand (e) identifying the textures in displayed images. Each of these tasks comprised 20 to 126 multi-\n\n7\n\n\fFigure 3: Error under different interfaces and mechanisms for \ufb01ve experiments conducted on Me-\nchanical Turk.\n\nple choice questions.3 For each experiment, we compared (i) a baseline setting (Figure 1a) with an\nadditive payment mechanism that pays a \ufb01xed amount per correct answer, and (ii) our skip-based\nsetting (Figure 1b) with the multiplicative mechanism of Algorithm 1. For each experiment, and for\neach of the two settings, we had 35 workers independently perform the task.\nUpon completion of the tasks on Amazon Mechanical Turk, we aggregated the data in the following\nmanner. For each mechanism in each experiment, we subsampled 3, 5, 7, 9 and 11 workers, and\ntook a majority vote of their responses. We averaged the accuracy across all questions and across\n1, 000 iterations of this subsample-and-aggregate procedure.\nResults. Figure 3 reports the error in the aggregate data in the \ufb01ve experiments. We see that in\nmost cases, our skip-based setting results in a higher quality data, and in many of the instances, the\nreduction is two-fold or higher. All in all, in the experiments, we observed a substantial reduction in\nthe amount of error in the labelled data while expending the same or lower amounts and receiving\nno negative comments from the workers. These observations suggest that our proposed skip-based\nsetting coupled with our multiplicative payment mechanisms have potential to work in practice; the\nunderlying fundamental theory ensures that the system cannot be gamed in the long run.\n\n5 Discussion and Conclusions\n\nIn an extended version of this paper [SZ14], we generalize the \u201cskip-based\u201d setting considered here\nto one where we also elicit the workers\u2019 con\ufb01dence about their answers. Moreover, in a companion\npaper [SZP15], we construct mechanisms to elicit the support of worker\u2019s beliefs.\nOur mechanism offers some additional bene\ufb01ts. The pattern of skips of the workers provide a rea-\nsonable estimate of the dif\ufb01culty of each question. In practice, the questions that are estimated to\nbe more dif\ufb01cult may now be delegated to an expert or to additional non-expert workers. Secondly,\nthe theoretical guarantees of our mechanism may allow for better post-processing of the data, in-\ncorporating the con\ufb01dence information and improving the overall accuracy. Developing statistical\naggregation algorithms or augmenting existing ones (e.g., [RYZ+10, KOS11, LPI12, ZLP+15]) for\nthis purpose is a useful direction of research. Thirdly, the simplicity of our mechanisms may fa-\ncilitate an easier adoption among the workers. In conclusion, given the uniqueness and optimality\nin theory, simplicity, and good performance observed in practice, we envisage our multiplicative\npayment mechanisms to be of interest to practitioners as well as researchers who employ crowd-\nsourcing.\n\n3See the extended version of this paper [SZ14] for additional experiments involving free-form responses,\n\nsuch as text transcription.\n\n8\n\n\f[Dou14]\n\n[DDS+09]\n\nJohn Bohannon. Social science for pennies. Science, 334(6054):307\u2013307, 2011.\n\nReferences\n[Boh11]\n[CBW+10] Andrew Carlson, Justin Betteridge, Richard C Wang, Estevam R Hruschka Jr, and\nTom M Mitchell. Coupled semi-supervised learning for information extraction. In\nACM WSDM, pages 101\u2013110, 2010.\nJia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A\nlarge-scale hierarchical image database. In IEEE Conference on Computer Vision and\nPattern Recognition, pages 248\u2013255, 2009.\nDouble or Nothing. http://wikipedia.org/wiki/Double_or_nothing,\n2014. Last accessed: July 31, 2014.\nTilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and\nestimation. Journal of the American Statistical Association, 102(477):359\u2013378, 2007.\n[HDY+12] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed,\nNavdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath,\net al. Deep neural networks for acoustic modeling in speech recognition: The shared\nviews of four research groups. IEEE Signal Processing Magazine, 29(6):82\u201397, 2012.\nPanagiotis G Ipeirotis, Foster Provost, Victor S Sheng, and Jing Wang. Repeated\nlabeling using multiple noisy labelers. Data Mining and Knowledge Discovery,\n28(2):402\u2013441, 2014.\nSrikanth Jagabathula, Lakshminarayanan Subramanian, and Ashwin Venkataraman.\nReputation-based worker \ufb01ltering in crowdsourcing. In Advances in Neural Informa-\ntion Processing Systems 27, pages 2492\u20132500, 2014.\n\n[IPSW14]\n\n[JSV14]\n\n[GR07]\n\n[LPI12]\n\n[KOS11]\n\n[RYZ+10]\n\n[KKKMF11] Gabriella Kazai, Jaap Kamps, Marijn Koolen, and Natasa Milic-Frayling. Crowd-\nsourcing for book search evaluation: impact of HIT design on comparative system\nranking. In ACM SIGIR, pages 205\u2013214, 2011.\nDavid R Karger, Sewoong Oh, and Devavrat Shah.\nIterative learning for reliable\ncrowdsourcing systems. In Advances in neural information processing systems, pages\n1953\u20131961, 2011.\nQiang Liu, Jian Peng, and Alexander T Ihler. Variational inference for crowdsourcing.\nIn NIPS, pages 701\u2013709, 2012.\nVikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles\nFlorin, Luca Bogoni, and Linda Moy. Learning from crowds. The Journal of Machine\nLearning Research, 11:1297\u20131322, 2010.\nNihar B Shah and Dengyong Zhou. Double or nothing: Multiplicative incentive\nmechanisms for crowdsourcing. arXiv:1408.1387, 2014.\nNihar B Shah, Dengyong Zhou, and Yuval Peres. Approval voting and incentives in\ncrowdsourcing. In International Conference on Machine Learning (ICML), 2015.\nJeroen Vuurens, Arjen P de Vries, and Carsten Eickhoff. How much spam can you\ntake? An analysis of crowdsourcing results to increase accuracy.\nIn ACM SIGIR\nWorkshop on Crowdsourcing for Information Retrieval, pages 21\u201326, 2011.\nPaul Wais, Shivaram Lingamneni, Duncan Cook, Jason Fennell, Benjamin Golden-\nberg, Daniel Lubarov, David Marin, and Hari Simons. Towards building a high-\nquality workforce with Mechanical Turk. NIPS workshop on computational social\nscience and the wisdom of crowds, 2010.\nDengyong Zhou, Qiang Liu, John C Platt, Christopher Meek, and Nihar B\nShah. Regularized minimax conditional entropy for crowdsourcing. arXiv preprint\narXiv:1503.07240, 2015.\n\n[WLC+10]\n\n[VdVE11]\n\n[ZLP+15]\n\n[SZ14]\n\n[SZP15]\n\n9\n\n\f", "award": [], "sourceid": 2, "authors": [{"given_name": "Nihar Bhadresh", "family_name": "Shah", "institution": "UC Berkeley"}, {"given_name": "Dengyong", "family_name": "Zhou", "institution": "MSR"}]}