{"title": "Variational Inference for Crowdsourcing", "book": "Advances in Neural Information Processing Systems", "page_first": 692, "page_last": 700, "abstract": "Crowdsourcing has become a popular paradigm for labeling large datasets. However, it has given rise to the computational task of aggregating the crowdsourced labels provided by a collection of unreliable annotators. We approach this problem by transforming it into a standard inference problem in graphical models, and applying approximate variational methods, including belief propagation (BP) and mean field (MF). We show that our BP algorithm generalizes both majority voting and a recent algorithm by Karger et al, while our MF method is closely related to a commonly used EM algorithm. In both cases, we find that the performance of the algorithms critically depends on the choice of a prior distribution on the workers' reliability; by choosing the prior properly, both BP and MF (and EM) perform surprisingly well on both simulated and real-world datasets, competitive with state-of-the-art algorithms based on more complicated modeling assumptions.", "full_text": "Variational Inference for Crowdsourcing\n\nQiang Liu\n\nICS, UC Irvine\n\nqliu1@ics.uci.edu\n\nJian Peng\n\nTTI-C & CSAIL, MIT\n\njpeng@csail.mit.edu\n\nAlexander Ihler\nICS, UC Irvine\n\nihler@ics.uci.edu\n\nAbstract\n\nCrowdsourcing has become a popular paradigm for labeling large datasets. How-\never, it has given rise to the computational task of aggregating the crowdsourced\nlabels provided by a collection of unreliable annotators. We approach this prob-\nlem by transforming it into a standard inference problem in graphical models,\nand applying approximate variational methods, including belief propagation (BP)\nand mean \ufb01eld (MF). We show that our BP algorithm generalizes both major-\nity voting and a recent algorithm by Karger et al. [1], while our MF method is\nclosely related to a commonly used EM algorithm. In both cases, we \ufb01nd that the\nperformance of the algorithms critically depends on the choice of a prior distribu-\ntion on the workers\u2019 reliability; by choosing the prior properly, both BP and MF\n(and EM) perform surprisingly well on both simulated and real-world datasets,\ncompetitive with state-of-the-art algorithms based on more complicated modeling\nassumptions.\n\n1\n\nIntroduction\n\nCrowdsourcing has become an ef\ufb01cient and inexpensive way to label large datasets in many ap-\nplication domains, including computer vision and natural language processing. Resources such as\nAmazon Mechanical Turk provide markets where the requestors can post tasks known as HITs (Hu-\nman Intelligence Tasks) and collect large numbers of labels from hundreds of online workers (or\nannotators) in a short time and with relatively low cost.\nA major problem of crowdsoucing is that the qualities of the labels are often unreliable and diverse,\nmainly since it is dif\ufb01cult to monitor the performance of a large collection of workers. In the ex-\ntreme, there may exist \u201cspammers\u201d, who submit random answers rather than good-faith attempts to\nlabel, or even \u201cadversaries\u201d, who may deliberately give wrong answers, either due to malice or to a\nmisinterpretation of the task. A common strategy to improve reliability is to add redundancy, such\nas assigning each task to multiple workers, and aggregate the workers\u2019 labels. The baseline majority\nvoting heuristic, which simply assigns the label returned by the majority of the workers, is known to\nbe error-prone, because it counts all the annotators equally. In general, ef\ufb01cient aggregation methods\nshould take into account the differences in the workers\u2019 labeling abilities.\nA principled way to address this problem is to build generative probabilistic models for the annota-\ntion processes, and assign labels using standard inference tools. A line of early work builds simple\nmodels characterizing the annotators using confusion matrices, and infers the labels using the EM\nalgorithm [e.g., 2, 3, 4]. Recently however, signi\ufb01cant efforts have been made to improve perfor-\nmance by incorporating more complicated generative models [e.g., 5, 6, 7, 8, 9]. However, EM is\nwidely criticized for having local optimality issues [e.g., 1]; this raises a potential tradeoff between\nmore dedicated exploitation of the simpler models, either by introducing new inference tools or \ufb01x-\ning local optimality issues in EM, and the exploration of larger model space, usually with increased\ncomputational cost and possibly the risk of over-\ufb01tting.\nOn the other hand, variational approaches, including the popular belief propagation (BP) and mean\n\ufb01eld (MF) methods, provide powerful inference tools for probabilistic graphical models [10, 11].\n\n1\n\n\fThese algorithms are ef\ufb01cient, and often have provably strong local optimality properties or even\nglobally optimal guarantees [e.g., 12]. To our knowledge, no previous attempts have taken advantage\nof variational tools for the crowdsourcing problem. A closely related approach is a message-passing-\nstyle algorithm in Karger et al. [1] (referred to as KOS in the sequel), which the authors asserted\nto be motivated by but not equivalent to standard belief propagation. KOS was shown to have\nstrong theoretical guarantees on (locally tree-like) random assignment graphs, but does not have an\nobvious interpretation as a standard inference method on a generative probabilistic model. As one\nconsequence, the lack of a generative model interpretation makes it dif\ufb01cult to either extend KOS to\nmore complicated models or adapt it to improve its performance on real-world datasets.\nContribution. In this work, we approach the crowdsourcing problems using tools and concepts from\nvariational inference methods for graphical models. First, we present a belief-propagation-based\nmethod, which we show includes both KOS and majority voting as special cases, in which partic-\nular prior distributions are assumed on the workers\u2019 abilities. However, unlike KOS our method is\nderived using generative principles, and can be easily extended to more complicated models. On\nthe other side, we propose a mean \ufb01eld method which we show closely connects to, and provides\nan important perspective on, EM. For both our BP and MF algorithms (and consequently for EM\nas well), we show that performance can be signi\ufb01cantly improved by using more carefully chosen\npriors. We test our algorithms on both simulated and real-world datasets, and show that both BP\nand MF (or EM), with carefully chosen priors, is able to perform competitively with state-of-the-art\nalgorithms that are based on far more complicated models.\n\n2 Background\nAssume there are M workers and N tasks with binary labels {\u00b11}. Denote by zi \u2208 {\u00b11}, i \u2208 [N ]\nthe true label of task i, where [N ] represents the set of \ufb01rst N integers; Nj is the set of tasks labeled\nby worker j, and Mi the workers labeling task i. The task assignment scheme can be represented by\na bipartite graph where an edge (i, j) denotes that the task i is labeled by the worker j. The labeling\nresults form a matrix L \u2208 {0,\u00b11}N\u00d7M , where Lij \u2208 {\u00b11} denotes the answer if worker j labels\ntask i, and Lij = 0 if otherwise. The goal is to \ufb01nd an optimal estimator \u02c6z of the true labels z given\nthe observation L, minimizing the average bit-wise error rate 1\nN\n\n(cid:80)\ni\u2208[N ] prob[\u02c6zi (cid:54)= zi].\n\nWe assume that all the tasks have the same level of dif\ufb01culty, but that workers may have different\npredictive abilities. Following Karger et al. [1], we initially assume that the ability of worker j is\nmeasured by a single parameter qj, which corresponds to their probability of correctness: qj =\nprob[Lij = zi]. More generally, the workers\u2019 abilities can be measured by a confusion matrix, to\nwhich our method can be easily extended (see Section 3.1.2).\nThe values of qj re\ufb02ect the abilities of the workers: qj \u2248 1 correspond to experts that provide\nreliable answers; qj \u2248 1/2 denote spammers that give random labels independent of the questions;\nand qj < 1/2 denote adversaries that tend to provide opposite answers. Conceptually, the spammers\nand adversaries should be treated differently: the spammers provide no useful information and only\ndegrade the results, while the adversaries actually carry useful information, and can be exploited to\nimprove the results if the algorithm can identify them and \ufb02ip their labels. We assume the qj of all\nworkers are drawn independently from a common prior p(qj|\u03b8), where \u03b8 are the hyper-parameters.\nTo avoid the cases when adversaries and/or spammers overwhelm the system, it is reasonable to\nrequire that E[qj|\u03b8] > 1/2. Typical priors include the Beta prior p(qj|\u03b8) \u221d q\u03b1\u22121\n(1 \u2212 qj)\u03b2\u22121 and\ndiscrete priors, e.g., the spammer-hammer model, where qj \u2248 0.5 or qj \u2248 1 with equal probability.\nMajority Voting. The majority voting (MV) method aggregates the workers\u2019 labels by\n\nj\n\n\u02c6zmajority\ni\n\n= sign[\n\nLij].\n\n(cid:88)\n\nj\u2208Mi\n\nusually via a maximum a posteriori estimator, \u02c6q = arg max log p(q|L, \u03b8) = log(cid:80)\n\nThe limitation of MV is that it weights all the workers equally, and performs poorly when the\nqualities of the workers are diverse, especially when adversarial workers exist.\nExpectation Maximization. Weighting the workers properly requires estimating their abilities qj,\nz p(q, z|L, \u03b8).\nThis is commonly solved using an EM algorithm treating the z as hidden variables, [e.g., 2, 3, 4].\nAssuming a Beta(\u03b1, \u03b2) prior on qj, EM is formulated as\n\n2\n\n\fE-step: \u00b5i(zi) \u221d (cid:89)\n\nj\u2208Mi\n\nj (1 \u2212 \u02c6qj)1\u2212\u03b4ij ,\n\u02c6q\u03b4ij\n\nM-step: \u02c6qj =\n\n(cid:80)\n\n\u00b5i(Lij) + \u03b1 \u2212 1\n\ni\u2208Nj\n|Nj| + \u03b1 + \u03b2 \u2212 2\n\n,\n\n(1)\n\nwhere \u03b4ij = I[Lij = zi]; the \u02c6zi is then estimated via \u02c6zi = arg maxzi \u00b5i(zi). Many approaches have\nbeen proposed to improve this simple EM approach, mainly by building more complicated models.\nMessage Passing. A rather different algorithm in a message-passing style is proposed by Karger,\nOh and Shah [1] (referred to as KOS in the sequel). Let xi\u2192j and yj\u2192i be real-valued messages\nfrom tasks to workers and from workers to tasks, respectively. Initializing y0\nj\u2192i randomly from\nNormal(1, 1) or deterministically by y0\n\nj\u2192i = 1, KOS updates the messages at t-th iteration via\nj(cid:48)\u2192i,\n\nLi(cid:48)jxt+1\n\ni(cid:48)\u2192j,\n\nLij(cid:48)yt\n\n(cid:88)\n\n(cid:88)\n\nxt+1\ni\u2192j =\n\n(2)\n\nj(cid:48)\u2208Mi\\j\n\nLijyt\n\ni = sign[\u02c6xt\n\ni], where \u02c6xt\n\nj\u2192i. Note that the 0th\nand the labels are estimated via \u02c6st\niteration of KOS reduces to majority voting when initialized with y0\nj\u2192i = 1. KOS has surprisingly\nnice theoretical properties on locally tree-like assignment graphs: its error rate is shown to scale\nin the same manner as an oracle lower bound that assumes the true qj are known. Unfortunately,\nKOS is not derived using a generative model approach under either Bayesian or maximum likeli-\nhood principles, and hence is dif\ufb01cult to extend to more general cases, such as more sophisticated\nworker-error models (Section 3.1.2) or other features and side information (see appendix). Given\nthat the assumptions made in Karger et al. [1] are restrictive in practice, it is unclear whether the\ntheoretical performance guarantees of KOS hold in real-world datasets. Additionally, an interest-\ning phase transition phenomenon was observed in Karger et al. [1] \u2013 the performance of KOS was\nshown to degenerate, sometimes performing even worse than majority voting when the degrees of\nthe assignment graph (corresponding to the number of annotators per task) are small.\n\nyt+1\nj\u2192i =\n\ni = (cid:80)\n\ni(cid:48)\u2208Nj\\i\nj\u2208Mi\n\n3 Crowdsourcing as Inference in a Graphical Model\n\nWe present our main framework in this section, transforming the labeling aggregation problem into\na standard inference problem on a graphical model, and proposing a set of ef\ufb01cient variational\nmethods, including a belief propagation method that includes KOS and majority voting as special\ncases, and a mean \ufb01eld method, which connects closely to the commonly used EM approach.\nTo start, the joint posterior distribution of workers\u2019 abilities q = {qj : j \u2208 [M ]} and the true labels\nz = {zi : i \u2208 [N ]} conditional on the observed labels L and hyper-parameter \u03b8 is\n\np(qj|\u03b8)\n\nj (1 \u2212 qj)\u03b3j\u2212cj ,\nwhere \u03b3j = |Nj| is the number of predictions made by worker j and cj :=(cid:80)\n(cid:90)\n\nI[Lij = zi] is\nthe number of j\u2019s predictions that are correct. By standard Bayesian arguments, one can show that\nthe optimal estimator of z to minimize the bit-wise error rate is given by\n\np(Lij|zi, qj) =\n\np(qj|\u03b8)qcj\n\nj\u2208[M ]\n\nj\u2208[M ]\n\ni\u2208Nj\n\ni\u2208Nj\n\np(z, q|L, \u03b8) \u221d (cid:89)\n\n(cid:89)\n\n(cid:89)\n\nwhere\n\np(zi|L, \u03b8) =\n\np(z, q|L, \u03b8)dq.\n\n(3)\n\n\u02c6zi = arg max\n\nzi\n\np(zi|L, \u03b8)\n\n(cid:88)\n\nz[N ]\\i\n\nq\n\nNote that the EM algorithm (1), which maximizes rather than marginalizes qj, is not equivalent to\nthe Bayesian estimator (3), and hence is expected to be suboptimal in terms of error rate. However,\ncalculating the marginal p(zi|L, \u03b8) in (3) requires integrating all q and summing over all the other zi,\na challenging computational task. In this work we use belief propagation and mean \ufb01eld to address\nthis problem, and highlight their connections to KOS, majority voting and EM.\n\n3.1 Belief Propagation, KOS and Majority Voting\nIt is dif\ufb01cult to directly apply belief propagation to the joint distribution p(z, q|L, \u03b8), since it is\na mixed distribution of discrete variables z and continuous variables q. We bypass this issue by\ndirectly integrating over qj, yielding a marginal posterior distribution over the discrete variables z,\n\n(cid:90)\n\np(z|L, \u03b8) =\n\np(z, q|L, \u03b8)dq =\n\n(cid:90) 1\n\n(cid:89)\n\nj\u2208[M ]\n\n0\n\n(cid:89)\n\nj\u2208[M ]\n\n\u03c8j(zNj ),\n\n(4)\n\np(qj|\u03b8)qcj\n\nj (1 \u2212 qj)\u03b3j\u2212cj\n\ndef\n=\n\n3\n\n\fwhere \u03c8j(zNj ) is the local factor contributed by worker j due to eliminating qj, which couples\nall the tasks zNj labeled by j; here we suppress the dependency of \u03c8j on \u03b8 and L for notational\nsimplicity. A key perspective is that we can treat p(z|L, \u03b8) as a discrete Markov random \ufb01eld, and\nre-interpret the bipartite assignment graph as a factor graph [13], with the tasks mapping to variable\nnodes and workers to factor nodes. This interpretation motivates us to use a standard sum-product\nbelief propagation method, approximating p(zi|L, \u03b8) with \u201cbeliefs\u201d bi(zi) using messages mi\u2192j\nand mj\u2192i between the variable nodes (tasks) and factor nodes (workers),\n\nFrom tasks to workers:\n\nFrom workers to tasks:\n\nCalculating the beliefs:\n\nj(cid:48)\u2208Mi/j\n\ni\u2192j(zi) \u221d (cid:89)\nj\u2192i(zi) \u221d (cid:88)\n(zi) \u221d (cid:89)\n\nzNj/i\n\nj\u2208Mi\n\nmt+1\n\nmt+1\n\nbt+1\ni\n\nmt\n\nj(cid:48)\u2192i(zi),\n\n(cid:89)\n\ni(cid:48)\u2208Nj\n\nmt+1\n\ni(cid:48)\u2192j(zi(cid:48)),\n\n(5)\n\n(6)\n\n(7)\n\n\u03c8j(zNj )\n\nmt+1\n\nj\u2192i(zi).\n\ni = arg maxzi bt\n\ni(zi). One immediate\nAt the end of T iterations, the labels are estimated via \u02c6zt\ndifference between BP (5)-(7) and the KOS message passing (2) is that the messages and beliefs in\n(5)-(7) are probability tables on zi, i.e., mi\u2192j = [mi\u2192j(+1), mi\u2192j(\u22121)], while the messages in\n(2) are real values. For binary labels, we will connect the two by rewriting the updates (5)-(7) in\nterms of their (real-valued) log-odds, a standard transformation used in error-correcting codes.\nThe BP updates above appear computationally challenging, since step (6) requires eliminating a\nhigh-order potential \u03c8(zNj ), costing O(2\u03b3j ) in general. However, note that \u03c8(zNj ) in (4) depends\non zNj only through cj, so that (with a slight abuse of notation) it can be rewritten as \u03c8(cj, \u03b3j). This\nstructure enables us to rewrite the BP updates in a more ef\ufb01cient form (in terms of the log-odds):\nTheorem 3.1.\n\nLet\n\n\u02c6xi = log\n\nbi(+1)\nbi(\u22121)\n\n,\n\nxi\u2192j = log\n\nmi\u2192j(+1)\nmi\u2192j(\u22121)\n\n,\n\nand\n\nyj\u2192i = Lij log\n\nmj\u2192i(+1)\nmi\u2192j(\u22121)\n\n.\n\nThen, sum-product BP (5)-(7) can be expressed as\n\n(cid:80)\u03b3j\u22121\n(cid:80)\u03b3j\u22121\n\n(cid:88)\n\nj(cid:48)\u2208Mi\\j\n\nxt+1\ni\u2192j =\n\n= (cid:80)\n\nLijyt\n\nj(cid:48)\u2192i,\n\nyt+1\nj\u2192i = log\n\nk=0 \u03c8(k + 1, \u03b3j) et+1\n\nk\n\nk=0 \u03c8(k, \u03b3j) et+1\n\nk\n\n,\n\n(8)\n\ni\n\n(cid:80)\nand \u02c6xt+1\nelementary symmetric polynomials in variables {exp(Li(cid:48)jxi(cid:48)\u2192j)}i(cid:48)\u2208Nj\\i,\ni(cid:48)\u2208s exp(Li(cid:48)jxi(cid:48)\u2192j). In the end, the true labels are decoded as \u02c6zt\n\ni\u2192j, where the terms ek for k = 0, . . . , Nj \u2212 1, are the\nis, ek =\ni].\ni = sign[\u02c6xt\n\nLijyt+1\n\n(cid:81)\n\ns : |s|=k\n\nj\u2208Mi\n\nthat\n\nThe terms ek can be ef\ufb01ciently calculated by divide & conquer and the fast Fourier transform in\nO(\u03b3j(log \u03b3j)2) time (see appendix), making (8) much more ef\ufb01cient than (6) initially appears.\nSimilar to sum-product, one can also derive a max-product BP to \ufb01nd the joint maximum a posteriori\ncon\ufb01guration, \u02c6z = arg maxz p(z|L, \u03b8), which minimizes the block-wise error rate prob[\u2203i : zi (cid:54)=\n\u02c6zi] instead of the bit-wise error rate. Max-product BP can be organized similarly to (8), with the\nslightly lower computational cost of O(\u03b3j log \u03b3j); see appendix for details and Tarlow et al. [14]\nfor a general discussion on ef\ufb01cient max-product BP with structured high-order potentials. In this\nwork, we focus on sum-product since the bit-wise error rate is more commonly used in practice.\n\n3.1.1 The Choice of Algorithmic Priors and connection to KOS and Majority Voting\n\nBefore further discussion, we should be careful to distinguish between the prior on qj used in our\nalgorithm (the algorithmic prior) and, assuming the model is correct, the true distribution of the qj\nin the data generating process (the data prior); the algorithmic and data priors often do not match.\nIn this section, we discuss the form of \u03c8(cj, \u03b3j) for different choices of algorithmic priors, and in\nparticular show that KOS and majority voting can be treated as special cases of our belief propaga-\ntion (8) with the most \u201cuninformative\u201d and most \u201cinformative\u201d algorithmic priors, respectively. For\nmore general priors that may not yield a closed form for \u03c8(cj, \u03b3j), one can calculate \u03c8(cj, \u03b3j) by\nnumerical integration and store them in a (\u03b3 + 1) \u00d7 \u03b3 table for later use, where \u03b3 = maxj\u2208[M ] \u03b3j.\n\n4\n\n\fj\n\nBeta Priors. If p(qj|\u03b8) \u221d q\u03b1\u22121\n(1 \u2212 qj)\u03b2\u22121, we have \u03c8(cj, \u03b3j) \u221d B(\u03b1 + cj, \u03b2 + \u03b3j \u2212 cj), where\nB(\u00b7,\u00b7) is the Beta function. Note that \u03c8(cj, \u03b3j) in this case equals (up to a constant) the likelihood\nof a Beta-binomial distribution.\nDiscrete Priors. If p(qj|\u03b8) has non-zero probability mass on only \ufb01nite points, that is, prob(qj =\nk pk = 1, then we have \u03c8(cj, \u03b3j) =\n\n\u02dcqk) = pk, k \u2208 [K], where 0 \u2264 \u02dcqk \u2264 1, 0 \u2264 pk \u2264 1 and(cid:80)\n(cid:80)\n\nk pk \u02dcqcj\n\nk (1 \u2212 \u02dcqk)\u03b3j\u2212cj . One can show that log \u03c8(cj, \u03b3j) in this case is a log-sum-exp function.\n\nj\n\nHaldane Prior. The Haldane prior [15] is a special discrete prior that equals either 0 or 1 with equal\nprobability, that is, prob[qj = 0] = prob[qj = 1] = 1/2. One can show that in this case we have\n\u03c8(0, \u03b3j) = \u03c8(\u03b3j, \u03b3j) = 1 and \u03c8(cj, \u03b3j) = 0 otherwise.\nClaim 3.2. The BP update in (8) with Haldane prior is equivalent to KOS update in (2).\nProof. Just substitute the \u03c8(cj, \u03b3j) of Haldane prior shown above into the BP update (8).\nThe Haldane prior can also be treated as a Beta(\u0001, \u0001) prior with \u0001 \u2192 0+, or equivalently an improper\nprior p(qj) \u221d q\u22121\n(1 \u2212 qj)\u22121, whose normalization constant is in\ufb01nite. One can show that the\nHaldane prior is equivalent to putting a \ufb02at prior on the log-odds log[qj/(1 \u2212 qj)]; also, it has\nthe largest variance (and hence is \u201cmost uninformative\u201d) among all the possible distributions of qj.\nTherefore, although appearing to be extremely dichotomous, it is well known in Bayesian statistics\nas an uninformative prior of binomial distributions. Other choices of objective priors include the\nuniform prior Beta(1, 1) and Jeffery\u2019s prior Beta(1/2, 1/2) [16], but these do not yield the same\nsimple linear message passing form as the Haldane prior.\nUnfortunately, the use of Haldane prior in our problem suffers an important symmetry breaking is-\nsue: if the prior is symmetric, i.e., p(qj|\u03b8) = p(1 \u2212 qj|\u03b8), the true marginal posterior distribution of\nzj is also symmetric, i.e., p(zj|L, \u03b8) = [1/2; 1/2], because jointly \ufb02ipping the sign of any con\ufb01gu-\nration does not change its likelihood. This makes it impossible to break the ties when decoding zj.\nIndeed, it is not hard to observe that xi\u2192j = yj\u2192i = 0 (corresponding to symmetric probabilities)\nis a \ufb01xed point of the KOS update (2). The mechanism of KOS for breaking the symmetry seems to\nrely solely on initializing to points that bias towards majority voting, and the hope that the symmetric\ndistribution is an unstable \ufb01xed point. In experiments, we \ufb01nd that the use of symmetric priors usu-\nally leads to degraded performance when the degree of the assignment graph is low, corresponding\nto the phase transition phenomenon discussed in Karger et al. [1]. This suggests that it is bene\ufb01cial\nto use asymmetric priors with E[qj|\u03b8] > 1/2, to incorporate the prior knowledge that the majority of\nworkers are non-adversarial. Interestingly, it turns out that majority voting uses such an asymmetric\nprior, but unfortunately corresponding to another unrealistic extreme.\nDeterministic Priors. A deterministic prior is a special discrete distribution that equals a single\npoint deterministically, i.e., prob[qj = \u02dcq|\u03b8] = 1, where 0 \u2264 \u02dcq \u2264 1. One can show that log \u03c8 in this\ncase is a linear function, that is, log \u03c8(cj, \u03b3j) = cjlogit(\u02dcq) + const.\nClaim 3.3. The BP update (8) with deterministic priors satisfying \u02dcq > 1/2 terminates at the \ufb01rst\niteration and \ufb01nds the same solution as majority voting.\nProof. Just note that log \u03c8(cj, \u03b3j) = cjlogit(\u02dcq) + const, and logit(\u02dcq) > 0 in this case.\nThe deterministic priors above have the opposite properties to the Haldane prior: they can be also\ntreated as Beta(\u03b1, \u03b2) priors, but with \u03b1 \u2192 +\u221e and \u03b1 > \u03b2; these priors have the smallest variance\n(equal to zero) among all the possible qj priors.\nIn this work, we propose to use priors that balance between KOS and majority voting. One reason-\nable choice is Beta(\u03b1, 1) prior with \u03b1 > 1 [17]. In experiments, we \ufb01nd that a typical choice of\nBeta(2, 1) performs surprisingly well even when it is far from the true prior.\n\n3.1.2 The Two-Coin Models and Further Extensions\n\nWe previously assumed that workers\u2019 abilities are parametrized by a single parameter qj. This is\nlikely to be restrictive in practice, since the error rate may depend on the true label value: false\npositive and false negative rates are often not equal. Here we consider the more general case, where\nthe ability of worker j is speci\ufb01ed by two parameters, the sensitivitiy sj and speci\ufb01city tj [2, 4],\n\nsj = prob[Lij = +1|zi = +1],\n\ntj = prob[Lij = \u22121|zi = \u22121].\n\n5\n\n\fA typical prior on sj and tj are two independent Beta distributions. One can show that \u03c8(zNj ) in\nthis case equals a product of two Beta functions, and depends on zNj only through two integers, the\ntrue positive and true negative counts. An ef\ufb01cient BP algorithm similar to (8) can be derived for\nthe general case, by exploiting the special structure of \u03c8(zNj ). See the Appendix for details.\nOne may also try to derive a two-coin version of KOS, by assigning two independent Haldane priors\non sj and tj; it turns out that the extended version is exactly the same as the standard KOS in (2). In\nthis sense, the Haldane prior is too restrictive for the more general case. Several further extensions\nof the BP algorithm that are not obvious for KOS, for example the case when known features of the\ntasks or other side information are available, are discussed in the appendix due to space limitations.\n\n3.2 Mean Field Method and Connection of EM\nWe next present a mean \ufb01eld method for computing the marginal p(zi|L, \u03b8) in (3), and show its\nclose connection to EM. In contrast to the derivation of BP, here we directly work on the mixed joint\nposterior p(z, q|L, \u03b8). Let us approximate p(z, q|L, \u03b8) with a fully factorized distribution b(z, q) =\n\nKL[b(z, q) || p(z, q|L, \u03b8)] = \u2212Eb[log p(z, q|L, \u03b8)] \u2212 (cid:88)\n\nj\u2208[M ] \u03bdj(qj). The best b(z, q) should minimize the KL divergence,\n\nH(\u00b5i) \u2212 (cid:88)\n\n(cid:81)\ni\u2208[N ] \u00b5i(zi)(cid:81)\n\nH(\u03bdj).\n\nwhere Eb[\u00b7] denotes the expectation w.r.t. b(z, q), and H(\u00b7) the entropy functional. Assuming the\nalgorithmic prior of Beta(\u03b1, \u03b2), one crucial property of the KL objective in this case is that the\noptimal {\u03bd\u2217\nj (qj)} is guaranteed to be a Beta distribution as well. Using a block coordinate descent\nmethod that alternatively optimizes {\u00b5i(zi)} and {\u03bdj(qj)}, the mean \ufb01eld (MF) update is\n\ni\u2208[N ]\n\nj\u2208[M ]\n\nUpdating \u00b5i: \u00b5i(zi) \u221d (cid:89)\n\nj\u2208Mi\nUpdating \u03bdj: \u03bdj(qj) \u223c Beta(\n\n,\n\nj\n\nj b1\u2212\u03b4ij\n(cid:88)\na\u03b4ij\n\ni\u2208Nj\n\n\u00b5i(Lij) + \u03b1,\n\n(cid:88)\n\ni\u2208Nj\n\n\u00b5i(\u2212Lij) + \u03b2),\n\n(9)\n\n(10)\n\n(cid:80)\n\nUpdating \u00b5i: \u00b5i(zi) \u221d (cid:89)\n\nwhere aj = exp(E\u03bdj [ln qj]) and bj = exp(E\u03bdj [ln(1\u2212 qj)]). The aj and bj can be exactly calculated\nby noting that E[ln x] = Digamma(\u03b1)\u2212Digamma(\u03b1+\u03b2) if x \u223c Beta(\u03b1, \u03b2). One can also instead\ncalculate the \ufb01rst-order approximation of aj and bj: by Taylor expansion, one have ln(1 + x) \u2248 x;\ntaking x = (qj \u2212 \u00afqj)/\u00afqj, where \u00afqj = E\u03bdj [qj], and substituting it into the de\ufb01nition of aj and bj,\none get aj \u2248 \u00afqj and bj \u2248 1 \u2212 \u00afqj; it gives an approximate MF (AMF) update,\n\nj\u2208Mi\n\n\u00b5i(Lij) + \u03b1\n\ni\u2208Nj\n|Nj| + \u03b1 + \u03b2\n\nj (1 \u2212 \u00afqj)1\u2212\u03b4ij , Updating \u03bdj: \u00afqj =\n\u00afq\u03b4ij\n\n. (11)\nThe update (11) differs from EM (1) only in replacing \u03b1\u22121 and \u03b2\u22121 with \u03b1 and \u03b2, corresponding to\nreplacing the posterior mode of the Beta distribution with its posterior mean. This simple (perhaps\ntrivial) difference plays a role of Laplacian smoothing, and provides insights for improving the\nperformance of EM. For example, note that the \u02c6qj in the M-step of EM could be updated to 0 or 1 if\n\u03b1 = 1 or \u03b2 = 1, and once this happens, the \u02c6qj is locked at its current value, causing EM to trapped\nat a local maximum. Update (11) can prevent \u00afqj from becoming 0 or 1, avoiding the degenerate\ncase. One can of course interpret (11) as EM with prior parameters \u03b1(cid:48) = \u03b1 + 1, and \u03b2(cid:48) = \u03b2 + 1;\nunder this interpretation, it is advisable to choose priors \u03b1(cid:48) > 1 and \u03b2(cid:48) > 1 (corresponding to a less\ncommon or intuitive \u201cinformative\u201d prior).\nWe should point out that it is widely known that EM can be interpreted as a coordinate descent on\nvariational objectives [18, 11]; our derivation differs in that we marginalize, instead of maximize,\nover qj. Our \ufb01rst-order approximation scheme is also similar to the method by Asuncion [19]. One\ncan also extend this derivation to two-coin models with independent Beta priors, yielding the EM\nupdate in Dawid and Skene [2]. On the other hand, discrete priors do not seem to lead to interesting\nalgorithms in this case.\n\n4 Experiments\n\nIn this section, we present numerical experiments on both simulated and real-world Amazon Me-\nchanical Turk datasets. We implement majority voting (MV), KOS in (2), BP in (8), EM in (1) and\n\n6\n\n\fFigure 1: The performance of the algorithms as the degrees of the assignment graph vary; the left\ndegree (cid:96) denotes the number of workers per task, and the right degree \u03b3 denotes the number of tasks\nper worker. The true data prior is prob[qj = 0.5] = prob[qj = 0.9] = 1/2.\n\n(a) Beta prior (\ufb01xed \u03b1/(\u03b1 + \u03b2) = 0.6)\n\n(b) Beta prior (\ufb01xed \u03b1 + \u03b2 = 1)\n\n(c) Spammer-hammer prior\n\n(d) Adversary-spammer-hammer prior\n\nFigure 2: The performance on data generated with different qj priors on (9,9)-regular random graphs.\n(a) Beta prior with \ufb01xed \u03b1\n\u03b1+\u03b2 = 0.6. (b) Beta prior with \ufb01xed \u03b1 + \u03b2 = 1. (c) Spammer-hammer\nprior, prob[qj = 0.5] = 1\u2212prob[qj = 0.9] = p0, with varying p0. (d) Adversary-spammer-hammer\nprior, prob[qj = 0.1] = p0, prob[qj = 0.5] = prob[qj = 0.9] = (1 \u2212 p0)/2 with varying p0.\n\nj\u2208Mi\n\nwith yj\u2192i = 1 and EM with \u00b5i(zi) =(cid:80)\n\nits variant AMF in (11). The exact MF (9)-(10) was implemented, but is not reported because its\nperformance is mostly similar to AMF (11) in terms of error rates. We initialize BP (including KOS)\nI[Lij = zi]/|Mi|, both of which reduce to major-\nity voting at the 0-th iteration; for KOS, we also implemented another version that exactly follows\nthe setting of Karger et al. [1], which initializes yj\u2192i by Normal(1, 1) and terminates at the 10-th\niteration; the best performance of the two versions was reported. For EM with algorithmic prior\nBeta(\u03b1, \u03b2), we add a small constant (0.001) on \u03b1 and \u03b2 to avoid possible numerical NaN values.\nWe also implemented a max-product version of BP, but found it performed similarly to sum-product\nBP in terms of error rates. We terminate all the iterative algorithms at a maximum of 100 iterations\nor with 10\u22126 message convergence tolerance. All results are averaged on 100 random trials.\nSimulated Data. We generate simulated data by drawing the abilities qj from Beta priors or the\nadversary-spammer-hammer priors, that equals 0.1, 0.5, or 0.9 with certain probabilities; the as-\nsignment graphs are randomly drawn from the set of ((cid:96), \u03b3)-regular bipartite graphs with 1000 task\nnodes using the con\ufb01guration method [20]. For the simulated datasets, we also calculated the oracle\nlower bound in Karger et al. [1] that assumes the true qj are known, as well as a BP equipped with\nan algorithmic prior equal to the true prior used to generate the data, which sets a tighter (perhaps\napproximate) \u201cBayesian oracle\u201d lower bound for all the algorithms that do not know qj. We \ufb01nd that\nBP and AMF with a typical asymmetric prior Beta(2, 1) perform mostly as well as the \u201cBayesian\noracle\u201d bound, eliminating the necessity to search for more accurate algorithmic priors.\nIn Fig. 1, we show that the error rates of the algorithms generally decay exponentially w.r.t.\nthe\ndegree (cid:96) and log(\u03b3) of the assignment graph on a spammer-hammer model. Perhaps surprisingly,\nwe \ufb01nd that the BP, EM and AMF with the asymmetric algorithmic prior beta(2, 1) scale similarly to\nKOS, which has been theoretically shown to be order-optimal compared to the oracle lower bound.\nOn the other hand, BP with symmetric algorithmic priors, such as the Haldane prior Beta(0+, 0+) of\nKOS and the uniform prior Beta(1, 1), often result in degraded performance when the degrees of the\n\n7\n\n510152025\u22122.5\u22122\u22121.5\u22121\u22120.5(cid:1)(\ufb01xed\u03b3=5)Log\u2212error2510\u22123\u22122.5\u22122\u22121.5\u22121\u03b3(\ufb01xed(cid:1)=15)51015\u22122.5\u22122\u22121.5\u22121\u22120.5(cid:1)((cid:1)=\u03b3) KOSMVOracleBP\u2212TrueBP\u2212Beta(2,1)BP\u2212Beta(1,1)EM\u2212Beta(2,1)AMF\u2212Beta(2,1)12345678910\u22122.5\u22122\u22121.5\u22121\u22120.5\u03b1+\u03b2Log\u2212error KOSMVOracleBP\u2212TrueBP\u2212Beta(2,1)EM\u2212Beta(2,1)AMF\u2212Beta(2,1)0.550.60.650.7\u22122.5\u22121.5\u22120.5\u03b1/(\u03b1+\u03b2)0.40.60.8\u22122\u22121.5\u22121\u22120.5Percentage of Spammers00.10.2\u22121.5\u22121\u22120.5Percentage of Adversaries\f(a). The bluebird dataset\n\n(b) The rte dataset\n\n(c). The temp dataset\n\nFigure 3: The results on Amazon Mechanical Turk datasets. Averaged on 100 random subsamples.\n\nassignment graphs are low, supporting our discussion in Section 3.1.1. Indeed, BP with symmetric\nalgorithmic priors often fails to converge in the low-degree setting.\nFig. 2 shows the performance of the algorithms when varying the true priors of the data. We \ufb01nd in\nFig. 2(b) and (d) that the performance of EM with Beta(2, 1) tends to degrade when the fraction of\nadversaries increases, probably because the \u02c6qj is more likely to be incorrectly updated to and stuck\non 0 or 1 in these cases; see the discussion in Section 3.2. In all cases, we \ufb01nd that BP and AMF (and\nMF) perform mostly equally well, perhaps indicating both Bethe and mean-\ufb01eld approximations are\nreasonably good on the joint distribution p(z, q|L, \u03b8) in terms of error rates. Our implementation\nof EM (on both simulated data and the real data below) seems to perform better than previously\nreported results, probably due to our careful choice on the prior and initialization.\nReal Data. We tested our methods on three publicly available Amazon Mechanical Turk datasets.\nThe symmetric assumption of qj = sj = tj is likely to be violated in practice, especially on vision\ndatasets where a human\u2019s perception decides on whether some object is present. Therefore we also\nimplemented the two-coin version of BP and AMF(EM) with the algorithmic priors of sj and tj\ntaken as two independent Beta(2, 1) (referred to as BP-Beta2(2,1) and similar).\nWe \ufb01rst tested on the bluebird dataset of Welinder et al. [6], including 108 tasks and 39 workers\non a fully connected bipartite assignment graph, where the workers are asked whether the presented\nimages contain Indigo Bunting or Blue GrosBeak. Fig. 3(a) shows the performance when \ufb01xed\nnumbers of annotators are subsampled for each task. On this dataset, all methods, including KOS,\nBP and AMF(EM), work poorly under the symmetric assumption, while the two-coin versions of\nBP and AMF(EM) are signi\ufb01cantly better, achieving equivalent performance to the algorithm by\nWelinder et al. [6] based on an advanced high dimensional model. This suggests that the symmetric\nassumption is badly violated on this dataset, probably caused by the non-expert workers with high\nsensitivities but low speci\ufb01cities, having trouble identifying Indigo Bunting from Blue GrosBeak.\nWe then tested on two natural language processing datasets in [21], the rte dataset with 800 tasks and\n164 workers, and the temp dataset with 462 tasks and 76 workers. As seen in Fig. 3(b)-(c), both the\nsymmetric and the two-coin versions of BP and AMF(EM) performed equally well, all achieving\nalmost the same performance as the SpEM algorithm reported in [4]. The KOS algorithm does\nsurprisingly poorly, probably due to the assignment graphs not having locally tree-like structures.\n\n5 Conclusion\n\nWe have presented a spectrum of inference algorithms, in particular a novel and ef\ufb01cient BP algo-\nrithm, for crowdsourcing problems and clari\ufb01ed their connections to existing methods. Our explo-\nration provides new insights into the existing KOS, MV and EM algorithms, and more importantly,\nfor separating the modeling factors and algorithmic factors in crowdsourcing problems, which pro-\nvides guidance for both implementations of the current algorithms, and for designing even more\nef\ufb01cient algorithms in the future. Numerical experiments show that BP, EM and AMF, and exact\nMF, when implemented carefully, all perform impressively in term of their error rate scaling. Further\ndirections include applying our methodology to more advanced models, e.g., incorporating variation\nin task dif\ufb01culties, and theoretical analysis of the error rates of BP, EM and MF that matches the\nempirical behavior in Fig. 1.\nAcknowledgements. Work supported in part by NSF IIS-1065618 and two Microsoft Research\nFellowships. We thank P. Welinder and S. Belongie for providing the data and code.\n\n8\n\n51015200.10.20.30.40.5Number of Workers per TaskError Rate2468100.10.20.30.40.5Number of Workers per Task2468100.10.20.30.4Number of Workers per Task KOSMVBP-Beta(2,1)EM-Beta(2,1)AMF-Beta(2,1)BP-Beta2(2,1)EM-Beta2(2,1)AMF-Beta2(2,1)Welinderetal.2010\fReferences\n[1] D.R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In\n\nNeural Information Processing Systems (NIPS), 2011.\n\n[2] A.P. Dawid and A.M. Skene. Maximum likelihood estimation of observer error-rates using the\n\nem algorithm. Applied Statistics, pages 20\u201328, 1979.\n\n[3] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective\nlabelling of venus images. Advances in neural information processing systems, pages 1085\u2013\n1092, 1995.\n\n[4] V.C. Raykar, S. Yu, L.H. Zhao, G.H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning\n\nfrom crowds. The Journal of Machine Learning Research, 11:1297\u20131322, 2010.\n\n[5] J Whitehill, P Ruvolo, T Wu, J Bergsma, and J Movellan. Whose vote should count more:\nIn Advances in Neural\n\nOptimal integration of labels from labelers of unknown expertise.\nInformation Processing Systems, pages 2035\u20132043. 2009.\n\n[6] P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds.\n\nIn Neural Information Processing Systems Conference (NIPS), 2010.\n\n[7] V.C. Raykar and S. Yu. Eliminating spammers and ranking annotators for crowdsourced label-\n\ning tasks. Journal of Machine Learning Research, 13:491\u2013518, 2012.\n\n[8] Fabian L. Wauthier and Michael I. Jordan. Bayesian bias mitigation for crowdsourcing. In\n\nAdvances in Neural Information Processing Systems 24, pages 1800\u20131808. 2011.\n\n[9] B. Carpenter. Multilevel bayesian models of categorical data annotation. Unpublished\n\n[10] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT\n\nmanuscript, 2008.\n\npress, 2009.\n\n[11] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational infer-\n\nence. Found. Trends Mach. Learn., 1(1-2):1\u2013305, 2008.\n\n[12] Y. Weiss and W.T. Freeman. On the optimality of solutions of the max-product belief-\nInformation Theory, IEEE Transactions on, 47\n\npropagation algorithm in arbitrary graphs.\n(2):736 \u2013744, Feb 2001.\n\n[13] F.R. Kschischang, B.J. Frey, and H.A. Loeliger. Factor graphs and the sum-product algorithm.\n\nInformation Theory, IEEE Transactions on, 47(2):498\u2013519, 2001.\n\n[14] D. Tarlow, I.E. Givoni, and R.S. Zemel. Hopmap: Ef\ufb01cient message passing with high order\n\npotentials. In Proc. of AISTATS, 2010.\n\n[15] A. Zellner. An introduction to Bayesian inference in econometrics, volume 17. John Wiley and\n\nsons, 1971.\n\n[16] R.E. Kass and L. Wasserman. The selection of prior distributions by formal rules. Journal of\n\nthe American Statistical Association, pages 1343\u20131370, 1996.\n\n[17] F. Tuyl, R. Gerlach, and K. Mengersen. A comparison of bayes-laplace, jeffreys, and other\n\npriors. The American Statistician, 62(1):40\u201344, 2008.\n\n[18] Radford Neal and Geoffrey E. Hinton. A view of the EM algorithm that justi\ufb01es incremental,\nsparse, and other variants. In M. Jordan, editor, Learning in Graphical Models, pages 355\u2013368.\nKluwer, 1998.\n\n[19] A. Asuncion. Approximate mean \ufb01eld for Dirichlet-based models.\n\nIn ICML Workshop on\n\nTopic Models, 2010.\n\n[20] B. Bollob\u00b4as. Random graphs, volume 73. Cambridge Univ Pr, 2001.\n[21] R. Snow, B. O\u2019Connor, D. Jurafsky, and A.Y. Ng. Cheap and fast\u2014but is it good?: evaluating\nnon-expert annotations for natural language tasks. In Proceedings of the Conference on Empir-\nical Methods in Natural Language Processing, pages 254\u2013263. Association for Computational\nLinguistics, 2008.\n\n9\n\n\f", "award": [], "sourceid": 328, "authors": [{"given_name": "Qiang", "family_name": "Liu", "institution": null}, {"given_name": "Jian", "family_name": "Peng", "institution": null}, {"given_name": "Alexander", "family_name": "Ihler", "institution": null}]}