{"title": "A Collaborative Mechanism for Crowdsourcing Prediction Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 2600, "page_last": 2608, "abstract": "Machine Learning competitions such as the Netflix Prize have proven reasonably successful as a method of \u201ccrowdsourcing\u201d prediction tasks. But these compe- titions have a number of weaknesses, particularly in the incentive structure they create for the participants. We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively \u201clearn\u201d a hypothesis for a given prediction task. The approach draws heavily from the concept of a prediction market, where traders bet on the likelihood of a future event. In our framework, the mechanism continues to publish the current hypothesis, and par- ticipants can modify this hypothesis by wagering on an update. The critical in- centive property is that a participant will profit an amount that scales according to how much her update improves performance on a released test set.", "full_text": "A Collaborative Mechanism for Crowdsourcing\n\nPrediction Problems\n\nJacob Abernethy\n\nDivision of Computer Science\n\nUniversity of California at Berkeley\n\njake@cs.berkeley.edu\n\nRafael M. Frongillo\n\nDivision of Computer Science\n\nUniversity of California at Berkeley\n\nraf@cs.berkeley.edu\n\nAbstract\n\nMachine Learning competitions such as the Net\ufb02ix Prize have proven reasonably\nsuccessful as a method of \u201ccrowdsourcing\u201d prediction tasks. But these compe-\ntitions have a number of weaknesses, particularly in the incentive structure they\ncreate for the participants. We propose a new approach, called a Crowdsourced\nLearning Mechanism, in which participants collaboratively \u201clearn\u201d a hypothesis\nfor a given prediction task. The approach draws heavily from the concept of a\nprediction market, where traders bet on the likelihood of a future event. In our\nframework, the mechanism continues to publish the current hypothesis, and par-\nticipants can modify this hypothesis by wagering on an update. The critical in-\ncentive property is that a participant will pro\ufb01t an amount that scales according to\nhow much her update improves performance on a released test set.\n\n1\n\nIntroduction\n\nThe last several years has revealed a new trend in Machine Learning: prediction and learning prob-\nlems rolled into prize-driven competitions. One of the \ufb01rst, and certainly the most well-known, was\nthe Net\ufb02ix prize released in the Fall of 2006. Net\ufb02ix, aiming to improve the algorithm used to pre-\ndict users\u2019 preferences on its database of \ufb01lms, released a dataset of 100M ratings to the public and\nasked competing teams to submit a list of predictions on a test set withheld from the public. Net\ufb02ix\noffered $1,000,000 to the \ufb01rst team achieving prediction accuracy exceeding a given threshold, a\ngoal that was eventually met. This competitive model for solving a prediction task has been used\nfor a range of similar competitions since, and there is even a new company (kaggle.com) that\ncreates and hosts such competitions. Such prediction competitions have proven quite valuable for\na couple of important reasons: (a) they leverage the abilities and knowledge of the public at large,\ncommonly known as \u201ccrowdsourcing\u201d, and (b) they provide an incentivized mechanism for an indi-\nvidual or team to apply their own knowledge and techniques which could be particularly bene\ufb01cial\nto the problem at hand. This type of prediction competition provides a nice tool for companies and\ninstitutions that need help with a given prediction task yet can not afford to hire an expert. The po-\ntential leverage can be quite high: the Net\ufb02ix prize winners apparently spent more than $1,000,000\nin effort on their algorithm alone.\nDespite the extent of its popularity, is the Net\ufb02ix competition model the ideal way to \u201ccrowdsource\u201d\na learning problem? We note several weaknesses:\nIt is anti-collaborative. Competitors are strongly incentivized to keep their techniques private.\nThis is in stark contrast to many other projects that rely on crowdsourcing \u2013 Wikipedia being a\nprime example, where participants must build off the work of others. Indeed, in the case of the\nNet\ufb02ix prize, not only do leading participants lack incentives to share, but the work of non-winning\ncompetitors is effectively wasted.\n\n1\n\n\fThe incentives are skewed and misaligned. The winner-take-all prize structure means that sec-\nond place is as good as having not competed at all. This ultimately leads to an equilibrium where\nonly a few teams are actually competing, and where potential new teams never form since catching\nup seems so unlikely. In addition, the \ufb01xed achievement benchmark, set by Net\ufb02ix as a 10% im-\nprovement in prediction RMSE over a baseline, leads to misaligned incentives. Effectively, the prize\nstructure implies that an improvement of %9.9 percent is worth nothing to Net\ufb02ix, whereas a 20%\nimprovement is still only worth $1,000,000 to Net\ufb02ix. This is clearly not optimal.\nThe nature of the competition precludes the use of proprietary methods. By requiring that the\nwinner reveal the winning algorithm, potential competitors utilizing non-open software or propri-\netary techniques will be unwilling to compete. By participating in the competition, a user must\neffectively give away his intellectual property.\nIn this paper we describe a new and very general mechanism to crowdsource prediction/learning\nproblems. Our mechanism requires participants to place bets, yet the space they are betting over\nis the set of hypotheses for the learning task at hand. At any given time the mechanism publishes\nthe current hypothesis w and participants can wager on a modi\ufb01cation of w to w(cid:48), upon which the\nmodi\ufb01ed w(cid:48) is posted. Eventually the wagering period \ufb01nishes, a set of test data is revealed, and\neach participant receives a payout according to their bets. The critical property is that every trader\u2019s\npro\ufb01t scales according to how well their modi\ufb01cation improved the solution on the test data.\nThe framework we propose has many qualities similar to that of an information or prediction market,\nand many of the ideas derive from recent research on the design of automated market makers [7, 8,\n3, 4, 1]. Many information markets already exist; at sites like Intrade.com and Betfair.com,\nindividuals can bet on everything ranging from election outcomes to geopolitical events. There has\nbeen a burst of interest in such markets in recent years, not least of which is due to their potential\nfor combining large amounts of information from a range of sources. In the words of Hanson et\nal [9]: \u201cRational expectations theory predicts that, in equilibrium, asset prices will re\ufb02ect all of\nthe information held by market participants. This theorized information aggregation property of\nprices has lead economists to become increasingly interested in using securities markets to predict\nfuture events.\u201d In practice, prediction markets have proven impressively accurate as a forecasting\ntool [11, 2, 12].\nThe central contribution of the present paper is to take the framework of a prediction market as a\ntool for information aggregation and to apply this tool for the purpose of \u201caggregating\u201d a hypothesis\n(classi\ufb01er, predictor, etc.) for a given learning problem. The crowd of ML researchers, practitioners,\nand domain experts represents a highly diverse range of expertise and algorithmic tools. In contrast\nto the Net\ufb02ix prize, which pitted teams of participants against each other, the mechanism we propose\nallows for everyone to contribute whatever knowledge they may have available towards the \ufb01nal\nsolution. In a sense, this approach decentralizes the process of solving the task, as individual experts\ncan potentially apply their expertise to a subset of the problem on which they have an advantage.\nWhereas a market price can be thought of as representing a consensus estimate of the value of an\nasset, our goal is to construct a consensus hypothesis re\ufb02ecting all the knowledge and capabilities\nabout a particular learning problem1.\nLayout: We begin in Section 2.1 by introducing the simple notion of a generalized scoring rule\nL(\u00b7,\u00b7) representing the \u201closs function\u201d of the learning task at hand. In Section 2.2 we describe our\nproposed Crowdsourced Learning Mechanism (CLM) in detail, and discuss how to structure a CLM\nfor a particular scoring function L, in order that the traders are given incentives to minimize L.\nIn Section 3 we give an example based on the design of Huffman codes. In Section 4 we discuss\nprevious work on the design of prediction markets using an automated prediction market maker\n(APMM). In Section 5 we \ufb01nish by considering two learning settings (e.g. linear regression) and we\nconstruct a CLM for each. The proofs have been omitted throughout, but these are available in the\nfull version of the present paper.\nNotation: Given a smooth strictly convex function R : Rd \u2192 R, and points x, y \u2208 dom(R), we\nde\ufb01ne the Bregman divergence DR(x, y) as the quantity R(x) \u2212 R(y) \u2212 \u2207R(y) \u00b7 (x \u2212 y). For any\nconvex function R, we let R\u2217 denote the convex conjugate of R, that is R\u2217(y) := supx\u2208dom(R) y \u00b7\nx \u2212 R(x). We shall use \u2206(S) to refer to the set of integrable probability distributions over the set\n\n1It is worth noting that Barbu and Lay utilized concepts from prediction markets to design algorithms for\n\nclassi\ufb01er aggregation [10], although their approach was unrelated to crowdsourcing.\n\n2\n\n\fdenote the entropy function, that is H(p) := \u2212(cid:80)n\nthat is KL(p; q) :=(cid:80)n\n\nS, and \u2206n to refer to the set of probability vectors p \u2208 Rn. The function H : \u2206n \u2192 R shall\ni=1 p(i) log p(i). We use the notation KL(p; q)\nto describe the relative entropy or Kullback-Leibler divergence between distributions p, q \u2208 \u2206n,\nq(i) . We will also use ei \u2208 Rn to denote the ith standard basis\n\ni=1 p(i) log p(i)\n\nvector, having a 1 in the ith coordinate and 0\u2019s elsewhere.\n\n2 Scoring Rules and Crowdsourced Learning Mechanisms\n\n2.1 Generalized Scoring Rules\nFor the remainder of this section, we shall let H denote some set of hypotheses, which we will\nassume is a convex subset of Rn. We let O be some arbitrary set of outcomes. We use the symbol\nX to refer to either an element of O, or a random variable taking values in O.\nWe recall the notion of a scoring rule, a concept that arises frequently in economics and statistics [6].\nDe\ufb01nition 1. Let P \u2286 \u2206(O) be some convex set of distributions on an outcome space O. A scoring\nrule is a function S : P \u00d7 O \u2192 R where, for all P \u2208 P, P \u2208 argmaxQ\u2208P EX\u223cP S(Q, X).\nIn other words, if you are paid S(P, X) upon stating belief P \u2208 P and outcome X occurring, then\nyou maximize your expected utility by stating your true belief. We offer a much weaker notion:\nDe\ufb01nition 2. Given a convex hypothesis space H \u2282 Rn and an outcome space O, let L : H\u00d7O \u2192\nR be a continuous function. Given any P \u2208 \u2206(O), let WL(P ) := argminw\u2208H EX\u223cP [L(w; X)].\nThen we say that L is a Generalized Scoring Rule (GSR) if WL(P ) is a nonempty convex set for\nevery P \u2208 \u2206(O).\nThe generalized scoring rule shall represent the \u201closs function\u201d for the learning problem at hand,\nand in Section 2.2 we will see how L is utilized in the mechanism. The hypothesis w shall represent\nthe advice we receive from the crowd, X shall represent the test data to be revealed at the close of\nthe mechanism, and L(w; X) shall represent the loss of the advised w on the data X. Notice that\nwe do not de\ufb01ne L to be convex in its \ufb01rst argument as this does not hold for many important cases.\nInstead, we require the weaker condition that EX [L(w; X)] is minimized on a convex set for any\ndistribution on X.\nOur scoring rule differs from traditional scoring rules in an important way. Instead of starting with\nthe desire know about the true value of X, and then designing a scoring rule which incentivizes\nparticipants to elicit their belief P \u2208 P, our objective is precisely to minimize our scoring rule.\nIn other words, traditional scoring rules were a means to an end (eliciting P ) but our generalized\nscoring rule is the end itself. One can recover the traditional scoring rule de\ufb01nition by setting H = P\nand imposing the constraint that P \u2208 WL(P ).\nA useful class of GSRs L are those based on a Bregman divergence.\nDe\ufb01nition 3. We say that a GSR L : H \u00d7 O \u2192 R is divergence-based if there exists an alternative\nhypothesis space H(cid:48) \u2282 Rm, for some m, where we can write\n\nL(w; X) \u2261 DR(\u03c1(X), \u03c8(w)) + f (X)\n\n(1)\nfor arbitrary maps \u03c1 : O \u2192 H(cid:48), f : O \u2192 R, and \u03c8 : H \u2192 H(cid:48), and any closed strictly convex\nR : H(cid:48) \u2192 R whose convex conjugate R\u2217 is \ufb01nite on all of Rm.\nThis property allows us to think of L(w; X) as a kind of distance between \u03c1(X) and \u03c8(w). Clearly\nthen, the minimum value of L for a given X will be attained when \u03c8(w) = \u03c1(X), given that\nDR(x, x) = 0 for any Bregman divergence. In fact, as the following proposition shows, we can\neven think of the expected value E[L(w; X)], as a distance between E[\u03c1(X)] and \u03c8(w).\nProposition 1. Given a divergence-based GSR L(w; X) = DR(\u03c1(X), \u03c8(w)) + f (X) and a belief\n\ndistribution P on O, we have WL(P ) = \u03c8\u22121(cid:0)EX\u223cP [\u03c1(X)](cid:1).\n\nWe now can see that the divergence-based property greatly simpli\ufb01es the task of minimizing L; in-\nstead of worrying about E[L(\u00b7; X)] one can simply base the hypothesis directly on the expectation\nE[\u03c1(X)]. As we will see in section 4, this also leads to ef\ufb01cient prediction markets and crowdsourc-\ning mechanisms.\n\n3\n\n\f2.2 The Crowdsourced Learning Mechanism\n\nWe will now de\ufb01ne our actual mechanism rigorously.\nDe\ufb01nition 4. A Crowdsourced Learning Mechanism (CLM) is the procedure in Algorithm 1 as\nde\ufb01ned by the tuple (H,O, Cost, Payout). The function Cost : H \u00d7 H \u2192 R sets the cost charged\nto a participant that makes a modi\ufb01cation to the posted hypothesis. The function Payout : H\u00d7H\u00d7\nO \u2192 R determines the amount paid to each participant when the outcome is revealed to be X.\nAlgorithm 1 Crowdsourced Learning Mechanism for (H,O, Cost, Payout)\n1: Mechanism sets initial hypothesis to some w0 \u2208 H\n2: for rounds t = 0, 1, 2, . . . do\n3: Mechanism posts current hypothesis wt \u2208 H\n4:\n5: Mechanism charges participant Cost(wt, w(cid:48))\n6: Mechanisms updates hypothesis wt+1 \u2190 w(cid:48)\n7: end for\n8: Market closes after T rounds and the outcome (test data) X \u2208 O is revealed\n9: for each t do\n10:\n11: end for\n\nParticipant responsible for the update wt (cid:55)\u2192 wt+1 receives Payout(wt, wt+1; X)\n\nSome participant places a bid on the update wt (cid:55)\u2192 w(cid:48)\n\nThe above procedure describes the process by which participants can provide advice to the mecha-\nnism to select a good w, and the pro\ufb01t they earn by doing so. Of course, this pro\ufb01t will precisely\ndetermine the incentives of our mechanism, and hence a key question is: how can we design Cost\nand Payout so that participants are incentivized to provide good hypotheses? The answer is that we\nshall structure the incentives around a GSR L(w; X) chosen by the mechanism designer.\nDe\ufb01nition 5. For a CLM A = (H,O, Cost, Payout), denote the ex-post pro\ufb01t for the bid (w (cid:55)\u2192\nw(cid:48)) when the outcome is X \u2208 O by Profit(w, w(cid:48); X) := Payout(w, w(cid:48); X) \u2212 Cost(w, w(cid:48)). We\nsay that A implements a GSR L : H(cid:48) \u00d7 O \u2192 R if there exists a surjective map \u03d5 : H \u2192 H(cid:48) such\nthat for all w1, w2 \u2208 H and X \u2208 O,\n\nProfit(w1, w2; X) = L(\u03d5(w1); X) \u2212 L(\u03d5(w2); X).\n\n(2)\n\nIf additionally H(cid:48) = H and \u03d5 = idH, we call A an L-CLM and say that A is L-incentivized.\nWhen a CLM implements a given L, the incentives are structured in order that the participants will\nwork to minimize L(w; X). Of course, the input X is unknown to the participants, yet we can\nassume that the mechanism has provided a public \u201ctraining set\u201d to use in a learning algorithm. The\nparticipants are thus asked not only to propose a \u201cgood\u201d hypothesis wt but to wager on whether the\nupdate wt\u22121 (cid:55)\u2192 wt improves generalization error. It is worth making clear that knowledge of the\ntrue distribution on X provides a straightforward optimal strategy.\nProposition 2. Given a GSR L : H\u00d7O \u2192 R and an L-CLM (Cost, Payout), any participant who\nknows the true distribution P \u2208 P over X will maximize expected pro\ufb01t by modifying the hypothesis\nto any w \u2208 WL(P ).\n\nCost of operating a CLM.\nIt is clear that the agent operating the mechanism must pay the par-\nticipants at the close of the competition, and is thus at risk of losing money (in fact, it is pos-\n(cid:55)\u2192 wt+1) made by\nsible he may gain). How much money is lost depends on the bets (wt\nthe participants, and of course the \ufb01nal outcome X. The agent has a clear interest in knowing\n(cid:80)T\nprecisely the potential cost \u2013 fortunately this cost is easy to compute. The loss to the agent is\nclearly the total ex-post pro\ufb01t earned by the participants, and by construction this sum telescopes:\nt=0 Profit(wt, wt+1; X) = L(w0; X) \u2212 L(wT ; X). This is a simple yet appealing property of\nthe CLM: the agent pays only as much in reward to the participants as it bene\ufb01ts from the improve-\nment of wT over the initial w0. It is worth noting that this value could be negative when wT is\nactually \u201cworse\u201d than w0; in this case, as we shall see in section 3, the CLM can act as an insurance\npolicy with respect to the mistakes of the participants. A more typical scenario, of course, is where\nthe participants provide an improved hypothesis, in which case the CLM will run at a cost. We can\ncompute the WorstCaseLoss(L-CLM) := maxw\u2208H,X\u2208O (L(w0; X) \u2212 L(w; X)). Given a budget\n\n4\n\n\fof size $B, the mechanism can always rescale L in order that WorstCaseLoss(L-CLM) = B. This\nrequires, of course, that the WorstCaseLoss is \ufb01nite.\n\nComputational ef\ufb01ciency of operating a CLM. We shall say that a CLM has the ef\ufb01cient com-\nputation (EC) property if both Cost and Payout are ef\ufb01ciently computable functions. We shall say\na CLM has the tractable trading (TT) property if, given a current hypothesis w, a belief P \u2208 \u2206(O)\nand a budget B, one can ef\ufb01ciently compute an element of the set\n\n(cid:110)EX\u223cP\n\n(cid:2)Profit(w, w(cid:48), X)(cid:3) : Cost(w, w(cid:48)) \u2264 B\n\n(cid:111)\n\n.\n\nargmin\nw(cid:48)\u2208H\n\nThe EC property ensures that the mechanism operator can run the CLM ef\ufb01ciently. The TT property\nsays that participants can compute the optimal hypothesis to bet on given a belief on the outcome\nand a budget. This is absolutely essential for the CLM to successfully aggregate the knowledge and\nexpertise of the crowd \u2013 without it, despite their motivation to lower L(; ), the participants would\nnot be able to compute the optimal bet.\n\nSuitable collateral requirements. We say that a CLM has the escrow (ES) property if the Cost\nand Payout functions are structured in order that, given any wager (w (cid:55)\u2192 w(cid:48)), we have that\nPayout(w, w(cid:48); X) \u2265 0 for all X \u2208 O.\nIt is clear that, when designing an L-CLM for a par-\nticular L, the Payout function is fully speci\ufb01ed once Cost is \ufb01xed, since we have the relation\nPayout(w, w(cid:48); X) = L(w; X) \u2212 L(w(cid:48); X) + Cost(w, w(cid:48)) for every w, w(cid:48) \u2208 H and X \u2208 O. A\ncurious reader might ask, why not simply set Cost(w, w(cid:48)) \u2261 0 and Payout \u2261 Profit? The prob-\nlem with this approach is that potentially Payout(w, w(cid:48); X) < 0 which implies that the participant\nwho wagered on (w (cid:55)\u2192 w(cid:48)) can be indebted to the mechanism and could default on this obligation.\nThus the Cost function should be set in order to require every participant to deposit at least enough\ncollateral in escrow to cover any possible losses.\n\nSubsidizing with a voucher pool. One practical weakness of a wagering-based mechanism is\nthat individuals may be hesitant to participate when it requires depositing actual money into the\nsystem. This can be allayed to a reasonable degree by including a voucher pool where each of the\n\ufb01rst m participants may receive a voucher in the amount of $C. These candidates need not pay to\nparticipate, yet have the opportunity to win. Of course, these vouchers must be paid for by the agent\nrunning the mechanism, and hence a value of mC is added to the total operational cost.\n\n3 A Warm-up: Compressing an Unfamiliar Data Stream\n\nLet us now introduce a particular setting motivated by a well-known problem in information theory.\nImagine a \ufb01rm is looking to do compression on an unfamiliar channel, and from this channel the\n\ufb01rm will receive a stream of m characters from an n-sized alphabet which we shall index by [n]. The\ngoal is to select a binary encoding of this alpha in such a way that minimizes the total bits required\nto store the data, as a cost of $1 is required for each bit.\nA \ufb01rst-order approach to encode such a stream is to assign a probability distribution q \u2208 \u2206n to\nthe alphabet, and to select an encoding of character i with a binary word of length log(1/q(i)) (we\nignore round-off for simplicity). This can be achieved using Huffman Codes for example, and we\nrefer the reader to Cover and Thomas ([5], Chapter 5) for more details. Thus, given a distribution\nq, the \ufb01rm pays L(q; i) = \u2212 log q(i) for each character i. It is easy to see that if the characters\nare sampled from some \u201ctrue\u201d distribution p, then the expected cost L(q; p) := Ei\u223cp [L(q; i)] =\nKL(p; q) + H(p), which is minimized at q = p. Not knowing the true distribution p, the \ufb01rm is\nthus interested in \ufb01nding a q with a low expected cost L(q; p).\nAn attractive option available to the \ufb01rm is to crowdsource the task of lowering this cost L(\u00b7;\u00b7) by\nsetting up an L-CLM. It is reasonably likely that outside individuals have private information about\nthe behavior of the channel and, in particular, may be able to provide a better estimate q of the true\ndistribution of the characters in the channel. As just discussed, the better the estimate the cheaper\nthe compression.\nWe set H = \u2206n and O = [n], where a hypothesis q represents the proposed distribution over the n\ncharacters, and X is some character sampled uniformly from the stream after it has been observed.\n\n5\n\n\fWe de\ufb01ne Cost and Payout as\nCost(q, q(cid:48)) := max\ni\u2208[n]\n\nlog(q(i)/q(cid:48)(i)),\n\nPayout(q, q(cid:48); i) := log(q(i)/q(cid:48)(i)) + Cost(q, q(cid:48)),\n\nwhich is clearly an L-CLM for the loss de\ufb01ned above. It is worth noting that L is a divergence-based\nGSR if we take R(q) = \u2212H(q), \u03c1(i) = ei, f \u2261 0, \u03c8 \u2261 id\u2206n, using the convention 0 log 0 = 0 (in\nfact, L is the LMSR). Finally, the \ufb01rm will initially set q0 to be its best guess of p, which we will\nassume to be uniform (but need not be).\nWe have devised this payout scheme according to the selection of a single character i, and it is\nworth noting that because this character is sampled uniformly at random from the stream (with\nprivate randomness), the participants cannot know which character will be released. This forces the\nparticipants to wager on the empirical distribution \u02c6p of the characters from the stream. A reasonable\nalternative, and one which lowers the payment variance, is to payout according to the L(q; \u02c6p), which\nis also equal to the average of L(q; i) when i is chosen uniformly from the stream.\nThe obvious question to ask is: how does this CLM bene\ufb01t the \ufb01rm that wants to design the encod-\ning? More precisely, if the \ufb01rm uses the \ufb01nal estimate qT from the mechanism, instead of the initial\nguess q0, what is the trade-off between the money paid to participants and the money gained by us-\ning the crowdsourced hypothesis? At \ufb01rst glance, it appears that this trade-off can be arbitrarily bad:\nthe worst case cost of encoding the stream using the \ufb01nal estimate qT is supi,qT \u2212 log(qT (i)) = \u221e.\nAmazingly, however, by virtue of the aligned incentives, the \ufb01rm has a very strong control of its total\ncost (the CLM cost plus the encoding cost). Suppose the \ufb01rm scales L by a parameter \u03b1, to separate\nthe scale of the CLM from the scale of the encoding cost (which we assumed to be $1 per bit). Then\ngiven any initial estimate q0 and \ufb01nal estimate qT , the expected total cost over p is\n\n(cid:122)\n\n(cid:125)(cid:124)\n\n(cid:123)\n\nEncoding cost of using qT given p\n\n(cid:122)\n\n(cid:125)(cid:124)\n\nMechanism\u2019s cost of getting advice qT\n\n\u03b1(KL(p; q0) \u2212 KL(p; qT ))\n\n(cid:123)\n\nTotal expected cost =\n\nH(p) + KL(p; qT )\n\n+\n\n= H(p) + (1 \u2212 \u03b1)KL(p; qT ) + \u03b1KL(p; q0)\n\nLet us spend a moment to analyze the above expression. Imagine that the \ufb01rm set \u03b1 = 1. Then\nthe total cost of the \ufb01rm would be H(p) + KL(p; q0), which is bounded by log n for q0 uniform.\nNotice that this expression does not depend on qT \u2013 in fact, this cost precisely corresponds to the\nscenario where the \ufb01rm had not set up a CLM and instead used the initial estimate q0 to encode. In\nother words, for \u03b1 = 1, the \ufb01rm is entirely neutral to the quality of the estimate qT ; even if the CLM\nprovided an estimate qT which performed worse than q0, the cost increase due to the bad choice of\nq is recouped from payments of the ill-informed participants.\nThe \ufb01rm may not want to be neutral to the estimate of the crowd, however, and under the reasonable\nassumption that the \ufb01nal estimate qT will improve upon q0, the \ufb01rm should set 0 < \u03b1 < 1 (of\ncourse, positivity is needed for nonzero payouts). In this case, the \ufb01rm will strictly gain by using the\nCLM when KL(p; qT ) < KL(p; q0), but still has some insurance policy if the estimate qT is poor.\n\n4 Prediction Markets as a Special Case\n\nLet us brie\ufb02y review the literature for the type of prediction markets relevant to the present work. In\nsuch a prediction market, we imagine a future event to reveal one of n uncertain outcomes. Hanson\n[7, 8] proposed a framework in which traders make \u201creports\u201d to the market about their internal\nbelief in the form of a distribution p \u2208 \u2206n. Each trader would receive a reward (or loss) based on a\nfunction of their proposed belief and the belief of the previous trader, and the function suggested by\nHanson was the Logarithmic Market Scoring Rule (LMSR). It was shown later that the LMSR-based\nmarket is equivalent to what is known as a cost function based automated market makers, proposed\nby Chen and Pennock [3]. More recently a much broader equivalence was established by Chen and\nWortman Vaughan [4] between markets based on cost functions and those based on scoring rules.\nThe market framework proposed by Chen and Pennock allows traders to buy and sell Arrow-Debreu\nsecurities (equivalently: shares, contracts), where an Arrow-Debreu security corresponding to out-\ncome i pays out $1 if and only if i is realized. All shares are bought and sold through an automated\nmarket maker, which is the entity managing the market and setting prices. At any time period, traders\ncan purchase bundles of contracts r \u2208 Rn, where r(i) represents the number of shares purchased on\n\n6\n\n\foutcome i. The price of a bundle r is set as C(s + r) \u2212 C(s), where C is some differentiable con-\nvex cost function and s \u2208 Rn is the \u201cquantity vector\u201d representing the total number of outstanding\nshares. The LMSR cost function is C(s) := 1\n\n\u03b7 log ((cid:80)n\n\ni=1 exp(\u03b7s(i))).\n\nThis cost function framework was extended by Abernethy et al. [1] to deal with prohibitively large\noutcome spaces. When the set of potential outcomes O is of exponential size or even in\ufb01nite,\nthe market designer can offer a restricted number of contracts, say n ((cid:28) |O|), rather than offer\nan Arrow-Debreu contract for each member of O. To determine the payout structure, the market\ndesigner chooses a function \u03c1 : O \u2192 Rn, where contract i returns a payout of \u03c1i(X) and, thus, a\ncontract bundle r pays \u03c1(X) \u00b7 r. As with the framework of Chen and Pennock, the contract prices\nare set according to a cost function C, so that a bundle r has a price of C(s + r)\u2212 C(s). The design\nof the function C is addressed at length in Abernethy et al., to which we refer the reader.\nFor the remainder of this section we shall discuss the prediction market template of Abernethy et al.\nas it provides the most general model; we shall refer to such a market as an Automated Prediction\nMarket Maker. We now precisely state the ingredients of this framework.\nDe\ufb01nition 6. An Automated Prediction Market Maker (APMM) is de\ufb01ned by a tuple (S,O, \u03c1, C)\nwhere S is the share space of the market, which we will assume to be the linear space Rn; O is the set\nof outcomes; C : S \u2192 R is a smooth and convex cost function with \u2207C(S) = relint(\u2207C(S)) (here,\nwe use \u2207C(S) := {\u2207C(s) | s \u2208 S} to denote the derivative space of C); and \u03c1 : O \u2192 \u2207C(S) is a\npayoff function2.\nFortunately, we need not provide a full description of the procedure of the APMM mechanism: The\nAPMM is precisely a special case of a CLM! Indeed, the APMM framework can be described as a\nCLM (H,O, Cost, Payout) where\n\nH = S(= Rn)\n\nCost(s, s(cid:48)) = C(s(cid:48)) \u2212 C(s)\n\nPayout(s, s(cid:48); X) = \u03c1(X) \u00b7 (s(cid:48) \u2212 s).\n\n(3)\n\nHence we can think of APMM prediction markets in terms of our learning mechanism. Markets\nof this form are an important special class of CLMs \u2013 in particular, we can guarantee that they are\nef\ufb01cient to work with, as we show in the following proposition.\nProposition 3. An APMM (S,O, \u03c1, C) with ef\ufb01ciently computable C satis\ufb01es EC and TT.\nWe now ask, what is the learning problem that the participants of an APMM are trying to solve?\nMore precisely, when we think of an APMM as a CLM, does it implement a particular L?\nTheorem 1. Given APMM A := (S,O, \u03c1, C), then A implements L : \u2207C(S)\u00d7O \u2192 R de\ufb01ned by\n(4)\n\nL(w; X) = DC\u2217 (\u03c1(X), w),\n\nwhere C\u2217 is the conjugate dual of the function C.\nThere is another more subtle bene\ufb01t to APMMs \u2013 and, in fact, to most prediction market mechanisms\nin practice \u2013 which is that participants make bets via purchasing of shares or share bundles. When a\ntrader makes a bet, she purchases a contract bundle r, is charged C(s + r)\u2212 C(s) (when the current\nquantity vector is s), and shall receive payout \u03c1(X) \u00b7 r if and when X is realized. But at any point\nbefore X is observed and trading is open, the trader can sell off this bundle, to the APMM or another\ntrader, and hence neutralize her risk. In this sense bets made in an APMM are stateless, whereas for\nan arbitrary CLM this may not be the case: the wager de\ufb01ned by (wt (cid:55)\u2192 wt+1) can not necessarily\nbe sold back to the mechanism, as the posted hypothesis may no longer remain at wt+1.\nGiven a learning problem de\ufb01ned by the GSR L : H \u00d7 O \u2192 R, it is natural to ask whether we\ncan design a CLM which implements this L and has this \u201cshare-based property\u201d of APMMs. More\nprecisely, under what conditions is it possible to implement L with an APMM?\nTheorem 2. For any divergence-based GSR L(w; X) = DR(\u03c1(X), \u03c8(w)) + f (X), with \u03c8 : H \u2192\nH(cid:48) one-to-one, H(cid:48) = relint(H(cid:48)), and \u03c1(O) \u2286 \u03c8(H), there exists an APMM which implements L.\nWe point out, as a corollary, that if an APMM implements some arbitrary L, then we must be able to\nwrite L as a divergence function. This fully speci\ufb01es the class of problems solvable using APMMs.\n2The conditions that \u03c1(O) \u2286 \u2207C(S) and \u2207C(S) = relint(\u2207C(S)) are technical but important, and we\ndo not address these details in the present extended abstract although they will be considered in the full version.\nMore relevant discussion can also be found in Abernethy et al. [1].\n\n7\n\n\fCorollary 1. If APMM (S,O, \u03c1, C) implements a GSR L : H\u00d7O \u2192 R, then L is divergence-based.\n\nTheorem 1 establishes a strong connection between prediction markets and a natural class of GSRs.\nOne interpretation of this result is that any GSR based on a Bregman divergence has a \u201cdual\u201d char-\nacterization as a share-based market, where participants buy and sell shares rather than directly\naltering the share prices (the hypothesis). This has many advantages for prediction markets, not\nleast of which is that shares are often easier to think about than the underlying hypothesis space.\nOur notion of a CLM offers another interpretation. In light of Proposition 3, any machine learning\nproblem whose hypotheses can be evaluated in terms of a divergence leads to a tractable crowdsourc-\ning mechanism, as was the case in Section 3. Moreover, this theorem does not preclude ef\ufb01cient yet\nnon-divergence-based loss functions as we see in the next section.\n\n5 Example CLMs for Typical Machine Learning Tasks\n\n\u03b1\n2n\n\n(cid:2)Profit(w, w(cid:48), X)(cid:3) for every P . A budget-constrained pro\ufb01t-maximizing\n\nRegression. We now construct a CLM for a typical regression problem. We let H be the\n(cid:96)2-norm ball of radius 1 in Rd, and we shall let an outcome be a batch of a data,\nthat is\n(cid:80)n\nX := {(x1, y1), . . . , (xn, yn)} where for each i we have xi \u2208 Rd, yi \u2208 [\u22121, 1], and we as-\nsume (cid:107)xi(cid:107)2 \u2264 1. We construct a GSR according to the mean squared error, L(w;{(xi, yi)}n\ni=1) =\ni=1(w\u00b7 xi \u2212 yi)2 for some parameter \u03b1 > 0. It is worth noting that L is not divergence-based.\nIn order to satisfy the escrow property (ES), we can set Cost(w, w(cid:48)) := 2\u03b1(cid:107)w \u2212 w(cid:48)(cid:107)2 because\nthe function L(w; X) is 2\u03b1-lipschitz with respect to w for any X. To ensure that the CLM is\nL-incentivized, we must set Payout(w, w(cid:48); X) := Cost(w, w(cid:48)) + L(w; X) \u2212 L(w(cid:48); X).\nIf we set the initial hypothesis w0 = 0, it is easy to check that WorstCaseLoss = \u03b1/2.\nIt re-\nmains to check whether this CLM is tractable.\nIt\u2019s clear that we can ef\ufb01ciently compute Cost\nand Payout, hence the EC property holds. Given how Cost is de\ufb01ned, it is clear that the set\n{w(cid:48) : Cost(w, w(cid:48)) \u2264 B} is just an (cid:96)2-norm ball. Also, since L is convex in w for each X, so\nis the function EX\u223cP\nparticipant must simply solve a convex optimization problem, and hence the TT property holds.\nBetting Directly on the Labels. Let us return our attention to the Net\ufb02ix Prize model as discussed\nin the Introduction. For this style of competition a host releases a dataset for a given prediction task.\nThe host then requests participants to provide predictions on a speci\ufb01ed set of instances on which\nit has correct labels. For every submission the agent computes an error measure, say the MSE, and\nreports this to the participants. Of course, the correct labels are withheld throughout.\nOur CLM framework is general enough to apply to this problem framework as well. De\ufb01ne H =\nO = K m where K \u2286 R bounded is the set of valid labels, and m is the number of requested test set\npredictions. For some w \u2208 H and y \u2208 O, w(k) speci\ufb01es the kth predicted label, and y(k) speci\ufb01es\nk=1(w(k)\u2212y(k))2.\nOf course, this approach is quite different from the Net\ufb02ix Prize model, in two key respects: (a) the\nparticipants have to wager on their predictions and (b) by participating in the mechanism they are\nrequired to reveal their modi\ufb01cation to all of the other players. Hence while we have structured a\ncompetitive process the participants are de facto forced to collaborate on the solution.\nA reasonable critique of this collaborative mechanism approach to a Net\ufb02ix-style competition is that\nit does not provide the instant feedback of the \u201cleaderboard\u201d where individuals observe performance\nimprovements in real time. However, we can adjust our mechanism to be online with a very simple\nmodi\ufb01cation of the CLM protocol, which we sketch here. Rather than make payouts in a large batch\nat the end, the competition designer could perform a mini-payout at the end of each of a sequence\nof time intervals. At each interval, the designer could select a (potentially random) subset S of\nuser/movie pairs in the remaining test set, freeze updates on the predictions w(k) for all k \u2208 S, and\nperform payouts to the participants on only these labels. What makes this possible, of course, is that\nthe generalized scoring rule we chose decomposes as a sum over the individual labels.\nAcknowledgments. We gratefully acknowledge the support of the NSF under award DMS-\n0830410, a Google University Research Award, and the National Defense Science and Engineering\nGraduate (NDSEG) Fellowship, 32 CFR 168a.\n\nthe true label. A natural scoring function is the total squared loss, L(w; y) :=(cid:80)m\n\n8\n\n\fReferences\n[1] J. Abernethy, Y. Chen, and J. Wortman Vaughan. An optimization-based framework for auto-\nmated market-making. In Proceedings of the 12th ACM Conference on Electronic Commerce,\n2011.\n\n[2] J. E. Berg, R. Forsythe, F. D. Nelson, and T. A. Rietz. Results from a dozen years of election\nIn C. A. Plott and V. Smith, editors, Handbook of Experimental\n\nfutures markets research.\nEconomic Results. 2001.\n\n[3] Y. Chen and D. M. Pennock. A utility framework for bounded-loss market makers. In Pro-\n\nceedings of the 23rd Conference on Uncertainty in Arti\ufb01cial Intelligence, 2007.\n\n[4] Y. Chen and J. Wortman Vaughan. A new understanding of prediction markets via no-regret\n\nlearning. In Proceedings of the 11th ACM Conference on Electronic Commerce, 2010.\n\n[5] T.M. Cover, J.A. Thomas, J. Wiley, et al. Elements of information theory, volume 6. Wiley\n\nOnline Library, 1991.\n\n[6] T. Gneiting and A.E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal\n\nof the American Statistical Association, 102(477):359\u2013378, 2007.\n\n[7] R. Hanson. Combinatorial information market design.\n\n5(1):105\u2013119, 2003.\n\nInformation Systems Frontiers,\n\n[8] R. Hanson. Logarithmic market scoring rules for modular combinatorial information aggrega-\n\ntion. Journal of Prediction Markets, 1(1):3\u201315, 2007.\n\n[9] R. Hanson, R. Oprea, and D. Porter. Information aggregation and manipulation in an experi-\n\nmental market. Journal of Economic Behavior & Organization, 60(4):449\u2013459, 2006.\n\n[10] Nathan Lay and Adrian Barbu. Supervised aggregation of classi\ufb01ers using arti\ufb01cial prediction\n\nmarkets. In ICML, pages 591\u2013598, 2010.\n\n[11] J. Ledyard, R. Hanson, and T. Ishikida. An experimental test of combinatorial information\n\nmarkets. Journal of Economic Behavior and Organization, 69:182\u2013189, 2009.\n\n[12] J. Wolfers and E. Zitzewitz. Prediction markets. Journal of Economic Perspective, 18(2):107\u2013\n\n126, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1403, "authors": [{"given_name": "Jacob", "family_name": "Abernethy", "institution": null}, {"given_name": "Rafael", "family_name": "Frongillo", "institution": null}]}