{"title": "A Bandit Framework for Strategic Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1821, "page_last": 1829, "abstract": "We consider a learner's problem of acquiring data dynamically for training a regression model, where the training data are collected from strategic data sources. A fundamental challenge is to incentivize data holders to exert effort to improve the quality of their reported data, despite that the quality is not directly verifiable by the learner. In this work, we study a dynamic data acquisition process where data holders can contribute multiple times. Using a bandit framework, we leverage on the long-term incentive of future job opportunities to incentivize high-quality contributions. We propose a Strategic Regression-Upper Confidence Bound (SR-UCB) framework, an UCB-style index combined with a simple payment rule, where the index of a worker approximates the quality of his past contributions and is used by the learner to determine whether the worker receives future work. For linear regression and certain family of non-linear regression problems, we show that SR-UCB enables a $O(\\sqrt{\\log T/T})$-Bayesian Nash Equilibrium (BNE) where each worker exerting a target effort level that the learner has chosen, with $T$ being the number of data acquisition stages. The SR-UCB framework also has some other desirable properties: (1) The indexes can be updated in an online fashion (hence computationally light). (2) A slight variant, namely Private SR-UCB (PSR-UCB), is able to preserve $(O(\\log^{-1} T), O(\\log^{-1} T))$-differential privacy for workers' data, with only a small compromise on incentives (achieving $O(\\log^{6} T/\\sqrt{T})$-BNE).", "full_text": "A Bandit Framework for Strategic Regression\n\nYang Liu and Yiling Chen\n\nSchool of Engineering and Applied Science, Harvard University\n\n{yangl,yiling}@seas.harvard.edu\n\nAbstract\n\nWe consider a learner\u2019s problem of acquiring data dynamically for training a re-\ngression model, where the training data are collected from strategic data sources.\nA fundamental challenge is to incentivize data holders to exert effort to improve\nthe quality of their reported data, despite that the quality is not directly veri\ufb01able\nby the learner. In this work, we study a dynamic data acquisition process where\ndata holders can contribute multiple times. Using a bandit framework, we lever-\nage the long-term incentive of future job opportunities to incentivize high-quality\ncontributions. We propose a Strategic Regression-Upper Con\ufb01dence Bound (SR-\nUCB) framework, a UCB-style index combined with a simple payment rule,\nwhere the index of a worker approximates the quality of his past contributions\nand is used by the learner to determine whether the worker receives future work.\nFor linear regression and a certain family of non-linear regression problems, we\n\nshow that SR-UCB enables an O(cid:0)(cid:112)logT /T(cid:1)-Bayesian Nash Equilibrium (BNE)\n(PSR-UCB), is able to preserve (O(cid:0)log\u22121 T(cid:1),O(cid:0)log\u22121 T(cid:1))-differential privacy for\na target effort level is an O(cid:0)log6 T /\n\nwhere each worker exerts a target effort level that the learner has chosen, with T\nbeing the number of data acquisition stages. The SR-UCB framework also has\nsome other desirable properties: (1) The indexes can be updated in an online fash-\nion (hence computation is light). (2) A slight variant, namely Private SR-UCB\n\nworkers\u2019 data, with only a small compromise on incentives (each worker exerting\n\nT(cid:1)-BNE).\n\n\u221a\n\n1\n\nIntroduction\n\nMore and more data for machine learning nowadays are acquired from distributed, unmonitored\nand strategic data sources and the quality of these collected data is often unveri\ufb01able. For example,\nin a crowdsourcing market, a data requester can pay crowd workers to label samples. While this\napproach has been widely adopted, crowdsourced labels have been shown to degrade the learning\nperformance signi\ufb01cantly, see e.g., [19], due to the low quality of the data. How to incentivize\nworkers to contribute high-quality data is hence a fundamental question that is crucial to the long-\nterm viability of this approach.\nRecent works [2,4,10] have considered incentivizing data contributions for the purpose of estimating\na regression model. For example Cai et al. [2] design payment rules so that workers are incentivized\nto exert effort to improve the quality of their contributed data, while Cummings et al. [4] design\nmechanisms to compensate privacy-sensitive workers for their privacy loss when contributing their\ndata. These studies focus on a static data acquisition process, only considering one-time data acqui-\nsition from each worker. Hence, the incentives completely rely on the payment rule. However, in\nstable crowdsourcing markets, workers return to receive additional work. Future job opportunities\nare thus another dimension of incentives that can be leveraged to motive high-quality data contribu-\ntions. In this paper, we study dynamic data acquisition from strategic agents for regression problems\nand explore the use of future job opportunities to incentivize effort exertion.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn our setting, a learner has access to a pool of workers and in each round decides on which workers\nto ask for data. We propose a Multi-armed Bandit (MAB) framework, called Strategic Regression-\nUpper Con\ufb01dence Bound (SR-UCB), that combines a UCB-style index rule with a simple per-round\npayment rule to align the incentives of data acquisition with the learning objective. Intuitively, each\nworker is an arm and has an index associated with him that measures the quality of his past con-\ntributions. The indexes are used by the learner to select workers in the next round. While MAB\nframework is natural for modeling selection problem with data contributors of potentially varying\nqualities, our setting has two challenges that are distinct from classical bandit settings. First, after\na worker contributes his data, there is no ground-truth observation to evaluate how well the worker\nperforms (or reward as commonly referred to in a MAB setting). Second, a worker\u2019s performance\nis a result of his strategic decision (e.g. how much effort he exerts), instead of being purely exoge-\nnously determined. Our SR-UCB framework overcomes the \ufb01rst challenge by evaluating the quality\nof an agent\u2019s contributed data against an estimator trained on data provided by all other agents to\nobtain an unbiased estimate of the quality, an idea inspired by the peer prediction literature [11, 16].\nTo address the second challenge, our SR-UCB framework enables a game-theoretic equilibrium\nwith workers exerting target effort levels chosen by the learner. More speci\ufb01cally, in addition to\nproposing the SR-UCB framework, our contributions include:\n\u2022 We show that SR-UCB helps simplify the design of payment, and successfully incentivizes effort\nexertion for acquiring data for linear regression. Every worker exerting a targeted effort level\n\n(for labeling and reporting the data) is an O(cid:0)(cid:112)logT /T(cid:1)-Bayesian Nash Equilibrium (BNE). We\nsmall compromise on incentives. PSR-UCB is (O(cid:0)log\u22121 T(cid:1),O(cid:0)log\u22121 T(cid:1))-differentially private\nand every worker exerting the targeted effort level is an O(cid:0)log6 T /\n\n\u2022 SR-UCB indexes can be maintained in an online fashion, hence are computationally light.\n\u2022 We extend SR-UCB to Private SR-UCB (PSR-UCB) to further provide privacy guarantees, with\n\ncan also extend the above results to a certain family of non-linear regression problems.\n\nT(cid:1)-BNE.\n\n\u221a\n\n2 Related work\n\nRecent works have formulated various strategic learning settings under different objectives [2,4,10,\n20]. Among these, payment based solutions are proposed for regression problems when data come\nfrom workers who are either effort sensitive [2] or privacy sensitive [4]. These solutions induce\ngame-theoretic equilibria where high-quality data are contributed. The basic idea of designing the\npayment rules is inspired by the much mature literature of proper scoring rules [8] and peer predic-\ntion [16]. Both [2] and [4] consider a static data acquisition procedure, while our work focuses on\na dynamic data acquisition process. Leveraging the long-term incentive of future job opportunities,\nour work has a much simpler payment rule than those of [2] and [4] and relaxes some of the re-\nstrictions on the learning objectives (e.g., well behaved in [2]), at the cost of a weaker equilibrium\nconcept (approximate BNE in this work vs. dominate strategy in [2]).\nMulti-armed Bandit (MAB) is a sequential decision making and learning framework which has\nbeen extensively studied. It is nearly impossible to survey the entire bandit literature. The seminal\nwork by Lai et al [13] derived lower and upper bounds on asymptotic regret on bandit selection.\nMore recently, \ufb01nite-time algorithms have been developed for i.i.d. bandits [1] . Different from\nthe classical settings, this work needs to deal with challenges such as no ground-truth observations\nfor bandits and bandits\u2019 rewards being strategically determined. A few recent works [7, 15] also\nconsidered bandit settings with strategic arms. Our work differs from these in that we consider\na regression learning setting without ground-truth observations, as well as we consider long-term\nworkers whose decisions on reporting data can change over time.\nOur work and motivations have some resemblance to online contract design problems for a principal-\nagent model [9]. But unlike the online contract design problems, our learner cannot verify the quality\nof \ufb01nished work after each task assignment. In addition, instead of focusing on learning the optimal\ncontract, we use bandits mainly to maintain a long-term incentive for inducing high-quality data.\n\n3 Formulation\n\nThe learner observes a set of feature data X for training. To make our analysis tractable, we assume\neach x \u2208 X is sampled uniformly from a unit ball with dimension d: x \u2208 Rd s.t. ||x||2 \u2264 1. Each\n\n2\n\n\fx associates with a ground-truth response (or label) y(x), which cannot be observed directly by the\nlearner. Suppose x and y(x) are related through a function f : Rd \u2192 R that y(x) = f (x) + z, where\nz is a zero-mean noise with variance \u03c3z, and is independent of x. For example, for linear regression\nf (x) = \u03b8T x for some \u03b8 \u2208 Rd. The learner would like to learn a good estimate \u02dcf of f . For the purpose\nof training, the learner needs to \ufb01gure out y(x) for different x \u2208 X. To obtain an estimate \u02dcy(x) of\ny(x), the learner assigns each x to a selected worker to obtain a label.\nAgent model: Suppose we have a set of workers U = {1,2, ...,N} with N \u2265 2. After receiving\nthe labeling task, each worker will decide on the effort level e he wants to exert to generate an\noutcome \u2013 higher effort leads to a better outcome, but is also associated with a higher cost. We\nassume e has bounded support [0, \u00afe] for all worker i \u2208 U. When deciding on an effort level, a\nworker wants to maximize his expected payment minus cost for effort exertion. The resulted label\n\u02dcy(x) will be given back to the learner. Denote by \u02dcyi(x,e) the label returned by worker i for data\ninstance x (if assigned) with chosen effort level e. We consider the following effort-sensitive agent\nmodel: \u02dcyi(x,e) = f (x) + z + zi(e), where zi(e) is a zero-mean noise with variance \u03c3i(e). \u03c3i(e) can\nbe different for different workers, and \u03c3i(e) decreases in e,\u2200i. The z and zi\u2019s have bounded support\nsuch that |z|,|zi| \u2264 Z, \u2200i. Without loss of generality, we assume that the cost for exerting effort e is\nsimply e for every worker.\n\nLearner\u2019s objective Suppose the learner wants to learn f with the set of samples X. Then the\nlearner \ufb01nds effort levels e\u2217 for data points in X such that\n\ne\u2217 \u2208 argmin{e(x)}x\u2208X ERROR( \u02dcf ({x, \u02dcy(x,e(x))}x\u2208X )) + \u03bb\u00b7 PAYMENT({e(x)}x\u2208X ) ,\n\nwhere e(x) is the effort level for sample x, and { \u02dcy(x,e(x))}x\u2208X is the set of labeled responses for\ntraining data X. \u02dcf (\u00b7) is the regression model trained over this data. The learner assigns the data and\npay appropriately to induce the corresponding effort level e\u2217. This formulation resembles the one\npresented in [2]. The ERROR term captures the expected error of the trained model using collected\ndata (e.g., measure in squared loss), while the PAYMENT term captures the total expected budget that\nthe learner spends to receive the labels. This payment quantity depends on the mechanism that the\nlearner chooses to use and is the expected payment of the mechanism to induce selected effort level\nfor each data point {e(x)}x\u2208X. \u03bb > 0 is a weighting factor, which is a constant. It is clear that the\nobjective function depends on \u03c3i\u2019s. We assume for now that the learner knows \u03c3i(\u00b7)\u2019s,1 and the\noptimal e\u2217 can be computed.\n\n4 StrategicRegression-UCB (SR-UCB): A general template\n\nWe propose SR-UCB for solving the dynamic data acquisition problem. SR-UCB enjoys a bandit\nsetting, where we borrow the idea from the classical UCB algorithm [1], which maintains an index\nfor each arm (worker in our setting), balancing exploration and exploitation. While a bandit frame-\nwork is not necessarily the best solution for our dynamic data acquisition problem, it is a promising\noption for the following reasons. First, as utility maximizers, workers would like to be assigned\ntasks as long as the marginal gain for taking a task is positive. A bandit algorithm can help execute\nthe assignment process. Second, carefully designed indexes can potentially re\ufb02ect the amount of\neffort exerted by the agents. Third, because the arm selection (of bandit algorithms) is based on the\nindexes of workers, it introduces competition among workers for improving their indexes.\nSR-UCB contains the following two critical components:\nPer-round payment For each worker i, once selected to label a sample x, we will assign a base\npayment pi = ei + \u03b3, 2 after reporting the labeling outcome, where ei is the desired effort level that\nwe would like to induce from worker i (for simplicity we have assumed the cost for exerting effort ei\nequals to the effort level), and \u03b3 > 0 is a small quantity. The design of this base payment is to ensure\nonce selected, a worker\u2019s base cost will be covered. Note the above payment depends on neither the\nassigned data instance x nor the reported outcome \u02dcy. Therefore such a payment procedure can be\npre-de\ufb01ned after the learner sets a target effort level.\n\n1This assumption can be relaxed. See our supplementary materials for the case with homogeneous \u03c3.\n2We assume workers have knowledge of how the mechanism sets up this \u03b3.\n\n3\n\n\fAssignment The learner assigns multiple task {xi(t)}i\u2208d(t) at time t, with d(t) denoting the set of\nworkers selected at t. Denote by ei(t) the effort level worker i exerted for xi(t), if i \u2208 d(t). Note all\n{xi(t)}i\u2208d(t) are different tasks, and each of them is assigned to exactly one worker. The selection of\nworkers will depend on the notion of indexes. Details are given in Algorithm 1.\n\nAlgorithm 1 SR-UCB: Worker index & selection\n\nStep 1. For each worker i, \ufb01rst train estimator \u02dcf\u2212i,t using data {x j(n) : 1 \u2264 n \u2264 t \u22121, j \u2208 d(n), j (cid:54)=\ni}, that is using the data collected from workers j (cid:54)= i up to time t \u2212 1. When t = 1, we will\ninitialize by sampling each worker at least once such that \u02dcf\u2212i,t can be computed.\nStep 2. Then compute the following index for worker i at time t\n\n(cid:20)\n\n(cid:18)\n\n(cid:19)2(cid:21)\n\n(cid:115)\n\nlogt\nni(t)\n\n,\n\nIi(t) =\n\n1\n\nni(t)\n\nt\n\n\u2211\n\nn=1\n\n1(i \u2208 d(n))\n\na\u2212 b\n\n\u02dcf\u2212i,t (xi(n))\u2212 \u02dcyi(n,ei(n))\n\n+ c\n\nwhere ni(t) is the number of times worker i has been selected up to time t. a,b are two positive\nconstants for \u201cscoring\u201d, and c is a normalization constant. \u02dcyi(n,ei(n)) is the corresponding label\nfor task xi(n) with effort level ei(n), if i \u2208 d(n).\nStep 3. Based on the above index, we select d(t) at time t such that d(t) := { j : Ij(t) \u2265 maxi Ii(t)\u2212\n\u03c4(t)}, where \u03c4(t) is a perturbation term decreasing in t.\n\nSome remarks on SR-UCB: (1) Different from the classical bandit setting, when calculating the\nindexes, there is no ground-truth observation for evaluating the performance of each worker. There-\nfore we adopt the notion of scoring rule [8]. Particularly the one we used above is the well-known\nBrier scoring rule: B(p,q) = a\u2212 b(p\u2212 q)2 . (2) The scoring rule based index looks similar to the\npayment rules studied in [2, 4]. But as we will show later, under our framework the selection of a,b\nis much less sensitive to different problem settings, as with an index policy, only the relative values\nmatter (ranking). This is another bene\ufb01t of separating payment from selection. (3) Instead of only\nselecting the best worker with the highest index, we select workers whose index is within a certain\nrange of the maximum one (a con\ufb01dence region). This is because workers may have competing\nexpertise level and hence selecting only one of them would de-incentivize workers\u2019 effort exertion.\n\n4.1 Solution concept\nDenote by e(n) := {e1(n), ...,eN(n)}, and e\u2212i(n) = {e j(n)} j(cid:54)=i. We de\ufb01ne approximate Bayesian\nNash Equilibrium as our solution concept:\nDe\ufb01nition 1. Suppose SR-UCB runs for T stages. {ei(t)}N,T\n\ni=1,t=1 is a \u03c0-BNE if \u2200i,{ \u02dcei(t)}T\n\n(pi \u2212 ei(t))1(i \u2208 d(t))(cid:12)(cid:12){e(n)}n\u2264t ] \u2265 1\n\nT\n\n(pi \u2212 \u02dcei(t))1(i \u2208 d(t))(cid:12)(cid:12){ \u02dcei(n),e\u2212i(n)}n\u2264t ]\u2212 \u03c0.\n\nt=1:\n\nt=1\n\nE[\n\nE[\n\n\u2211\n\n1\nT\nThis is to say by deviating, each worker will gain no more than \u03c0 net-payment per around. We\nwill establish our main results in terms of \u03c0-BNE. The reason we adopt such a notion is that in a\nsequential setting it is generally hard to achieve strict BNE or other stronger notion as any one-step\ndeviation may not affect a long-term evaluation by much.3 Approximate BNE is likely the best\nsolution concept we can hope for.\n\nT\n\nT\n\n\u2211\n\nt=1\n\n5 Linear regression\n\n5.1 Settings and a warm-up scenario\n\nIn this section we present our results for a simple linear regression task where the feature x and ob-\nservation y are linearly related via an unknown \u03b8: y(x) = \u03b8T x + z, \u2200x \u2208 X. Let\u2019s start with assuming\nall workers are statistically identical such that \u03c31 = \u03c32 = ... = \u03c3N. This is an easier case that serves\nas a warm-up. It is known that given training data, we can \ufb01nd an estimation \u02dc\u03b8 that minimizes a\n\n3Certainly, we can run mechanisms that induce BNE or dominant-strategy equilibrium for one-shot setting,\n\ne.g. [2], for every time step. But such solution does not incorporate long-term incentives.\n\n4\n\n\fnon-regularized empirical risk function: \u02dc\u03b8 = argmin\u02c6\u03b8\u2208Rd \u2211x\u2208X (y(x)\u2212 \u02c6\u03b8T x)2 (linear least square). To\nput this model into SR-UCB, denote \u02dc\u03b8\u2212i(t) as the linear least square estimator trained using data\n\nfrom workers j (cid:54)= i up to time t \u2212 1. And Ii(t) := Si(t) + c(cid:112)logt/ni(t), with\n\n(cid:20)\n\n(cid:18)\n(cid:19)2(cid:21)\n\u02dc\u03b8T\u2212i(t)xi(n)\u2212 \u02dcyi(n,ei(n))\n\n1(i \u2208 d(n))\n\na\u2212 b\n\n.\n\n(5.1)\n\nSi(t) :=\n\n1\n\nni(t)\n\nt\u22121\n\u2211\n\nn=1\n\nSuppose ||\u03b8||2 \u2264 M. Given ||x||2 \u2264 1 and |z|, |zi| \u2264 Z, we then prove that \u2200t,n,i, (\u02dc\u03b8T\u2212i(t)xi(n) \u2212\n\u02dcyi(n,ei(n)))2 \u2264 8M2 + 2Z2. Choose a,b such that a\u2212 (8M2 + 2Z2)b \u2265 0, then we have 0 \u2264 Si(t) \u2264\n\na, \u2200i,t. For the perturbation term, we set \u03c4(t) := O(cid:0)(cid:112)logt/t(cid:1). The intuition is that with t samples,\nupper bounded at the order of O(cid:0)(cid:112)logt/t(cid:1). Thus, to not miss a competitive worker, we set the\n\nthe uncertainties in the indexes, coming from both the score calculation and the bias term, can be\n\ntolerance to be at the same order.\nWe now develop the formal equilibrium result of SR-UCB for linear least square. Our analysis\nrequires the following assumption on the smoothness of \u03c3.\nAssumption 1. We assume \u03c3(e) is convex on e \u2208 [0, \u00afe], with gradient \u03c3(cid:48)(e) being both upper\nbounded, and lower bounded away from 0, i.e., L \u2265 |\u03c3(cid:48)(e)| \u2265 L > 0, \u2200e.\nThe learner wants to learn f with a total of NT (= |X| or (cid:100)NT(cid:101) = |X|) samples. Since workers are\nstatistically equivalent, ideally the learner would like to run SR-UCB for T steps and collect a label\nfor a unique sample from each worker at each step. Hence, the learner would like to elicit a single\ntarget effort level e\u2217 from all workers and for all samples:\n\n(cid:20)\n\n(cid:21)2\n\ne\u2217 \u2208 argmineEx,y, \u02dcy\n\n\u03b8T ({xi(n), \u02dcyi(n,e)}N,T\n\ni=1,n=1)\u00b7 x\u2212 y\n\n+ \u03bb\u00b7 (e + \u03b3)NT.\n\n(5.2)\n\nThe net payment (payment minus the cost of effort) per task can be made arbitrarily small by setting\n\nDue to the uncertainty in worker selection, it is highly likely that after step T , there will be tasks\nleft unlabelled. We can let the mechanism go for extra steps to complete labelling of these tasks.\nBut due to the bounded number of missed selections as we will show later, stopping at step T won\u2019t\naffect the accuracy in the model trained.\nTheorem 1. Under SR-UCB for linear least square, set \ufb01xed payment pi = e\u2217 + \u03b3 for all i, where\n\nOur solution heavily relies on forming a race among workers. By establishing the convergence of\nbandit indexes to a function of effort (via \u03c3(\u00b7)), we show that when other workers j (cid:54)= i follow\nthe equilibrium strategy, worker i will be selected w.h.p. at each round, if he also puts in the same\n\n\u03b3 = \u2126((cid:112)logT /T ), choose c to be a large enough constant, c \u2265 Const.(M,Z,N,b), and let \u03c4(t) :=\nO(cid:0)(cid:112)logt/t(cid:1). Workers have full knowledge of the mechanism and the values of the parameters.\nThen at an O(cid:0)(cid:112)logT /T(cid:1)-BNE, workers, whenever selected, exert effort ei(t) \u2261 e\u2217 for all i and t.\n\u03b3 exactly on the order of O(cid:0)(cid:112)logT /T(cid:1), and pi \u2212 e\u2217 = \u03b3 = O(cid:0)(cid:112)logT /T(cid:1) \u2192 0, as T \u2192 \u221e.\namount of effort. On the other hand, if worker i shirks from doing so by as much as (O(cid:0)(cid:112)logT /T(cid:1)),\nin the next section, all workers shirking from exerting effort is also an O(cid:0)(cid:112)logT /T(cid:1)-BNE. This\nsigned a task w.p. 1. Set ps := 1\u2212 O(cid:0)(cid:112)logT /T /\u03b3(cid:1). So with probability 1\u2212 ps = O(cid:0)(cid:112)logT /T /\u03b3(cid:1),\nO(cid:0)(cid:112)logT /T(cid:1)-BNE, while every worker exerting any effort level that is \u2206e > O(cid:0)\u03b3(cid:1) lower than the\ntarget effort level is not a \u03c0-BNE with \u03c0 \u2264 O(cid:0)(cid:112)logT /T(cid:1).\n\nequilibrium can be removed by adding some uncertainty on top of the bandit selection procedure.\nWhen there are \u2265 2 workers being selected in SR-UCB, each of them will be assigned a task with\ncertain probability 0 < ps < 1. While when there is a single selected worker, the worker is as-\neven the \u201cwinning\u201dworkers will miss the selection. With this change, exerting e\u2217 still forms an\n\nhis number of selection will go down in order. This establishes the \u03c0-BNE. As long as there exists\none competitive worker, all others will be incentivized to exert effort. Though as will be shown\n\n5\n\n\f5.2 Linear regression with different \u03c3\n\n(cid:20)\n\n(cid:21)2\ni=1,n=1)x\u2212 y\n\n+ \u03bb\u00b7 (e2 + \u03b3)2T.\n\nNow we consider the more realistic case that different workers have different noise-effort function\n\u03c3\u2019s. W.l.o.g., we assume \u03c31(e) < \u03c32(e) < ... < \u03c3N(e),\u2200e.4 In such a setting, ideally we would\nalways like to collect data from worker 1 since he has the best expertise level (lowest variance in\nlabeling noise). Suppose we are targeting an effort level e\u2217\n1 from data source 1 (the best data source).\nWe \ufb01rst argue that we also need to incentivize worker 2 to exert competitive effort level e\u2217\n2 such that\n\u03c31(e\u2217\n2 exists.5 This also naturally implies that e\u2217\n2 > e\u2217\n1 as\nworker 1 contributes data with less variance in noise at the same effort level. The reason is similar\nto the homogeneous setting\u2014over time workers form a competition on \u03c3i(ei). Having a competitive\npeer will motivate workers to exert as much effort as he can (up to the payment). Therefore the goal\nfor such a learner (with 2T samples to assign) is to \ufb01nd an effort level e\u2217 such that 6\n\n2), and we assume such an e\u2217\n\n1) = \u03c32(e\u2217\n\ne\u2217 \u2208 argmine2:\u03c31(e1)=\u03c32(e2)\n\nEx,y, \u02dcy\n\n\u03b8T ({xi(n), \u02dcyi(n,ei))}2,T\n\ni )\u2212 \u03c31(e\u2217\n\n1 be the solution to \u03c31(e\u2217\n\nSet the one-step payment to be pi = e\u2217 + \u03b3,\u2200i. Let e\u2217\n1) = \u03c32(e\u2217) and let\ni = e\u2217 for i \u2265 2. Note for i > 2 we have \u03c3i(e\u2217\ne\u2217\n1) > 0. While we have argued about the\nnecessity for choosing the top two most competitive workers, we have not mentioned the optimality\nof doing so.\nIn fact selecting the top two is the best we can do. Suppose on the contrary, the\noptimal solution is by selecting top k > 2 workers, at effort level ek. According to our solution, we\ntargeted the effort level that leads to variance of noise \u03c3k(ek) (so the least competitive worker will\nbe incentivized). Then we can simply target the same effort level ek, but migrating the task loads to\nonly the top two workers \u2013 this keeps the payment the same, but the variance of noise now becomes\n\u03c32(ek) < \u03c3k(ek), which leads to better performance. Denote \u22061 := \u03c33(e\u2217)\u2212 \u03c31(e\u2217\n1) > 0 and assume\nAssumption 1 applies to all \u03c3i\u2019s. We prove:\n\nTheorem 2. Under SR-UCB for linear least square, set c \u2265 Const.(M,Z,b,\u22061), \u2126((cid:112)logT /T ) =\n2L , \u03c4(t) := O(cid:0)(cid:112)logt/t(cid:1). Then, each worker i exerting effort e\u2217\nO(cid:0)(cid:112)logT /T(cid:1)-BNE.\nO(cid:0)\u03c31(e\u2217\nO(cid:0)\u03c31(e\u2217\nO(cid:0)logT(cid:1)) is bounded by E[\u03c31(e\u2217\n\n1)/(\u2211i=1,2 ni(T ))2(cid:1). Ideally we want to have \u2211i=1,2 ni(T ) = 2T , such that an upper bound of\n1)/(2T )2(cid:1) can be achieved. Compared to the bound O(cid:0)\u03c31(e\u2217\n1)/(2T )2(cid:1), SR-UCB\u2019s expected\n1)logT /T 3(cid:1) w.h.p. .\n\nPerformance with acquired data If workers follow the \u03c0-BNE, the contributed data from the\ntop two workers (who have been selected the most number of times) will have the same variance\n\u03c31(e\u2217\n1). Then following results in [4], w.h.p. the performance of the trained classi\ufb01er is bounded by\n\nperformance loss (due to missed sampling & wrong selection, which is bounded at the order of\n\n1)/(2T )2] \u2264 O(cid:0)\u03c31(e\u2217\n\n1)/(\u2211i=1,2 ni(T ))2 \u2212 \u03c31(e\u2217\n\ni once selected forms an\n\n\u03b3 \u2264 \u22061\n\nRegularized linear regression Ridge estimator has been widely adopted for solving linear regres-\nsion. The objective is to \ufb01nd a linear model \u02dc\u03b8 that minimizes the following regularized empirical\nrisk: \u02dc\u03b8 = argmin\u02c6\u03b8\u2208Rd \u2211x\u2208X (y(x)\u2212 \u02c6\u03b8T x)2 + \u03c1||\u02c6\u03b8||2\n2 , with \u03c1 > 0 being the regularization parameter.\nWe claim that simply changing the \u02dcf\u2212i,t (\u00b7) in SR-UCB to the output from the above ridge regression,\n\nthe O(cid:0)(cid:112)logT /T(cid:1)-BNE for inducing an effort level e\u2217 will hold. Different from the non-regularized\n\ncase, the introduction of the regularization term will add bias in \u02dc\u03b8T\u2212i(t), which gives a biased evalu-\nation of indexes. However, we prove the convergence of \u02dc\u03b8T\u2212i(t) (so again the indexes will converge\nproperly) in the following lemma, which enables an easy adaption of our previous results for non-\nregularized case to ridge regression:\nLemma 1. With n i.i.d. samples, w.p. \u2265 1\u2212 e\u2212Kn (K > 0 is a constant), ||\u02dc\u03b8\u2212i(t)\u2212 \u03b8||2\nNon-linear regression The basic idea for extending the results to non-linear regression is inspired\nby the consistency results on M-estimator [14], when the error of training data satis\ufb01es zero mean.\nSimilar to the reasoning for Lemma 1, if ( \u02dcf\u2212i,t (x)\u2212 f (x))2 \u2192 0, we can hope for an easy adaptation\n4Combing with the results for homogeneous workers, we can again easily extend our results to the case\n5It exists when the supports for \u03c31(\u00b7),\u03c32(\u00b7) overlap for a large support range.\n6Since we only target the top two workers, we can limit the number of acquisitions on each stage to be no\n\nwhere there are a mixture of homogeneous and heterogenous workers.\n\n2 \u2264 O(cid:0) 1\n\n(cid:1).\n\nn2\n\nmore than two, so the number of query does not go beyond 2T .\n\n6\n\n\fof our previous results. Suppose the non-linear regression model can be characterized by a parameter\nfamily \u0398, where f is characterized by parameter \u03b8, and \u02dcf\u2212i,t by \u02dc\u03b8i(t). Due to the consistency\nof M-estimator we will have ||\u02dc\u03b8i(t) \u2212 \u03b8||2 \u2192 0. More speci\ufb01cally, according to the results from\n\nn(cid:1) convergence rate with n\n[18], for the non-linear regression model we can establish an O(cid:0)1/\n( \u02dcf\u2212i,t (x)\u2212 f (x))2 \u2192 0, and ( \u02dcf\u2212i,t (x)\u2212 f (x))2 \u2264 O(cid:0)1/t(cid:1). The rest of the proof can then follow.\n\n\u221a\nthere exists a constant LN > 0\ntraining samples. When f is Lipschitz in parameter space, i.e.\nsuch that | \u02dcf\u2212i,t (x) \u2212 f (x)| \u2264 LN||\u02dc\u03b8i(t) \u2212 \u03b8||2, by dominated convergence theorem we also have\n\nExample 1. Logistic function f (x) =\n\n1\n\n1+e\u2212\u03b8T x satis\ufb01es Lipschitz condition with LN = 1/4.\n\n6 Computational issues\nIn order to update the indexes and select workers adaptively, we face a few computational challenges.\nFirst, in order to update the index for each worker at any time t, a new estimator \u02dc\u03b8\u2212i(t) (using data\nfrom all other workers j (cid:54)= i up to time t \u2212 1) needs to be re-computed. Second, we need to re-apply\n\u02dc\u03b8\u2212i(t) to every collected sample from worker i,{(xi(n), \u02dcyi(n,ei(n)) : i \u2208 d(n),n = 1,2, ...t \u2212 1} from\nprevious rounds. We propose online variants of SR-UCB to address these challenges.\nOnline update of \u02dc\u03b8\u2212i(\u00b7)\nInspired by the online learning literature, instead of re-computing \u02dc\u03b8\u2212i(t)\nat each step, which involves re-calculating the inverse of a covariance matrix (e.g., (\u03c1I + X T X)\u22121\nfor ridge regression) whenever there is a new sample point arriving, we can update \u02dc\u03b8\u2212i(t) in an\nonline fashion, which is computationally much more ef\ufb01cient. We demonstrate our results with\nridge linear regression. Start with an initial model \u02dc\u03b8online\u2212i\n(1). Denote by (x\u2212i(t), \u02dcy\u2212i(t)) any newly\narrived sample at time t from worker j (cid:54)= i. Update \u02dc\u03b8online\u2212i\n(t + 1) (for computing Ii(t + 1)) as [17]:\n\n\u02dc\u03b8online\u2212i\n\n(t + 1) := \u02dc\u03b8online\u2212i\n\n(t)\u2212 \u03b7t \u00b7 \u2207\u02dc\u03b8online\u2212i\n\n(t)[(\u03b8T x\u2212i(t)\u2212 \u02dcy\u2212i(t))2 + \u03c1||\u03b8||2\n2] ,\n\nNotice there could be multiple such data points arriving at each time \u2013 in which case we will up-\ndate sequentially in an arbitrarily order. It is also possible that there is no sample point arriving from\nworkers other than i at a time t, in which case we simply do not perform an update. Name this online\nupdating SR-UCB as OSR1-UCB. With online updating, the accuracy of trained model \u02dc\u03b8online\u2212i\n(t +1)\nconverges slower, so is the accuracy in the index for characterizing worker\u2019s performance. Never-\n\ntheless we prove exerting targeted effort exertion e\u2217 is O(cid:0)(cid:112)logT /T(cid:1)-BNE under OSR1-UCB for\n\nridge regression, using convergence results for \u02dc\u03b8online\u2212i\nOnline score update Online updating can also help compute Si(t) (in Ii(t)) ef\ufb01ciently. Instead of\nrepeatedly re-calculating the score for each data point (in Si(t)), we only update the newly assigned\nsamples which has not been evaluated yet, by replacing \u02dc\u03b8online\u2212i\n\n(t) with \u02dc\u03b8online\u2212i\n\n(t) proved in [17].\n\n(n) in Si(t):\n\nSonline\ni\n\n(t) :=\n\n1\n\nni(t)\n\nt\n\n\u2211\n\nn=1\n\n1(i \u2208 d(n))[a\u2212 b((\u02dc\u03b8online\u2212i\n\n(n))T xi(n)\u2212 \u02dcyi(n,ei(n)))2].\n\n(6.1)\n\nWith less aggressive update, again the index term\u2019s accuracy converges slower than before, which is\ndue to the fact the older data is scored using an older (and less accurate) version of \u02dc\u03b8online\u2212i without\nbeing further updated. We propose OSR2-UCB where we change the index SR-UCB to: Sonline\n(t) +\n\nc(cid:112)(logt)2/ni(t), to accommondate the slower convergence. We establish an O(cid:0)logT /\n\nT(cid:1)-BNE\n\n\u221a\ni\n\nfor workers exerting target effort\u2014the change is due to the change of the bias term.\n\n7 Privacy preserving SR-UCB\nWith a repeated data acquisition setting, workers\u2019 privacy in data may leak repeatedly. In this section\nwe study an extension of SR-UCB to preserve privacy of each individual worker\u2019s contributed data.\nDenote the training data collected as D := { \u02dcyi(t,ei(t))}i\u2208d(t),t. We quantify privacy using differential\nprivacy [5], and we adopt (\u03b5,\u03b4)-differential privacy (DP) [6], which for our setting is de\ufb01ned below:\nDe\ufb01nition 2. A mechanism M : (X \u00d7R)|D| \u2192 O is (\u03b5,\u03b4)-differentially private if for any i \u2208 d(t),t,\ni(t)), and for every subset of possible outputs S \u2286 O, Pr[M (D) \u2208\nany two distinct \u02dcyi(t,ei(t)), \u02dcy(cid:48)\nS ] \u2264 exp(\u03b5)Pr[M (D\\{ \u02dcyi(t,ei(t))}, \u02dcy(cid:48)\n\ni(t))) \u2208 S ] + \u03b4.\n\ni(t,e(cid:48)\n\ni(t,e(cid:48)\n\n7\n\n\f\u221a\nT ,\n\nthe output\n\nAn outcome o \u2208 O of a mechanism contains two parts, both of which can contribute to privacy\nleakage: (1) The learned regression model \u02dc\u03b8(T ), which is trained using all data collected after T\nrounds. Suppose after learning the regression model \u02dc\u03b8(T ), this information will be released for\npublic usage or monitoring. This information contains each individual worker\u2019s private information.\nNote this is a one-shot leak of privacy (published at the end of the training (step T )).\n(2) The\nindexes can reveal private information. Each worker i\u2019s data will be utilized towards calculating\nother workers\u2019 indexes Ij(t), j (cid:54)= i, as well as his own Ii(t), which will be published.7 Note this type\nof leakage occurs at each step. The lemma below allows us to focus on the privacy losses in S j(t),\ninstead of Ij(t), as both Ij(t) and ni(t) are functions of {S j(n)}n\u2264t.\nLemma 2. At any time t, \u2200i, ni(t) can be written as a function of {S j(n),n < t} j.\nPreserving privacy in \u02dc\u03b8(T ) To protect privacy in \u02dc\u03b8(T ), following standard method [6], we add a\nLaplacian noise vector v\u03b8 to it: \u02dc\u03b8p(T ) = \u02dc\u03b8(T ) + v\u03b8, where Pr(v\u03b8) \u221d exp(\u2212\u03b5\u03b8||v\u03b8||2). \u03b5\u03b8 > 0 is a\nparameter controlling the noise level.\nLemma 3. Set \u03b5\u03b8 = 2\n\n(O(cid:0)T\u22121/2(cid:1),exp(\u2212O(cid:0)T(cid:1)))-DP. Further w.p. \u2265 1\u2212 1/T 2, ||\u02dc\u03b8p(T )\u2212 \u02dc\u03b8(T )||2 = ||v\u03b8||2 \u2264 logT /\n\n\u02dc\u03b8p(T ) of SR-UCB for linear regression preserves\n\u221a\nT .\nPreserving privacy in {Ii(t)}i,t: a continual privacy preserving model For indexes {Ii(t)}i, it\nis also tempting to add vi(t) to each index, i.e. Ii(t) := Ii(t) + vi(t), where again vi(t) is a zero-\nmean Laplacian noise. However releasing {Ii(t)}i at each step will release a noisy version of each\n\u02dcyi(n,ei(n)),i \u2208 d(n),\u2200n < t. The composition theory in differential privacy [12] implies that the\npreserved privacy level will grow in time t, unless we add signi\ufb01cant noise on each stage, which\nwill completely destroy the informativeness of our index policy. We borrow the partial sum idea for\ncontinual observations [3]. The idea is when releasing continual data, instead of inserting noise at\nevery step, the current to-be-released data will be decoupled into sum of partial sums, and we only\nadd noise to each partial sum and this noisy version of the partial sums can be re-used repeatedly.\nWe consider adding noise to a modi\ufb01ed version of the online indexes {Sonline\n(t)}i,t as de\ufb01ned in\n\u02dc\u03b8\u2212i(n)/t, where \u02dc\u03b8\u2212i(n) is the regression model we\nEqn. (6.1), with \u02dc\u03b8online\u2212i\nestimated using all data from worker j (cid:54)= i up to time n. For each worker i, his contributed data\nappear in both {Sonline\n(t), j (cid:54)= i, we want to preserve privacy\nin \u2211t\n\u02dc\u03b8\u2212 j(n)/t. Write down t as a binary string and \ufb01nd the\nWe \ufb01rst apply the partial sums idea to \u2211t\nrightmost digit that is a 1, then \ufb02ip that digit to 0: convert is back to decimal gives q(t). Take the\n\u02dc\u03b8\u2212 j(n) as one partial sum. Repeat above for q(t), to get q(q(t)),\nsum from q(t) + 1 to t: \u2211t\nn=q(t)+1\nand the second partial sum \u2211q(t)\nn=q(q(t))+1\n1\nt (\n\n\u02dc\u03b8\u2212 j(n), until we reach q(\u00b7) = 0. So\n\u2211\n\n(t)}t , j (cid:54)= i. For Sonline\n\u02dc\u03b8\u2212 j(n)/t, which contains information of \u02dcyi(n,ei(n)).\n\n(t) replaced by \u2211t\n(t)}t and{Sonline\n\n\u02dc\u03b8\u2212 j(n) + ... +\n\n\u02dc\u03b8\u2212 j(n)/t =\n\n\u02dc\u03b8\u2212 j(n) +\n\n\u02dc\u03b8\u2212 j(n)) .\n\n(7.1)\n\n\u2211\n\n\u2211\n\n\u2211\n\nn=1\n\nn=1\n\nn=1\n\nq(t)\n\ni\n\nj\n\nt\n\ni\n\nj\n\nn=q(t)+1\n\nn=q(q(t))+1\n\ni\n\ni\n\nn=1\n\n(n). Each Sonline\n\n\u02dc\u03b8\u2212 j(n)/t as \u02dc\u02dc\u03b8online\u2212i\n\n(t) is computed using \u02dc\u02dc\u03b8online\u2212i\n\n(t), we also want to preserve privacy in \u02dcyi(n,ei(n)). Clearly Sonline\n\nAdd noise v\u02dc\u03b8 with Pr(v\u02dc\u03b8) \u221d e\u2212\u03b5||v\u02dc\u03b8||2 to each partial sum. The number of noise terms is bounded\nby \u2264 (cid:100)logt(cid:101) at time t. So is the number of appearance of each private data in the partial sums [3].\nDenote the noisy version of \u2211t\n(n).\nFor Sonline\nsum of partial sums of terms involving \u02dcyi(n,ei(n)): write Sonline\n(short-handing dS(n) := a\u2212 b((\u02dc\u02dc\u03b8online\u2212i\ntime of worker i being sampled the n-th time.). Decouple Sonline\ntechnique. For each partial sum, add a noise vS with distribution Pr(vS) \u221d e\u2212\u03b5|vS|.\nWe then show that with the above two noise exertion procedures, our index policy SR-UCB will\nnot lose its value in incentivizing effort. In order to prove similar convergence results, we need to\nmodify SR-UCB by changing the index to the following format:\n\n(t) can be written as\nn=1 dS(n)/ni(t)\n(t(n)))T xi(t(n))\u2212 \u02dcyi(t(n),ei(t(n))))2, where t(n) denotes the\n(t) into partial sums using the same\n\n(t) + c(log3 t log3 T )/(cid:112)ni(t), \u03c4(t) = O(cid:0)(log3 t log3 T )/\n\n(t) as a summation: \u2211ni(t)\n\nt(cid:1) ,\n\nIi(t) = \u02c6Sonline\n\ni\n\n\u221a\n\ni\n\ni\n\ni\n\n7It is debatable whether the indexes should be published or not. But revealing decisions on worker selection\n\nwill also reveal information on the indexes. We consider the more direct scenario \u2013 indexes are published.\n\n8\n\nt\n\nn=1\n\n0\n\nn=0\n\n\fi\n\ni\n\n(t) denotes the noisy version of Sonline\n\nwhere \u02c6Sonline\n(t) with added noises ( vS,v\u02dc\u03b8 etc). The change of\nbias is mainly to incorporate the increased uncertainty level (due to added privacy preserving noise).\nDenote this mechanism as PSR-UCB, we have:\nTheorem 3. Set \u03b5 := 1/log3 T for added noises (both vS and v\u02dc\u03b8), PSR-UCB preserves\nWith homogeneous workers, we similarly can prove exerting effort {e\u2217\ni }i (optimal effort level) is\n\nT(cid:1)-BNE. We can see that, in order to protect privacy in the bandit setting, the approxi-\n\n(O(cid:0)log\u22121 T(cid:1),O(cid:0)log\u22121 T(cid:1))-DP for linear regression.\nO(cid:0)log6 T /\n\n\u221a\n\nmation term of BNE is worse than before.\nAcknowledgement: We acknowledge the support of NSF grant CCF-1301976.\n\nReferences\n[1] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Machine learning, 47(2-3):235\u2013256, 2002.\n\n[2] Yang Cai, Constantinos Daskalakis, and Christos H Papadimitriou. Optimum statistical esti-\n\nmation with strategic data sources. arXiv preprint arXiv:1408.2539, 2014.\n\n[3] T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM\n\nTransactions on Information and System Security (TISSEC), 14(3):26, 2011.\n\n[4] Rachel Cummings, Stratis Ioannidis, and Katrina Ligett. Truthful linear regression. In Pro-\n\nceedings of The 28th Conference on Learning Theory, COLT 2015, pages 448\u2013483, 2015.\n\n[5] Cynthia Dwork. Differential privacy. In Automata, languages and programming. 2006.\n[6] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy.\n[7] Arpita Ghosh and Patrick Hummel. Learning and incentives in user-generated content: Multi-\narmed bandits with endogenous arms. In Proceedings of the 4th conference on Innovations in\nTheoretical Computer Science, pages 233\u2013246. ACM, 2013.\n\n[8] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estima-\n\ntion. Journal of the American Statistical Association, 102(477):359\u2013378, 2007.\n\n[9] Chien-Ju Ho, Aleksandrs Slivkins, and Jennifer Wortman Vaughan. Adaptive contract design\nfor crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. In Pro-\nceedings of the \ufb01fteenth ACM EC, pages 359\u2013376. ACM, 2014.\n\n[10] Stratis Ioannidis and Patrick Loiseau. Linear regression as a non-cooperative game. In Web\n\nand Internet Economics, pages 277\u2013290. Springer, 2013.\n\n[11] Radu Jurca and Boi Faltings. Collusion-resistant, incentive-compatible feedback payments. In\nProceedings of the 8th ACM conference on Electronic commerce, pages 200\u2013209. ACM, 2007.\n[12] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential\n\nprivacy. arXiv preprint arXiv:1311.0776, 2013.\n\n[13] Tze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Ad-\n\nvances in applied mathematics, 6(1):4\u201322, 1985.\n\n[14] Guy Lebanon. m-estimators and z-estimators.\n[15] Yishay Mansour, Aleksandrs Slivkins, and Vasilis Syrgkanis. Bayesian incentive-compatible\n\nbandit exploration. In Proceedings of the Sixteenth ACM EC, pages 565\u2013582. ACM, 2015.\n\n[16] Nolan Miller, Paul Resnick, and Richard Zeckhauser. Eliciting informative feedback: The\n\npeer-prediction method. Management Science, 51(9):1359\u20131373, 2005.\n\n[17] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for\n\nstrongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011.\n\n[18] BLS Prakasa Rao. The rate of convergence of the least squares estimator in a non-linear\n\nregression model with dependent errors. Journal of Multivariate Analysis, 1984.\n\n[19] Victor S Sheng, Foster Provost, and Panagiotis G Ipeirotis. Get another label? improving\ndata quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM\nSIGKDD international conference on Knowledge discovery and data mining, 2008.\n\n[20] Panos Toulis, David C. Parkes, Elery Pfeffer, and James Zou. Incentive-Compatible Experi-\n\nmental Design. Proceedings 16th ACM EC\u201915, pages 285\u2013302, 2015.\n\n9\n\n\f", "award": [], "sourceid": 980, "authors": [{"given_name": "Yang", "family_name": "Liu", "institution": "Harvard University"}, {"given_name": "Yiling", "family_name": "Chen", "institution": "Harvard University"}]}