{"title": "Distinguishing Distributions When Samples Are Strategically Transformed", "book": "Advances in Neural Information Processing Systems", "page_first": 3193, "page_last": 3201, "abstract": "Often, a principal must make a decision based on data provided by an agent.  Moreover, typically, that agent has an interest in the decision that is not perfectly aligned with that of the principal.  Thus, the agent may have an incentive to select from or modify the samples he obtains before sending them to the principal.  In other settings, the principal may not even be able to observe samples directly; instead, she must rely on signals that the agent is able to send based on the samples that he obtains, and he will choose these signals strategically.\n\nIn this paper, we give necessary and sufficient conditions for when the principal can distinguish between agents of ``good'' and ``bad'' types, when the type affects the distribution of samples that the agent has access to.  We also study the computational complexity of checking these conditions.  Finally, we study how many samples are needed.", "full_text": "Distinguishing Distributions\n\nWhen Samples Are Strategically Transformed\n\nHanrui Zhang\nDuke University\n\nDurham, NC 27708\n\nhrzhang@cs.duke.edu\n\nYu Cheng\n\nDuke University\n\nDurham, NC 27708\n\nyucheng@cs.duke.edu\n\nVincent Conitzer\nDuke University\n\nDurham, NC 27708\n\nconitzer@cs.duke.edu\n\nAbstract\n\nOften, a principal must make a decision based on data provided by an agent.\nMoreover, typically, that agent has an interest in the decision that is not perfectly\naligned with that of the principal. Thus, the agent may have an incentive to select\nfrom or modify the samples he obtains before sending them to the principal. In\nother settings, the principal may not even be able to observe samples directly;\ninstead, she must rely on signals that the agent is able to send based on the samples\nthat he obtains, and he will choose these signals strategically.\nIn this paper, we give necessary and suf\ufb01cient conditions for when the principal can\ndistinguish between agents of \u201cgood\u201d and \u201cbad\u201d types, when the type affects the\ndistribution of samples that the agent has access to. We also study the computational\ncomplexity of checking these conditions. Finally, we study how many samples are\nneeded.\n\n1\n\nIntroduction\n\nAnyone can have a bad day. Or a lucky one. Thus, in general, to determine with reasonable con\ufb01dence\nwho are the highly capable agents\u2014whether they be people, companies, or anything else\u2014we need\nto observe their output over an extended period of time. Moreover, capability is generally not\none-dimensional, and who should be considered highly capable depends on what it is that we are\nlooking for. Finally, the policy that we set to evaluate agents\u2019 output will in general affect how they\nstrategically try to shape that output. Thus, we must choose our policy to enable the agents that are\nhighly capable (according to our de\ufb01nition) to distinguish themselves from others.\nExample. Suppose that there are researchers of different types. Speci\ufb01cally, suppose we have the\nfollowing set of types:\n\n\u0398 = {TML-H, TML-L, AML-H, AML-L}\n\nwhere \u201cTML\u201d stands for \u201ctheoretical machine learning,\u201d \u201cAML\u201d for \u201capplied machine learning,\u201d and\n\u201cL\u201d and \u201cH\u201d for \u201clow quality\u201d and \u201chigh quality,\u201d respectively. Each researcher generates high-quality\nideas (which we will in this paper refer to as samples) according to some probabilistic process.\nSuppose here the sample space is\n\nS = {T, A, B}\n\nwhere \u201cT\u201d stands for a purely theoretical idea without immediate applied signi\ufb01cance, \u201cA\u201d for an\napplied idea without immediate theoretical signi\ufb01cance, and \u201cB\u201d for an idea that has both theoretical\nand applied signi\ufb01cance. Finally, suppose there are only 3 conferences: COLT, KDD, and NeurIPS\n(we will in this paper refer to papers published in these conferences as \u201csignals\u201d).\n\n\u03a3 = {COLT, KDD, NeurIPS}\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTML-H\nTML-L\nAML-H\nAML-L\n\n0.1\n\n0.4\n\n0.5\n\nT\n\nA\n\nB\n\nCOLT\n\nKDD\n\nNeurIPS\n\nFigure 1: Illustration of the example.\n\nA T or a B idea (sample) can be turned into a COLT paper (signal);1 an A or a B idea can be turned\ninto a KDD paper; and a T, A, or B idea can be turned into a NeurIPS paper.2 Each idea, of course,\ncan be published in only one conference.\nSuppose a university would like to hire an AML-H researcher (but none of the other types). The\nfaculty recruiting committee, unfortunately, is excessively lazy and only looks at the publication\ncounts in the various venues. While the candidate researchers of course are committed to improving\nthis terrible process once they get the job, for now their only concern is getting the job. In particular,\neveryone will attempt to pretend to be an AML-H researcher by sending their papers to the appropriate\nvenues. But what exactly does this mean?\nSuppose an AML-H researcher generates ideas at the following rates: 0.5 B, 0.4 A, 0.1 T. Moreover\nsuppose that a TML-H researcher generates ideas at the following rates: 0.5 B, 0.1 A, 0.4 T. If the\nAML-H researcher sends all her papers to NeurIPS, then, even in the long run, she cannot distinguish\nherself from the TML-H researcher, who could do the same. On the other hand, if she sends strictly\nmore than 0.6 of her ideas to KDD, then in the long run she will be able to distinguish herself from\nthe TML-H researcher, because 0.4 of the latter\u2019s ideas cannot go to KDD.\nNow consider the AML-L researcher. First, an easy case: suppose he generates ideas at the following\nrates: 0.4 B, 0.3 A, 0 T. (These numbers do not sum to 1, but this is not necessary, since they are rates.\nEquivalently, we can suppose him to have \u201cthe empty idea\u201d \u2205 with the remaining probability 0.3,\nwhich can be sent only to \u201cthe empty conference\u201d where anything can be sent. This \u201cempty signal\u201d\ncan also be used to model that the researchers sometimes only have ideas that they do not consider\nworth publishing, i.e., that they strategically select only a subset of their samples to pursue.) Clearly\nthe AML-H researcher will in the long run distinguish herself from the AML-L researcher simply by\nthe overall number of papers published (as long as the AML-H researcher does not unnecessarily send\npapers to the empty conference!). Alternatively, suppose the AML-L researcher generates ideas at\nthe following rates: 0.4 B, 0.5 A, 0.1 T (so that the only weakness of the AML-L researcher relative\nto the AML-H researcher is that fewer of his ideas have both theoretical and applied signi\ufb01cance). In\nthis case, the AML-H researcher can, in the long run, distinguish herself from the AML-L researcher\nby sending strictly more than 0.5 of her ideas to COLT. Of course, this con\ufb02icts with what she needs\nto do distinguish herself from the TML-H researcher. Still, she can distinguish herself from both the\nTML-H and the AML-L researcher in the long run by, in odd-numbered years, sending strictly more\nthan 0.6 of her ideas to KDD, and, in even-numbered years, sending strictly more than 0.5 of her\nideas to COLT.\n\nIn the long run we are all dead. \u2014John Maynard Keynes\n\nIn reality, the candidates will have only \ufb01nite time to prove themselves. Still, the lazy committee may\nhope to distinguish them with high probability. How many years suf\ufb01ce for this (and, therefore, should\nbe the length of a typical Ph.D. program, potentially extended with a postdoctoral appointment)?\nWhile this is example is a bit tongue-in-cheek, it is not hard to see that this basic phenomenon\nfrequently occurs in society. People select from their opportunities and craft them to \ufb01t what they\n\n1Of course, having the basic idea is generally only a small part of the work that needs to be done for a\nconference paper; but for our purposes here, we may imagine that the idea incorporates all the work that needs\nto be done.\n\n2We use the names of actual conferences strictly for amusement value, and while we think our example\nroughly aligns with the focus of these conferences, we do not mean to imply anything about their selectivity\n(all these ideas are high-quality) or open-mindedness. We also do not mean to imply anything about other\nconferences\u2014e.g., ICML could just as well have been used instead of NeurIPS\u2014or (in what follows) about\ndifferent types of researchers or the priorities and effort levels of actual hiring committees.\n\n2\n\n\fthink will appeal to future employers. A start-up company may select from its opportunities and craft\nthem to \ufb01t what they think will impress future backers.\nIn this paper, we introduce a general model\nthat captures all these and other cases. Within this model, we characterize conditions under which\nagents of certain types can distinguish themselves from others, as well as how many samples are\nneeded for this.\n\n1.1 Related Work\n\nZhang et al. [19] study a related problem in which an agent draws samples and has to submit a subset\nof size k of these samples to a principal, where k is exogenous. In that paper, the motivation is that the\nprincipal can inspect only so many samples. In contrast, in this paper there is no such constraint, but\nsamples can be modi\ufb01ed or turned into signals according to a given (arbitrary) graph. This paper also\nallows for uncertainty about how many samples an agent has available, via the \u201cempty sample/signal\u201d\ntrick illustrated in the introductory example.\nOur setting is related to mechanism design with partial veri\ufb01cation [8, 18], where an agent\u2019s type\nrestricts which signals he can send. This can be thought of as corresponding to the special case of\nour model in which an agent only has a single sample which is fully determined by his type. More\ngenerally, our setting is related to the literature in economics on signaling (along the lines of [16]).\nHowever, our model does not involve the agents taking any costly actions. There is other work that\ngeneralizes the partial veri\ufb01cation setting to allow costly signaling [12, 13], motivated in part by\nstrategic classi\ufb01cation settings where agents are being classi\ufb01ed but they can strategically change\ntheir features at some cost (as also studied in [9]).3 In contrast to this line of work, in this paper we\nconsider settings where a single agent with a single type repeatedly generates samples according\nto a distribution (which are then strategically transformed into signals). This allows us to study the\nquestion of how many samples are needed to, with high con\ufb01dence, distinguish types from each other.\nOur results can be viewed as generalizations of classical results in ef\ufb01cient statistics, and in particular,\nresults for learning and testing discrete distributions, to strategic settings. One of our main results,\nTheorem 6, relies on a subroutine which generalizes the folklore result of estimating discrete dis-\ntributions. Another main result, Theorem 7, uses as a building block the sample-optimal identity\ntesting algorithm for discrete distributions [6, 17]. Theorem 7 generalizes their algorithm into an\nenvironment where samples can be strategically modi\ufb01ed according to a partial order.\n\n2 Preliminaries\n\nFor a set S, we use \u2206(S) to denote the set of probability distributions over S. Given a distribution\nx \u2208 \u2206(S), we use x(i) to denote the probability mass on the element i \u2208 S, and x(A) to denote the\ntotal probability mass on the set A \u2286 S. We are generally interested in distinguishing one or more\ngood distributions from one or more bad distributions (where good and bad are determined by what\nwe are looking for). We use g to denote the good distribution, and b to denote the bad distribution.\n(We use (gi)i and (bi)i when there are multiple good/bad distributions.) The agent, depending on his\ntype being either good or bad, draws n samples i.i.d. from either g or b. How samples can be turned\ninto signals is represented by a bipartite graph G = (S \u222a \u03a3, E) between the (discrete) sample space\nS and the (discrete) signal space \u03a3. An agent must convert each sample into a signal and then submit\nall n signals to the principal. E speci\ufb01es which signals are valid for each sample: a sample s \u2208 S\ncan be converted into a signal \u03c3 \u2208 \u03a3 iff (s, \u03c3) \u2208 E.\nNote that our model generalizes each of the following models:\n1. The agent can choose to omit samples. We can add an \u201cempty signal\u201d to \u03a3, where converting a\n\nsample s to the empty signal corresponds to not reporting s.\n\n2. The agent may or may not receive a sample in each round. E.g., in the example where samples\ncorrespond to ideas and signals correspond to papers, in some rounds the agent may not have any\n(worthwhile) idea. We can add an \u201cempty sample\u201d in S which can only be converted to the empty\nsignal.\n\n3Other work that models strategic agents manipulating the data that they submit [5, 15] concerns aggregating\nthe data of multiple agents into a single outcome that all these agents care about; as such, this is less related to\nour model here, as here we are interested in determining a given single agent\u2019s type rather than choosing a single\noutcome that affects multiple agents.\n\n3\n\n\f3. The signal space is the same as the sample space: S = \u03a3. In this case it is more natural to replace\nthe bipartite graph by one that has only one copy of each sample/signal, is no longer bipartite,\nand that represents the possibility of changing sample/signal u to sample/signal v by a directed\nedge (u, v).\n\nWe will be interested in the probability of accepting good or bad types after T rounds (i.e., after the\nagent draws T samples). We call the T signals submitted a report R \u2208 \u03a3T . The principal gets to\nchoose an acceptance function (or policy, which could be randomized) f : R \u2192 {0, 1} that maps\nthe report into a binary decision. Her goal is to accept the good agent and reject the bad agent. The\nagent wants to be accepted regardless of his type. The principal can thus make two types of mistakes:\nfalse-positive (or type 1 error) when she accepts a bad agent, and false-negative (type 2 error) when\nshe rejects a good agent. The principal wants to minimize the maximum probability of making either\ntype of mistakes.\nWe recall the following de\ufb01nition of the total variation distance:\nDe\ufb01nition 1 (Total Variation Distance). The total variation distance between two distributions\nx, y \u2208 \u2206(\u03a3) over support \u03a3 is de\ufb01ned to be\n1\n2\n\n|x(\u03c3) \u2212 y(\u03c3)| = max\nA\u2286\u03a3\n\n(x(A) \u2212 y(A)).\n\ndTV(x, y) =\n\n(cid:107)x \u2212 y(cid:107)1 =\n\n1\n2\n\n(cid:88)\n\n\u03c3\u2208\u03a3\n\nIn our setting, the total variation distance provides a good way to measure the closeness between two\nsignal distributions, which are observable by the principal. We will generalize this de\ufb01nition to our\nstrategic setting, to measure how close two distributions over the sample space are to each other.\n\n3 Basic Structural Results\n\nIn this section, we de\ufb01ne a notion that we term \u201cdirected total variation distance\u201d dDTV. For two\ndistributions x and y over samples, dDTV(x, y) measures how well x can distinguish itself from y\nin our strategic setting. As we will see in the later sections, dDTV is a central notion in this paper,\nand often dictates the number of samples we need to distinguish the two distributions under strategic\nreporting.\nWe \ufb01rst give the formal de\ufb01nitions of reporting strategies and the directed total variation distance\ndDTV(x, y). Then we de\ufb01ne another notion MaxSep(x, y) that measures how well x can distinguish\nitself from y from the principal\u2019s perspective, using separating sets instead of reporting strategies.\nGiven these de\ufb01nitions, we present one of our key structural results (Proposition 1), which shows that\nthe two notions are equivalent.\nBefore investigating distinguishing distributions under strategic reporting, we \ufb01rst generalize the\nclassical measure of how close two distributions are, dTV, to our strategic setting. We \ufb01rst give a\nformal de\ufb01nition the reporting strategy used by the agents.\nDe\ufb01nition 2 ((Single-Round) Reporting Strategy). Given x \u2208 \u2206(S), \u03b1 \u2208 \u2206(\u03a3), we say x can report\n\u03b1 (x \u2192 \u03b1), if there exist a reporting strategy R = {rs,\u03c3}(s,\u03c3)\u2208E satisfying:\n\u2022 rs,\u03c3 \u2265 0 for all (s, \u03c3) \u2208 E.\n\n\u2022 For each s \u2208 S,(cid:80)\n\u2022 For each \u03c3 \u2208 \u03a3,(cid:80)\n\n\u03c3:(s,\u03c3)\u2208E rs,\u03c3 = 1.\ns:(s,\u03c3)\u2208E x(s) \u00b7 rs,\u03c3 = \u03b1(\u03c3).\n\nWe say x reports \u03b1 by strategy R (x \u2192R \u03b1).\nIn other words, when each sample s \u2208 S is drawn from the distribution x and given this sample\nthe agent is reporting \u03c3 \u2208 \u03a3 with probability rs,\u03c3, the resulting distribution over the signal space is\nexactly \u03b1. For a \ufb01xed sample or a random variable s, we use R(s) \u2208 \u2206(\u03a3) to denote the random\nvariable whose distribution over the signal space is induced by {Rs,\u03c3}\u03c3\u2208\u03a3.\nGiven the de\ufb01nition of reporting strategies, we are ready to generalize dTV to our setting. Intuitively,\nx chooses a report \ufb01rst, and then y chooses a report in response; they play a zero-sum game where\nx wants the reports to be as far away from each other as possible. dDTV(x, y) is the value of this\ntwo-player game when x must choose a report (i.e., a pure strategy) \ufb01rst, which measures how far x\ncan stay away from y.\n\n4\n\n\fDe\ufb01nition 3 (Directed Total Variation Distance). Given (S, \u03a3, E), the directed total variation distance\nbetween two distributions x, y \u2208 \u2206(S) over the sample space S is de\ufb01ned to be\n\ndDTV(x, y) = max\n\u03b1:x\u2192\u03b1\n\nmin\n\u03b2:y\u2192\u03b2\n\ndTV(\u03b1, \u03b2).\n\nDirected total variance distance nicely characterizes the distance between two distributions from the\nagent\u2019s perspective, but it is not immediately clear how that might help the principal. In particular,\nare two distributions easily separable by setting an appropriate policy if they have large directed\ntotal variation distance? To study this, we introduce several concepts to model the problem from the\nprincipal\u2019s perspective.\nDe\ufb01nition 4 (Preimage of Signals). For any set of signals A \u2286 \u03a3, the preimage pre(A) of A is\nde\ufb01ned to be the set of samples which can be mapped to a signal in A. That is\n\npre(A) = {s \u2208 S | \u2203\u03c3 \u2208 A, s.t. (s, \u03c3) \u2208 E}.\n\nThe principal could label a set A of signals as \u201cgood\u201d signals and simply measure how many good\nsignals the agent is able to send. Ideally, this A is chosen so that a good agent can send (signi\ufb01cantly)\nmore signals in A than a bad agent. This inspires the following de\ufb01nitions.\nDe\ufb01nition 5 (Separation). For any A \u2286 \u03a3, if x(pre(A)) \u2212 y(pre(A)) = \u0001 > 0, then we say A\nseparates x from y by a margin of \u0001.\nDe\ufb01nition 6 (Max Separation). The max separation of x \u2208 \u2206(S) from y \u2208 \u2206(S) over the sample\nspace S is de\ufb01ned to be MaxSep(x, y) = maxA\u2286\u03a3(x(pre(A)) \u2212 y(pre(A))).\nWe now draw the connection between the agent\u2019s and the principal\u2019s perspectives. The following\nproposition can be viewed as a generalization of the classic Hall\u2019s Marriage Theorem. Proposition 1\nstates that g can distinguish itself from b under strategic reporting iff there exists a subset A\u2217 of\nsignals so that g can generate more signals in A\u2217 than b. Equivalently, the best reporting strategy\nfor g is to focus on a subset A\u2217 of the signal space, and try to convert samples into signals in A\u2217\nwhenever possible.\nProposition 1. For any x, y \u2208 \u2206(S), dDTV(x, y) = MaxSep(x, y).\nThe proof of the proposition, as well as all other proofs, is deferred to the appendix. This equivalence\nbetween dDTV and MaxSep not only is a nice structural result; Proposition 1 plays a substantial part\nin our main algorithmic results.\nIt is worth noting that dDTV(x, y) in general is not equal to dDTV(y, x). However, the triangle\ninequality still holds for dDTV, which also enables some of our main results.\nProposition 2. For any x, y, z \u2208 \u2206(S), dDTV(x, y) + dDTV(y, z) \u2265 dDTV(x, z).\n\n4 Structural and Computational Results in the General Case\n\nIn this section, we de\ufb01ne adaptive and non-adaptive reporting strategies (De\ufb01nition 7), and the\naccepting probabilities of the optimal reporting strategies after T rounds (De\ufb01nition 8). At a high\nlevel, we give a tight characterization result on when there exists a policy that can distinguish g from\nb under strategic reporting, and provide an asymptotically tight bound on the sample complexity\nof the optimal policy. Moreover, we show that while our structural result is clean and tight, it is\ncomputationally hard to check if the condition holds. That is, in the general case, it is NP-hard to\ndetermine whether there is a policy that can distinguish g from b.\nMore speci\ufb01cally, we \ufb01rst show that there exists a policy that can distinguish g from b in the limit\n(when T \u2192 \u221e) iff dDTV(g, b) > 0 (Theorem 1). Next, we give an asymptotically tight sample\ncomplexity bound of T = \u0398(1/\u00012) when dDTV(g, b) = \u0001 and we want to distinguish g from b with\nhigh constant probability (Theorem 3). We then extend the existence result to more general settings\nwhen there are multiple good and bad distributions (Theorem 4). Finally, we show that it is NP-hard\nto decide if we are in the case where dDTV(g, b) = 0 or dDTV(g, b) >\nWe start with the de\ufb01nition of adaptive reporting strategies.\nDe\ufb01nition 7 (Adaptive Reporting Strategy). An adaptive reporting strategy R = (R1, . . . , RT ) is\na sequence of (different) reporting strategies. The signal \u03c3i at time i is obtained by applying Ri to\n\npoly(m,n) (Theorem 2).\n\n1\n\n5\n\n\fthe sample si at time i. Ri = Ri(\u03c31, . . . , \u03c3i\u22121) may depend on all past signals. A reporting strategy\nis non-adaptive if Ri = R1 for any i and (\u03c31, . . . , \u03c3i\u22121), and adaptive otherwise. For an adaptive\npolicy R = (R1, . . . , RT ), we interchangeably write \u03c3i = Ri(si | \u03c31, . . . , \u03c3i\u22121) to indicate the\ndependence of Ri on \u03c31, . . . , \u03c3i\u22121.\n\nWhen we analyze the quality of a \ufb01xed T -round policy f, we are interested in the probability that f\naccepts g or b after T rounds, when the agent (of either type) best-responds to f.\nDe\ufb01nition 8 (Acceptance Probabilities of the Best Reporting Strategies). Given x \u2208 \u2206(S), T \u2208\nN, and the principal\u2019s policy f, let the acceptance rate under adaptive / non-adaptive reporting\nrespectively be\n\npada(f, x, T ) =\n\nmax\n\nR=(R1,...,RT )\n\nE[f ((Ri(si))i\u2208[T ])],\n\npnon(f, x, T ) = max\n\nR=(R,...,R)\n\nE[f ((Ri(si))i\u2208[T ])]\n\nsamples (si)i drawn from x. Observe that\n\nwhere the expectations are taken over T i.i.d.\npada(f, x, T ) \u2265 pnon(f, x, T ) for any f, x and T .\nIntuitively, if dDTV(g, b) = 0, then the bad distribution can mimic the good distribution perfectly in\nthe signal space, no matter what reporting strategy g uses. Therefore, it is impossible to distinguish\ng from b. The next theorem formalizes this intuition. In particular, even if g reports adaptively, b\ncan still mimic g\u2019s conditional reporting strategy in every situation (i.e., for every combination of\npreviously reported signals).\nTheorem 1 (Separability in the Limit). Given good and bad distributions g and b:\n(i) If dDTV(g, b) > 0, then there exists a policy f such that\n\nT\u2192\u221e(pnon(f, g, T ) \u2212 pada(f, b, T )) = 1.\n\nlim\n\nThat is, f accepts g and rejects b with probability 1 in the limit.\n\n(ii) If dDTV(g, b) = 0, then for any policy f and any T ,\n\npada(f, g, T ) \u2264 pada(f, b, T ), pnon(f, g, T ) \u2264 pnon(f, b, T ).\n\nThat is, no policy can separate g from b, regardless of whether the setting is adaptive.\n\nThe next theorem states that while our characterization result (Theorem 1) is clean and tight (we can\ndistinguish iff dDTV(g, b) > 0), it is in fact computationally hard to check if this condition holds.\nIntuitively, Theorem 2 constructs an instance where the good distribution needs to focus on as few\nsignals as possible. The parameters are chosen carefully so that it is crucial that g \ufb01nds a subset of\nsignals A \u2286 \u03a3 with minimum cardinality that covers the support of g.\nTheorem 2 (hardness of checking separability). Given x, y \u2208 \u2206(S), it is NP-hard to distinguish be-\ntween the following two cases: (1) dDTV(x, y) = 0 and (2) dDTV(x, y) \u2265\npoly(m,n) , or equivalently,\nto determine the existence of a set A \u2286 \u03a3 such that x(pre(A)) \u2212 y(pre(A)) \u2265\n\n1\n\n1\n\npoly(m,n) .\n\nNote that the hardness of checking the existence of separating sets implies the hardness of \ufb01nding any\nseparating set given that dDTV(x, y) > 0. This is because given an algorithm for the latter problem,\none could run that algorithm without knowing whether dDTV(x, y) > 0 and see if it succeeds. Either\nthe algorithm returns a separating set, or we know it must be the case that dDTV(x, y) = 0 and no\nseparating set exists.\nNext, we focus on the case when there are \ufb01nitely many samples. Theorem 3 is more re\ufb01ned\nthan Theorem 1, in that it gives a tight sample complexity bound instead of only talking about\ndistinguishing g and b in the limit.\nTheorem 3 (Sample Complexity with Two Distributions). For any g and b such that dDTV(g, b) \u2265 \u0001:\n\u2022 There is a policy f such that for any \u03b4 > 0 and T \u2265 2 ln(1/\u03b4)/\u00012, pnon(f, g, T ) \u2265 1 \u2212 \u03b4 and\n\u2022 When dDTV(g, b) = \u0001 and T = o(1/\u00012), for any f, pnon(f, g, T ) \u2212 pnon(f, b, T ) < 1\n3 .\n\npada(f, b, T ) \u2264 \u03b4.\n\n6\n\n\fTheorem 3 can be generalized to the case where there are multiple good and bad distributions. First,\nsuppose there is one good distribution and multiple bad distributions. As long as dDTV(g, bj) \u2265 \u0001\nfor every bad distribution bj, we can use the testing algorithm in Theorem 3 to distinguish them in\nT = O(1/\u00012) rounds (with high constant probability). We potentially need to do so separately for\nevery bad distribution, paying an extra factor of \u2126((cid:96)) in the sample complexity if there are (cid:96) bad\ndistributions. If there are k good distributions, then we can run the k testers in parallel, paying an\nadditional factor of log(k) in the sample complexity to boost the success probability so that we can\ntake a union bound.\nTheorem 4 (Multiple Good and Bad Distributions, the General Case). For any g1, . . . , gk and\nb1, . . . , b(cid:96) such that dDTV(gi, bj) \u2265 \u0001 for any i \u2208 [k] and j \u2208 [(cid:96)], there is a policy f such that: For\nany \u03b4 > 0 and T \u2265 2(cid:96) ln(k(cid:96)/\u03b4)/\u00012, pada(f, gi, T ) \u2265 1 \u2212 \u03b4 for any i \u2208 [k], and pada(f, bj, T ) \u2264 \u03b4\nfor any j \u2208 [(cid:96)].\nWe note that the policy in Theorem 4 requires the good distribution to report in different ways,\nwhich is not possible with a non-adaptive strategy according to our de\ufb01nition. In particular, the\ngood distribution must know which bad distribution it is up against in each phase, and report\naccordingly. As our introductory example shows, this is in fact necessary when there are multiple\nbad distributions.\n\n5 When Signals Are Partially Ordered\n\nIn many real-world situations, the sample and signal spaces are structured. For example, when a\nband is recruiting new members, applicants may be asked to submit video recordings of themselves\nplaying. An applicant would probably videotape herself playing for an entire event as a sample, and\nthen crop the recording to create a signal that demonstrates only her best performance. This cropping\nprocedure is irreversible: the complete recording may be cropped to keep a part, but from a part,\nit is impossible to recover the full recording. The signal space in this scenario is partially ordered\nby the cropping procedure\u2014the samples/signals can be transformed in one direction (shortening),\nbut never the other. Also, there is a \u201cdefault\u201d signal for each sample, which is simply to submit the\ncomplete recording without cropping. The default signal can be transformed into any signal that can\nbe reported from this sample. In this section, we consider the following abstraction of such scenarios:\n\u2022 S = \u03a3,\n\u2022 (s, s) \u2208 E for any s \u2208 S,\n\u2022 (s, t) \u2208 E and (t, u) \u2208 E =\u21d2 (s, u) \u2208 E, and\n\u2022 E is acyclic except for self-cycles.\nThis abstraction also covers, for example, scenarios where the agent can choose to hide certain\nsamples\u2014any sample can be transformed into a non-sample, but not reversely. Note that given the\nabove conditions, the sample/signal space is essentially a partially ordered set, where a sample can\nonly be transformed according to this partial order. Let n = |S| be the cardinality of the sample/signal\nspace.\nWe \ufb01rst show some useful structural results in the partially ordered case. The following proposition\ndemonstrates that the revelation principle holds in this case.\nProposition 3 (Revelation Principle). For any policy f:\n\u2022 There exists a policy f(cid:48) such that for any x \u2208 \u2206(S), T \u2208 N,\n\n\u2022 There exists a policy f(cid:48)(cid:48) such that for any x \u2208 \u2206(S), T \u2208 N,\n\npnon(f, x, T ) = pnon(f(cid:48), x, T ) = E[f(cid:48)((si)i)].\n\npada(f, x, T ) = pada(f(cid:48)(cid:48), x, T ) = pnon(f(cid:48)(cid:48), x, T ) = E[f(cid:48)(cid:48)((si)i)].\n\nIn other, non-learning contexts in mechanism design, whether the revelation principle holds is often\nan aspect that determines whether the computational problems therein are tractable. We will see that\nthis is also the case for our problem\u2014the revelation principle enables ef\ufb01cient computation of the\nmax separation, and therefore ef\ufb01cient policies in a quite natural way.\nThe next proposition simpli\ufb01es the de\ufb01nition of dDTV in the partially ordered case, based on the\ninsight that, per the revelation principle, the best way for x to avoid being mimicked by y is to always\nreport the unmodi\ufb01ed samples.\n\n7\n\n\fProposition 4 (dDTV Simpli\ufb01ed). In the transitive case, dDTV(x, y) = miny\u2192y(cid:48) dTV(x, y(cid:48)).\nThis also gives us an ef\ufb01cient algorithm for \ufb01nding the set that supports the max separation\nMaxSep(x, y) of x from y:\nCorollary 1 (Ef\ufb01cient Computation of Max Separation). Given any x, y \u2208 \u2206(S), there is a poly-time\nalgorithm which computes a set A\u2217 satisfying x(pre(A\u2217)) \u2212 y(pre(A\u2217)) = MaxSep(x, y).\nWe show in Theorem 5 that in the partially ordered case we can separate multiple good distributions\nfrom multiple bad ones with much smaller overhead. The proof of Theorem 5 is similar to that of\nTheorem 4. The only difference is that, because of the revelation principle, we no longer require good\ndistributions to report adaptively.\nTheorem 5 (Multiple Good and Bad Distributions: The Partially Ordered Case). For any g1, . . . , gk\nand b1, . . . , b(cid:96) where dDTV(gi, bj) \u2265 \u0001 for any i \u2208 [k], j \u2208 [(cid:96)], there is a policy f such that: For any\n\u03b4 > 0 and T \u2265 2 ln(k(cid:96)/\u03b4)/\u00012, pnon(f, gi, T ) \u2265 1 \u2212 \u03b4 for any i \u2208 [k], and pada(f, bj, T ) \u2264 \u03b4 for\nany j \u2208 [(cid:96)].\nIn the partially ordered case, we cannot only deal with multiple good and bad distributions much\nmore ef\ufb01ciently, but also deal with any bad distribution using a single sample-ef\ufb01cient policy. Before\nstating the result, recall the following de\ufb01nition of the width of a partially ordered set.\nDe\ufb01nition 9 (Width of Partially Ordered Sets). The width \u03c1(G) of a partially ordered set represented\nas graph G = (S, E) is de\ufb01ned to be \u03c1(G) = max{|A| | A \u2286 S, \u2200s1, s2 \u2208 A, (s1, s2) /\u2208 E}. In\nother words, the width is the maximum size of a set A \u2286 S where any two elements in A are not\ncomparable. Such a set A is called an anti-chain.\n\nWe now provide our generic policy, whose sample complexity, quite surprisingly, depends roughly\nlinearly on the width of the sample space.\nTheorem 6 (Ef\ufb01cient Policy against Any Bad Distribution). For any g \u2208 \u2206(S), there is a policy f\nsuch that for any \u03b4 > 0, and T \u2265 2\u03c1 ln(1+n/\u03c1) ln(1/\u03b4)\n: (1) pnon(f, g, T ) \u2265 1 \u2212 \u03b4, and (2) for any b\nsuch that dDTV(g, b) \u2265 \u0001, pada(f, b, T ) \u2264 \u03b4. Moreover, the outcome of the policy can be computed\nin polynomial time.\n\n\u00012\n\n\u221a\nThe above policy is able to detect any bad distribution with adaptive reporting. For bad distributions\nwithout adaptive reporting, when \u03c1 = \u2126(\nn/ log n), the following policy achieves even better\nsample complexity.\nTheorem 7 (Ef\ufb01cient Policy against Non-adaptive Bad Distributions). For any g \u2208 \u2206(S), there is a\nsamples: (1) pnon(f, g, T ) \u2265 1 \u2212 \u03b4, and\npolicy f such that for any \u03b4 > 0, with T = O\n(2) for any b such that dDTV(g, b) \u2265 \u0001, pnon(f, b, T ) \u2264 \u03b4. Moreover, the outcome of the policy can\nbe computed in polynomial time.\n\n(cid:16)\u221a\n\n(cid:17)\n\nn ln(1/\u03b4)\n\n\u00012\n\n6 Future research\n\nIn this paper, we have focused on distinguishing good and bad types with near certainty. In reality,\nthe number of available samples may not always be suf\ufb01cient for this. If so, it may be worthwhile to\nmove beyond simple acceptance and rejection decisions to a more general mechanism design setup.\nFor example, when the signals we receive from an agent are not decisive one way or another, perhaps\nan intermediate outcome between rejection and acceptance allows us to improve our objective, by\navoiding the damage of either accepting a bad type or rejecting a good type. One may also consider\nsettings in which signaling is costly (or at least sending high-quality signals comes at an effort cost,\nin line with traditional signaling models [16]) or in which agents can in fact improve their actual\ntypes via some investment cost. Any of these directions would further enrich the speci\ufb01c connections\nbetween mechanism design and learning theory that we have begun to explore in this paper (and\nthat in turn complement other fascinating connections between these topics that have earlier been\nestablished by others [1, 11, 2, 10, 3, 4, 14, 7]).\n\nAcknowledgements. We are thankful for support from NSF under awards IIS-1814056 and IIS-\n1527434. We also thank anonymous reviewers for helpful comments.\n\n8\n\n\fReferences\n[1] Pranjal Awasthi, Avrim Blum, Nika Haghtalab, and Yishay Mansour. Ef\ufb01cient PAC Learning from the\n\nCrowd. In Conference on Learning Theory, pages 127\u2013150, 2017.\n\n[2] Avrim Blum, Nika Haghtalab, Ariel D Procaccia, and Mingda Qiao. Collaborative PAC Learning. In\n\nAdvances in Neural Information Processing Systems, pages 2392\u20132401, 2017.\n\n[3] Yiling Chen, Nicole Immorlica, Brendan Lucier, Vasilis Syrgkanis, and Juba Ziani. Optimal data acquisition\nfor statistical estimation. In Proceedings of the 2018 ACM Conference on Economics and Computation,\npages 27\u201344. ACM, 2018.\n\n[4] Yiling Chen, Chara Podimata, Ariel D Procaccia, and Nisarg Shah. Strategyproof linear regression in high\ndimensions. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 9\u201326.\nACM, 2018.\n\n[5] Ofer Dekel, Felix Fischer, and Ariel D. Procaccia. Incentive compatible regression learning. In Proceedings\nof the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 884\u2013893, Philadelphia, PA,\nUSA, 2008. Society for Industrial and Applied Mathematics.\n\n[6] Ilias Diakonikolas, Daniel M Kane, and Vladimir Nikishkin. Testing identity of structured distributions. In\nProceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pages 1841\u20131854.\nSociety for Industrial and Applied Mathematics, 2015.\n\n[7] Jinshuo Dong, Aaron Roth, Zachary Schutzman, Bo Waggoner, and Zhiwei Steven Wu. Strategic clas-\nsi\ufb01cation from revealed preferences. In Proceedings of the 2018 ACM Conference on Economics and\nComputation, pages 55\u201370. ACM, 2018.\n\n[8] Jerry Green and Jean-Jacques Laffont. Partially veri\ufb01able information and mechanism design. Review of\n\nEconomic Studies, 53:447\u2013456, 1986.\n\n[9] Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classi\ufb01cation. In\n\nInnovations in Theoretical Computer Science (ITCS), Cambridge, MA, USA, 2016.\n\n[10] Lily Hu, Nicole Immorlica, and Jennifer Wortman Vaughan. The disparate effects of strategic manipulation.\nIn Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 259\u2013268. ACM,\n2019.\n\n[11] Shahin Jabbari, Ryan M Rogers, Aaron Roth, and Steven Z Wu. Learning from rational behavior: Predicting\nsolutions to unknown linear programs. In Advances in Neural Information Processing Systems, pages\n1570\u20131578, 2016.\n\n[12] Andrew Kephart and Vincent Conitzer. Complexity of mechanism design with signaling costs.\n\nIn\nProceedings of the Fourteenth International Conference on Autonomous Agents and Multi-Agent Systems\n(AAMAS), pages 357\u2013365, Istanbul, Turkey, 2015.\n\n[13] Andrew Kephart and Vincent Conitzer. The revelation principle for mechanism design with reporting costs.\nIn Proceedings of the Seventeenth ACM Conference on Economics and Computation (EC), pages 85\u2013102,\nMaastricht, the Netherlands, 2016.\n\n[14] Annie Liang, Xiaosheng Mu, and Vasilis Syrgkanis. Optimal and myopic information acquisition. In\n\nProceedings of the 2018 ACM Conference on Economics and Computation, pages 45\u201346. ACM, 2018.\n\n[15] Reshef Meir, Ariel D. Procaccia, and Jeffrey S. Rosenschein. Algorithms for strategyproof classi\ufb01cation.\n\nArti\ufb01cial Intelligence, 186:123\u2013156, 2012.\n\n[16] Michael Spence. Job market signaling. Quarterly Journal of Economics, 87(3):355\u2013374, 1973.\n\n[17] Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing.\n\nSIAM Journal on Computing, 46(1):429\u2013455, 2017.\n\n[18] Lan Yu. Mechanism design with partial veri\ufb01cation and revelation principle. Autonomous Agents and\n\nMulti-Agent Systems, 22(1):217\u2013223, 2011.\n\n[19] Hanrui Zhang, Yu Cheng, and Vincent Conitzer. When samples are strategically selected. In Thirty-sixth\n\nInternational Conference on Machine Learning, 2019.\n\n9\n\n\f", "award": [], "sourceid": 1798, "authors": [{"given_name": "Hanrui", "family_name": "Zhang", "institution": "Duke University"}, {"given_name": "Yu", "family_name": "Cheng", "institution": "Duke University"}, {"given_name": "Vincent", "family_name": "Conitzer", "institution": "Duke University"}]}