{"title": "Learning to Screen", "book": "Advances in Neural Information Processing Systems", "page_first": 8615, "page_last": 8624, "abstract": "Imagine a large firm with multiple departments that plans a large recruitment. \nCandidates arrive one-by-one, and for each candidate\nthe firm decides, based on her data (CV, skills, experience, etc), \nwhether to summon her for an interview.\nThe firm wants to recruit the best candidates while minimizing the number of interviews.\nWe model such scenarios as an assignment problem between items (candidates) and categories (departments):\nthe items arrive one-by-one in an online manner,\nand upon processing each item the algorithm decides, \nbased on its value and the categories it can be matched with,\nwhether to retain or discard it (this decision is irrevocable).\nThe goal is to retain as few items as possible while guaranteeing \nthat the set of retained items contains an optimal matching. \n \nWe consider two variants of this problem:\n(i) in the first variant it is assumed that the $n$ items are drawn independently \nfrom an unknown distribution $D$.\n(ii) In the second variant it is assumed that before the process starts, \nthe algorithm has an access to a training set of $n$ items drawn independently \nfrom the same unknown distribution (e.g.\\ data of candidates from previous recruitment seasons).\nWe give tight bounds on the minimum possible number of retained items in each of these variants.\nThese results demonstrate that one can retain exponentially less items in the second variant (with the training set).\n\nOur algorithms and analysis utilize ideas and techniques from statistical learning theory and from discrete algorithms.", "full_text": "Learning to Screen\n\nAlon Cohen\u2217 Avinatan Hassidim\u2020 Haim Kaplan\u2021 Yishay Mansour\u00a7\n\nShay Moran\u00b6\n\nAbstract\n\nImagine a large \ufb01rm with multiple departments that plans a large recruitment.\nCandidates arrive one-by-one, and for each candidate the \ufb01rm decides, based on her\ndata (CV, skills, experience, etc), whether to summon her for an interview. The \ufb01rm\nwants to recruit the best candidates while minimizing the number of interviews.\nWe model such scenarios as an assignment problem between items (candidates)\nand categories (departments): the items arrive one-by-one in an online manner,\nand upon processing each item the algorithm decides, based on its value and the\ncategories it can be matched with, whether to retain or discard it (this decision is\nirrevocable). The goal is to retain as few items as possible while guaranteeing that\nthe set of retained items contains an optimal matching.\nWe consider two variants of this problem: (i) in the \ufb01rst variant it is assumed that\nthe n items are drawn independently from an unknown distribution D. (ii) In the\nsecond variant it is assumed that before the process starts, the algorithm has an\naccess to a training set of n items drawn independently from the same unknown\ndistribution (e.g. data of candidates from previous recruitment seasons). We give\nnear-optimal bounds on the best-possible number of retained items in each of these\nvariants. These results demonstrate that one can retain exponentially less items in\nthe second variant (with the training set).\nOur algorithms and analysis utilize ideas and techniques from statistical learning\ntheory and from discrete algorithms.\n\n1\n\nIntroduction\n\nMatching is the bread-and-butter of many real-life problems from the \ufb01elds of computer science,\noperations research, game theory, and economics. Some examples include job scheduling where we\nassign jobs to machines, economic markets where we allocate products to buyers, online advertising\nwhere we assign advertisers to ad slots, assigning medical interns to hospitals, and many more.\nOne particular example that motivates this work is the following example from labor markets. Imagine\na \ufb01rm that is planning a large recruitment. Candidates arrive one-by-one and the HR department\nimmediately decides whether to summon them for an interview. Moreover, the \ufb01rm has multiple\ndepartments, each requiring different skills and having a different target number of hires. Different\nemployees have different subsets of the required skills, and thus \ufb01t only certain departments and\nwith a certain quality. The \ufb01rm\u2019s HR department, following the interviews, decides which candidates\nto recruit and to which departments to assign them. The HR department has to maximize the total\nquality of the hired employees such that each department gets its required number of hires with\nthe required skills. In addition, the HR uses data from the previous recruitment season in order to\nminimize the number of interviews while not compromising the quality of the solution.\n\u2217Technion\u2014Israel Inst. of Technology and Google Research. aloncohen@technion.ac.il\n\u2020Bar-Ilan University and Google Research. avinatanh@gmail.com\n\u2021Tel-Aviv University and Google Research. haimk@tau.ac.il\n\u00a7Tel-Aviv University and Google Research. mansour.yishay@gmail.com\n\u00b6Princeton University. shaymoran1@gmail.com. This work was done while the author was working at\n\nGoogle Research.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe study the following formulation of the problem above. We receive n items (candidates), where\neach item has a subset of d properties (departments) denoted by P1, . . . , Pd. We select k items out of\nthe n, subject to d constraints of the form\n\nwhere(cid:80)d\n\nexactly ki of the selected items must satisfy a property Pi,\n\ni=1 ki = k and we assume that d (cid:28) k (cid:28) n. Furthermore, if item c possesses property Pi, then\nit has a value vi(c) associated with this property. Our goal is to compute a matching of maximum\nvalue that associates k items to the d properties subject to the constraints above.\nWe consider matching algorithms in the following online setting. The algorithms receive n items\nonline, drawn independently from D, and either reject or retain each item. Then, the algorithm\nutilizes the retained items and outputs an (approximately-)optimal feasible solution. We present a\nnaive greedy algorithm that returns the optimal solution with probability at least 1 \u2013 \u03b4 and retains\nO(k log(k/\u03b4)) items in expectation. We prove that no other algorithm with the same guarantee can\nretain less items in expectation.\nThus, to further reduce the number of retained items, we add an initial preprocessing phase in which\nthe algorithm learns an online policy from a training set. The training set is a single problem instance\nthat consists of n items drawn independently from the same unknown distribution D. We address\nthe statistical aspects of this problem and develop ef\ufb01cient learning algorithms. In particular, we\nde\ufb01ne a class of thresholds-policies. Each thresholds-policy is a simple rule for deciding whether\nto retain an item. We present uniform convergence rates for both the number of items retained by a\nthresholds policy and the value of the resulting solution. We show that these quantities deviate from\nn bound; recall that we assume k (cid:28) n)\ntheir expected value by order of\nwhich we prove using concentration inequalities and tools from VC-theory. Using these concentration\ninequalities, we analyze an ef\ufb01cient online algorithm that returns the optimal of\ufb02ine solution with\nprobability at least 1 \u2013 \u03b4, and retains a near-optimal O(k log log(1/\u03b4)) number of items in expectation\n(compare with the O(k log(k/\u03b4)) number of retained items when no training set is given).\n\nk (rather than an easier\n\n\u221a\n\n\u221a\n\nRelated work. Our model is related to the online secretary problem in which one needs to select\nthe best secretary in an online manner (see Ferguson, 1989). Our setting differs from this classical\nmodel due to the two-stage process and the complex feasibility constraints. Nonetheless, we remark\nthat there are few works on the secretary model that allow delayed selection (see Vardi, 2015, Ezra\net al., 2018) as well as matroid constraints [Babaioff et al., 2007]. These works differ from ours in\nthe way the decision is made, the feasibility constraints and the learning aspect of receiving a single\nproblem instance as a training example. Correa et al., 2018 consider a distributional setting for the\nsingle-choice prophet inequality problem. Similarly to the setting considered here, they assume that\nthe data is drawn independently from an unknown distribution and that the algorithm has an access to\na training-set sampled from the same distribution. However, the objective is quite different from ours:\nthe goal is to pick a stopping time \u03c4 such that the \u03c4\u2019th sample approximately maximizes the value\namong all samples (including those that were not seen yet).\nAnother related line of work in algorithmic economics studies the statistical learnability of pricing\nschemes (see e.g., Morgenstern and Roughgarden, 2015, 2016, Hsu et al., 2016, Balcan et al., 2018).\nThe main difference of these works from ours is that our training set consists of a single \u201cexample\u201d\n(namely the set of items that are used for training), and in their setting (as well as in most typical\nstatistical learning settings) the training set consists of many i.i.d examples. This difference also affects\nthe technical tools used for obtaining generalization bounds. For example, some of our bounds exploit\nTalagrand\u2019s concentration inequality rather than the more standard Chernoff/McDiarmid/Bernstein\ninequalities. We note that Talagrand\u2019s inequality and other advanced inequalities were applied in\nmachine learning in the context of learning combinatorial functions [Vondr\u00e1k, 2010, Blum et al.,\n2017]. See also the survey by Bousquet et al. [2004] or the book by Boucheron et al. [2013] for a\nmore thorough review of concentration inequalities.\nFurthermore, there is a large body of work on online matching in which the vertices arrive in various\nmodels (see Mehta et al., 2013, Gupta and Molinaro, 2016). We differ from this line of research, by\nallowing a two-stage algorithm, and requiring to output the optimal matching is the second stage.\nCelis et al. [2017, 2018] studies similar problems of ranking and voting with fairness constraints. In\nfact, the optimization problem that they consider allows more general constraints and the value of a\ncandidate is determined from votes/comparisons. The main difference with our framework is that\n\n2\n\n\fthey do not consider a statistical setting (i.e. there is no distribution over the items and no training set\nfor preprocessing) and focus mostly on approximation algorithms for the optimization problem.\n\n2 Model and Results\nLet X be a domain of items, where each item c \u2208 X can possess any subset of d properties denoted\nby P1, . . . , Pd (we view Pi \u2286 X as the set of items having property Pi). Each item c has a value\nvi(c) \u2208 [0, 1] associated with each property Pi such that c \u2208 Pi.\n\nWe are given a set C \u2286 X of n items as well as counts k1, . . . kd such that(cid:80)d\nfeasible subset S that maximizes(cid:80)d\n\ni=1 ki = k. Our goal is to\nselect exactly k items in total, constrained on selecting exactly ki items with property Pi. We assume\nthat these constraints are exclusive, in the sense that each item in C can be used to satisfy at most\none of the constraints. Formally, a feasible solution is a subset S \u2286 C, such that |S| = k and there is\npartition S into d disjoint subsets S1, . . . , Sd, such that Si \u2286 Pi and |Si| = ki. We aim to compute a\n\n(cid:80)\n\nFurthermore, we assume that d (cid:28) k (cid:28) n. Namely, the number of constraints is much smaller than\nthe number of items that we have to select, which is much smaller than the total number of items\nin C. In order to avoid feasibility issues we assume that there is a set Cdummy that contains k dummy\n0-value items with all the d properties (we assume that the algorithm has always access to Cdummy\nand do not view them as part of C).\n\ni=1\n\nc\u2208Si\n\nvi(c).\n\nFormulation as bipartite matching. We \ufb01rst discuss the of\ufb02ine versions of these allocation prob-\nlems. That is, we assume that C and the capacities ki are all given as an input before the algorithm\nstarts. We are interested in an algorithm for computing an optimal set S. That is a set of items of\nmaximum total value that satisfy the constraints. This problem is equivalent to a maximum matching\nproblem in a bipartite graph (L, R, E, w) de\ufb01ned as follows.\n\nconstraint i is represented by ki of these vertices.\nitem c \u2208 C and for each dummy item c(cid:48) \u2208 Cdummy.\n\n\u2022 L is the set of vertices in one side of the bipartite graph. It contains k vertices, where each\n\u2022 R is the set of vertices in the other side of the bipartite graph. It contains a vertex for each\n\u2022 E is the set of edges. Each vertex in R is connected to each vertex of each of the constraints\n\u2022 The weight w(l, r) of edge (l, r) \u2208 E is vl(r): the value of item r associated with property Pl.\nThere is a natural correspondence between saturated-matchings in this graph, that is matchings\nin which every l \u2208 L is matched, and between feasible solutions (i.e., solutions that satisfy the\nconstraints) to the allocation problem. Thus, a saturated-matching of maximum value corresponds to\nan optimal solution. It is well know that the problem of \ufb01nding such a maximum weight bipartite\nmatching can be solved in polynomial time (see e.g., Lawler, 2001).\n\nthat it satis\ufb01es.\n\nProblem de\ufb01nition.\nIn this work we consider the following online learning model. We assume\nthat n items are sequentially drawn i.i.d. from an unknown distribution D over X. Upon receiving\neach item, we decide whether to retain it, or reject it irrevocably (the \ufb01rst stage of the algorithm).\nThereafter, we select a feasible solution6 consisting only of retained items (the second stage of the\nalgorithm). Most importantly, before accessing the online sequence and take irreversible online\ndecisions of which items to reject, we have access a training set Ctrain consisting of n independent\ndraws from D.\n\n2.1 Results\n\n2.1.1 Oblivious online screening\n\nWe begin by studying a greedy algorithm that does not require a training set. In the online phase,\nthis algorithm acts greedily by keeping an item if it participates in the best solution thus far. Then,\n\n6In addition to the retained items, the algorithm has access to Cdummy, and therefore a feasible solution always\n\nexists.\n\n3\n\n\fthe algorithm computes an optimal matching among the retained items. The particular details of the\nalgorithm are given in the supplementary material. We have the following guarantee for this greedy\nalgorithm proven in the supplementary material.\nTheorem 1. Let \u03b4 \u2208 (0, 1). The greedy algorithm outputs the optimal solution with probability at\nleast 1 \u2013 \u03b4 and retains O(k log(min{k/\u03b4, n/k})) items in expectation.\n\nAs we shall see in the next section, learning from the training set allows one to retain exponentially\nless items than is implied by the theorem above.7 It is then natural to ask to which extent is the\ntraining phase essential in order to accommodate such an improvement. We answer this question\nin The supplementary material by proving a lower bound on the number of retained items for any\nalgorithm that does not use a training phase. This lower bound already applies in the simple setting\nwhere d = 1: here, each item consists only of a value v \u2208 [0, 1], and the goal of the algorithm is to\nretain as few items as possible while guaranteeing with high probability that the top k maximal values\nare retained.\nTheorem 2. Let \u03b4 \u2208 (0, 1). For every algorithm A which retains the maximal k elements with\nprobability at least 1 \u2013 \u03b4, there exists a distribution \u00b5 such that the expected number of retained\nelements for input sequences v1 . . . vn \u223c \u00b5n is at least \u2126(k log(min{k/\u03b4, n/k})).\nThus, the above theorem implies that \u0398(k log(n/k)) can not be improved even if we allow failure\nprobability \u03b4 = \u0398(k2/n) (see Theorem 1).\n\n2.1.2 Online screening with learning\n\nWe now design online algorithms that, before the online screening process begins, use Ctrain to learn a\nthresholds-policy T \u2208 T such that with high probability: (i) the number of items that are retained in\nthe online phase is small, and (ii) there is a feasible solution consisting of k retained items whose\nvalue is optimal (or close to optimal). Thresholds-policies are studied in Section 3 and are de\ufb01ned as\nfollows.\nDe\ufb01nition 3 (Thresholds-policies). A threshold-policy is parametrized by a vector T = (t1, . . . , td)\nof thresholds, where ti corresponds to property Pi for 1 \u2264 i \u2264 d. The semantics of T is as follows:\ngiven a sample C of n items, each item c \u2208 C is retained if and only if there exists a property Pi\nsatis\ufb01ed by c, such that its value vi(c) passes the threshold ti. More formally, c is retained if and only\nif \u2203i \u2208 {1, . . . , d} such that c \u2208 Pi and vi(c) \u2265 ti.\nHaving proven uniform convergence results for thresholds-policies (see Section 3.1), we show the\nfollowing in Section 4.\nTheorem 4. There exists an algorithm that learns a thresholds-policy T from a single training sample\nCtrain \u223c Dn, such that after processing the (\u201creal-time\u201d) input sample C \u223c Dn using T:\n\n\u2022 It outputs an optimal solution with probability at least 1 \u2013 \u03b4.\n\n\u2022 The expected number of retained items in the \ufb01rst phase is O(cid:0)k(log d + log log(n/k) +\nlog log(1/\u03b4))(cid:1).\n\nThus, with the additional information given by the training set, the algorithm presented in Theo-\nrem 4 improves the number of retained items from k log(k/\u03b4) to k log log(1/\u03b4). This demonstrates a\nsigni\ufb01cant improvement over Theorem 1.\nFinally, in the supplementary material we prove that the algorithm from Theorem 4 is nearly-optimal\nin the sense that it is impossible to signi\ufb01cantly improve the number of retained items even if we\nallow the algorithm to fully know the distribution over input items (so, in a sense, having an access to\nn i.i.d samples from the distribution is the same as knowing it completely).\nTheorem 5. Consider the case where k = d and k1 = \u00b7\u00b7\u00b7 kd = 1. There exists a universe X and a\n\ufb01xed distribution D over X such that for C \u223c Dn the following holds: any online learning algorithm\n(which possibly \u201cknows\u201d D) that retains a subset S \u2286 C of items that contains an optimal solution\n\nwith probability at least 1 \u2013 \u03b4 must satisfy that Ex(cid:2)|S|(cid:3) = \u2126(k log log(1/\u03b4)).\n\n7That is, the expected number of retained items is reduces from order of log n to log log n.\n\n4\n\n\f3 Thresholds-policies\n\nWe next discuss a framework to design algorithms that exploit the training set to learn policies that\nare applied in the \ufb01rst phase of the matching process. We would like to frame this in standard ML\nformalism by phrasing this problem as learning a class H of policies such that:\n\n\u2022 H is not too small: The policies in H should yield solutions with high values (optimal, or\nnear-optimal).\n\u2022 H is not too large: H should satisfy some uniform convergence properties; i.e. the perfor-\nmance of each policy in H on the training set is close, with high probability, to its expected\nreal-time performance on the sampled items during the online selection process.\n\nIndeed, as we now show these demands are met by the class T of thresholds policies (De\ufb01nition 3).\nWe \ufb01rst show that the class of thresholds-policies contains an optimal policy, and in the sequel we\nshow that it satis\ufb01es attractive uniform convergence properties.\n\nAn assumption (values are unique). We assume that for each constraint Pi, the marginal distribu-\ntion over the value of c \u223c D conditioned on c \u2208 Pi is atomless; namely Prc\u223cD[v(c) = v | c \u2208 Pi] = 0\nfor every v \u2208 [0, 1]. This assumption can be removed by adding arti\ufb01cial tie-breaking rules, but\nmaking it will simplify some of the technical statements.\nTheorem 6 (There is a thresholds policy that retains an optimal solution). For any set of items C,\nthere exists a thresholds vector T \u2208 T that retains exactly k items that form an optimal solution for C.\nProof. Let S denote the set of k items in an optimal solution for C, and let Si \u2286 S \u2229 Pi be the subset\nof M that is assigned to the constraint Pi. De\ufb01ne ti = minc\u2208Si vi(c), for i \u2265 1, Clearly, T retains all\nthe items in S. Assume towards contradiction that T retains an item cj /\u2208 S, and assume that Pi is a\nconstraint such that cj \u2208 Pi and vi(cj) \u2265 ti. Since by our assumption on D all the values vi(cj) are\ndistinct it follows that vi(cj) > ti. Thus, we can modify S by replacing cj with the item of minimum\nvalue in Si and increase the total value. This contradicts the optimality of S.\n\nWe next establish generalization bounds for the class of thresholds-policies.\n\nExC\u223cDn(cid:2)|RT\n\n3.1 Uniform convergence of the number of retained items\nFor a sample C \u223c Dn and a thresholds-policy T \u2208 T , we denote by RT\nthe set of items that are retained by the threshold ti, and we denote its expected size by \u03c1T\n\ni (C)|(cid:3). Similarly we denote by RT(C) = \u222aiRT\n\ni (C) = {c : c \u2208 Pi and vi(c) \u2265 ti}\ni =\ni (C) the items retained by T, and by \u03c1T its\ni (C) and RT(C) are concentrated around their expectations\n\nexpectation. We prove that the sizes of RT\nuniformly for all thresholds policies.\nThe following theorems establish uniform convergence results for the number of retained items.\ni , RT \u2248 \u03c1T simultaneously for all T \u2208 T and i \u2264 d.\nNamely, with high probability we have RT\nTheorem 7 (Uniform convergence of the number of retained items). With probability at least 1 \u2013 \u03b4\nover C \u223c Dn, the following holds for all policies T \u2208 T simultaneously:\n\ni \u2248 \u03c1T\n\n1. If \u03c1T \u2265 k, then (1 \u2013 \u03b5)\u03c1T \u2264 |RT(C)| \u2264 (1 + \u03b5)\u03c1T , and\n2. if \u03c1T < k, then \u03c1T \u2013 \u03b5k \u2264 |RT(C)| \u2264 \u03c1T + \u03b5k ,\n\n(cid:32)(cid:114) d log(d) log(n/k) + log(1/\u03b4)\n\n(cid:33)\n\n.\n\nwhere\n\n\u03b5 = O\n\nk\n\nTheorem 8 (Uniform convergence of the number of retained items per constraint). With probability\nat least 1 \u2013 \u03b4 over C \u223c Dn, the following holds for all policies T \u2208 T and all i \u2264 d + 1 simultaneously:\n\n1. If \u03c1T\n2. if \u03c1T\n\ni \u2265 k, then (1 \u2013 \u03b5)\u03c1T\ni < k, then \u03c1T\n\ni \u2013 \u03b5k \u2264 |RT\n\ni \u2264 |RT\n\ni (C)| \u2264 \u03c1T\n\ni + \u03b5k ,\n\ni (C)| \u2264 (1 + \u03b5)\u03c1T\n\ni\n\n, and\n\n5\n\n\fwhere\n\n(cid:32)(cid:114)log(d) log(n/k) + log(1/\u03b4)\n\n(cid:33)\n\nk\n\n.\n\n\u03b5 = O\n\nThe proofs of Theorems 7 and 8 are based on standard VC-based uniform convergence results, and\ntechnically the proof boils down to bounding the VC-dimension of the families\n\nR = {RT : T \u2208 T } and Q = {RT\n\n: T \u2208 T , i \u2264 d}.\n\ni\n\nIndeed, in the supplementary material we prove the following.\nLemma 9. VC(R) = O(d log d) .\nLemma 10. VC(Q) = O(log d) .\nUsing Lemmas 9 and 10, we can now apply standard uniform convergence results from VC-theory to\nderive Theorems 7 and 8.\nDe\ufb01nition 11 (Relative (p, \u03b5)-approximation; Har-Peled and Sharir, 2011). Let F be a family of\nsubsets over a domain X, and let \u00b5 be a distribution on X. Z \u2286 X is a (p, \u03b5)-approximation for F if\nfor each f \u2208 F we have,\n\n1. If \u00b5(f ) \u2265 p, then (1 \u2013 \u03b5)\u00b5(f ) \u2264(cid:98)\u00b5(f ) \u2264 (1 + \u03b5)\u00b5(f ),\n2. If \u00b5(f ) < p, then \u00b5(f ) \u2013 \u03b5p \u2264(cid:98)\u00b5(f ) \u2264 \u00b5(f ) + \u03b5p,\n\nwhere(cid:98)\u00b5(f ) = |Z \u2229 F|/|Z| is the (\u201cempirical\u201d) measure of f with respect to Z.\n\nThe proof of Theorems 7 and 8 now follows by plugging p = k/n in Har-Peled and Sharir [2011,\nTheorem 2.11], which we state in the next proposition.\nProposition 12 (Har-Peled and Sharir, 2011). Let F and \u00b5 like in De\ufb01nition 11. Suppose F has VC\ndimension m. Then, with provability at least 1 \u2013 \u03b4, a random sample of size\n\n\u2126\nis a relative (p, \u03b5)-approximation for F.\n\n(cid:18) m log(1/p) + log(1/\u03b4)\n\n(cid:19)\n\n\u03b52p\n\n3.2 Uniform convergence of values\n\nWe now prove a concentration result for the value of an optimal solution among the retained items.\nUnlike the number of retained items, the value of an optimal solution corresponds to a more complex\nrandom variable, and analyzing the concentration of its empirical estimate requires more advanced\ntechniques.\nWe denote by VT(C) the value of the optimal solution among the items retained by the thresholds-\n\npolicy T, and we denote its expectation by \u03bdT = ExC\u223cDn(cid:2)VT(C)(cid:3). We show that VT(C) is concentrated\n\nuniformly for all thresholds policies.\nTheorem 13 (Uniform convergence of values). With probability at least 1 \u2013 \u03b4 over C \u223c Dn, the\nfollowing holds for all policies T \u2208 T simultaneously:\n|\u03bdT \u2013 VT(C)| \u2264 \u03b5k, where \u03b5 = O\n\n(cid:18)(cid:114) d log k + log(1/\u03b4)\n\n(cid:19)\n\n.\n\nk\n\n\u221a\n\nn) (rather than(cid:101)O(\n\ncomplicated function of them. We also note that a bound of(cid:101)O(\n\nNote that unlike most uniform convergence results that guarantee simultaneous convergence of\nempirical averages to expectations, here VT(C) is not an average of the n samples, but rather a more\nk)) on the additive\ndeviation of VT(C) from its expectation can be derived using the McDiarmid\u2019s inequality [McDiarmid,\n\u221a\nn > k (because k upper bounds the value of the\n1989]. However, this bound is meaningless when\noptimal solution). We use Talagrand\u2019s concentration inequality [Talagrand, 1995] to derive the O(\nk)\nupper bound on the additive deviation. Talagrand\u2019s concentration inequality allows us to utilize the\nfact that an optimal solution uses only k (cid:28) n items, and therefore replacing an item that does not\nparticipate in the solution does not affect its value.\nTo prove the theorem we need the following concentration inequality for the value of the optimal\nselection in hindsight. Note that by Theorem 6 this value equals to VT(C) for some T.\n\n\u221a\n\n\u221a\n\n6\n\n\fLemma 14. Let OPT(C) denote the value of the optimal solution for a sample C. We have that\n\n(cid:2)|OPT(C) \u2013 Ex[OPT(C)]| \u2265 \u03b1(cid:3) \u2264 2 exp(\u2013\u03b12/2k).\n\nPr\nC\u223cDn\n\nSo, for example, it happens that |OPT(C) \u2013 Ex[OPT(C)]| \u2264(cid:112)2k log(2/\u03b4) with probability at least\n\n1 \u2013 \u03b4.\nTo prove this lemma we use the following version of Talagrand\u2019s inequality (that appears for example\nin lecture notes by van Handel [2014]).\nProposition 15 (Talagrand\u2019s Concentration Inequality). Let f : Rn (cid:55)\u2192 R be a function, and suppose\nthat there exist g1, . . . , gn : Rn (cid:55)\u2192 R such that for any x, y \u2208 Rn\n\nf (x) \u2013 f (y) \u2264 n(cid:88)\nPr(cid:2)|f (X) \u2013 Ex[f (X)]| > \u03b1(cid:3) \u2264 2 exp\n\ni=1\n\n(cid:18)\n\n\u2013\n\ngi(x)1[xi(cid:54)=yi].\n\nThen, for independent random variables X = (X1, . . . , Xn) we have\n\n(1)\n\n(cid:19)\n\n.\n\n(cid:80)n\n\n\u03b12\n\n2 supx\n\ni=1 g2\n\ni (x)\n\nProof of Lemma 14. We apply Talagrand\u2019s concentration inequality to the random variable OPT(C).\nOur Xi\u2019s are the items c1, . . . , cn in the order that they are given. We show that Eq. (1) holds for\ngi(C) = 1[ci\u2208S] where S = S(C) is a \ufb01xed optimal solution for C (we use some arbitrary tie breaking\n\namong optimal solutions). We then have,(cid:80)n\n\ni (C) = |S| = k, thus completing the proof.\n\ni=1 g2\n\nNow, let C, C(cid:48) be two samples of n items. Recall that we need to show that\n\nOPT(C) \u2013 OPT(C(cid:48)) \u2264 n(cid:88)\n\ngi(C)1[ci(cid:54)=c(cid:48)\n\ni ] .\n\ni=1\n\nWe use S to construct a solution S(cid:48) for C(cid:48) as follows. Let Sj \u2286 S the subset of S matched to Pj. For\neach i, if ci \u2208 Sj for some j, and ci = c(cid:48)\nj. Otherwise, we add a dummy item from\ndummy to S(cid:48)\nC(cid:48)\nj (with value zero). Let V(S(cid:48)) denote the value of S(cid:48). Note that the difference between\nthe values of S and S(cid:48) is the total value of all items i \u2208 S such that ci (cid:54)= c(cid:48)\ni. Since the item values are\nbounded in [0, 1] we get that\n\ni, then we add i to S(cid:48)\n\nOPT(C) \u2013 V(S(cid:48)) =\n\nvj(ci)1[ci(cid:54)=c(cid:48)\n\n1[ci(cid:54)=c(cid:48)\n\ni ] =\n\ngi(C)1[ci(cid:54)=c(cid:48)\n\ni ] .\n\nd(cid:88)\n\n(cid:88)\n\nj=1\n\nci\u2208Sj\n\ni ] \u2264 d(cid:88)\n\n(cid:88)\n\nj=1\n\nci\u2208Sj\n\nn(cid:88)\n\ni=1\n\nThe proof is complete by noticing that OPT(C(cid:48)) \u2265 V(S(cid:48)).\nWe also require the following construction of a bracketing of T which is formally presented in the\nsupplementary material.\nLemma 16. There exists a collection of N thresholds-policies such that |N| \u2264 kO(d), and for every\nthresholds-policy T \u2208 T there are T+, T\u2013 \u2208 N such that\n\n1. VT\u2013(C) \u2264 VT(C) \u2264 VT+(C) for every sample of items C; note that by taking expectations\n\nthis implies that \u03bdT\u2013 \u2264 \u03bdT \u2264 \u03bdT+, and\n\n2. \u03bdT+ \u2013 \u03bdT\u2013 \u2264 10.\n\nProof of Theorem 13. The items in C that are retained by T are independent samples from a distri-\nbution D(cid:48) that is sampled as follows: (i) sample c \u223c D, and (ii) if c is retained by T then keep it,\nand otherwise discard it. This means that vT(C) is in fact the optimal solution of C with respect to\nD(cid:48). Since Lemma 14 applies to every distribution D we can apply it to D(cid:48) and get that for any \ufb01xed\nT \u2208 T\n\n(cid:2)|\u03bdT \u2013 VT(C)| \u2265 \u03b1(cid:3) \u2264 2 exp(\u2013\u03b12/2k) .\n\nPr\nC\u223cDn\n\n7\n\n\f(cid:16)(cid:113)\n\nk(cid:0)d log k + log(1/\u03b4)(cid:1)(cid:17)\n\nNow, by the union bound for N be as in Lemma 16 we get that the probability that there is T \u2208 N\nsuch that |\u03bdT \u2013 VT(C)| \u2265 \u03b1 is at most |N| \u00b7 2 exp(\u2013\u03b12/2k). Thus, since |N| \u2264 kO(d), it follows that\nwith probability at least 1 \u2013 \u03b4,\n\n(\u2200T \u2208 N ) : |\u03bdT \u2013 VT(C)| \u2264 O\n\n(2)\nWe now show why uniform convergence for N implies uniform convergence for T . Combining\nLemma 16 with Equation (2) we get that with probability at least 1 \u2013 \u03b4, every T \u2208 T satis\ufb01es:\n\n.\n\n|\u03bdT \u2013 VT(C)| \u2264 max{|\u03bdT+ \u2013 VT\u2013(C)|,|\u03bdT\u2013 \u2013 VT+(C)|}\n\n\u2264 max{|\u03bdT\u2013 \u2013 VT\u2013(C)|,|\u03bdT+ \u2013 VT+(C)|} + 10\n\u2264 10 + O\n\nk(cid:0)d log k + log(1/\u03b4)(cid:1)(cid:17)\n\n(cid:16)(cid:113)\n\n.\n\n(by Item 1 of Lemma 16)\n(by Item 2 of Lemma 16)\n\n(by Eq. (2))\n\nHere the \ufb01rst inequality follows from Item 1 by noticing that if [a, b], [c, d] are intervals on the real\nline and x \u2208 [a, b], y \u2208 [c, d] then |x \u2013 y| \u2264 max{|b \u2013 c|, |d \u2013 a|}, and plugging in x = \u03bdT, y = VT(C), a =\n\u03bdT\u2013, b = \u03bdT+, c = VT\u2013(C), d = VT+(C).\n\nThis \ufb01nishes the proof, by setting \u03b5 such that \u03b5 \u00b7 k = O(cid:0)(cid:112)k(d log k + log(1/\u03b4))(cid:1).\n\n4 Algorithms based on learning thresholds-policies\n\nWe next exemplify how one can use the above properties of thresholds-policies to design algorithms.\nA natural algorithm would be to use the training set to learn a threshold-policy T that retains an\noptimal solution with k items from the training set as speci\ufb01ed in Theorem 6, and then use this online\npolicy to retain a subset of the n items in the \ufb01rst phase. Theorem 7 and Theorem 13 imply that with\n\nprobability 1 \u2013 \u03b4, the number of retained items is at most m = k + O(cid:0)(cid:112)kd log(d) log(n/k) + k log(1/\u03b4)(cid:1)\nand that the value of the resulting solution is at least OPT \u2013 O(cid:0)(cid:112)kd log k + k log(1/\u03b4)(cid:1).\n\n(cid:16)\n\n(cid:17)\n\nm\nk\n\nWe can improve this algorithm by combining it with the greedy algorithm of Theorem 1 described in\nthe supplementary material. During the \ufb01rst phase, we retain an item c only if (i) c is retained by T,\nand (ii) c participates in the optimal solution among the items that were retained thus far. Theorem 1\nthen implies that out of these m items greedy keeps a subset of\n\n(cid:18)\n\n(cid:18)\n\n(cid:16) n\n\n(cid:17)\n\n(cid:18)1\n\n(cid:19)(cid:19)(cid:19)\n\n.\n\nk\n\nk\n\n\u03b4\n\nO\n\n= O\n\nk log\n\nlog log\n\n+ log log\n\nWe can further improve the value of the solution and guarantee that it will be optimal (with respect\nto all n items) with probability 1 \u2013 \u03b4. This is based on the observation that if the set of retained\nitems contains the top k items of each property Pi then it also contains an optimal solution. Thus, we\n\nitems in expectation that still contains a solution of value at least OPT \u2013 O((cid:112)kd log k + k log(1/\u03b4)).\ncan compute a thresholds-policy T that retains the top k + O((cid:112)k log(d) log(n/k) + k log(1/\u03b4)) items\nof items that are retained by T in real-time is at most m = dk + O(d(cid:112)k log(d) log(n/k) + k log(1/\u03b4)).\n\nof each property from the training set (if the training set does not have this many items with some\nproperty then set the corresponding threshold to 0). Then, it follows from Theorem 8, that with\nprobability 1 \u2013 \u03b4, T will retain the top k items of each property in the \ufb01rst online phase and therefore\nwill retain an optimal solution. Now, Theorem 8 implies that with probability 1 \u2013 \u03b4 the total number\n\nBy \ufb01ltering the retained elements with the greedy algorithm of Theorem 1 as before it follows that\nthe total number of retained items is at most\n\n(cid:18)\n\n(cid:18)\n\n(cid:16) m\n\n(cid:17)\n\nk\n\n(cid:16) n\n\n(cid:17)\n\nk\n\n(cid:19)(cid:19)(cid:19)\n\n(cid:18)1\n\n\u03b4\n\nk + k log\n\n= O\n\nk\n\nlog d + log log\n\n+ log log\n\nwith probably 1 \u2013 \u03b4. This proves Theorem 4.\n\nAcknowledgements\n\nWe thank an anonymous reviewer for their remarks regarding a previous version of this manuscript.\nTheir remarks and questions eventually led us to proving Theorem 2.\n\n8\n\n\fReferences\nM. Babaioff, N. Immorlica, and R. Kleinberg. Matroids, secretary problems, and online mechanisms.\nIn Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages\n434\u2013443. Society for Industrial and Applied Mathematics, 2007.\n\nM. Balcan, T. Sandholm, and E. Vitercik. A general theory of sample complexity for multi-item\n\npro\ufb01t maximization. In EC, pages 173\u2013174. ACM, 2018.\n\nA. Blum, I. Caragiannis, N. Haghtalab, A. D. Procaccia, E. B. Procaccia, and R. Vaish. Opting into\n\noptimal matchings. In SODA, pages 2351\u20132363. SIAM, 2017.\n\nS. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of\n\nIndependence. Oxford University Press, 2013. ISBN 9780191747106.\n\nO. Bousquet, U. von Luxburg, and G. R\u00e4tsch, editors. Advanced Lectures on Machine Learning, ML\nSummer Schools 2003, Canberra, Australia, February 2-14, 2003, T\u00fcbingen, Germany, August\n4-16, 2003, Revised Lectures, volume 3176 of Lecture Notes in Computer Science, 2004. Springer.\n\nL. E. Celis, D. Straszak, and N. K. Vishnoi. Ranking with fairness constraints. arXiv preprint\n\narXiv:1704.06840, 2017.\n\nL. E. Celis, L. Huang, and N. K. Vishnoi. Multiwinner voting with fairness constraints. In IJCAI,\n\npages 144\u2013151, 2018.\n\nJ. R. Correa, P. D\u00fctting, F. A. Fischer, and K. Schewior. Prophet inequalities for independent\nrandom variables from an unknown distribution. CoRR, abs/1811.06114, 2018. URL http:\n//arxiv.org/abs/1811.06114.\n\nT. Ezra, M. Feldman, and I. Nehama. Prophets and secretaries with overbooking. In Proceedings of\n\nthe 2018 ACM Conference on Economics and Computation, pages 319\u2013320. ACM, 2018.\n\nT. S. Ferguson. Who solved the secretary problem? Statistical Science, 4(3):282\u2013289, 1989.\n\nS. Greenberg and M. Mohri. Tight lower bound on the probability of a binomial exceeding its\n\nexpectation. CoRR, abs/1306.1433, 2013.\n\nA. Gupta and M. Molinaro. How the experts algorithm can help solve lps online. Math. Oper. Res.,\n\n41(4):1404\u20131431, 2016.\n\nS. Har-Peled and M. Sharir. Relative (p, \u03b5)-approximations in geometry. Discrete & Computational\n\nGeometry, 45(3):462\u2013496, 2011.\n\nJ. Hsu, J. Morgenstern, R. M. Rogers, A. Roth, and R. Vohra. Do prices coordinate markets? In\n\nSTOC, pages 440\u2013453. ACM, 2016.\n\nE. L. Lawler. Combinatorial optimization: networks and matroids. Courier Corporation, 2001.\n\nC. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics 1989. Cambridge\n\nUniversity Press, Cambridge, 1989.\n\nA. Mehta et al. Online matching and ad allocation. Foundations and Trends R(cid:13) in Theoretical\n\nComputer Science, 8(4):265\u2013368, 2013.\n\nS. Moran, M. Snir, and U. Manber. Applications of ramsey\u2019s theorem to decision tree complexity.\n\nJournal of the ACM (JACM), 32(4):938\u2013949, 1985.\n\nJ. Morgenstern and T. Roughgarden. On the pseudo-dimension of nearly optimal auctions. In NIPS,\n\npages 136\u2013144, 2015.\n\nJ. Morgenstern and T. Roughgarden. Learning simple auctions. In COLT, volume 49 of JMLR\n\nWorkshop and Conference Proceedings, pages 1298\u20131318. JMLR.org, 2016.\n\nN. Sauer. On the density of families of sets. J. Combinatorial Theory Ser. A, 13:145\u2013147, 1972.\n\n9\n\n\fM. Talagrand. Concentration of measure and isoperimetric inequalities in product spaces. Publications\n\nMath\u00e9matiques de l\u2019Institut des Hautes Etudes Scienti\ufb01ques, 81(1):73\u2013205, 1995.\n\nR. van Handel. Probability in high dimension. Technical report, PRINCETON UNIV NJ, 2014.\n\nS. Vardi. The returning secretary. In 32nd International Symposium on Theoretical Aspects of\n\nComputer Science, page 716, 2015.\n\nJ. Vondr\u00e1k. A note on concentration of submodular functions. CoRR, abs/1005.2791, 2010.\n\n10\n\n\f", "award": [], "sourceid": 4644, "authors": [{"given_name": "Alon", "family_name": "Cohen", "institution": "Google"}, {"given_name": "Avinatan", "family_name": "Hassidim", "institution": "Google"}, {"given_name": "Haim", "family_name": "Kaplan", "institution": "TAU, GOOGLE"}, {"given_name": "Yishay", "family_name": "Mansour", "institution": "Tel Aviv University / Google"}, {"given_name": "Shay", "family_name": "Moran", "institution": "Google AI Princeton"}]}