{"title": "On preserving non-discrimination when combining expert advice", "book": "Advances in Neural Information Processing Systems", "page_first": 8376, "page_last": 8387, "abstract": "We study the interplay between sequential decision making and avoiding discrimination against protected groups, when examples arrive online and do not follow distributional assumptions. We consider the most basic extension of classical online learning: Given a class of predictors that are individually non-discriminatory with respect to a particular metric, how can we combine them to perform as well as the best predictor, while preserving non-discrimination? Surprisingly we show that this task is unachievable for the prevalent notion of \"equalized odds\" that requires equal false negative rates and equal false positive rates across groups. On the positive side, for another notion of non-discrimination, \"equalized error rates\", we show that running separate instances of the classical multiplicative weights algorithm for each group achieves this guarantee. Interestingly, even for this notion, we show that algorithms with stronger performance guarantees than  multiplicative weights cannot preserve non-discrimination.", "full_text": "On preserving non-discrimination when combining\n\nexpert advice\n\nAvrim Blum\nTTI-Chicago\n\navrim@ttic.edu\n\nSuriya Gunasekar\n\nTTI-Chicago\n\nsuriya@ttic.edu\n\nThodoris Lykouris\nCornell University\n\nteddlyk@cs.cornell.edu\n\nNathan Srebro\nTTI-Chicago\n\nnati@ttic.edu\n\nAbstract\n\nWe study the interplay between sequential decision making and avoiding discrimi-\nnation against protected groups, when examples arrive online and do not follow\ndistributional assumptions. We consider the most basic extension of classical online\nlearning: Given a class of predictors that are individually non-discriminatory with\nrespect to a particular metric, how can we combine them to perform as well as the\nbest predictor, while preserving non-discrimination? Surprisingly we show that this\ntask is unachievable for the prevalent notion of equalized odds that requires equal\nfalse negative rates and equal false positive rates across groups. On the positive\nside, for another notion of non-discrimination, equalized error rates, we show that\nrunning separate instances of the classical multiplicative weights algorithm for\neach group achieves this guarantee. Interestingly, even for this notion, we show\nthat algorithms with stronger performance guarantees than multiplicative weights\ncannot preserve non-discrimination.\n\n1\n\nIntroduction\n\nThe emergence of machine learning in the last decade has given rise to an important debate regarding\nthe ethical and societal responsibility of its offspring. Machine learning has provided a universal\ntoolbox enhancing the decision making in many disciplines from advertising and recommender\nsystems to education and criminal justice. Unfortunately, both the data and their processing can be\nbiased against speci\ufb01c population groups (even inadvertently) in every single step of the process [4].\nThis has generated societal and policy interest in understanding the sources of this discrimination and\ninterdisciplinary research has attempted to mitigate its shortcomings.\nDiscrimination is commonly an issue in applications where decisions need to be made sequentially.\nThe most prominent such application is online advertising where platforms need to sequentially\nselect which ad to display in response to particular query searches. This process can introduce\ndiscrimination against protected groups in many ways such as \ufb01ltering particular alternatives [12, 2]\nand reinforcing existing stereotypes through search results [38, 25]. Another canonical example\nof sequential decision making is medical trials where underexploration on female groups often\nleads to signi\ufb01cantly worse treatments for them [31]. Similar issues occur in image classi\ufb01cation\nas stressed by \u201cgender shades\u201d [7]. The reverse (overexploration in minority populations) can also\ncause concerns especially if conducted in a non-transparent fashion [5].\nIn these sequential settings, the assumption that data are i.i.d. is often violated. Online advertising,\nrecommender systems, medical trials, image classi\ufb01cation, loan decisions, criminal recidivism all\nrequire decisions to be made sequentially. The corresponding labels are not identical across time and\ncan be affected by the economy, recent events, etc. Similarly labels are also not independent across\nrounds \u2013 if a bank offers a loan then this decision can affect whether the loanee or their environment\nwill be able to repay future loans thereby affecting future labels as discussed by Liu et al. [32]. As a\nresult, it is important to understand the effect of this adaptivity on non-discrimination.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe classical way to model settings that are not i.i.d. is via adversarial online learning [30, 17],\nwhich poses the question: Given a class F of predictors, how can we make online predictions that\nperform as well as the best predictor from F in hindsight? The most basic online learning question\n(answered via the celebrated \u201cmultiplicative weights\u201d algorithm) concerns competing with a \ufb01nite set\nof predictors. The class F is typically referred to as \u201cexperts\u201d and can be thought as \u201cfeatures\u201d of the\nexample where we want to make online predictions that compete with the best 1-sparse predictor.\nIn this work, we wish to understand the interplay between adaptivity and non-discrimination and\ntherefore consider the most basic extension of the classical online learning question:\n\nGiven a class of individually non-discriminatory predictors, how can we combine\nthem to perform as well as the best predictor, while preserving non-discrimination?\n\nThe assumption that predictors are individually non-discriminatory is a strong assumption on the\npredictors and makes the task trivial in the batch setting where the algorithm is given labeled examples\nand wishes to perform well on unseen examples drawn from the same distribution. This happens\nbecause the algorithm can learn the best predictor from the labeled examples and then follow it (since\nthis predictor is individually non-discriminatory, the algorithm does not exhibit discrimination). This\nenables us to understand the potential overhead that adaptivity is causing and signi\ufb01cantly strengthens\nany impossibility result. Moreover, we can assume that predictors have been individually vetted\nto satisfy the non-discrimination desiderata \u2013 we therefore wish to understand how to ef\ufb01ciently\ncompose these non-discriminatory predictors while preserving non-discrimination.\n\n1.1 Our contribution\n\nOur impossibility results for equalized odds. Surprisingly, we show that for a prevalent notion\nof non-discrimination, equalized odds, it is impossible to preserve non-discrimination while also\ncompeting comparably to the best predictor in hindsight (no-regret property). Equalized odds,\nsuggested by Hardt et al. [20] in the batch setting, restricts the set of allowed predictors requiring that,\nwhen examples come from different groups, the prediction is independent to the group conditioned on\nthe label. In binary classi\ufb01cation, this means that the false negative rate (fraction of positive examples\npredicted negative) is equal across groups and the same holds for the false positive rate (de\ufb01ned\nanalogously). This notion was popularized by a recent debate on potential bias of machine learning\nrisk tools for criminal recividism [1, 10, 28, 16].\nOur impossibility results demonstrate that the order in which examples arrive signi\ufb01cantly com-\nplicates the task of achieving desired ef\ufb01ciency while preserving non-discrimination with respect\nto equalized odds. In particular, we show that any algorithm agnostic to the group identity either\ncannot achieve performance comparable to the best predictor or exhibits discrimination in some\ninstances (Theorem 1). This occurs in phenomenally simple settings with only two individually\nnon-discriminatory predictors, two groups, and perfectly balanced instances: groups are of equal\nsize and each receives equal number of positive and negative labels. The only imbalance exists in\nthe order in which the labels arrive which is also relatively well behaved \u2013 labels are generated from\ntwo i.i.d. distributions, one in the \ufb01rst half of the instance and one in the second half. Although\nin many settings we cannot actively use the group identity of the examples due to legal reasons\n(e.g., in hiring), one may wonder whether these impossibility results disappear if we can actively\nuse the group information to compensate for past mistakes. We show that this is also not the case\n(Theorem 2). Although our groups are not perfectly balanced, the construction is again very simple\nand consists only of two groups and two predictors: one always predicting positive and one always\npredicting negative. The simplicity of the settings, combined with the very strong assumption on\nthe predictors being individually non-discriminatory speaks to the trade-off between adaptivity and\nnon-discrimination with respect to equalized odds.\n\nOur results for equalized error rates. The strong impossibility results with respect to equalized\nodds invite the natural question of whether there exists some alternative fairness notion that, given\naccess to non-discriminatory predictors, achieves ef\ufb01ciency while preserving non-discrimination. We\nanswer the above positively by suggesting the notion of equalized error rates, which requires that the\naverage expected loss (regardless whether it stems from false positives or false negatives) encountered\nby each group should be the same. This notion makes sense in settings where performance and\nfairness are measured with respect to the same objective. Consider a medical application where people\n\n2\n\n\ffrom different subpopulations wish to receive appropriate treatment and any error in treatment costs\nequally both towards performance and towards fairness.1 It is morally objectionable to discriminate\nagainst one group, e.g. based on race, using it as experimentation to enhance the quality of service of\nthe other, and it is reasonable to require that all subpopulations receive same quality of service.\nFor this notion, we show that, if all predictors are individually non-discriminatory with respect to\nequalized error rates, running separate multiplicative weights algorithms, one for each subpopulation,\npreserves this non-discrimination without decay in the ef\ufb01ciency (Theorem 3). The key property we\nuse is that the multiplicative weights algorithm guarantees to perform not only no worse than the best\npredictor in hindsight but also no better; this property holds for a broader class of algorithms [14].\nOur result applies to general loss functions beyond binary predictions and only requires predictors to\nsatisfy the weakened assumption of being approximately non-discriminatory.\nFinally, we examine whether the decisions of running separate algorithms and running this particular\nnot so ef\ufb01cient algorithm were important for the result. We \ufb01rst give evidence that running separate\nalgorithms is essential for the result; if we run a single instance of \u201cmultiplicative weights\u201d or \u201cfollow\nthe perturbed leader\u201d, we cannot guarantee non-discrimination with respect to equalized error rates\n(Theorem 4). We then suggest that the property of not performing better than the best predictor is\nalso crucial; in particular, better algorithms that satisfy the stronger guarantee of low shifting regret\n[21, 6, 34] are also not able to guarantee this non-discrimination (Theorem 5). These algorithms\nare considered superior to classical no-regret algorithms as they can better adapt to changes in the\nenvironment, which has nice implications in game-theoretic settings [35]. Our latter impossibility\nresult is a \ufb01rst application where having these strong guarantees against changing benchmarks is not\nnecessarily desired and therefore is of independent learning-theoretic interest.\n\n1.2 Related work\n\nThere is a large line of work on fairness and non-discrimination in machine learning (see [36, 8,\n13, 41, 22, 20, 10, 28, 26] for a non-exhaustive list). We elaborate on works that either study group\nnotions of fairness or fairness in online learning.\nThe last decade has seen a lot of work on group notions of fairness, mostly in classi\ufb01cation setting.\nExamples include notions that compare the percentage of members predicted positive such as\ndemographic parity [8], disparate impact [15], equalized odds [20] and calibration across groups\n[10, 28]. There is no consensus on a universal fairness notion; rather the speci\ufb01c notion considered is\nlargely task-speci\ufb01c. In fact, previous works identi\ufb01ed that these notions are often not compatible to\neach other [10, 28], posed concerns that they may introduce unintentional discrimination [11], and\nsuggested the need to go beyond such observational criteria via causal reasoning [27, 29]. Prior to our\nwork, group fairness notions have been studied primarily in the batch learning setting with the goal of\noptimizing a loss function subject to a fairness constraint either in a post-hoc correction framework\nas proposed by Hardt et al. [20] or more directly during training from batch data [41, 19, 39, 40, 3]\nwhich requires care due to the predictors being discriminatory with respect to the particular metric\nof interest. The setting we focus on in this paper does not have the challenges of the above since\nall predictors are non-discriminatory; however, we obtain surprising impossibility results due to the\nordering in which labels arrive.\nRecently fairness in online learning has also started receiving attention. One line of work focuses\non imposing a particular fairness guarantee at all times for bandits and contextual bandits, either for\nindividual fairness [22, 23] or for group fairness [9]. Another line of work points to counterintuitive\nexternalities of using contextual bandit algorithms agnostic to the group identity and suggest that\nheterogeneity in data can replace the need for exploration [37, 24]. Moreover, following a seminal\npaper by Dwork et al. [13], a line of work aims to treat similar people similarly in online settings\n[33, 18]. Our work distinguishes itself from these directions mainly in the objective, since we require\nthe non-discrimination to happen in the long-term instead of at any time; this extends the classical\nbatch de\ufb01nitions of non-discrimination in the online setting. In particular, we only focus on situations\nwhere there are enough samples from each population of interest and we do not penalize the algorithm\nfor a few wrong decisions, leading it to be overly pessimistic. Another difference is that previous\nwork focuses either on individual notions of fairness or on i.i.d. inputs, while our work is about\nnon-i.i.d. inputs in group notions of fairness.\n\n1In contrast, in equalized odds, a misprediction only costs to the false negative metric if the label is positive.\n\n3\n\n\f2 Model\n\nOnline learning protocol with group context. We consider the classical online learning setting\nof prediction with expert advice, where a learner needs to make sequential decisions for T rounds by\ncombining the predictions of a \ufb01nite set F of d hypotheses (also referred to as experts). We denote\nthe outcome space by Y; in binary classi\ufb01cation, this corresponds to Y = {+,\u2212}. Additionally, we\nintroduce a set of disjoint groups by G which identi\ufb01es subsets of the population based on a protected\nattribute (such as gender, ethnicity, or income).\nThe online learning protocol with group context proceeds in T rounds. Each round t is associated\nwith a group context g(t) \u2208 G and an outcome y(t) \u2208 Y. We denote the resulting T -length time-\ngroup-outcome sequence tuple by \u03c3 = {(t, g(t), y(t)) \u2208 N \u00d7 G \u00d7 Y}T\nt=1. This is a random variable\nthat can depend on the randomness in the generation of the groups and the outcomes. We use the\nshorthand \u03c31:\u03c4 = {(t, g(t), y(t)) \u2208 N \u00d7 G \u00d7 Y}\u03c4\nt=1 to denote the subsequence until round \u03c4. The\nexact protocol for generating these sequences is described below. At round t = 1, 2, . . . , T :\n\n1. An example with group context g(t) \u2208 G arrives stochastically or is adversarially selected.\n2. The learning algorithm or learner L commits to a probability distribution pt \u2208 \u2206(d) across\nf denotes the probability that she follows the advice of expert f \u2208 F at\nexperts where pt\nround t. This distribution pt can be a function of the sequence \u03c31:t\u22121. We call the learner\ngroup-unaware if she ignores the group context g(\u03c4 ) for all \u03c4 \u2264 t when selecting pt.\n3. An adversary A then selects an outcome y(t) \u2208 Y. The adversary is called adaptive if the\ngroups/outcomes at round t = \u03c4 + 1 are a function of the realization of \u03c31:\u03c4 ; otherwise\nshe is called oblivious. The adversary always has access to the learning algorithm, but an\nadaptive adversary additionally has access to the realized \u03c31:t\u22121 and hence also knows pt.\nSimultaneously, each expert f \u2208 F makes a prediction \u02c6yt\nf \u2208 \u02c6Y, where \u02c6Y is a generic\nprediction space; for example, in binary classi\ufb01cation, the prediction space could simply be\nthe positive or negative labels: \u02c6Y = {+,\u2212}, or the probabilistic score: \u02c6Y = [0, 1] with \u02c6yt\ninterpreted as the probability the expert f \u2208 F assigns to the positive label in round t, or\neven an uncalibrated score like the output of a support vector machine: \u02c6Y = R.\nLet (cid:96) : \u02c6Y \u00d7 Y \u2192 [0, 1] be the loss function between predictions and outcomes. This leads to\na corresponding loss vector (cid:96)t \u2208 [0, 1]d where (cid:96)t\ndenotes the loss the learner\nincurs if she follows expert f \u2208 F.\n\n\u02c6yt\nf , y(t)\n\nf = (cid:96)\n\n(cid:16)\n\n(cid:17)\n\nf\n\nexpected loss(cid:80)\n\nf (cid:96)t\n\nf\u2208F pt\n\n4. The learner then observes the entire loss vector (cid:96)t (full information feedback) and incurs\nf . For classi\ufb01cation, this feedback is obtained by observing y(t).\nIn this paper, we consider a setting where all the experts f \u2208 F are fair in isolation (formalized\nbelow). Regarding the group contexts, our main impossibility results (Theorems 1 and 2) assume\nthat the group contexts g(t) arrive stochastically from a \ufb01xed distribution, while our positive result\n(Theorem 3) holds even when they are adversarially selected.\nFor simplicity of notation, we assume throughout the presentation that the learner\u2019s algorithm is\nproducing the distribution pt of round t = \u03c4 + 1 deterministically based on \u03c31:\u03c4 and therefore all our\nexpectations are taken only over \u03c3 which is the case in most algorithms. Our results extend when the\nalgorithm uses extra randomness to select the distribution.\n\nGroup fairness in online learning. We now de\ufb01ne non-discrimination (or fairness) with respect\nto a particular evaluation metric M, e.g. in classi\ufb01cation, the false negative rate metric (FNR) is the\nfraction of examples with positive outcome predicted negative incorrectly. For any realization of the\ntime-group-outcome sequence \u03c3 and any group g \u2208 G, metric M induces a subset of the population\nS \u03c3\ng (M) that is relevant to it. For example, in classi\ufb01cation, S \u03c3\ng (F N R) = {t : g(t) = g, y(t) = +}\nis the set of positive examples of group g. The performance of expert f \u2208 F on the subpopulation\ng (M) is denoted by M\u03c3\nS\u03c3\nDe\ufb01nition 1. An expert f \u2208 F is called fair in isolation with respect to metric M if, for every\nsequence \u03c3, her performance with respect to M is the same across groups, i.e. M\u03c3\nf (g(cid:48))\nfor all g, g(cid:48) \u2208 G.\n\nf (g) = M\u03c3\n\nf .\ng (M) (cid:96)t\n\n(cid:80)\n\n1\ng (M)|\n\nf (g) =\n\nt\u2208S \u03c3\n\n|S \u03c3\n\n4\n\n\f(cid:80)\n\n(cid:80)\n\n|S \u03c3\n\nt\u2208S \u03c3\n\ng (M)\n\n1\ng (M)|\n\nThe learner\u2019s performance on this subpopulation is M\u03c3L(g) =\nf . To\nformalize our non-discrimination desiderata, we require the algorithm to have similar expected\nperformance across groups, when given access to fair in isolation predictors. We make the following\nassumptions to avoid trivial impossibility results due to low-probability events or underrepresented\npopulations. First, we take expectation over sequences generated by the adversary A (that has access\nto the learning algorithm L). Second, we require the relevant subpopulations to be, in expectation,\nlarge enough. Our positive results do not depend on either of these assumptions. More formally:\nDe\ufb01nition 2. Consider a set of experts F such that each expert is fair in isolation with respect to\nmetric M. Learner L is called \u03b1-fair in composition with respect to metric M if, for all adversaries\nthat produce E\u03c3[min(|S\u03c3\n\ng(cid:48)(M)|)] = \u2126(T ) for all g, g(cid:48), it holds that:\n\ng (M)|,|S\u03c3\n\nf\u2208F pt\n\nf (cid:96)t\n\n|E\u03c3[M\u03c3L(g)] \u2212 E\u03c3[M\u03c3L(g(cid:48))]| \u2264 \u03b1.\n\nWe note that, in many settings, we wish to have non-discrimination with respect to multiple metrics\nsimultaneously. For instance, equalized odds requires fairness in composition both with respect to\nfalse negative rate and with respect to false positive rate (de\ufb01ned analogously). Since we provide an\nimpossibility result for equalized odds, focusing on only one metric makes the result even stronger.\n\nRegret notions. The typical way to evaluate the performance of an algorithm in online learning is\nvia the notion of regret. Regret is comparing the performance of the algorithm to the performance of\nthe best expert in hindsight on the realized sequence \u03c3 as de\ufb01ned below:\n\nT(cid:88)\n\n(cid:88)\n\nt=1\n\nf\u2208F\n\nT(cid:88)\n\nt=1\n\nRegT =\n\nf \u2212 min\nf (cid:96)t\npt\nf (cid:63)\u2208F\n\n(cid:96)t\nf (cid:63) .\n\nIn the above de\ufb01nition, regret is a random variable depending on the sequence \u03c3; therefore depending\non the randomness in groups/outcomes.\nAn algorithm satis\ufb01es the no-regret property (or Hannan consistency) in our setting if for any losses\nrealizable by the above protocol, the regret is sublinear in the time horizon T , i.e. RegT = o(T ). This\nproperty ensures that, as time goes by, the average regret vanishes. Many online learning algorithms,\n\nsuch as multiplicative weights updates satisfy this property with RegT = O((cid:112)T log(d)).\n\nWe focus on the notion of approximate regret, which is a relaxation of regret that gives a small\nmultiplicative slack to the algorithm. More formally, \u0001-approximate regret with respect to expert\nf (cid:63) \u2208 F is de\ufb01ned as:\n\nApxReg\u0001,T (f (cid:63)) =\n\nf \u2212 (1 + \u0001)\nf (cid:96)t\npt\n\n(cid:96)t\nf (cid:63) .\n\nT(cid:88)\n\n(cid:88)\n\nf\u2208F\n\nT(cid:88)\n\nt=1\n\nt=1\n\nf (cid:63) \u2208 F. When the time-horizon is known in advance, by setting \u0001 =(cid:112)ln(d)/T, such a bound implies\n\nWe note that typical algorithms guarantee ApxReg\u0001,T (f (cid:63)) = O(ln(d)/\u0001) simultaneously for all experts\n\nthe aforementioned regret guarantee. In the case when the time horizon is not known, one can also\nobtain a similar guarantee by adjusting the learning rate of the algorithm appropriately.\nOur goal is to develop online learning algorithms that combine fair in isolation experts in order to\nachieve both vanishing average expected \u0001-approximate regret, i.e. for any \ufb01xed \u0001 > 0 and f (cid:63) \u2208 F,\nE\u03c3[ApxReg\u0001,T (f (cid:63))] = o(T ), and also non-discrimination with respect to fairness metrics of interest.\n\n3\n\nImpossibility results for equalized odds\n\nIn this section, we study a popular group fairness notion, equalized odds, in the context of online\nlearning. A natural extension of equalized odds for online settings would require that the false\nnegative rate, i.e. percentage of positive examples predicted incorrectly, is the same across all groups\nand the same also holds for the false positive rate. We assume that our experts are fair in isolation\nwith respect to both false negative as well as false positive rate. A weaker notion of equalized odds is\nequality of opportunity where the non-discrimination condition is required to be satis\ufb01ed only for\nthe false negative rate. We \ufb01rst study whether it is possible to achieve the vanishing regret property\n\n5\n\n\fwhile guaranteeing \u03b1-fairness in composition with respect to false negative rate for arbitrarily small\n\u03b1. When the input is i.i.d., this is trivial as we can learn the best expert in O(log d) rounds and then\nfollow its advice; since the expert is fair in isolation, this will guarantee vanishing non-discrimination.\nIn contrast, we show that, in a non-i.i.d. online setting, this goal is unachievable. We demonstrate this\nin phenomenally benign settings where there are just two groups G = {A, B} that come from a \ufb01xed\ndistribution and just two experts that are fair in isolation (with respect to false negative rate) even\nper round \u2013 not only ex post. Our \ufb01rst construction (Theorem 1) shows that any no-regret learning\nalgorithm that is group-unaware cannot guarantee fairness in composition, even in instances that\nare perfectly balanced (each pair of label and group gets 1/4 of the examples) \u2013 the only adversarial\ncomponent is the order in which these examples arrive. This is surprising because such a task is\nstraightforward in the stochastic setting as all hypotheses are non-discriminatory. We then study\nwhether actively using the group identity can correct the aforementioned similarly to how it enables\ncorrection against discriminatory predictors [20]. The answer is negative even in this scenario\n(Theorem 2): if the population is suf\ufb01ciently not balanced, any no-regret learning algorithm will be\nunfair in composition with respect to false negative rate even if they are not group-unaware.\n\n(cid:2)ApxReg\u0001,T (f )(cid:3) = o(T ) for all f \u2208 F is \u03b1-unfair in composition with respect to false negative\n\nGroup-unaware algorithms. We \ufb01rst present the impossibility result for group-unaware algo-\nrithms. In our construction, the adversary is oblivious, there is perfect balance in groups (half of the\npopulation corresponds to each group), and perfect balance within group (half of the labels of each\ngroup are positive and half negative).\nTheorem 1. For all \u03b1 < 3/8, there exists \u0001 > 0 such that any group-unaware algorithm that satis\ufb01es\nE\u03c3\nrate even for perfectly balanced sequences.\nProof sketch. Consider an instance that consists of two groups G = {A, B}, two experts F =\n{hn, hu}, and two phases: Phase I and Phase II. Group A is the group we end up discriminating\nagainst while group B is boosted by the discrimination with respect to false negative rate. At each\nround t the groups arrive stochastically with probability 1/2 each, independent of \u03c31:t\u22121.\nThe experts output a score value in \u02c6Y = [0, 1], where score \u02c6yt\nf \u2208 \u02c6Y can be interpreted as the\nprobability that expert f assigns to label being positive in round t, i.e. y(t) = +. The loss function is\nthe expected probability of error given by (cid:96)(\u02c6y, y) = \u02c6y \u00b7 1{y = \u2212} + (1 \u2212 \u02c6y) \u00b7 1{y = +}. The two\nexperts are very simple: hn always predicts negative, i.e. \u02c6yt\n= 0 for all t, and hu is an unbiased\nhn\nexpert who, irrespective of the group or the label, makes an inaccurate prediction with probability\n= \u03b2 \u00b7 1{y(t) = \u2212} + (1 \u2212 \u03b2) \u00b7 1{y(t) = +} for all t. Both experts are fair\n\u03b2 = 1/4 +\nin isolation with respect to both false negative and false positive rates: FNR is 100% for hn and \u03b2 for\nhu regardless the group, and FPR is 0% for hn and \u03b2 for hu, again independent of the group. The\ninstance proceeds in two phases:\n\n\u0001, i.e. \u02c6yt\nhu\n\n\u221a\n\n1. Phase I lasts for T/2 rounds. The adversary assigns negative labels on examples with group\n\ncontext B and assigns a label uniformly at random to examples from group A.\n\n2. In Phase II, there are two plausible worlds:\n\n(cid:104)(cid:80)T/2\n\n(cid:105)\n\n\u221a\n\n>\n\nE\u03c3\n\nt=1 pt\nhu\n\n(a) if the expected probability the algorithm assigns to expert hu in Phase I is at least\n\u0001 \u00b7 T then the adversary assigns negative labels for both groups\n(b) else the adversary assigns positive labels to examples with group context B while\nexamples from group A keep receiving positive and negative labels with probability\nequal to half.\n\nWe will show that for any algorithm with vanishing approximate regret property, i.e. with\nApxReg\u0001,T (f ) = o(T ), the condition for the \ufb01rst world is never triggered and hence the\nabove sequence is indeed balanced.\n\nWe now show why this instance is unfair in composition with respect to false negative rate. The proof\ninvolves showing the following two claims, whose proofs we defer to the supplementary material.\n\n1. In Phase I, any \u0001-approximate regret algorithm needs to select the negative expert hn most\nof the times to ensure small approximate regret with respect to hn. This means that, in Phase\n\n6\n\n\fI (where we encounter half of the positive examples from group A and none from group B),\nthe false negative rate of the algorithm is close to 1.\n\n2. In Phase II, any \u0001-approximate regret algorithm should quickly catch up to ensure small\napproximate regret with respect to hu and hence the false negative rate of the algorithm\nis closer to \u03b2. Since the algorithm is group-unaware, this creates a mismatch between the\nfalse negative rate of B (that only receives false negatives in this phase) and A (that has also\nreceived many false negatives before).\n\nGroup-aware algorithms. We now turn our attention to group-aware algorithms, that can use\nthe group context of the example to select the probability of each expert and provide a similar\nimpossibility result. There are three changes compared to the impossibility result we provided for\ngroup-unaware algorithms. First, the adversary is not oblivious but instead is adaptive. Second, we\ndo not have perfect balance across populations but instead require that the minority population arrives\nwith probability b < 0.49, while the majority population arrives with probability 1 \u2212 b. Third, the\nlabels are not equally distributed across positive and negative for each population but instead positive\nlabels for one group are at least a c percentage of the total examples of the group for a small c > 0.\nAlthough the upper bounds on b and c are not optimized, our impossibility result cannot extend to\nb = c = 1/2. Understanding whether one can achieve fairness in composition for some values of b\nand c is an interesting open question. Our impossibility guarantee is the following:\nTheorem 2. For any group imbalance b < 0.49 and 0 < \u03b1 < 0.49\u22120.99b\nthat for all 0 < \u0001 < \u00010 any algorithm that satis\ufb01es E\u03c3\n\u03b1-unfair in composition.\nProof sketch. The instance has two groups: G = {A, B}. Examples with group context A are\ndiscriminated against and arrive randomly with probability b < 1/2 while examples with group\ncontext B are boosted by the discrimination and arrive with the remaining probability 1 \u2212 b. There\nare again two experts F = {hn, hp}, which output score values in \u02c6Y = [0, 1], where \u02c6yt\nf can be\ninterpreted as the probability that expert f assigns to label being + in round t. We use the earlier loss\nfunction of (cid:96)(\u02c6y, y) = \u02c6y \u00b7 1{y = \u2212} + (1 \u2212 \u02c6y) \u00b7 1{y = +}. The \ufb01rst expert hn is again pessimistic\nand always predicts negative, i.e. \u02c6yt\n= 0, while the other expert hp is optimistic and always predicts\nhn\n= 1. These satisfy fairness in isolation with respect to equalized odds (false negative\npositive, i.e. \u02c6yt\nhp\nrate and false positive rate). Let c = 1/1012 denote the percentage of the input that is about positive\nexamples for A, ensuring that |S \u03c3\n\n(cid:2)ApxReg\u0001,T (f )(cid:3) = o(T ) for all f \u2208 F, is\n\ng (F N R)| = \u2126(T ). The instance proceeds in two phases.\n\n, there exists \u00010 > 0 such\n\n1\u2212b\n\n1. Phase I lasts \u0398 \u00b7 T rounds for \u0398 = 101c. The adversary assigns negative labels on examples\nwith group context B. For examples with group context A, the adversary acts as following:\n100 , i.e.\n\n\u2022 if the algorithm assigns probability on the negative expert below \u03b3(\u0001) = 99\u22122\u0001\n\u2022 otherwise, the adversary assigns positive labels.\n\n(\u03c31:t\u22121) < \u03b3(\u0001), then the adversary assigns negative label.\n\npt\nhn\n\n2. In Phase II, there are two plausible worlds:\n\n(a) the adversary assigns negative labels to both groups if the expected number of times that\nthe algorithm selected the negative expert with probability higher than \u03b3(\u0001) on members\nof group A is less than c\u00b7b\u00b7T , i.e. E\u03c3\n\n(cid:2)1(cid:8)t \u2264 \u0398 \u00b7 T : g(t) = A, pt\n\n\u2265 \u03b3(\u0001)(cid:9)(cid:3) < c\u00b7b\u00b7T .\n\n(b) otherwise she assigns positive labels to examples with group context B and negative\n\nhn\n\nlabels to examples with group context A.\n\nNote that, as before, the condition for the \ufb01rst world will never be triggered by any no-regret\nlearning algorithm (we elaborate on that below) which ensures that E\u03c3 |S\u03c3\nA(F N R)| \u2265 c\u00b7b\u00b7T .\nThe proof is based on the following claims, whose proofs are deferred to the supplementary material.\n1. In Phase I, any vanishing approximate regret algorithm enters the second world of Phase II.\n2. This implies a lower bound on the false negative rate on A, i.e. F N R(A) \u2265 \u03b3(\u0001) = 99\u22122\u0001\n100 .\n3. In Phase II, any \u0001-approximate regret algorithm assigns large enough probability to expert\nhp for group B, implying an upper bound on the false negative rate on B, i.e. F N R(B) \u2264\n1/2(1\u2212b). Therefore this provides a gap in the false negative rates of at least \u03b1.\n\n7\n\n\f4 Fairness in composition with respect to an alternative metric\n\nThe negative results of the previous section give rise to a natural question of whether fairness in\ncomposition can be achieved for some other fairness metric in an online setting.\nWe answer this question positively by suggesting the equalized error rates metric EER which\ncaptures the average loss over the total number of examples (independent of whether this loss comes\nfrom false negative or false positive examples). The relevant subset induced by this metric S \u03c3\ng (EER)\nis the set of all examples coming from group g \u2208 G. We again assume that experts are fair in isolation\nwith respect to equalized error rate and show that a simple scheme where we run separately one\ninstance of multiplicative weights for each group achieves fairness in composition (Theorem 3). The\nresult holds for general loss functions (beyond pure classi\ufb01cation) and is robust to the experts only\nbeing approximately fair in isolation. A crucial property we use is that multiplicative weights not\nonly does not perform worse than the best expert; it also does not perform better. In fact, this property\nholds more generally by online learning algorithms with optimal regret guarantees [14].\nInterestingly, not all algorithms can achieve fairness in composition even with respect to this re\ufb01ned\nnotion. We provide two algorithm classes where this is unachievable. First, we show that any\nalgorithm (subject to a technical condition satis\ufb01ed by algorithms such as multiplicative weights and\nfollow the perturbed leader) that ignores the group identity can be unboundedly unfair with respect to\nequalized error rates (Theorem 4). This suggests that the algorithm needs to actively discriminate\nbased on the groups to achieve fairness with respect to equalized error rates. Second, we show\na similar negative statement for any algorithm that satis\ufb01es the more involved guarantee of small\nshifting regret, therefore outperforming the best expert (Theorem 5). This suggests that the algorithm\nused should be good but not too good. This result is, to the best of our knowledge, a \ufb01rst application\nwhere shifting regret may not be desirable which may be of independent interest.\n\nThe positive result. We run separate instances of multiplicative weights with a \ufb01xed learning rate\n\u03b7, one for each group. More formally, for each pair of expert f \u2208 F and group g \u2208 G, we initialize\nf,g = 1. At round t = {1, 2, . . . , T}, an example with group context g(t) arrives and the\nweights w1\n. Then\nlearner selects a probability distribution based to the corresponding weights: pt\nf\u00b71{g(t)=g}.\nTheorem 3. For any \u03b1 > 0 and any \u0001 < \u03b1 such that running separate instances of multiplicative\nweights for each group with learning rate \u03b7 = min(\u0001, \u03b1/6) guarantees \u03b1-fairness in composition and\n\u0001-approximate regret of at most O(|G| log(d)/\u0001).\n\nthe weights corresponding to group g(t) are updated exponentially: wt+1\n\nf,g = wt\n\nj,g(t)\n\n(cid:80)\n\nwt\nf,g(t)\nj\u2208F wt\n\nf =\nf,g \u00b7 (1\u2212 \u03b7)(cid:96)t\n\nProof sketch. The proof is based on the property that multiplicative weights performs not only no\nworse than the best expert in hindsight but also no better. Therefore the average performance of\nmultiplicative weights at each group is approximately equal to the average performance of the best\nexpert in that group. Since the experts are fair in isolation, the average performance of the best expert\nin all groups is the same which guarantees the equalized error rates desideratum. We make these\narguments formal in the supplementary material.\n\nRemark 1. If the instance is instead only approximately fair in isolation with respect to equalized\nerror rates, i.e. the error rates of the two experts are not exactly equal but within some constant \u03ba,\nthe same analysis implies (\u03b1 + \u03ba)-fairness in composition with respect to equalized error rates.\n\nImpossibility results for group-unaware algorithms.\nIn the previous algorithm, it was crucial\nthat the examples of the one group do not interfere with the decisions of the algorithm on the other\ngroup. We show that, had we run one multiplicative weights algorithm in a group-unaware way, we\nwould not have accomplished fairness in composition. In fact, this impossibility result holds for any\nalgorithm with vanishing \u0001-approximate regret where the learning dynamic (pt at each round t) is a\ndeterministic function of the difference between the cumumative losses of the experts (without taking\ninto consideration their identity). This is satis\ufb01ed, for instance by multiplicative weights and follow\nthe perturbed leader with a constant learning rate. Unlike the previous section, the impossibility\nresults for equalized error rates require groups to arrive adversarially (which also occurs in the above\npositive result). The proof of the following theorem is provided in the supplementary material.\n\n8\n\n\fTheorem 4. For any \u03b1 > 0 and for any \u0001 > 0, running a single algorithm from the above class in a\ngroup-unaware way is \u03b1-unfair in composition with respect to equalized error rate.\n\nImpossibility results for shifting algorithms. The reader may be also wondering whether it\nsuf\ufb01ces to just run separate learning algorithms in the two groups or whether multiplicative weights\nhas a special property. In the following theorem, we show that the latter is the case. In particular,\nmultiplicative weights has the property of not doing better than the best expert in hindsight. The\nmain representative of algorithms that do not have such a property are the algorithms that achieve\nlow approximate regret compared to a shifting benchmark (tracking the best expert). More formally,\napproximate regret against a shifting comparator f (cid:63) = (f (cid:63)(1), . . . , f (cid:63)(T )) is de\ufb01ned as:\n\n(cid:88)\n\nt\n\n(cid:88)\n\nt\n\nf \u2212 (1 + \u0001)\nf (cid:96)t\npt\n\n(cid:96)t\nf (cid:63)(t),\n\nApxReg\u0001,T (f (cid:63)) =\n\nand typical guarantees are E[ApxReg(f (cid:63))] = O(K(f (cid:63))\u00b7ln(dT )/\u0001) where K(f (cid:63)) =(cid:80)T\n\n1{f (cid:63)(t) (cid:54)=\nf (cid:63)(t \u2212 1)} is the number of switches in the comparator. We show that any algorithm that can achieve\nsuch a guarantee even when K(f (cid:63)) = 2 does not satisfy fairness in composition with respect to\nequalized error rate. This indicates that, for the fairness with equalized error rates purpose, the\nalgorithm not being too good is essential. This is established in the following theorem whose proof is\ndeferred to the supplementary material.\nTheorem 5. For any \u03b1 < 1/2 and \u0001 > 0, any algorithm that can achieve the vanishing approximate\nregret property against shifting comparators f of length K(f ) = 2, running separate instances of the\nalgorithm for each group is \u03b1-unfair in composition with respect to equalized error rate.\n\nt=2\n\n5 Discussion\n\nIn this paper, we introduce the study of avoiding discrimination towards protected groups in online\nsettings with non-i.i.d. examples. Our impossibility results for equalized odds consist of only two\nphases, which highlights the challenge in correcting for historical biases in online decision making.\nOur work also opens up a quest towards de\ufb01nitions that are relevant and tractable in non-i.i.d. online\nsettings for speci\ufb01c tasks. We introduce the notion of equalized error rates that can be a useful metric\nfor non-discrimination in settings where all examples that contribute towards the performance also\ncontribute towards fairness. This is the case in settings that all mistakes are similarly costly as is\nthe case in speech recognition, recommender systems, or online advertising. However, we do not\nclaim that its applicability is universal. For instance, consider college admission with two perfectly\nbalanced groups that correspond to ethnicity (equal size of the two groups and equal number of\npositive and negatives within any group). A racist program organizer can select to admit all students\nof the one group and decline the students of the other, while satisfying equalized error rates \u2013 this\ndoes not satisfy equalized odds. Given the impossibility result we established for equalized odds, it is\ninteresting to identify de\ufb01nitions that work well for different tasks one encounters in online non-i.i.d.\nsettings. Moreover, although our positive results extend to the case where predictors are vetted to\nbe approximately non-discriminatory, they do not say anything about the case where the predictors\ndo not satisfy this property. We therefore view our work only as a \ufb01rst step towards understanding\nnon-discrimination in non-i.i.d. online settings.\n\nAcknowledgements\n\nThe authors would like to thank Manish Raghavan for useful discussions that improved the presenta-\ntion of the paper. This work was supported by the NSF grants CCF-1800317 and CCF-1563714, as\nwell as a Google Ph.D. Fellowship.\n\nReferences\n[1] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There\u2019s software\nused across the country to predict future criminals. And it\u2019s biased against blacks. ProPublica,\n2016.\n\n9\n\n\f[2] Julia Angwin and Terry Parris Jr. Facebook lets advertisers exclude users by race. ProPublica\n\nblog, 28, 2016.\n\n[3] Maria-Florina Balcan, Travis Dick, Ritesh Noothigattu, and Ariel Procaccia. Envy-free classi\ufb01-\n\ncation, 2018.\n\n[4] Solon Barocas and Andrew D. Selbst. Big Data\u2019s Disparate Impact. California Law Review,\n\n2016.\n\n[5] Sarah Bird, Solon Barocas, Kate Crawford, Fernando Diaz, and Hanna Wallach. Exploring\nor Exploiting? Social and Ethical Implications of Autonomous Experimentation in AI. In\nWorkshop on Fairness, Accountability, and Transparency in Machine Learning (FAT-ML), 2016.\n\n[6] Avrim Blum and Yishay Mansour. From external to internal regret. In Proceedings of the 18th\n\nAnnual Conference on Learning Theory (COLT), 2005.\n\n[7] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in\ncommercial gender classi\ufb01cation. In Conference on Fairness, Accountability and Transparency,\n2018.\n\n[8] Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. Building classi\ufb01ers with independency\n\nconstraints. In IEEE International Conference on Data Mining (ICDM), 2009.\n\n[9] L. Elisa Celis and Nisheeth K. Vishnoi. Fair personalization.\n\nIn Workshop on Fairness,\n\nAccountability, and Transparency in Machine Learning (FAT-ML), 2017.\n\n[10] Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism\n\nprediction instruments. Big data, 5(2):153\u2013163, 2017.\n\n[11] Sam Corbett-Davies and Sharad Goel. The measure and mismeasure of fairness: A critical\n\nreview of fair machine learning. arXiv preprint arXiv:1808.00023, 2018.\n\n[12] Amit Datta, Michael Carl Tschantz, and Anupam Datta. Automated experiments on ad privacy\n\nsettings. Proceedings on Privacy Enhancing Technologies, 2015.\n\n[13] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness\nthrough awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science\nConference (ITCS), 2012.\n\n[14] Eyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman. Regret to the best vs.\n\nregret to the average. Journal of Machine Learning (JMLR), 2008.\n\n[15] Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkata-\nsubramanian. Certifying and removing disparate impact. In Proceedings of the 21th ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2015.\n\n[16] Avi Feller, Emma Pierson, Sam Corbett-Davies, and Sharad Goel. A computer program used\nfor bail and sentencing decisions was labeled biased against blacks. it\u2019s actually not that clear.\nThe Washington Post, 2016.\n\n[17] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning\n\nand an application to boosting. J. Comput. Syst. Sci., 1997.\n\n[18] Stephen Gillen, Christopher Jung, Michael Kearns, and Aaron Roth. Online learning with an\nunknown fairness metric. In Advances in Neural Information Processing Systems (NIPS), 2018.\n\n[19] Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael P Friedlander. Satisfying real-world\ngoals with dataset constraints. In Advances in Neural Information Processing Systems (NIPS),\n2016.\n\n[20] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In\n\nAdvances in neural information processing systems (NIPS), 2016.\n\n[21] Mark Herbster and Manfred K. Warmuth. Tracking the best linear predictor. Journal of Machine\n\nLearning Research, 2001.\n\n10\n\n\f[22] Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. Fairness in learning:\nClassic and contextual bandits. In Advances in Neural Information Processing Systems (NIPS),\n2016.\n\n[23] Sampath Kannan, Michael Kearns, Jamie Morgenstern, Mallesh Pai, Aaron Roth, Rakesh Vohra,\nand Zhiwei Steven Wu. Fairness incentives for myopic agents. In Proceedings of the 2017 ACM\nConference on Economics and Computation (EC), 2017.\n\n[24] Sampath Kannan, Jamie Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. A\nsmoothed analysis of the greedy algorithm for the linear contextual bandit problem. In Advances\nin Neural Information Processing Systems (NIPS), 2018.\n\n[25] Matthew Kay, Cynthia Matuszek, and Sean A Munson. Unequal representation and gender\nstereotypes in image search results for occupations. In Proceedings of the 33rd Annual ACM\nConference on Human Factors in Computing Systems, 2015.\n\n[26] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerryman-\ndering: Auditing and learning for subgroup fairness. In In Proceedings of the 35th International\nConference on Machine Learning (ICML), 2018.\n\n[27] Niki Kilbertus, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik\nIn\n\nJanzing, and Bernhard Sch\u00f6lkopf. Avoiding discrimination through causal reasoning.\nAdvances in Neural Information Processing Systems (NIPS), 2017.\n\n[28] Jon M. Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair\n\ndetermination of risk scores. In Innovations of Theoretical Computer Science (ITCS), 2017.\n\n[29] Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2017.\n\n[30] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Inf. Comput.,\n\n108(2):212\u2013261, February 1994.\n\n[31] Katherine A Liu and Natalie A Dipietro Mager. Women\u2019s involvement in clinical trials: historical\n\nperspective and future implications. Pharmacy Practice, 2016.\n\n[32] Lydia T. Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt. Delayed impact of\n\nfair machine learning. 35th International Conference on Machine Learning (ICML), 2018.\n\n[33] Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, and David C. Parkes.\nCalibrated fairness in bandits. Workshop on Fairness, Accountability, and Transparency in\nMachine Learning (FAT-ML), 2017.\n\n[34] Haipeng Luo and Robert E. Schapire. Achieving all with no parameters: Adanormalhedge. In\n\nProceedings of The 28th Conference on Learning Theory (COLT), 2015.\n\n[35] Thodoris Lykouris, Vasilis Syrgkanis, and \u00c9va Tardos. Learning and ef\ufb01ciency in games with\ndynamic population. In Proceedings of the Twenty-seventh Annual ACM-SIAM Symposium on\nDiscrete Algorithms (SODA), 2016.\n\n[36] Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining. In\nProceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining (KDD), 2008.\n\n[37] Manish Raghavan, Aleksandrs Slivkins, Jennifer Vaughan Wortman, and Zhiwei Steven Wu.\nThe externalities of exploration and how data diversity helps exploitation. In Proceedings of the\n31st Conference On Learning Theory (COLT), 2018.\n\n[38] Latanya Sweeney. Discrimination in online ad delivery. Commun. ACM, 56(5):44\u201354, May\n\n2013.\n\n[39] Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. Learning\n\nnon-discriminatory predictors. In Conference on Learning Theory (COLT), 2017.\n\n11\n\n\f[40] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi.\n\nLearning fair classi\ufb01ers. Proceedings of 30th Neural Information Processing Systems, 2017.\n\n[41] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representa-\n\ntions. In International Conference on Machine Learning (ICML), 2013.\n\n12\n\n\f", "award": [], "sourceid": 5077, "authors": [{"given_name": "Avrim", "family_name": "Blum", "institution": "Toyota Technological Institute at Chicago"}, {"given_name": "Suriya", "family_name": "Gunasekar", "institution": "TTI Chicago"}, {"given_name": "Thodoris", "family_name": "Lykouris", "institution": "Cornell University"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI-Chicago"}]}