{"title": "Leveraging Labeled and Unlabeled Data for Consistent Fair Binary Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 12760, "page_last": 12770, "abstract": "We study the problem of fair binary classification using the notion of Equal Opportunity.\nIt requires the true positive rate to distribute equally across the sensitive groups.\nWithin this setting we show that the fair optimal classifier is obtained by recalibrating the Bayes classifier by a group-dependent threshold. We provide a constructive expression for the threshold.\nThis result motivates us to devise a plug-in classification procedure based on both unlabeled and labeled datasets.\nWhile the latter is used to learn the output conditional probability, the former is used for calibration.\nThe overall procedure can be computed in polynomial time and it is shown to be statistically consistent both in terms of the classification error and fairness measure. Finally, we present numerical experiments which indicate that our method is often superior or competitive with the state-of-the-art methods on benchmark datasets.", "full_text": "Leveraging Labeled and Unlabeled Data for\n\nConsistent Fair Binary Classi\ufb01cation\n\nEvgenii Chzhen1,2, Christophe Denis1, Mohamed Hebiri1,\n\nLuca Oneto3, Massimiliano Pontil4,5\n\n1Universit\u00e9 Paris-Est, 2Universit\u00e9 Paris-Sud, 3University of Pisa,\n\n4Istituto Italiano di Tecnologia, 5University College London\n\nevgenii.chzhen@math.u-psud.fr, {mohamed.hebiri,christophe.denis}@u-pem.fr,\n\nluca.oneto@unipi.it, massimiliano.pontil@iit.it\n\nAbstract\n\nWe study the problem of fair binary classi\ufb01cation using the notion of Equal Op-\nportunity. It requires the true positive rate to distribute equally across the sensitive\ngroups. Within this setting we show that the fair optimal classi\ufb01er is obtained\nby recalibrating the Bayes classi\ufb01er by a group-dependent threshold. We provide\na constructive expression for the threshold. This result motivates us to devise\na plug-in classi\ufb01cation procedure based on both unlabeled and labeled datasets.\nWhile the latter is used to learn the output conditional probability, the former is\nused for calibration. The overall procedure can be computed in polynomial time\nand it is shown to be statistically consistent both in terms of the classi\ufb01cation error\nand fairness measure. Finally, we present numerical experiments which indicate\nthat our method is often superior or competitive with the state-of-the-art methods\non benchmark datasets.\n\n1\n\nIntroduction\n\nAs machine learning becomes more and more spread in our society, the potential risk of using\nalgorithms that behave unfairly is rising. As a result there is growing interest to design learning\nmethods that meet \u201cfairness\u201d requirements, see [5, 9, 10, 17, 19, 22\u201324, 28, 31, 33, 47, 48, 50, 52]\nand references therein. A central goal is to make sure that sensitive information does not \u201cunfairly\u201d\nin\ufb02uence the outcomes of learning methods. For instance, if we wish to predict whether a university\nstudent applicant should be offered a scholarship based on curriculum, we would like our model to\nnot unfairly use additional sensitive information such as gender or race.\nSeveral measures of fairness of a classi\ufb01er have been studied in the literature [49], ranging from\nDemographic Parity [8], Equal Odds and Equal Opportunity [22], Disparate Treatment, Impact, and\nMistreatment [48], among others. In this paper, we study the problem of learning a binary classi\ufb01er\nwhich satis\ufb01es the Equal Opportunity fairness constraint. It requires that the true positive rate of\nthe classi\ufb01er is the same across the sensitive groups. This notion has been used extensively in the\nliterature either as a postprocessing step [22] on a learned classi\ufb01er or directly during training, see for\nexample [17] and references therein.\nWe address the important problem of devising statistically consistent and computationally ef\ufb01cient\nlearning procedures that meet the fairness constraint. Speci\ufb01cally, we make four contributions. First,\nwe derive in Proposition 2.3 the expression for the optimal equal opportunity classi\ufb01er, derived\nvia thresholding of the Bayes regressor. Second, inspired by the above result we proposed a semi-\nsupervised plug-in type method, which \ufb01rst estimates the regression function on labeled data and\nthen estimates the unknown threshold using unlabeled data. Consequently, we establish in Theorem\n4.5 that the proposed procedure is consistent, that is, it asymptotically satis\ufb01es the equal opportunity\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fconstraint and its risk converges to the risk of the optimal equal opportunity classi\ufb01er. Finally, we\npresent numerical experiments which indicate that our method is often superior or competitive with\nthe state-of-the-art on benchmark datasets.\nWe highlight that the proposed learning algorithm can be applied on top of any off-the shelf method\nwhich consistently estimates the regression function (class condition probability), under mild addi-\ntional assumptions which we discuss in the paper. Furthermore, our calibration procedure is based on\nsolving a simple univariate problem. Hence the generality, statistical consistency and computational\nef\ufb01ciency are strengths of our approach.\nThe paper is organized in the following manner. In Section 2, we introduce the problem and derive a\nform of the optimal equal opportunity classi\ufb01er. Section 3 is devoted to the description of our method.\nIn Section 4 we introduce assumptions used throughout this work and establish that the proposed\nlearning algorithm is consistent. Finally, Section 5 presents numerical experiments with our method.\n\n1.1 Related work\n\nIn this section we review previous contributions on the subject. Works on algorithmic fairness\ncan be divided in three families. Our algorithm falls within the \ufb01rst family, which modi\ufb01es a pre-\ntrained classi\ufb01er in order to increase its fairness properties while maintaining as much as possible\nthe classi\ufb01cation performance, see [6, 20, 22, 38] and references therein.\nImportantly, for our\napproach the post-processing step requires only unlabeled data, which is often easier to collect than\nits labeled counterpart. Methods in the second family enforce fairness directly during the training\nstep, e.g. [2, 12, 17, 37]. The third family of methods implements fairness by modifying the data\nrepresentation and then employs standard machine learning methods, see e.g. [1, 9, 17, 25\u201327, 50] as\nrepresentative examples.\nTo the best of our knowledge the formula for the optimal fair classi\ufb01er presented here is novel. In [22]\nthe authors note that the optimal equalized odds or equal opportunity classi\ufb01er can be derived from\nthe Bayes optimal regressor, however, no explicit expression for this threshold is provided. The idea\nof recalibrating the Bayes classi\ufb01er is also discussed in a number of papers, see for example [35, 38]\nand references therein. More importantly, the problem of deriving ef\ufb01cient and consistent estimators\nunder fairness constraints has received limited attention in the literature. In [17], the authors present\nconsistency results under restrictive assumptions on the model class. Furthermore, they only consider\nconvex approximations of the risk and fairness constraint and it is not clear how to relate their results\nto the original problem with the miss-classi\ufb01cation risk. In [2], the authors reduce the problem of fair\nclassi\ufb01cation to a sequence of cost-sensitive problems by leveraging the saddle point formulation.\nThey show that their algorithm is consistent in both risk and fairness constraints. However, similarly\nto [17], the authors of [2] assume that the family of possible classi\ufb01ers admits a bounded Rademacher\ncomplexity.\nPlug-in methods in classi\ufb01cation problems are well established and are well studied from statistical\nperspective, see [4, 16, 46] and references therein; in particular, it is known that one can build a\nplug-in type classi\ufb01er which is optimal in minimax sense [4, 46]. Until very recently, theoretical\nstudies on such methods were reduced to an ef\ufb01cient estimation of the regression function. Indeed, in\nstandard settings of classi\ufb01cation the threshold is always known beforehand, thus, all the information\nabout the optimal classi\ufb01er is wrapped into the distribution of the label conditionally on the feature.\nMore recently, classi\ufb01cation problems with a distribution dependent threshold have emerged. Promi-\nnent examples include classi\ufb01cation with non-decomposable measures [30, 45, 51], classi\ufb01cation\nwith reject option [15, 32], and con\ufb01dence set setup of multi-class classi\ufb01cation [11, 14, 40], among\nothers. A typical estimation algorithm in these scenarios is based on the plug-in strategy, which uses\nextra data to estimate the unknown threshold. Interestingly, in some setups a practitioner does not\nneed to have access to two labeled samples and optimal estimation can be ef\ufb01ciently performed in\nsemi-supervised manner [11, 14].\n\n2 Optimal Equal Opportunity classi\ufb01er\nLet (X, S, Y ) be a tuple on Rd \u00d7 {0, 1} \u00d7 {0, 1} having a joint distribution P. Here the vector\nX \u2208 Rd is seen as the vector of features, S \u2208 {0, 1} a binary sensitive variable and Y \u2208 {0, 1} a\nbinary output label that we wish to predict from the pair (X, S). We also assume that the distribution\n\n2\n\n\fis non-degenerate in Y and S that is P(S = 1) \u2208 (0, 1) and P(Y = 1) \u2208 (0, 1). A classi\ufb01er g is\na measurable function from Rd \u00d7 {0, 1} to {0, 1}, and the set of all such functions is denoted by\nG. In words, each classi\ufb01er receives a pair (x, s) \u2208 Rd \u00d7 {0, 1} and outputs a binary prediction\ng(x, s) \u2208 {0, 1}. For any classi\ufb01er g we introduce its associated miss-classi\ufb01cation risk as\n\nR(g) := P (g(X, S) (cid:54)= Y ) .\n\n(1)\n\nA fair optimal classi\ufb01er is formally de\ufb01ned as\n\ng\u2217 \u2208 arg ming\u2208G {R(g) : g is fair} .\n\nThere are various de\ufb01nitions of fairness available in the literature, each having its critics and its\nsupporter. In this work, we employ the following de\ufb01nition introduced in [22]. We refer the reader to\nthis work as well as [2, 17, 35] for a discussion, motivation of this de\ufb01nition, and a comparison to\nother fairness de\ufb01nitions.\nDe\ufb01nition 2.1 (Equal Opportunity [22]). A classi\ufb01er (x, s) (cid:55)\u2192 g(x, s) \u2208 {0, 1} is called fair if\n\nP (g(X, S) = 1| S = 1, Y = 1) = P (g(X, S) = 1| S = 0, Y = 1) .\n\nThe set of all fair classi\ufb01ers is denoted by F(P).\nNote, that the de\ufb01nition of fairness depends on the underlying distribution P and hence the whole\nclass F(P) of the fair classi\ufb01ers should be estimated. Further, notice that the class F(P) is non-empty\nas it always contains a classi\ufb01er g(x, s) \u2261 0.\nUsing this notion of fairness we de\ufb01ne an optimal equal opportunity classi\ufb01er as a solution of the\noptimization problem\n\nming\u2208G {R(g) : P (g(X, S) = 1| Y = 1, S = 1) = P (g(X, S) = 1| Y = 1, S = 0)} .\n\n(2)\nWe now introduce an assumption on the regression function that plays an important role in establishing\nthe form of the optimal fair classi\ufb01er.\nAssumption 2.2. For each s \u2208 {0, 1} we require the mapping t (cid:55)\u2192 P (\u03b7(X, S) \u2264 t| S = s) to be\ncontinuous on (0, 1), where for all (x, s) \u2208 Rd \u00d7 {0, 1}, we let the regression function\n\n\u03b7(x, s) := P (Y = 1| X = x, S = s) = E [Y | X = x, S = s]\nMoreover, for every s \u2208 {0, 1}, we assume that P (\u03b7(X, s) \u2265 1/2| S = s) > 0.\nThe \ufb01rst part of Assumption 2.2 is achieved by many distributions and has been introduced in various\ncontexts, see e.g. [11, 15, 32, 40, 45] and references therein. It says that, for every s \u2208 {0, 1} the\nrandom variable \u03b7(X, s) does not have atoms, that is, the event {\u03b7(X, s) = t} has probability zero.\nThe second part of the assumption states that the regression function \u03b7(X, s) must surpass the level\n1/2 on a set of non-zero measure. Informally, returning to scholarship example mentioned in the\nintroduction, this assumption means that there are individuals from both groups who are more likely\nto be offered a scholarship based on their curriculum.\nIn the following result we establish that the optimal equal opportunity classi\ufb01er is obtained by\nrecalibrating the Bayes classi\ufb01er.\nProposition 2.3 (Optimal Rule). Under Assumption 2.2 an optimal classi\ufb01er g\u2217 can be obtained for\nall (x, s) \u2208 Rd \u00d7 {0, 1} as\n\n.\n\nP(Y =1,S=1) )},\nwhere \u03b8\u2217 \u2208 R is determined from the equation\n(cid:35)\n\ng\u2217(x, 1) = 1{1\u2264\u03b7(x,1)(2\u2212\n(cid:34)\n\u03b7(X,1)1{1\u2264\u03b7(X,1)(2\u2212\nP(Y =1 | S=1)\n\nP(Y =1,S=1) )}\n\nEX|S=1\n\n=\n\n\u03b8\u2217\n\n\u03b8\u2217\n\ng\u2217(x, 0) = 1{1\u2264\u03b7(x,0)(2+\n(cid:104)\n\n\u03b7(X, 0)1{1\u2264\u03b7(X,0)(2+\nP (Y = 1| S = 0)\n\nEX|S=0\n\nP(Y =1,S=0) )}\n\n\u03b8\u2217\n\nP(Y =1,S=0) )}\n\n\u03b8\u2217\n\n(3)\n\n(cid:105)\n\n.\n\nFurthermore it holds that |\u03b8\u2217| \u2264 2.\n\nProof sketch. The proof relies on weak duality. The \ufb01rst step of the proof is to write the minimization\nproblem for g\u2217 using a \u201cmin-max\u201d problem formulation. We consider the corresponding dual \u201cmax-\nmin\u201d problem and show that it can be analytically solved. Then, the continuity part of Assumption 2.2\nallows to demonstrate that the solution of the \u201cmax-min\u201d problem gives a solution of the \u201cmin-max\u201d\nproblem. The second part of Assumption 2.2 is used to prove that |\u03b8\u2217| \u2264 2.\n\n3\n\n\fBefore proceeding further, let us de\ufb01ne a notion of unfairness, which plays a key role in our statistical\nanalysis; it is sometimes referred to as difference of equal opportunity (DEO) in the literature [see\ne.g. 17].\nDe\ufb01nition 2.4 (Unfairness). For any classi\ufb01er g we de\ufb01ne its unfairness as\n\n\u2206(g, P) = |P (g(X, S) = 1| S = 1, Y = 1) \u2212 P (g(X, S) = 1| S = 0, Y = 1)|\n\n.\n\nA principal goal of this paper is to construct a classi\ufb01cation algorithm \u02c6g which satis\ufb01es\n\nE[\u2206(\u02c6g, P)] \u2192 0\n,\n\nand\n\nE[R(\u02c6g)] \u2192 R(g\u2217)\n\n,\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nasymptotically fair\n\nasymptotically optimal\n\nwhere the expectations are taken with respect to the distribution of data samples. As we shall see our\nestimator is built from independent sets of labeled and unlabeled samples. Hence the convergence\nabove is meant to hold as both samples grow to in\ufb01nity.\n\n3 Proposed procedure\n\nIn this section, we present the proposed plug-in algorithm and begin to study its theoretical properties.\nWe assume that we have at our disposal two datasets, labeled Dn and unlabeled DN de\ufb01ned as\n\nDn = {(Xi, Si, Yi)}n\n\ni=1\n\ni.i.d.\u223c P, and DN = {(Xi, Si)}n+N\n\ni=n+1\n\ni.i.d.\u223c P(X,S) ,\n\nwhere P(X,S) is the marginal distribution of the vector (X, S). We additionally assume that the\nestimator \u02c6\u03b7 of the regression function is constructed based on Dn, independently of DN . Let us\ndenote by \u02c6EX|S=1, \u02c6EX|S=0 expectations taken w.r.t. the empirical distributions induced by DN , that\nis,\n\n\u02c6PX|S=s =\n\n1\n\n|{(X, S) \u2208 DN : S = s}|\n\n{(X,S)\u2208DN : S=s}\n\n\u03b4X ,\n\n(cid:88)\n\n(cid:80)\n\n\u03b4S.\n\n(X,S)\u2208DN\n\nfor all s \u2208 {0, 1}, and by \u02c6ES expectation taken w.r.t. the empirical measure of S, that is, \u02c6PS =\n\n1\nN\nRemark 3.1. In theory, the empirical distributions might be not well de\ufb01ned, since they are only\nvalid if the unlabeled dataset DN is composed of features from both groups. We show how to bypass\nthis problem theoretically in supplementary material. Nevertheless, this remark has little to no impact\nin practice and in most situations these quantities are well de\ufb01ned.\nBased on the estimator \u02c6\u03b7 and the unlabeled sample DN , let us introduce the following estimators for\neach s \u2208 {0, 1}\n\n\u02c6P(Y = 1, S = s) := \u02c6EX|S=s[\u02c6\u03b7(X, s)]\u02c6PS(S = s) .\n\nUsing the above estimators a straightforward procedure to mimic the optimal classi\ufb01er g\u2217 provided\nby Proposition 2.3 is to employ a plug-in rule \u02c6g, obtained by replacing all the unknown quantities by\neither their empirical versions or their estimates. Speci\ufb01cally, we let \u02c6g at (x, s) \u2208 Rd \u00d7 {0, 1} as\n\n\u02c6g(x, 1) = 1(cid:110)\n\n(cid:16)\n\n1\u2264\u02c6\u03b7(x,1)\n\n2\u2212\n\n\u02c6\u03b8\n\n\u02c6P(Y =1,S=1)\n\n(cid:17)(cid:111),\n\n\u02c6g(x, 0) = 1(cid:110)\n\n(cid:16)\n\n1\u2264\u02c6\u03b7(x,0)\n\n2+\n\n\u02c6\u03b8\n\n\u02c6P(Y =1,S=0)\n\n(cid:17)(cid:111) .\n\n(4)\n\nIt remains to de\ufb01ne the value of \u02c6\u03b8, clearly it is desirable to mimic the condition that is satis\ufb01ed\nby \u03b8\u2217 in Proposition 2.3. To this end, we make use of the unlabeled data DN and of the estimator\n\u02c6\u03b7 previously built from the labeled dataset Dn. Consequently, we de\ufb01ne a data-driven version of\nunfairness \u2206(g, P), which allows to construct an approximation \u02c6\u03b8 of the true value \u03b8\u2217.\nDe\ufb01nition 3.2 (Empirical unfairness). For any classi\ufb01er g, an estimator \u02c6\u03b7 based on Dn, and unla-\nbeled sample DN the empirical unfairness is de\ufb01ned as\n\n(cid:12)(cid:12)(cid:12) \u02c6EX|S=1 \u02c6\u03b7(X,1)g(X,1)\n\n\u02c6EX|S=1 \u02c6\u03b7(X,1)\n\n\u02c6\u2206(g, P) =\n\n\u2212 \u02c6EX|S=0 \u02c6\u03b7(X,0)g(X,0)\n\n\u02c6EX|S=0 \u02c6\u03b7(X,0)\n\n.\n\n(cid:12)(cid:12)(cid:12)\n\n4\n\n\fNotice that the empirical unfairness \u02c6\u2206(g, P) is data-driven, that is, it does not involve unknown\nquantities. One might wonder why it is an empirical version of the quantity \u2206(g, P) in De\ufb01nition 2.4\nand what is the reason to introduce it. The de\ufb01nition reveals itself when we rewrite the population of\nunfairness \u2206(g, P) using1 the identity\n\nP (g(X, S) = 1| S = s, Y = 1) =\n\nP(g(X,S)=1,Y =1 | S=s)\n\nP(Y =1 | S=s)\n\n=\n\nEX|S=s[\u03b7(X,s)g(X,s)]\n\nEX|S=s[\u03b7(X,s)]\n\n.\n\nUsing the above expression we can rewrite\n\n(cid:12)(cid:12)(cid:12) EX|S=1[\u03b7(X,1)g(X,1)]\n\n\u2206(g, P) =\n\nEX|S=1[\u03b7(X,1)] \u2212 EX|S=0[\u03b7(X,0)g(X,0)]\n\nEX|S=0[\u03b7(X,0)]\n\n.\n\n(cid:12)(cid:12)(cid:12)\n\nHence, the passage from the population unfairness to its empirical version in De\ufb01nition 3.2 formally\nreduces to substituting \u201chats\u201d to all the unknown quantities.\nUsing De\ufb01nition 3.2, a logical estimator \u02c6\u03b8 of \u03b8\u2217 can be obtained as\n\n\u02c6\u03b8 \u2208 arg min\n\u03b8\u2208[\u22122,2]\n\n\u02c6\u2206(\u02c6g\u03b8, P) ,\n\n2+\n\n\u03b8\n\n(cid:16)\n\n(cid:17)(cid:111) .\n\nwhere, for all \u03b8 \u2208 [\u22122, 2], \u02c6g\u03b8 is de\ufb01ned at (x, s) \u2208 Rd \u00d7 {0, 1} as\n\u02c6g\u03b8(x, 0) = 1(cid:110)\n\n\u02c6g\u03b8(x, 1) = 1(cid:110)\n\n(cid:17)(cid:111),\n\n(cid:16)\n\n\u03b8\n\n2\u2212\n\n\u02c6P(Y =1,S=0)\n\n\u02c6P(Y =1,S=1)\n\n1\u2264\u02c6\u03b7(x,1)\n\n1\u2264\u02c6\u03b7(x,0)\n\n(5)\nIn this case, the algorithm \u02c6g that we propose is such that \u02c6g \u2261 \u02c6g\u02c6\u03b8. It is crucial to mention that since\nthe quantity \u02c6\u2206(\u02c6g\u03b8, P) is empirical, then there might be no \u03b8 which delivers zero for the empirical\nunfairness. This is exactly the reason we perform a minimization of this quantity.\nRemark 3.3. Even though we believe that the introduction of the unlabeled sample is one of the\nstrong points of our approach, this sample may not be available on some benchmark datasets. In\nthis case, we can simply randomly split the data into two parts disregarding labels in one of them,\nor alternatively we can use the same sample twice. The second path is not directly justi\ufb01ed by our\ntheoretical results, yet, let us suggest the following intuitive explanation for this approach. On the\n\ufb01rst and the second steps, our procedure approximates two independent parts of the distribution P\nof the random tuple (X, S, Y ). Indeed, following the factorization P = PY |X,S \u2297 P(X,S), the \ufb01rst\nstep of our procedure approximates PY |X,S, whereas the second step is aimed at P(X,S) which is\nindependent from PY |X,S. In our experiments, reported in Section 5, we exploited the same set of\ndata for both Dn and DN , since no unlabelled sample were available and splitting the dataset would\nhave reduced the quality of the trained model because the datasets have a small sample size.\n\n4 Consistency\n\nIn this section we establish that the proposed procedure is consistent. To present our theoretical results\nwe impose two assumptions on the estimator \u02c6\u03b7 and demonstrate how to satisfy them in practice.\nAssumption 4.1. The estimator \u02c6\u03b7 which is constructed on Dn satis\ufb01es for all s \u2208 {0, 1}\n\nEX|S=s |\u03b7(X, S) \u2212 \u02c6\u03b7(X, S)| \u2192 0 as n \u2192 \u221e;\n\u221a\n1\ncn,N\n\n(i) EDn\n(ii) There exists a sequence cn,N > 0 satisfying\nthat EX|S=s[\u02c6\u03b7(X, S)] \u2265 cn,N almost surely.\n\nN\n\n= on,N (1) and cn,N = on,N (1) such\n\nRemark 4.2. There are two parts in Assumption 4.1, the \ufb01rst one requires a consistent estimator in (cid:96)1\nnorm. This \ufb01rst assumption is rather weak, since there are many different available consistent estima-\ntors for the regression function in the literature, including the Maximum likelihood estimator [45] for\nGaussian Generative Model, local polynomial estimator [4] for \u03b2-H\u00f6lder smooth regression function\n\u03b7(\u00b7, s), regularized logistic regression [42] for Generalized Linear Model, k-Nearest Neighbors\nestimator [16] for Lipschitz regression function \u03b7(\u00b7, s), and random forest type estimators in various\nsettings [3, 7, 21, 41].\nThe second part of Assumption 4.1 means that EX|S=s[\u02c6\u03b7(X, s)] is lower bounded by a positive term\n1Note additionally that for all s \u2208 {0, 1} we can write 1{Y =1,g(X,s)=1} \u2261 Y g(X, s), since both Y and g\n\nare binary.\n\n5\n\n\fvanishing as N, n grow to in\ufb01nity. This condition can be introduced arti\ufb01cially to any prede\ufb01ned esti-\nmator. Indeed, assume that we have a consistent estimator \u02dc\u03b7 and let \u02c6\u03b7(x, s) = max{\u02dc\u03b7(x, s), cn,N},\nthen the second item of the assumption is satis\ufb01ed in even a stronger form. Moreover, this estimator \u02c6\u03b7\nremains consistent, since using the triangle inequality and the fact that |\u02c6\u03b7(x, s) \u2212 \u02dc\u03b7(x, s)| \u2264 cn,N\nfor all x \u2208 Rd, we have\n\nEDn\n\nEX|S=s |\u03b7(X, s) \u2212 \u02c6\u03b7(X, s)| \u2264 EDn\n\nEX|S=s |\u03b7(X, s) \u2212 \u02dc\u03b7(X, s)| + cn,N \u2192 0 .\n\nAdditionally, we impose one more condition on the estimator \u02c6\u03b7 that was already successfully used in\nthe context of con\ufb01dence set classi\ufb01cation [11, 15].\nAssumption 4.3. The estimator \u02c6\u03b7 is such that for all s \u2208 {0, 1} the mapping\n\nt (cid:55)\u2192 P (\u02c6\u03b7(X, s) \u2264 t| S = s) ,\n\nis continuous on (0, 1) almost surely.\nIn our settings this assumption allows us to show that the value of \u02c6\u2206(\u02c6g, P) cannot be large, that is,\nthe empirical unfairness of the proposed procedure is small or zero. As we shall see, a control on the\nempirical unfairness \u02c6\u2206(\u02c6g, P) in De\ufb01nition 3.2 is crucial in proving that the proposed procedure \u02c6g\nachieves both asymptotic fairness and risk consistency.\nRemark 4.4. Assumption 4.3 is equivalent to say that there are no atoms in the estimated regression\nfunction. It can be ful\ufb01lled by a simple modi\ufb01cation of any preliminary estimator, by adding a small\ndeterministic \u201cnoise\u201d, the amplitude of which must be decreasing with n, N in order to preserve\nstatistical consistency.\n\nOur remarks suggest that both Assumptions 4.1 and 4.3 can be easily satis\ufb01ed in a variety of practical\nsettings and the most demanding part of these assumptions is the consistency of \u02c6\u03b7.\nThe next result establishes the statistical consistency of the proposed algorithm.\nTheorem 4.5 (Asymptotic properties). Under Assumptions 2.2, 4.1, and 4.3 the proposed algorithm\nsatis\ufb01es\n\nlimn,N\u2192\u221e E(Dn,DN )[\u2206(\u02c6g, P)] = 0 and limn,N\u2192\u221e E(Dn,DN )[R(\u02c6g)] \u2264 R(g\u2217) .\n\nProof sketch. In order to establish statistical consistency of the proposed procedure, we follow the\nstrategy of [11, 15], that is, we \ufb01rst introduce an intermediate pseudo-estimator \u02dcg as follows\n\n\u02dcg(x, 1)=1(cid:26)\n\n(cid:18)\n\n(cid:19)(cid:27), \u02dcg(x, 0)=1(cid:26)\n\n(cid:18)\n\n(cid:19)(cid:27), (6)\n\n1\u2264\u02c6\u03b7(x,1)\n\n2\u2212\n\nE\nX|S=1[ \u02c6\u03b7(X,1)]P(S=1)\n\n\u02dc\u03b8\n\n1\u2264\u02c6\u03b7(x,0)\n\n2+\n\nE\nX|S=0[ \u02c6\u03b7(X,0)]P(S=0)\n\n\u02dc\u03b8\n\nwhere \u02dc\u03b8 is chosen such that\n\nEX|S=1 [\u02c6\u03b7(X, 1)\u02dcg(X, 1)]\n\nEX|S=1[\u02c6\u03b7(X, 1)]\n\nEX|S=0 [\u02c6\u03b7(X, 0)\u02dcg(X, 0)]\n\nEX|S=0[\u02c6\u03b7(X, 0)]\n\n=\n\n.\n\n(7)\n\nNote that by Assumption 4.3 such a value \u02dc\u03b8 always exists. Intuitively, the classi\ufb01er \u02dcg \u201cknows\u201d the\nmarginal distribution of (X, S), that is, it knows both PX|S and PS. It is seen as an idealized version\nof \u02c6g, where the uncertainty is only induced by the lack of knowledge of the regression function \u03b7.\nWe express the excess risk as a sum of two terms, EDn [R(\u02dcg)] \u2212 R(g\u2217) + E(Dn,DN )[R(\u02c6g) \u2212 R(\u02dcg)].\nWe show that the \ufb01rst can be bounded by the (cid:96)1 distance between \u02c6\u03b7 and \u03b7, and thanks to the\nconsistency of \u02c6\u03b7 it does converge to zero. The handling of the second term is move involved, but we\nare able to show that it reduces to a study of suprema of empirical processes conditionally on the\nlabeled sample Dn.\nTo demonstrate that the proposed algorithm is asymptotically fair, we \ufb01rst show that\n\nE(Dn,DN )[\u2206(\u02c6g, P)] \u2264 E(Dn,DN )[ \u02c6\u2206(\u02c6g, P)] + on,N (1) .\n\nAt last, the continuity Assumption 4.3 alongside with means of theory of empirical processes allow\nto demonstrate that the term E(Dn,DN )[ \u02c6\u2206(\u02c6g, P)] converges to zero when N growth.\nRemark 4.6. Let us mention that it is possible to present our result in a \ufb01nite sample regime, since\nour proof of consistency is based on non-asymptotic theory of empirical processes. However, the\nactual rate of convergence depends on the rate of (cid:96)1-norm estimation of the regression function \u03b7,\nwhich can vary signi\ufb01cantly from one setup to another. That is why we decided to present our result\nin the asymptotic sense.\n\n6\n\n\fCOMPAS\n\nAdult\n\nGerman\n\nDrug\n\nArrhythmia\n\nACC\n\nDEO\n\nACC\n\nACC\n\nDEO\n\nACC\n\nDEO\n\nDEO\n\nACC DEO\n\nMethod\n0.78\u00b10.07 0.13\u00b10.04 0.75\u00b10.01 0.15\u00b10.02 0.80 0.13 0.69\u00b10.04 0.11\u00b10.10 0.81\u00b10.02 0.41\u00b10.06\nLin.SVM\n0.79\u00b10.06 0.13\u00b10.05 0.76\u00b10.02 0.16\u00b10.02 0.81 0.12 0.67\u00b10.05 0.12\u00b10.11 0.80\u00b10.01 0.42\u00b10.05\nLin.LR\nLin.SVM+Hardt 0.74\u00b10.06 0.07\u00b10.04 0.67\u00b10.03 0.21\u00b10.09 0.80 0.10 0.61\u00b10.15 0.15\u00b10.13 0.77\u00b10.02 0.22\u00b10.09\n0.75\u00b10.04 0.08\u00b10.05 0.67\u00b10.02 0.18\u00b10.07 0.81 0.09 0.62\u00b10.05 0.13\u00b10.09 0.76\u00b10.01 0.18\u00b10.04\nLin.LR+Hardt\n0.71\u00b10.03 0.03\u00b10.02 0.69\u00b10.02 0.10\u00b10.06 0.78 0.05 0.62\u00b10.09 0.13\u00b10.11 0.69\u00b10.03 0.02\u00b10.07\nZafar\n0.79\u00b10.07 0.04\u00b10.03 0.76\u00b10.01 0.04\u00b10.03 0.77 0.01 0.69\u00b10.04 0.05\u00b10.03 0.79\u00b10.02 0.05\u00b10.03\nLin.Donini\nLin.SVM+Ours 0.75\u00b10.08 0.04\u00b10.04 0.73\u00b10.01 0.05\u00b10.02 0.79 0.03 0.68\u00b10.04 0.04\u00b10.03 0.78\u00b10.02 0.01\u00b10.02\n0.75\u00b10.06 0.04\u00b10.05 0.74\u00b10.02 0.06\u00b10.02 0.80 0.03 0.67\u00b10.05 0.04\u00b10.03 0.77\u00b10.03 0.02\u00b10.02\nLin.LR+Ours\n0.78\u00b10.06 0.13\u00b10.04 0.73\u00b10.01 0.14\u00b10.02 0.82 0.14 0.74\u00b10.03 0.10\u00b10.06 0.81\u00b10.04 0.38\u00b10.03\nSVM\n0.79\u00b10.05 0.12\u00b10.04 0.74\u00b10.01 0.14\u00b10.02 0.81 0.15 0.75\u00b10.03 0.11\u00b10.06 0.82\u00b10.01 0.37\u00b10.03\nLR\n0.83\u00b10.03 0.09\u00b10.02 0.77\u00b10.02 0.11\u00b10.02 0.86 0.12 0.78\u00b10.02 0.09\u00b10.04 0.86\u00b10.01 0.29\u00b10.02\nRF\n0.74\u00b10.06 0.07\u00b10.04 0.71\u00b10.02 0.08\u00b10.02 0.82 0.11 0.71\u00b10.03 0.11\u00b10.18 0.75\u00b10.11 0.14\u00b10.08\nSVM+Hardt\n0.73\u00b10.05 0.10\u00b10.04 0.70\u00b10.02 0.09\u00b10.02 0.80 0.12 0.72\u00b10.04 0.09\u00b10.06 0.77\u00b10.03 0.11\u00b10.04\nLR+Hardt\n0.79\u00b10.03 0.07\u00b10.01 0.76\u00b10.01 0.07\u00b10.02 0.83 0.05 0.76\u00b10.02 0.06\u00b10.04 0.82\u00b10.01 0.09\u00b10.02\nRF+Hardt\n0.79\u00b10.09 0.03\u00b10.02 0.73\u00b10.01 0.05\u00b10.03 0.81 0.01 0.73\u00b10.04 0.05\u00b10.03 0.80\u00b10.03 0.07\u00b10.05\nDonini\n0.77\u00b10.07 0.04\u00b10.02 0.72\u00b10.02 0.06\u00b10.02 0.80 0.02 0.73\u00b10.03 0.04\u00b10.06 0.79\u00b10.02 0.05\u00b10.01\nSVM+Ours\n0.77\u00b10.06 0.04\u00b10.02 0.73\u00b10.01 0.06\u00b10.02 0.80 0.02 0.73\u00b10.02 0.04\u00b10.06 0.80\u00b10.01 0.05\u00b10.02\nLR+Ours\n0.81\u00b10.04 0.03\u00b10.01 0.76\u00b10.02 0.04\u00b10.02 0.85 0.03 0.77\u00b10.02 0.02\u00b10.02 0.83\u00b10.01 0.04\u00b10.02\nRF+Ours\nTable 1: Results (average \u00b1 standard deviation, when a \ufb01xed test set is not provided) for all the\ndatasets, concerning ACC and DEO.\n\nFigure 1: Results of Table 1 of linear (left) and nonlinear (right) methods when the error and the\nDEO are normalized in [0, 1] column-wise. Different colors and symbols refer to different datasets\nand method respectively. The closer a point is to the origin, the better the result is.\n\n5 Experimental results\n\nIn this section, we present numerical experiments with the proposed method. The source code we\nused to perform the experiments can be found at https://github.com/lucaoneto/NIPS2019_\nFairness.\nWe follow the protocol outlined in [17]. We consider the following datasets: Arrhythmia, COMPAS,\nAdult, German, and Drug2 and compare the following algorithms: Linear Support Vector Machines\n(Lin.SVM), Support Vector Machines with the Gaussian kernel (SVM), Linear Logistic Regression\n(Lin.LR), Logistic Regression with the Gaussian kernel (LR), Hardt method [22] to all approaches\n(Hardt), Zafar method [48] implemented with the code provided by the authors for the linear case3,\nthe Linear (Lin.Donini) and the Non Linear methods (Donini) proposed in [17] and freely available4,\nand also Random Forests (RF). Then, since Lin.SVM, SVM, Lin.LR, LR, and RF have also the\npossibility to output a probability together with the classi\ufb01cation, we applied our method in all these\ncases.\nIn all experiments, we collect statistics concerning the classi\ufb01cation accuracy (ACC), namely proba-\nbility to correctly classify a sample, and the Difference of Equal Opportunity (DEO) in De\ufb01nition\n2.1. For Arrhythmia, COMPAS, German and Drug datasets we split the data in two parts (70%\n\n2For more information about these datasets please refer to [17].\n3Python code for [48]: https://github.com/mbilalzafar/fair-classification\n4Python code for [17]: https://github.com/jmikko/fair_ERM\n\n7\n\n\fCOMPAS\n\nACC\n\nDEO\n\nRF+Ours\nDn=1/10\n0.68 \u00b1 0.03 0.07 \u00b1 0.02 0.79 \u00b1 0.02 0.06 \u00b1 0.02\nDn=1/10, DN =1/10 0.68 \u00b1 0.03 0.07 \u00b1 0.02 0.79 \u00b1 0.02 0.06 \u00b1 0.02\nDn=1/10, DN =2/10 0.68 \u00b1 0.03 0.07 \u00b1 0.02 0.79 \u00b1 0.02 0.06 \u00b1 0.02\nDn=1/10, DN =4/10 0.70 \u00b1 0.02 0.06 \u00b1 0.02 0.79 \u00b1 0.02 0.05 \u00b1 0.01\nDn=1/10, DN =8/10 0.71 \u00b1 0.02 0.05 \u00b1 0.01 0.80 \u00b1 0.02 0.04 \u00b1 0.01\n\nAdult\n\nACC\n\nDEO\n\nTable 2: Impact of the size of the unlabeled dataset on ACC and DEO. The size of the labeled sample\nDn is \ufb01xed to 1/10 of the original dataset. The unlabeled DN is initially empty (as in previous\nexperiments of Table 1), and then increases from 1/10 to 8/10 of the original dataset.\n\ntrain and 30% test), this procedure is repeated 30 times, and we reported the average performance\non the test set alongside its standard deviation. For the Adult dataset, we used the provided split\nof train and test sets. Unless otherwise stated, we employ two steps in the 10-fold CV procedure\nproposed in [17] to select the best hyperparameters with the training set5. In the \ufb01rst step, the value\nof the hyperparameters with the highest accuracy is identi\ufb01ed. In the second step, we shortlist all the\nhyperparameters with accuracy close to the best one (in our case, above 90% of the best accuracy).\nFinally, from this list, we select the hyperparameters with the lowest DEO.\nWe also present in Figure 1 the results of Table 1 for linear (left) and nonlinear (right) methods, when\nthe error (one minus ACC) and the DEO are normalized in [0, 1] column-wise. In the \ufb01gure, different\ncolors and symbols refer to different datasets and methods, respectively. The closer a point is to the\norigin, the better the result is.\nFrom Table 1 and Figure 1 it is possible to observe that the proposed method outperforms all methods\nexcept the one of [17] for which we obtain comparable performance. Nevertheless, note that our\nmethod is more general than the one of [17], since it can be applied to any algorithms which return a\nprobability estimator (better if consistent since this will allow us to have a full consistent approach\nalso from the fairness point of view). In fact, on these datasets, RF, which cannot be made trivially\nfair with the approach proposed in [17], outperforms all the available methods.\nNote that the results reported in Table 1 differ from the one reported in [17] since the proposed\nmethod requires the knowledge of the sensitive variable at classi\ufb01cation time, so Table 1 reports\njust this case. That is, the functional form of the model explicitly depends on the sensitive variable\ns \u2208 {0, 1}. Many authors, point out that this may not be permitted in several practical scenarios\n(see e.g. [19, 39] and reference therein). Yet, removing the sensitive variable from the functional\nform of the model does not ensure that the sensitive variable is not considered by the model itself. We\nrefer to [36] for the in-depth discussion on this issue. Further, the method in [22] explicitly requires\nthe knowledge of the sensitive variable for their thresholding procedure. In Appendix E we show how\nto modify our method in order to derive a fair optimal classi\ufb01er without the sensitive variable s in\nthe functional form of the model. Moreover, we propose a modi\ufb01cation of our approach which does\nnot use s at decision time and perform additional numerical comparison in this context. We arrive\nto similar conclusions about the performance of our method as in this section. Yet, the consistency\nresults are not available for this methods and are left for future investigation.\nIn Table 2 we demonstrate the impact of the unlabeled data size on the performance of the proposed\nalgorithm. Since the above benchmark datasets are not provide with additional unlabeled data, we\ndeploy the following data generation procedure: we randomly select 1/10 observations in each dataset\nand assign it to the labeled sample Dn; consequently, the size of the unlabeled sample DN increases\nfrom 0 to 8/10 samples that were not assigned to the labeled sample Dn. This data generation\nprocedure is applied to COMPAS and Adult datasets. Finally, we apply our method using the random\nforest algorithm using the cross-validation scheme employed in the previous experiments. The above\nabove pipeline is repeated 30 times and the variance of the results is reported in Table 2. We can\nsee that both DEO and ACC are improving with N, highlighting the importance of the unlabeled\ndata. We believe that the improvement could have been more signi\ufb01cant if the unlabeled data were\nprovided initially.\n\n5The regularization parameter (for all method) and the RBF kernel with 30 values, equally spaced in logarith-\nmic scale between 10\u22124 and 104. For RF the number of trees has been set to 1000 and the size of the subset of fea-\ntures optimized at each node has been search in {d,(cid:100)d15/16(cid:101),(cid:100)d7/8(cid:101),(cid:100)d3/4(cid:101),(cid:100)d1/2(cid:101),(cid:100)d1/4(cid:101),(cid:100)d1/8(cid:101),(cid:100)d1/16(cid:101), 1}\nwhere d is the number of features in the dataset.\n\n8\n\n\f6 Conclusion\n\nUsing the notion of equal opportunity we have derived a form of the fair optimal classi\ufb01er based\non group-dependent threshold. Relying on this result we have proposed a semi-supervised plug-\nin method which enjoys strong theoretical guarantees under mild assumptions. Importantly, our\nalgorithm can be implemented on top of any base classi\ufb01er which has conditional probabilities as\noutputs. We have conducted an extensive numerical evaluation comparing our procedure against\nthe state-of-the-art approaches and have demonstrated that our procedure performs well in practice.\nIn future works we would like to extend our analysis to other fairness measures as well as provide\nconsistency results for the algorithm which does not use the sensitive feature at the decision time.\nFinally, we note that our consistency result is constructive and could be used to derive non-asymptotic\nrates of convergence for the proposed method, relying upon available rates for the regression function\nestimator.\n\nAcknowledgments\n\nThis work was supported in part by SAP SE, by the Amazon AWS Machine Learning Research\nAward, by CISCO, and by the Labex B\u00e9zout of Universit\u00e9 Paris-Est.\n\nReferences\n[1] J. Adebayo and L. Kagal. Iterative orthogonal feature projection for diagnosing bias in black-box\nmodels. In Conference on Fairness, Accountability, and Transparency in Machine Learning,\n2016.\n\n[2] A. Agarwal, A. Beygelzimer, M. Dud\u00edk, J. Langford, and H. Wallach. A reductions approach to\n\nfair classi\ufb01cation. arXiv preprint arXiv:1803.02453, 2018.\n\n[3] S. Arlot and R. Genuer. Analysis of purely random forests bias. arXiv preprint arXiv:1407.3939,\n\n2014.\n\n[4] J. Y. Audibert and A. Tsybakov. Fast learning rates for plug-in classi\ufb01ers. The Annals of\n\nStatistics, 35(2):608\u2013633, 2007.\n\n[5] S. Barocas, M. Hardt, and A. Narayanan. Fairness and Machine Learning. fairmlbook.org,\n\n2018.\n\n[6] A. Beutel, J. Chen, Z. Zhao, and E. H. Chi. Data decisions and theoretical implications when\nadversarially learning fair representations. In Conference on Fairness, Accountability, and\nTransparency in Machine Learning, 2017.\n\n[7] L. Breiman. Consistency for a simple model of random forests. Technical report, Statistics\n\nDepartment University Of California At Berkeley, 2004.\n\n[8] T. Calders, F. Kamiran, and M. Pechenizkiy. Building classi\ufb01ers with independency constraints.\n\nIn IEEE international conference on Data mining, 2009.\n\n[9] F. Calmon, D. Wei, B. Vinzamuri, K. N. Ramamurthy, and K. R. Varshney. Optimized pre-\n\nprocessing for discrimination prevention. In Neural Information Processing Systems, 2017.\n\n[10] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii. Fair clustering through fairlets. In\n\nNeural Information Processing Systems, 2017.\n\n[11] E. Chzhen, C. Denis, and M. Hebiri. Minimax semi-supervised con\ufb01dence sets for multi-class\n\nclassi\ufb01cation. arXiv preprint arXiv:1904.12527, 2019.\n\n[12] A. Cotter, M. Gupta, H. Jiang, N. Srebro, K. Sridharan, S. Wang, B. Woodworth, and S. You.\nTraining well-generalizing classi\ufb01ers for fairness metrics and other data-dependent constraints.\narXiv preprint arXiv:1807.00028, 2018.\n\n[13] F. Cribari-Neto, N. Garcia, and K. Vasconcellos. A note on inverse moments of binomial\n\nvariates. Brazilian Review of Econometrics, 20(2):269\u2013277, 2000.\n\n9\n\n\f[14] C. Denis and M. Hebiri. Con\ufb01dence sets with expected sizes for multiclass classi\ufb01cation.\n\nJournal of Machine Learning Research, 18(1):3571\u20133598, 2017.\n\n[15] C. Denis and M. Hebiri. Consistency of plug-in con\ufb01dence sets for classi\ufb01cation in semi-\n\nsupervised learning. Journal of Nonparametric Statistics, 0(0):1\u201331, 2019.\n\n[16] L. Devroye. The uniform convergence of nearest neighbor regression function estimators and\ntheir application in optimization. IEEE Transactions on Information Theory, 24(2):142\u2013151,\n1978.\n\n[17] M. Donini, L. Oneto, S. Ben-David, J. S. Shawe-Taylor, and M. Pontil. Empirical risk mini-\n\nmization under fairness constraints. In Neural Information Processing Systems, 2018.\n\n[18] A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample\ndistribution function and of the classical multinomial estimator. The Annals of Mathematical\nStatistics, 27(3):642\u2013669, 1956.\n\n[19] C. Dwork, N. Immorlica, A. T. Kalai, and M. D. M. Leiserson. Decoupled classi\ufb01ers for\ngroup-fair and ef\ufb01cient machine learning. In Conference on Fairness, Accountability and\nTransparency, 2018.\n\n[20] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifying\nand removing disparate impact. In International Conference on Knowledge Discovery and Data\nMining, 2015.\n\n[21] R. Genuer. Variance reduction in purely random forests. Journal of Nonparametric Statistics,\n\n24(3):543\u2013562, 2012.\n\n[22] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In Neural\n\nInformation Processing Systems, 2016.\n\n[23] S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, and A. Roth. Fair learning in markovian\nIn Conference on Fairness, Accountability, and Transparency in Machine\n\nenvironments.\nLearning, 2016.\n\n[24] M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth. Fairness in learning: Classic and\n\ncontextual bandits. In Neural Information Processing Systems, 2016.\n\n[25] F. Kamiran and T. Calders. Classifying without discriminating. In International Conference on\n\nComputer, Control and Communication, 2009.\n\n[26] F. Kamiran and T. Calders. Classi\ufb01cation with no discrimination by preferential sampling. In\n\nMachine Learning Conference, 2010.\n\n[27] F. Kamiran and T. Calders. Data preprocessing techniques for classi\ufb01cation without discrimina-\n\ntion. Knowledge and Information Systems, 33(1):1\u201333, 2012.\n\n[28] N. Kilbertus, M. Rojas-Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Sch\u00f6lkopf.\nAvoiding discrimination through causal reasoning. In Neural Information Processing Systems,\n2017.\n\n[29] V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery\nProblems: Ecole d\u2019Et\u00e9 de Probabilit\u00e9s de Saint-Flour XXXVIII-2008, volume 2033. Springer\nScience & Business Media, 2011.\n\n[30] O. Koyejo, N. Natarajan, P. Ravikumar, and I. Dhillon. Consistent multilabel classi\ufb01cation. In\n\nNeural Information Processing Systems, 2015.\n\n[31] M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness. In Neural Information\n\nProcessing Systems, 2017.\n\n[32] J. Lei. Classi\ufb01cation with con\ufb01dence. Biometrika, 101(4):755\u2013769, 2014.\n\n[33] K. Lum and J. Johndrow. A statistical framework for fair predictive algorithms. arXiv preprint\n\narXiv:1610.08077, 2016.\n\n10\n\n\f[34] P. Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The annals of\n\nProbability, pages 1269\u20131283, 1990.\n\n[35] A. K. Menon and R. C. Williamson. The cost of fairness in binary classi\ufb01cation. In Conference\n\non Fairness, Accountability and Transparency, 2018.\n\n[36] L. Oneto, M. Donini, A. Elders, and M. Pontil. Taking advantage of multitask learning for fair\n\nclassi\ufb01cation. In AAAI/ACM Conference on AI, Ethics, and Society, 2019.\n\n[37] L. Oneto, M. Donini, and M. Pontil. General fair empirical risk minimization. arXiv preprint\n\narXiv:1901.10080, 2019.\n\n[38] G. Pleiss, M. Raghavan, F. Wu, J. Kleinberg, and K. Weinberger. On fairness and calibration.\n\nIn Neural Information Processing Systems, 2017.\n\n[39] J. E. Roemer and A. Trannoy. Equality of opportunity. In Handbook of income distribution,\n\n2015.\n\n[40] M. Sadinle, J. Lei, and L. Wasserman. Least ambiguous set-valued classi\ufb01ers with bounded\n\nerror levels. Journal of the American Statistical Association, pages 1\u201312, 2018.\n\n[41] E. Scornet, G. Biau, and J.-P. Vert. Consistency of random forests. Ann. Statist., 43(4):1716\u2013\n\n1741, 08 2015.\n\n[42] S. Van de Geer. High-dimensional generalized linear models and the lasso. The Annals of\n\nStatistics, 36(2):614\u2013645, 2008.\n\n[43] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events\n\nto their probabilities. In Measures of complexity, 2015.\n\n[44] J. Wellner. Empirical processes: Theory and applications. Technical report, Delft University of\n\nTechnology, 2005.\n\n[45] B. Yan, S. Koyejo, K. Zhong, and P. Ravikumar. Binary classi\ufb01cation with karmic, threshold-\n\nquasi-concave metrics. In International Conference on Machine Learning, 2018.\n\n[46] Y. Yang. Minimax nonparametric classi\ufb01cation: Rates of convergence. IEEE Transactions on\n\nInformation Theory, 45(7):2271\u20132284, 1999.\n\n[47] S. Yao and B. Huang. Beyond parity: Fairness objectives for collaborative \ufb01ltering. In Neural\n\nInformation Processing Systems, 2017.\n\n[48] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond disparate\nIn\n\ntreatment & disparate impact: Learning classi\ufb01cation without disparate mistreatment.\nInternational Conference on World Wide Web, 2017.\n\n[49] M. B. Zafar, I. Valera, M. Gomez-Rodriguez, and K. P. Gummadi. Fairness constraints: A\n\ufb02exible approach for fair classi\ufb01cation. Journal of Machine Learning Research, 20(75):1\u201342,\n2019.\n\n[50] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In\n\nInternational Conference on Machine Learning, 2013.\n\n[51] M. J. Zhao, N. Edakunni, A. Pocock, and G. Brown. Beyond fano\u2019s inequality: bounds on the\noptimal f-score, ber, and cost-sensitive risk and their implications. Journal of Machine Learning\nResearch, 14:1033\u20131090, 2013.\n\n[52] I. Zliobaite. On the relation between accuracy and fairness in binary classi\ufb01cation. arXiv\n\npreprint arXiv:1505.05723, 2015.\n\n11\n\n\f", "award": [], "sourceid": 6944, "authors": [{"given_name": "Evgenii", "family_name": "Chzhen", "institution": "Universit\u00e9 Paris-Est"}, {"given_name": "Christophe", "family_name": "Denis", "institution": "Universite Paris Est"}, {"given_name": "Mohamed", "family_name": "Hebiri", "institution": "Universit\u00e9 Paris-Est--MLV"}, {"given_name": "Luca", "family_name": "Oneto", "institution": "University of Genoa"}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": "IIT"}]}