{"title": "Empirical Risk Minimization Under Fairness Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 2791, "page_last": 2801, "abstract": "We address the problem of algorithmic fairness: ensuring that sensitive information does not unfairly influence the outcome of a classifier. We present an approach based on empirical risk minimization, which incorporates a fairness constraint into the learning problem. It encourages the conditional risk of the learned classifier to be approximately constant with respect to the sensitive variable. We derive both risk and fairness bounds that support the statistical consistency of our methodology. We specify our approach to kernel methods and observe that the fairness requirement implies an orthogonality constraint which can be easily added to these methods. We further observe that for linear models the constraint translates into a simple data preprocessing step. Experiments indicate that the method is empirically effective and performs favorably against state-of-the-art approaches.", "full_text": "Empirical Risk Minimization\nUnder Fairness Constraints\n\nMichele Donini1, Luca Oneto 2,\n\nShai Ben-David3, John Shawe-Taylor4, Massimiliano Pontil1,4\n1Istituto Italiano di Tecnologia (Italy) 2University of Genoa (Italy),\n3University of Waterloo (Canada), 4University College London (UK)\n\nAbstract\n\nWe address the problem of algorithmic fairness: ensuring that sensitive informa-\ntion does not unfairly in\ufb02uence the outcome of a classi\ufb01er. We present an ap-\nproach based on empirical risk minimization, which incorporates a fairness con-\nstraint into the learning problem. It encourages the conditional risk of the learned\nclassi\ufb01er to be approximately constant with respect to the sensitive variable. We\nderive both risk and fairness bounds that support the statistical consistency of\nour methodology. We specify our approach to kernel methods and observe that\nthe fairness requirement implies an orthogonality constraint which can be easily\nadded to these methods. We further observe that for linear models the constraint\ntranslates into a simple data preprocessing step. Experiments indicate that the\nmethod is empirically effective and performs favorably against state-of-the-art ap-\nproaches.\n\n1\n\nIntroduction\n\nIn recent years there has been a lot of interest on algorithmic fairness in machine learning see,\ne.g., [1\u201313] and references therein. The central question is how to enhance supervised learning\nalgorithms with fairness requirements, namely ensuring that sensitive information (e.g. knowledge\nabout the ethnic group of an individual) does not \u2018unfairly\u2019 in\ufb02uence the outcome of a learning al-\ngorithm. For example if the learning problem is to decide whether a person should be offered a loan\nbased on her previous credit card scores, we would like to build a model which does not unfairly\nuse additional sensitive information such as race or sex. Several notions of fairness and associated\nlearning methods have been introduced in machine learning in the past few years, including De-\nmographic Parity [14], Equal Odds and Equal Opportunities [2], Disparate Treatment, Impact, and\nmistreatment [3]. The underlying idea behind such notions is to balance decisions of a classi\ufb01er\namong the different sensitive groups and label sets.\nIn this paper, we build upon the notion Equal Opportunity (EO) which de\ufb01nes fairness as the re-\nquirement that the true positive rate of the classi\ufb01er is the same across the sensitive groups.\nIn\nSection 2 we introduce a generalization of this notion of fairness which constrains the conditional\nrisk of a classi\ufb01er, associated to positive labeled samples of a group, to be approximately constant\nwith respect to group membership. The risk is measured according to a prescribed loss function\nand approximation parameter \u270f. When the loss is the misclassi\ufb01cation error and \u270f = 0 we recover\nthe notion EO above. We study the problem of minimizing the expected risk within a prescribed\nclass of functions subject to the fairness constraint. As a natural estimator associated with this prob-\nlem, we consider a modi\ufb01ed version of Empirical Risk Minimization (ERM) which we call Fair\nERM (FERM). We derive both risk and fairness bounds, which support that FERM is statistically\nconsistent, in a certain sense which we explain in the paper in Section 2.2. Since the FERM ap-\nproach is impractical due to the non-convex nature of the constraint, in the same section we propose\na surrogate convex FERM problem which relates, under a natural condition, to the original goal of\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fminimizing the misclassi\ufb01cation error subject to a relaxed EO constraint. We further observe that\nour condition can be empirically veri\ufb01ed to judge the quality of the approximation in practice. As a\nconcrete example of the framework, in Section 3 we describe how kernel methods such as support\nvector machines (SVMs) can be enhanced to satisfy the fairness constraint. We observe that a par-\nticular instance of the fairness constraint for \u270f = 0 reduces to an orthogonality constraint. Moreover,\nin the linear case, the constraint translates into a preprocessing step that implicitly imposes the fair-\nness requirement on the data, making fair any linear model learned with them. We report numerical\nexperiments using both linear and nonlinear kernels, which indicate that our method improves on\nthe state-of-the-art in four out of \ufb01ve datasets and is competitive on the \ufb01fth dataset. Additional\ntechnical steps and experiments are presented in the supplementary material.\nIn summary the contributions of this paper are twofold. First, we outline a general framework for\nempirical risk minimization under fairness constraints. The framework can be used as a starting point\nto develop speci\ufb01c algorithms for learning under fairness constraints. As a second contribution, we\nshown how a linear fairness constraint arises naturally in the framework and allows us to develop a\nnovel convex learning method that is supported by consistency properties both in terms of EO and\nrisk of the selected model, performing favorably against state-of-the-art alternatives on a series of\nbenchmark datasets.\nRelated Work. Work on algorithmic fairness can be divided in three families. Methods in the \ufb01rst\nfamily modify a pretrained classi\ufb01er in order to increase its fairness properties while maintaining as\nmuch as possible the classi\ufb01cation performance: [2, 15\u201317] are examples of these methods, however\nno consistency properties nor comparison with state-of-the-art proposals are provided. Methods\nin the second family enforce fairness directly during the training step, e.g. [3, 18\u201326]. However\nthey either provide non-convex approaches to the solution of the problem or derive consistency\nresults for the non-convex formulation, resorting later to a convex approach which is not theoretically\ngrounded; The third family of methods implements fairness by modifying the data representation\nand then employs standard machine learning methods: [4, 7, 27\u201330] are examples of these methods.\nAgain, no consistency property nor comparison with state-of-the-art proposal are provided. Our\nmethod belongs to the second family, in that it directly optimizes a fairness constraint related to the\nnotion of EO discussed above. Furthermore, in the case of linear models, our method translates into\nan ef\ufb01cient preprocessing of the input data, with methods in the third family. Finally our method can\nbe extended to deal with other frameworks like the multitask learning one [31]. As we shall see, our\napproach is statistically consistent and performs favorably against the state-of-the-art. We are aware\nthat other convex methods exist, e.g. [32\u201334] which, however, do not compare with other state-of-\nthe-art solutions and do not provide consistency. In this sense, an exception is [1] but, contrarily\nto our proposal, do not enforce a fairness constraint directly in the learning phase. We note that a\nmore detailed comparison between our proposal and state-of-the-art is reported in the supplementary\nmaterial, Section C.\n\n2 Fair Empirical Risk Minimization\n\nIn this section, we present our approach to learning with fairness. We begin by introducing our nota-\ntion. We let D = {(x1, s1, y1), . . . , (xn, sn, yn)} be a sequence of n samples drawn independently\nfrom an unknown probability distribution \u00b5 over X\u21e5S\u21e5Y\n, where Y = {1, +1} is the set of\nbinary output labels, S = {a, b} represents group membership among two groups1 (e.g. \u2018female\u2019\nor \u2018male\u2019), and X is the input space. We note that the input x 2X may further contain or not the\nsensitive feature s 2S in it. We also denote by D+,g={(xi, si, yi) : yi=1, si=g} for g 2{ a, b}\nand n+,g = |D+,g|. Let us consider a function (or model) f : X! R chosen from a set F of\npossible models. The error (risk) of f in approximating \u00b5 is measured by a prescribed loss function\n` : R\u21e5Y ! R. The risk of f is de\ufb01ned as L(f ) = E [`(f (x), y)]. When necessary we will indicate\nwith a subscript the particular loss function used, i.e. Lp(f ) = E [`p(f (x), y)].\nThe purpose of a learning procedure is to \ufb01nd a model that minimizes the risk. Since the proba-\nbility measure \u00b5 is usually unknown, the risk cannot be computed, however we can compute the\nempirical risk \u02c6L(f ) = \u02c6E[`(f (x), y)], where \u02c6E denotes the empirical expectation. A natural learning\nstrategy, called Empirical Risk Minimization (ERM), is then to minimize the empirical risk within\na prescribed set of functions.\n\n1The extension to multiple groups (e.g. ethnic groups) is brie\ufb02y discussed in the supplementary material,\n\nSection I.\n\n2\n\n\f2.1 Fairness De\ufb01nitions\nIn the literature there are different de\ufb01nitions of fairness of a model or learning algorithm [1\u20133,\n3], but there is not yet a consensus on which de\ufb01nition is most appropriate.\nIn this paper, we\nintroduce a general notion of fairness which encompasses some previously used notions and it allows\nto introduce new ones by specifying the loss function used below.\nDe\ufb01nition 1. Let L+,g(f )=E[`(f (x), y)|y=1, s=g] be the risk of the positive labeled samples in\nthe g-th group, and let \u270f 2 [0, 1]. We say that a function f is \u270f-fair if |L+,a(f )  L+,b(f )|\uf8ff \u270f.\nThis de\ufb01nition says that a model is fair if it commits approximately the same error on the positive\nclass independently of the group membership. That is, the conditional risk L+,g is approximately\nconstant across the two groups. Note that if \u270f = 0 and we use the hard loss function, `h(f (x), y) =\n{yf (x)\uf8ff0}, then De\ufb01nition 1 is equivalent to de\ufb01nition of EO proposed by [2], namely\n1\n\nP{f (x) > 0 | y = 1, s = a} = P{f (x) > 0 | y = 1, s = b} .\n\n(1)\nThis equation means that the true positive rate is the same across the two groups. Furthermore, if\nwe use the linear loss function `l(f (x), y) = (1  yf (x))/2 and set \u270f = 0, then De\ufb01nition 1 gives\n(2)\n\nE[f (x) | y = 1, s = a] = E[f (x) | y = 1, s = b].\n\nBy reformulating this expression we obtain a notion of fairness introduced in [1]\n\nYet another implication of Eq. (2) is that the output of the model is uncorrelated with respect to the\ngroup membership conditioned on the label being positive [35], that is, for every g2{a, b}, we have\n\nXg2{a,b}E[f (x) | y = 1, s = g]  E[f (x) | y = 1] = 0.\n{s=g} | y = 1\u21e4.\nE\u21e5f (x)1\n\n{s=g} | y = 1\u21e4 = E\u21e5f (x)|y = 1\u21e4E\u21e51\n\nFinally, we observe that our approach naturally generalizes to other fairness measures that are based\non conditional probabilities, e.g. equal odds [2] (see the supplementary material, Section A). Specif-\nically, we would require in De\ufb01nition 1 that |Ly,a(f )  Ly,b(f )|\uf8ff \u270f for both y 2 {1, 1}.\n2.2 Fair Empirical Risk Minimization\nIn this paper, we aim at minimizing the risk subject to a fairness constraint. Speci\ufb01cally, we consider\nthe problem\n\n(3)\nwhere \u270f 2 [0, 1] is the amount of unfairness that we are willing to bear. Since the measure \u00b5\nis unknown we replace the deterministic quantities with their empirical counterparts. That is, we\nreplace Problem 3 with\n\nminnL(f ) : f2F, L+,a(f )  L+,b(f ) \uf8ff \u270fo,\nminn \u02c6L(f ) : f2F,  \u02c6L+,a(f )  \u02c6L+,b(f ) \uf8ff \u02c6\u270fo,\n\nwhere \u02c6\u270f 2 [0, 1]. We will refer to Problem 4 as FERM.\nWe denote by f\u21e4 a solution of Problem 3, and by \u02c6f a solution of Problem 4. In this section we\nwill show that these solutions are linked one to another. In particular, if the parameter \u02c6\u270f is chosen\nappropriately, we will show that, in a certain sense, the estimator \u02c6f is consistent. In order to present\nour observations, we require that it holds with probability at least 1   that\n\n(4)\n\n(5)\n\nsup\n\nf2FL(f )  \u02c6L(f ) \uf8ff B(, n,F)\n\nwhere the bound B(, n,F) goes to zero as n grows to in\ufb01nity if the class F is learnable with\nrespect to the loss [see e.g. 36, and references therein]. For example, if F is a compact subset of\nlinear separators in a Reproducing Kernel Hilbert Space (RKHS), and the loss is Lipschitz in its\n\ufb01rst argument, then B(, n,F) can be obtained via Rademacher bounds [see e.g. 37]. In this case\nB(, n,F) goes to zero at least asp1/n as n grows to in\ufb01nity, where n = |D|.\nWe are ready to state the \ufb01rst result of this section (proof is reported in supplementary material,\nSection B).\n\n3\n\n\fTheorem 1. Let F be a learnable set of functions with respect to the loss function ` : R \u21e5Y! R,\nlet f\u21e4 be a solution of Problem (3) and let \u02c6f be a solution of Problem (4) with\n\n\u02c6\u270f = \u270f +Pg2{a,b} B(, n+,g,F).\n\nWith probability at least 1  6 it holds simultaneously that\n\nL( \u02c6f )  L(f\u21e4) \uf8ff 2B(, n,F)\n\nand\n\nL+,a( \u02c6f )  L+,b( \u02c6f ) \uf8ff \u270f + 2Pg2{a,b} B(, n+,g,F).\n\nA consequence of the \ufb01rst statement of Theorem 1 is that as n tends to in\ufb01nity L( \u02c6f ) tends to a value\nwhich is not larger than L(f\u21e4), that is, FERM is consistent with respect to the risk of the selected\nmodel. The second statement of Theorem 1, instead, implies that as n tends to in\ufb01nity we have that\n\u02c6f tends to be \u270f-fair. In other words, FERM is consistent with respect to the fairness of the selected\nmodel.\nThanks to Theorem 1 we can state that f\u21e4 is close to \u02c6f both in term of its risk and its fairness.\nNevertheless, our \ufb01nal goal is to \ufb01nd an f\u21e4h which solves the following problem\n\nminnLh(f ) : f2F, L+,a\n\nh (f )  L+,b\n\nh (f ) \uf8ff \u270fo.\n\nNote that the objective function in Problem 7 is the misclassi\ufb01cation error of the classi\ufb01er f, whereas\nh (f )\n\nthe fairness constraint is a relaxation of the EO constraint in Eq. (1). Indeed, the quantityL+,a\nh (f ) is equal to\n\nL+,b\n\n(8)\n\nWe refer to this quantity as difference of EO (DEO).\nAlthough Problem 7 cannot be solved, by exploiting Theorem 1 we can safely search for a solution\n\u02c6fh of its empirical counterpart\n\nP{f (x) > 0 | y = 1,s = a} P{f (x) > 0 | y = 1,s = b}.\nh (f ) \uf8ff \u02c6\u270fo.\n\nminn \u02c6Lh(f ) : f2F,  \u02c6L+,a\n\nh (f )  \u02c6L+,b\n\nUnfortunately Problem 9 is a dif\ufb01cult nonconvex nonsmooth problem, and for this reason it is more\nconvenient to solve a convex relaxation. That is, we replace the hard loss in the risk with a convex\nloss function `c (e.g. the Hinge loss `c = max{0,` l}) and the hard loss in the constraint with the\nlinear loss `l. In this way, we look for a solution \u02c6fc of the convex FERM problem\n\n(9)\n\nminn \u02c6Lc(f ) : f2F,  \u02c6L+,a\n\nl\n\n(f )  \u02c6L+,b\n\nl\n\n(f ) \uf8ff \u02c6\u270fo.\n\nNote that this approximation of the EO constraint correspond to matching the \ufb01rst order mo-\nment. Other works tries to match the second order moment [20] or potentially in\ufb01nitely many\nmoments [38] but these approaches result in non-convex approaches.\nThe questions that arise here are whether, \u02c6fc is to \u02c6fh, how much, and under which assumptions. The\nfollowing theorem sheds some lights on these issues (proof is reported in supplementary material,\nSection B).\nProposition 1. If `c is the Hinge loss then \u02c6Lh(f ) \uf8ff \u02c6Lc(f ). Moreover, if for f : X! R the\nfollowing condition is true\n\n(6)\n\n(7)\n\n(10)\n\n(11)\n\nthen it also holds that\n\n1\n\n2Pg2{a,b}\u02c6E\u21e5signf (x)  f (x) y = 1, s = g\u21e4 \uf8ff \u02c6,\n \u02c6L+,a\n(f ) + \u02c6.\n\nh (f ) \uf8ff \u02c6L+,a\n\nh (f )  \u02c6L+,b\n\n(f )  \u02c6L+,b\n\nl\n\nl\n\nThe \ufb01rst statement of Proposition 1 tells us that exploiting `c instead of `h is a good approximation\nif \u02c6Lc( \u02c6fc) is small. The second statement of Proposition 1, instead, tells us that if the hypothesis of\ninequality (11) holds, then the linear loss based fairness is close to the EO. Obviously the smaller\n\n\u02c6 is, the closer they are. Inequality (11) says that the functions signf (x) and f (x) distribute,\n\n4\n\n\fon average, in a similar way. This condition is quite natural and it has been exploited in previous\nwork [see e.g. 39]. Moreover, in Section 4 we present experiments showing that \u02c6 is small.\nThe bound in Proposition 1 may be tighten by using different nonlinear approximations of EO [see\ne.g. 7]. However, the linear approximation proposed in this work gives a convex problem, and as we\nshall see in Section 5, works well in practice.\nIn summary, the combination of Theorem 1 and Proposition 1 provides conditions under which a\nsolution \u02c6fc of Problem 4, which is convex, is close, both in terms of classi\ufb01cation accuracy and\nfairness, to a solution f\u21e4h of Problem 7, which is our \ufb01nal goal.\n\n3 Fair Learning with Kernels\n\nf (x) = hw, (x)i, x 2X ,\n\nIn this section, we specify the FERM framework to the case that the underlying space of models\nis a reproducing kernel Hilbert space (RKHS) [see e.g. 40, 41, and references therein]. We let\n\uf8ff : X\u21e5X! R be a positive de\ufb01nite kernel and let  : X! H be an induced feature mapping\nsuch that \uf8ff(x, x0) = h(x), (x0)i, for all x, x0 2X , where H is the Hilbert space of square\nsummable sequences. Functions in the RKHS can be parametrized as\n(12)\nfor some vector of parameters w 2 H. In practice a bias term (threshold) can be added to f but to\nease our presentation we do not include it here.\nWe solve Problem (10) with F a ball in the RKHS and employ a convex loss function `. As for the\nfairness constraint we use the linear loss function, which implies the constraint to be convex. Let ug\nbe the barycenter in the feature space of the positively labelled points in the group g 2{ a, b}, that is\n(13)\nwhere I +,g = {i : yi=1, si=g}. Then using Eq. (18) the constraint in Problem (10) takes the form\nIn practice, we solve the Tikhonov regularization problem\ni=1 `(hw, (xi)i, yi) + kwk2\n\n(14)\nwhere u = ua  ub and  is a positive parameter which controls model complexity. In particular,\nif \u270f = 0 the constraint in Problem (14) reduces to an orthogonality constraint that has a simple\ngeometric interpretation. Speci\ufb01cally, the vector w is required to be orthogonal to the vector formed\nby the difference between the barycenters of the positive labelled input samples in the two groups.\nBy the representer theorem [42], the solution to Problem (14) is a linear combination of the feature\nvectors (x1), . . . , (xn) and the vector u. However, in our case u is itself a linear combination of\nthe feature vectors (in fact only those corresponding to the subset of positive labeled points) hence w\ni=1 \u21b5i(xi). The corresponding function\ni=1 \u21b5i\uf8ff(xi, x). Let K be the Gram matrix.\n\nhw, ua  ubi \uf8ff \u270f.\n\nn+,gPi2I+,g (xi),\n\ns.t.hw, ui \uf8ff \u270f\n\nw2H Pn\n\nug = 1\n\nmin\n\nThe vector of coef\ufb01cients \u21b5 can then be found by solving\n\nis a linear combination of the input points, that is w =Pn\nused to make predictions is then given by f (x) = Pn\n\u21b52Rn( nXi=1\ns.t.\nnXi=1\n\nKij\u21b5j, yi\u25c6+\nnXi,j=1\n\n`\u2713 nXj=1\n\n\u21b5i\u21b5jKij\n\nmin\n\nIn our experiments below we consider this particular case of Problem (14) and furthermore choose\nthe loss function `c to be the Hinge loss. The resulting method is an extension of SVM. The fairness\nconstraint and, in particular, the orthogonality constraint when \u270f = 0, can be easily added within\nstandard SVM solvers2.\nIt is instructive to consider Problem (14) when  is the identity mapping (i.e. \uf8ff is the linear kernel\non Rd) and \u270f = 0. In this special case we can solve the orthogonality constraint hw, ui = 0 for wi,\nwhere the index i is such that |ui| = kuk1, obtaining that wi = Pd\n. Consequently\nj=1,j6=i wj\nthe linear model rewrites asPd\n). In this way, we then see the\n\n2In supplementary material we derive the dual of Problem (14) when `c is the Hinge loss.\n\nj=1 wjxj =Pd\n\nj=1,j6=i wj(xj  xi\n\nuj\nui\n\nuj\nui\n\n\u21b5i\uf8ff 1\nn+,aXj2I+,a\n\nKij\n\n1\n\nn+,bXj2I+,b\n\nKij \uf8ff \u270f).\n\n5\n\n\ffairness constraint is implicitly enforced by making the change of representation x 7! \u02dcx 2 Rd1,\nwith\n(15)\nIn other words, we are able to obtain a fair linear model without any other constraint and by using a\nrepresentation that has one feature fewer than the original one3.\n\nj 2{ 1, . . . , i  1, i + 1, . . . , d}.\n\n\u02dcxj = xj  xi\n\nuj\nui\n\n,\n\n4 Experiments\n\nIn this section, we present numerical experiments with the proposed method on one synthetic and\n\ufb01ve real datasets. The aim of the experiments is threefold. First, we show that our approach is\neffective in selecting a fair model, incurring only a moderate loss in accuracy. Second, we provide\nan empirical study of the properties of the method, which supports our theoretical observations\nin Section 2. Third, we highlight the generality of our approach by showing that it can be used\neffectively within other linear models such as Lasso SVM for classi\ufb01cation.\nWe use our approach with \u270f=0 in order to simplify the hyperparameter selection procedure. For\nthe sake of completeness, a set of results for different values of \u270f is presented in the supplemen-\ntary material and brie\ufb02y we comment on these below. In all the experiments, we collect statistics\nconcerning the classi\ufb01cation accuracy and DEO of the selected model. We recall that the DEO is\nde\ufb01ned in Eq. (8) and is the absolute difference of the true positive rate of the classi\ufb01er applied\nto the two groups. In all experiments, we performed a 10-fold cross validation (CV) to select the\nbest hyperparameters4. For the Arrhythmia, COMPAS, German and Drug datasets, this procedure\nis repeated 10 times, and we reported the average performance on the test set alongside its standard\ndeviation. For the Adult dataset, we used the provided split of train and test sets. Unless other-\nwise stated, we employ two steps in the 10-fold CV procedure. In the \ufb01rst step, the value of the\nhyperparameters with highest accuracy is identi\ufb01ed.\nIn the second step, we shortlist all the hy-\nperparameters with accuracy close to the best one (in our case, above 90% of the best accuracy).\nFinally, from this list, we select the hyperparameters with the lowest DEO. This novel validation\nprocedure, that we wil call NVP, is a sanity-check to ensure that fairness cannot be achieved by a\nsimple modi\ufb01cation of hyperparameter selection procedure. The code of our method is available at:\nhttps://github.com/jmikko/fair_ERM.\nSynthetic Experiment. The aim of this experiment is to study the behavior of our method, in terms\nof both DEO and classi\ufb01cation accuracy, in comparison to standard SVM (with our novel validation\nprocedure). To this end, we generated a synthetic binary classi\ufb01cation dataset with two sensitive\ngroups in the following manner. For each group in the class 1 and for the group a in the class\n+1, we generated 1000 examples for training and the same amount for testing. For the group b in\nthe class +1, we generated 200 examples for training and the same number for testing. Each set\nof examples is sampled from a 2-dimensional isotropic Gaussian distribution with different mean\n\u00b5 and variance 2: (i) Group a, Label +1: \u00b5 = (1,1), 2 = 0.8; (ii) Group a, Label 1:\n\u00b5 = (1, 1), 2 = 0.8; (iii) Group b, Label +1: \u00b5 = (0.5,0.5), 2 = 0.5; (iv) Group b, Label 1:\n\u00b5 = (0.5, 0.5), 2 = 0.5. When a standard machine learning method is applied to this toy dataset,\nthe generated model is unfair with respect to the group b, in that the classi\ufb01er tends to negatively\nclassify the examples in this group.\nWe trained different models, varying the value of the hyperparameter C, and using the standard\nlinear SVM and our linear method. Figure 1 (Left) shows the performance of the various generated\nmodels with respect to the classi\ufb01cation error and DEO on the test set. Note that our method gener-\nated models that have an higher level of fairness, maintaining a good level of accuracy. The grid in\nthe plots emphasizes the fact that both the error and DEO have to be simultaneously considered in\nthe evaluation of a method. Figure 1 (Center and Left) depicts the histogram of the values of hw, xi\n(where w is the generated model) for test examples with true label equal to +1 for each of the two\ngroups. The results are reported both for our method (Right) and standard SVM (Center). Note that\nour method generates a model with a similar true positive rate among the two groups (i.e. the areas\nof the value when the horizontal axis is greater than zero are similar for groups a and b). Moreover,\n\n3In supplementary material is reported the generalization of this argument to kernel for SVM.\n4The regularization parameter C (for both SVM and our method) with 30 values, equally spaced in loga-\nrithmic scale between 104 and 104; we used both the linear or RBF kernel (i.e. for two examples x and z, the\nRBF kernel is e||xz||2) with  2{ 0.001, 0.01, 0.1, 1}. In our case, C = 1\n\n2 of Eq. (14).\n\n6\n\n\fFigure 1: Left: Test classi\ufb01cation error and DEO for different values of hyperparameter C for\nstandard linear SVM (green circles) and our modi\ufb01ed linear SVM (magenta stars). Center and\nRights: Histograms of the distribution of the values hw, xi for the two groups (a in blue and b\nin light orange) for test examples with label equals to +1. The results are collected by using the\noptimal validated model for the classical linear SVM (Center) and for our linear method (Right).\n\nMethod\n\nArrhythmia\n\nACC\n\nDEO\n\nCOMPAS\n\nACC\n\nDEO\n\nAdult\n\nACC DEO\n\nGerman\n\nACC\n\nDEO\n\nACC\n\nDrug\n\nDEO\n\ns not included in x\n\nNa\u00efve Lin. SVM 0.75\u00b10.04 0.11\u00b10.03 0.73\u00b10.01 0.13\u00b10.02 0.78 0.10 0.71\u00b10.06 0.16\u00b10.04 0.79\u00b10.02 0.25\u00b10.03\nLin. SVM\n0.71\u00b10.05 0.10\u00b10.03 0.72\u00b10.01 0.12\u00b10.02 0.78 0.09 0.69\u00b10.04 0.11\u00b10.10 0.79\u00b10.02 0.25\u00b10.04\nHardt\nZafar\n0.67\u00b10.03 0.05\u00b10.02 0.69\u00b10.01 0.10\u00b10.08 0.76 0.05 0.62\u00b10.09 0.13\u00b10.10 0.66\u00b10.03 0.06\u00b10.06\nLin. Ours\n0.75\u00b10.05 0.05\u00b10.02 0.73\u00b10.01 0.07\u00b10.02 0.75 0.01 0.69\u00b10.04 0.06\u00b10.03 0.79\u00b10.02 0.10\u00b10.06\nNa\u00efve SVM\n0.75\u00b10.04 0.11\u00b10.03 0.72\u00b10.01 0.14\u00b10.02 0.80 0.09 0.74\u00b10.05 0.12\u00b10.05 0.81\u00b10.02 0.22\u00b10.04\nSVM\n0.71\u00b10.05 0.10\u00b10.03 0.73\u00b10.01 0.11\u00b10.02 0.79 0.08 0.74\u00b10.03 0.10\u00b10.06 0.81\u00b10.02 0.22\u00b10.03\nHardt\nOurs\n0.75\u00b10.05 0.05\u00b10.02 0.72\u00b10.01 0.08\u00b10.02 0.77 0.01 0.73\u00b10.04 0.05\u00b10.03 0.79\u00b10.03 0.10\u00b10.05\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\ns included in x\n\nNa\u00efve Lin. SVM 0.79\u00b10.06 0.14\u00b10.03 0.76\u00b10.01 0.17\u00b10.02 0.81 0.14 0.71\u00b10.06 0.17\u00b10.05 0.81\u00b10.02 0.44\u00b10.03\nLin. SVM\n0.78\u00b10.07 0.13\u00b10.04 0.75\u00b10.01 0.15\u00b10.02 0.80 0.13 0.69\u00b10.04 0.11\u00b10.10 0.81\u00b10.02 0.41\u00b10.06\nHardt\n0.74\u00b10.06 0.07\u00b10.04 0.67\u00b10.03 0.21\u00b10.09 0.80 0.10 0.61\u00b10.15 0.15\u00b10.13 0.77\u00b10.02 0.22\u00b10.09\nZafar\n0.71\u00b10.03 0.03\u00b10.02 0.69\u00b10.02 0.10\u00b10.06 0.78 0.05 0.62\u00b10.09 0.13\u00b10.11 0.69\u00b10.03 0.02\u00b10.07\nLin. Ours\n0.79\u00b10.07 0.04\u00b10.03 0.76\u00b10.01 0.04\u00b10.03 0.77 0.01 0.69\u00b10.04 0.05\u00b10.03 0.79\u00b10.02 0.05\u00b10.03\nNa\u00efve SVM\n0.79\u00b10.06 0.14\u00b10.04 0.76\u00b10.01 0.18\u00b10.02 0.84 0.18 0.74\u00b10.05 0.12\u00b10.05 0.82\u00b10.02 0.45\u00b10.04\nSVM\n0.78\u00b10.06 0.13\u00b10.04 0.73\u00b10.01 0.14\u00b10.02 0.82 0.14 0.74\u00b10.03 0.10\u00b10.06 0.81\u00b10.02 0.38\u00b10.03\nHardt\n0.74\u00b10.06 0.07\u00b10.04 0.71\u00b10.01 0.08\u00b10.01 0.82 0.11 0.71\u00b10.03 0.11\u00b10.18 0.75\u00b10.11 0.14\u00b10.08\nOurs\n0.79\u00b10.09 0.03\u00b10.02 0.73\u00b10.01 0.05\u00b10.03 0.81 0.01 0.73\u00b10.04 0.05\u00b10.03 0.80\u00b10.03 0.07\u00b10.05\nTable 1: Results (average \u00b1 standard deviation, when a \ufb01xed test set is not provided) for all the\ndatasets, concerning accuracy (ACC) and DEO .\n\ndue to the simplicity of the toy test, the distribution with respect to the two different groups is also\nvery similar when our model is used.\nReal Data Experiments. We next compare the performance of our model to set of different methods\non 5 publicly available datasets: Arrhythmia, COMPAS, Adult, German, and Drug. A description of\nthe datasets is provided in the supplementary material. These datasets have been selected from the\nstandard databases of datasets (UCI, mldata and Fairness-Measures5). We considered only datasets\nwith a DEO higher than 0.1, when the model is generated by an SVM validated with the NVP. For\nthis reason, some of the commonly used datasets have been discarded (e.g. Diabetes, Heart, SAT,\nPSU-Chile, and SOEP). We compared our method both in the linear and not linear case against: (i)\nNa\u00efve SVM, validated with a standard nested 10-fold CV procedure. This method ignores fairness\nin the validation procedure, simply trying to optimize accuracy; (ii) SVM with the NVP. As noted\nabove, this baseline is the simplest way to inject the fairness into the model; (iii) Hardt method [2]\napplied to the best SVM; (iv) Zafar method [3], implemented with the code provided by the au-\nthors for the linear case6. Concerning our method, in the linear case, it exploits the preprocessing\npresented in Section 3.\nTable 1 shows our experimental results for all the datasets and methods both when s is included in\nx or not. This result suggests that our method performs favorably over the competitors in that it\n\n5Fairness-Measures website: fairness-measures.org\n6Python code for [3]: https://github.com/mbilalzafar/fair-classification\n\n7\n\n\fFigure 2: Results of Table 1 of linear (left) and nonlinear (right) methods, when the error and the\nDEO are normalized in [0, 1] column-wise and when s is included in x. Different symbols and\ncolors refer to different datasets and method respectively. The closer a point is to the origin, the\nbetter the result is.\n\ndecreases DEO substantially with only a moderate loss in accuracy. Moreover having s included in\nx increases the accuracy but - for the methods without the speci\ufb01c purpose of producing fairness\nmodels - decreases the fairness. On the other hand, having s included in x ensures to our method\nthe ability of improve the fairness by exploiting the value of s also in the prediction phase. This\nis to be expected, since knowing the group membership increases our information but also leads to\nbehaviours able to in\ufb02uence the fairness of the predictive model. In order to quantify this effect,\nwe present in Figure 2 the results of Table 1 of linear (left) and nonlinear (right) methods, when the\nerror (one minus accuracy) and the DEO are normalized in [0, 1] column-wise both when the s is\nincluded and not included in x. In the \ufb01gure, different symbols and colors refer to different datasets\nand methods, respectively. The closer a point is to the origin, the better the result is. The best\naccuracy is, in general, reached by using the Na\u00efve SVM (in red) both for the linear and nonlinear\ncase. This behavior is expected due to the absence of any fairness constraint. On the other hand,\nNa\u00efve SVM has unsatisfactory levels of fairness. Hardt [2] (in blue) and Zafar [3] (in cyan, for the\nlinear case) methods are able to obtain a good level of fairness but the price of this fair model is a\nstrong decrease in accuracy. Our method (in magenta) obtains similar or better results concerning\nthe DEO preserving the performance in accuracy. In particular in the nonlinear case, our method\nreaches the lowest levels of DEO with respect to all the methods. For the sake of completeness, in\nthe nonlinear (bottom) part of Figure 2, we show our method when the parameter \u270f is set to 0.1 (in\nbrown) instead of 0 (in magenta). As expected, the generated models are less fair with a (small)\nimprovement in the accuracy. An in depth analysis of the role of \u270f is presented in supplementary\nmaterial.\nApplication to Lasso. Due to the particular proposed methodology, we are able in principle to apply\nour method to any learning algorithm. In particular, when the algorithm generates a linear model we\ncan exploit the data preprocessing in Eq. (15), to directly impose fairness in the model. Here, we\nshow how it is possible to obtain a sparse and fair model by exploiting the standard Lasso algorithm\nin synergy with this preprocessing step. For this purpose, we selected the Arrhythmia dataset as the\nLasso works well in a high dimensional / small sample setting. We performed the same experiment\ndescribed above, where we used the Lasso algorithm in place of the SVM. In this case, by Na\u00efve\nLasso, we refer to the Lasso when it is validated with a standard nested 10-fold CV procedure,\nwhereas by Lasso we refer to the standard Lasso with the NVP outlined above. The method of [2]\nhas been applied to the best Lasso model. Moreover, we reported the results obtained using Na\u00efve\nLinear SVM and Linear SVM. We also repeated the experiment by using a reduced training set in\norder to highlight the effect of the sparsity. Table 2 reported in the supplementary material shows\nthe results in the case when s is included in x. It is possible to note that, reducing the training sets,\nthe generated models become less fair (i.e. the DEO increases). Using our method, we are able to\nmaintain a fair model reaching satisfactory accuracy results.\nThe Value of \u02c6. Finally, we show experimental results to highlight how the hypothesis of Propo-\nsition 1 (Section 2.2) are reasonable in the real cases. We know that, if the hypothesis of inequality\n(11) are satis\ufb01ed, the linear loss based fairness is close to the EO. Speci\ufb01cally, these two quantities\nare closer when \u02c6 is small. We evaluated \u02c6 for benchmark and toy datasets. The obtained results\n\n8\n\n\fMethod\nAccuracy\nNa\u00efve Lin. SVM 0.79 \u00b1 0.06\nLinear SVM\n0.78 \u00b1 0.07\nNa\u00efve Lasso\n0.79 \u00b1 0.07\nLasso\n0.74 \u00b1 0.04\nHardt\n0.71 \u00b1 0.05\nOur Lasso\n0.77 \u00b1 0.02\n\nMethod\nAccuracy\nNa\u00efve Lin. SVM 0.69 \u00b1 0.03\nLinear SVM\n0.68 \u00b1 0.03\nNa\u00efve Lasso\n0.73 \u00b1 0.04\nLasso\n0.70 \u00b1 0.04\nHardt\n0.67 \u00b1 0.06\nOur Lasso\n0.71 \u00b1 0.04\n\n0.14 \u00b1 0.03\n0.13 \u00b1 0.04\n0.11 \u00b1 0.04\n0.07 \u00b1 0.04\n0.04 \u00b1 0.06\n0.03 \u00b1 0.02\n\n0.16 \u00b1 0.03\n0.15 \u00b1 0.03\n0.15 \u00b1 0.06\n0.09 \u00b1 0.05\n0.08 \u00b1 0.07\n0.03 \u00b1 0.04\n\nSelected Features\n\n-\n-\n\n22.7 \u00b1 9.1\n5.2 \u00b1 3.7\n5.2 \u00b1 3.7\n7.5 \u00b1 2.0\n\n-\n-\n\n14.1 \u00b1 6.6\n7.9 \u00b1 8.0\n7.9 \u00b1 8.0\n9.0 \u00b1 7.3\n\nArrhythmia dataset - Training set reduced by 50%\n\nDEO\n\nSelected Features\n\nare in Table 3 of supplementary material, where \u02c6 has the order of magnitude of 102 in all the\ndatasets. Consequently, our method is able to obtain a good approximation of the DEO.\n\nArrhythmia dataset\nDEO\n\nTable 2: Results (average \u00b1 standard deviation) when the model is the Lasso, concerning accuracy,\nDEO and the number of features with weight bigger than 108 over the 279 original features. The\nexperiment has been repeated also reducing the training set. In this case s is included in x.\n\nDataset\nToytest\nToytest Lasso\nArrhythmia\nCOMPAS\nAdult\nGerman\nDrug\n\n\u02c6\n0.03\n0.02\n0.03\n0.04\n0.06\n0.05\n0.03\n\nTable 3: The \u02c6 for the exploited datasets. A smaller \u02c6 means a better approximation of the DEO in\nour method.\n\n5 Conclusion and Future Work\n\nWe have presented a generalized notion of fairness, which encompasses previously introduced no-\ntion and can be used to constrain ERM, in order to learn fair classi\ufb01ers. The framework is appealing\nboth theoretically and practically. Our theoretical observations provide a statistical justi\ufb01cation for\nthis approach and our algorithmic observations suggest a way to implement it ef\ufb01ciently in the set-\nting of kernel methods. Experimental results suggest that our approach is promising for applications,\ngenerating models with improved fairness properties while maintaining classi\ufb01cation accuracy. We\nclose by mentioning directions of future research. On the algorithmic side, it would be interesting\nto study whether our method can be improved by other relaxations of the fairness constraint beyond\nthe linear loss used here. Applications of the fairness constraint to multi-class classi\ufb01cation or to\nregression tasks would also be valuable. On the theory side, it would be interesting to study how\nthe choice of the parameter \u270f affects the statistical performance of our method and derive optimal\naccuracy-fairness trade-off as a function of this parameter.\n\nAcknowledgments\nWe wish to thank Amon Elders, Theodoros Evgeniou and Andreas Maurer for useful comments.\nThis work was supported in part by SAP SE and the EPSRC.\n\n9\n\n\fReferences\n[1] C. Dwork, N. Immorlica, A. T. Kalai, and M. D. M. Leiserson. Decoupled classi\ufb01ers for group-fair and\n\nef\ufb01cient machine learning. In Conference on Fairness, Accountability and Transparency, 2018.\n\n[2] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In Advances in neural\n\ninformation processing systems, 2016.\n\n[3] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond disparate treatment &\ndisparate impact: Learning classi\ufb01cation without disparate mistreatment. In International Conference on\nWorld Wide Web, 2017.\n\n[4] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In International\n\nConference on Machine Learning, 2013.\n\n[5] N. Kilbertus, M. Rojas-Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Sch\u00f6lkopf. Avoiding\ndiscrimination through causal reasoning. In Advances in Neural Information Processing Systems, 2017.\n\n[6] M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness. In Advances in Neural Infor-\n\nmation Processing Systems, 2017.\n\n[7] F. Calmon, D. Wei, B. Vinzamuri, K. Natesan Ramamurthy, and K. R. Varshney. Optimized pre-\nprocessing for discrimination prevention. In Advances in Neural Information Processing Systems, 2017.\n\n[8] M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth. Fairness in learning: Classic and contextual\n\nbandits. In Advances in Neural Information Processing Systems, 2016.\n\n[9] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii. Fair clustering through fairlets. In Advances\n\nin Neural Information Processing Systems, 2017.\n\n[10] S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, and A. Roth. Fair learning in markovian environments.\n\nIn Conference on Fairness, Accountability, and Transparency in Machine Learning, 2016.\n\n[11] S. Yao and B. Huang. Beyond parity: Fairness objectives for collaborative \ufb01ltering. In Advances in Neural\n\nInformation Processing Systems, 2017.\n\n[12] K. Lum and J. Johndrow. A statistical framework for fair predictive algorithms.\n\narXiv:1610.08077, 2016.\n\narXiv preprint\n\n[13] I. Zliobaite. On the relation between accuracy and fairness in binary classi\ufb01cation. arXiv preprint\n\narXiv:1505.05723, 2015.\n\n[14] T. Calders, F. Kamiran, and M. Pechenizkiy. Building classi\ufb01ers with independency constraints. In IEEE\n\ninternational conference on Data mining, 2009.\n\n[15] G. Pleiss, M. Raghavan, F. Wu, J. Kleinberg, and K. Q. Weinberger. On fairness and calibration.\n\nAdvances in Neural Information Processing Systems, 2017.\n\nIn\n\n[16] A. Beutel, J. Chen, Z. Zhao, and E. H. Chi. Data decisions and theoretical implications when adversarially\nlearning fair representations. In Conference on Fairness, Accountability, and Transparency in Machine\nLearning, 2017.\n\n[17] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifying and\nIn International Conference on Knowledge Discovery and Data Mining,\n\nremoving disparate impact.\n2015.\n\n[18] A. Agarwal, A. Beygelzimer, M. Dud\u00edk, and J. Langford. A reductions approach to fair classi\ufb01cation. In\n\nConference on Fairness, Accountability, and Transparency in Machine Learning, 2017.\n\n[19] A. Agarwal, A. Beygelzimer, M. Dud\u00edk, J. Langford, and H. Wallach. A reductions approach to fair\n\nclassi\ufb01cation. arXiv preprint arXiv:1803.02453, 2018.\n\n[20] B. Woodworth, S. Gunasekar, M. I. Ohannessian, and N. Srebro. Learning non-discriminatory predictors.\n\nIn Computational Learning Theory, 2017.\n\n[21] A. K. Menon and R. C. Williamson. The cost of fairness in binary classi\ufb01cation.\n\nFairness, Accountability and Transparency, 2018.\n\nIn Conference on\n\n[22] M. B. Zafar, I. Valera, M. Rodriguez, K. Gummadi, and A. Weller. From parity to preference-based\n\nnotions of fairness in classi\ufb01cation. In Advances in Neural Information Processing Systems, 2017.\n\n10\n\n\f[23] Y. Bechavod and K. Ligett.\n\narXiv:1707.00044v3, 2018.\n\nPenalizing unfairness in binary classi\ufb01cation.\n\narXiv preprint\n\n[24] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness constraints: Mechanisms for\n\nfair classi\ufb01cation. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2017.\n\n[25] T. Kamishima, S. Akaho, and J. Sakuma. Fairness-aware learning through regularization approach. In\n\nInternational Conference on Data Mining Workshops, 2011.\n\n[26] M. Kearns, S. Neel, A. Roth, and Z. S. Wu. Preventing fairness gerrymandering: Auditing and learning\n\nfor subgroup fairness. arXiv preprint arXiv:1711.05144, 2017.\n\n[27] J. Adebayo and L. Kagal. Iterative orthogonal feature projection for diagnosing bias in black-box models.\n\nIn Conference on Fairness, Accountability, and Transparency in Machine Learning, 2016.\n\n[28] F. Kamiran and T. Calders. Classifying without discriminating. In International Conference on Computer,\n\nControl and Communication, 2009.\n\n[29] F. Kamiran and T. Calders. Data preprocessing techniques for classi\ufb01cation without discrimination.\n\nKnowledge and Information Systems, 33(1):1\u201333, 2012.\n\n[30] F. Kamiran and T. Calders. Classi\ufb01cation with no discrimination by preferential sampling. In Machine\n\nLearning Conference, 2010.\n\n[31] L. Oneto, M. Donini, A. Elders, and M. Pontil. Taking advantage of multitask learning for fair classi\ufb01ca-\n\ntion. In AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2019.\n\n[32] A. P\u00e9rez-Suay, V. Laparra, G. Mateo-Garc\u00eda, J. Mu\u00f1oz-Mar\u00ed, L. G\u00f3mez-Chova, and G. Camps-Valls. Fair\n\nkernel learning. In Machine Learning and Knowledge Discovery in Databases, 2017.\n\n[33] R. Berk, H. Heidari, S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth. A convex\n\nframework for fair regression. arXiv preprint arXiv:1706.02409, 2017.\n\n[34] D. Alabi, N. Immorlica, and A. T. Kalai. When optimizing nonlinear objectives is no harder than linear\n\nobjectives. arXiv preprint arXiv:1804.04503, 2018.\n\n[35] M. Donini, S. Ben-David, M. Pontil, and J. Shawe-Taylor. An ef\ufb01cient method to impose fairness in linear\n\nmodels. In NIPS Workshop on Prioritising Online Content, 2017.\n\n[36] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms.\n\nCambridge University Press, 2014.\n\n[37] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural\n\nresults. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n[38] N. Quadrianto and V. Sharmanska. Recycling privileged learning and distribution matching for fairness.\n\nIn Advances in Neural Information Processing Systems, 2017.\n\n[39] A. Maurer. A note on the pac bayesian theorem. arXiv preprint cs/0411099, 2004.\n\n[40] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University Press,\n\n2004.\n\n[41] A. J. Smola and B. Sch\u00f6lkopf. Learning with Kernels. MIT Press, 2001.\n\n[42] B. Sch\u00f6lkopf, R. Herbrich, and A. Smola. A generalized representer theorem. In Computational Learning\n\nTheory, 2001.\n\n[43] V. N. Vapnik. Statistical learning theory. Wiley New York, 1998.\n\n[44] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970.\n\n11\n\n\f", "award": [], "sourceid": 1478, "authors": [{"given_name": "Michele", "family_name": "Donini", "institution": "Istituto Italiano di Tecnologia"}, {"given_name": "Luca", "family_name": "Oneto", "institution": "University of Genoa"}, {"given_name": "Shai", "family_name": "Ben-David", "institution": "Universitys of Waterloo"}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": "UCL"}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": "IIT"}]}