{"title": "Optimization over Continuous and Multi-dimensional Decisions with Observational Data", "book": "Advances in Neural Information Processing Systems", "page_first": 2962, "page_last": 2970, "abstract": "We consider the optimization of an uncertain objective over continuous and multi-dimensional decision spaces in problems in which we are only provided with observational data. We propose a novel algorithmic framework that is tractable, asymptotically consistent, and superior to comparable methods on example problems. Our approach leverages predictive machine learning methods and incorporates information on the uncertainty of the predicted outcomes for the purpose of prescribing decisions. We demonstrate the efficacy of our method on examples involving both synthetic and real data sets.", "full_text": "Optimization over Continuous and Multi-dimensional\n\nDecisions with Observational Data\n\nDimitris Bertsimas\n\nSloan School of Management\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02142\ndbertsim@mit.edu\n\nChristopher McCord\n\nOperations Research Center\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02142\n\nmccord@mit.edu\n\nAbstract\n\nWe consider the optimization of an uncertain objective over continuous and multi-\ndimensional decision spaces in problems in which we are only provided with\nobservational data. We propose a novel algorithmic framework that is tractable,\nasymptotically consistent, and superior to comparable methods on example prob-\nlems. Our approach leverages predictive machine learning methods and incorpo-\nrates information on the uncertainty of the predicted outcomes for the purpose of\nprescribing decisions. We demonstrate the ef\ufb01cacy of our method on examples\ninvolving both synthetic and real data sets.\n\n1\n\nIntroduction\n\nWe study the general problem in which a decision maker seeks to optimize a known objective function\nthat depends on an uncertain quantity. The uncertain quantity has an unknown distribution, which\nmay be affected by the action chosen by the decision maker. Many important problems across a\nvariety of \ufb01elds \ufb01t into this framework. In healthcare, for example, a doctor aims to prescribe drugs in\nspeci\ufb01c dosages to regulate a patient\u2019s vital signs. In revenue management, a store owner must decide\nhow to price various products in order to maximize pro\ufb01t. In online retail, companies decide which\nproducts to display for a user to maximize sales. The general problem we study is characterized by\nthe following components:\n\n\u2022 Decision variable: z \u2208 Z \u2282 Rp,\n\u2022 Outcome: Y (z) \u2208 Y (We adopt the potential outcomes framework [20], in which Y (z)\ndenotes the (random) quantity that would have been observed had decision z been chosen.),\n\u2022 Auxiliary covariates (also called side-information or context): x \u2208 X \u2282 Rd,\n\u2022 Cost function: c(z; y) : Z \u00d7 Y \u2192 R. (This function is known a priori.)\n\nWe allow the auxiliary covariates, decision variable, and outcome to take values on multi-dimensional,\ncontinuous sets. A decision-maker seeks to determine the action that minimizes the conditional\nexpected cost:\n\nE[c(z; Y (z))|X = x].\n\nmin\nz\u2208Z\n\n(1)\n\nOf course, the distribution of Y (z) is unknown, so it is not possible to solve this problem exactly.\nHowever, we assume that we have access to observational data, consisting of n independent and\nidentically distributed observations, (Xi, Zi, Yi) for i = 1, . . . , n. Each of these observations consists\nof an auxiliary covariate vector, a decision, and an observed outcome. This type of data presents\ntwo challenges that differentiate our problem from a predictive machine learning problem. First, it is\nincomplete. We only observe Yi := Yi(Zi), the outcome associated with the applied decision. We do\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fnot observe what the outcome would have been under a different decision. Second, the decisions\nwere not necessarily chosen independently of the outcomes, as they would have been in a randomized\nexperiment, and we do not know how the decisions were assigned. Following common practice in\nthe causal inference literature, we make the ignorability assumption of Hirano and Imbens [13].\nAssumption 1 (Ignorability).\n\nY (z) \u22a5\u22a5 Z | X \u2200z \u2208 Z\n\nIn other words, we assume that historically the decision Z has been chosen as a function of the\nauxiliary covariates X. There were no unmeasured confounding variables that affected both the\nchoice of decision and the outcome.\nUnder this assumption, we are able to rewrite the objective of (1) as\n\nE[c(z; Y ) | X = x, Z = z].\n\nThis form of the objective is easier to learn because it depends only on the observed outcome, not on\nthe counterfactual outcomes. A direct approach to solve this problem is to use a regression method\nto predict the cost as a function of x and z and then choose z to minimize this predicted cost. If\nthe selected regression method is uniformly consistent in z, then the action chosen by this method\nwill be asymptotically optimal under certain conditions. (We will formalize this later.) However,\nthis requires choosing a regression method that ensures the optimization problem is tractable. For\nthis work, we restrict our attention to linear and tree-based methods, such as CART [7] and random\nforests [6], as they are both effective and tractable for many practical problems.\nA key issue with the direct approach is that it tries to learn too much. It tries to learn the expected\noutcome under every possible decision, and the level of uncertainty associated with the predicted\nexpected cost can vary between different decisions. This method can lead us to select a decision\nwhich has a small point estimate of the cost, but a large uncertainty interval.\n\n1.1 Notation\n\nThroughout the paper, we use capital letters to refer to random quantities and lower case letters\nto refer to deterministic quantities. Thus, we use Z to refer to the decision randomly assigned by\nthe (unknown) historical policy and z to refer to a speci\ufb01c action. For a given, auxiliary covariate\nvector, x, and a proposed decision, z, the conditional expectation E[c(z; Y )|X = z, Z = z] means\nthe expectation of the cost function c(z; Y ) under the conditional measure in which X is \ufb01xed as\nx and Z is \ufb01xed as z. We ignore details of measurability throughout and assume this conditional\nexpectation is well de\ufb01ned. Throughout, all norms are (cid:96)2 norms unless otherwise speci\ufb01ed. We use\n(X, Z) to denote vector concatenation.\n\n1.2 Related Work\n\nRecent years have seen tremendous interest in the area of data-driven optimization. Much of this work\ncombines ideas from the statistics and machine learning literature with techniques from mathematical\noptimization. Bertsimas and Kallus [4] developed a framework that uses nonparametric machine\nlearning methods to solve data-driven optimization problems in the presence of auxiliary covariates.\nThey take advantage of the fact that for many machine learning algorithms, the predictions are given\nby a linear combination of the training samples\u2019 target variables. Kao et al. [17] and Elmachtoub\nand Grigas [11] developed algorithms that make predictions tailored for use in speci\ufb01c optimization\nproblems. However, they all deal with the setting in which the decision does not affect the outcome.\nThis is insuf\ufb01cient for many applications, such as pricing, in which the demand for a product is clearly\naffected by the price. Bertsimas and Kallus [5] later studied the limitations of predictive approaches to\npricing problems. In particular, they demonstrated that confounding in the data between the decision\nand outcome can lead to large optimality gaps if ignored. They proposed a kernel-based method for\ndata-driven optimization in this setting, but it does not scale well with the dimension of the decision\nspace. Misic [19] developed an ef\ufb01cient mixed integer optimization formulation for problems in\nwhich the predicted cost is given by a tree ensemble model. This approach scales fairly well with the\ndimension of the decision space but does not consider the need for uncertainty penalization.\nAnother relevant area of research is causal inference (see Rosenbaum [20] for an overview), which\nconcerns the study of causal effects from observational data. Much of the work in this area has\n\n2\n\n\ffocused on determining whether a treatment has a signi\ufb01cant effect on the population as a whole.\nHowever, a growing body of work has focused on learning optimal, personalized treatments from\nobservational data. Athey and Wager [1] proposed an algorithm that achieves optimal (up to a constant\nfactor) regret bounds in learning a treatment policy when there are two potential treatments. Kallus\n[14] proposed an algorithm to ef\ufb01ciently learn a treatment policy when there is a \ufb01nite set of potential\ntreatments. Bertsimas et al. [3] developed a tree-based algorithm that learns to personalize treatment\nassignments from observational data. It is based on the optimal trees machine learning method [2]\nand has performed well in experiments. Considerably less attention has been paid to problems with\na continuous decision space. Hirano and Imbens [13] introduced the problem of inference with a\ncontinuous treatment, and Flores [12] studied the problem of learning an optimal policy in this setting.\nRecently, Kallus and Zhou [16] developed an approach to policy learning with a continuous decision\nvariable that generalizes the idea of inverse propensity score weighting. Our approach differs in\nthat we focus on regression-based methods, which we believe scale better with the dimension of the\ndecision space and avoid the need for density estimation.\nThe idea of uncertainty penalization has been explored as an alternative to empirical risk minimization\nin statistical learning, starting with Maurer and Pontil [18]. Swaminathan and Joachims [21] applied\nuncertainty penalization to the of\ufb02ine bandit setting. Their setting is similar to the one we study. An\nagent seeks to minimize the prediction error of his/her decision, but only observes the loss associated\nwith the selected decision. They assumed that the policy used in the training data is known, which\nallowed them to use inverse propensity weighting methods. In contrast, we assume ignorability, but\nnot knowledge of the historical policy, and we allow for more complex decision spaces.\nWe note that our approach bears a super\ufb01cial resemblance to the upper con\ufb01dence bound (UCB)\nalgorithms for multi-armed bandits (cf. Bubeck et al. [8]). These algorithms choose the action with\nthe highest upper con\ufb01dence bound on its predicted expected reward. Our approach, in contrast,\nchooses the action with the highest lower con\ufb01dence bound on its predicted expected reward (or\nlowest upper con\ufb01dence bound on predicted expected cost). The difference is that UCB algorithms\nchoose actions with high upside to balance exploration and exploitation in the online bandit setting,\nwhereas we work in the of\ufb02ine setting with a focus on solely exploitation.\n\n1.3 Contributions\n\nOur primary contribution is an algorithmic framework for observational data driven optimization that\nallows the decision variable to take values on continuous and multidimensional sets. We consider\napplications in personalized medicine, in which the decision is the dose of Warfarin to prescribe to a\npatient, and in pricing, in which the action is the list of prices for several products in a store.\n\n2 Approach\n\nthat\n\nthe observational data consists of n i.i.d.\n\nIn this section, we introduce the uncertainty penalization approach for optimization with\nobservational data. Recall\nobservations,\n(X1, Z1, Y1), . . . , (Xn, Zn, Yn). For observation i, Xi represents the pertinent auxiliary covariates,\nZi is the decision that was applied, and Yi is the observed response. The \ufb01rst step of the approach is\nto train a predictive machine learning model to estimate E[c(z; Y )|X = x, Z = z]. When training\nthe predictive model, the feature space is the cartesian product of the auxiliary covariate space and\nthe decision space, X \u00d7 Z. We have several options for how to train the predictive model. We can\ntrain the model to predict Y , the cost c(Z, Y ), or a combination of these two responses. In general,\nwe denote the prediction of the ML algorithm as a linear combination of the cost function evaluated\nat the training examples,\n\nn(cid:88)\n\n\u02c6\u00b5(x, z) :=\n\nwi(x, z)c(z; Yi).\n\ni=1\n\nWe require the predictive model to satisfy a generalization of the honesty property of Wager and\nAthey [23].\nAssumption 2 (Honesty). The model trained on (X1, Z1, Y1), . . . , (Xn, Zn, Yn) is honest, i.e., the\nweights, wi(x, z), are determined independently of the outcomes, Y1, . . . , Yn.\nThis honesty assumption reduces the bias of the predictions of the cost. We also enforce several\nrestrictions on the weight functions.\n\n3\n\n\fAssumption 3 (Weights). For all (x, z) \u2208 X \u00d7 Z,(cid:80)n\n\ni=1 wi(x, z) = 1 and for all i, wi(x, z) \u2208\n[0, 1/\u03b3n]. In addition, X \u00d7 Z can be partitioned into \u0393n regions such that if (x, z) and (x, z(cid:48)) are in\nthe same region, ||w(x, z) \u2212 w(x, z(cid:48))||1 \u2264 \u03b1||z \u2212 z(cid:48)||2.\nThe direct approach to solving (1) amounts to choosing z \u2208 Z that minimizes \u02c6\u00b5(x, z), for each new\ninstance of auxiliary covariates, x. However, the variance of the predicted cost, \u02c6\u00b5(x, z), can vary with\nthe decision variable, z. Especially with a small training sample size, the direct approach, minimizing\n\u02c6\u00b5(x, z), can give a decision with a small, but highly uncertain, predicted cost. We can reduce the\nexpected regret of our action by adding a penalty term for the variance of the selected decision. If\nAssumption 2 holds, the conditional variance of \u02c6\u00b5(x, z) given (X1, Z1), . . . , (Xn, Zn) is given by\n\nV (x, z) :=\n\ni (x, z)Var(c(z; Yi)|Xi, Zi).\nw2\n\n(cid:88)\n\ni\n\n(cid:88)\n\ni\n\nIn addition, \u02c6\u00b5(x, z) may not be an unbiased predictor, so we also introduce a term that penalizes the\nconditional bias of the predicted cost given (X1, Z1), . . . , (Xn, Zn). Since the true cost is unknown,\nit is not possible to exactly compute this bias. Instead, we compute an upper bound under a Lipschitz\nassumption (details in Section 3).\n\nB(x, z) :=\n\nwi(x, z)||(Xi, Zi) \u2212 (x, z)||2.\n\nOverall, given a new vector of auxiliary covariates, x \u2208 X , our approach makes a decision by solving\n(2)\n\n(cid:112)V (x, z) + \u03bb2B(x, z),\n\nmin\nz\u2208Z \u02c6\u00b5(x, z) + \u03bb1\n\nwhere \u03bb1 and \u03bb2 are tuning parameters.\nAs a concrete example, we can use the CART algorithm of Breiman et al. [7] or the optimal regression\ntree algorithm of Bertsimas and Dunn [2] as the predictive method. These algorithms work by\npartitioning the training examples into clusters, i.e., the leaves of the tree. For a new observation, a\nprediction of the response variable is made by averaging the responses of the training examples that\nare contained in the same leaf.\n\n(cid:40) 1\n\nN (x,z) ,\n0,\n\nwi(x, z) =\n\n(x, z) \u2208 l(x, z),\notherwise,\n\nwhere l(x, z) denotes the set of training examples that are contained in the same leaf of the tree as\n(x, z), and N (x, z) = |l(x, z)|. The variance term will be small when the leaf has a large number of\ntraining examples, and the bias term will be small when the diameter of the leaf is small. Assumption\n2 can be satis\ufb01ed by ignoring the outcomes when selecting the splits or by dividing the training data\ninto two sets, one for making splits and one for making predictions. Assumption 3 is satis\ufb01ed with\n\u03b1 = 0 if the minimum number of training samples in each leaf is \u03b3n and the maximum number of\nleaves in the tree is \u0393n.\n\n2.1 Parameter Tuning\nBefore proceeding, we note that the variance terms, Var(c(z; Yi) | Xi, Zi), are often unknown in\npractice. In the absence of further knowledge, we assume homoscedasticity, i.e., Var(Yi|Xi, Zi) is\nconstant. It is possible to estimate this value by training a machine learning model to predict Yi as a\nfunction of (Xi, Zi) and computing the mean squared error on the training set. However, it may be\nadvantageous to include this value with the tuning parameter \u03bb1.\nWe have several options for tuning parameters \u03bb1 and \u03bb2 (and whatever other parameters are\nassociated with the predictive model). Because the counterfactual outcomes are unknown, it is not\npossible to use the standard approach of holding out a validation set during training and evaluating\nthe error of the model on that validation set for each combination of possible parameters. One option\nis to tune the predictive model\u2019s parameters using cross validation to maximize predictive accuracy\nand then select \u03bb1 and \u03bb2 using the theory we present in Section 3. Another option is to split the data\ninto a training and validation set and train a predictive model on the validation data to impute the\ncounterfactual outcomes. We then select the model that minimizes the predicted cost on the validation\nset. For the examples in Section 4, we use a combination of these two ideas. We train a random\n\n4\n\n\fforest model on the validation set (in order to impute counterfactual outcomes), and we then select\nthe model that minimizes the sum of the mean squared error and the predicted cost on the validation\ndata. In the supplementary materials, we include computations that demonstrate, for the Warfarin\nexample of Section 4.2, the method is not too sensitive to the choice of \u03bb1 and \u03bb2.\n\n3 Theory\n\nIn this section, we describe the theoretical motivation for our approach and provide \ufb01nite-sample\ngeneralization and regret bounds. For notational convenience, we de\ufb01ne\n\n\u00b5(x, z) := E[c(z; Y (z))|X = x] = E[c(z; Y )|X = x, Z = z],\n\nwhere the second equality follows from the ignorability assumption. Before presenting the results,\nwe \ufb01rst present a few additional assumptions.\nAssumption 4 (Regularity). The set X \u00d7 Z is nonempty, closed, and bounded with diameter D.\nAssumption 5 (Objective Conditions). The objective function satis\ufb01es the following properties:\n\n1. |c(z; y)| \u2264 1 \u2200z, y.\n2. For all y \u2208 Y, c(\u00b7; y) is L-Lipschitz.\n3. For any x, x(cid:48) \u2208 X and any z, z(cid:48) \u2208 Z, |\u00b5(x, z) \u2212 \u00b5(x(cid:48), z(cid:48))| \u2264 L||(x, z) \u2212 (x(cid:48), z(cid:48))||.\n\nThese assumptions provide some conditions under which the generalization and regret bounds hold,\nbut similar results hold under alternative sets of assumptions (e.g. if c(z; Y )|Z is subexponential\ninstead of bounded). With these additional assumptions, we have the following generalization bound.\nAll proofs are contained in the supplementary materials.\nTheorem 1. Suppose assumptions 1-5 hold. Then, with probability at least 1 \u2212 \u03b4,\n\nln(Kn/\u03b4) + 2(cid:112)V (x, z) ln(Kn/\u03b4) + L \u00b7 B(x, z) \u2200z \u2208 Z,\n\n\u00b5(x, z) \u2212 \u02c6\u00b5(x, z) \u2264 4\n3\u03b3n\n\n(cid:0)9D\u03b3n\n\n(cid:0)\u03b1(LD + 1 +\n\n\u221a\n\n\u221a\n2) + L(\n\n2 + 3)(cid:1)(cid:1)p.\n\nwhere Kn = \u0393n\n\nThis result uniformly bounds, with high probability, the true cost of action z by the predicted cost,\n\u02c6\u00b5(x, z), a term depending on the uncertainty of that predicted cost, V (x, z), and a term proportional\nto the bias associated with that predicted cost, B(x, z). It is easy to see how this result motivates\nthe approach described in (2). One can also verify that the generalization bound still holds if\n(X1, Z1), . . . , (Xn, Zn) are chosen deterministically, as long as Y1, . . . , Yn are still independent.\nUsing Theorem 1, we are able to derive a \ufb01nite-sample regret bound.\nTheorem 2. Suppose assumptions 1-5 hold. De\ufb01ne\n\n(cid:112)V (x, z) + \u03bb2B(x, z).\nIf \u03bb1 = 2(cid:112)ln(2Kn/\u03b4) and \u03bb2 = L, then with probability at least 1 \u2212 \u03b4,\n\nz\u2217 \u2208 arg min\n\u02c6z \u2208 arg min\n\n\u02c6\u00b5(x, z) + \u03bb1\n\n\u00b5(x, z),\n\nz\n\nz\n\nln(2Kn/\u03b4) + 4(cid:112)V (x, z\u2217) ln(2Kn/\u03b4) + 2L \u00b7 B(x, z\u2217),\n\n\u00b5(x, \u02c6z) \u2212 \u00b5(x, z\u2217) \u2264 2\n\u03b3n\n\n(cid:0)\u03b1(LD + 1 +\n\n(cid:0)9D\u03b3n\n\nwhere Kn = \u0393n\n\n\u221a\n\n\u221a\n2) + L(\n\n2 + 3)(cid:1)(cid:1)p.\n\nBy this result, the regret of the approach de\ufb01ned in (2) depends only on the variance and bias terms\nof the optimal action, z\u2217. Because the predicted cost is penalized by V (x, z) and B(x, z), it does not\nmatter how poor the prediction of cost is at suboptimal actions. Theorem 2 immediately implies the\nfollowing asymptotic result, assuming the auxiliary feature space and decision space are \ufb01xed as the\ntraining sample size grows to in\ufb01nity.\nCorollary 1. In the setting of Theorem 2, if \u03b3n = \u2126(n\u03b2) for some \u03b2 > 0, \u0393n = O(n), and\nB(x, z\u2217) \u2192p 0 as n \u2192 \u221e, then\nas n \u2192 \u221e.\n\n\u00b5(x, \u02c6z) \u2192p \u00b5(x, z\u2217)\n\n5\n\n\fThe assumptions can be satis\ufb01ed, for example, with CART or random forest as the learning algorithm\nwith parameters set in accordance with Lemma 2 of Wager and Athey [23]. This next example\ndemonstrates that there exist problems for which the regret of the uncertainty penalized method is\nstrictly better, asymptotically, than the regret of predicted cost minimization.\nExample 1. Suppose there are m + 1 different actions and two possible, equally probable states of\nthe world. In one state, action 0 has a cost that is deterministically 1, and all other actions have a\nrandom cost that is drawn from N (0, 1) distribution. In the other state, action 0 has a cost that is\ndeterministically 0, and all other actions have a random cost, drawn from a N (1, 1) distribution.\nSuppose the training data consists of m trials of each action. If \u02c6\u00b5(j) is the empirical average\ncost of action j, then the predicted cost minimization algorithm selects the action that minimizes\n\u02c6\u00b5(j). The uncertainty penalization algorithm adds a penalty of the form suggested by Theorem 2,\n2, the (Bayesian) expected regret of the uncertainty penalization algorithm is\n\u03bb\nasymptotically strictly less than the expected regret of the predicted cost minimization algorithm,\nERU P = o(ERP CM ), where the expectations are taken over both the training data and the unknown\nstate of the world.\n\nm . If \u03bb \u2265 \u221a\n\n(cid:113) \u03c32\n\nj ln m\n\nThis example is simple but demonstrates that there exist settings in which predicted cost minimization\nis asymptotically suboptimal to the method we have described. In addition, the proof illustrates\nhow one can construct tighter regret bounds than the one in Theorem 2 for problems with speci\ufb01c\nstructure.\n\n3.1 Tractability\n\nThe tractability of (2) depends on the algorithm that is used as the predictive model. For many\nkernel-based methods, the resulting optimization problems are highly nonlinear and do not scale\nwell when the dimension of the decision space is more than 2 or 3. For this reason, we advocate\nusing tree-based and linear models as the predictive model. Tree based models partition the space\nX \u00d7 Z into \u0393n leaves, so there are only \u0393n possible values of w(x, z). Therefore, we can solve (2)\nseparately for each leaf. For j = 1, . . . , \u0393n, we solve\n\n(cid:112)V (x, z) + \u03bb2B(x, z)\n\nmin \u02c6\u00b5(x, z) + \u03bb1\ns.t.\n\nz \u2208 Z\n(x, z) \u2208 Lj,\n\n(3)\n\nwhere Lj denotes the subset of X \u00d7 Z that makes up leaf j of the tree. Because each split in the tree\nis a hyperplane, Lj is de\ufb01ned by an intersection of hyperplanes and thus is a polyhedral set. Clearly,\nB(x, z) is a convex function in z, as it is a nonnegative linear combination of convex functions. If we\nassume homoscedasticity, then V (x, z) is constant for all (x, z) \u2208 Lj. If c(z; y) is convex in z and\nZ is a convex set, (3) is a convex optimization problem and can be solved by convex optimization\ntechniques. Furthermore, since the \u0393n instances of (3) are all independent, we can solve them in\nparallel. Once (3) has been solved for all leaves, we select the solution from the leaf with the overall\nminimal objective value.\nFor tree ensemble methods, such as random forest [6] or xgboost [9], optimization is more dif\ufb01cult.\nWe compute optimal decisions using a coordinate descent heuristic. From a random starting action,\nwe cycle through holding all decision variables \ufb01xed except for one and optimize that decision using\ndiscretization. We repeat this until convergence from several different random starting decisions. For\nlinear predictive models, the resulting problem is often a second order conic optimization problem,\nwhich can be handled by off-the-shelf solvers (details given in the supplementary materials).\n\n4 Results\n\nIn this section, we demonstrate the effectiveness of our approach with two examples. In the \ufb01rst,\nwe consider pricing problem with synthetic data, while in the second, we use real patient data for\npersonalized Warfarin dosing.\n\n6\n\n\f(a) Pricing example.\n\n(b) Warfarin example.\n\nFigure 1\n\n4.1 Pricing\nIn this example, the decision variable, z \u2208 R5, is a vector of prices for a collection of products. The\noutcome, Y , is a vector of demands for those products. The auxiliary covariates may contain data on\nthe weather and other exogenous factors that may affect demand. The objective is to select prices\nto maximize revenue for a given vector of auxiliary covariates. The demand for a single product is\naffected by the auxiliary covariates, the price of that product, and the price of one or more of the other\nproducts, but the mapping is unknown to the algorithm. The details on the data generation process\ncan be found in the supplementary materials.\nIn Figure 1a, we compare the expected revenues of the strategies produced by several algorithms.\nCART, RF, and Lasso refer to the direct methods of training, respectively, a decision tree, a random\nforest, and a lasso regression [22] to predict revenue, as a function of the auxiliary covariates and\nprices, and choosing prices, for each vector of auxiliary covariates in the test set, that maximize\npredicted revenue. (Note that the revenues for CART and Lasso were too small to be displayed on\nthe plot. Unsurprisingly, the linear model performs poorly because revenue does not vary linearly\nwith price. We restrict all prices to be at most 50 to ensure the optimization problems are bounded.)\nUP-CART, UP-RF, and UP-Lasso refer to the uncertainty penalized analogues in which the variance\nand bias terms are included in the objective. For each training sample size, n, we average our results\nover one hundred separate training sets of size n. At a training size of 2000, the uncertainty penalized\nrandom forest method improves expected revenue by an average of $270 compared to the direct RF\nmethod. This improvement is statistically signi\ufb01cant at the 0.05 signi\ufb01cance level by the Wilcoxon\nsigned-rank test (p-value 4.4 \u00d7 10\u221218, testing the null hypothesis that mean improvement is 0 across\n100 different training sets).\n\n4.2 Warfarin Dosing\n\nWarfarin is a commonly prescribed anticoagulant that is used to treat patients who have had blood clots\nor who have a high risk of stroke. Determining the optimal maintenance dose of Warfarin presents a\nchallenge as the appropriate dose varies signi\ufb01cantly from patient to patient and is potentially affected\nby many factors including age, gender, weight, health history, and genetics. However, this is a crucial\ntask because a dose that is too low or too high can put the patient at risk for clotting or bleeding.\nThe effect of a Warfarin dose on a patient is measured by the International Normalilzed Ratio (INR).\nPhysicians typically aim for patients to have an INR in a target range of 2-3.\nIn this example, we test the ef\ufb01cacy of our approach in learning optimal Warfarin dosing with data\nfrom Consortium et al. [10]. This publicly available data set contains the optimal stable dose, found\n\n7\n\n50000525005500057500600000500100015002000Number of Training ExamplesExpected RevenueRFUP\u2212CARTUP\u2212LassoUP\u2212RF !!!!!!!!!2003004005006001000200030004000Number of Training ExamplesMSE !CARTConstantCRMLassoLBRFUP\u2212CARTUP\u2212LassoUP\u2212RF\fby experimentation, for a diverse set of 5410 patients. In addition, the data set contains a variety\nof covariates for each patient, including demographic information, reason for treatment, medical\nhistory, current medications, and the genotype variant at CYP2C9 and VKORC1. It is unique because\nit contains the optimal dose for each patient, permitting the use of off-the-shelf machine learning\nmethods to predict this optimal dose as a function of patient covariates. We instead use this data to\nconstruct a problem with observational data, which resembles the common problem practitioners\nface. Our access to the true optimal dose for each patient allows us to evaluate the performance of our\nmethod out-of-sample. This is a commonly used technique, and the resulting data set is sometimes\ncalled semi-synthetic. Several researchers have used the Warfarin data for developing personalized\napproaches to medical treatments. In particular, Kallus [15] and Bertsimas et al. [3] tested algorithms\nthat learned to treat patients from semi-synthetic observational data. However, they both discretized\nthe dosage into three categories, whereas we treat the dosage as a continuous decision variable.\nTo begin, we split the data into a training set of 4000 patients and a test set of 1410 patients. We\nkeep this split \ufb01xed throughout all of our experiments to prevent cheating by using insights gained\nby visualization and exploration on the training set. Similar to Kallus [15], we assume physicians\nprescribe Warfarin as a function of BMI. We assume the response that the physicians observe is\nrelated to the difference between the dose a patient was given and the true optimal dose for that\npatient. It is a noisy observation, but it, on average, gives directional information (whether the dose\nwas too high or too low) and information on the magnitude of the distance from the optimal dose.\nThe precise details of how we generate the data are given in the supplementary materials. For all\nmethods, we repeat our work across 100 randomizations of assigned training doses and responses. To\nmeasure the performance of our methods, we compute, on the test set, the mean squared error (MSE)\nof the prescribed doses relative to the true optimal doses. Using the notation described in Section\n1, Xi \u2208 R99 represents the auxiliary covariates for patient i. We work in normalized units so the\ncovariates all contribute equally to the bias penalty term. Zi \u2208 R represents the assigned dose for\npatient i, and Yi \u2208 R represents the observed response for patient i. The objective in this problem is\nto minimize (E[Y (z)|X = x])2 with respect to the dose, z.1\nFigure 1b displays the results of several algorithms as a function of the number of training examples.\nWe compare CART, without any penalization, to CART with uncertainty penalization (UP-CART),\nand we see that uncertainty penalization offers a consistent improvement. This improvement is\ngreatest when the training sample size is smallest. (Note: for CART with no penalization, when\nmultiple doses give the same optimal predicted response, we select the mean.) Similarly, when we\ncompare the random forest and Lasso methods with their uncertainty-penalizing analogues, we again\nsee consistent improvements in MSE. The \u201cConstant\u201d line in the plot measures the performance of a\nbaseline heuristic that assigns a \ufb01xed dose of 35 mg/week to all patients. The \u201cLB\u201d line provides\nan unattainable lower bound on the performance of all methods that use the observational data.\nFor this method, we train a random forest to predict the optimal dose as a function of the patient\ncovariates. We also compare our methods with the Counterfactual Risk Minimization (CRM) method\nof Swaminathan and Joachims [21]. We allow their method access to the true propensity scores\nthat generated the data and optimize over all regularized linear policies for which the proposed\ndose is a linear function of the auxiliary covariates. We tried multiple combinations of tuning\nparameters, but the method always performed poorly out-of-sample. We suspect this is due to the\nsize of the policy space. Our lasso based method works best on this data set when the number of\ntraining samples is large, but the random forest based method is best for smaller sample sizes. With\nthe maximal training set size of 4000, the improvements of the CART, random forest, and lasso\nuncertainty penalized methods over their unpenalized analogues (2.2%, 8.6%, 0.5% respectively) are\nall statistically signi\ufb01cant at the 0.05 family-wise error rate level by the Wilcoxon signed-rank test\nwith Bonferroni correction (adjusted p-values 2.1 \u00d7 10\u22124, 4.3 \u00d7 10\u221216, 1.2 \u00d7 10\u22126 respectively).\n\n5 Conclusions\n\nIn this paper, we introduced a data-driven framework that combines ideas from predictive machine\nlearning and causal inference to optimize an uncertain objective using observational data. Unlike\n\n1This objective differs slightly from the setting described in Section 3 in which the objective was to minimize\nthe conditional expectation of a cost function. However, it is straightforward to modify the results to obtain\nthe same regret bound (save a few constant factors) when minimizing g(E[c(z; Y (z))|X = x]) for a Lipschitz\nfunction, g.\n\n8\n\n\fmost existing algorithms, our approach handles continuous and multi-dimensional decision variables\nby introducing terms that penalize the uncertainty associated with the predicted costs. We proved\n\ufb01nite sample generalization and regret bounds and provided a suf\ufb01cient set of conditions under which\nthe resulting decisions are asymptotically optimal. We demonstrated, both theoretically and with\nreal-world examples, the tractability of the approach and the bene\ufb01t of the approach over unpenalized\npredicted cost minimization.\n\nReferences\n[1] Athey, Susan, Stefan Wager. 2017. Ef\ufb01cient policy learning. arXiv preprint arXiv:1702.02896 .\n\n[2] Bertsimas, Dimitris, Jack Dunn. 2017. Optimal classi\ufb01cation trees. Machine Learning 1\u201344.\n\n[3] Bertsimas, Dimitris, Jack Dunn, Nishanth Mundru. 2018. Optimal prescriptive trees. Under review .\n\n[4] Bertsimas, Dimitris, Nathan Kallus. 2014. From predictive to prescriptive analytics. arXiv preprint\n\narXiv:1402.5481 .\n\n[5] Bertsimas, Dimitris, Nathan Kallus. 2017. The power and limits of predictive approaches to observational-\n\ndata-driven optimization .\n\n[6] Breiman, Leo. 2001. Random forests. Machine learning 45(1) 5\u201332.\n\n[7] Breiman, Leo, Jerome Friedman, Charles J Stone, Richard A Olshen. 1984. Classi\ufb01cation and regression\n\ntrees. CRC press.\n\n[8] Bubeck, S\u00e9bastien, Nicolo Cesa-Bianchi, et al. 2012. Regret analysis of stochastic and nonstochastic\n\nmulti-armed bandit problems. Foundations and Trends R(cid:13) in Machine Learning 5(1) 1\u2013122.\n\n[9] Chen, Tianqi, Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. Proceedings of the 22nd\n\nacm sigkdd international conference on knowledge discovery and data mining. ACM, 785\u2013794.\n\n[10] Consortium, International Warfarin Pharmacogenetics, et al. 2009. Estimation of the warfarin dose with\n\nclinical and pharmacogenetic data. N Engl J Med 2009(360) 753\u2013764.\n\n[11] Elmachtoub, Adam N, Paul Grigas. 2017. Smart\" predict, then optimize\". arXiv preprint arXiv:1710.08005\n\n.\n\n[12] Flores, Carlos A. 2005. Estimation of dose-response functions and optimal doses with a continuous\n\ntreatment .\n\n[13] Hirano, Keisuke, Guido W Imbens. 2004. The propensity score with continuous treatments. Applied\n\nBayesian modeling and causal inference from incomplete-data perspectives 226164 73\u201384.\n\n[14] Kallus, Nathan. 2017. Balanced policy evaluation and learning. arXiv preprint arXiv:1705.07384 .\n\n[15] Kallus, Nathan. 2017. Recursive partitioning for personalization using observational data. International\n\nConference on Machine Learning. 1789\u20131798.\n\n[16] Kallus, Nathan, Angela Zhou. 2018. Policy evaluation and optimization with continuous treatments. arXiv\n\npreprint arXiv:1802.06037 .\n\n[17] Kao, Yi-hao, Benjamin V Roy, Xiang Yan. 2009. Directed regression. Advances in Neural Information\n\nProcessing Systems. 889\u2013897.\n\n[18] Maurer, Andreas, Massimiliano Pontil. 2009. Empirical bernstein bounds and sample variance penalization.\n\narXiv preprint arXiv:0907.3740 .\n\n[19] Misic, Velibor V. 2017. Optimization of tree ensembles. arXiv preprint arXiv:1705.10883 .\n\n[20] Rosenbaum, Paul R. 2002. Observational studies. Observational studies. Springer, 1\u201317.\n\n[21] Swaminathan, Adith, Thorsten Joachims. 2015. Counterfactual risk minimization. Proceedings of the 24th\n\nInternational Conference on World Wide Web. ACM, 939\u2013941.\n\n[22] Tibshirani, Robert. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety. Series B (Methodological) 267\u2013288.\n\n[23] Wager, Stefan, Susan Athey. 2017. Estimation and inference of heterogeneous treatment effects using\n\nrandom forests. Journal of the American Statistical Association (just-accepted).\n\n9\n\n\f", "award": [], "sourceid": 1544, "authors": [{"given_name": "Dimitris", "family_name": "Bertsimas", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Christopher", "family_name": "McCord", "institution": "MIT"}]}