{"title": "Learning Decision Theoretic Utilities through Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1061, "page_last": 1067, "abstract": null, "full_text": "Learning Decision Theoretic Utilities Through \n\nReinforcement Learning \n\nMagnus Stensmo \n\nComputer Science Division \n\nUniversity of California \n\nBerkeley, CA 94720, U.S.A. \n\nmagnus@cs.berkeley.edu \n\nTerrence J. Sejnowski \n\nHoward Hughes Medical Institute \n\nThe Salk Institute \n\n10010 North Torrey Pines Road \n\nLa Jolla, CA 92037, U.S.A. \n\nterry@salk.edu \n\nAbstract \n\nProbability models can be used to predict outcomes and compensate for \nmissing data, but even a perfect model cannot be used to make decisions \nunless the utility of the outcomes, or preferences between them, are also \nprovided. This arises in many real-world problems, such as medical di(cid:173)\nagnosis, where the cost of the test as well as the expected improvement \nin the outcome must be considered. Relatively little work has been done \non learning the utilities of outcomes for optimal decision making. In this \npaper, we show how temporal-difference reinforcement learning (TO(A\u00bb \ncan be used to determine decision theoretic utilities within the context of \na mixture model and apply this new approach to a problem in medical di(cid:173)\nagnosis. TO( A) learning of utilities reduces the number of tests that have \nto be done to achieve the same level of performance compared with the \nprobability model alone, which results in significant cost savings and in(cid:173)\ncreased efficiency. \n\n1 \n\nINTRODUCTION \n\nDecision theory is normative or prescriptive and can tell us how to be rational and behave \noptimally in a situation [French, 1988]. Optimal here means to maximize the value of the \nexpected future outcome. This has been formalized as the maximum expected utility prin(cid:173)\nciple by [von Neumann and Morgenstern, 1947]. Decision theory can be used to make op(cid:173)\ntimal choices based on probabilities and utilities. Probability theory tells us how probable \ndifferent future states are, and how to reason with and represent uncertainty information. \n\n\f1062 \n\nM. Stensmo and T. 1. Sejnowski \n\nUtility theory provides values for these states so that they can be compared with each other. \nA simple form of a utility function is a loss function. Decision theory is a combination of \nprobability and utility theory through expectation. \n\nThere has previously been a lot of work on learning probability models (neural networks, \nmixture models, probabilistic networks, etc.) but relatively little on representing and rea(cid:173)\nsoning about preference and learning utility models. This paper demonstrates how both lin(cid:173)\near utility functions (i.e., loss functions) and non-linear ones can be learned as an alternative \nto specifying them manually. \n\nAutomated fault or medical diagnosis is an interesting and important application for deci(cid:173)\nsion theory. It is a sequential decision problem that includes complex decisions (What is \nthe most optimal test to do in a situation? When is it no longer effective to do more tests?), \nand other important problems such as missing data (both during diagnosis, i.e., tests not yet \ndone, and in the database which learning is done from). We demonstrate the power of the \nnew approach by applying it to a real-world problem by learning a utility function to im(cid:173)\nprove automated diagnosis of heart disease. \n\n'2 PROBABILITY, UTILITY AND DECISION THEORY MODELS \n\nThe system has separate probability and decision theory models. The probability model is \nused to predict the probabilities for the different outcomes that can occur. By modeling the \njoint probabilities these predictions are available no matter how many or few of the input \nvariables are available at any instant. Diagnosis is a missing data problem because of the \nquestion-and-answer cycle that results from the sequential decision making process. \n\nOur decision theoretic automated diagnosis system is based on hypotheses and deductions \naccording to the following steps: \n\n1. Any number of observations are made. This means that the values of one or several \n\nobservation variables of the probability model are determined. \n\n2. The system finds probabilities for the different possible outcomes using the joint \nprobability model to calculate the conditional probability for each of the possible \noutcomes given the current observations. \n\n3. Search for the next observation that is expected to be most useful for improving \n\nthe diagnosis according to the Maximum Expected Utility principle. \nEach possible next variable is considered. The expected value of the system pre(cid:173)\ndiction with this variable observed minus the current maximum value before mak(cid:173)\ning the additional observation and the cost of the observation is computed and de(cid:173)\nfined as the net value of information for this variable [Howard, 1966]. The variable \nwith the maximum of all of these is then the best next observation to make. \n\n4. The steps 1-3 above are repeated until further improvements are not possible. This \nhappens when none of the net value of information values in step 3 is positive. \nThey can be negative since a positive cost has been subtracted. \n\nNote that we only look ahead one step (called a myopic approximation [Gorry and Barnett, \n1967]). This is in principle suboptimal, however, the reinforcement learning procedure de(cid:173)\nscribed below can compensate for this. The optimal solution is to consider all possible se(cid:173)\nquences, but the search tree grows exponentially in the number of unobserved variables. \n\n\fLearning Decision Theoretic Utilities through Reinforcement Learning \n\n1063 \n\nJoint probabilities are modeled using mixture models [McLachlan and Basford, 1988]. \nSuch models can be efficiently trained using the Expectation-Maximization (EM) algorithm \n[Dempster et al., 1977], which has the additional benefit that missing variable values in the \ntraining data also can be handled correctly. This is important since most real-world data \nsets are incomplete. More detail on the probability model can be found in [Stensmo and \nSejnowski, 1995; Stensmo, 1995]. This paper is concerned with the utility function part of \nthe decision theoretic model. \n\nThe utilities are values assigned to different states so that their usefulness can be compared \nand actions are chosen to maximize the expected future utility. Utilities are represented as \npreferences when a certain disease has been classified but the patient in reality has another \none [Howard, 1980; Heckerman et al., 1992]. For each pair of diseases there is a utility \nvalue between 0 and 1, where a 0 means maximally bad and a 1 means maximally good. \nThis is a d x d matrix for d diseases, and the matrix can be interpreted as a kind of a loss \nfunction. The notation is natural and helps for acquiring the values, which is a non-trivial \nproblem. Preferences are subjective contrary to probabilities which are objective (for the \npurposes of this paper). For example, a doctor, a patient and the insurance company may \nhave different preferences, but the probabilities for the outcomes are the same. \n\nMethods have been devised to convert perceived risk to monetary values [Howard, 1980]. \nSubjects were asked to answer questions such as: \"How much would you have to be paid to \naccept a one in a millionth chance of instant painless death r' The answers are recorded for \nvarious low levels of risk. It has been empirically found that people are relatively consis(cid:173)\ntent and that perceived risk is linear for low levels of probability. Howard defined the unit \nmicromort (mmt) to mean one in J millionth chance of instant painless death and [Heck(cid:173)\nerman et al., 1992] found that one subject valued 1 micromort to $20 (in 1988 US dollars) \nlinearly to within a factor of two. We use this to convert utilities in [0,1] units to dollar \nvalues and vice versa. \n\nPrevious systems asked experts to supply the utility values, which can be very complicated, \nor used some simple approximation. [Heckerman et al., 1992] used a utility value of 1 for \nmisclassification penalty when both diseases are malign or both are benign, and 0 otherwise \n(see Figure 4, left). They claim that it worked in their system but this approximation should \nreduce accuracy. We show how to adapt and learn utilities to find better ones. \n\n3 REINFORCEMENT LEARNING OF UTILmES \n\nUtilities are adapted using a type of reinforcement learning, specifically the method of tem(cid:173)\nporal differences [Sutton, 1988]. This method is capable of adjusting the utility values cor(cid:173)\nrectly even though a reinforcement signal is only received after each full sequence of ques(cid:173)\ntions leading to a diagnosis. \n\nThe temporal difference algorithm (ID(A\u00bb learns how to predict future values from past \nexperience. A sequence of observations is used, in our case they are the results of the med(cid:173)\nical tests that have been done. We used ID(A) to learn how to predict the expected utility \nof the final diagnosis. \nUsing the notation of Sutton, the function Pt predicts the expected utility at time t. Pt is a \nvector of expected utilities, one for each outcome. In the linear form described above, Pt = \nP(Xt, Wt) = WtXt, where Wt is a matrix of utility values and Xt is the vector of probabilities \nof the outcomes, our state description. The objective is to learn the utility matrix Wt. \n\n\f1064 \n\nM. Stensmo and T. J. Sejnowski \n\nWe use an intra-sequence version of the ID(,\\) algorithm so that learning can occur during \nnormal operation of the system [Sutton, 1988]. The update equation is \n\nWt+! = Wt + a[P(xt+!, Wt) - P(Xt, Wt)) 2: ,\\t-kv wP(Xk, Wt), \n\nt \n\n(1) \n\nk=1 \n\nwhere a is the learning rate and ,\\ is a discount factor. With Pk = P (x k, Wt) = x k Wt and \net = E!=l ,\\t-kv wP(Xk, wt} = E~=I ,\\t-kxk , (1) becomes the two equations \n\nWt+l \n\net+l \n\nWt + awt[x t+1 - xt)et \nXt+l + '\\et, \n\nstarting with el = Xl . These update equations were used after each question was answered. \nWhen the diagnosis was done, the reinforcement signal z (considered to be observation \nPt+1) was obtained and the weights were updated: Wt+! = Wt + awt[z - Xt)et. A final up(cid:173)\ndate of et was not necessary. Note that t~is method allows for the use of any differentiable \nutility function, specifically a neural network, in the place of P(Xk, wt}. \n\nPreference is sUbjective. In this paper we investigated two examples of reinforcement. One \nwas to simply give the highest reinforcement (z = 1) on correct diagnosis and the low(cid:173)\nest (z = 0) for errors. This yielded a linear utility function or loss function that was the \nunity matrix which confirmed that the method works. When applied to a non-linear utility \nfunction the result is non-trivial. \n\nIn the second example the reinforcement signal was modified by a penalty for the use of \na high number questions by multiplying each z above with (maXq -q)j(maXq - minq), \nwhere q is the number of questions used for the diagnostic sequence, and the minimum and \nmaximum number of questions are minq and maXq, respectively. The results presented in \nthe next section used this reinforcement signal. \n\n4 RESULTS \n\nThe publicly available Cleveland heart-disease database was used to test the method. It con(cid:173)\nsists of 303 cases where the disorder is one of four types of heart-disease or its absence. \nThere are fourteen variables as shown in Figure 1. Continuous variables were converted into \na 1-0/-N binary code based on their distributions among the cases in the database. Nominal \nand categorical variables were coded with one unit per value. In total 96 binary variables \ncoded the 14 original variables. \n\nTo find the parameter values for the mixture model that was used for probability estima(cid:173)\ntion, the EM algorithm was run until convergence [Stensmo and Sejnowski, 1995; Stensmo, \n1995]. The classification error was 16.2%. To get this result all of the observation variables \nwere set to their correct values for each case. Note that all this information might not be \navailable in a real situation, and that the decision theory model was not needed in this case. \n\nTo evaluate how well the complete sequential decision process system does, we went \nthrough each case in the database and answered the questions that came up according to \nthe correct values for the case. When the system completed the diagnosis sequence, the re(cid:173)\nsult was compared to the actual disease that was recorded in the database. The number of \nquestions that were answered for each case was also recorded ( q above). After all of the \ncases had been processed in this way, the average number of questions needed, its standard \n\n\fLearning Decision Theoretic Utilities through Reinforcement Learning \n\n1065 \n\nObserv. Description \nAge in years \nage \nsex \nSex of subject \nChest pain \ncp \nResting blood pressure \ntrestbps \nSerum cholesterol \nchol \nFasting blood sugar \nfbs \n\nResting electrocardiographic \nresult \nMaximum heart rate achieved \nExercise induced angina \nST depression induced by \nexercise relative to rest \nSlope of peak exercise \nSTsegment \nNumber major vessels colored \nby flouroscopy \nDefect type \n\nDisorder Description \nHeart disease \nnum \n\n1 \n2 \n3 \n4 \n5 \n6 \n\n7 \n\n11 \n\n12 \n\n13 \n\n14 \n\nrestecg \n\nthalach \nexang \n\n8 \n9 \nlO oldpeak \n\nslope \n\nca \n\nthaI \n\nCost (mmt) Cost ($) \n\n0 \n0 \n20 \n40 \n100 \n100 \n\n100 \n\n100 \n100 \n100 \n\n100 \n\n100 \n\n100 \n\n0 \n0 \n400 \n800 \n2000 \n2000 \n\n2000 \n\n2000 \n2000 \n2000 \n\n2000 \n\n2000 \n\n2000 \n\nValues \ncontinuous \nmale/female \nfour types \ncontinuous \ncontinuous \n<,or> \n120mgldl \nfive values \n\ncontinuous \nyes/no \ncontinuous \n\nup/flat/down \n\n0-3 \n\nnormaVfixedi \nreversible \nValues \nNo disease/ \nfour types \n\nFigure 1: The Cleveland Heart Disease database. The database consists of 303 cases de(cid:173)\nscribed by 14 variables. Observation costs are somewhat arbitrarily assigned and are given \nin both dollars and converted to micromorts (mmt) in [0,1] units based on $20 per micromort \n(one in 1 millionth chance of instant painless death). \n\ndeviation, and the number of errors were calculated. If the system had several best answers, \none was selected randomly. \n\nObservation costs were assigned to the different variables according to Figure 1. Using the \nfull utility/decision model and the O/I-approximation for the utility function (left part of \nFigure 4), there were 29.4% errors. The results are summarized in Figure 2. Over the whole \ndata set an average of 4.42 questions were used with a standard deviation of 2.02. Asking \nabout 4-5 questions instead of 13 is much quicker but unfortunately less accurate. This was \nbefore the utilities were adapted. \n\nWith TD(~) learning (Figure 3), the number of errors decreased to 16.2% after 85 repeated \npresentations of all of the cases in random order. We varied ~ from 0 to 1 in increments \nof 0.1, and a over several orders of magnitude to find the reported results. The resulting \naverage number of questions were 6.05 with a standard deviation of 2.08. The utility matrix \nafter 85 iterations is shown in Figure 4 with 0'=0.0005 and ~=O.I. \n\nThe price paid for increased robustness was an increase in the average number of questions \nfrom 4.42 to 6.05, but the same accuracy was achieved using only less than half of them on \naverage. Many people intuitively think that half of the questions should be enough. There \nis, however, no reason for this; furthermore there is no procedure to stop asking questions \nif observations are chosen randomly. \n\nIn this paper a simple state description has been used, namely the predicted probabilities of \nthe outcomes. We have also tried other representations by including the test results in the \nstate description. On this data set similar results were obtained. \n\n\f1066 \n\nM. Stensmo and T. 1. Sejnowski \n\nModel \nProbability model only \n011 approximation \nAfter 85 iterations of TD(\"\\) learning \n\nErrors # Questions St. Dev. \n16.2% \n29.4% \n16.2% \n\n-\n2.02 \n2.08 \n\n13 \n4.42 \n6.05 \n\nFigure 2: Results on the Cleveland Heart Disease Database. The three methods are de(cid:173)\nscribed in the text. The first method does not use a utility model. The 011 approximation \nuse the matrix in Figure 4, left. The utility matrix that was learned by TD(\"\\) is shown in \nFigure 4, right. \n\n'\" \nc \n0 .;: \n'\" Q) \n::I \nCI \n\n7 \n6.5 \n6 \n5.5 \n5 \n4.5 \n4 \n\n0 \n\n20 \n\n,...-..\" \n\n40 \n\n60 \n\nIteration \n\n30 \n\n25 \n\n~ \n\ng 20 \n\n~ \n\n15 \n\n10 \n\n0 \n\n20 \n\n80 \n\n40 \n\n60 \n\nIteration \n\n80 \n\nFigure 3: Learning graphs with discount-rate parameter ..\\=0.1, and learning rate a=O.OOO5 \nfor the TD(\"\\) algorithm. One iteration is a presentation of all of the cases in random order. \n\n5 SUMMARY AND DISCUSSION \n\nWe have shown how utilities or preferences can be learned for different expected outcomes \nin a complex system for sequential decision making based on decision theory. Temporal(cid:173)\ndifferences reinforcement learning was efficient and effective. \n\nThis method can be extended in several directions. Utilities are usually modeled linearly in \ndecision theory (as with the misclassification utility matrix), since manual specification and \ninterpretation of the utility values then is quite straight-forward. There are advantages with \nnon-linear utility functions and, as indicated above, our method can be used for any utility \nfunction that is differentiable. \n\nInitial \n\nAfter 8S iterations \n\n0 0 \n\n1 0 \n0 \n0 \n0 \n0 \n\n0 \n1 1 1 1 \n1 1 1 \n1 \n1 1 \n1 \n1 \n1 1 \n1 1 \n\n0.8179 0.0698 0.0610 0.0435 \n0.0505 \n0.0579 0.6397 0.2954 0.3331 \n0.6308 \n0.0215 \n0.1799 0.6305 0.3269 0.6353 \n0.0164 0.1430 0.2789 0.7210 0.6090 \n0.0058 0.1352 0.2183 \n0.2742 0.8105 \n\nFigure 4: Misclassification utility matrices. The disorder no disease is listed in the first \nrow and column, followed by the four types of heart disease. Left: Initial utility matrix. \nRight: After TD learning with discount-rate parameter ..\\=0.1 and learning rate a=O.OOO5. \nElement Uij (row i, column j) is the utility when outcome i has been chosen but when it \nactually is j. Maximally good has value 1, and maximally bad has value O. \n\n\fLearning Decision Theoretic Utilities through Reinforcement Learning \n\n1067 \n\nAn alternative to learning the utility or value function is to directly learn the optimal actions \nto take in each state, as in Q-Iearning [Watkins and Dayan, 1992]. This would require one \nto learn which question to ask in each situation instead of the utility values but would not \nbe directly analyzable in terms of maximum expected utility. \n\nAcknowledgements \n\nFinancial support for M.S. was provided by the Wenner-Gren Foundations and the Founda(cid:173)\ntion Blanceftor Boncompagni-Ludovisi, nee Bildt. The heart-disease database is from the \nUniversity of California, Irvine Repository of Machine Learning Databases and originates \nfrom R. Detrano, Cleveland Clinic Foundation. Stuart Russell is thanked for discussions. \n\nReferences \n\nDempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incom(cid:173)\n\nplete data via the EM algorithm. Journal of the Royal Statistical Society, Series, B. , 39, \n1-38. \n\nFrench, S. (1988). Decision Theory: An Introduction to the Mathematics of Rationality. \n\nEllis Horwood, Chichester, UK. \n\nGorry, G. A. and Barnett, G. O. (1967). Experience with a model of sequential diagnosis. \n\nComputers and Biomedical Research, 1,490-507. \n\nHeckerman, D. E., Horvitz, E. J. and Nathwani, B. N. (1992). Toward normative expert \nsystems: Part I. The Pathfinder project. Methods of Information in Medicine, 31, 90-\n105. \n\nHoward, R. A. (1966). Information value theory. IEEE Transactions on Systems Science \n\nand Cybernetics, SSC-2, 22-26. \n\nHoward, R. A. (1980). On making life and death decisions. In Schwing, R. C. and Albers, \nJr., W. A., editors, Societal risk assessment: How safe is safe enough? Plenum Press, \nNew York, NY. \n\nMcLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference and Applications \n\nto Clustering. Marcel Dekker, Inc., New York, NY. \n\nStensmo, M. (1995). Adaptive Automated Diagnosis. PhD thesis, Royal Institute of Tech(cid:173)\n\nnology (Kungliga Tekniska Hogskolan), Stockholm, Sweden. \n\nStensmo, M. and Sejnowski, T. J. (1995). A mixture model system for medical and machine \ndiagnosis. In Tesauro, G., Touretzky, D. S. and Leen, T. K., editors, Advances in Neural \nInformation Processing Systems, vol. 7, pp 1077-1084. MIT Press, Cambridge, MA. \n\nSutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine \n\nLearning, 3,9-44. \n\nvon Neumann, J. and Morgenstern, O. (1947). Theory of Games and Economic Behavior. \n\nPrinceton University Press, Princeton, NJ. \n\nWatkins, C. J. and Dayan, P. (1992). Q-Iearning. Machine Learning, 8, 279-292. \n\n\f", "award": [], "sourceid": 1185, "authors": [{"given_name": "Magnus", "family_name": "Stensmo", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}