{"title": "Improving Existing Fault Recovery Policies", "book": "Advances in Neural Information Processing Systems", "page_first": 1642, "page_last": 1650, "abstract": "Automated recovery from failures is a key component in the management of large data centers. Such systems typically employ a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we explain how to use data gathered from the interactions of the hand-made controller with the system, to create an optimized controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. While our paper focuses on a specific domain, our method is applicable to other systems that use a hand-coded, imperfect controllers.", "full_text": "Improving Existing Fault Recovery Policies\n\nGuy Shani\n\nDepartment of Information Systems Engineering\n\nBen Gurion University, Beer-Sheva, Israel\n\nshanigu@bgu.ac.il\n\nChristopher Meek\nMicrosoft Research\n\nOne Microsoft Way, Redmond, WA\n\nmeek@microsoft.com\n\nAbstract\n\nAn automated recovery system is a key component in a large data center. Such\na system typically employs a hand-made controller created by an expert. While\nsuch controllers capture many important aspects of the recovery process, they are\noften not systematically optimized to reduce costs such as server downtime. In\nthis paper we describe a passive policy learning approach for improving existing\nrecovery policies without exploration. We explain how to use data gathered from\nthe interactions of the hand-made controller with the system, to create an improved\ncontroller. We suggest learning an inde\ufb01nite horizon Partially Observable Markov\nDecision Process, a model for decision making under uncertainty, and solve it us-\ning a point-based algorithm. We describe the complete process, starting with data\ngathering, model learning, model checking procedures, and computing a policy.\n\n1 Introduction\n\nMany companies that provide large scale online services, such as banking services, e-mail services,\nor search engines, use large server farms, often containing tens of thousands of computers in order to\nsupport fast computation with low latency. Occasionally, these computers may experience failures,\ndue to software, or hardware problems. Often, these errors can be \ufb01xed automatically through\nactions such as rebooting or re-imaging of the computer [6]. In such large systems it is prohibitively\ncostly to have a technician decide on a repair action for each observed problem. Therefore, these\nsystems often use some automatic repair policy or controller to choose appropriate repair actions.\nThese repair policies typically receive failure messages from the system. For example, Isard [6]\nsuggests using a set of watchdogs \u2014 computers that probe other computers to test some attribute.\nMessages from the watchdogs are then typically aggregated into a small set of noti\ufb01cations, such as\n\u201cSoftware Error\u201d or \u201cHardware Error\u201d. The repair policy receives noti\ufb01cations and decides which\nactions can \ufb01x the observed problems. In many cases such policies are created by a human experts\nbased on their experience and knowledge of the process. While human-made controllers often ex-\nhibit a reasonable performance, they are not automatically optimized to reduce costs. Thus, in many\ncases, it is possible to create a better controller, that would improve the performance of the system.\nA natural choice for modeling such systems is to model each machine as a Partially Observable\nMarkov Decision Process (POMDP) [8] \u2014 a well known model for decision making under uncer-\ntainty [12]. Given the POMDP parameters, we can compute a policy that optimizes repair costs,\nbut learning the POMDP parameters may be dif\ufb01cult. Most researchers that use POMDPs therefore\nassume that the parameters are known. Alternatively, Reinforcement Learning (RL) [14] offers a\nwide range of techniques for learning optimized controllers through interactions with the environ-\nment, often avoiding the need for an explicit model. These techniques are typically used in an online\nlearning setting, and require that the agent will explore all possible state-action pairs.\nIn the case of the management of large data centers, where inappropriate actions may result in\nconsiderable increased costs, it is unlikely that the learning process would be allowed to try every\n\n1\n\n\fcombination of state and action. It is therefore unclear how standard RL techniques can be used\nin this setting. On the other hand, many systems log the interactions of the existing hand-made\ncontroller with the environment, accumulating signi\ufb01cant data. Typically, the controller will not be\ndesigned to perform exploration, and we cannot expect such logs to contain suf\ufb01cient data to train\nstandard RL techniques.\nIn this paper we introduce a passive policy learning approach, that uses only available information\nwithout exploration, to improve an existing repair policy. We adopt the inde\ufb01nite-horizon POMDP\nformalization [4], and use the existing controller\u2019s logs to learn the unkown model parameters, using\nan EM algorithm (an adapted Baum-Welch [1, 2, 15] algorithm). We suggest a model-checking\nphase, providing supporting evidence for the quality of the learned model, which may be crucial\nto help the system administrators decide whether the learned model is appropriate. We proceed\nto compute a policy for our learned model, that can then be used in the data center instead of the\noriginal hand-made controller.\nWe experiment with a synthetic, yet realistic, simulation of machine failures, showing how the\npolicy of the learned POMDP performs close to optimal, and outperforms a set of simpler techniques\nthat learn a policy directly in history space. We discuss the limitations of our method, mainly the\ndependency on a reasonable hand-made controller in order to learn good models.\nMany other real world applications, such as assembly lines, medical diagnosis systems, and failure\ndetection and recovery systems, are also controlled by hand-made controllers. While in this paper\nwe focus on recovery from failures, our approach may be applicable to other similar domains.\n\n2 Properties of the Error Recovery Problem\n\nIn this section we describe aspects of the error recovery problem and a POMDP model for the\nproblem. Key aspects of the problem include the nature of repair actions and costs, machine failure,\nfailure detection, and control policies.\nKey aspects of repair actions include: (1) actions may succeed or fail stochastically. (2) These ac-\ntions often provide an escalating behavior. We label actions using increasing levels, where problems\n\ufb01xed by an action at level i, are also \ufb01xed by any action of level j > i. Probabilistically, this would\nmean that if j > i then pr(healthy|aj, e) \u2265 pr(healthy|ai, e) for any error e. (3) Action costs are\ntypically escalating, where lower level actions that \ufb01x minor problems are relatively cheap, while\nhigher level actions are more expensive. In many real world systems this escalation is exponen-\ntial. For example, restarting a service takes 5 seconds, rebooting a machine takes approximately 10\nminutes, while re-imaging the machine takes about 2 hours.\nAnother stochastic feature of this problem is the inexact failure detection. It is not uncommon for a\nwatchdog to report an error for a machine that is fully operational, or to report an \u201chealthy\u201d status\nfor a machine that experiences a failure.\nIn this domain, machines are identical and independent. Typically computers in service farms share\nthe same con\ufb01guration and execute independent programs, attempting, for example, to answer inde-\npendent queries to a search engine. It is therefore unlikely, if not impossible, for errors to propagate\nfrom one machine to another.\nIn view of the escalating nature of actions and costs, a natural choice for a policy is an escalation\npolicy. Such policies choose a starting level based on the \ufb01rst observation, and execute an action at\nthat level. In many cases, due to the non-deterministic success of repair actions, each action is tried\nseveral times. After the controller decides that the action at the current level cannot \ufb01x the problem,\nthe controller escalates to the next action level. Such policies have several hand tuned decisions. For\nexample, the number of retries of an action before an escalation occurs, and the entry level given an\nobservation. We can hope that these features, at least, could be optimized by a learning algorithm.\nSystem administrators typically collect logs of the hand-made controller execution, for maintenance\npurposes. These logs represent a valuable source of data about the system behavior that can be\nused to learn a policy. We would like to use this knowledge to construct an improved policy that\nwill perform better than the original policy. Formally, we assume that we receive as input a log L\nof repair sessions. Each repair session is a sequence l = o0, a1, o1, ..., onl, starting with an error\nnoti\ufb01cation, followed by a set of repair actions and observations until the problem is \ufb01xed.\nIn\n\n2\n\n\fsome cases, sessions end with the machine declared as \u201cdead\u201d, but in practice a technician is called\nfor these machines, repairing or replacing them. Therefore, we can assume that all sessions end\nsuccessfully in the healthy state.\n\n2.1 A POMDP for Error Recovery\n\nGiven the problem features above, a natural choice is to model each machine independently as a\npartially observable Markov decision process (POMDP) with common parameters. We de\ufb01ne a\ncost-based POMDP through a tuple < S, A, tr, C, \u2126, O > where S is a set of states. In our case,\nwe adopt a factored representation, where s =< e0, ..., en > where ei \u2208 {0, 1} indicates whether\nerror i exists. That is, states are sets of failures, or errors of a machine, such as software error or a\nhardware failure. We also add a special state sH =< 0, ..., 0 > \u2014 the healthy state.\nA is a set of actions, such as rebooting a machine or re-imaging it. tr(s, a, s(cid:48)) is a state transition\nfunction, specifying the probabilities of moving between states. We restrict our transition function\nsuch that tr(s, a, s(cid:48)) > 0 iff \u2200i if si = 0 then s(cid:48)\ni = 0. That is, an action may only \ufb01x an error,\nnot generate new errors. C(s, a) is a cost function, assigning a cost to each state-action pair. Often,\ncosts can be measured as the time (minutes) for executing the action. For example, a reboot may\ntake 15 minutes, while re-imaging takes 2 hours.\n\u2126 is a set of possible observations. For us, observations are messages from the watchdogs, such\nas a noti\ufb01cation of a hard disk failure, or a service reporting an error, and noti\ufb01cations about the\nsuccess or failure of an action. O(a, s(cid:48), o) is an observation function, assigning a probability to each\nobservation pr(o|a, s(cid:48)).\nIn a POMDP the true state is not directly observable and we thus maintain a belief state b \u2208 B \u2014\na probability distribution over states, where b(s) is the probability that the system is at state s. We\nassume that every repair session starts with an error observation, typically provided by one of the\nwatchdogs. We therefore de\ufb01ne bo\n0 \u2014 the prior distribution over states given an initial observation\no. We will also maintain a probability distribution pr0(o) over initial observations. While this\nprobability distribution is not used in model learning, it is useful for evaluating the quality of policies\nthrough trials.\nIt is convenient to de\ufb01ne a policy for a POMDP as a mapping from belief states to actions \u03c0 : B \u2192\nA. Our goal is to \ufb01nd an optimal policy that brings the machine to the healthy state with the minimal\ncost. One method for computing a policy is through a value function, V , assigning a value to each\nbelief state b. Such a value function can be expressed as a set of |S| dimensional vectors known as\n\u03b1-vectors, i.e., V = {\u03b11, ..., \u03b1n}. Then, \u03b1b = min\u03b1\u2208V \u03b1 \u00b7 b is the optimal \u03b1-vector for belief state\ni \u03b1ibi is the\nstandard vector inner product. By associating an action a(\u03b1) which each vector, a policy \u03c0 : B \u2192 A\ncan be de\ufb01ned through \u03c0(b) = a(\u03b1b).\nWhile exact value iteration, through complete updates of the belief space, does not scale beyond\nsmall toy examples, Pineau et al. [10] suggest to update the value function by creating a single\n\u03b1-vector that is optimized for a speci\ufb01c belief state. Such methods, known as point-based value\niteration, compute a value function over a \ufb01nite set of belief states, resulting in a \ufb01nite size value\nfunction. Perseus [13] is an especially fast point-based solver that incrementally updates a value\nfunction over a randomly chosen set of belief points, ensuring that at each iteration, the value for\neach belief state is improved, while maintaining a compact value function representation.\nWe adopt here the inde\ufb01nite horizon POMDP framework [4], which we consider to be most appro-\npriate for failure recovery. In this framework the POMDP has a single special action aT , available in\nany state, that terminates the repair session. In our case, the action is to call a technician, determinis-\ntically repairing the machine, but with a huge cost. For example, Isard [6] estimates that a technician\nwill \ufb01x a computer within 2 weeks. Executing aT in sH incurs no cost. Using inde\ufb01nite horizon it is\neasy to de\ufb01ne a lower bound on the value function using aT , and execute any point-based algorithm,\nsuch as the Perseus algorithm that we use.\n\nb, and V (b) = b \u00b7 \u03b1b is the value that the value function V assigns to b, where \u03b1 \u00b7 b =(cid:80)\n\n3\n\n\f3 Learning Policies from System Logs\n\nIn this section we propose two alternatives for computing a recovery policy given the logs. We begin\nwith a simple, model-free, history-based policy computation. Then, we suggest a more sophisticated\nmethod that learns the POMDP model parameters, and then uses the POMDP to compute a policy.\n\n3.1 Model-Free Learning of Q-values\n\nThe optimal policy for a POMDP can be expressed as a mapping from action-observation histories\nto actions. Histories are directly observable, allowing us to use the standard Q function terminol-\nogy, where Q(h, a) is the expected cost of executing action a with history h and continuing the\nsession until it terminates. This approach is known as model-free, because (e.g.) the parameters\nof a POMDP are never learned, and has some attractive properties, because histories are directly\nobservable, and do not require any assumption about the unobserved state space.\nAs opposed to standard Q-learning, where the Q function is learned while interacting with the\nenvironment, we use the system log L to compute Q:\n\n|l|(cid:88)\n(cid:80)\n\nj=i+1\n\nCost(li) =\n\nQ(h, a) =\n\nC(aj)\n\n(cid:80)\n\nl\u2208L \u03b4(h + a, l)Cost(l|h|)\n\nl\u2208L \u03b4(h + a, l)\n\n(1)\n\n(2)\n\nwhere li is a suf\ufb01x of l starting at action ai, C(a) is the cost of action a, h + a is the history h with\nthe action a appended at its end, and \u03b4(h, l) = 1 if h is a pre\ufb01x of l and 0 otherwise. The Q function\nis hence the average cost until repair of executing the action a in history h, under the policy that\ngenerated L. Learning a Q function is much faster than learning the POMDP parameters, requiring\nonly a single pass over the training sequences in the system log.\nGiven the learned Q function, we can de\ufb01ne the following policy:\n\n\u03c0Q(h) = min\na\n\nQ(h, a)\n\n(3)\n\nOne obvious problem of learning a direct mapping from history to actions is that such policies do\nnot generalize \u2014 if a history sequence was not observed in the logs, then we cannot evaluate the\nexpected cost until the error is repaired. An approach that generalizes better is to use a \ufb01nite history\nwindow of size k, discarding all the observations and action occurring more than k steps ago. For\nexample, when k = 1 the result is a completely reactive Q function, computing Q(o, a) using the\nlast observation only.\n\n3.2 Model-Based Policy Learning\n\nWhile we assume that the behavior of a machine can be captured perfectly using a POMDP as\ndescribed above, in practice we cannot expect the parameters of the POMDP to be known a-priori.\nIn practice, the only parameters that are known are the set of possible repair actions and the set of\npossible observations, but even the number of possible errors is not initially known, let alone the\nprobability of repair or observation.\nGiven the log of repair sessions, we can use a learning algorithm to learn the parameters of the\nPOMDP. In this paper we choose to use an adapted Baum-Welch algorithm [1, 2, 15], an EM algo-\nrithm originally developed for computing the parameters of Hidden Markov Models (HMMs). The\nBaum-Welch algorithm takes as input the number of states (the number of possible errors) and a set\nof training sequences. Then, using the forward-backward procedure, the parameters of the POMDP\nare computed, attempting to maximize the likelihood of the data (the observation sequences). After\nthe POMDP parameters have been learned, we execute Perseus [13] to compute a policy.\nWhile training the model parameters, it is important to test likelihood on a held out set of sequences\nthat are not used in training, in order to ensure that the resulting model does not over-\ufb01t the data. We\nhence split the input sequences into a train set (80%) and test set (20%). We check the likelihood of\nthe test set after each forward-backward iteration, and stop the training when the likelihood of the\ntest set does not improve.\n\n4\n\n\f3.2.1 Model Checking\n\nWhen employing automatic learning methods to create an improved policy, it is important to pro-\nvide evidence for the quality of the learned models. Such evidence can be helpful for the system\nadministrators in order to make a decision whether to replace the existing policy with a new policy.\nUsing an imperfect learner such as Baum-Welch does not guarantee that the resulting model indeed\nmaximizes the likelihood of the observations given the policy, even for the same policy that was used\nto generate the training data. Also, the loss function used for learning the model ignores action costs,\nthus ignoring an important aspect of the problem. For these reasons, it is possible that the resulting\nmodel will describe the domain poorly. After the model has been learned, however, we can use the\naverage cost to provide evidence for the validity of the model. Such a process can help us determine\nwhether these shortcomings of the learning process have indeed resulted in an inappropriate model.\nThis phase is usually known as model checking (see, e.g. [3]).\nAs opposed to the Q-learning approach, learning a generative model (the POMDP) allows us check\nhow similar the learned model is to the original model. We say that two POMDPs M1 =<\nS1, A, tr1, C1, \u2126, O1 > and M2 =< S2, A, tr2, C2, \u2126, O2 > are indistinguishable if for each policy\nt Ct|M2, \u03c0]. That is, the models are indistinguishable if any\n\nt Ct|M1, \u03c0] = E[(cid:80)\n\n\u03c0 : H \u2192 A, E[(cid:80)\n\npolicy has the same expected accumulated cost when executed in both models.\nMany policies cannot be evaluated on the real system because we cannot tolerate damaging policies.\nWe can, however, compare the performance of the original, hand-made policy, on the system and on\nthe learned POMDP model. We hence focus the model checking phase on comparing the expected\ncost of the hand-made policy predicted be the learned model to the true expected cost on the real\nsystem. To estimate the expected cost in the real system, we use the average cost of the sessions\nin the logs. To estimate the expected cost of the policy on the learned POMDP we execute a set of\ntrials, each simulating a repair session, using the learned parameters of the POMDP to govern the\ntrial advancement (observation emissions given the history and action).\nWe can then use the two expected cost estimates as a measure of closeness. For example, if the\npredicted cost of the policy over the learned POMDP is more than 20% away from the true expected\ncost, we may deduce that the learned mode does not properly capture the system dynamics. While\nchecking the models under a single policy cannot ensure that the models are identical, it can detect\nwhether the model is defective. If the learned model produces a substantially different expectation\nover the cost of a policy than the real system, we know that the model is corrupted prior to executing\nits optimal policy on the real system.\nAfter ensuring that the original policy performs similarity on the real system and on the learned\nmodel, we can also evaluate the performance of the computed policy on the learned model. Thus,\nwe can compare the quality of the new policy to the existing one, helping us to understand the\npotential cost reduction of the new policy.\n\n4 Empirical Evaluation\n\nIn this section we provide an empirical evaluation to demonstrate that our methods can improve an\nexisting policy. We created a simulator of recovery sessions. In the simulator we assume that a\nmachine can be in one of n error states, or in healthy state, we also assume n possible repair actions,\nand m possible observations. We assume that each action was designed to \ufb01x a single error state,\nand set the number of errors to be the number of repair actions.\nWe set pr(sH|ei, aj) = 0.7 + 0.3 \u00b7 j\u2212i\nif j \u2265 i and 0 otherwise, simulating the escalation power\nof repair actions. We set C(s, ai) = 4i and C(sH , aT ) = 0, simulating the exponential growth of\ncosts in the real AutoPilot system [6], and the zero downtime caused by terminating the session in\nthe healthy state. For observations, we compute the relative severity of an error ei in the observation\n\u221a\nspace \u00b5i = i\u2217m\n2\u03c0, where \u03ba is a normalizing factor, and\n/\nj \u2208 [\u00b5i \u2212 1, \u00b5i + 1].\nWe execute a hand-made escalation policy with 3 retries (see Section 2) over the simulator and\ngather a log of repair sequences. Each repair sequence begins with selecting an error uniformly, and\nexecuting the policy until the error is \ufb01xed. Then, we use the logs in order to learn a Q function\n\nn , and then set pr(oj|ei, a) = \u03bae\u2212 (i\u2212\u00b5i)2\n\nj\n\n2\n\n5\n\n\fTable 1: Average cost of recovery policies in simulation, with increasing model size. Results are\naveraged over 10 executions, and the worst standard error across all recovery policies is reported in\nthe last column.\nProblem parameters\n|E|\n|L|\n10000\n2\n10000\n4\n10000\n4\n50000\n8\n8\n50000\n\nQ1, S Q3, S Q5, S\n18.0\n174.6\n239.5\n29611\n61071\n\nOriginal Optimal\n\u03c0M\u2217 , S\n\u03c0E, S\n17.3\n21.6\n167.7\n220.3\n136.8\n221.6\n15047\n29070\n28978\n15693\n\nPolicies learned from logs\n\n|O|\n2\n2\n4\n4\n8\n\n\u03c0M , S Q, S\n17.3\n17.3\n193.6\n172.3\n197.8\n141.5\n52636\n20592\n18303\n54585\n\n17.4\n179.6\n163.6\n24611\n26808\n\nSE\n17.3\n< 0.2\n190.8\n< 3\n178.5\n< 2.5\n27951 < 250\n27038 < 275\n\nover the complete history, \ufb01nite history window Q functions with k = 1, 3, and a POMDP model.\nFor the POMDP model, we initialize the number of states to the number of repair actions, initialize\ntransition uniformly, and observation randomly, and execute the Baum-Welch algorithm. We also\nconstructed a maximum-likelihood POMDP model, by initializing the state space, transition, and\nobservation function using the true state labels (the simulated errors), and executing Baum-Welch\nafterwards. This initialization simulates the result of a \u201cperfect learner\u201d that does not suffer from the\nlocal maximum problems of Baum-Welch.\nIn the tables below we use S for simulator, M for the learned model, and M\u2217 for the model initial-\nized by the true error labels. For policies, we use \u03c0E for the escalation policy, \u03c0M for the policy\ncomputed by Perseus on the learned model, and \u03c0M\u2217 for the Perseus policy over the \u2018perfect learner\u2019\nmodel. For the history-based Q functions, we use Q to denote the function computed for the com-\nplete history, and Qi denotes a policy over history suf\ufb01xes of length i. A column header \u03c0, S denotes\nthe estimated cost of executing \u03c0 on the simulator, and \u03c0, M denotes the estimated cost of executing\n\u03c0 on the model M. We also report the standard error in the estimations.\n\n4.1 Results\n\nWe begin with showing the improvement of the policy of the learned models over the original esca-\nlation policies. As Table 1 demonstrates, learning a POMDP model and computing its policy always\nresult in a substantial reduction in costs. The M\u2217 model, initialized using the true error labels, pro-\nvides an upper bound on the best performance gain that can be achieved using our approach. We\ncan see that in many cases, the result of the learned model is very closed to this upper bound.\nThe Q functions over histories did well on the smallest domains, but not on larger domains. The\nworst performance is of the reactive Q function (Q1) over the latest observation. In the smaller\ndomains Q learning, especially with a history window of 3 (Q3) does fairly well, but in the larger\ndomains all history-based policies do not perform well.\nWe now take a look at the results of the model checking technique. As we explained above, a model\nchecking phase, comparing the expected cost of a policy on both the real system and the learned\nmodel, can provide evidence as to the validity of the learned model. Indeed, as we see in Table 2,\nthe learned models predict an expected cost that is within 3% of the real expected cost.\nTo further validate our learned models, we also compare the expected cost of the policies computed\nfrom the models (M and M\u2217) over the model and the simulator. Again, we can see that the predicted\ncosts are very close to the real costs of these policies. As expected, the M\u2217 predicted costs are within\nmeasurement error of the true costs of the policy on the real system.\n\n4.2 Discussion\n\nThe experimental results provide a few interesting insights. First, when the observation space is rich\nenough to capture all the errors, the learned model is very close to the optimal one. When we use\nfewer observations, the quality of the learned model is reduced (further from M\u2217), but the policy\nthat is learned still signi\ufb01cantly outperforms the original escalation policy.\nIt is important to note that the hand-made escalation policy that we use is very natural for domains\nwith actions that have escalating costs and effects. Also, the number of actions and errors that we\nuse is similar to these used by current controllers of repair services [6]. As such, the improvements\n\n6\n\n\fTable 2: Comparing expected cost of policies on the learned model and the simulator for model\nchecking. Results are averaged over 10 executions, and the worst standard error across all recovery\npolicies is reported in the last column.\nProblem parameters\n|E|\n2\n4\n4\n8\n8\n\nLearned model\n\u03c0M , M \u03c0M , S\n17.6\n173.5\n138.7\n21104\n17052\n\n\u03c0M\u2217 , M\u2217\n17.3\n165.5\n137.4\n16461\n15630\n\n\u03c0E, M \u03c0E, S\n22.3\n227.2\n225.4\n29152\n29104\n\n21.6\n220.4\n221.6\n29070\n28978\n\n\u03c0E, M\u2217\n21.6\n219.5\n221.1\n28985\n28870\n\n|O|\n2\n2\n4\n4\n8\n\n|L|\n10000\n10000\n10000\n50000\n50000\n\nEscalation policy\n\nOptimal model\n\n\u03c0M\u2217 , S\n17.3\n167.7\n136.8\n15047\n15693\n\n17.3\n172.4\n141.5\n20592\n18303\n\nSE\n< 0.2\n< 3\n< 2.5\n< 250\n< 275\n\nthat we achieve over this hand-made policy hint that similar gains can be made over the real system.\nOur results show an improvement of 20%\u2212 40%, increasing with the size of the domain. As driving\ndown the costs of data centers is crucial for the success of such systems, increasing performance by\n20% is a substantial contribution.\nThe history-based approaches did well only on the smallest domains. This is because for a history\nbased value function, we have to evaluate any action at every history. As the input policy does not\nexplore, the set of resulting histories does not provide enough coverage of the history space. For\nexample, if the current repair policy only escalates, the history-based approach will never observe a\nhigher level action followed by a low level action, and cannot evaluate its expected cost.\nFinite history windows increase coverage, by reducing the number of possible histories to a \ufb01nite\nscale. Thus, \ufb01nite-history window provide some generalization power. Indeed, the \ufb01nite history\nwindow methods (except for the reactive policy) improve upon the original escalation policy in\nsome cases. We note, though, that we used a very simple \ufb01nite history model, and more complicated\napproaches, such as variable length history windows [9, 11] may provide better results.\nOur model checking technique indicates that none of the models that were learned, even when the\nnumber of observations was smaller than the number of errors, was defective. This is not particulary\nsurprising, as the input data originated from a simulator that operated under the assumptions of the\nmodel. This is unlikely to happen in the real system, where any modeling assumptions compromise\nsome aspect of the real world. Still, in the real world this technique will allow us to test, before\nchanging the existing policy, whether the assumptions are close to the truth, and whether the model\nis reasonable. This is a crucial step in making the decision to replace an existing, imperfect yet\noperative policy, with a new one. It is unclear how to run a similar model checking phase over the\nhistory-based approach.\nIn improving unexploring policies, we must assume that many possible histories will not be ob-\nserved. However, a complete model de\ufb01nition must set a value for every possibility. In learning\nsuch cases, it is important to set default values for these unknown model parameters. In our case,\nit is best to be pessimistic about these parameters, that is, to overestimate the cost of repair. It is\ntherefore safe to assume that action a will not \ufb01x error e if we never observed a to \ufb01x e in the logs,\nexcept for the terminating action aT .\n\n5 Related Work\n\nUsing decision-theoretic techniques for troubleshooting and recovery dates back to Heckerman et\nal.\n[5], who employed Bayesian networks for troubleshooting, and a myopic approximation for\nrecovery. Heckerman et al. assume that the parameters of the Bayesian network are given as input,\nand training it using the unlabeled data that the logs contain is dif\ufb01cult. This Bayesian network\napproach is also not designed for sequential data.\nPartially Observable Markov Decision Processes were previously suggested for modeling automated\nrecovery from errors. Most notably, Littman et al. [8] suggests the CSFR model which is similar\nto our POMDP formalization, except for a deterministic observation function, the escalation of\nactions, and the terminating action. They then proceed to de\ufb01ne a belief state in this model, which\nis a set of possible error states, and a Q-function Q(b, a) over beliefs. The Q-function is computed\nusing standard value iteration. As these assumptions reduce the partial observability, the resulting\n\n7\n\n\fQ function can produce good policies. Littman et al. assume that the model is either given, or that\nQ-learning can be executed online, using an exploration strategy, both of which are not applicable in\nour case. Also, as we argue above, in our case a Q function produces substantially inferior policies\nbecause of its lack of generalization power in partially observable domains.\nAnother, more recent, example of a recovery approach based on POMDPs was suggested by Joshi\net al. [7]. Similar to Littman et al., Joshi et al. focus on the problem of fault recovery in networks,\nwhich adds a layer of dif\ufb01culty because we can no longer assume that machines are independent,\nas often faults cascade through the network. Joshi et al. also assume that the parameters of the\nmodel, such as the probability that a watchdog will detect each failure, and the effects of actions on\nfailures, are known a-priori. They then suggest a one step lookahead repair strategy, and a multi-step\nlookahead, that uses a value function over a belief space similar to the Littman et al. belief space.\nBayer-Zubek and Dietterich [16] use a set of examples, similar to our logs, to learn a policy for\ndisease diagnosis. They formalize the problem as an MDP, assuming that test results are discrete\nand exact, and use AO\u2217 search, while computing the needed probabilities using the example set.\nThey did not address the problem of missing data in the example set that arises from a non-exploring\npolicy. Indeed, in the medical diagnosis case, one may argue that trying an action sequence that was\nnever tried by a human doctor may result in an unreasonable risk of harm to the patient, and that\ntherefore the system should not consider such policies.\n\n6 Conclusion\n\nWe have presented an approach to improving imperfect repair policies through learning a POMDP\nmodel of the problem. Our method takes as input a log of interaction of the existing controller with\nthe system, learns a POMDP model, and computes a policy for the POMDP that can be used in the\nthe real system. The advantage of our method is that it does not require the existing controller to\nactively explore the effects of actions in all conditions, which may result in unacceptable costs in the\nreal system. On the other hand, our approach may not converge to an optimal policy. We experiment\nwith a synthetic, yet realistic, example of a hand-made escalation policy, where actions are ordered\nby increasing cost, and any action is repeated a number of times. We show how the policy of the\nlearned model signi\ufb01cantly improves the original escalation policy.\nIn the future we intend to use the improved policies to manage repairs in a real data center within the\nAutoPilot system [6]. The \ufb01rst step would be to \u201c\ufb02ight\u201d candidate policies to evaluate their perfor-\nmance in the real system. Our current method is a single shot improvement, and an interesting next\nstep is to create an incremental improvement process, where new policies constantly improve the\nexisting one. In this setting, it would be interesting to explore bounded exploration, an exploration\ntechnique that puts a bound on the risk of the strategy.\nThere are a number of interesting theoretical questions about our passive policy learning method\nand about passive policy learning in general. First, for what families of initial policies and system\ndynamics would a passive policy learning method be expected to yields an improvement in expected\ncosts. Second, what families of initial policies and systems dynamics would a passive policy learning\nmethod be expected to yield the optimal policy. Third, how would one characterize when iteratively\napplying a passive policy learning method would yield expected improvements in expected costs.\nFinally, while this paper focuses on the important failure recovery problem, our methods may be\napplicable to a wide range of similar systems, such as assembly line management, and medical\ndiagnosis systems, that currently employ hand-made imperfect controllers.\n\nReferences\n\n[1] Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in\nthe statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics,\n41(1):164\u2013171, 1970.\n\n[2] Lonnie Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach.\nIn In Proceedings of the Tenth National Conference on Arti\ufb01cial Intelligence, pages 183\u2013188. AAAI\nPress, 1992.\n\n8\n\n\f[3] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data Analysis. Chapman\n\nand Hall, 1996.\n\n[4] Eric A. Hansen. Inde\ufb01nite-horizon POMDPs with action-based termination. In AAAI, pages 1237\u20131242,\n\n2007.\n\n[5] David Heckerman, John S. Breese, and Koos Rommelse. Decision-theoretic troubleshooting. Commun.\n\nACM, 38(3):49\u201357, 1995.\n\n[6] Michael Isard. Autopilot: automatic data center management. Operating Systems Review, 41(2):60\u201367,\n\n2007.\n\n[7] Kaustubh R. Joshi, William H. Sanders, Matti A. Hiltunen, and Richard D. Schlichting. Automatic model-\n\ndriven recovery in distributed systems. In SRDS, pages 25\u201338, 2005.\n\n[8] Michael L. Littman and Nishkam Ravi. An instance-based state representation for network repair. In\nin Proceedings of the Nineteenth National Conference on Arti\ufb01cial Intelligence (AAAI, pages 287\u2013292,\n2004.\n\n[9] Andrew Kachites Mccallum. Reinforcement learning with selective perception and hidden state. PhD\n\nthesis, 1996. Supervisor-Ballard, Dana.\n\n[10] Joelle Pineau, Geoffrey Gordon, and Sebastian Thrun. Point-based value iteration: An anytime algorithm\nfor POMDPs. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages 1025 \u2013 1032,\nAugust 2003.\n\n[11] Guy Shani and Ronen I. Brafman. Resolving perceptual aliasing in the presence of noisy sensors. In\n\nNIPS, 2004.\n\n[12] R. D. Smallwood and E. J. Sondik. The optimal control of partially observable Markov decision processes\n\nover a \ufb01nite horizon. Operations Research, 21:1071\u20131098, 1973.\n\n[13] Matthijs T. J. Spaan and Nikos Vlassis. Perseus: Randomized point-based value iteration for POMDPs.\n\nJournal of Arti\ufb01cial Intelligence Research, 24:195\u2013220, 2005.\n\n[14] Richard S. Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[15] Daan Wierstra and Marco Wiering. Utile distinction hidden Markov models. In ICML \u201904: Proceedings\nof the twenty-\ufb01rst international conference on Machine learning, page 108, New York, NY, USA, 2004.\nACM.\n\n[16] Valentina Bayer Zubek and Thomas G. Dietterich. Integrating learning from examples into the search for\n\ndiagnostic policies. J. Artif. Intell. Res. (JAIR), 24:263\u2013303, 2005.\n\n9\n\n\f", "award": [], "sourceid": 393, "authors": [{"given_name": "Guy", "family_name": "Shani", "institution": null}, {"given_name": "Christopher", "family_name": "Meek", "institution": null}]}