{"title": "Transferring Expectations in Model-based Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2555, "page_last": 2563, "abstract": "We study how to automatically select and adapt multiple abstractions or representations of the world to support model-based reinforcement learning. We address the challenges of transfer learning in heterogeneous environments with varying tasks. We present an efficient, online framework that, through a sequence of tasks, learns a set of relevant representations to be used in future tasks. Without pre-defined mapping strategies, we introduce a general approach to support transfer learning across different state spaces. We demonstrate the potential impact of our system through improved jumpstart and faster convergence to near optimum policy in two benchmark domains.", "full_text": "Transferring Expectations in Model-based\n\nReinforcement Learning\n\nTrung Thanh Nguyen, Tomi Silander, Tze-Yun Leong\n\nSchool of Computing\n\nNational University of Singapore\n\n{nttrung, silander, leongty}@comp.nus.edu.sg\n\nSingapore, 117417\n\nAbstract\n\nWe study how to automatically select and adapt multiple abstractions or represen-\ntations of the world to support model-based reinforcement learning. We address\nthe challenges of transfer learning in heterogeneous environments with varying\ntasks. We present an ef\ufb01cient, online framework that, through a sequence of tasks,\nlearns a set of relevant representations to be used in future tasks. Without pre-\nde\ufb01ned mapping strategies, we introduce a general approach to support transfer\nlearning across different state spaces. We demonstrate the potential impact of\nour system through improved jumpstart and faster convergence to near optimum\npolicy in two benchmark domains.\n\n1\n\nIntroduction\n\nIn reinforcement learning (RL), an agent autonomously learns how to make optimal sequential de-\ncisions by interacting with the world. The agent\u2019s learned knowledge, however, is task and environ-\nment speci\ufb01c. A small change in the task or the environment may render the agent\u2019s accumulated\nknowledge useless; costly re-learning from scratch is often needed.\nTransfer learning addresses this shortcoming by accumulating knowledge in forms that can be reused\nin new situations. Many existing techniques assume the same state space or state representation in\ndifferent tasks. While recent efforts have addressed inter-task transfer in different action or state\nspaces, speci\ufb01c mapping criteria have to be established through policy reuse [7], action correlation\n[14], state abstraction [22], inter-space relation [16], or other methods. Such mappings are hard\nto de\ufb01ne when the agent operates in complex environments with large state spaces and multiple\ngoal states, with possibly different state feature distributions and world dynamics. To ef\ufb01ciently\naccomplish varying tasks in heterogeneous environments, the agent has to learn to focus attention\non the crucial features of each environment.\nWe propose a system that tries to transfer old knowledge, but at the same time evaluates new op-\ntions to see if they work better. The agent gathers experience during its lifetime and enters a new\nenvironment equipped with expectations on how different aspects of the world affect the outcomes\nof the agent\u2019s actions. The main idea is to allow an agent to collect a library of world models or\nrepresentations, called views, that it can consult to focus its attention in a new task. In this paper, we\nconcentrate on approximating the transition model. The reward model library can be learned in an\nanalogous fashion. Effective utilization of the library of world models allows the agent to capture\nthe transition dynamics of the new environment quickly; this should lead to a jumpstart in learning\nand faster convergence to a near optimal policy. A main challenge is in learning to select a proper\nview for a new task in a new environment, without any prede\ufb01ned mapping strategies.\nWe will next formalize the problem and describe the method of collecting views into a library.\nWe will then present an ef\ufb01cient implementation of the proposed transfer learning technique. After\n\n1\n\n\fdiscussing related work, we will demonstrate the ef\ufb01cacy of our system through a set of experiments\nin two different benchmark domains.\n\n2 Method\n\nIn RL, a task environment is typically modeled as a Markov decision process (MDP) de\ufb01ned by a\ntuple (S, A, T, R), where S is a set of states; A is a set of actions; T : S \u00d7 A \u00d7 S \u2192 [0, 1] is\ntransition function, such that T (s, a, s(cid:48)) = P (s(cid:48)|s, a) indicates the probability of transiting to a state\ns(cid:48) upon taking an action a in state s; R : S \u00d7 A \u2192 R is a reward function indicating immediate\nexpected reward after an action a is taken in state s. The goal is then to \ufb01nd a policy \u03c0 that speci\ufb01es\nan action to perform in each state so that the expected accumulated future reward (possibly giving\nhigher weights to more immediate rewards) for each state is maximized [18]. In model-based RL,\nthe optimal policy is calculated based on the estimates of the transition model T and the reward\nmodel R which are obtained by interacting with the environment.\nA key idea of this work is that the agent can represent the world dynamics from its sensory state\nspace in different ways. Such different views correspond to the agent\u2019s decisions to focus attention\non only some features of the state in order to quickly approximate the state transition function.\n\n2.1 Decomposition of transition model\n\nTo allow knowledge transfer from one state space to another, we assume that each state s in all\nthe state spaces can be characterized by a d-dimensional feature vector f (s) \u2208 Rd. The states\nthemselves may or may not be factored. We use the idea in situation calculus [11] to decompose\nthe transition model T in accordance with the possible action effects. In the RL context, an action\nwill stochastically create an effect that determines how the current state changes to the next one [2,\n10, 14]. For example, an attempt to move left in a grid world may cause the agent to move one step\nleft or one step forward, with small probabilities. The relative changes in states, \u201cmoved left\u201d and\n\u201cmoved forward\u201d, are called effects of the action.\nFormally, let us call MDP with a decomposed transition model CMDP (situation Calculus MDP).\nCMDP is de\ufb01ned by a tuple (S, A, E, \u03c4, \u03b7, f, R) in which the transition model T has been replaced\nby the the terms E, \u03c4, \u03b7, f, where E is an effect set and f is a function from states to their feature\nvectors. \u03c4 : S\u00d7A\u00d7E \u2192 [0, 1] is an action model such that \u03c4 (s, a, e) = P (e | f (s), a) indicates the\nprobability of achieving effect e upon performing action a at state s. Notice that the probability of\neffect e depends on state s only through the features f (s). While the agent needs to learn the effects\nof the action, it is usually assumed to understand the meaning of the effects, i.e., how the effects turn\neach state into a next state. This knowledge is captured in a deterministic function \u03b7 : S \u00d7 E \u2192 S.\nDifferent effects e will change a state s to a different next state s(cid:48) = \u03b7(s, e). The MDP transition\nmodel T can be reconstructed from the CMDP by the equation:\n\nT (s, a, s(cid:48); \u03c4 ) = P (s(cid:48) | f (s), a) = \u03c4 (s, a, e),\n\n(1)\n\nwhere e is the effect of action a that takes s to s(cid:48), if such an e exists, otherwise T (s, a, s(cid:48); \u03c4 ) = 0.\nThe bene\ufb01t of this decomposition is that while there may be a large number of states, there is\nusually a limited number of de\ufb01nable effects of actions, and those are assumed to depend only\non some features of the states and not on the actual states themselves. We can therefore turn the\nlearning of the transition model into a supervised online classi\ufb01cation problem that can be solved by\nany standard online classi\ufb01cation method. More speci\ufb01cally, the classi\ufb01cation task is to predict the\neffect e of an action a in a state s with features f (s).\n\n2.2 A multi-view transfer framework\nIn our framework, the knowledge gathered and transferred by the agent is collected into a library T\nof online effect predictors or views.\nA view consists of a structure component \u00aff that picks the features which should be focused on, and\na quantitative component \u0398 that de\ufb01nes how these features should be combined to approximate the\ndistribution of action effects. Formally, a view is de\ufb01ned as \u03c4 = ( \u00aff , \u0398), such that P (E|S, a; \u03c4 ) =\nP (E| \u00aff (S), a; \u0398) = \u03c4 (S, a, E), in which \u00aff is an orthogonal projection of f (s) to some subspace\n\n2\n\n\fof Rd. Each view \u03c4 is specialized in predicting the effects of one action a(\u03c4 ) \u2208 A and it yields a\nprobability distribution for the effects of the action a in any state. This prediction is based on the\nfeatures of the state and the parameters \u0398(\u03c4 ) of the view that may be adjusted based on the actual\neffects observed in the task environment.\nWe denote the subset of views that specify the effects for action a by T a \u2282 T . The main challenge\nis to build and maintain a comprehensive set of views that can be used in new environments likely\nresembling the old ones, but at the same time allow adaptation to new tasks with completely new\ntransition dynamics and feature distributions.\nAt the beginning of every new task, the existing library is copied into a working library which is\nalso augmented with fresh, uninformed views, one for each action, that are ready to be adapted to\nnew tasks. We then select, for each action, a view with a good track record. This view is is used to\nestimate the optimal policy based on the transition model speci\ufb01ed in Equation 1, and the policy is\nused to pick the \ufb01rst action a. The action effect is then used to score all the views in the working\nlibrary and to adjust their parameters. In each round the selection of views is repeated based on their\nscores, and the new optimal policy is calculated based on the new selections. At the end of the task,\nthe actual library is updated by possibly recruiting the views that have \u201cperformed well\u201d and retiring\nthose that have not. A more rigorous version of the procedure is described in Algorithm 1.\n\nAlgorithm 1 TES: Transferring Expectations using a library of views\n\nInput: T = {\u03c41, \u03c42, ...}: view library; CMDPj: a new jth task; \u03a6: view goodness evaluator\nLet T0 be a set of fresh views - one for each action\nTtmp \u2190 T \u222a T0\n/* THE WORKING LIBRARY FOR THE TASK */\nfor all a \u2208 A do\n\u02c6T [a] \u2190 argmax\u03c4\u2208T a \u03a6(\u03c4, j)\nfor t = 0, 1, 2, ... do\n\n/* SELECTING VIEWS */\n\nend for\n\nat \u2190 \u02c6\u03c0(st), where \u02c6\u03c0 is obtained by solving MDP using transition model \u02c6T\nPerform action at and observe effect et\nfor all \u03c4 \u2208 T at\nfor all \u03c4 \u2208 T at\n\u02c6T [at] \u2190 argmax\u03c4\u2208T at\n\nScore[\u03c4 ] \u2190 Score[\u03c4 ] + log \u03c4 (st, at, et)\ndo Update view \u03c4 based on (f (st), at, et)\nScore[\u03c4 ]\n\n/* SELECTING VIEWS */\n\ntmp \u222a T at do\n\ntmp\n\ntmp\n\nend for\nend for\n\nend for\nfor all a \u2208 A do\n\nend for\nif |T | > M then\n\nScore[\u03c4 ];\n\n\u03c4\u2217 \u2190 argmax\u03c4\u2208T a\nT a \u2190 growLibrary(T a, \u03c4\u2217, Score, j)\nT \u2190 T \u2212 {argmin\u03c4\u2208T \u03a6(\u03c4, j)}\n\ntmp\n\nend if\n\n/* UPDATING LIBRARY */\n\n/* PRUNING LIBRARY */\n\n2.2.1 Scoring the views\n\nTo assess the quality of a view \u03c4, we measure its predictive performance by a cumulative log-score.\nThis is a proper score [12] that can be effectively calculated online.\nGiven a sequence Da = (d1, d2, . . . , dN ) of observations di = (si, a, ei) in which action a has\nresulted in effect ei in state si, the score for an a-specialized \u03c4 is\n\nN(cid:88)\n\nS(\u03c4, Da) =\n\nlog \u03c4 (si, a, ei; \u03b8:i(\u03c4 )),\n\nwhere \u03c4 (si, a, ei; \u03b8:i(\u03c4 )) is the probability of event ei given by the event predictor \u03c4 based on\nthe features of state si and the parameters \u03b8:i(\u03c4 ) that may have been adjusted using previous data\n(d1, d2, . . . , di\u22121).\n\ni=1\n\n2.2.2 Growing the library\n\nAfter completing a task, the highest scoring new views for each action are considered for recruiting\ninto the actual library. The winning \u201cnewbies\u201d are automatically accepted. In this case, the data has\nmost probably come from the distribution that is far from the any current models, otherwise one of\nthe current models would have had an advantage to adapt and win.\n\n3\n\n\fThe winners \u03c4\u2217that are adjusted versions of old views \u00af\u03c4 are accepted as new members if they score\nsigni\ufb01cantly higher than their original versions, based on the logarithm of the prequential likelihood\nratio [5] \u039b(\u03c4\u2217, \u00af\u03c4 ) = S(\u03c4\u2217, Da) \u2212 S(\u00af\u03c4 , Da). Otherwise, the original versions \u00af\u03c4 get their parameters\nupdated to the new values. This procedure is just a heuristic and other inclusion and updating criteria\nmay well be considered. The policy is detailed in Algorithm 2.\nAlgorithm 2 Grow sub-library T a\n\nInput: T a, \u03c4\u2217, Score, j: task index; c: constant; H\u03c4\u2217 = {}: empty history record\nOutput: updated library subset T a and winning histories H\u03c4\u2217\ncase \u03c4\u2217 \u2208 T a\notherwise\n\ndo Let \u00af\u03c4 \u2208 T be the original, not adapted version of \u03c4\u2217\n\n0 do T a \u2190 T a \u222a {\u03c4\u2217}\n\n/* ADD NEWBIE TO LIBRARY */\n\ncase Score[\u03c4\u2217] \u2212 Score[\u00af\u03c4 ] > c do T a \u2190 T a \u222a {\u03c4\u2217}\notherwise\n\ndo T a \u2190 T a \u222a {\u03c4\u2217} \u2212 {\u00af\u03c4}\n\nH\u03c4\u2217 \u2190 H\u03c4\u2217 \u222a {j}\n\n2.2.3 Pruning the library\n\nH\u03c4\u2217 \u2190 H\u00af\u03c4\n\n/* INHERIT HISTORY */\n\nTo keep the library relatively compact, a plausible policy is to remove views that have not performed\nwell for a long time, possibly because there are better predictors or they have become obsolete in\nthe new tasks or environments. To implement such a retiring scheme, each view \u03c4 maintains a list\nH\u03c4 of task indices that indicates the tasks for which the view has been the best scoring predictor for\nits specialty action a(\u03c4 ). We can then calculate the recency weighted track record for each view.\nIn practice, we have adopted the procedure by Zhu et al. [27] that introduces the recency weighted\nscore at time T as\n\n\u03a6(\u03c4, T ) =\n\ne\u2212\u00b5(T\u2212t),\n\n(cid:88)\n\nt\u2208H\u03c4\n\nwhere \u00b5 controls the speed of decay of past success. Other decay functions could naturally also be\nused. The pruning can then be done by introducing a threshold for recency weighted score or always\nmaintaining the top M views.\n\n3 A view learning algorithm\n\nIn TES, a view can be implemented by any probabilistic classi\ufb01cation model that can be quickly\nlearned online. A popular choice for representing the transition model in factored domains is the\ndynamic Bayesian network (DBN), but learning DBNs is computationally very expensive. Recent\nstudies [24, 25] have shown encouraging results in learning the structure of logistic regression mod-\nels that can serve as local structures of DBNs. While these models cannot capture all the conditional\ndistributions, their simplicity allows fast online learning in very high dimensional spaces.\nWe introduce an online sparse multinomial logistic regression algorithm to incrementally learn a\nview. The proposed algorithm is similar to so called group-lasso [26] which has been recently\nsuggested for feature selection among a very large set of features [25].1\nAssuming K classes of vectors x \u2208 Rd, each class k is represented with a d-dimensional prototype\nvector Wk. Classi\ufb01cation of an input vector x in logistic regression is based on how \u201csimilar\u201d it is\ni=1 Wkixi. The\nlog probability of a class y is de\ufb01ned by log P (y = k|x; Wk) \u221d (cid:104)Wk, x(cid:105). The classi\ufb01er can then be\nparametrized by stacking the Wk vectors as rows into a matrix W = (W1, ..., WK)T .\nAn online learning system usually optimizes its probabilistic classi\ufb01cation performance by mini-\nmizing a total loss function through updating its parameters over time. A typical item-wise loss\nfunction of a multinomial logistic regression classi\ufb01er is l(W ) = \u2212 log P (y|x; W ), where (y, x)\ndenotes data item observed at time t. To achieve a parsimonious model in a feature-rich domain,\nwe express our a priori belief that most features are super\ufb02uous by introducing a regularization term\n\nto the prototype vectors. Similarity is measured by the inner product (cid:104)Wk, x(cid:105) =(cid:80)d\n\n1We report here the details of the method that should allow its replication. A more comprehensive descrip-\n\ntion is available as a separate report in the supplementary material.\n\n4\n\n\f\u221a\n\n\u03a8(W ) = \u03bb(cid:80)d\nThe objective function can now be written as(cid:80)T\n\nK||W\u00b7i||2, where ||W\u00b7i||2 denotes the 2-norm of the ith column of W , and \u03bb is a\npositive constant. This regularization is similar to that of group lasso [26]. It communicates the idea\nthat it is likely that a whole column of W has zero values (especially, for large \u03bb). A column of all\nzeros suggests that the corresponding feature is irrelevant for classi\ufb01cation.\n\ni\n\nt=1 l(W t, dt) + \u03a8(W t), where W t is the coef\ufb01cient\nmatrix learned using t \u2212 1 previously observed data items. Inspired by the ef\ufb01cient dual averaging\nmethod [24] for solving lasso and group lasso [25] logistic regression, we extend the results to the\nmultinomial case. Speci\ufb01cally, the loss minimizing sequence of parameter matrices W t can be\nachieved by the following online update scheme.\nLet Gt\nderivatives \u00afGt\nGiven a K \u00d7 d average gradient matrix \u00afGt, and a regularization parameter \u03bb > 0, the ith column of\nthe new parameter matrix W t+1 can be achieved as follows\n\nki be the derivatives of function lt(W ) with respect to Wki. \u00afGt is a matrix of average partial\n\ni (I(yj = k) \u2212 P (k|xj; W j\u22121)).\n\nki, where Gj\n\nki = \u2212xj\n\n(cid:80)t\n\nki = 1\nt\n\nj=1 Gj\n\n(cid:17) \u00afGt\u00b7i\n\n\u221a\n\nif || \u00afGt\u00b7i||2 \u2264 \u03bb\notherwise,\n\nK,\n\n(2)\n\n(cid:40)(cid:126)0\n\n(cid:16) \u03bb\n\n\u221a\n\nt\n\u03b1\n\n\u221a\nK|| \u00afGt\u00b7i||2\n\n\u2212 1\n\nW t+1\n\n\u00b7i =\n\nwhere \u03b1 > 0 is a constant. The update rule (2) dictates that when the length of the average gradient\nmatrix column is small enough, the corresponding parameter column should be truncated to zero.\nThis introduces feature selection into the model.\n\n4 Related work\n\nThe survey by Taylor and Stone [20] offers a comprehensive exposition of recent methods to trans-\nfer various forms of knowledge in RL. Not much research, however, has focused on transferring\ntransition models. For example, while super\ufb01cially similar to our framework, the case-based rea-\nsoning approaches [4] [13] focus on collecting good decisions instead of building models of world\ndynamics. Taylor proposes TIMBREL [19] to transfer observations in a source to a target task via\nmanually tailored inter-task mapping. Fernandez et al. [7] transfers a library of policies learned\nin previous tasks to bias exploration in new tasks. The method assumes a constant inter-task state\nspace, otherwise a state mapping strategy is needed.\nHester and Stone [8] describe a method to learn a decision tree for predicting state relative changes\nwhich are similar to our action effects. They learn decision trees online by repeatedly applying batch\nlearning. Such a sequence of classi\ufb01ers forms an effect predictor that could be used as a member of\nour view library. This work, however, does not directly focus on transfer learning.\nMultiple models have previously been used to guide behavior in non-stationary environ-\nments [6] [15]. Unlike our work, these studies usually assume a common concrete state space.\nIn representation selection, Konidaris and Barto [9] focus on selecting the best abstraction to assist\nthe agent\u2019s skill learning, and Van et al. [21] study using multiple representations together to solve\na RL problem. None of these studies, however, solve the problem of transferring knowledge in\nheterogeneous environments.\nAtkeson and Santamaria introduce a locally weighted transfer learning technique called LWT to\nadapt previously learned transition models into a new situation [1]. This study is among the very\nfew that actually consider transferring the transition model to a new task [20]. While their work is\nconducted in continuous state space using a \ufb01xed state similarity measure, it can be adapted to a\ndiscrete case. Doing so corresponds to adopting a \ufb01xed single view. We will compare our work with\nthis approach in our experiments. This approach could also be extended to be compatible with our\nwork by learning a library of state similarity measures and developing a method to choose among\nthose similarities for each task.\nWilson et al. [23] also address the problem of transfer in heterogeneous environments. They for-\nmalize the problem as learning a generative Dirichlet process for MDPs and suggest an approximate\nsolution using Gibbs sampling. Our method can be seen as a structure learning enhanced alterna-\ntive implementation of this generative model. Our online-method is computationally more ef\ufb01cient,\nbut the MCMC estimation should eventually yield more accurate estimates. Both models can also\n\n5\n\n\fbe adjusted to deal with non-stationary task sources. The work by Wilson et al. demonstrates the\nmethod for reward models, and it is unclear how to extend the approach for transferring transition\nmodels. We will also compare our work with this hierarchical Bayes approach in our experiments.\n\n5 Experiments\n\nWe examine the performance of our expectation transfer algorithm TES that transfers views to speed-\nup the learning process across different environments in two benchmark domains. We show that TES\ncan ef\ufb01ciently: a) learn the appropriate views online, b) select views using the proposed scoring\nmetric, c) achieve a good jump start, and d) perform well in the long run.\nTo better compare with some related work, we evaluate the performance of TES for transferring both\ntransition models and reward models in RL. TES can be adapted to transfer reward models as follows:\nAssuming that the rewards follow a Gaussian distribution, a view of the expected reward model can\nbe learned similarly as shown in section 3. We use an online sparse linear regression model instead\nof the multinomial logistic regression. Simply replacing matrix W by a vector w, and using squared\nloss function, the coef\ufb01cient update function can be found similar to that in Equation 2 [24]. When\nstudying reward models, the transition models are assumed to be known.\n\n5.1 Learning views for effective transfer\n\nIn the \ufb01rst experiment, we compare TES with the locally weighted LWT approach by Atkeson et al.\n[1] and the non-parametric hierarchical Bayesian approach HB by Wilson et al. [23] in transferring\nreward models. We adopt the same domain as described in Wilson et al.\u2019s HB paper, but augment\neach state with 200 random binary features. The objective is to \ufb01nd the optimal route to a known\ngoal state in a color maze. Assuming a deterministic transition model, the highest cumulative reward,\ndetermined by the colors around each cell/state, can be achieved on the optimal route.\nExperiment set-up: Five different reward models are generated by normal Gaussian distributions,\neach depending on different sets of features. The start state is random. We run experiments on 15\ntasks repeatedly 20 times, and conduct leave-one-task-out test. The maximum size M of the views\nlibrary, initially empty, is set to be 20; threshold c for growing the library is set to be log 300. The\nparameters for view learning are: \u03bb = 0.05 and \u03b1 = 2.5.\n\nTable 1: Transfer of reward models: Cumulative reward in the \ufb01rst episodes; Time to solve 15 tasks\n(in minutes), in which each is run with 200 episodes. Map sizes vary from 20 \u00d7 20 to 30 \u00d7 30.\n\nMethods\n\nHB\nLWT\nTES\n\n1\n\n2\n\nTime\n-108.01 -85.26 -67.46 -90.17 -130.11 -95.42 -46.23 -77.10 -83.91 -51.01 -131.44 -97.05 -90.11 -48.91 -92.31 77.2\n-79.41 -114.28 -83.31 -46.70 -245.11 -156.23 -47.05 -49.52 -105.24 -88.19 -174.15 -85.10 -55.45 -101.24 -86.01 28.6\n-45.01\n-78.23 -62.15 -54.46 -119.76 -115.77 -37.15 -58.09 -167.13 -59.11 -102.46 -45.99 -86.12 -67.23 -81.39 31.5\n\n10\n\n11\n\n14\n\n12\n\n13\n\n3\n\n4\n\n15\n\n5\n\n6\n\n7\n\nTasks\n\n8\n\n9\n\nAs seen in Table 1, TES on average wins over HB in 11 and LWT in 12 out of 15 tasks. In the\n15 \u00d7 20 = 300 runs TES wins over HB 239 and over LWT 279 times, both yielding binomial test p-\nvalues less than 0.05. This demonstrates that TES can successfully learn the views and utilize them\nin novel tasks. Moreover, TES runs much faster than HB, and just slightly slower than LWT. Since\nHB does not learn the relevant features for model representation, it may over\ufb01t, and the knowledge\nlearned cannot be easily generalized. It also needs a costly sampling method. Similarly, the strategy\nfor LWT that tries to learn one common model for transfer in various tasks often does not work well.\n\n5.2 Multi-view transfer in complex environments\n\nIn the second experiment, we evaluate TES in a more challenging domain, transferring transition\nmodels. We consider a grid-based robot navigation problem in which each grid-cell has the surface\nof either sand, soil, water, brick, or \ufb01re. In addition, there may be walls between cells. The surfaces\nand walls determine the stochastic dynamics of the world. However, the agent also observes numer-\nous other features in the environment. The agent has to learn to focus on the relevant features to\nquickly achieve its goal. The goal is to reach any exit door in the world consuming as little energy\nas possible.\n\n6\n\n\fExperiment set-up: The agent can perform four actions (move up, down, left, right) which will lead\nit to one of the four states around it, or leave it to its current state if it bumps into a wall. The agent\nwill spend 0.01 units of energy to perform an action. It loses 1 unit if falling into a \ufb01re, but gains 1\nunit when reaching an exit door. A task ends when the agent reaches any exit door or \ufb01re.\nWe design \ufb01fteen tasks with grid sizes ranging from 20\u00d7 20 to 30\u00d7 30. Each task has a different\nstate space and different terminal states. Each state (cell) also has 200 irrelevant random binary\nfeatures, besides its surface materials and the walls around it. The tasks may have different dynamics\nas well as different distributions of the surface materials.\nIn our experiments, the environment\ntransition dynamics is generated using three different sets of multinomial logistic regression models\nso that every combination of cell surfaces and walls around the cell will lead to a different transition\ndynamics at the cell. The probability of going through a wall is rounded to zero and the freed\nprobability mass is evenly distributed to other effects. The agent\u2019s starting position is randomly\npicked in each episode.\nWe represent \ufb01ve effects of the actions: moved up, left, down, right, did not move. The maximum\nsize M of the view library, initially empty, is set to be 20; threshold c = log 300. In a new environ-\nment, the TES-agent mainly relies on its transferred knowledge. However, we allow some \u0001-greedy\nexploration with \u0001 = 0.05. The parameters for view learning algorithm are that \u03bb = 0.05, \u03b1 = 1.5.\nWe conduct leave-one-out cross-validation experiment with \ufb01fteen different tasks. In each scenario\nthe agent is \ufb01rst allowed to experience fourteen tasks, over 100 episodes in each, and it is then tested\non the remaining one task. No recency weighting is used to calculate the goodness of the views in\nthe library. We next discuss experimental results averaged over 20 runs showing 95% con\ufb01dence\nintervals (when practical) for some representative tasks.\nTransferring expectations between homogeneous tasks. To ensure that TES is capable of basic\nmodel transfer, we \ufb01rst evaluate it on a simple task to ensure that the learning algorithm in section 3\nworks. We train and test TES on two environments which have same dynamics and 200 irrelevant\nbinary features that challenge agent\u2019s ability to learn a compact model for transfer. Figure 1a shows\nhow much the other methods lose to TES in terms of accumulated reward in the test task. loreRL is an\nimplementation of TES equipped with the view learning algorithm that does not transfer knowledge.\nfRmax is the factored Rmax [3] in which the network structures of transition models are provided by\nan oracle [17]; its parameter m is set to be 10 in all the experiments. fEpsG is a heuristic in which\nthe optimistic Rmax exploration of fRmax is replaced by an \u0001-greedy strategy (\u0001 = 0.1). The results\nshow that these oracle methods still have to spend time to learn the model parameters, so they gain\nless accumulated reward than TES. This also suggests that the transferred view of TES is likely not\nonly compact but also accurate. Figure 1a further shows that loreRL and fEpsG are more effective\nthan fRmax in early episodes.\nView selection vs. random views. Figure 1b shows how different views lead to different policies\nand accumulated rewards over the \ufb01rst 50 episodes in a given task. The Rands curves show the\naccumulated reward difference to TES when the agent follows some random combinations of views\nfrom the library. For clarity we show only 5 such random combinations. For all these, the difference\nturns negative fast in the beginning indicating less reward in early episodes. We conclude that our\nview selection criterion outperforms random selection.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: Performance difference to TES in early trials in a) homogeneous, b) heterogeneous envi-\nronments. c) Convergence.\n\n7\n\n-18-16-14-12-10-8-6-4-2 0 0 10 20 30 40 50accumulated reward differenceepisodefEpsGfRmaxloreRL-7-6-5-4-3-2-1 0 0 10 20 30 40 50accumulated reward differenceepisodeRandsLWTloreRLfEpsG-400-300-200-100 0 100 200 300 0 100 200 300 400 500accumulated rewardepisodeTESfRmaxRmax\fTable 2: Cumulative reward after \ufb01rst episodes. For example, in Task 1 TES can save (0.616 \u2212\n0.113)/0.01 = 50.3 actions compared to LWT.\n\n1\n\n2\n\n3\n\n4\n\nMethods\nloreRL -0.681 -0.826 -0.814 -1.068 -0.575 -0.810 -0.529 -0.398 -0.653 -0.518 -0.528 -0.244 -0.173 -1.176 -0.692\n0.113 -0.966 -0.300 0.024 -1.205 -0.345 -1.104 -1.98 -0.057 -0.664 -0.230 -1.228 0.034 0.244 -0.564\nLWT\n0.616 -0.369 0.230 -0.044 -0.541 -0.784 -0.265 0.255 0.001 -0.298 -1.184 -0.077 0.209 0.389 -0.407\nTES\n\n12\n\n11\n\n13\n\n14\n\n15\n\n9\n\n10\n\n5\n\n6\n\n7\n\nTasks\n\n8\n\nMultiple views vs. single view, and non-transfer. We compare the multi-view learning TES agent\nwith a non-transfer agent loreRL, and an LWT agent that tries to learn only one good model for\ntransfer. We also compare with the oracle method fEpsG. As seen in Figure 1b, TES outperforms\nLWT which, due to differences in the tasks, also performs worse than loreRL. When the earlier\ntraining tasks are similar to the test task, the LWT agent performs well. However, the TES agent also\nquickly picks the correct views, thus we never lose much but often gain a lot. We also notice that TES\nachieves a higher accumulated reward than loreRL and fEpsG that are bound to make uninformed\ndecisions in the beginning.\nTable 2 shows the average cumulative reward after the \ufb01rst episode (the jumpstart effect) for each\ntest task in the leave-one-out cross-validation. We observe that TES usually outperforms both the\nnon-transfer and the LWT approach. In all 15 \u00d7 20 = 300 runs, TES wins over LWT 247 times and\nit wins over loreRL 263 times yielding p-values smaller than 0.05.\nWe also notice that due to its fast capability of capturing the world dynamics, TES running time is\njust slightly longer than LWT\u2019s and loreRL\u2019s, which do not perform extra work for view switching\nbut need more time and data to learn the dynamics models.\nConvergence. To study the asymptotic performance of TES, we compare with the oracle method\nfRmax which is known to converge to a (near) optimal policy. Notice that in this feature-rich domain,\nfRmax without the pre-de\ufb01ned DBN structure is just similar to Rmax. Therefore, we also compare\nwith Rmax. For Rmax, the number of visits to any state before it is considered \u201cknown\u201d is set to 5,\nand the exploration probability \u0001 for known states starts to decrease from value 0.1.\nFigure 1c shows the accumulated rewards and their statistical dispersion over episodes. Average\nperformance is re\ufb02ected by the angles of the curves. As seen, TES can achieve a (near) optimal\npolicy very fast and sustain its good performance over the long run. It is only gradually caught\nup by fRmax and Rmax. This suggests that TES can successfully learn a good library of views in\nheterogeneous environments and ef\ufb01ciently utilize those views in novel tasks.\n\n6 Conclusions\n\nWe have presented a framework for learning and transferring multiple expectations or views about\nworld dynamics in heterogeneous environments. When the environments are different, the combina-\ntion of learning multiple views and dynamically selecting the most promising ones yields a system\nthat can learn a good policy faster and gain higher accumulated reward as compared to the common\nstrategy of learning just a single good model and using it in all occasions.\nUtilizing and maintaining multiple models require additional computation and memory. We have\nshown that by a clever decomposition of the transition function, model selection and model up-\ndating can be accomplished ef\ufb01ciently using online algorithms. Our experiments demonstrate that\nperformance improvements in multi-dimensional heterogeneous environments can be achieved with\na small computational cost.\nThe current work addresses the question of learning good models, but the problem of learning good\npolicies in large state spaces still remains. Our model learning method is independent of the policy\nlearning task, thus it can well be coupled with any scalable approximate policy learning algorithms.\n\nAcknowledgments\n\nThis research is supported by Academic Research Grants: MOE2010-T2-2-071 and T1\n251RES1005 from the Ministry of Education in Singapore.\n\n8\n\n\fReferences\n[1] Atkeson, C., Santamaria, J.: A comparison of direct and model-based reinforcement learning. In:\n\nICRA\u201997. vol. 4, pp. 3557\u20133564 (1997)\n\n[2] Boutilier, C., Dearden, R., Goldszmidt, M.: Stochastic dynamic programming with factored representa-\n\ntions. Journal of Arti\ufb01cial Intelligence 121, 49\u2013107 (2000)\n\n[3] Brafman, R.I., Tennenholtz, M.: R-max - a general polynomial time algorithm for near-optimal reinforce-\n\nment learning. Journal of Machine Learning Research 3, 213\u2013231 (2002)\n\n[4] Celiberto, L.A., Matsuura, J.P., de Mntaras, R.L., Bianchi, R.A.C.: Using cases as heuristics in reinforce-\n\nment learning: A transfer learning application. In: IJCAI\u201911. pp. 1211\u20131217 (2011)\n\n[5] Dawid, A.: Statistical theory: The prequential approach. Journal of the Royal Statistical Society A 147,\n\n278\u2013292 (1984)\n\n[6] Doya, K., Samejima, K., Katagiri, K.i., Kawato, M.: Multiple model-based reinforcement learning. Neu-\n\nral Computation 14, 1347\u20131369 (June 2002)\n\n[7] Fern\u00b4andez, F., Garc\u00b4\u0131a, J., Veloso, M.: Probabilistic policy reuse for inter-task transfer learning. Robot and\n\nAutonomous System 58, 866\u2013871 (July 2010)\n\n[8] Hester, T., Stone, P.: Generalized model learning for reinforcement learning in factored domains. In:\n\nAAMAS\u201909. vol. 2, pp. 717\u2013724 (2009)\n\n[9] Konidaris, G., Barto, A.: Ef\ufb01cient skill learning using abstraction selection. In: IJCAI\u201909. pp. 1107\u20131112\n\n(2009)\n\n[10] Lef\ufb02er, B.R., Littman, M.L., Edmunds, T.: Ef\ufb01cient reinforcement learning with relocatable action mod-\n\nels. In: AAAI\u201907. vol. 1, pp. 572\u2013577 (2007)\n\n[11] McCarthy, J.: Situations, actions, and causal laws. Tech. Rep. Memo 2, Stanford Arti\ufb01cial Intelligence\n\nProject, Stanford University (1963)\n\n[12] Savage, L.J.: Elicitation of personal probabilities and expectations. Journal of the American Statistical\n\nAssociation 66(336), 783\u2013801 (1971)\n\n[13] Sharma, M., Holmes, M., Santamaria, J., Irani, A., Isbell, C., Ram, A.: Transfer learning in real-time\n\nstrategy games using hybrid cbr/rl. In: IJCAI\u201907. pp. 1041\u20131046 (2007)\n\n[14] Sherstov, A.A., Stone, P.: Improving action selection in MDP\u2019s via knowledge transfer. In: AAAI\u201905.\n\nvol. 2, pp. 1024\u20131029 (2005)\n\n[15] Silva, B.C.D., Basso, E.W., Bazzan, A.L.C., Engel, P.M.: Dealing with non-stationary environments using\n\ncontext detection. In: ICML\u201906. pp. 217\u2013224 (2006)\n\n[16] Soni, V., Singh, S.: Using homomorphisms to transfer options across continuous reinforcement learning\n\ndomains. In: AAAI\u201906. pp. 494\u2013499 (2006)\n\n[17] Strehl, A.L., Diuk, C., Littman, M.L.: Ef\ufb01cient structure learning in factored-state MDPs. In: AAAI\u201907.\n\npp. 645\u2013650 (2007)\n\n[18] Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998)\n[19] Taylor, M.E., Jong, N.K., Stone, P.: Transferring instances for model-based reinforcement learning. In:\n\nMachine Learning and Knowledge Discovery in Databases. LNAI, vol. 5212 (2008)\n\n[20] Taylor, M.E., Stone, P.: Transfer learning for reinforcement learning domains: A survey. Journal of\n\nMachine Learning Research 10, 1633\u20131685 (December 2009)\n\n[21] Van Seijen, H., Bakker, B., Kester, L.: Switching between different state representations in reinforcement\nlearning. In: Proceedings of the 26th IASTED International Conference on Arti\ufb01cial Intelligence and\nApplications. pp. 226\u2013231 (2008)\n\n[22] Walsh, T.J., Li, L., Littman, M.L.: Transferring state abstractions between MDPs. In: ICML Workshop\n\non Structural Knowledge Transfer for Machine Learning (2006)\n\n[23] Wilson, A., Fern, A., Ray, S., Tadepalli, P.: Multi-task reinforcement learning: A hierarchical Bayesian\n\napproach. In: ICML\u201907. pp. 1015\u20131023 (2007)\n\n[24] Xiao, L.: Dual averaging methods for regularized stochastic learning and online optimization. In:\n\nNIPS\u201909 (2009)\n\n[25] Yang, H., Xu, Z., King, I., Lyu, M.R.: Online learning for group Lasso. In: ICML\u201910 (2010)\n[26] Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology) 68(1), 49\u201367 (2006)\n\n[27] Zhu, X., Ghahramani, Z., Lafferty, J.: Time-sensitive Dirichlet process mixture models. Tech. Rep. CMU-\n\nCALD-05-104, School of Computer Science, Carnegie Mellon University (2005)\n\n9\n\n\f", "award": [], "sourceid": 1218, "authors": [{"given_name": "Trung", "family_name": "Nguyen", "institution": null}, {"given_name": "Tomi", "family_name": "Silander", "institution": null}, {"given_name": "Tze", "family_name": "Leong", "institution": null}]}