{"title": "Schema Learning: Experience-Based Construction of Predictive Action Models", "book": "Advances in Neural Information Processing Systems", "page_first": 585, "page_last": 592, "abstract": null, "full_text": "Schema Learning: Experience-Based\n\nConstruction of Predictive Action Models\n\nMichael P. Holmes\nCollege of Computing\n\nCharles Lee Isbell, Jr.\nCollege of Computing\n\nGeorgia Institute of Technology\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332-0280\nmph@cc.gatech.edu\n\nAtlanta, GA 30332-0280\n\nisbell@cc.gatech.edu\n\nAbstract\n\nSchema learning is a way to discover probabilistic, constructivist, pre-\ndictive action models (schemas) from experience.\nIt includes meth-\nods for \ufb01nding and using hidden state to make predictions more accu-\nrate. We extend the original schema mechanism [1] to handle arbitrary\ndiscrete-valued sensors, improve the original learning criteria to handle\nPOMDP domains, and better maintain hidden state by using schema pre-\ndictions. These extensions show large improvement over the original\nschema mechanism in several rewardless POMDPs, and achieve very low\nprediction error in a dif\ufb01cult speech modeling task. Further, we compare\nextended schema learning to the recently introduced predictive state rep-\nresentations [2], and \ufb01nd their predictions of next-step action effects to\nbe approximately equal in accuracy. This work lays the foundation for a\nschema-based system of integrated learning and planning.\n\n1\n\nIntroduction\n\nSchema learning1 is a data-driven, constructivist approach for discovering probabilistic ac-\ntion models in dynamic controlled systems. Schemas, as described by Drescher [1], are\nprobabilistic units of cause and effect reminiscent of STRIPS operators [3]. A schema pre-\ndicts how speci\ufb01c sensor values will change as different actions are executed from within\nparticular sensory contexts. The learning mechanism also discovers hidden state features\nin order to make schema predictions more accurate.\n\nIn this work we have generalized and extended Drescher\u2019s original mechanism to learn\nmore accurate predictions by using improved criteria both for discovery and re\ufb01nement of\nschemas as well as for creation and maintenance of hidden state. While Drescher\u2019s work\nincluded mechanisms for action selection, here we focus exclusively on the problem of\nlearning schemas and hidden state to accurately model the world. In several benchmark\nPOMDPs, we show that our extended schema learner produces signi\ufb01cantly better action\nmodels than the original. We also show that the extended learner performs well on a com-\nplex, noisy speech modeling task, and that its prediction accuracy is approximately equal\nto that of predictive state representations [2] on a set of POMDPs, with faster convergence.\n\n1This use of the term schema derives from Piaget\u2019s usage in the 1950s; it bears no relation to\n\ndatabase schemas or other uses of the term.\n\n\f2 Schema Learning\n\nSchema learning is a process of constructing probabilistic action models of the environment\nso that the effects of agent actions can be predicted. Formally, a schema learner is \ufb01tted\nwith a set of sensors S = {s1, s2, . . .} and a set of actions A = {a1, a2, . . .} through\nwhich it can perceive and manipulate the environment. Sensor values are discrete: sj\ni\nmeans that si has value j. As it observes the effects of its actions on the environment,\nthe learner constructs predictive units of sensorimotor cause and effect called schemas. A\nai\u2212\u2192 R essentially says, \u201cIf I take action ai in situation C, I will see result R.\u201d\nschema C\nSchemas thus have three components: (1) the context C = {c1, c2, . . . , cn} , which is a set\nof sensor conditions ci \u2261 sk\nj that must hold for the schema to be applicable, (2) the action\nthat is taken, and (3) the result, which is a set of sensor conditions R = {r1, r2, . . . , rm}\npredicted to follow the action. A schema is said to be applicable if its context conditions are\nsatis\ufb01ed, activated if it is applicable and its action is taken, and to succeed if it is activated\nand the predicted result is observed. Schema quality is measured by reliability, which is the\nai\u2212\u2192 R) = prob(Rt+1|Ct, ai(t)).\nprobability that activation culminates in success: Rel(C\nNote that schemas are not rules telling an agent what to do; rather, they are descriptions of\nwhat will happen if the agent takes a particular action in a speci\ufb01c circumstance. Also note\nthat schema learning has no prede\ufb01ned states such as those found in a POMDP or HMM;\nthe set of sensor readings is the state. Because one schema\u2019s result can set up another\nschema\u2019s context, schemas \ufb01t naturally into a planning paradigm in which they are chained\nfrom the current situation to reach sensor-de\ufb01ned goals.\n\n2.1 Discovery and Re\ufb01nement\n\nSchema learning comprises two basic phases: discovery, in which context-free action/result\nschemas are found, and re\ufb01nement, in which context is added to increase reliability. In\ndiscovery, statistics track the in\ufb02uence of each action ai on each sensor condition sj\nr.\nDrescher\u2019s original schema mechanism accommodated only binary-valued sensors, but we\nhave generalized it to allow a heterogeneous set of sensors that take on arbitrary discrete\nvalues.\nIn the present work, we assume that the effects of actions are observed on the\nsubsequent timestep, which leads to the following criterion for discovering action effects:\n\ncount(at, sj\n\nr(t+1)) > \u03b8d,\n\n(1)\n\nIf this criterion is met, the learner constructs a\nwhere \u03b8d is a noise-\ufb01ltering threshold.\nschema \u2205 ai\u2212\u2192 sj\nr , where the empty set, \u2205, means that the schema is applicable in any situ-\nation. This works in a POMDP because it means that executing ai in some state has caused\nsensor sr to give observation j, implying that such a transition exists in the underlying (but\nunknown) system model. The presumption is that we can later learn what sensory context\nmakes this transition reliable. Drescher\u2019s original discovery criterion generalizes in the\nnon-binary case to:\n\nprob(sj\nprob(sj\n\nr(t+1)|at)\nr(t+1)|at)\n\n> \u03b8od,\n\n(2)\n\nwhere \u03b8od > 1 and at means a was not taken at time t. Experiments in worlds of known\nstructure show that this criterion misses many true action effects.\n\nWhen a schema is discovered, it has no context, so its reliability may be low if the effect\noccurs only in particular situations. Schemas therefore begin to look for context conditions\n\n\fExtended Schema Learner\n\nOriginal Schema Learner\n\nCriterion\n\nDiscovery\n\nRe\ufb01nement\n\ncount(at, sj\n\nr(t+1)) > \u03b8d\n\nai\u2212\u2192 R)\n\nRel(C \u222a {sj\n\nc}\nai\u2212\u2192 R)\nAnnealed threshold\n\nRel(C\n\n> \u03b8\n\nc}\nai\u2212\u2192 R)\n\nRel(C\n\n> \u03b8\n\nj\nr(t+1)\nj\nr(t+1)\n\nprob(s\n\n|at)\n\n> \u03b8od\n\n|at)\n\nprob(s\nBinary sensors only\nRel(C \u222a {sj\n\nai\u2212\u2192 R)\n\nStatic threshold\nBinary sensors only\n0 < Rel(C\nSchema is locally consistent\nAverage duration\n\nai\u2212\u2192 R) < \u03b8\n\nSynthetic Item Creation\n\nSynthetic Item Maintenance\n\nai\u2212\u2192 R) < \u03b8\n\n0 < Rel(C\nNo context re\ufb01nement possible\nPredicted by other schemas\n\nTable 1: Comparison of extended and original schema learners.\n\nthat increase reliability. The criterion for adding sj\n\nc to the context of C\n\nai\u2212\u2192 R is:\n\nRel(C \u222a {sj\n\nRel(C\n\nc} ai\u2212\u2192 R)\nai\u2212\u2192 R)\n\n> \u03b8c,\n\n(3)\n\nwhere \u03b8c > 1. In practice we have found it necessary to anneal \u03b8c to avoid adding spurious\ncontext. Once the criterion is met, a child schema C \u2032 ai\u2212\u2192 R is formed, where C \u2032 = C \u222asj\nc.\n\n2.2 Synthetic Items\n\nIn addition to basic discovery and re\ufb01nement of schemas, a schema learner also discovers\nhidden state. Consider the case where no context conditions are found to make a schema\nreliable. There must be unperceived environmental factors on which the schema\u2019s relia-\nbility depends (see [4]). The schema learner therefore creates a new binary-valued virtual\nsensor, called a synthetic item, to represent the presence of conditions in the environment\nthat allow the schema to succeed. This addresses the state aliasing problem by splitting\nthe state space into two parts, one where the schema succeeds, and one where it does not.\nSynthetic items are said to reify the host schemas whose success conditions they represent;\nthey have value 1 if the host schema would succeed if activated, and value 0 otherwise.\nUpon creation, a synthetic item begins to act as a normal sensor, with one exception: the\nagent has no way of directly perceiving its value. Creation and state maintenance criteria\nthus emerge as the main problems associated with synthetic items.\n\nDrescher originally posited two conditions for the creation of a synthetic item: (1) a schema\nmust be unreliable, and (2) the schema must be locally consistent, meaning that if it suc-\nceeds once, it has a high probability of succeeding again if activated soon afterward. The\nsecond of these conditions formalizes the assumption that a well-behaved environment has\npersistence and does not tend to radically change from moment to moment. This was moti-\nvated by the desire to capture Piagetian \u201cconservation phenomena.\u201d While well-motivated,\nwe have found that the second condition is simply too restrictive. Our criterion for creating\nai\u2212\u2192 R) < \u03b8r, subject to the constraint that the statistics\nsynthetic items is 0 < Rel(C\ngoverning possible additional context conditions have converged. When this criterion is\nmet, a synthetic item is created and is thenceforth treated as a normal sensor, able to be\nincorporated into the contexts and results of other schemas.\n\nA newly created synthetic item is grounded: it represents whatever conditions in the world\nallow the host schema to succeed when activated. Thus, upon activation of the host schema,\nwe retroactively know the state of the synthetic item at the time of activation (1 if the\nschema succeeded, 0 otherwise). Because the synthetic item is treated as a sensor, we can\n\n\fFigure 1: Benchmark problems. (left) The \ufb02ip system. All transitions are deterministic. (right)\nThe \ufb02oat/reset system. Dashed lines represent \ufb02oat transitions that happen with probability 0.5,\nwhile solid lines represent deterministic reset transitions.\n\ndiscover which previous actions led to each synthetic item state, and the synthetic item can\ncome to be included as a result condition in new schemas. Once we have reliable schemas\nthat predict the state of a synthetic item, we can begin to know its state non-retroactively,\nwithout having to activate the host schema. The synthetic item\u2019s state can potentially be\nknown just as well as that of the regular sensors, and its addition expands the state represen-\ntation in just such a way as to make sensory predictions more reliable. Predicted synthetic\nitem state implicitly summarizes the relevant preceding history: it indicates that one of the\nschemas that predicts it was just activated. If the predicting schema also has a synthetic\nitem in its context, an additional step of history is implied. Such chaining allows synthetic\nitems to summarize arbitrary amounts of history without explicitly remembering any of it.\nThis use of schemas to predict synthetic item state is in contrast to [1], which relied on the\naverage duration of synthetic item states in order to predict them. Table 1 compares our\nextended schema learning criteria with Drescher\u2019s original criteria.\n\n3 Empirical Evaluation\n\nIn order to test the advantages of the extended learning criteria, we compared four ver-\nsions of schema learning. The \ufb01rst two were basic learners that made no use of synthetic\nitems, but discovered and re\ufb01ned schemas using our extended criteria in one case, and the\ndirect generalizations of Drescher\u2019s original criteria in the other. The second pair added the\nextended and original synthetic item mechanisms, respectively, to the \ufb01rst pair.\n\nOur \ufb01rst experimental domains are based on those used in [5]. They have a mixture of\ntransient and persistent hidden state and, though small, are non-trivial.2 The \ufb02ip system\nis shown on the left in Figure 1; it features deterministic transitions, hidden state, and\na null action that confounds simplistic history approaches to handling hidden state. The\n\ufb02oat/reset system is illustrated on the right side of Figure 1; it features both deterministic\nand stochastic transitions, as well as a more complicated hidden state structure. Finally, we\nuse a modi\ufb01ed \ufb02oat/reset system in which the f action from the two right-most states leads\ndeterministically to their left neighbor; this reveals more about the hidden state structure.\n\nTo test predictive power, each schema learner, upon taking an action, uses the most reliable\nof all activated schemas to predict what the next value of each sensor will be. If there is\nno activation of a reliable schema to predict the value of a particular sensor, its value is\npredicted to stay constant. Error is measured as the fraction of incorrect predictions.\n\nIn these experiments, actions were chosen uniformly at random, and learning was allowed\nto continue throughout.3 No learning parameters are changed over time; schemas stop\nbeing created when discovery and re\ufb01nement criteria cease to generate them. Figure 2\nshows the performance in each domain, while Table 2 summarizes the average error.\n\n2E.g. [5] showed that \ufb02ip is non-trivial because it cannot be modeled exactly by k-Markov models,\n\nand its EM-trained POMDP representations require far more than the minimum number of states.\n\n3Note that because a prediction is made before each observation, the observation does not con-\n\ntribute to the learning upon which its predicted value is based.\n\n\fflip \n\nfloat/reset \n\n \n\nR\nO\nR\nR\nE\nN\nO\nT\nC\nD\nE\nR\nP\n\nI\n\nI\n\n \n\nR\nO\nR\nR\nE\nN\nO\nT\nC\nD\nE\nR\nP\n\nI\n\nI\n\nextended\nextended baseline\noriginal\noriginal baseline\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n7000\n\n8000\n\n9000\n\n10000\n\nTIMESTEP\n\nmodified float/reset \n\nextended\nextended baseline\noriginal\noriginal baseline\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n7000\n\n8000\n\n9000\n\n10000\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n \n\nI\n\nR\nO\nR\nR\nE\nN\nO\nT\nC\nD\nE\nR\nP\n\nI\n\n \n\nR\nO\nR\nR\nE\nN\nO\nT\nC\nD\nE\nR\nP\n\nI\n\nI\n\nextended\nextended baseline\noriginal\noriginal baseline\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n7000\n\n8000\n\n9000\n\n10000\n\nTIMESTEP\n\nspeech modeling \n\nweather predictor\n2\u2212context schema learner\n3\u2212context schema learner\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n7000\n\n8000\n\n9000\n\n10000\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\nTIMESTEP\n\nTIMESTEP\n\nFigure 2: Prediction error in several domains. The x-axis represents timesteps and the y-axis\nrepresents error. Each point represents average error over 100 timesteps. In the speech modeling\ngraph, learning is stopped after approximately 4300 timesteps (shown by the vertical line), after\nwhich no schemas are added, though reliabilities continue to be updated.\n\nLearner\nExtended\nExtended baseline\nOriginal\nOriginal baseline\n\n\ufb02ip\n0.020\n0.331\n0.426\n0.399\n\n\ufb02oat/reset modi\ufb01ed f/r\n0.136\n0.136\n0.140\n0.139\n\n0.00716\n0.128\n0.299\n0.315\n\nTable 2: Average error. Calculated over 10 independent runs of 10,000 timesteps each.\n\n3.1 Speech Modeling\n\nThe Japanese vowel dataset [6] contains time-series recordings of nine Japanese speakers\nuttering the ae vowel combination 54-118 times. Each data point consists of 12 continuous-\nvalued cepstral coef\ufb01cients, which we transform into 12 sensors with \ufb01ve discrete values\neach. The data is noisy and the dynamics are non-stationary between speakers. Each utter-\nance is divided in half, with the \ufb01rst half treated as the action of speaking a and the latter\nhalf as e. In order to more quickly adapt to discontinuity resulting from changes in speaker,\nreliability was calculated using an exponential weighting of more recent observations; each\nrelevant probability p was updated according to:\n\npt+1 = \u03b1pt + (1 \u2212 \u03b1)(cid:26) 1 if event occurred at time t\n\n0 otherwise\n\n.\n\n(4)\n\nThe parameter \u03b1 is set equal to the current prediction accuracy so that decreased accuracy\ncauses faster adaptation. Several modi\ufb01cations were necessary for tractability: (1) schemas\nwhose reliability fell below a threshold of their parents\u2019 reliability were removed, (2) con-\n\n\ftext sizes were, on separate experimental runs, restricted to two and three items, and (3)\nthe synthetic item mechanisms were deactivated. Figure 2 displays results for this learner\ncompared to a baseline weather predictor.4\n\n3.2 Analysis\n\nIn each benchmark problem, the learners drop to minimum error after no more than 1000\ntimesteps. Large divergence in the curves corresponds to the creation of synthetic items and\nthe discovery of schemas that predict synthetic item state. Small divergence corresponds\nto differences in discovery and re\ufb01nement criteria. In \ufb02ip and modi\ufb01ed \ufb02oat/reset, the ex-\ntended schema learner reaches zero error, having a complete model of the hidden state, and\noutperforms all other learners, while the extended basic version outperforms both original\nlearners. In \ufb02oat/reset, all learners perform approximately equally, re\ufb02ecting the fact that,\ngiven the hidden stochasticity of this system, the best schema for action r is one that, with-\nout reference to synthetic items, gives a prediction of 1. Surprisingly, the original learner\nnever signi\ufb01cantly outperformed its baseline, and even performed worse than the baseline\nin \ufb02ip. This is accounted for by the duration-based maintenance of synthetic items, which\ncauses the original learner to maintain transient synthetic item state longer than it should.\nPrediction-based synthetic item maintenance overcomes this limitation.\n\nThe speech modeling results show that schema learning can induce high-quality action\nmodels in a complex, noisy domain. With a maximum of three context conditions, it aver-\naged only 1.2% error while learning, and 1.6% after learning stopped, a large improvement\nover the 30.3% error of the baseline weather predictor. Note that allowing three instead\nof two context conditions dropped the error from 4.6% to 1.2% and from 9.0% to 1.6% in\nthe training and testing phases, respectively, demonstrating the importance of incremental\nspecialization of schemas through context re\ufb01nement.\n\nAll together, these results show that our extended schema learner produces better action\nmodels than the original, and can handle more complex domains. Synthetic items are seen\nto effectively model hidden state, and prediction-based maintenance of synthetic item state\nis shown to be more accurate than duration-based maintenance in POMDPs. Discovery\nof schemas is improved by our criterion, missing fewer legitimate schemas, and therefore\nproducing more accurate predictions. Re\ufb01nement using the annealed generalization of the\noriginal criterion performs correctly with a lower false positive rate.\n\n4 Comparison to Predictive State Representations\n\nPredictive state representations (PSRs; [2]), like schema learning, are based on grounded,\nsensorimotor predictions that uncover hidden state.\nInstead of schemas, PSRs rely\non the notion of tests. A test q is a series of alternating actions and observations\na0o0a1o1 . . . anon. In a PSR, the environment state is represented as the probabilities that\neach of a set of core tests would yield its observations if its actions were executed. These\nprobabilities are updated at each timestep by combining the current state with the new ac-\ntion/observation pair. In this way, the PSR implicitly contains a suf\ufb01cient history-based\nstatistic for prediction, and should overcome aliasing relative to immediate observations.\n[2] shows that linear PSRs are at least as compact and general as POMDPs, while [5] shows\nthat PSRs can learn to accurately maintain their state in several POMDP problems.\n\nA schema is similar to a one-step PSR test, and schema reliability roughly corresponds to\nthe probability of a PSR test. Schemas differ, however, in that they only specify context and\nresult incrementally, incorporating incremental history via synthetic items, while PSR tests\nincorporate the complete history and full observations (i.e. all sensor readings at once) into\n\n4A weather predictor always predicts that values will stay the same as they are presently.\n\n\fProblem\n\ufb02ip\n\ufb02oat/reset\nnetwork\npaint\n\nPSR\n0\n0.11496\n0.04693\n0.20152\n\nSchema Learner Difference\n0\n0.13369\n0.06457\n0.21051\n\n0\n0.01873\n0.01764\n0.00899\n\nSchema Learning Steps\n10, 000\n10, 000\n10, 000\n30, 000\n\nTable 3: Prediction error for PSRs and schema learning on several POMDPs. Error is averaged\nover 10 epochs of 10,000 timesteps each. Performance differs by less than 2% in every case.\n\na test probability. A multi-step test can say more about the current state than a schema, but\nis not as useful for regression planning because there is no way to extract the probability\nthat a particular one of its observations will be obtained. Thus, PSRs are more useful as\nMarkovian state for reinforcement learning, while schemas are useful for explicit planning.\nNote that synthetic items and PSR core test probabilities both attempt to capture a suf\ufb01cient\nhistory statistic without explicitly maintaining history. This suggests a deeper connection\nbetween the two approaches, but the relationship has yet to be formalized.\n\nWe compared the predictive performance of PSRs with that of schema learning on some of\nthe POMDPs from [5]. One-step PSR core tests can be used to predict observations: as an\naction is taken, the probability of each observation is the probability of the one-step core\ntest that uses the current action and terminates in that observation. We choose the most\nprobable observation as the PSR prediction. This allows us to evaluate PSR predictions\nusing the same error measure (fraction of incorrect predictions) as in schema learning.5\nIn our experiments, the extended schema learner was \ufb01rst allowed to learn until it reached\nan asymptotic minimum error (no longer than 30,000 steps). Learning was then deactivated,\nand the schema learner and PSR each made predictions over a series of randomly chosen\nactions. Table 3 presents the average performance for each approach.\n\nLearning PSR parameters required 1-10 million timesteps [5], while schema learning used\nno more than 30,000 steps. Also, learning PSR parameters required access to the underly-\ning POMDP [5], whereas schema learning relies solely on sensorimotor information.\n\n5 Related Work\n\nAside from PSRs, schema learning is also similar to older work in learning planning op-\nerators, most notably that of Wang [7], Gil [8], and Shen [9]. These approaches use ob-\nservations to learn classical, deterministic STRIPS-like operators in predicate logic envi-\nronments. Unlike schema learning, they make the strong assumption that the environment\ndoes not produce noisy observations. Wang and Gil further assume no perceptual aliasing.\n\nOther work in this area has attempted to handle noise, but only in the problem of context\nre\ufb01nement. Benson [10] gives his learner prior knowledge about action effects, and the\nlearner \ufb01nds conditions to make the effects reliable with some tolerance for noise. One\nadvantage of Benson\u2019s formalism is that his operators are durational, rather than atomic\nover a single timestep. Balac et al.\n[11] use regression trees to \ufb01nd regions of noisy,\ncontinuous sensor space that cause a speci\ufb01ed action to vary in the degree of its effect.\n\nFinally, Shen [9] and McCallum [12] have mechanisms for handling state aliasing. Shen\nuses differences in successful and failed predictions to identify pieces of history that reveal\nhidden state. His approach, however, is completely noise intolerant. McCallum\u2019s UTree\nalgorithm selectively adds pieces of history in order to maximize prediction of reward.\n\n5Unfortunately, not all the POMDPs from [5] had one-step core tests to cover the probability of\nevery observation given every action. We restricted our comparisons to the four systems that had at\nleast two actions for which the probability of all next-step observations could be determined.\n\n\fThis bears a strong resemblance to the history represented by chains of synthetic items, a\nconnection that should be explored more fully. Synthetic items, however, are for general\nsensor prediction, which contrasts with UTree\u2019s task-speci\ufb01c focus on reward prediction.\nSchema learning, PSRs, and the UTree algorithm are all highly related in this sense of\nselectively tracking history information to improve predictive performance.\n\n6 Discussion and Future Work\n\nWe have shown that our extended schema learner produces accurate action models for a\nvariety of POMDP systems and for a complex speech modeling task. The extended schema\nlearner performs substantially better than the original, and compares favorably in predictive\npower to PSRs while appearing to learn much faster. Building probabilistic goal-regression\nplanning on top of the schemas is a logical next step; however, to succeed with real-world\nplanning problems, we believe that we need to extend the learning mechanism in several\nways. For example, the schema learner must explicitly handle actions whose effects occur\nover an extended duration instead of after one timestep. The learner should also be able to\ndirectly handle continuous-valued sensors. Finally, the current mechanism has no means\nof abstracting similar schemas, e.g., to reduce x1\n1\n\n1 and x2\n1\n\na\u2212\u2192 xp+1\n\na\u2212\u2192 x3\n\n1 to xp\n1\n\na\u2212\u2192 x2\n\n.\n\n1\n\nAcknowledgements\n\nThanks to Satinder Singh and Michael R. James for providing POMDP PSR parameters.\n\nReferences\n[1] G. Drescher. Made-up minds: a constructivist approach to arti\ufb01cial intelligence. MIT Press,\n\n1991.\n\n[2] M. L. Littman, R. S. Sutton, and S. Singh. Predictive representations of state. In Advances in\n\nNeural Information Processing Systems, pages 1555\u20131561. MIT Press, 2002.\n\n[3] R. E. Fikes and N. J. Nilsson. STRIPS: a new approach to the application of theorem proving\n\nto problem solving. Arti\ufb01cial Intelligence, 2:189\u2013208, 1971.\n\n[4] C. T. Morrison, T. Oates, and G. King. Grounding the unobservable in the observable: the\nrole and representation of hidden state in concept formation and re\ufb01nement. In AAAI Spring\nSymposium on Learning Grounded Representations, pages 45\u201349. AAAI Press, 2001.\n\n[5] S. Singh, M. L. Littman, N. K. Jong, D. Pardoe, and P. Stone. Learning predictive state rep-\nresentations. In International Conference on Machine Learning, pages 712\u2013719. AAAI Press,\n2003.\n\n[6] M. Kudo, J. Toyama, and M. Shimbo. Multidimensional curve classi\ufb01cation using passing-\n\nthrough regions. Pattern Recognition Letters, 20(11\u201313):1103\u20131111, 1999.\n\n[7] X. Wang. Learning by observation and practice: An incremental approach for planning operator\nacquisition. In International Conference on Machine Learning, pages 549\u2013557. AAAI Press,\n1995.\n\n[8] Y. Gil. Learning by experimentation: Incremental re\ufb01nement of incomplete planning domains.\n\nIn International Conference on Machine Learning, pages 87\u201395. AAAI Press, 1994.\n\n[9] W.-M. Shen. Discovery as autonomous learning from the environment. Machine Learning,\n\n12:143\u2013165, 1993.\n\n[10] Scott Benson. Inductive learning of reactive action models. In International Conference on\n\nMachine Learning, pages 47\u201354. AAAI Press, 1995.\n\n[11] N. Balac, D. M. Gaines, and D. Fisher. Using regression trees to learn action models. In IEEE\n\nSystems, Man and Cybernetics Conference, 2000.\n\n[12] A. W. McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD\n\nthesis, University of Rochester, 1995.\n\n\f", "award": [], "sourceid": 2592, "authors": [{"given_name": "Michael", "family_name": "Holmes", "institution": null}, {"given_name": "Charles", "family_name": "Jr.", "institution": null}]}