{"title": "Autonomous Learning of Action Models for Planning", "book": "Advances in Neural Information Processing Systems", "page_first": 2465, "page_last": 2473, "abstract": "This paper introduces two new frameworks for learning action models for planning.  In the mistake-bounded planning framework, the learner has access to a planner for the given model representation, a simulator, and a planning problem generator, and aims to learn a model with at most a polynomial number of faulty plans.  In the planned exploration framework, the learner does not have access to a problem generator and must instead design its own problems, plan for them, and converge with at most a polynomial number of planning attempts.  The paper reduces learning in these frameworks to concept learning with one-sided error and provides algorithms for successful learning in both frameworks.  A specific family of hypothesis spaces is shown to be efficiently learnable in both the frameworks.", "full_text": "Autonomous Learning of Action Models for Planning\n\nNeville Mehta\n\nPrasad Tadepalli\n\nSchool of Electrical Engineering and Computer Science\n\nOregon State University, Corvallis, OR 97331, USA.\n{mehtane,tadepall,afern}@eecs.oregonstate.edu\n\nAlan Fern\n\nAbstract\n\nThis paper introduces two new frameworks for learning action models for plan-\nning. In the mistake-bounded planning framework, the learner has access to a\nplanner for the given model representation, a simulator, and a planning problem\ngenerator, and aims to learn a model with at most a polynomial number of faulty\nplans. In the planned exploration framework, the learner does not have access to a\nproblem generator and must instead design its own problems, plan for them, and\nconverge with at most a polynomial number of planning attempts. The paper re-\nduces learning in these frameworks to concept learning with one-sided error and\nprovides algorithms for successful learning in both frameworks. A speci\ufb01c family\nof hypothesis spaces is shown to be ef\ufb01ciently learnable in both the frameworks.\n\n1\n\nIntroduction\n\nPlanning research typically assumes that the planning system is provided complete and correct mod-\nels of the actions. However, truly autonomous agents must learn these models. Moreover, model\nlearning, planning, and plan execution must be interleaved, because agents need to plan long before\nperfect models are learned. This paper formulates and analyzes the learning of deterministic action\nmodels used in planning for goal achievement. It has been shown that deterministic STRIPS actions\nwith a constant number of preconditions can be learned from raw experience with at most a polyno-\nmial number of plan prediction mistakes [8]. In spite of this positive result, compact action models\nin fully observable, deterministic action models are not always ef\ufb01ciently learnable. For example,\naction models represented as arbitrary Boolean functions are not ef\ufb01ciently learnable under standard\ncryptographic assumptions such as the hardness of factoring [2].\nLearning action models for planning is different from learning an arbitrary function from states\nand actions to next states, because one can ignore modeling the effects of some actions in certain\ncontexts. For example, most people who drive do not ever learn a complete model of the dynamics\nof their vehicles; while they might accurately know the stopping distance or turning radius, they\ncould be oblivious to many aspects that an expert auto mechanic is comfortable with. To capture\nthis intuition, we introduce the concept of an adequate model, that is, a model that is sound and\nsuf\ufb01ciently complete for planning for a given class of goals. When navigating a city, any spanning\ntree of the transportation network connecting the places of interest would be an adequate model.\nWe de\ufb01ne two distinct frameworks for learning adequate models for planning and then characterize\nIn the mistake-bounded planning (MBP)\nsuf\ufb01cient conditions for success in these frameworks.\nframework, the goal is to continually solve user-generated planning problems while learning action\nmodels and guarantee at most a polynomial number of faulty plans or mistakes. We assume that\nin addition to the problem generator, the learner has access to a sound and complete planner and\na simulator (or the real world). We also introduce a more demanding planned exploration (PLEX)\nframework, where the learner needs to generate its own problems to re\ufb01ne its action model. This\nrequirement translates to an experiment-design problem, where the learner needs to design problems\nin a goal language to re\ufb01ne the action models.\n\n1\n\n\fThe MBP and PLEX frameworks can be reduced to over-general query learning, concept learning\nwith strictly one-sided error, where the learner is only allowed to make false positive mistakes [7].\nThis is ideally suited for the autonomous learning setting in which there is no oracle who can provide\npositive examples of plans or demonstrations, but negative examples are observed when the agent\u2019s\nplans fail to achieve their goals. We introduce mistake-bounded and exact learning versions of\nthis learning framework and show that they are strictly more powerful than the recently introduced\nKWIK framework [4]. We view an action model as a set of state-action-state transitions and ensure\nthat the learner always maintains a hypothesis which includes all transitions in some adequate model.\nThus, a sound plan is always in the learner\u2019s search space, while it may not always be generated. As\nthe learner gains more experience in generating plans, executing them on the simulator, and receiving\nobservations, the hypothesis is incrementally re\ufb01ned until an adequate model is discovered. To\nground our analysis, we show that a general family of hypothesis spaces is learnable in polynomial\ntime in the two frameworks given appropriate goal languages. This family includes a generalization\nof propositional STRIPS operators with conditional effects.\n\n2 Over-General Query Learning\n\nProof. (If) Let Z be a set of negative examples and let H0 =(cid:83)\n\nWe \ufb01rst introduce a variant of a concept-learning framework that serves as formal underpinning\nof our model-learning frameworks. This variant is motivated by the principle of \u201coptimism under\nuncertainty\u201d, which is at the root of several related algorithms in reinforcement learning [1, 3].\nA concept is a set of instances. An hypothesis space H is a set of strings or hypotheses, each of\nwhich represents a concept. The size of the concept is the length of the smallest hypothesis that\nrepresents it. Without loss of generality, H can be structured as a (directed acyclic) generalization\ngraph, where the nodes correspond to sets of equivalent hypotheses representing a concept and there\nis a directed edge from node n1 to node n2 if and only if the concept at n1 is strictly more general\nthan (a strict superset of) that at n2.\nDe\ufb01nition 2.1. The height of H is a function of n and is the length of the longest path from a root\nnode to any node representing concepts of size n in the generalization graph of H.\nDe\ufb01nition 2.2. A hypothesis h is consistent with a set of negative examples Z if h\u2229 Z = \u2205. Given a\nset of negative examples Z consistent with a target hypothesis h, the version space of action models\nis the subset of all hypotheses in H that are consistent with Z and is denoted as M(Z).\nDe\ufb01nition 2.3. H is well-structured if, for any negative example set Z which has a consistent target\nhypothesis in H, the version space M(Z) contains a most general hypothesis mgh(Z). Further,\nH is ef\ufb01ciently well-structured if there exists an algorithm that can compute mgh(Z \u222a {z}) from\nmgh(Z) and a new example z in time polynomial in the size of mgh(Z) and z.\nLemma 2.1. Any \ufb01nite hypothesis space H is well-structured if and only if it is closed under union.\nh\u2208M(Z) h represent the unique union\nof all concepts represented by hypotheses in M(Z). Because H is closed under union and \ufb01nite, H0\nmust be in H. If \u2203z \u2208 H0\u2229 Z, then z \u2208 h\u2229 Z for some h \u2208 M(Z). This is a contradiction, because\nall h \u2208 M(Z) are consistent with Z. Consequently, H0 is consistent with Z, and is in M(Z). It is\nmore general than (is a superset of) every other hypothesis in M(Z) because it is their union.\n(Only if) Let h1, h2 be any two hypotheses in H and Z be the set of all instances not included in\neither h1 and h2. Both h1 and h2 are consistent with examples in Z. As H is well-structured,\nmgh(Z) must also be in the version space M(Z), and consequently in H. However, mgh(Z) =\nh1 \u222a h2 because it cannot include any element without h1 \u222a h2 and must include all elements within.\nHence, h1 \u222a h2 is in H, which makes it closed under union.\nIn the over-general query (OGQ) framework, the teacher selects a target concept c \u2208 H. The\nlearner outputs a query in the form of a hypothesis h \u2208 H, where h must be at least as general\nas c. The teacher responds with yes if h \u2261 c and the episode ends; otherwise, the teacher gives a\ncounterexample x \u2208 h \u2212 c. The learner then outputs a new query, and the cycle repeats.\nDe\ufb01nition 2.4. A hypothesis space is OGQ-learnable if there exists a learning algorithm for the\nOGQ framework that identi\ufb01es the target c with the number of queries and total running time that\nis polynomial in the size of c and the size of the largest counterexample.\nTheorem 1. H is learnable in the OGQ framework if and only if H is ef\ufb01ciently well-structured and\nits height is a polynomial function.\n\n2\n\n\fProof. (If) If H is ef\ufb01ciently well-structured, then the OGQ learner can always output the mgh,\nguaranteed to be more general than the target concept, in polynomial time. Because the maximum\nnumber of hypothesis re\ufb01nements is bounded by the polynomial height of H, it is learnable in the\nOGQ framework.\n(Only if) If H is not well-structured, then \u2203h1, h2 \u2208 H, h1 \u222a h2 /\u2208 H. The teacher can delay picking\nits target concept, but always provide counterexamples from outside both h1 and h2. At some point,\nthese counterexamples will force the learner to choose between h1 or h2, because their union is not in\nthe hypothesis space. Once the learner makes its choice, the teacher can choose the other hypothesis\nas its target concept c, resulting in the learner\u2019s hypothesis not being more general than c. If H is\nnot ef\ufb01ciently well-structured, then there exists Z and z such that computing mgh(Z \u222a {z}) from\nmgh(Z) and a new example z cannot be done in polynomial time. If the teacher picks mgh(Z\u222a{z})\nas the target concept and only provides counterexamples from Z \u222a{z}, then the learner cannot have\npolynomial running time. Finally, the teacher can always provide counterexamples that forces the\nlearner to take the longest path in H\u2019s generalization graph. Thus, if H does not have polynomial\nheight, then the number of queries will not be polynomial.\n\n2.1 A Comparison of Learning Frameworks\n\nIn order to compare the OGQ framework to other learning frameworks, we \ufb01rst de\ufb01ne the over-\ngeneral mistake-bounded (OGMB) learning framework, in which the teacher selects a target concept\nc from H and presents an arbitrary instance x from the instance space to the learner for a prediction.\nAn inclusion mistake is made when the learner predicts x \u2208 c although x /\u2208 c; an exclusion mistake\nis made when the learner predicts x /\u2208 c although x \u2208 c. The teacher presents the true label to the\nlearner if a mistake is made, and then presents the next instance to the learner, and so on.\nDe\ufb01nition 2.5. A hypothesis space is OGMB-learnable if there exists a learning algorithm for the\nOGMB framework that never makes any exclusion mistakes and its number of inclusion mistakes\nand the running time on each instance are both bounded by polynomial functions of the size of the\ntarget concept and the size of the largest instance seen by the learner.\n\nIn the following analysis, we let the name of a framework denote the set of hypothesis spaces learn-\nable in that framework.\nTheorem 2. OGQ (cid:40) OGMB.\nProof. We can construct an OGMB learner from the OGQ learner as follows. When the OGQ\nlearner makes a query h, we use h to make predictions for the OGMB learner. As h is guaranteed\nto be over-general, it never makes an exclusion mistake. Any instance x on which it makes an\ninclusion mistake must be in h \u2212 c and this is returned to the OGQ learner. The cycle repeats with\nthe OGQ learner providing a new query. Because the OGQ learner makes only a polynomial number\nof queries and takes polynomial time for query generation, the simulated OGMB learner makes only\na polynomial number of mistakes and runs in at most polynomial time per instance. The converse\ndoes not hold in general because the queries of the OGQ learner are restricted to be \u201cproper\u201d, that is,\nthey must belong to the given hypothesis space. While the OGMB learner can maintain the version\nspace of all consistent hypotheses of a polynomially-sized hypothesis space, the OGQ learner can\nonly query with a single hypothesis and there may not be any hypothesis that is guaranteed to be\nmore general than the target concept.\nIf the learner is allowed to ask queries outside H, such as queries of the form h1 \u222a \u00b7\u00b7\u00b7 \u222a hn for all\nhi in the version space, then over-general learning is possible. In general, if the learner is allowed\nto ask about any polynomially-sized, polynomial-time computable hypothesis, then it is as powerful\nas OGMB, because it can encode the computation of the OGMB learner inside a polynomial-sized\ncircuit and query with that as the hypothesis. We call this the OGQ+ framework and claim the\nfollowing theorem (the proof is straightforward).\nTheorem 3. OGQ+ = OGMB.\n\nThe Knows-What-It-Knows (KWIK) learning framework [4] is similar to the OGMB framework\nwith one key difference: it does not allow the learner to make any prediction when it does not know\nthe correct answer. In other words, the learner either makes a correct prediction or simply abstains\nfrom making a prediction and gets the true label from the teacher. The number of abstentions is\n\n3\n\n\fbounded by a polynomial in the target size and the largest instance size. The set of hypothesis\nspaces learnable in the mistake-bound (MB) framework is a strict subset of that learnable in the\nprobably-approximately-correct (PAC) framework [5], leading to the following result.\nTheorem 4. KWIK (cid:40) OGMB (cid:40) MB (cid:40) PAC.\nProof. OGMB (cid:40) MB: Every hypothesis space that is OGMB-learnable is MB-learnable because the\nOGMB learner is additionally constrained to not make an exclusion mistake. However, every MB-\nlearnable hypothesis space is not OGMB-learnable. Consider the hypothesis space of conjunctions\nof n Boolean literals (positive or negative). A single exclusion mistake is suf\ufb01cient for an MB learner\nto learn this hypothesis space. In contrast, after making an inclusion mistake, the OGMB learner\ncan only exclude that example from the candidate set. As there is exactly one positive example, this\ncould force the OGMB learner to make an exponential number of mistakes (similar to guessing an\nunknown password).\nKWIK (cid:40) OGMB: If a concept class is KWIK-learnable, it is also OGMB-learnable \u2014 when the\nKWIK learner does not know the true label, the OGMB learner simply predicts that the instance is\npositive and gets corrected if it is wrong. However, every OGMB-learnable hypothesis space is not\nKWIK-learnable. Consider the hypothesis space of disjunctions of n Boolean literals. The OGMB\nlearner begins with a disjunction over all possible literals (both positive and negative) and hence\npredicts all instances as positive. A single inclusion mistake is suf\ufb01cient for the OGMB learner\nto learn this hypothesis space. On the other hand, the teacher can supply the KWIK learner with\nan exponential number of positive examples, because the KWIK learner cannot ever know that the\ntarget does not include all possible instances; this implies that the number of abstentions is not\npolynomially bounded.\n\nThis theorem demonstrates that KWIK is too conservative a framework for model learning \u2014 any\nprediction that might be a mistake is disallowed. This makes it impossible to learn even simple\nconcept classes such as pure disjunctions.\n\n3 Planning Components\nA factored planning domain P is a tuple (V, D, A, T ), where V = {v1, . . . , vn} is the set of vari-\nables, D is the domain of the variables in V , and A is the set of actions. S = Dn represents the state\nspace and T \u2282 S\u00d7 A\u00d7 S is the transition relation, where (s, a, s(cid:48)) \u2208 T signi\ufb01es that taking action a\nin state s results in state s(cid:48). As we only consider learning deterministic action models, the transition\nrelation is in fact a function, although the learner\u2019s hypothesis space may include nondeterministic\nmodels. The domain parameters, n,|D|, and |A|, characterize the size of P and are implicit in all\nclaims of complexity in the rest of this paper.\nDe\ufb01nition 3.1. An action model is a relation M \u2286 S \u00d7 A \u00d7 S.\nA planning problem is a pair (s0, g), where s0 \u2208 S and the goal condition g is an expression chosen\nfrom a goal language G and represents a set of states in which it evaluates to true. A state s satis\ufb01es\na goal g if and only if g is true in s. Given a planning problem (s0, g), a plan is a sequence of states\nand actions s0, a1, . . . , ap, sp, where the state sp satis\ufb01es the goal g. The plan is sound with respect\nto (M, g) if (si\u22121, ai, si) \u2208 M for 1 \u2264 i \u2264 p.\nDe\ufb01nition 3.2. A planner for the hypothesis space and goal language pair (H,G) is an algorithm\nthat takes M \u2208 H and (s0, g \u2208 G) as inputs and outputs a plan or signals failure. It is sound with\nrespect to (H,G) if, given any M and (s0, g), it produces a sound plan with respect to (M, g) or\nsignals failure. It is complete with respect to (H,G) if, given any M and (s0, g), it produces a sound\nplan whenever one exists with respect to (M, g).\n\nWe generalize the de\ufb01nition of soundness from its standard usage in the literature in order to apply to\nnondeterministic action models, where the nondeterminism is \u201cangelic\u201d \u2014 the planner can control\nthe outcome of actions when multiple outcomes are possible according to its model [6]. One way to\nimplement such a planner is to do forward search through all possible action and outcome sequences\nand return an action sequence if it leads to a goal under some outcome choices. Our analysis is\nagnostic to plan quality or plan length and applies equally well to suboptimal planners. This is\nmotivated by the fact that optimal planning is hard for most domains, but suboptimal planning such\nas hierarchical planning can be quite ef\ufb01cient and practical.\n\n4\n\n\fDe\ufb01nition 3.3. A planning mistake occurs if either the planner signals failure when a sound plan\nexists with respect to the transition function T or when the plan output by the planner is not sound\nwith respect to T .\nDe\ufb01nition 3.4. Let P be a planning domain and G be a goal language. An action model M is\nadequate for G in P if M \u2286 T and the existence of a sound plan with respect to (T, g \u2208 G) implies\nthe existence of a sound plan with respect to (M, g). H is adequate for G if \u2203M \u2208 H such that M\nis adequate for G.\nAn adequate model may be partial or incomplete in that it may not include every possible transition\nin the transition function T . However, the model is suf\ufb01cient to produce a sound plan with respect to\n(T, g) for every goal g in the desired language. Thus, the more limited the goal language, the more\nincomplete the adequate model can be. In the example of a city map, if the goal language excludes\ncertain locations, then the spanning tree could exclude them as well, although not necessarily so.\nDe\ufb01nition 3.5. A simulator of the domain is always situated in the current state s. It takes an action\na as input, transitions to the state s(cid:48) resulting from executing a in s, and returns the current state s(cid:48).\nGiven a goal language G, a problem generator generates an arbitrary problem (s0, g \u2208 G) and sets\nthe state of the simulator to s0.\n\n4 Mistake-Bounded Planning Framework\n\nThis section constructs the MBP framework that allows learning and planning to be interleaved for\nuser-generated problems. It actualizes the teacher of the OGQ framework by combining a problem\ngenerator, a planner, and a simulator, and interfaces with the OGQ learner to learn action models as\nhypotheses over the space of possible state transitions for each action. It turns out that the one-sided\nmistake property is needed for autonomous learning because the learner can only learn by generating\nplans and observing the results; if the learner ever makes an exclusion error, there is no guarantee of\n\ufb01nding a sound plan even when one exists and the learner cannot recover from such mistakes.\nDe\ufb01nition 4.1. Let G be a goal language such that H is adequate for it. H is learnable in the MBP\nframework if there exists an algorithm A that interacts with a problem generator over G, a sound\nand complete planner with respect to (H,G), and a simulator of the planning domain P, and outputs\na plan or signals failure for each planning problem while guaranteeing at most a polynomial number\nof planning mistakes. Further, A must respond in time polynomial in the domain parameters and the\nlength of the longest plan generated by the planner, assuming that a call to the planner, simulator,\nor problem generator takes O(1) time.\n\nThe goal language is picked such that the hypothesis space is adequate for it. We cannot bound the\ntime for the convergence of A, because there is no limit on when the mistakes are made.\nTheorem 5. H is learnable in the MBP framework if H is OGQ-learnable.\nProof. Algorithm 1 is a general schema for\naction model learning in the MBP framework.\nThe model M begins with the initial query from\nOGQ-LEARNER. PROBLEMGENERATOR pro-\nvides a planning problem and initializes the cur-\nrent state of SIMULATOR. Given M and the\nplanning problem, PLANNER always outputs a\nplan if one exists because H is adequate for G\n(it contains a \u201ctarget\u201d adequate model) and M is\nat least as general as every adequate model. If\nPLANNER signals failure, then there is no plan\nfor it. Otherwise, the plan is executed through\nSIMULATOR until an observed transition con-\n\ufb02icts with the predicted transition. If such a tran-\nsition is found, it is supplied to OGQ-LEARNER\nand M is updated with the next query; other-\nwise, the plan is output. If H is OGQ-learnable,\nthen OGQ-LEARNER will only be called a poly-\nnomial number of times, every call taking polynomial time. As the number of planning mistakes is\n\nAlgorithm 1 MBP LEARNING SCHEMA\nInput: Goal language G\n1: M \u2190 OGQ-LEARNER()\n2: loop\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n\nM \u2190 OGQ-LEARNER((s, a, \u02c6s(cid:48)))\nprint mistake\nbreak\ns \u2190 s(cid:48)\n\nif no mistake then\n\nprint plan\n\n// Initial query\n\n(s, g) \u2190 PROBLEMGENERATOR(G)\nplan \u2190 PLANNER(M, (s, g))\nif plan (cid:54)= false then\n\nfor (\u02c6s, a, \u02c6s(cid:48)) in plan do\ns(cid:48) \u2190 SIMULATOR(a)\nif s(cid:48) (cid:54)= \u02c6s(cid:48) then\n\n5\n\n\fpolynomial and every response of Algorithm 1 is polynomial in the runtime of OGQ-LEARNER and\nthe length of the longest plan, H is learnable in the MBP framework.\nThe above result generalizes the work on learning STRIPS operator models from raw experience\n(without a teacher) in [8] to arbitrary hypotheses spaces by identifying suf\ufb01ciency conditions. (A\nfamily of hypothesis spaces considered later in this paper subsumes propositional STRIPS by cap-\nturing conditional effects.) It also clari\ufb01es the notion of an adequate model, which can be much\nsimpler than the true transition model, and the in\ufb02uence of the goal language on the complexity of\nlearning action models.\n\n5 Planned Exploration Framework\n\nThe MBP framework is appropriate when mistakes are permissible on user-given problems as long\nas their total number is limited and not for cases where no mistakes are permitted after the training\nperiod. In the planned exploration (PLEX) framework, the agent seeks to learn an action model for\nthe domain without an external problem generator by generating planning problems for itself. The\nkey issue here is to generate a reasonably small number of planning problems such that solving them\nwould identify a deterministic action model. Learning a model in the PLEX framework involves\nknowing where it is de\ufb01cient and then planning to reach states that are informative, which entails\nformulating planning problems in a goal language. This framework provides a polynomial sample\nconvergence guarantee which is stronger than a polynomial mistake bound of the MBP framework.\nWithout a problem generator that can change the simulator\u2019s state, it is impossible for the simulator\nto transition freely between strongly connected components (SCCs) of the transition graph. Hence,\nwe make the assumption that the transition graph is a disconnected union of SCCs and require only\nthat the agent learn the model for a single SCC that contains the initial state of the simulator.\nDe\ufb01nition 5.1. Let P be a planning domain whose transition graph is a union of SCCs. (H,G)\nis learnable in the PLEX framework if there exists an algorithm A that interacts with a sound and\ncomplete planner with respect to (H,G) and the simulator for P and outputs a model M \u2208 H that\nis adequate for G within the SCC that contains the initial state s0 of the simulator after a polynomial\nnumber of planning attempts. Further, A must run in polynomial time in the domain parameters and\nthe length of the longest plan output by the planner, assuming that every call to the planner and the\nsimulator takes O(1) time.\n\nA key step in planned exploration is designing appropriate planning problems. We call these ex-\nperiments because the goal of solving these problems is to disambiguate nondeterministic action\nmodels. In particular, the agent tries to reach an informative state where the current model is nonde-\nterministic.\nDe\ufb01nition 5.2. Given a model M, the set of informative states is I(M ) = {s : (s, a, s(cid:48)), (s, a, s(cid:48)(cid:48)) \u2208\nM \u2227 s(cid:48) (cid:54)= s(cid:48)(cid:48)}, where a is said to be informative in s.\n\nDe\ufb01nition 5.3. A set of goals G is a cover of a set of states R if (cid:83)\n\ng\u2208G{s : s satis\ufb01es g} = R.\n\nGiven the goal language G and a model M, the problem of experiment design is to \ufb01nd a set of goals\nG \u2286 G such that the sets of states that satisfy the goals in G collectively cover all informative states\nI(M ). If it is possible to plan to achieve one of these goals, then either the plan passes through a\nstate where the model is nondeterministic or it executes successfully and the agent reaches the \ufb01nal\ngoal state; in either case, an informative action can be executed and the observed transition is used\nto re\ufb01ne the model. If none of the goals in G can be successfully planned for, then no informative\nstates for that action are reachable. We formalize these intuitions below.\nDe\ufb01nition 5.4. The width of (H,G) is de\ufb01ned as max\n|G|, where minG |G| =\nM\u2208H\n\u221e if there is no G \u2286 G to cover a nonempty I(M ).\nDe\ufb01nition 5.5. (H,G) permits ef\ufb01cient experiment design if, for any M \u2208 H, 1(cid:13) there exists an\nalgorithm (EXPERIMENTDESIGN) that takes M and G as input and outputs a polynomial-sized\ncover of I(M ) in polynomial time and 2(cid:13) there exists an algorithm (INFOACTIONSTATES) that\ntakes M and a state s as input and outputs an informative action and two (distinct) predicted next\nstates according to M in polynomial time.\nIf (H,G) permits ef\ufb01cient experiment design, then it has polynomial width because no algorithm\ncan always guarantee to output a polynomial-sized cover otherwise.\n\nG\u2286G:G is a cover of I(M )\n\nmin\n\n6\n\n\fTheorem 6. (H,G) is learnable in the PLEX framework if it permits ef\ufb01cient experiment design,\nand H is adequate for G and is OGQ-learnable.\n\n// Initial query\n\nif plan = false then\nreturn M\nfor (\u02c6s, a, \u02c6s(cid:48)) in plan do\ns(cid:48) \u2190 SIMULATOR(a)\ns \u2190 s(cid:48)\nif s(cid:48) (cid:54)= \u02c6s(cid:48) then\n\nif G = \u2205 then\nreturn M\nfor g \u2208 G do\nplan \u2190 PLANNER(M, (s, g))\nif plan (cid:54)= false then\n\nAlgorithm 2 PLEX LEARNING SCHEMA\nInput: Initial state s, goal language G\nOutput: Model M\n1: M \u2190 OGQ-LEARNER()\n2: loop\n3: G \u2190 EXPERIMENTDESIGN(M,G)\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n(a, \u02c6S(cid:48)) \u2190 INFOACTIONSTATES(M, s)\n19:\ns(cid:48) \u2190 SIMULATOR(a)\n20:\nM \u2190 OGQ-LEARNER((s, a, \u02c6s(cid:48) \u2208 \u02c6S(cid:48) \u2212{s(cid:48)}))\n21:\ns \u2190 s(cid:48)\n22:\n23: return M\n\nProof. Algorithm 2 is a general schema\nfor action model learning in the PLEX\nframework. The model M begins with\nthe initial query from OGQ-LEARNER.\nGiven M and G, EXPERIMENTDESIGN\ncomputes a polynomial-sized cover G. If\nG is empty, then the model cannot be re-\n\ufb01ned further; otherwise, given M and a\ngoal g \u2208 G, PLANNER may signal fail-\nure if either no state satis\ufb01es g or states\nsatisfying g are not reachable from the\ncurrent state of the simulator. If PLAN-\nNER signals failure on all of the goals,\nthen none of the informative states are\nreachable and M cannot be re\ufb01ned fur-\nther.\nIf PLANNER does output a plan,\nthen the plan is executed through SIMU-\nLATOR until an observed transition con-\n\ufb02icts with the predicted transition.\nIf\nsuch a transition is found, it is supplied\nto OGQ-LEARNER and M is updated\nwith the next query. If the plan executes\nsuccessfully, then INFOACTIONSTATES\nprovides an informative action with the\ncorresponding set of two resultant states\naccording to M; OGQ-LEARNER is sup-\nplied with the transition of the goal state,\nthe informative action, and the incor-\nrectly predicted next state, and M is updated with the new query. A new cover is computed ev-\nery time M is updated, and the process continues until all experiments are exhausted. If (H,G)\npermits ef\ufb01cient experiment design, then every cover can be computed in polynomial time and IN-\nFOACTIONSTATES is ef\ufb01cient. If H is OGQ-learnable, then OGQ-LEARNER will only be called a\npolynomial number of times and it can output a new query in polynomial time. As the number of\nfailures per successful plan is bounded by a polynomial in the width w of (H,G), the total number\nof calls to PLANNER is polynomial. Further, as the innermost loop of Algorithm 2 is bounded by\nthe longest length l of a plan, its running time is a polynomial in the domain parameters and l. Thus,\n(H,G) is learnable in the PLEX framework.\n\nbreak\n\nM \u2190 OGQ-LEARNER((s, a, \u02c6s(cid:48)))\nbreak\n\nif M has not been updated then\n\n6 A Hypothesis Family for Action Modeling\n\nu\u2208H u : H \u2286 U} and Pairs(U) = {u1 \u222a u2 : u1, u2 \u2208 U}.\n\ntransition tuples. Let Power(U) = {(cid:83)\n\nThis section proves the learnability of a hypothesis-space family for action modeling in the MBP and\nPLEX frameworks. Let U = {u1, u2, . . .} be a polynomial-sized set of polynomially computable\nbasis hypotheses (polynomial in the relevant parameters), where ui represents a deterministic set of\nLemma 6.1. Power(U) is OGQ-learnable.\nProof. Power(U) is ef\ufb01ciently well-structured, because it is closed under union by de\ufb01nition and\nthe new mgh can be computed by removing any basis hypotheses that are not consistent with the\ncounterexample; this takes polynomial time as U is of polynomial size. At the root of the generaliza-\nu\u2208U u and at the leaf is the empty hypothesis. Because\nU is of polynomial size and the longest path from the root to the leaf involves removing a single\ncomponent at a time, the height of Power(U) is polynomial.\nLemma 6.2. Power(U) is learnable in the MBP framework.\nProof. This follows from Lemma 6.1 and Theorem 5.\n\ntion graph of Power(U) is the hypothesis(cid:83)\n\n7\n\n\fLemma 6.3. For any goal language G, (Power(U), G) permits ef\ufb01cient experiment design if\n(Pairs(U), G) permits ef\ufb01cient experiment design.\nProof. Any informative state for a hypothesis in Power(U) is an informative state for some hypothe-\nsis in Pairs(U), and vice versa. Hence, a cover for (Pairs(U), G) would be a cover for (P ower(U),G).\nConsequently, if (Pairs(U), G) permits ef\ufb01cient experiment design, then the ef\ufb01cient algorithms EX-\nPERIMENTDESIGN and INFOACTIONSTATES are directly applicable to (Power(U), G).\nLemma 6.4. For any goal language G, (Power(U), G) is learnable in the PLEX framework if\n(Pairs(U), G) permits ef\ufb01cient experiment design and Power(U) is adequate for G.\nProof. This follows from Lemmas 6.1 and 6.3, and Theorem 6.\n\nWe now de\ufb01ne a hypothesis space that is a concrete member of the family. Let an action production\nr be de\ufb01ned as \u201cact : pre \u2192 post\u201d, where act(r) is an action and the precondition pre(r) and\npostcondition post(r) are conjunctions of \u201cvariable = value\u201d literals.\nDe\ufb01nition 6.1. A production r is triggered by a transition (s, a, s(cid:48)) if s satis\ufb01es the precondition\npre(r) and a = act(r). A production r is consistent with (s, a, s(cid:48)) if either 1(cid:13) r is not triggered\nby (s, a, s(cid:48)) or 2(cid:13) s(cid:48) satis\ufb01es the post(r) and all variables not mentioned in post(r) have the same\nvalues in both s and s(cid:48).\n\nA production represents the set of all consistent transitions that trigger it. All the variables in pre(r)\nmust take their speci\ufb01ed values in a state to trigger r; when r is triggered, post(r) de\ufb01nes the values\nin the next state. An example of an action production is \u201cDo : v1 = 0, v2 = 1 \u2192 v1 = 2, v3 = 1\u201d. It\nis triggered only when the Do action is executed in a state in which v1 = 0 and v2 = 1, and de\ufb01nes\nthe value of v1 to be 2 and v3 to be 1 in the next state, with all other variables staying unchanged.\nLet k-SAP be the hypothesis space of models represented by a set of action productions (SAP)\nIf U is the set of productions, then |U| =\nwith no more than k variables per production.\n\n(cid:1)(|D| + 1)2i(cid:1) = O(|A|nk|D|2k), because a production can have one of |A| actions,\n\nO(cid:0)|A|(cid:80)k\n\nup to k relevant variables \ufb01guring on either side of the production, and each variable set to a value\nin its domain. As U is of polynomial size, k-SAP is an instance of the family of basis action models.\nMoreover, if Conj is the goal language consisting of all goals that can be expressed as conjunctions\nof \u201cvariable = value\u201d literals, then (Pairs(k-SAP), Conj) permits ef\ufb01cient experiment design.\nLemma 6.5. (k-SAP, Conj) is learnable in the PLEX framework if k-SAP is adequate for Conj.\n\n(cid:0)n\n\ni\n\ni=1\n\n7 Conclusion\n\nThe main contributions of the paper are the development of the MBP and PLEX frameworks for\nlearning action models and the characterization of suf\ufb01cient conditions for ef\ufb01cient learning in these\nframeworks. It also provides results on learning a family of hypothesis spaces that is, in some ways,\nmore general than standard action modeling languages. For example, unlike propositional STRIPS\noperators, k-SAP captures the conditional effects of actions.\nWhile STRIPS-like languages served us well in planning research by creating a common useful\nplatform, they are not designed from the point of view of learnability or planning ef\ufb01ciency. Many\ndomains such as robotics and real-time strategy games are not amenable to such clean and simple ac-\ntion speci\ufb01cation languages. This suggests an approach in which the learner considers increasingly\ncomplex models as dictated by its planning needs. For example, the model learner might start with\nsmall values of k in k-SAP and then incrementally increase k until a value is found that is adequate\nfor the goals encountered. In general, this motivates a more comprehensive framework in which\nplanning and learning are tightly integrated, which is the premise of this chapter. Another direc-\ntion is to investigate better exploration methods that go beyond using optimistic models to include\nBayesian and utility-guided optimal exploration.\n\n8 Acknowledgments\n\nWe thank the reviewers for their helpful feedback. This research is supported by the Army Research\nOf\ufb01ce under grant number W911NF-09-1-0153.\n\n8\n\n\fReferences\n[1] R. Brafman and M. Tennenholtz. R-MAX \u2014 A General Polynomial Time Algorithm for Near-\n\nOptimal Reinforcement Learning. Journal of Machine Learning Research, 3:213\u2013231, 2002.\n\n[2] M. Kearns and L. Valiant. Cryptographic Limitations on Learning Boolean Formulae and Finite\n\nAutomata. In Annual ACM Symposium on Theory of Computing, 1989.\n\n[3] L. Li. A Unifying Framework for Computational Reinforcement Learning Theory. PhD thesis,\n\nRutgers University, 2009.\n\n[4] L. Li, M. Littman, and T. Walsh. Knows What It Knows: A Framework for Self-Aware Learning.\n\nIn ICML, 2008.\n\n[5] N. Littlestone. Mistake Bounds and Logarithmic Linear-Threshold Learning Algorithms. PhD\n\nthesis, U.C. Santa Cruz, 1989.\n\n[6] B. Marthi, S. Russell, and J. Wolfe. Angelic Semantics for High-Level Actions.\n\n2007.\n\nIn ICAPS,\n\n[7] B. K. Natarajan. On Learning Boolean Functions. In Annual ACM Symposium on Theory of\n\nComputing, 1987.\n\n[8] T. Walsh and M. Littman. Ef\ufb01cient Learning of Action Schemas and Web-Service Descriptions.\n\nIn AAAI, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1323, "authors": [{"given_name": "Neville", "family_name": "Mehta", "institution": null}, {"given_name": "Prasad", "family_name": "Tadepalli", "institution": null}, {"given_name": "Alan", "family_name": "Fern", "institution": null}]}