{"title": "Skill Characterization Based on Betweenness", "book": "Advances in Neural Information Processing Systems", "page_first": 1497, "page_last": 1504, "abstract": "We present a characterization of a useful class of skills based on a graphical representation of an agent's interaction with its environment. Our characterization uses betweenness, a measure of centrality on graphs. It may be used directly to form a set of skills suitable for a given environment. More importantly, it serves as a useful guide for developing online, incremental skill discovery algorithms that do not rely on knowing or representing the environment graph in its entirety.", "full_text": "Skill characterization based on betweenness\n\n\u00a8Ozg\u00a8ur S\u00b8ims\u00b8ek\u2217\nAndrew G. Barto\n\nAmherst, MA 01003\n\n{ozgur|barto}@cs.umass.edu\n\nDepartment of Computer Science\n\nUniversity of Massachusetts\n\nAbstract\n\nWe present a characterization of a useful class of skills based on a graphical repre-\nsentation of an agent\u2019s interaction with its environment. Our characterization uses\nbetweenness, a measure of centrality on graphs. It captures and generalizes (at\nleast intuitively) the bottleneck concept, which has inspired many of the existing\nskill-discovery algorithms. Our characterization may be used directly to form a\nset of skills suitable for a given task. More importantly, it serves as a useful guide\nfor developing incremental skill-discovery algorithms that do not rely on knowing\nor representing the interaction graph in its entirety.\n\n1 Introduction\n\nThe broad problem we consider is how to equip arti\ufb01cial agents with the ability to form useful\nhigh-level behaviors, or skills, from available primitives. For example, for a robot performing tasks\nthat require manipulating objects, grasping is a useful skill that employs lower-level sensory and\nmotor primitives. In approaching this problem, we distinguish between two related questions: What\nconstitutes a useful skill? And, how can an agent identify such skills autonomously? Here, we\naddress the former question with the objective of guiding research on the latter.\nOur main contribution is a characterization of a useful class of skills based on a graphical represen-\ntation of the agent\u2019s interaction with its environment. Speci\ufb01cally, we use betweenness, a measure\nof centrality on graphs [1, 2], to de\ufb01ne a set of skills that allows ef\ufb01cient navigation on the inter-\naction graph. In the game of Tic-Tac-Toe, these skills translate into setting up a fork, creating an\nopportunity to win the game. In the Towers of Hanoi puzzle, they include clearing the stack above\nthe largest disk and clearing one peg entirely, making it possible to move the largest disk.\nOur characterization may be used directly to form a set of skills suitable for a given task if the\ninteraction graph is readily available. More importantly, this characterization is a useful guide for\ndeveloping low-cost, incremental algorithms for skill discovery that do not rely on complete rep-\nresentation of the interaction graph. We present one such algorithm here and perform preliminary\nanalysis.\nOur characterization captures and generalizes (at least intuitively) the bottleneck concept, which\nhas inspired many of the existing skill-discovery algorithms [3, 4, 5, 6, 7, 8, 9]. Bottlenecks have\nbeen described as regions that the agent tends to visit frequently on successful trajectories but not on\nunsuccessful ones [3], border states of strongly connected areas [6], and states that allow transitions\nto a different part of the environment [7]. The canonical example is a doorway connecting two rooms.\nWe hope that our explicit and concrete description of what makes a useful skill will lead to further\ndevelopment of these existing algorithms and inspire alternative methods.\n\n\u2217Now at the Max Planck Institute for Human Development, Center for Adaptive Behavior and Cognition,\n\nBerlin, Germany.\n\n1\n\n\fFigure 1: A visual representation of betweenness on two sample graphs.\n\n2 Skill De\ufb01nition\n\nWe assume that the agent\u2019s interaction with its environment may be represented as a Markov De-\ncision Process (MDP). The interaction graph is a directed graph in which the vertices represent\nthe states of the MDP and the edges represent possible state transitions brought about by available\nactions. Speci\ufb01cally, the edge u \u2192 v is present in the graph if and only if the corresponding state\ntransition has a strictly positive probability through the execution of at least one action. The weight\non each edge is the expected cost of the transition, or expected negative reward.\nOur claim is that states that have a pivotal role in ef\ufb01ciently navigating the interaction graph are\nuseful subgoals to reach and that a useful measure for evaluating how pivotal a vertex v is\n\n(cid:88)\n\ns(cid:54)=t(cid:54)=v\n\n\u03c3st(v)\n\u03c3st\n\nwst,\n\nwhere \u03c3st is the number of shortest paths from vertex s to vertex t, \u03c3st(v) is the number of such\npaths that pass through vertex v, and wst is the weight assigned to paths from vertex s to vertex t.\nWith uniform path weights, the above expression equals betweenness, a measure of centrality on\ngraphs [1, 2]. It gives the fraction of shortest paths on the graph (between all possible sources and\ndestinations) that pass through the vertex of interest. If there are multiple shortest paths from a\ngiven source to a given destination, they are given equal weights that sum to one. Betweenness\nmay be computed in O(nm) time and O(n + m) space on unweighted graphs with n nodes and m\nedges [10]. On weighted graphs, the space requirement remains the same, but the time requirement\nincreases to O(nm + n2logn).\nIn our use of betweenness, we include path weights to take into account the reward function. De-\npending on the reward function\u2014or a probability distribution over possible reward functions\u2014some\nparts of the interaction graph may be given more weight than others, depending on how well they\nserve the agent\u2019s needs.\nWe de\ufb01ne as subgoals those states that correspond to local maxima of betweenness on the interaction\ngraph, in other words, states that have a higher betweenness than other states in their neighborhood.\nHere, we use a simple de\ufb01nition of neighborhood, including in it only the states that are one hop\naway, which may be revised in the future. Skills for ef\ufb01ciently reaching the local maxima of be-\ntweenness represent a set of behaviors that may be combined in different ways to ef\ufb01ciently reach\ndifferent regions, serving as useful building blocks for navigating the graph.\nFigure 1 is a visual representation of betweenness on two sample graphs, computed using uniform\nedge and path weights. The gray-scale shading on the vertices corresponds to the relative values of\nbetweenness, with black representing the highest betweenness on the graph and white representing\nthe lowest. The graph on the left corresponds to a gridworld in which a doorway connects two\nrooms. The graph on the right has a doorway of a different type: an edge connecting two otherwise\ndistant nodes. In both graphs, states that are local maxima of betweenness correspond to our intuitive\nchoice of subgoals.\n\n2\n\n\fFigure 2: Betweenness in Taxi, Playroom, and Tic-Tac-Toe (from left to right). Edge directions are\nomitted in the \ufb01gure.\n\n3 Examples\n\nWe appled the skill de\ufb01nition of Section 2 to various domains in the literature: Taxi [11], Play-\nroom [12, 13], and the game of Tic-Tac-Toe. Interaction graphs of these domains, displaying be-\ntweenness values as gray-scale shading on the vertices, are shown in Figure 2. In Taxi and Playroom,\ngraph layouts were determined by a force-directed algorithm that models the edges as springs and\nminimizes the total force on the system. We considered a node to be a local maximum if its be-\ntweenness was higher than or equal to those of its immediate neighbors, taking into account both\nincoming and outgoing edges. Unless stated otherwise, actions had uniform cost and betweenness\nwas computed using uniform path weights.\nTaxi This domain includes a taxi and a passenger on the 5 \u00d7 5 grid shown in Figure 4. At each\ngrid location, the taxi has six primitive actions: north, east, south, west, pick-up, and\nput-down. The navigation actions succeed in moving the taxi in the intended direction with prob-\nability 0.8; with probability 0.2, the action takes the taxi to the right or left of the intended direction.\nIf the direction of movement is blocked, the taxi remains in the same location. Pick-up places the\npassenger in the taxi if the taxi is at the passenger location; otherwise it has no effect. Similarly,\nput-down delivers the passenger if the passenger is inside the taxi and the taxi is at the destina-\ntion; otherwise it has no effect. The source and destination of all passengers are chosen uniformly\nat random from among the grid squares R, G, B, Y. We used a continuing version of this problem in\nwhich a new passenger appears after each successful delivery.\nThe highest local maxima of betweenness are at the four regions of the graph that correspond to\npassenger delivery. Other local maxima belong to one of the following categories: (1) taxi is at\nthe passenger location1, (2) taxi is at one of the passenger wait locations with the passenger in the\ntaxi2, (3) taxi and passenger are both at destination, (4) the taxi is at x = 2, y = 3, a navigational\nbottleneck on the grid, and (5) the taxi is at x = 3, y = 3, another navigational bottleneck. The\ncorresponding skills are (approximately) those that take the taxi to the passenger location, to the\ndestination (having picked up the passenger), or to a navigational bottleneck. These skills closely\nresemble those that are hand-coded for this domain in the literature.\nPlayroom We created a Markov version of this domain in which an agent interacts with a number\nof objects in its surroundings: a light switch, a ball, a bell, a button for turning music on and off,\n\n1Except when passenger is waiting at Y, in which case, the taxi is at x = 1, y = 3.\n2For wait location Y, the corresponding subgoal has the taxi at x = 1, y = 3, having picked up the\n\npassenger.\n\n3\n\n\fRooms\n\nShortcut\n\nPlayroom\n\nFigure 3: Learning performance in Rooms, Shortcut, and Playroom.\n\nand a toy monkey. The agent has an eye, a hand, and a marker it can place on objects. Its actions\nconsist of looking at a randomly selected object, looking at the object in its hand, holding the object\nit is looking at, looking at the object that the marker is placed on, placing the marker on the object it\nis looking at, moving the object in its hand to the location it is looking at, \ufb02ipping the light switch,\npressing the music button, and hitting the ball towards the marker. The \ufb01rst two actions succeed\nwith probability 1, while the remaining actions succeed with probability 0.75, producing no change\nin the environment if they fail. In order to operate on an object, the agent must be looking at the\nobject and holding the object in its hand. To be able to press the music button successfully, the light\nshould be on. The toy monkey starts to make frightened sounds if the bell is rung while the music\nis playing; it stops only when the music is turned off. If the ball hits the bell, the bell rings for one\ndecision stage.\nThe MDP state consists of the object that the agent is looking at, the object that the agent is holding,\nthe object that the marker is placed on, music (on/off), light (on/off), monkey (frightened/not), and\nbell (ringing/not). The six different clusters of the interaction graph in Figure 2 emerge naturally\nfrom the force-directed layout algorithm and correspond to the different settings of the music, light,\nand monkey variables. There are only six such clusters because not all variable combinations are\npossible. Betweenness peaks at regions that immediately connect neighboring clusters, correspond-\ning to skills that change the setting of the music, light, or monkey variables.\nTic-Tac-Toe In the interaction graph, the node at the center of the interaction graph is the empty\nboard, with other board con\ufb01gurations forming rings around it with respect to their distance from\nthis initial con\ufb01guration. The innermost ring shows states in which both players have played a\nsingle turn. The agent played \ufb01rst. The opponent followed a policy that (1) placed the third mark in\na row, whenever possible, winning the game, (2) blocked the agent from completing a row, and (3)\nplaced its mark on a random empty square, with decreasing priority. Our state representation was\ninvariant with respect to rotational and re\ufb02ective symmetries of the board. We assigned a weight of\n+1 to paths that terminate at a win for the agent and 0 to all other paths. The state with the highest\nbetweenness is the one shown in Figure 4. The agent is the X player and will go next. This state\ngives the agent two possibilities for setting up a fork (board locations marked with *), creating an\nopportunity to win on the next turn. There were nine other local maxima that similarly allowed the\nagent to immediately create a fork. In addition, there were a number of \u201ctrivial\u201d local maxima that\nallowed the agent to immediately win the game.\n\n4 Empirical Performance\n\nWe evaluated the impact of our skills on the agent\u2019s learning performance in Taxi, Playroom, Tic-\nTac-Toe, and two additional domains, called Rooms and Shortcut, whose interaction graphs are those\npresented in Figure 1. Rooms is a gridworld in which a doorway connects two rooms. At each state,\nthe available actions are north, south, east, and west. They move the agent in the intended\ndirection with probability 0.8 and in a uniform random direction with probability 0.2. The local\n\n4\n\n020406000.511.52x 104Episodes completedPrimitivesSkillsRandomCumulative number of steps0102030405001000200030004000500060007000Episodes completedPrimitivesSkillsRandomCumulative number of steps0 204001234567x 104Episodes completedPrimitivesSkills\u2212300Random\u2212300Skills\u2212100Random\u2212100Cumulative number of steps\f(cid:4)\n(cid:8)\n(cid:1)\n\n(cid:20)\n(cid:22)\n(cid:14)\n(cid:21)\n(cid:12)\n(cid:10)\n\n(cid:4)\n(cid:5)\n(cid:1)\n(cid:17)(cid:19)(cid:19)(cid:25)\n(cid:18)\n(cid:11)\n\n(cid:5)(cid:4)\n(cid:1)\n(cid:12)(cid:21)(cid:14)(cid:22)(cid:20)\n\n(cid:10)\n\n5\n4\n3\n2\n1\n\ny\n\nR\n\nY\n1\nx\n\nG\n\nB\n4\n\n5\n\n2\n\n3\n\n*\nX\n*\n\nO\nO\nX\n\nFigure 4: Learning performance in Taxi and Tic-Tac-Toe.\n\nmaxima of betweenness are the two states that surround the doorway, which have a slightly higher\nbetweenness than the doorway itself. The transition dynamics of Shortcut is identical, except there\nis one additional long-range action, connecting two particular states, which are the local maxima of\nbetweenness in this domain.\nWe represented skills using the options framework [14, 15]. The initiation set was restricted to\ninclude a certain number of states and included those states with the least distance to the subgoal on\nthe interaction graph. The skills terminated with probability one outside the initiation set and at the\nsubgoal, with probability zero at all other states. The skill policy was the optimal policy for reaching\nthe subgoal. We compared three agents: one that used only the primitive actions of the domain, one\nthat used primitives and our skills, and one that used primitives and a control group of skills whose\nsubgoals were selected randomly. The number of subgoals used and the size of the initiation sets\nwere identical in the two skill conditions. The agent used Q-learning with (cid:31)-greedy exploration with\n(cid:31)= 0(cid:46)05. When using skills, it performed both intra-option and macro-Q updates [16]. The learning\nrate ((cid:30)) was kept constant at 0.1. Initial Q-values were 0. Discount rate (cid:29) was set to 1 in episodic\ntasks, to 0.99 in continuing tasks.\nFigure 3 shows performance results in Rooms, Shortcut, and PlayRoom, where we had the agent\nperform 100 different episodic tasks, choosing a single goal state uniformly randomly in each task.\nThe reward was (cid:31)0(cid:46)001 for each transition and an additional +1 for transitions into the goal state.\nThe initial state was selected randomly. The labels in the \ufb01gure indicate the size of the initiation sets.\nIf no number is present, the skills were made available everywhere in the domain. The availability of\nour skills\u2014those that were identi\ufb01ed using local maxima of betweenness\u2014revealed a big improve-\nment compared to using primitive actions only. In some cases, random skills improved performance\nas well, but this improvement was much smaller than that obtained by our skills.\nFigure 4 shows similar results in Taxi and Tic-Tac-Toe. The \ufb01gure shows mean performance in 100\ntrials. In Taxi, we examined performance on the single continuing task that rewarded the agent for\ndelivering passengers. Reward was (cid:31)1 for each action, an additional +50 for passenger delivery, and\nan additional (cid:31)10 for an unsuccessful pick-up or put-down. In Tic-Tac-Toe, the agent received\na reward of (cid:31)0(cid:46)001 for each action, an additional +1 for winning the game, and an additional (cid:31)1\nfor losing. Creating an individual skill for reaching each of the identi\ufb01ed subgoals (which is what\nwe have done in other domains) generates skills that are not of much use in Tic-Tac-Toe because\nreaching any particular board con\ufb01guration is usually not possible. Instead, we de\ufb01ned a single skill\nwith multiple subgoals\u2014the ten local maxima of betweenness that allow the agent to setup a fork.\nWe set the initial Q-value of this skill to 1 at the start state to ensure that the skill got executed\nfrequently enough. It is not clear what this single skill can be meaningfully compared to, so we do\nnot provide a control condition with randomly-selected subgoals.\n\n5\n\n(cid:5)\n(cid:4)\n(cid:6)\n(cid:4)\n(cid:7)\n(cid:4)\n(cid:8)\n(cid:4)\n(cid:9)\n(cid:4)\n(cid:1)\n(cid:9)\n(cid:4)\n(cid:9)\n(cid:5)\n(cid:4)\n(cid:11)\n(cid:24)\n(cid:28)\n(cid:23)\n(cid:15)\n(cid:26)\n(cid:1)\n(cid:2)\n(cid:5)\n(cid:4)\n(cid:1)\n(cid:17)\n(cid:13)\n(cid:22)\n(cid:16)\n(cid:26)\n(cid:1)\n(cid:16)\n(cid:13)\n(cid:14)\n(cid:18)\n(cid:3)\n(cid:10)\n(cid:25)\n(cid:19)\n(cid:22)\n(cid:19)\n(cid:27)\n(cid:19)\n(cid:29)\n(cid:16)\n(cid:26)\n(cid:12)\n(cid:20)\n(cid:19)\n(cid:21)\n(cid:21)\n(cid:11)\n(cid:16)\n(cid:30)\n(cid:13)\n(cid:25)\n(cid:15)\n(cid:4)\n(cid:8)\n(cid:4)\n(cid:5)\n(cid:4)\n(cid:4)\n(cid:1)\n(cid:7)\n(cid:4)\n(cid:4)\n(cid:1)\n(cid:6)\n(cid:4)\n(cid:4)\n(cid:1)\n(cid:5)\n(cid:4)\n(cid:4)\n(cid:4)\n(cid:5)\n(cid:4)\n(cid:4)\n(cid:6)\n(cid:4)\n(cid:4)\n(cid:9)\n(cid:15)\n(cid:24)\n(cid:17)\n(cid:22)\n(cid:14)\n(cid:1)\n(cid:2)\n(cid:5)\n(cid:4)\n(cid:4)\n(cid:4)\n(cid:1)\n(cid:25)\n(cid:26)\n(cid:15)\n(cid:23)\n(cid:25)\n(cid:1)\n(cid:15)\n(cid:12)\n(cid:13)\n(cid:16)\n(cid:3)\n(cid:9)\n(cid:24)\n(cid:17)\n(cid:20)\n(cid:17)\n(cid:26)\n(cid:17)\n(cid:28)\n(cid:15)\n(cid:25)\n(cid:11)\n(cid:18)\n(cid:17)\n(cid:19)\n(cid:19)\n(cid:25)\n(cid:1)\n(cid:8)\n(cid:4)\n(cid:10)\n(cid:15)\n(cid:26)\n(cid:27)\n(cid:24)\n(cid:21)\n\fOur analysis shows that, in a diverse set of domains, the skill de\ufb01nition of Section 2 gives rise\nto skills that are consistent with common sense, are similar to skills people handcraft for these\ndomains, and improve learning performance. The improvements in performance are greater than\nthose observed when using a control group of randomly-generated skills, suggesting that they should\nnot be attributed to the presence of skills alone but to the presence of the speci\ufb01c skills that are\nformed based on betweenness.\n\n5 Related Work\n\nA graphical approach to forming high-level behavioral units was \ufb01rst suggested by Amarel in his\nclassic analysis of the missionaries and cannibals problem [17]. Amarel advocated representing ac-\ntion consequences in the environment as a graph and forming skills that correspond to navigating this\ngraph by exploiting its structural regularities. He did not, however, propose any general mechanism\nthat can be used for this purpose.\nOur skill de\ufb01nition captures the \u201cbottleneck\u201d concept, which has inspired many of the existing skill\ndiscovery algorithms [3, 6, 4, 5, 7, 8, 9]. There is clearly an overlap between our skills and the skills\nthat are generated by these algorithms. Here, we review these algorithms, with a focus on the extent\nof this overlap and sample ef\ufb01ciency.\nMcGovern & Barto [3] examine past trajectories to identify states that are common in successful\ntrajectories but not in unsuccessful ones. An important concern with their method is its need for\nexcessive exploration of the environment. It can be applied only after the agent has successfully\nperformed the task at least once. Typically, it requires many additional successful trajectories. Fur-\nthermore, a fundamental property of this algorithm prevents it from identifying a large portion of\nour subgoals. It examines different paths that reach the same destination, while we look for the most\nef\ufb01cient ways of navigating between different source and destination pairs. Bottlenecks that are not\non the path to the goal state would not be identi\ufb01ed by this algorithm, while we consider such states\nto be useful subgoals.\nStolle & Precup [4] and Stolle [5] address this last concern by obtaining their trajectories from\nmultiple tasks that start and terminate at different states. As the number of tasks increases, the\nsubgoals identi\ufb01ed by their algorithms become more similar to ours. Unfortunately, however, sample\nef\ufb01ciency is even a larger concern with these algorithms, because they require the agent to have\nalready identi\ufb01ed the optimal policy\u2014not for only a single task, but for many different tasks in the\ndomain.\nMenache et al. [6] and Mannor et al. [8] take a graphical approach and use the MDP state-transition\ngraph to identify subgoals. They apply a clustering algorithm to partition the graph into blocks and\ncreate skills that ef\ufb01ciently take the agent to states that connect the different blocks. The objective is\nto identify blocks that are highly connected within themselves but weakly connected to each other.\nDifferent clustering techniques and cut metrics may be used towards this end. Rooms and Playroom\nare examples of where these algorithms can succeed. Tic-Tac-Toe and Shortcut are examples of\nwhere they fail.\nS\u00b8ims\u00b8ek, Wolfe & Barto [9] address certain shortcomings of global graph partitioning by construct-\ning their graphs from short trajectories. S\u00b8ims\u00b8ek & Barto [7] take a different approach and search for\nstates that introduce short-term novelty. Although their algorithm does not explicitly use the connec-\ntivity structure of the domain, it shares some of the limitations of graph partitioning as we discuss\nmore fully in the next section. We claim that the more fundamental property that makes a doorway\na useful subgoal is that it is between many source-destination pairs and that graph partitioning can\nnot directly tap into this property, although it can sometimes do it indirectly.\n\n6 An Incremental Discovery Algorithm\n\nOur skill de\ufb01nition may be used directly to form a set of skills suitable for a given environment.\nBecause of its reliance on complete knowledge of the interaction graph and the computational cost\nof betweenness, the use of our approach as a skill-discovery method is limited, although there are\nconditions under which it would be useful. An important research question is whether approximate\nmethods may be developed that do not require complete representation of the interaction graph.\n\n6\n\n\fAlthough betweenness of a given vertex is a global graph property that can not be estimated reliably\nwithout knowledge of the entire graph, it should be possible to reliably determine the local maxima\nof betweenness using limited information. Here, we investigate this possibility by combining the\ndescriptive contributions of the present paper with algorithmic insights of earlier work. In particular,\nwe apply the statistical approach from S\u00b8ims\u00b8ek & Barto [7] and S\u00b8ims\u00b8ek, Wolfe & Barto [9] using the\nskill description in the present paper.\nThe resulting algorithm is founded on the premise that local maxima of betweenness of the inter-\naction graph are likely to be local maxima on its subgraphs. While any single subgraph would not\nbe particularly useful to identify such vertices, a collection of subgraphs may allow us to correctly\nidentify them. The algorithm proceeds as follows. The agent uses short trajectories to construct sub-\ngraphs of the interaction graph and identi\ufb01es the local maxima of betweenness on these subgraphs.\nFrom each subgraph, it obtains a new observation for every state represented on it. This is a positive\nobservation if the state is a local maximum, a negative observation otherwise. We use the decision\nrule from S\u00b8ims\u00b8ek, Wolfe & Barto [9], making a particular state a subgoal if there are at least no\nobservations on this state and if the proportion of positive observations is at least p+. The agent\ncontinues this incremental process inde\ufb01nitely.\nFigure 5 shows the results of applying this algorithm on two domains. The \ufb01rst is a gridworld\nwith six rooms. The second is also a gridworld, but the grid squares are one of two types with\ndifferent rewards. The lightly colored squares produce a reward of (cid:31)0(cid:46)001 for actions that originate\non them, while the darker squares produce (cid:31)0(cid:46)1. The reward structure creates two local maxima\nof betweenness on the graph. These are the regions that look like doorways in the \ufb01gure\u2014they\nare useful subgoals for the same reasons that doorways are. Graph partitioning does not succeed\nin identifying these states because the structure is not created through node connectivity. Similarly,\nthe algorithms by S\u00b8ims\u00b8ek & Barto [7] and S\u00b8ims\u00b8ek, Wolfe & Barto [9] are also not suitable for this\ndomain. We applied them and found that they identi\ufb01ed very few subgoals (<0.05/trial) randomly\ndistributed in the domain.\nIn both domains, we had the agent perform a random walk of 40,000 steps. Every 1000 transitions,\nthe agent created a new interaction graph using the last 1000 transitions. Figure 5 shows the number\nof times each state was identi\ufb01ed as a subgoal in 100 trials, using no = 10, p+ = 0(cid:46)2. The individual\ngraphs had on average 156 nodes in the six-room gridworld and 224 nodes in the other one.\nWe present this algorithm here as a proof of concept, to demonstrate the feasibility of incremental\nalgorithms. An interesting direction is to develop algorithms that actively explore to discover local\nmaxima of betweenness rather than only passively mining available trajectories.\n\n50\n\n40\n\n30\n\n20\n\n10\n\n25\n20\n15\n10\n5\n\nFigure 5: Subgoal frequency in 100 trials using the incremental discovery algorithm.\n\n7\n\n\fAcknowledgments\n\nThis work is supported in part by the National Science Foundation under grant CNS-0619337 and by\nthe Air Force Of\ufb01ce of Scienti\ufb01c Research under grant FA9550-08-1-0418. Any opinions, \ufb01ndings,\nconclusions or recommendations expressed here are those of the authors and do not necessarily\nre\ufb02ect the views of the sponsors.\n\nReferences\n[1] L. C. Freeman. A set of measures of centrality based upon betweenness. Sociometry, 40:35\u201341,\n\n1977.\n\n[2] L. C. Freeman. Centrality in social networks: Conceptual clari\ufb01cation. Social Networks,\n\n1:215\u2013239, 1979.\n\n[3] A. McGovern and A. G. Barto. Automatic discovery of subgoals in reinforcement learning\nusing diverse density. In Proceedings of the Eighteenth International Conference on Machine\nLearning, 2001.\n\n[4] M. Stolle and D. Precup. Learning options in reinforcement learning. Lecture Notes in Com-\n\nputer Science, 2371:212\u2013223, 2002.\n\n[5] M. Stolle. Automated discovery of options in reinforcement learning. Master\u2019s thesis, McGill\n\nUniversity, 2004.\n\n[6] I. Menache, S. Mannor, and N. Shimkin. Q-Cut - Dynamic discovery of sub-goals in reinforce-\nment learning. In Proceedings of the Thirteenth European Conference on Machine Learning,\n2002.\n\u00a8O. S\u00b8ims\u00b8ek and A. G. Barto. Using relative novelty to identify useful temporal abstractions\nIn Proceedings of the Twenty-First International Conference on\nin reinforcement learning.\nMachine Learning, 2004.\n\n[7]\n\n[8] S. Mannor, I. Menache, A. Hoze, and U. Klein. Dynamic abstraction in reinforcement learn-\ning via clustering. In Proceedings of the Twenty-First International Conference on Machine\nLearning, 2004.\n\u00a8O. S\u00b8ims\u00b8ek, A. P. Wolfe, and A. G. Barto. Identifying useful subgoals in reinforcement learning\nby local graph partitioning. In Proceedings of the Twenty-Second International Conference on\nMachine Learning, 2005.\n\n[9]\n\n[10] U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology,\n\n25(2):163\u2013177, 2001.\n\n[11] T. G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decom-\n\nposition. Journal of Arti\ufb01cial Intelligence Research, 13:227\u2013303, 2000.\n\n[12] A. G. Barto, S. Singh, and N. Chentanez.\n\nIntrinsically motivated learning of hierarchical\ncollections of skills. In Proceedings of the Third International Conference on Developmental\nLearning, 2004.\n\n[13] S. Singh, A. G. Barto, and N. Chentanez. Intrinsically motivated reinforcement learning. In\n\nAdvances in Neural Information Processing Systems, 2005.\n\n[14] R. S. Sutton, D. Precup, and S. P. Singh. Between MDPs and Semi-MDPs: A framework\nfor temporal abstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112(1-2):181\u2013211,\n1999.\n\n[15] D. Precup. Temporal abstraction in reinforcement learning. PhD thesis, University of Mas-\n\nsachusetts Amherst, 2000.\n\n[16] A. McGovern, R. S. Sutton, and A. H. Fagg. Roles of macro-actions in accelerating reinforce-\n\nment learning. In Grace Hopper Celebration of Women in Computing, 1997.\n\n[17] S. Amarel. On representations of problems of reasoning about actions. In Machine Intelli-\n\ngence 3, pages 131\u2013171. Edinburgh University Press, 1968.\n\n8\n\n\f", "award": [], "sourceid": 994, "authors": [{"given_name": "\u00d6zg\u00fcr", "family_name": "\u015eim\u015fek", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": ""}]}