{"title": "Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3562, "page_last": 3573, "abstract": "Learning how to act when there are many available actions in each state is a challenging task for Reinforcement Learning (RL) agents, especially when many of the actions are redundant or irrelevant. In such cases, it is easier to learn which actions not to take. In this work, we propose the Action-Elimination Deep Q-Network (AE-DQN) architecture that combines a Deep RL algorithm with an Action Elimination Network (AEN) that eliminates sub-optimal actions. The AEN is trained to predict invalid actions, supervised by an external elimination signal provided by the environment. Simulations demonstrate a considerable speedup and added robustness over vanilla DQN in text-based games with over a thousand discrete actions.", "full_text": "Learn What Not to Learn: Action Elimination with\n\nDeep Reinforcement Learning\n\nTom Zahavy\u22171,2, Matan Haroush\u22171, Nadav Merlis\u22171, Daniel J. Mankowitz3, Shie Mannor1\n\n1The Technion - Israel Institute of Technology, 2 Google research, 3 Deepmind\n\nCorresponding to {tomzahavy,matan.h,merlis}@campus.technion.ac.il\n\n* Equal contribution\n\nAbstract\n\nLearning how to act when there are many available actions in each state is a\nchallenging task for Reinforcement Learning (RL) agents, especially when many of\nthe actions are redundant or irrelevant. In such cases, it is sometimes easier to learn\nwhich actions not to take. In this work, we propose the Action-Elimination Deep\nQ-Network (AE-DQN) architecture that combines a Deep RL algorithm with an\nAction Elimination Network (AEN) that eliminates sub-optimal actions. The AEN\nis trained to predict invalid actions, supervised by an external elimination signal\nprovided by the environment. Simulations demonstrate a considerable speedup\nand added robustness over vanilla DQN in text-based games with over a thousand\ndiscrete actions.\n\n1\n\nIntroduction\n\nLearning control policies for sequential decision-making tasks where both the state space and the\naction space are vast is critical when applying Reinforcement Learning (RL) to real-world problems.\nThis is because there is an exponential growth of computational requirements as the problem size\nincreases, known as the curse of dimensionality (Bertsekas and Tsitsiklis, 1995). Deep RL (DRL)\ntackles the curse of dimensionality due to large state spaces by utilizing a Deep Neural Network\n(DNN) to approximate the value function and/or the policy. This enables the agent to generalize\nacross states without domain-speci\ufb01c knowledge (Tesauro, 1995; Mnih et al., 2015).\nDespite the great success of DRL methods, deploying them in real-world applications is still limited.\nOne of the main challenges towards that goal is dealing with large action spaces, especially when\nmany of the actions are redundant or irrelevant (for many states). While humans can usually detect\nthe subset of feasible actions in a given situation from the context, RL agents may attempt irrelevant\nactions or actions that are inferior, thus wasting computation time. Control systems for large industrial\nprocesses like power grids (Wen, O\u2019Neill, and Maei, 2015; Glavic, Fonteneau, and Ernst, 2017; Dalal,\nGilboa, and Mannor, 2016) and traf\ufb01c control (Mannion, Duggan, and Howley, 2016; Van der Pol\nand Oliehoek, 2016) may have millions of possible actions that can be applied at every time step.\nOther domains utilize natural language to represent the actions. These action spaces are typically\ncomposed of all possible sequences of words from a \ufb01xed size dictionary resulting in considerably\nlarge action spaces. Common examples of systems that use this action space representation include\nconversational agents such as personal assistants (Dhingra et al., 2016; Li et al., 2017; Su et al., 2016;\nLipton et al., 2016b; Liu et al., 2017; Zhao and Eskenazi, 2016; Wu et al., 2016), travel planners\n(Peng et al., 2017), restaurant/hotel bookers (Budzianowski et al., 2017), chat-bots (Serban et al.,\n2017; Li et al., 2016) and text-based game agents (Narasimhan, Kulkarni, and Barzilay, 2015; He et\nal., 2015; Zelinka, 2018; Yuan et al., 2018; C\u00f4t\u00e9 et al., 2018). RL is currently being applied in all of\nthese domains, facing new challenges in function approximation and exploration due to the larger\naction space.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Zork interface. The state in the game (observation) and the player actions are describe\nin natural language. (b) The action elimination framework. Upon taking action at, the agent\nobserves a reward rt, the next state st+1 and an elimination signal et. The agent uses this information\nto learn two function approximation deep networks: a DQN and an AEN. The AEN provides an\nadmissible actions set A(cid:48) to the DQN, which uses this set to decide how to act and learn.\nIn this work, we propose a new approach for dealing with large action spaces that is based on action\nelimination; that is, restricting the available actions in each state to a subset of the most likely ones\n(Figure 1(b)). We propose a method that eliminates actions by utilizing an elimination signal; a\nspeci\ufb01c form of an auxiliary reward (Jaderberg et al., 2016), which incorporates domain-speci\ufb01c\nknowledge in text games. Speci\ufb01cally, it provides the agent with immediate feedback regarding taken\nactions that are not optimal. In many domains, creating an elimination signal can be done using\nrule-based systems. For example, in parser-based text games, the parser gives feedback regarding\nirrelevant actions after the action is played (e.g., Player: \"Climb the tree.\" Parser: \"There are no trees\nto climb\"). Given such signal, we can train a machine learning model to predict it and then use it to\ngeneralize to unseen states. Since the elimination signal provides immediate feedback, it is faster to\nlearn which actions to eliminate (e.g., with a contextual bandit using the elimination signal) than to\nlearn the optimal actions using only the reward (due to long term consequences). Therefore, we can\ndesign an algorithm that enjoys better performance by exploring invalid actions less frequently.\nMore speci\ufb01cally, we propose a system that learns an approximation of the Q-function and concur-\nrently learns to eliminate actions. We focus on tasks where natural language characterizes both the\nstates and the actions. In particular, the actions correspond to \ufb01xed length sentences de\ufb01ned over\na \ufb01nite dictionary (of words). In this case, the action space is of combinatorial size (in the length\nof the sentence and the size of the dictionary) and irrelevant actions must be eliminated to learn.\nWe introduce a novel DRL approach with two DNNs, a DQN and an Action Elimination Network\n(AEN), both designed using a Convolutional Neural Network (CNN) that is suited to NLP tasks (Kim,\n2014). Using the last layer activations of the AEN, we design a linear contextual bandit model that\neliminates irrelevant actions with high probability, balancing exploration/exploitation, and allowing\nthe DQN to explore and learn Q-values only for valid actions.\nWe tested our method in a text-based game called \"Zork\". This game takes place in a virtual world\nin which the player interacts with the world through a text-based interface (see Figure 1(a)). The\nplayer can type in any command, corresponding to the in-game action. Since the input is text-based,\nthis yields more than a thousand possible actions in each state (e.g., \"open door\", \"open mailbox\"\netc.). We demonstrate the agent\u2019s ability to advance in the game faster than the baseline agents by\neliminating irrelevant actions.\n\n2 Related Work\n\nText-Based Games (TBG): Video games, via interactive learning environments like the Arcade\nLearning Environment (ALE) (Bellemare et al., 2013), have been fundamental to the development of\nDRL algorithms. Before the ubiquitousness of graphical displays, TBG like Zork were popular in\nthe adventure gaming and role-playing communities. TBG present complex, interactive simulations\nwhich use simple language to describe the state of the environment, as well as reporting the effects of\nplayer actions (See Figure 1(a)). Players interact with the environment through text commands that\nrespect a prede\ufb01ned grammar, which must be discovered in each game.\n\n2\n\n\fTBG provide a testbed for research at the intersection of RL and NLP, presenting a broad spectrum\nof challenges for learning algorithms (C\u00f4t\u00e9 et al., 2018)1. In addition to language understanding,\nsuccessful play generally requires long-term memory, planning, exploration (Yuan et al., 2018),\naffordance extraction (Fulda et al., 2017), and common sense. Text games also highlight major open\nchallenges for RL: the action space (text) is combinatorial and compositional, while game states\nare partially observable since text is often ambiguous or under-speci\ufb01c. Also, TBG often introduce\nstochastic dynamics, which is currently missing in standard benchmarks (Machado et al., 2017). For\nexample, in Zork, there is a random probability of a troll killing the player. A thief can appear (also\nrandomly) in each room.\nRepresentations for TBG: To learn control policies from high-dimensional complex data such as\ntext, good word representations are necessary. Kim (2014) designed a shallow word-level CNN and\ndemonstrated state-of-the-art results on a variety of classi\ufb01cation tasks by using word embeddings.\nFor classi\ufb01cation tasks with millions of labeled data, random embeddings were shown to outperform\nstate-of-the-art techniques (Zahavy et al., 2018). On smaller data sets, using word2vec (Mikolov et\nal., 2013) yields good performance (Kim, 2014).\nPrevious work on TBG used pre-trained embeddings directly for control (Kostka et al., 2017; Fulda\net al., 2017). Other works combined pre-trained embeddings with neural networks. For example, He\net al. (2015) proposed to use Bag Of Words features as an input to a neural network, learned separate\nembeddings for states and actions, and then computed the Q function from autocorrelations between\nthese embeddings. Narasimhan et al. (2015) suggested to use a word level Long Short-Term Memory\n(Hochreiter and Schmidhuber, 1997) to learn a representation end-to-end, and Zelinka et al. (2018),\ncombined these approaches.\nDRL with linear function approximation: DRL methods such as the DQN have achieved state-\nof-the-art results in a variety of challenging, high-dimensional domains. This success is mainly\nattributed to the power of deep neural networks to learn rich domain representations for approximating\nthe value function or policy (Mnih et al., 2015; Zahavy, Ben-Zrihem, and Mannor, 2016; Zrihem,\nZahavy, and Mannor, 2016). Batch reinforcement learning methods with linear representations, on\nthe other hand, are more stable and enjoy accurate uncertainty estimates. Yet, substantial feature\nengineering is necessary to achieve good results. A natural attempt at getting the best of both\nworlds is to learn a (linear) control policy on top of the representation of the last layer of a DNN.\nThis approach was shown to re\ufb01ne the performance of DQNs (Levine et al., 2017) and improve\nexploration (Azizzadenesheli, Brunskill, and Anandkumar, 2018). Similarly, for contextual linear\nbandits, Riquelme et al. showed that a neuro-linear Thompson sampling approach outperformed deep\n(and linear) bandit algorithms in practice (Riquelme, Tucker, and Snoek, 2018).\nRL in Large Action Spaces: Being able to reason in an environment with a large number of discrete\nactions is essential to bringing reinforcement learning to a larger class of problems. Most of the\nprior work concentrated on factorizing the action space into binary subspaces (Pazis and Parr, 2011;\nDulac-Arnold et al., 2012; Lagoudakis and Parr, 2003). Other works proposed to embed the discrete\nactions into a continuous space, use a continuous-action policy gradient to \ufb01nd optimal actions in\nthe continuous space, and \ufb01nally, choose the nearest discrete action (Dulac-Arnold et al., 2015;\nVan Hasselt and Wiering, 2009). He et. al. (2015) extended DQN to unbounded (natural language)\naction spaces. His algorithm learns representations for the states and actions with two different DNNs\nand then models the Q values as an inner product between these representation vectors. While this\napproach can generalize to large action spaces, in practice, they only considered a small number of\navailable actions (4) in each state.\nLearning to eliminate actions was \ufb01rst mentioned by (Even-Dar, Mannor, and Mansour, 2003) who\nstudied elimination in multi-armed bandits and tabular MDPs. They proposed to learn con\ufb01dence\nintervals around the value function in each state and then use it to eliminate actions that are not\noptimal with high probability. Lipton et al. (2016a) studied a related problem where an agent wants\nto avoid catastrophic forgetting of dangerous states. They proposed to learn a classi\ufb01er that detects\nhazardous states and then use it to shape the reward of a DQN agent. Fulda et al. (2017) studied\naffordances, the set of behaviors enabled by a situation, and presented a method for affordance\nextraction via inner products of pre-trained word embeddings.\n\n1See also The CIG Competition for General Text-Based Adventure Game-Playing Agents\n\n3\n\n\f3 Action Elimination\n\nV \u03c0(s) = E\u03c0[(cid:80)\u221e\nby \u03c0\u2217(s) = arg max\u03c0 V \u03c0(s). The Q-function Q\u03c0(s, a) = E\u03c0[(cid:80)\u221e\n\nWe now describe a learning algorithm for MDPs with an elimination signal. Our approach builds\non the standard RL formulation (Sutton and Barto, 1998). At each time step t, the agent observes a\nstate st and chooses a discrete action at \u2208 {1, ..,|A|}. After executing the action, the agent obtains\na reward rt(st, at) and observes the next state st+1 according to a transition kernel P (st+1|st, at).\nThe goal of the algorithm is to learn a policy \u03c0(a|s) that maximizes the discounted cumulative return\nt=0 \u03b3tr(st, at)|s0 = s] where 0 < \u03b3 < 1 is the discount factor and V is the value\nfunction. The optimal value function is given by V \u2217(s) = max\u03c0 V \u03c0(s) and the optimal policy\nt=0 \u03b3tr(st, at)|s0 = s, a0 = a]\ncorresponds to the value of taking action a in state s and continuing according to policy \u03c0. The\noptimal Q-function Q\u2217(s, a) = Q\u03c0\u2217\n(s, a) can be found using the Q-learning algorithm (Watkins and\nDayan, 1992), and the optimal policy is given by \u03c0\u2217(s) = arg maxa Q\u2217(s, a).\nAfter executing an action, the agent also observes a binary elimination signal e(s, a), which equals 1\nif action a may be eliminated in state s; that is, any optimal policy in state s will never choose action\na (and 0 otherwise). The elimination signal can help the agent determine which actions not to take,\nthus aiding in mitigating the problem of large discrete action spaces. We start with the following\nde\ufb01nitions:\nDe\ufb01nition 1. Valid state-action pairs with respect to an elimination signal are state action pairs\nwhich the elimination process should not eliminate.\n\nAs stated before, we assume that the set of valid state-action pairs contains all of the state-action pairs\nthat are a part of some optimal policy, i.e., only strictly suboptimal state-actions can be invalid.\nDe\ufb01nition 2. Admissible state-action pairs with respect to an elimination algorithm are state action\npairs which the elimination algorithm does not eliminate.\n\nIn the following section, we present the main advantages of action elimination in MDPs with large\naction spaces. Afterward, we show that under the framework of linear contextual bandits (Chu et al.,\n2011), probability concentration results (Abbasi-Yadkori, Pal, and Szepesvari, 2011) can be adapted\nto guarantee that action elimination is correct in high probability. Finally, we prove that Q-learning\ncoupled with action elimination converges.\n\n3.1 Advantages in action elimination\n\nAction elimination allows the agent to overcome some of the main dif\ufb01culties in large action spaces,\nnamely: Function Approximation and Sample Complexity.\nFunction Approximation: It is well known that errors in the Q-function estimates may cause the\nlearning algorithm to converge to a suboptimal policy, a phenomenon that becomes more noticeable in\nenvironments with large action spaces (Thrun and Schwartz, 1993). Action elimination may mitigate\nthis effect by taking the max operator only on valid actions, thus, reducing potential overestimation\nerrors. Another advantage of action elimination is that the Q-estimates need only be accurate for\nvalid actions. The gain is two-fold: \ufb01rst, there is no need to sample invalid actions for the function\napproximation to converge. Second, the function approximation can learn a simpler mapping (i.e.,\nonly the Q-values of the valid state-action pairs), and therefore may converge faster and to a better\nsolution by ignoring errors from states that are not explored by the Q-learning policy (Hester et al.,\n2018).\nSample Complexity: The sample complexity of the MDP measures the number of steps, during\nlearning, in which the policy is not \u0001-optimal (Kakade and others, 2003). Assume that there are A(cid:48)\nactions that should be eliminated and are \u0001-optimal, i.e., their value is at least V \u2217(s) \u2212 \u0001. According\nto lower bounds by (Lattimore and Hutter, 2012), We need at least \u0001\u22122(1 \u2212 \u03b3)\u22123 log 1/\u03b4 samples per\nstate-action pair to converge with probability 1 \u2212 \u03b4. If, for example, the eliminated action returns\nno reward and doesn\u2019t change the state, the action gap is \u0001 = (1 \u2212 \u03b3)V \u2217(s), which translates to\n\u22122(1 \u2212 \u03b3)\u22125 log 1/\u03b4 \u2019wasted\u2019 samples for learning each invalid state-action pair. For large \u03b3,\nV \u2217(s)\nthis can lead to a tremendous number of samples (e.g., for \u03b3 = 0.99, (1\u2212 \u03b3)\u22125 = 1010). Practically,\nelimination algorithms can eliminate these actions substantially faster, and can, therefore, speed up\nthe learning process approximately by A/A(cid:48) (such that learning is effectively performed on the valid\nstate-action pairs).\n\n4\n\n\fEmbedding the elimination signal into the MDP is not trivial. One option is to shape the original\nreward by adding an elimination penalty. That is, decreasing the rewards when selecting the wrong\nactions. Reward shaping, however, is tricky to tune, may slow the convergence of the function\napproximation, and is not sample ef\ufb01cient (irrelevant actions are explored). Another option is to\ndesign a policy that is optimized by interleaved policy gradient updates on the two signals, maximizing\nthe reward and minimizing the elimination signal error. The main dif\ufb01culty with this approach is that\nboth models are strongly coupled, and each model affects the observations of the other model, such\nthat the convergence of any of the models is not trivial.\nNext, we present a method that decouples the elimination signal from the MDP by using contextual\nmulti-armed bandits. The contextual bandit learns a mapping from states (represented by context vec-\ntors x(s)) to the elimination signal e(s, a) that estimates which actions should be eliminated. We start\nby introducing theoretical results on linear contextual bandits, and most importantly, concentration\nbounds for contextual bandits that require almost no assumptions on the context distribution. We will\nlater show that under this model we can decouple the action elimination from the learning process in\nthe MDP, allowing us to learn using standard Q-learning while eliminating actions correctly.\n\na\n\nT x(st)+\u03b7t, where (cid:107)\u03b8\u2217\n\n3.2 Action elimination with contextual bandits\nLet x(st) \u2208 Rd be the feature representation of state st. We assume (realizability) that under this\na \u2208 Rd such that the elimination signal in state st is\nrepresentation there exists a set of parameters \u03b8\u2217\na(cid:107)2 \u2264 S. \u03b7t is an R-subgaussian random variable with zero mean\net(st, a) = \u03b8\u2217\nthat models additive noise to the elimination signal. When there is no noise in the elimination signal,\nthen R = 0. Otherwise, as the elimination signal is bounded in [0, 1], it holds that R \u2264 1. We\u2019ll also\nrelax our previous assumptions and allow the elimination signal to have values 0 \u2264 E[et(st, a)] \u2264 (cid:96)\nfor any valid action and u \u2264 E[et(st, a)] \u2264 1 for any invalid action, with (cid:96) < u. Next, we denote\nby Xt,a (Et,a) the matrix (vector) whose rows (elements) are the observed state representation\nvectors (elimination signals) in which action a was chosen, up to time t. For example, the ith row\nin Xt,a is the representation vector of the ith state on which the action a was chosen. Denote the\nsolution to the regularized linear regression (cid:107)Xt,a\u03b8t,a \u2212 Et,a(cid:107)2\n2 (for some \u03bb > 0) by\n\u02c6\u03b8t,a = \u00afV \u22121\nSimilarly to Theorem 2 in (Abbasi-Yadkori, Pal, and Szepesvari, 2011)2, for any state history\nand with probability of at least 1 \u2212 \u03b4, it holds for all t > 0 that\n\n(cid:12)(cid:12)(cid:12) \u2264\nt,a x(st), where (cid:112)\u03b2t(\u03b4) = R(cid:112)2 log(det( \u00afVt,a)1/2det(\u03bbI)\u22121/2/\u03b4) + \u03bb1/2S. If\n(cid:26)(cid:12)(cid:12)(cid:12)\u02c6\u03b8T\n\nwe de\ufb01ne \u02dc\u03b4 = \u03b4/k and bound this probability for all the actions, i.e., \u2200a, t > 0\n\n(cid:113)\n\u2200s,(cid:107)x(s)(cid:107)2 \u2264 L, then \u03b2t can be bounded by(cid:112)\u03b2t(\u03b4) \u2264 R\n\n\u03b2t(\u03b4)x(st)T \u00afV \u22121\n\n(cid:12)(cid:12)(cid:12)\u02c6\u03b8T\n(cid:16) 1+tL2/\u03bb\n(cid:27)\n\nt,a X T\n\nt,aEt,a where \u00afVt,a = \u03bbI + X T\n\n(cid:12)(cid:12)(cid:12) \u2264(cid:113)\n\nt\u22121,ax(st) \u2212 \u03b8\u2217\n\nt\u22121,a\n\n2 + \u03bb(cid:107)\u03b8t,a(cid:107)2\n\n\u03b2t(\u02dc\u03b4)x(st)T \u00afV \u22121\n\nt\u22121,ax(st)\n\nt,ax(st) \u2212 \u03b8\u2217\n\na\n\nT x(st)\n\n\u2265 1 \u2212 \u03b4\n\nd log\n\n\u03b4\n\n+ \u03bb1/2S. Next,\n\nt,aXt,a.\n\nPr\n\nT x(st)\n\n(cid:114)\n\n(cid:17)\n\n(1)\n\nt\u22121,ax(st) \u2212(cid:113)\n\n\u02c6\u03b8T\n\nRecall that any valid action a at state s satis\ufb01es E[et(s, a)] = \u03b8\u2217\naction a at state st if\n\na\n\nT x(st) \u2264 (cid:96). Thus, we can eliminate\n\n\u03b2t\u22121(\u02dc\u03b4)x(st)T \u00afV \u22121\n\n(2)\nThis ensures that with probability 1 \u2212 \u03b4 we never eliminate any valid action. We emphasize that\nonly the expectation of the elimination signal is linear in the context. The expectation does not have\nto be binary (while the signal itself is). For example, in conversational agents, if a sentence is not\nunderstood by 90% of the humans who hear it, it is still desirable to avoid saying it. We also note that\nwe assume (cid:96) is known, but in most practical cases, choosing (cid:96) \u2248 0.5 should suf\ufb01ce. In the current\nformulation, knowing u is not necessary, though its value will affect the overall performance.\n\nt\u22121,ax(st) > (cid:96)\n\n2Our theoretical analysis builds on results from (Abbasi-Yadkori, Pal, and Szepesvari, 2011), which can be\nextended to include Generalized Linear Models (GLMs). We focus on linear contextual bandits as they enjoy\neasier implementation and tighter con\ufb01dence intervals in comparison to GLMs. We will later combine the bandit\nwith feature approximation, which will approximately allow the realizability assumption even for linear bandits.\n\n5\n\n\f3.3 Concurrent Learning\n\nWe now show how the Q-learning and contextual bandit algorithms can learn simultaneously, resulting\nin the convergence of both algorithms, i.e., \ufb01nding an optimal policy and a minimal valid action\nspace. The challenge here is that each learning process affects the state-action distribution of the\nother. We \ufb01rst de\ufb01ne Action Elimination Q-learning.\nDe\ufb01nition 3. Action Elimination Q-learning is a Q-learning algorithm which updates only admissible\nstate-action pairs and chooses the best action in the next state from its admissible actions. We allow\nthe base Q-learning algorithm to be any algorithm that converges to Q\u2217 with probability 1 after\nobserving each state-action in\ufb01nitely often.\n\nt\u22121,ax(s)\u2212(cid:113)\n\nIf the elimination is done based on the concentration bounds of the linear contextual bandits, we can\nensure that Action Elimination Q-learning converges, as can be seen in Proposition 1 (See Appendix\nA for a full proof).\nProposition 1. Assume that all state action pairs (s, a) are visited in\ufb01nitely often unless eliminated\nt\u22121,ax(s) > (cid:96). Then, with a probability of at least 1\u2212 \u03b4,\naccording to \u02c6\u03b8T\naction elimination Q-learning converges to the optimal Q-function for any valid state-action pairs. In\naddition, actions which should be eliminated are visited at most Ts,a(t) \u2264 4\u03b2t/(u \u2212 (cid:96))2 + 1 times.\nNotice that when there is no noise in the elimination signal (R = 0), we correctly eliminate actions\nwith probability 1, and invalid actions will be sampled a \ufb01nite number of times. Otherwise, under\nvery mild assumptions, invalid actions will be sampled a logarithmic number of times.\n\n\u03b2t\u22121(\u02dc\u03b4)x(s)T \u00afV \u22121\n\n4 Method\n\nUsing raw features like word2vec, directly for control, results in exhaustive computations. Moreover,\nraw features are typically not realizable, i,.e., the assumption that et(st, a) = \u03b8\u2217\nT x(st) + \u03b7t does not\nhold. Thus, we propose learning a set of features \u03c6(st) that are realizable, i.e., e(st, a) = \u03b8\u2217\nT \u03c6(st),\nusing neural networks (using the activations of the last layer as features). A practical challenge here\nis that the features must be \ufb01xed over time when used by the contextual bandit, while the activations\nchange during optimization. We therefore follow a batch-updates framework (Levine et al., 2017;\nRiquelme, Tucker, and Snoek, 2018), where every few steps we learn a new contextual bandit model\nthat uses the last layer activations of the AEN as features.\n\na\n\na\n\nfunction ACT(s, Q, E, V \u22121, \u0001, \u03b2, (cid:96))\n\nA(cid:48) \u2190 {a : E(s)a \u2212(cid:113)\n\n\u03b2\u03c6(s)T V \u22121\n\na \u03c6(s) < (cid:96)}\n\nWith probability \u0001, return Uniform(A(cid:48))\nOtherwise, return arg max\n\nQ(s, a)\n\na\u2208A(cid:48)\n\nend function\nfunction TARGETS(s, r, \u03b3, Q, E, V \u22121, \u03b2, (cid:96))\nif s is a terminal state then return r end if\n\nA(cid:48) \u2190 {a : E(s)a \u2212(cid:113)\n\n\u03b2\u03c6(s)T V \u22121\n\na \u03c6(s) < (cid:96)}\n\nreturn (r + \u03b3max\n\na\u2208A(cid:48)Q(s, a))\n\nend function\nfunction AENUPDATE(E\u2212, \u03bb, D)\n\n(cid:17)\u22121\n\nAlgorithm 1 deep Q-learning with action elimination\nInput: \u0001, \u03b2, (cid:96), \u03bb, C, L, N\nInitialize AEN and DQN with random weights\n\u03c9, \u03b8 respectively, and set\ntarget networks\nQ\u2212, E\u2212 with a copy of \u03b8, \u03c9\nDe\ufb01ne \u03c6(s) \u2190 LastLayerActivations(E(s))\nInitialize Replay Memory D to capacity N\nfor t = 1,2,. . . , do\n\nat = ACT(st, Q, E\u2212, V \u22121, \u0001, (cid:96), \u03b2)\nExecute action at and observe {rt, et, st+1}\nStore transition {st, at, rt, et, st+1} in D\nSample transitions\n\nyj = Targets(cid:0)sj+1, rj, \u03b3, Q\u2212, E\u2212, V \u22121, \u03b2, (cid:96)(cid:1)\n\n{sj, aj, rj, ej, sj+1}m\n\nj=1 \u2208 D\n(cid:80)\n(cid:80)\nj (yj \u2212 Q(sj, aj; \u03b8))2\nj (ej \u2212 E(sj, aj; \u03c9))2\n\n\u03b8 = \u03b8 \u2212 \u2207\u03b8\n\u03c9 = \u03c9 \u2212 \u2207\u03c9\nIf (t mod C) = 0 : Q\u2212 \u2190 Q\nIf (t mod L) = 0 :\n\nE\u2212, V \u22121 \u2190AENUpdate(E, \u03bb, D)\n\nend for\n\n6\n\nfor a \u2208 A do\nV \u22121\na =\n\n(cid:16)(cid:80)\nba =(cid:80)\n\nend for\nreturn E\u2212, V \u22121\n\nend function\n\nj:aj =a \u03c6(sj)\u03c6(sj)T + \u03bbI\n\nj:aj =a \u03c6(sj)T ej\n\nSet LastLayer(E\u2212\n\na ) \u2190 V \u22121\na ba\n\n\fj:aj =a \u03c6(sj)\u03c6(sj)T , ba = (cid:80)\n\n\u03b2\u03c6(s)T V \u22121\n\na ) \u2190 V \u22121\n\n\u03bbI +(cid:80)\nA(cid:48) = {a : E(s)a \u2212(cid:113)\n\nOur Algorithm presents a hybrid approach for DRL with Action Elimination (AE), by incorporating\nAE into the well-known DQN algorithm to yield our AE-DQN (Algorithm 1 and Figure 1(b)).\nAE-DQN trains two networks: a DQN denoted by Q and an AEN denoted by E. The algorithm\nuses E, and creates a linear contextual bandit model from it every L iterations with procedure\nAENUpdate(). This procedure uses the activations of the last hidden layer of E as features, \u03c6(s) \u2190\nLastLayerActivations(E(s)), which are then used to create a contextual linear bandit model (Va =\nj:aj =a \u03c6(sj)T ej). AENUpdate() proceeds by solving this\nmodel, and plugging the solution into the target AEN (LastLayer(E\u2212\na ba). The contextual\nlinear bandit model (E\u2212, V ) is then used to eliminate actions (with high probability) via the ACT()\nand Targets() functions. ACT() follows an \u0001\u2212greedy mechanism on the admissible actions set\na \u03c6(s) < (cid:96)}. If it decides to exploit, then it selects the action with\nhighest Q-value by taking an arg max on Q-values among A(cid:48), and if it chooses to explore, then,\nit selects an action uniformly from A(cid:48). Targets() estimates the value function by taking max over\nQ-values only among admissible actions, hence, reducing function approximation errors.\nArchitectures: The agent uses an Experience Replay (Lin, 1992) to store information about states,\ntransitions, actions, and rewards. In addition, our agent also stores the elimination signal, provided by\nthe environment (Figure 1(b)). The architecture for both the AEN and DQN is an NLP CNN, based\non (Kim, 2014). We represent the state as a sequence of words, composed of the game descriptor\n(Figure 1(a), \"Observation\") and the player\u2019s inventory. These are truncated or zero-padded (for\nsimplicity) to a length of 50 (descriptor) + 15 (inventory) words and each word is embedded into\ncontinuous vectors using word2vec in R300. The features of the last four states are then concatenated\ntogether such that our \ufb01nal state representations s are in R78,000. The AEN is trained to minimize the\nMSE loss, using the elimination signal as a label. We used 100 (500 for DQN) convolutional \ufb01lters,\nwith three different 1D kernels of length (1,2,3) such that the last hidden layer size is 300. 3\n\n5 Experimental Results\n\nGrid World Domain: We start with an evaluation of action elimination in a small grid world domain\nwith 9 rooms, where we can carefully analyze the effect of action elimination. In this domain, the\nagent starts at the center of the grid and needs to navigate to its upper-left corner. On every step,\nthe agent suffers a penalty of (\u22121), with a terminal reward of 0. Prior to the game, the states are\nrandomly divided into K categories. The environment has 4K navigation actions, 4 for each category,\neach with a probability to move in a random direction. If the chosen action belongs to the same\ncategory as the state, the action is performed correctly with probability pT\nc = 0.75. Otherwise, it will\nbe performed correctly with probability pF\nc = 0.5. If the action does not \ufb01t the state category, the\nelimination signal equals 1, and if the action and state belong to the same category, then it equals\n0. An optimal policy only uses the navigation actions from the same type as the state, as the other\nactions are clearly suboptimal. We experimented with a vanilla Q-learning agent without action\nelimination and a tabular version of action elimination Q-learning. Our simulations show that action\nelimination dramatically improves the results in large action spaces. In addition, we observed that\nthe gain from action elimination increases as the amount of categories grows, and as the grid size\ngrows, since the elimination allows the agent to reach the goal earlier. We have also experimented\nwith random elimination signal and other modi\ufb01cations in the domain. Due to space constraints, we\nrefer the reader to the appendix for \ufb01gures and visualization of the domain.\nZork domain: \"This is an open \ufb01eld west of a white house, with a boarded front door. There is a\nsmall mailbox here. A rubber mat saying \u2019Welcome to Zork!\u2019 lies by the door\". This is an excerpt\nfrom the opening scene provided to a player in \u201cZork I: The Great Underground Empire\u201d; one of the\n\ufb01rst interactive \ufb01ction computer games, created by members of the MIT Dynamic Modeling Group\nin the late 70s. By exploring the world via interactive text-based dialogue, the players progress in the\ngame. The world of Zork presents a rich environment with a large state and action space (Figure 2).\nZork players describe their actions using natural language instructions. For example, in the opening\nexcerpt, an action might be \u2018open the mailbox\u2019 (Figure 1(a)). Once the player describes his/her action,\nit is processed by a sophisticated natural language parser. Based on the parser\u2019s results, the game\npresents the outcome of the action. The ultimate goal of Zork is to collect the Twenty Treasures of\n\n3Our code, the Zork domain, and the implementation of the elimination signal can be found at:\n\nhttps://github.com/TomZahavy/CB_AE_DQN\n\n7\n\n\fFigure 2: Left: the world of Zork. Right: subdomains of Zork; the Troll (green) and Egg (blue)\nQuests. Credit: S. Meretzky, The Strong National Museum of Play. Larger versions in Appendix B.\n\nZork and install them in the trophy case. Finding the treasures require solving a variety of puzzles\nsuch as navigation in complex mazes and intricate action sequences. During the game, the player is\nawarded points for performing deeds that bring him closer to the game\u2019s goal (e.g., solving puzzles).\nPlacing all of the treasures into the trophy case generates a total score of 350 points for the player.\nPoints that are generated from the game\u2019s scoring system are given to the agent as a reward. Zork\npresents multiple challenges to the player, like building plans to achieve long-term goals; dealing\nwith random events like troll attacks; remembering implicit clues as well as learning the interactions\nbetween objects in the game and speci\ufb01c actions. The elimination signal in Zork is given by the Zork\nenvironment in two forms, a \"wrong parse\" \ufb02ag, and text feedback (e.g. \"you cannot take that\"). We\ngroup these two signals into a single binary signal which we then provide to our learning algorithm.\nBefore we started experimenting in the \u201cOpen Zork\u201c domain, i.e., playing in Zork without any\nmodi\ufb01cations to the domain, we evaluated the performance on two subdomains of Zork. These\nsubdomains are inspired by the Zork plot and are referred to as the Egg Quest and the Troll Quest\n(Figure 2, right, and Appendix B). For these subdomains, we introduced an additional reward signal\n(in addition to the reward provided by the environment) to guide the agent towards solving speci\ufb01c\ntasks and make the results more visible. In addition, a reward of \u22121 is applied at every time step to\nencourage the agent to favor short paths. When solving \u201cOpen Zork\u201c we only use the environment\nreward. The optimal time that it takes to solve each quest is 6 in-game timesteps for the Egg quest, 11\nfor the Troll quest and 350 for \u201cOpen Zork\u201d. The agent\u2019s goal in each subdomain is to maximize\nits cumulative reward. Each trajectory terminates upon completing the quest or after T steps are\ntaken. We set the discounted factor during training to \u03b3 = 0.8 but use \u03b3 = 1 during evaluation 4. We\nused \u03b2 = 0.5, (cid:96) = 0.6 in all the experiments. The results are averaged over 5 random seeds, shown\nalongside error bars (std/3).\n\n(a) A1,T=100\n\nFigure 3: Performance of agents in the egg quest.\n\n(b) A2,T=100\n\n(c) A2,T=200\n\nThe Egg Quest: In this quest, the agent\u2019s goal is to \ufb01nd and open the jewel-encrusted egg, hidden on\na tree in the forest. The agent is awarded 100 points upon successful completion of this task. We\nexperimented with the AE-DQN (blue) agent and a vanilla DQN agent (green) in this quest (Figure\n3). The action set in this quest is composed of two subsets. A \ufb01xed subset of 9 actions that allow it to\ncomplete the Egg Quest like navigate (south, east etc.) open an item and \ufb01ght; And a second subset\nconsists of NTake \u201ctake\u201d actions for possible objects in the game. The \u201ctake\u201d actions correspond to\ntaking a single object and include objects that need to be collected to complete quests, as well as\n\n4We adopted a common evaluation scheme that is used in the ALE. During learning and training we use\n\u03b3 < 1 but evaluation is performed with \u03b3 = 1. Intuitively, during learning, choosing \u03b3 < 1 helps to learn, while\nduring evaluation, the sum of cumulative returns (\u03b3 = 1) is more interpretable (the score in the game).\n\n8\n\n\fother irrelevant objects from the game dictionary. We used two versions of this action set, A1 with\nNTake = 200 and A2 with NTake = 300. Robustness to hyperparameter tuning: We can see that\nfor A1, with T=100, (Figure 3a), and for A2, with T=200, (Figure 3c) Both agents solve the task well.\nHowever, for A2, with T=100, (Figure 3b) the AE-DQN agent learns considerably faster, implying\nthat action elimination is more robust to hyperparameters settings when there are many actions.\nThe Troll Quest: In this quest, the agent must \ufb01nd a\nway to enter the house, grab a lantern and light it, expose\nthe hidden entrance to the underworld and then \ufb01nd the\ntroll, awarding him 100 points. The Troll Quest presents\na larger problem than the Egg Quest, but smaller than\nthe full Zork domain; it is large enough to gain a useful\nunderstanding of our agents\u2019 performance. The AE-DQN\n(blue) and DQN (green) agents use a similar action set to\nA1 with 200 take actions and 15 necessary actions (215\nin total). For comparison, We also included an \"optimal\nelimination\" baseline (red) that consists of only 35 actions\n(15 essential, and 20 relevant take actions). We can see in\nFigure 5 that AE-DQN signi\ufb01cantly outperforms DQN, achieving compatible performance to the\n\"optimal elimination\" baseline. In addition, we can see that the improvement of the AE-DQN over\nDQN is more signi\ufb01cant in the Troll Quest than the Egg quest. This observation is consistent with\nour tabular experiments.\n\u201cOpen Zork\u201c: Next, we evaluated our agent in the \u201cOpen\nZork\u201c domain (without hand-crafting reward and termina-\ntion signals). To compare our results with previous work,\nwe trained our agent for 1M steps: each trajectory termi-\nnates after T = 200 steps, and a total of 5000 trajectories\nwere executed 5. We used two action sets: A3, the \u201cMin-\nimal Zork\u201c action set, is the minimal set of actions (131)\nthat is required to solve the game (comparable with the\naction set used by Kostka et al. (2017)). The actions are\ntaken from a tutorial for solving the game. A4, the \u201cOpen\nZork\u201c action set, includes 1227 actions (comparable with Fulda et al. (2017)). This set is created\nfrom action \"templates\", composed of {Verb, Object} tuples for all the verbs (19) and objects (62)\nin the game (e.g, open mailbox). In addition, we include a \ufb01xed set of 49 actions of varying length\n(but not of length 2) that are required to solve the game. Table 1 presents the average (over seeds)\nmaximal (in each seed) reward obtained by our AE-DQN agent in this domain while using action sets\nA3 and A4, showing that our agent achieves state-of-the-art results, outperforming all previous work.\nIn the appendix, we show the learning curves for both AE-DQN and DQN agents. Again, we can see\nthat AE-DQN outperforms DQN, learning faster and achieving more reward.\n\nTable 1: Experimental results in Zork\n\nFigure 4: Results in the Troll Quest.\n\n|A|\n\u2248 150\n131\n\n131\n\u2248 500\n1227\n1227\n\nreward\n13.5\n39\n44\n\nKostka et al. 2017\nOurs, A3\n\nOurs, A3, 2M steps\n\nFulda et al. 2017\nOurs, A4\nOurs, A4, 2M steps\n\ncumulative\n\n8.8\n16\n16\n\n6 Summary\n\nIn this work, we proposed the AE-DQN, a DRL approach for eliminating actions while performing\nQ-learning, for solving MDPs with large state and action spaces. We tested our approach on the\ntext-based game Zork, showing that by eliminating actions the size of the action space is reduced,\nexploration is more effective, and learning is improved. We provided theoretical guarantees on the\nconvergence of our approach using linear contextual bandits. In future work, we plan to investigate\nmore sophisticated architectures, as well as learning shared representations for elimination and control\nwhich may boost performance on both tasks. In addition, we aim to investigate other mechanisms for\naction elimination, e.g., eliminating actions that result from low Q-values (Even-Dar, Mannor, and\nMansour, 2003). Another direction is to generate elimination signals in real-world domains. This can\nbe done by designing a rule-based system for actions that should be eliminated, and then, training an\nAEN to generalize these rules for states that were not included in these rules. Finally, elimination\nsignals may be provided implicitly, e.g., by human demonstrations of actions that should not be taken.\n\n5The same amount of steps that were used in previous work on Zork (Fulda et al., 2017; Kostka et al., 2017).\n\nFor completeness, we also report results for AE-DQN with 2M steps, where learning seemed to converge.\n\n9\n\n\fReferences\nAbbasi-Yadkori, Y.; Pal, D.; and Szepesvari, C. 2011. Improved algorithms for linear stochastic\n\nbandits. In Advances in Neural Information Processing Systems, 2312\u20132320.\n\nAzizzadenesheli, K.; Brunskill, E.; and Anandkumar, A. 2018. Ef\ufb01cient exploration through bayesian\n\ndeep q-networks. arXiv.\n\nBellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The arcade learning environment:\nAn evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research 47:253\u2013279.\n\nBertsekas, D. P., and Tsitsiklis, J. N. 1995. Neuro-dynamic programming: an overview. In Decision\n\nand Control, 1995., Proceedings of the 34th IEEE Conference on, 560\u2013564. IEEE.\n\nBudzianowski, P.; Ultes, S.; Su, P.-H.; Mrksic, N.; Wen, T.-H.; Casanueva, I.; Rojas-Barahona, L.; and\nGasic, M. 2017. Sub-domain modelling for dialogue management with hierarchical reinforcement\nlearning. arXiv preprint.\n\nChu, W.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Contextual bandits with linear payoff functions.\nIn Proceedings of the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics.\n\nC\u00f4t\u00e9, M.-A.; K\u00e1d\u00e1r, \u00c1.; Yuan, X.; Kybartas, B.; Barnes, T.; Fine, E.; Moore, J.; Hausknecht, M.; Asri,\n\nL. E.; Adada, M.; et al. 2018. Textworld: A learning environment for text-based games. arXiv.\n\nDalal, G.; Gilboa, E.; and Mannor, S. 2016. Hierarchical decision making in electricity grid\n\nmanagement. In International Conference on Machine Learning, 2197\u20132206.\n\nDhingra, B.; Li, L.; Li, X.; Gao, J.; Chen, Y.-N.; Ahmed, F.; and Deng, L. 2016. End-to-end\nreinforcement learning of dialogue agents for information access. In Proceedings of the 55th\nAnnual Meeting of the Association for Computational Linguistics.\n\nDulac-Arnold, G.; Denoyer, L.; Preux, P.; and Gallinari, P. 2012. Fast reinforcement learning with\nlarge action sets using error-correcting output codes for mdp factorization. In Joint European\nConference on Machine Learning and Knowledge Discovery in Databases, 180\u2013194. Springer.\n\nDulac-Arnold, G.; Evans, R.; van Hasselt, H.; Sunehag, P.; Lillicrap, T.; Hunt, J.; Mann, T.; Weber,\nT.; Degris, T.; and Coppin, B. 2015. Deep reinforcement learning in large discrete action spaces.\narXiv.\n\nEven-Dar, E.; Mannor, S.; and Mansour, Y. 2003. Action elimination and stopping conditions for\nreinforcement learning. In Proceedings of the 20th International Conference on Machine Learning.\n\nFulda, N.; Ricks, D.; Murdoch, B.; and Wingate, D. 2017. What can you do with a rock? affordance\n\nextraction via word embeddings. arXiv.\n\nGlavic, M.; Fonteneau, R.; and Ernst, D. 2017. Reinforcement learning for electric power system\n\ndecision and control: Past considerations and perspectives. IFAC-PapersOnLine 6918\u20136927.\n\nHe, J.; Chen, J.; He, X.; Gao, J.; Li, L.; Deng, L.; and Ostendorf, M. 2015. Deep reinforcement\n\nlearning with an unbounded action space. CoRR abs/1511.04636.\n\nHester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.; Piot, B.; Sendonaris, A.; Dulac-Arnold,\nG.; Osband, I.; Agapiou, J.; et al. 2018. Learning from demonstrations for real world reinforcement\nlearning. AAAI.\n\nHochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735\u2013\n\n1780.\n\nJaderberg, M.; Mnih, V.; Czarnecki, W. M.; Schaul, T.; Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K.\n2016. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397.\n\nKakade, S. M., et al. 2003. On the sample complexity of reinforcement learning. Ph.D. Dissertation,\n\nUniversity of London, England.\n\nKim, Y. 2014. Convolutional neural networks for sentence classi\ufb01cation. arXiv preprint.\n\n10\n\n\fKostka, B.; Kwiecieli, J.; Kowalski, J.; and Rychlikowski, P. 2017. Text-based adventures of the\ngolovin ai agent. In Computational Intelligence and Games (CIG), 2017 IEEE Conference on.\nIEEE.\n\nLagoudakis, M. G., and Parr, R. 2003. Reinforcement learning as classi\ufb01cation: Leveraging modern\nclassi\ufb01ers. In Proceedings of the 20th International Conference on Machine Learning (ICML-03).\n\nLattimore, T., and Hutter, M. 2012. Pac bounds for discounted mdps. In International Conference on\n\nAlgorithmic Learning Theory.\n\nLevine, N.; Zahavy, T.; Mankowitz, D. J.; Tamar, A.; and Mannor, S. 2017. Shallow updates for deep\n\nreinforcement learning. In Advances in Neural Information Processing Systems, 3138\u20133148.\n\nLi, J.; Monroe, W.; Ritter, A.; Galley, M.; Gao, J.; and Jurafsky, D. 2016. Deep reinforcement\n\nlearning for dialogue generation. arXiv.\n\nLi, X.; Chen, Y.-N.; Li, L.; and Gao, J. 2017. End-to-end task-completion neural dialogue systems.\n\narXiv preprint.\n\nLin, L.-J. 1992. Self-improving reactive agents based on reinforcement learning, planning and\n\nteaching. Machine learning 8(3-4).\n\nLipton, Z. C.; Gao, J.; Li, L.; Chen, J.; and Deng, L. 2016a. Combating reinforcement learning\u2019s\n\nsisyphean curse with intrinsic fear. arXiv.\n\nLipton, Z. C.; Gao, J.; Li, L.; Li, X.; Ahmed, F.; and Deng, L. 2016b. Ef\ufb01cient exploration for\n\ndialogue policy learning with bbq networks and replay buffer spiking. arXiv.\n\nLiu, B.; Tur, G.; Hakkani-Tur, D.; Shah, P.; and Heck, L. 2017. End-to-end optimization of\n\ntask-oriented dialogue model with deep reinforcement learning. arXiv preprint.\n\nMachado, M. C.; Bellemare, M. G.; Talvitie, E.; Veness, J.; Hausknecht, M.; and Bowling, M. 2017.\nRevisiting the arcade learning environment: Evaluation protocols and open problems for general\nagents. arXiv.\n\nMannion, P.; Duggan, J.; and Howley, E. 2016. An experimental review of reinforcement learning\nalgorithms for adaptive traf\ufb01c signal control. In Autonomic Road Transport Support Systems.\nSpringer. 47\u201366.\n\nMikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations\nof words and phrases and their compositionality. In Advances in neural information processing\nsystems.\n\nMnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.;\nRiedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep\nreinforcement learning. Nature 518(7540):529\u2013533.\n\nNarasimhan, K.; Kulkarni, T. D.; and Barzilay, R. 2015. Language understanding for text-based\n\ngames using deep reinforcement learning. CoRR abs/1506.08941.\n\nPazis, J., and Parr, R. 2011. Generalized value functions for large action sets. In Proceedings of the\n\n28th International Conference on Machine Learning (ICML-11), 1185\u20131192.\n\nPeng, B.; Li, X.; Li, L.; Gao, J.; Celikyilmaz, A.; Lee, S.; and Wong, K.-F. 2017. Composite\ntask-completion dialogue system via hierarchical deep reinforcement learning. In Proceedings of\nthe 2017 Conference on Empirical Methods in Natural Language Processing.\n\nRiquelme, C.; Tucker, G.; and Snoek, J. 2018. Deep bayesian bandits showdown. International\n\nConference on Learning Representations.\n\nSerban, I. V.; Sankar, C.; Germain, M.; Zhang, S.; Lin, Z.; Subramanian, S.; Kim, T.; Pieper, M.;\n\nChandar, S.; Ke, N. R.; et al. 2017. A deep reinforcement learning chatbot. arXiv preprint.\n\nSu, P.-H.; Gasic, M.; Mrksic, N.; Rojas-Barahona, L.; Ultes, S.; Vandyke, D.; Wen, T.-H.; and Young,\n\nS. 2016. Continuously learning neural dialogue management. arXiv preprint.\n\n11\n\n\fSutton, R. S., and Barto, A. G. 1998. Reinforcement learning: An introduction. MIT press Cambridge.\n\nTesauro, G. 1995. Temporal difference learning and TD-Gammon. Communications of the ACM\n\n58\u201368.\n\nThrun, S., and Schwartz, A. 1993. Issues in using function approximation for reinforcement learning.\n\nIn Proceedings of the 1993 Connectionist Models.\n\nVan der Pol, E., and Oliehoek, F. A. 2016. Coordinated deep reinforcement learners for traf\ufb01c light\n\ncontrol. In In proceedings of NIPS.\n\nVan Hasselt, H., and Wiering, M. A. 2009. Using continuous action spaces to solve discrete problems.\n\nIn Neural Networks, 2009. IJCNN 2009. International Joint Conference on, 1149\u20131156. IEEE.\n\nWatkins, C. J., and Dayan, P. 1992. Q-learning. Machine learning 8(3-4):279\u2013292.\n\nWen, Z.; O\u2019Neill, D.; and Maei, H. 2015. Optimal demand response using device-based reinforcement\n\nlearning. IEEE Transactions on Smart Grid 2312\u20132324.\n\nWu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.;\nMacherey, K.; et al. 2016. Google\u2019s neural machine translation system: Bridging the gap between\nhuman and machine translation. arXiv preprint.\n\nYuan, X.; C\u00f4t\u00e9, M.-A.; Sordoni, A.; Laroche, R.; Combes, R. T. d.; Hausknecht, M.; and Trischler, A.\n2018. Counting to explore and generalize in text-based games. arXiv preprint arXiv:1806.11525.\n\nZahavy, T.; Ben-Zrihem, N.; and Mannor, S. 2016. Graying the black box: Understanding dqns. In\n\nInternational Conference on Machine Learning, 1899\u20131908.\n\nZahavy, T.; Magnani, A.; Krishnan, A.; and Mannor, S. 2018. Is a picture worth a thousand words?\na deep multi-modal fusion architecture for product classi\ufb01cation in e-commerce. The Thirtieth\nConference on Innovative Applications of Arti\ufb01cial Intelligence (IAAI).\n\nZelinka, M. 2018. Using reinforcement learning to learn how to play text-based games. arXiv\n\npreprint.\n\nZhao, T., and Eskenazi, M. 2016. Towards end-to-end learning for dialog state tracking and\n\nmanagement using deep reinforcement learning. arXiv preprint.\n\nZrihem, N. B.; Zahavy, T.; and Mannor, S. 2016. Visualizing dynamics: from t-sne to semi-mdps.\n\narXiv preprint arXiv:1606.07112.\n\n12\n\n\f", "award": [], "sourceid": 1809, "authors": [{"given_name": "Tom", "family_name": "Zahavy", "institution": "Technion"}, {"given_name": "Matan", "family_name": "Haroush", "institution": "Technion"}, {"given_name": "Nadav", "family_name": "Merlis", "institution": "Technion"}, {"given_name": "Daniel", "family_name": "Mankowitz", "institution": "Technion"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}]}