{"title": "Augmented Rescorla-Wagner and Maximum Likelihood Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 1561, "page_last": 1568, "abstract": null, "full_text": "Augmented Rescorla-Wagner and Maximum Likelihood estimation.\nAlan Yuille Department of Statistics University of California at Los Angeles Los Angeles, CA 90095 yuille@stat.ucla.edu\n\nAbstract\nWe show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. Our results involve genericity assumptions on the distributions of causes. If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. Previous results can be used to determine convergence and to estimate convergence rates.\n\n1\n\nIntroduction\n\nIt is important to understand the relationship between the Rescorla-Wagner (RW) algorithm [1,2] and theories of learning based on maximum likelihood (ML) estimation of the parameters of generative models [3,4,5]. The Rescorla-Wagner algorithm has been shown to account for many experimental findings. But maximum likelihood offers the promise of a sound statistical basis including the ability to learn sophisticated probabilistic models for causal learning [6,7,8]. Previous work, summarized in section (2), showed a direct relationship between the basic Rescorla-Wagner algorithm and maximum likelihood for the P model of causal learning [4,9]. More recently, a generalization of Rescorla-Wagner was shown to perform maximum likelihood estimation for both the P and the noisy-or models [10]. Throughout the paper, we follow the common practice of studying the convergence of the expected value of the weights and ignoring the fluctuations. The size of these fluctuations can be calculated analytically and precise convergence quantified [10]. In this paper, we greatly extend the connections between Rescorla-Wagner and ML estimation. We show that two classes of generalized Rescorla-Wagner algorithms can perform ML estimation for all generative models provided genericity assumptions on the causes are satisfied. These generalizations include augmenting the set of variables to represent conjunctive causes and are related to the augmented Rescorla-Wagner algorithm [2].\n\n\f\nWe also analyze the case where the genericity assumption breaks down and pay particular attention to Chengs' causal power model [4,5]. We demonstrate that Rescorla-Wagner can perform ML estimation for this model up to a nonlinear transformation of the model parameters (i.e. Rescorla-Wagner does ML but in a different coordinate system). We sketch how a nonlinear Rescorla-Wagner can estimate the parameters directly. Convergence analysis from previous work [10] can be directly applied to these new Rescorla-Wagner algorithms. This gives convergence conditions and put bounds on the convergence rate. The analysis assumes that the data consists of i.i.d. samples from the (unknown) causal distribution. But the results can also be applied in the piecewise iid case (such as forward and backward blocking [11]).\n\n2\n\nSummary of Previous Work\n\nWe summarize pervious work relating maximum likelihood estimation of generative models with the Rescorla-Wagner algorithm [4,9,10]. This work assumes that there is a binaryvalued event E which can be caused by one or more of two binary-valued causes C1 , C2 . The P and Noisy-or theories use generative models of form: PP (E = 1|C1 , C2 , 1 , 2 ) = 1 C1 + 2 C2 PN oisy-or (E = 1|C1 , C2 , 1 , 2 ) = 1 C1 + 2 C2 - 1 2 C1 C2 , where {1 , 2 } are the model parameters.\n The training data consists of examples {E , C1 , C2 }. The parameters {1 , 2 } are estimated by Maximum Likelihood P (E |C1 , C2 ; 1 , 2 )P (C1 , C2 ), (3) {1 , 2 } = arg max { 1 , 2 }\n\n(1) (2)\n\nwhere P (C1 , C2 ) is the distribution on the causes. It is independent of {1 , 2 } and does not affect the Maximum Likelihood estimation, except for some non-generic cases to be discussed in section (5). An alternative approach to learning causal models is the Rescorla-Wagner algorithm which updates weights V1 , V2 as follows: V1t+1 = V1t + V1t , V2t+1 = V2t + V2t , where the update rule V can take forms like: V1 = 1 C1 (E - C1 V1 - C2 V2 ), V2 = 2 C2 (E - C1 V1 - C2 V2 ), basic rule V1 = 1 C1 (1 - C2 )(E - V1 ), V2 = 2 C2 (1 - C1 )(E - V2 ), variant rule. (5) (6) (4)\n\nIt is known that if the basic update rule (5) is used then the weights converge to the ML estimates of the parameters {1 , 2 } provided the data is generated by the P model (1) [4,9] (but not for the noisy-or model). If the variant update rule (6) is used, then the weights converge to the parameters {1 , 2 } of the P model or the noisy-or model (2) depending on which model generates the data [10].\n\n3\n\nBasic Ingredients\n\nThis section describes three basic ingredients of this work: (i) the generative models, (ii) maximum likelihood, and (iii) the generalized Rescorla-Wagner algorithms.\n\n\f\nRepresenting the generative models. We represent the distribution P (E |C ; ) by the function: i i hi (C ), P (E = 1|C ; ) =\n\n(7)\n\nwhere the {hi (C )} are a set of basis functions and the {i } are parameters. If the dimension of C is n, then the number of basis functions is 2n . All distributions of binary variables can be represented in this form. For example, if n = 2 we can use the basis: h1 (C ) = 1, h2 (C ) = C1 , h3 (C ) = C2 , h4 (C ) = C1 C2 , (8)\n\nThen the noisy-or model P (E = 1|C1 , C2 ) = 1 C1 + 2 C2 - 1 2 C1 C2 corresponds to setting 1 = 0, 2 = 1 , 3 = 2 , 4 = -1 2 . Data Generation Assumption and Maximum Likelihood We assume that the observed data {E , C : } are i.i.d. samples from P (E |C )P (C ). It is possible to adapt our results to cases where the data is piecewise i.i.d., such as blocking experiments, but we have no space to describe this here. Maximum Likelihood (ML) estimates the by solving: = arg min - log{P (E |C ; )P (C )} = arg min -\n \n\nlog P (E |C ; ). (9)\n\n\n\nObserve that the estimate of is independent of P (C ) provided the distribution is generic. Important non-generic cases are treated in section (5). Generalized Rescorla-Wagner. The Rescorla-Wagner (RW) algorithm updates weights {Vi : i = 1, ..., n} by a discrete iterative algorithm: (10) Vit+1 = Vit + Vit , i = 1, ..., n. We assume a generalized form: j Vi = Vj fij (C ) + E gi (C ), i, j = 1, ..., n\n\n(11)\n\nfor functions {fij (C )}, {gi (C )}. It is easy to see that equations (5,6) are special cases.\n\n4\n\nTheoretical Results\n\nWe now gives sufficient conditions which ensure that the only fixed points of generalized Rescorla-Wagner correspond to ML estimates of the parameters of generative models P (E |C , ). We then obtain two classes of generalized Rescorla-Wagner which satisfy these conditions. For one class, convergence to the fixed points follow directly. For the other class we need to adapt results from [10] to guarantee convergence to the fixed points. Our results assume genericity conditions on the distribution P (C ) of causes. We relax these conditions in section (5). The number of weights {Vi } used by the Rescorla-Wagner algorithm is equal to the number of parameters {i } that specify the model. But many weights will remain zero unless conjunctions of causes occur, see section (6).\n\n\f\nTheorem 1. A sufficient condition for generalized Rescorla-Wagner (11), to have a unique fixed point at the maximum likelihood estimates of the parameters of a generative model P (E |C ; ) (7), is that < fij (C ) >P (C ) = - < gi (C )hj (C ) >P (C ) i, j and the matrix < fij (C ) > C is invertible.\nP( )\n\nWe use notation that < . >P (C ) is the expectation with respect to the probability distriCC bution P (C ) on the causes. For example, < fij (C ) >P (C ) = P ( )fij (C ). Hence C ) > C is invertible usually requires that P (C ) the requirement that the matrix < fij (\nP( )\n\nProof. We calculate the expectation < Vi >P (E |C )P (C ) . This is zero if, and only if, j j Vj < fij (C ) >P (C ) + j < gi (C )hj (C ) >P (C ) = 0. The result follows.\n\nis generic. See examples in sections (4.1,4.2). Convergence may still occur if the matrix < fij (C ) >P (C ) is non-invertible. Linear combinations of the weights will remained fixed (in the directions of the zero eigenvectors of the matrix) and the remaining linear co,mbinations will converge. Additional conditions to ensure convergence to the fixed point, and to determine the convergence rate, can be found using Theorems 3,4,5 in [10]. 4.1 Generalized RW class I\n\nWe now give prove a corollary of Theorem 1 which will enable us to obtain our first class of generalized RW algorithms. Corollary 1. A sufficient condition for generalized RW to have fixed points at ML estimates of the model parameters is fij (C ) = -hi (C )hj (C ), gi (C ) = hi (C ) i, j and the matrix < hi (C )hj (C ) >P (C ) is invertible. Moreover, convergence to the fixed point is guaranteed. Proof. Direct verification. Convergence to the fixed point follows from the gradient descent nature of the algorithm, see equation (12). These conditions define generalized RW class I (GRW-I) which is a natural extension of basic Rescorla-Wagner (5): j j Vi = hi (C ){E - hj (C )Vj } = - (E - hj (C )Vj )2 , i = 1, ..., n (12) Vi This GRW-I algorithm ia guaranteed to converge to the fixed point because it performs stochastic steepest descent. This is essentially the Widrow-Huff algorithm [12,13]. To illustrate Corollary 1, we show the relationships between GRW-I and ML for three different generative models: (i) the P model, (ii) the noisy-or model, and (iii) the most general form of P (E |C ) for two causes. It is important to realize that these generative models form a hierarchy and GRW-I algorithms for the later models will also perform ML on the simpler ones. 1. The P model. Set n = 2, h1 (C ) = C1 and h2 (C ) = C2 . Then equation (12) reduces to the basic RW algorithm (5) with two weights V1 , V2 . By Corollary 1, we see that it performs ML estimation for the P model (1). This rederives the known relationship between basic RW, ML, and the P model [4,9]. < C1 >P (C ) < C1 C2 >P (C ) Observe that Corollary 1 requires that the matrix < C1 C2 >P (C ) < C2 >P (C )\n\n\f\nbe invertible. This is equivalent to the genericity condition < C1 C2 >2 (C ) =< P C1 >P (C ) < C2 >P (C ) . 2. The Noisy-Or model. Set n = 3 with h1 (C ) = C1 , h2 (C ) = C2 , h3 (C ) = C1 C2 . Then Corollary 1 proves that the following algorithm will converge to estimate V1 = 1 , V2 = 2 and V3 = -1 2 for the noisy-or model. V1 = C1 (E - C1 V1 - C2 V2 - C1 C2 V3 ) = C1 (E - V1 - C2 V2 - C2 V3 ) V2 = C2 (E - C1 V1 - C2 V2 - C1 C2 V3 ) = C2 (E - C1 V1 - V2 - C1 V3 ) V3 = C1 C2 (E - C1 V1 - C2 V2 - C1 C2 V3 ) = C1 C2 (E - V1 - V2 - V3 ).\n\n(13)\n\nThis algorithm is a minor variant of basic RW. Observe that this has more weights (n = 3) than the total number of causes. The first two weights V1 and V2 yield 1 , 2 while the third weight V3 gives a (redundant) estimate of 1 2 . The matrix < hi (C )hj (C ) >P (C ) has determinant (< C1 C2 > - < C1 >)(< C1 C2 > - < C2 >) < C1 C2 > and is invertible provided < C1 >= 0, 1, < C2 >= 0, 1 and < C1 C2 >=< C1 >< C2 >. This rules out the special case in Cheng's experiments [4,5] where C1 = 1 always, see discussion in section (5). It is known that basic RW is unable to do ML estimation for the noisy-or model if there are only two weights [4,5,9,10]. The differences here is that three weights are used. 3. The general two-cause model. Thirdly, we consider the most general model P (E |C ) for two causes. This can be written in the form: P (E = 1|C1 , C2 ) = 1 + 2 C1 + 3 C2 + 4 C1 C2 . (14) This corresponds to h1 (C ) = 1, h2 (C ) = C1 , h3 (C ) = C2 , h4 (C ) = C1 C2 . Corollary 1 gives us the most general algorithm: V1 = (E - V1 - C1 V2 - C2 V3 - C1 C2 V4 ) = (E - V1 - C1 V2 - C2 V3 - C1 C2 V4 ) V2 = C1 (E - V1 - C1 V2 - C2 V3 - C1 C2 V4 ) = C1 (E - V1 - V2 - C2 V3 - C2 V4 ) V3 = C2 (E - V1 - C1 V2 - C2 V3 - C1 C2 V4 ) = C2 (E - V1 - C1 V2 - V3 - C1 V4 ) V4 = C1 C2 (E - V1 - C1 V2 - C2 V3 - C1 C2 V4 ) = C1 C2 (E - V1 - V2 - V3 - V4 ). By Corollary 1, this algorithm will converge to V1 = 1 , V2 = 2 , V3 = 3 , V4 = 4 , provided the matrix is invertible. The determinant of the matrix < hi (C )hj (C ) >P (C ) is < C1 C2 > (< C1 C2 > - < C1 >)(< C1 C2 > - < C2 >)(1- < C1 > - < C2 > + < C1 C2 >). This will be zero for special cases, for example if C1 = 1 always. It is important to realize that the most general GRW-I algorithm will converge if P (E |C ) is the P or the noisy-or model. For P it will converge to V1 = 0, V2 = 1 , V3 = 2 , V4 = 0. For noisy-or, it converges to V1 = 0, V2 = 1 , V3 = 2 , V4 = -1 2 . The learning system which implements the GRW-I algorithm will not know a priori whether the data is generated by P , noisy-or, or the general model for P (E |C1 , C2 ). It is therefore better to implement the most general algorithm because this works whatever model generated the data. Note: other functions {hi (C )} will lead to different ways to parameterize the probability distribution P (E |C ). They will correspond to different RW algorithms. But their basic properties will be similar to those discussed in this section.\n\n\f\n4.2\n\nGeneralized RW Class II\n\nWe can obtain a second class of generalized RW algorithms which perform ML estimation. Corollary 2. A sufficient condition for RW to have unique fixed point at the ML estimate of the generative model P (E |C ) is that fij (C ) = -gi (C )hj (C ), provided the matrix < hi (C )hj (C ) >P (C ) is invertible. Proof. Direct verification. Corollary 2 defines GRW-II to be of form: Vi = gi (C ){E - j hj (C )Vj }. (15)\n\nWe illustrate GRW-II by applying it to the noisy-or model (2). It gives an algorithm very similar to equation (6). Set h1 (C ) = C1 , h2 (C ) = C2 , h3 (C ) = C1 C2 and g1 (C ) = C1 (1 - C2 ), g2 (C ) = C2 (1 - C1 ), g3 (C ) = C1 C2 . Corollary 2 yields the update rule: V1 = C1 (1 - C2 ){E - C1 V1 - C2 V2 - C1 C2 V3 } = C1 (1 - C2 ){E - V1 }, V2 = C2 (1 - C1 ){E - C1 V1 - C2 V2 - C1 C2 V3 } = C2 (1 - C1 ){E - V2 }, V3 = C1 C2 {E - C1 V1 - C2 V2 - C1 C2 V3 } = C1 C2 {E - V1 - V2 - V3 }.\n\n(16)\n\nThe matrix < hi (C )hj (C ) >P (C ) has determinant < C1 C2 > (< C1 > - < C1 C2 >)(< C2 > - < C1 C2 >) and so is invertible for generic P (C ). The algorithm will converge to weights V1 = 1 , V2 = 2 , V3 = -1 2 . If we change the model to P , then we get convergence to V1 = 1 , V2 = 2 , V3 = 0. Observe that the equations (16) are largely decoupled. In particular, the updates for V1 and V2 do not depend on the third weight V3 . It is possible to remove the update equation for V3 by setting g3 (C ) = 0. The remaining update equations for V1 &V2 will converge to 1 , 2 for both the noisy-or and the P model. These reduced update equations are identical to those given by equation (6) which were proven to converge to 1 , 2 [10]. We note that the matrix < hi (C )hj (C ) >P (C ) now has a zero eigenvalue (because g3 (C ) = 0) but this does not matter because it corresponds to the third weight V3 . The matrix remains invertible if we restrict it to i, j = 1, 2. A limitation of GRW-II algorithm of equation (16) is that it only updates the weights if only one cause is active. So it would fail to explain effects such as blocking where both causes are on for part of the stimuli (Dayan personal communication).\n\n5\n\nNon-generic, coordinate transformations, and non-linear RW\n\nOur results have assumed genericity constraints on the distribution P (C ) of causes. They usually correspond to cases where one cause is always present. We now briefly discuss what happens when these constraints are violated. For simplicity, we concentrate on an important special case. Cheng's PC theory [4,5] uses the noisy-or model for generating the data but cause C1 is a background cause which is on all the time (i.e. C1 = 1 always). This implies that\n\n\f\n< C2 >=< C1 C2 > and so we cannot apply RW algorithms (13), the most general algorithm, or (16) because the matrix determinant will be zero in all three cases. Since C1 = 1 we can drop it as a variable and re-express the noisy-or model as: P (E = 1|C ) = 1 + 2 (1 - 1 )C2 . (17)\n\nTheorem 1 shows that we can define generalized RW algorithms to find ML estimates of 1 and 2 (1 - 1 ) (assuming 1 = 1). But, conversely, it is impossible to estimate 2 directly by any linear generalized RW. The problem is simply a matter of different coordinate systems. RW estimates the parameters of the generative model in a different coordinate system than the one used to specify the model. There is a non-linear transformation between the coordinates systems relating {1 , 2 } to {1 , 2 (1 - 1 )}. So RW can estimate the ML parameters provided we allow for an additional non-linear transformation. From this perspective, the inability to RW to perfrom ML estimation for Cheng's model is merely an artifact. If we reparameterize the generative model to be P (E = 1|C ) = 1 + 2 C2 , where 2 = 2 (1 - 1 ), then we can ^ ^ design an RW to estimate {1 , 2 }. ^ The non-linear transformation breaks down if 1 = 1. In this case, the generative model P (E |C ) becomes independent of 2 and so it is impossible to estimate it. But suppose we want to really estimate 1 and 2 directly (for Cheng's model, the value of 2 is the causal power and hence is a meaningful quantity [4,5]). To do this we first define a linear RW to estimate 1 and 2 = 2 (1 - 1 ). The equations are: ^ V1t+1 = V1t + 1 V1t , V2t+1 = V2t + 2 V2t . (18) with < V1 > 1 and < V2 > 2 for large t. The fluctuations (variances) are scaled by the parameters 1 , 2 and hence can be made arbitrarily small, see [10]. To estimate 2 , we replace the variable V2 by a new variable V3 = V2 /(1 - V1 ) which is updated by a nonlinear equation (V1 is updated as before): V3t+1 = V3t + V3t V2t V1t + , 1 - V1t 1 - V1t (19)\n\nwhere we use V3 = V2 /(1 - V1 ) to re-express V1 and V2 in terms of functions of V1 and V3 . Provided the fluctuations are small, by controlling the size of the 's, we can ensure that V3 converges arbitrarily close to 2 /(1 - 1 ) = 2 . ^\n\n6\n\nConclusion\n\nThis paper shows that we can obtain linear generalizations of the Rescorla-Wagner algorithm which can learn the parameters of generative models by Maximum Likelihood. For one class of RW generalizations we have only shown that the fixed points are unique and correspond to ML estimates of the parameters of the generative models. But Theorems 3,4 & 5 of Yuille (2004) can be applied to determine convergence conditions. Convergence rates can be determined by these Theorems provided that the data is generated as i.i.d. samples from the generative model. These theorems can also be used to obtain convergence results for piecewise i.i.d. samples as occurs in foreward and backward blocking experiments. These generalizations of Rescorla-Wagner require augmenting the number of weight variables. This was already proposed, on experimental grounds, so that new weights get created if causes occur in conjunction, [2]. Note that this happens naturally in the algorithms presented (13, the most general algorithm,16) weights remain at zero until we get an event\n\n\f\nC1 C2 = 1. It is straightforward to extend the analysis to models with conjunctions of many causes. We conjecture that these generalizations converge to good approaximation to ML estimates if we truncate the conjunction of causes at a fixed order. Finally, many of our results have involved a genericity assumption on the distribution of causes P (C ). We have argued that when these assumptions are violated, for example in Cheng's experiments, then generalized RW still performs ML estimation, but with a nonlinear transform. Alternatively we have shown how to define a nonlinear RW that estimates the parameters directly.\n\nAcknowledgement\nI acknowledge helpful conversations with Peter Dayan, Rich Shiffrin, and Josh Tennenbaum. I thank Aaron Courville for describing augmented Rescorla-Wagner. I thank the W.M. Keck Foundation for support and NSF grant 0413214.\n\nReferences\n[1]. R.A. Rescorla and A.R. Wagner. \"A Theory of Pavlovian Conditioning\". In A.H. Black andW.F. Prokasy, eds. Classical Conditioning II: Current Research and Theory. New York. Appleton-Century-Crofts, pp 64-99. 1972. [2] R.A. Rescorla. Journal of Comparative and Physiological Psychology. 79, 307. 1972. [3]. B. A. Spellman. \"Conditioning Causality\". In D.R. Shanks, K.J. Holyoak, and D.L. Medin, (eds). Causal Learning: The Psychology of Learning and Motivation, Vol. 34. San Diego, California. Academic Press. pp 167-206. 1996. [4]. P. Cheng. \"From Covariance to Causation: A Causal Power Theory\". Psychological Review, 104, pp 367-405. 1997. [5]. M. Buehner and P. Cheng. \"Causal Induction: The power PC theory versus the Rescorla-Wagner theory\". In Proceedings of the 19th Annual Conference of the Cognitive Science Society\". 1997. [6]. J.B. Tenenbaum and T.L. Griffiths. \"Structure Learning in Human Causal Induction\". Advances in Neural Information Processing Systems 12. MIT Press. 2001. [7]. D. Danks, T.L. Griffiths, J.B. Tenenbaum. \"Dynamical Causal Learning\". Advances in Neural Information Processing Systems 14. 2003. [8] A.C. Courville, N.D. Dew, and D.S. Touretsky. \"Similarity and discrimination in classical conditioning\". NIPS. 2004. [9]. D. Danks. \"Equilibria of the Rescorla-Wagner Model\". Journal of Mathematical Psychology. Vol. 47, pp 109-121. 2003. [10] A.L. Yuille. \"The Rescorla-Wagner algorithm and Maximum Likelihood estimation of causal parameters\". NIPS. 2004. [11]. P. Dayan and S. Kakade. \"Explaining away in weight space\". In Advances in Neural Information Processing Systems 13. 2001. [12] B. Widrow and M.E. Hoff. \"Adapting Switching Circuits\". 1960 IRE WESCON Conv. Record., Part 4, pp 96-104. 1960. [13] A.G. Barto and R.S. Sutton. \"Time-derivative Models of Pavlovian Conditioning\". In Learning and Computational Neuroscience: Foundations of Adaptive Networks. M. Gabriel and J. Moore (eds.). pp 497-537. MIT Press. Cambridge, MA. 1990.\n\n\f\n", "award": [], "sourceid": 2809, "authors": [{"given_name": "Alan", "family_name": "Yuille", "institution": null}]}