d/2' or '). = d/2 and m = 1 '. \n\n(Outline of the Proof) (1) In order to examine the poles of J(z), we can divide the \nparameter space into the sum of neighborhoods. Since H( w) is an analytic function, \nin arbitrary neighborhood of Wo that satisfies H(wo) = 0, we can find a positive \ndefinite quadratic form which is smaller than H(w). The positive definite quadratic \nform satisfies). = d/2 and m = 1. By using Lemma 1 (1), we obtain the first half. \n(2) Because Jeffreys' prior is coordinate free, we can study the problem on the \nparameter space U instead of Wi in eq. (2). Hence, there exists an analytic function \nt(x, u) such that, in each local coordinate, \n\nL(x, u) = L(x, g( u)) = t(x, U)U~l ... U~d. \n\nFor simplicity, we assume that Si > 0 (i = 1,2, ... , d). Then \n\n8L \n~ = ~Wi + Sit u1 .. ,ui' \nUWi \n\n(8t \nUWi \n\n) 8 1 \n\n8\u00b7 - 1 \n\n8d \n\n.. 'Ud . \n\nBy using blowing-ups Ui = V1V2'\" Vi (i = 1,2, ... , d) and a notation rYp = sp+sp+l + \n... + Sd, it is easy to show \n\ndetI(v) ::; II v;dap+p-d-2, du = (II Ivpld-p)dv . \n\nd \n\n(3) \n\nd \n\np=1 \n\np=1 \n\nBy using H(g(u)y = 11 v;apz and Lemma.1 (1), in order to prove the latter half \nof the theorem, it is suthcient to prove that \n\nhas a pole z = -d/2 with the order m = 1. Direct calculation of integrals in J(z) \ncompletes the theorem. (Q.E.D.) \n\n4 Three-Layer Percept ron \n\nIn this section, we study some cases when the learner is a three-layer percept ron \nand the true distribution is contained and not contained. We define the three layer \n\n\fpercept ron p(x, vlw) with JII! input units, K hidden units, and N output units, \nwhere x is an input, V is an output, and w is a parameter. \n\np(x, vlw) \n\nr(x) \n\n2 \n(27ru2)N/2 exp(- 2u211v - fK(x ,w)11 ) \n\n1 \n\nK \n\nfK(x,w) = Laku(bk\u00b7x+Ck) \n\nk=l \n\nwhere w = {(ak' bk, Ck); ak E R N, bk E R M, Ck E Rl}, r(x) is the probability density \non the input, and u 2 is the variance of the output (either r(x) or u is not estimated). \n\nTheorem 3 If the true distribution is represented by the three-layer perceptron with \nKo ::; K hidden units, and if positive prior is employed, then \n\n1 \n\nA ::; \"2 {Ko(M + N + 1) + (K - Ko) mm(M + 1, N)}. \n\n. \n\n(Outline of Proof) Firstly, we consider a case when g(x) = O. Then, \n\n(4) \n\n(5) \n\nLet ak = (akl' ... , akN) and bk = (b k1 , ... , bkM ). Let us consider a blowing-up, \n\nau = 0:, akj = o:a~j (k -=/:-l,j -=/:-1), bk1 = b~l' Ck = c~. \n\nThen da db dc = o:KN-1do: da' db' dc' and there exists an analytic function \nH1(a' , b' , c') such that H(a , b,c) = 0:2H1(a',b',c'). Therefore J(z) has a pole at \nz = - K N /2 . Also by using another blowing-up, \n\nthen, da db dc = 0:(M+1)K- 1do: da\" db\" dc\" and there exists an analytic \nfunction H2(a l ,bl ,c\") such that H(a , b,c) = 0:2H2(al ,bl ,c\"), which shows that \nJ( z) has a pole at z = -K(M + 1)/2. By combining both results, we obtain \nA ::; (K/2) min(M + 1, N) . Secondly, we prove the general case, 0 < Ko ::; K. \nThen, \n\nBy combining Lemma. 1 (2) and the above result, we obtain the Theorem. (Q.E.D. ). \n\nIf the true regression function g(x) is not contained in the learning model, we assume \nthat, for each 0 ::; k ::; K, there exists a parameter w~k) E W that minimizes the \nsquare error \n\n(6) \n\n\fWe use notations E(k) \nk) min(M + 1, N). \n\n(1/2){k(M + N + 1) + (K -\n\nTheorem 4 If the true regression function is not contained in the learning model \nand positive prior is applied, then \n\nF(n):,,::: min [n2E(k) +'\\(k)lognJ +0(1). \n\nO~k~K a \n\n(Outline of Proof) This theorem can be shown by the same procedure as eq.(6) in \nthe preceding theorem. (Q.E.D.) \n\nIf G(n) has an asymptotic expansion G(n) = 2::~1 aqfq(n), where fq(n) is a de(cid:173)\ncreasing function of n that satisfies fq+1(n) = o(fq(n)) and fQ(n) = l/n, then \n\nG ( n):,,::: min [ E ( k) + ,\\ ( k ) ] \n\nO~k~K a2 n ' \n\nwhich shows that the generalization error of the layered network is smaller than the \nregular statistical models even when the true distribution is not contained in the \nlearning model. It should be emphasized that the optimal k that minimizes G(n) \nis smaller than the learning model when n is not so large, and it becomes larger as \nn increases. This fact shows that the positive prior is useful for generalization but \nnot appropriate for model selection. Under the condition that the true distribution \nis contained in the parametric model, Jeffreys' prior may enable us to find the true \nmodel with higher probability. \n\nTheorem 5 If the true regression function is contained in the three-layer perceptron \nand Jeffrey's prior is applied, then ,\\ = d/2 and m = 1, even if the Fisher metric is \ndegenerate at the true parameter. \n\n(Outline of Proof) For simplicity, we prove the theorem for the case g(x) = O. The \ngeneral cases can be proven by the same method. By direct calculation of the Fisher \ninformation matrix, there exists an analytic function D(b, e) ~ 0 such that \n\nK \n\nN \n\ndetI(w) = II(Lakp)2(M+1)D(b,e) \n\nBy using a blowing-up \n\nk=1 p=1 \n\nwe obtain H(w) = a 2H1(a',b',e') same as eq.(5), detI(w) ex a 2(M+1)K, and \nda db de = aN K -1 da da' db de. The integral \n\n}(z) = 1 a 2za(M+1)K+NK- 1da \n\n1\"'1\u00ab' \n\nhas a pole at z = -(M + N + 1)K/2. By combining this result with Theorem 3, \nwe obtain Theorem.5. (Q.E.D.). \n\n\f5 Discussion \n\nIn many applications of neural networks, rather complex machines are employed \ncompared with the number of training samples. In such cases, the set of optimal \nparameters is not one point but an analytic set with singularities, and the set \nof almost optimal parameters {Wi H(w ) < E} is not an 'ellipsoid'. Hence neither \nthe Kullback distance can be approximat ed by any quadratic form nor the saddle \npoint approximation can be used in integration on the parameter space. The zeta \nfunction of the Kullback distance clarifies the behavior of the stochastic complexity \nand resolution of singularities enables us to calculate the learning efficiency. \n\n6 Conclusion \n\nThe relation between algebraic geometry and learning theory is clarified, and two \ndifferent facts are proven. \n(1) If the true distribution is not contained in a hierarchical learning model, then \nby using a positive prior, the generalization error is made smaller than the regular \nstatistical models. \n(2) If the true distribution is contained in the learning model and if Jeffreys' prior \nis used , then the average Bayesian factor has the same form as BIC. \n\nAcknowledgments \n\nThis research was partially supported by the Ministry of Education, Science, Sports \nand Culture in Japan, Grant-in-Aid for Scientific Research 12680370. \n\nReferences \n\n[1] Akaike, H. (1980) Likelihood and Bayes procedure. Bayesian Statistics, (Bernald J.M. \neds.) University Press, Valencia, Spain, 143-166. \n\n[2] Amari, S. (1985) Differential-geometrical m ethods in Statistics. Lecture Notes in Statis(cid:173)\ntics, Springer . \n\n[3] Atiyah, M . F. (1970) Resolution of singularities and division of distributions. Comm. \nPure and Appl. Math. , 13, pp.145-150. \n\n[4] Dacunha-Castelle, D. , & Gassiat, E. (1997). Testing in locally conic models, and \napplication to mixture models. Probability and Statistics, 1, 285-317. \n\n[5] Hironaka, H. (1964) Resolution of Singularities of an algebraic variety over a field of \ncharacteristic zero. Annals of Math., 79 ,109-326. \n\n[6] Kashiwara, M . (1976) B-functions and holonomic systems. Inventions Math., 38,33-53. \n\n[7] Schwarz, G . (1978) Estimating the dimension of a model. Ann. of Stat., 6 (2), 461-464. \n\n[8] Watanabe, S. (1998) On the generalization error by a layered statistical model with \nIEICE Transactions, J81-A (10), 1442-1452 . English version: \nBayesian estimation. \n(2000)Electronics and Communications in Japan, Part 3, 83(6) ,95-104. \n\n[9] Watanabe, S. (2000) Algebraic analysis for non-regular learning machines. Advances \nin Neural Information Processing Systems, 12, 356-362. \n\n[10] Watanabe, S. (2001) Algebraic analysis for non-identifiable learning machines. Neural \nComputation, to appear. \n\n\f", "award": [], "sourceid": 1826, "authors": [{"given_name": "Sumio", "family_name": "Watanabe", "institution": null}]}