{"title": "A Convergence Analysis of Log-Linear Training", "book": "Advances in Neural Information Processing Systems", "page_first": 657, "page_last": 665, "abstract": "Log-linear models are widely used probability models for statistical pattern recognition. Typically, log-linear models are trained according to a convex criterion. In recent years, the interest in log-linear models has greatly increased. The optimization of log-linear model parameters is costly and therefore an important topic, in particular for large-scale applications. Different optimization algorithms have been evaluated empirically in many papers. In this work, we analyze the optimization problem analytically and show that the training of log-linear models can be highly ill-conditioned. We verify our findings on two handwriting tasks. By making use of our convergence analysis, we obtain good results on a large-scale continuous handwriting recognition task with a simple and generic approach.", "full_text": "A Convergence Analysis of Log-Linear Training\n\nSimon Wiesler\n\nComputer Science Department\n\nRWTH Aachen University\n52056 Aachen, Germany\n\nwiesler@cs.rwth-aachen.de\n\nHermann Ney\n\nComputer Science Department\n\nRWTH Aachen University\n52056 Aachen, Germany\n\nney@cs.rwth-aachen.de\n\nAbstract\n\nLog-linear models are widely used probability models for statistical pattern recog-\nnition. Typically, log-linear models are trained according to a convex criterion.\nIn recent years, the interest in log-linear models has greatly increased. The opti-\nmization of log-linear model parameters is costly and therefore an important topic,\nin particular for large-scale applications. Different optimization algorithms have\nbeen evaluated empirically in many papers. In this work, we analyze the opti-\nmization problem analytically and show that the training of log-linear models can\nbe highly ill-conditioned. We verify our \ufb01ndings on two handwriting tasks. By\nmaking use of our convergence analysis, we obtain good results on a large-scale\ncontinuous handwriting recognition task with a simple and generic approach.\n\n1\n\nIntroduction\n\nLog-linear models, also known as maximum entropy models or multiclass logistic regression, have\nfound a wide range of applications in machine learning. Special cases of log-linear models include\nlogistic regression for binary class problems and conditional random \ufb01elds [10] for structured data,\nin particular sequential data. In recent years, the interest in log-linear models has increased greatly.\nDifferent models of log-linear form have been applied to natural language processing tasks, e.g. for\nsegmentation [10], parsing [21], and information extraction [16], and many other tasks.\nThe most frequently mentioned advantages of log-linear models are, \ufb01rst, their discriminative nature,\nand second, the possibility to use arbitrary and correlated features in log-linear models. Furthermore,\nthe conventional training of log-linear models is a strictly convex optimization problem. Thus,\nthe global optimum of the training criterion is unique and no other local optima exist. Steepest\ndescent and other gradient-based optimization algorithms are guaranteed to converge to the unique\nglobal optimum from any initialization. The probabilistic approach of log-linear models is bene\ufb01cial\nin many practical applications. For example, log-linear models are directly de\ufb01ned as multiclass\nmodels and can be integrated into more complex classi\ufb01ers.\nFor large datasets, the costs of training log-linear models are very high and limit their application\nrange. Therefore, the ef\ufb01cient optimization of log-linear models is of great interest. The most\nwidely used algorithms for this problem can be divided into three categories. Bound optimization\nalgorithms, as generalized iterative scaling (GIS) [4] and variants of GIS have been used in earlier\nworks. Later it has been found by several authors [17, 14, 21] that these algorithms converge very\nslowly and are inferior to gradient-based optimization algorithms. First-order optimization algo-\nrithms require the evaluation of the gradient of the objective function. The simplest algorithm of\nthis category is steepest descent. The more sophisticated conjugate gradient (CG) and L-BFGS are\nnow the standard choices for the training of log-linear models. Newton\u2019s method converges rapidly\nin a neighborhood of the optimum. For large-scale problems it is in general not applicable, because\nit requires the evaluation and storage of the Hessian matrix.\n\n1\n\n\fSo far, a rigorous mathematical analysis of the optimization problem encountered in training of log-\nlinear models has been missing. From optimization theory it is known that the convergence rate of\n\ufb01rst-order optimization algorithms depends on the condition number of the Hessian matrix at the\noptimum.1 The dependence of the convergence behavior on the condition number is strongest for\nsteepest descent. For high condition numbers, steepest descent is useless in practice [3, Chapter 9.3].\nIt can be shown that more sophisticated gradient-based optimization algorithms as CG and L-BFGS\ndepend on the condition number as well [18, Chapter 5.1],[18, Chapter 9.1]. Apart from numerical\nreasons, the convergence behavior of Newton\u2019s method is completely independent of the condition\nnumber.\nIn practice, it is not, because computing Newton\u2019s search direction requires solving a\nsystem of linear equations, which is more dif\ufb01cult for problems with high condition number [3,\nChapter 9.5].\nIn this paper, we derive an estimate for the condition number of the objective function used for\ntraining of log-linear models. Our analysis shows that convergence can be accelerated by feature\ntransformations. We verify our analytic results on two classi\ufb01cation tasks. One is a small digit\nrecognition task, the other a large-scale continuous handwriting recognition task with real-life data.\nThe experiments show that in extreme cases, log-linear training can be so ill-conditioned that a\nusable model can only be found from a reasonable initialization. On the other hand, when care is\ntaken, we obtain good results with a conceptually simple and generic approach.\nThe remaining paper is structured as follows: In the next section, we introduce the log-linear model\nand the training criterion. In Section 3, we give an overview on related work. Our novel convergence\nanalysis is presented in Section 4. Experimental results are reported in Section 5. In the last section,\nwe discuss our results.\n2 Model De\ufb01nition and Training Criterion\nIn this section, the log-linear model is de\ufb01ned and the necessary notation is introduced. Let X \u2282 Rd\ndenote the observation space and C = {1, . . . , C} a \ufb01nite set of classes. A log-linear model with\nparameters \u039b \u2208 Rd\u00d7C = (\u03bb1; . . . ; \u03bbC) is a model for class-posterior probabilities of the form\n\n(cid:80)\n\np\u039b(c|x) =\n\nexp(\u03bbT\nc x)\nc(cid:48)\u2208C exp(\u03bbT\n\nc(cid:48)x)\n\n.\n\nA log-linear model induces a decision rule via\n\nr : X \u2192 C,\n\nx (cid:55)\u2192 argmax\n\nc\u2208C\n\np\u039b(c|x) = argmax\n\nc\u2208C\n\n\u03bbT\nc x.\n\nThe decision boundaries of log-linear models are linear. Non-linear decision boundaries can\nbe achieved by embedding observations into a higher dimensional space. The penalized maxi-\nmum likelihood criterion is regarded as the natural training criterion for log-linear models. Let\n(xn, cn)n=1,...,N denote the training sample. Then the training criterion of log-linear models is an\nunconstrained optimization problem of the form\n\n\u02c6\u039b = argmin\n\u039b\u2208Rd\u00d7C\n\nF(\u039b), with F : Rd\u00d7C \u2192 R, \u039b (cid:55)\u2192 \u2212 1\nN\n\nlog p\u039b(cn|xn) +\n\n(cid:107)\u039b(cid:107)2\n\n2\n\n\u03b1\n2\n\n(3)\n\nHere, F is the objective function, and \u03b1 \u2265 0 the regularization constant. In the following, we refer\nto the optimization of the parameters of log-linear models as log-linear training.\nThe \ufb01rst and second partial derivatives of the objective function for 1 \u2264 c, \u00afc \u2264 C and 1 \u2264 j, \u00af\uf6be \u2264 d\nare:\n\nN(cid:88)\n\nn=1\n\n(1)\n\n(2)\n\n(\u039b)\n\n=\n\n\u2202F\n\u2202\u03bbc,j\n\u22022F\n\n\u2202\u03bbc,j\u2202\u03bb\u00afc,\u00af\uf6be\n\n(\u039b) =\n\nN(cid:88)\nN(cid:88)\n\nn=1\n\nn=1\n\n1\nN\n\n1\nN\n\n(p\u039b(c|xn) \u2212 \u03b4(c, cn)) xn,j + \u03b1\u03bbc,j ,\n\n(4)\n\np\u039b(c|xn)(\u03b4(c, \u00afc) \u2212 p\u039b(\u00afc|xn)) xn,jxn,\u00af\uf6be + \u03b1 \u03b4(c, \u00afc)\u03b4(j, \u00af\uf6be) .\n\n(5)\n\n1Recall that the condition number of a positive de\ufb01nite matrix A is the ratio of its largest and its smallest\n\neigenvalues: \u03ba(A) = \u03bbmax(A)/\u03bbmin(A)\n\n2\n\n\fIt can be shown that the Hessian matrix of F is positive\nHere, \u03b4 denotes the Kronecker delta.\nsemide\ufb01nite, and strictly positive de\ufb01nite for \u03b1 > 0. Thus, the optimization problem (3) is convex,\nrespectively strictly convex (see e.g. [22]).\n\n3 Related Work\n\nIn earlier works, e.g. [16, 10], the optimization problem (3) has been solved with generalized it-\nerative scaling (GIS) [4] or improved iterative scaling [10]. Since then, it has been shown in sev-\neral works that gradient-based optimization algorithms are far superior to iterative scaling methods.\nMinka [17] showed for logistic regression that iterative scaling methods perform poorly in com-\nparison to conjugate gradient (CG). Although Minka performed his experiments only on arti\ufb01cial\ndata with quite low dimensional features and a small number of observations, other authors came to\nsimilar \ufb01ndings. Malouf [14] performed experiments with (multiclass) log-linear models on typical\nnatural language processing tasks. As Minka, he found that CG outperforms iterative scaling meth-\nods. Furthermore, he obtained best results with L-BFGS [12], which today is considered as the best\nalgorithm for log-linear training. One of the \ufb01rst applications of CRFs to large-scale problems is by\nSha and Pereira [21]. They con\ufb01rmed again that L-BFGS is superior to CG and far superior to GIS.\nAll of the above mentioned papers concentrated on the empirical comparison of the performance\nof various optimization algorithms. The theoretical analysis of the optimization problem is very\nlimited. Salakhutdinov [20] derived a convergence analysis for bound optimization algorithms as\nGIS and showed that GIS converges extremely slowly when features are highly correlated and are\nfar from the origin. The disadvantage of Salakhutdinov\u2019s analysis is that, for log-linear models, it\nconcerns only GIS which now is known to perform very badly in practice. The effect of correlation\non the dif\ufb01culty of the optimization problem has been noted by several authors, though not analyzed\nin detail, e.g. by Minka [17].\nAn interesting connection is the convergence analysis by LeCun et al. for neural network training\n[11]. Their analysis differs in a number of aspects from our analysis. Interestingly, we come to\nsimilar conclusions for the convergence behavior of log-linear training as LeCun et al. for neural\nnetwork training. A comparison to their work is given in the discussion.\n4 Convergence Analysis of Log-Linear Model Training\n\nThis section contains our theoretical result. We derive an estimate of the eigenvalues of the Hessian\nof log-linear training, which determine the convergence behavior of gradient-based optimization\nalgorithms. First, we calculate the eigenvalues of the Hessian in terms of the eigenvalues of the\nuncentered covariance matrix. Our new Theorems 1 and 2 give lower and upper bounds for the\ncondition number of the uncentered covariance matrix. The analysis of the case with regularization\nis based on the analysis of the unregularized case.\n4.1 The Unregularized Case\nLet \u039b\u2217 be the limit of the optimization algorithm applied to problem (3) without regularization\n(\u03b1 = 0). The Hessian matrix of the objective function at the optimum depends on the posterior\nprobabilities p\u039b\u2217 (c|x), which are of course unknown. In the following, we consider a simpler prob-\nlem. We derive the eigenvalues of the Hessian at \u039b0 = 0. If the quadratic approximation of F at \u039b0\nis good, the Hessian does not change strongly from \u039b0 to \u039b\u2217, and the eigenvalues of HF (\u039b0) are\nclose to those of HF (\u039b\u2217). This enables us to draw conclusions about the convergence behavior of\ngradient-based optimization algorithms. The experiments in Section 5 justify our assumption. All\nexperimental results are in accordance to the theoretical results.\nFor \u039b0 = 0, the posterior probabilities are uniform, i.e. p\u039b0 (c|x) = C\u22121. Hence,\n\n(\u039b0) = C\u22121(cid:0)\u03b4(c, \u00afc) \u2212 C\u22121(cid:1) 1\n\nN(cid:88)\n\n\u22022F\n\n\u2202\u03bbc,j\u2202\u03bb\u00afc,\u00af\uf6be\n\nS \u2208 RC\u00d7C is de\ufb01ned by S = C\u22121(cid:0)IC \u2212 C\u221211C\n\nThe Hessian matrix can be written as a Kronecker product (see e.g. [8]): HF (\u039b0) = S \u2297 X . Here,\n1C \u2208 RC\u00d7C denotes the matrix, where all entries are equal to one. The matrix X \u2208 Rd\u00d7d is the\n\nN\n\nn=1\n\n(cid:1) , where IC \u2208 RC\u00d7C is the identity matrix, and\n\nxn,jxn,\u00af\uf6be .\n\n(6)\n\n3\n\n\f(cid:80)N\n\nn=1 xnxT\n\nuncentered covariance matrix: X = 1\nN\n\nn . The eigenvalues of S can be computed easily:\n(7)\nLet 0 \u2264 \u00b51(X) \u2264 . . . \u2264 \u00b5d(X) denote the eigenvalues of X. The eigenvalues of the Kronecker\nproduct S \u2297 X are of the form \u00b5i(S)\u00b5j(X) (see [8, Theorem 4.2.12]). Therefore, the spectrum of\nthe Hessian is determined by the eigenvalues of X:\n\n\u00b51(S) = 0, \u00b52(S) = C\u22121 .\n\n\u03c3(HF (\u039b0)) = {0} \u222a {C\u22121\u00b51(X), . . . , C\u22121\u00b5d(X)} .\n\n(8)\nA dif\ufb01culty in the analysis of the unregularized case is that the objective function is only convex, but\nnot strictly convex. This is caused by the invariances of log-linear models. For instance, shifting all\nparameter vectors by a constant does not change the posterior probabilities. In addition, singularities\nappear as a result of linear dependencies in the features. Thus, one of the eigenvalues of the Hessian\nat the optimum is zero and the condition number is not de\ufb01ned. Intuitively, the convergence rate\nshould not depend on the eigenvalue zero, since the objective function is constant in the direction\nof the corresponding eigenvectors. The classic proof about the convergence rate of steepest descent\nfor quadratic functions with the Kantorovich inequality (see [13, p218]) can directly be generalized\nto the singular case. The convergence rate depends on the ratio of the largest and the smallest\nnon-zero eigenvalue. Because of space constraints we omit this proof here. An analog result was\nshown by Notay [19] for the application of CG for solving systems of linear equations, which is\nequivalent to the minimization of quadratic functions. All results about the convergence behavior of\nconjugate gradient extend to the singular case, if instead of the complete spectrum only the non-zero\neigenvalues are considered. Therefore, Notay de\ufb01nes the condition number of a singular matrix as\nthe ratio of its largest eigenvalue and its smallest non-zero eigenvalue. In the following, we adopt\nthis de\ufb01nition of the condition number. The condition number of the Hessian is then:\n\n\u03ba(HF (\u039b0)) = \u03ba(X) =\n\n\u00b5d(X)\n\nmini:\u00b5i(X)(cid:54)=0 \u00b5i(X)\n\n.\n\n(9)\n\nIn the following subsection, we analyze the condition number \u03ba(X).\n\n4.2 The Eigenvalues of X\n\nThe dependence of the convergence behavior on the properties of X is in accordance to experimental\nobservations. Other researchers have noted before, that the use of correlated features leads to slower\nconvergence [21]. Minka [17] noted that convergence slows down when adding a constant to the\nfeatures, because this \u201cintroduces correlation, in the sense that\u201d X \u201chas signi\ufb01cant off-diagonals.\u201d.\nHow can we verify these \ufb01ndings formally? The following theorem concerns the case of uncorre-\nlated features. The proof is an application of Weyl\u2019s inequalities (see [9, Theorem 4.3.7]).\nTheorem 1. Suppose the features xi, 1 \u2264 i \u2264 d, are uncorrelated with respect to the empirical\ni denote the empirical mean and variance of xi for 1 \u2264 i \u2264 d. Without\n(cid:80)N\ndistribution. Let \u00b5i and \u03c32\nloss of generality, we assume that the features are ordered such that \u03c32\nd. Then the\ncondition number of X = 1\nn=1 xnxT\nN\n1 + (cid:107)\u00b5(cid:107)2\n2, \u03c32\n\nn is bounded by\nd}\nd + \u00b52\n2, \u03c32\n1}\n1 + \u00b52\n\n1 \u2264 . . . \u2264 \u03c32\n\n\u2264 \u03ba(X) \u2264 \u03c32\n\nd + (cid:107)\u00b5(cid:107)2\n\nmax{\u03c32\n\nmin{\u03c32\n\n(10)\n\n\u03c32\n1\n\n2\n\n.\n\nProof of Theorem 1. Since the features are uncorrelated, we have\n\n1, . . . , \u03c32\n\nX = diag(\u03c32\n\nd) + \u00b5\u00b5T def= A + B .\n\n(11)\nThe eigenvalues of the sum of two Hermitian matrices can be estimated with Weyl\u2019s inequalities.\nLet \u03bbj(M ) denote the j-th eigenvalue in ascending order of a Hermitian d \u00d7 d-matrix M. Weyl\u2019s\ninequalities state that for all Hermitian d \u00d7 d-matrices A, B and all j, k:\n\u03bbj+k\u2212d(A + B) \u2264 \u03bbj(A) + \u03bbk(B) ,\n\u03bbj+k\u22121(A + B) \u2265 \u03bbj(A) + \u03bbk(B) .\n\n(12)\n(13)\n\nThe eigenvalues of A are the diagonal elements \u03bbj(A) = \u03c32\neigenvalues \u03bbd(B) = (cid:107)\u00b5(cid:107)2\n\nj . B is a rank-one matrix with the\n2 and \u03bbj(B) = 0 for 1 \u2264 j \u2264 d \u2212 1. The bounds for \u03ba(X) follow\n\n4\n\n\fwith the application of (13) and (12) to the smallest and largest eigenvalue. For instance, the upper\nbound on the condition number follows from the application of (12) with j = k = d to the largest\neigenvalue and (13) with j = k = 1 to the lowest eigenvalue. The proof of the lower bound is\nanalogous. The bound is sharpened by using the fact that every diagonal element of X is an upper\nbound for the smallest eigenvalue and a lower bound for the largest eigenvalue (see [9, p181]).\n\nAnalyzing the general case of correlated and unnormalized features is more dif\ufb01cult. The idea of the\nfollowing theorem is regarding the off-diagonals as a perturbation of the diagonal matrix. This case\ncan be analyzed with Ger\u02c7sgorin\u2019s circle theorem [9, Theorem 6.1.1], which states that all eigenvalues\nlie in circles around the diagonal entries of the matrix.\ni denote the empirical mean and variance of xi for 1 \u2264 i \u2264 d and assume\nTheorem 2. Let \u00b5i and \u03c32\nthat \u03c32\n\nd. Let\n\n(cid:88)\n\nj,j(cid:54)=i\n\nRi =\n\n|Cov (xj, xi)|\n\n1 \u2264 . . . \u2264 \u03c32\n(cid:80)N\n\n(14)\n\n(15)\n(16)\n\ndenote the radius of the i-th Ger\u02c7sgorin circle. Then, the largest and smallest eigenvalues of X =\n1\nN\n\nn are bounded by:\n\nn=1 xnxT\n\nmax{\u03c32\n\nd + \u00b52\n\nd, \u03c32\n\n1 \u2212 R1 + (cid:107)\u00b5(cid:107)2\n\n2} \u2264 \u03bbd(X) \u2264 \u03c32\n\n1 \u2212 R1 \u2264 \u03bb1(X) \u2264 min{\u03c32\n\u03c32\n\n1 + \u00b52\n\nd + Rd} ,\n1, \u03c32\nd + Rd + (cid:107)\u00b5(cid:107)2\n2 .\n\nThe proof of Theorem 2 is a direct generalization of Theorem 1. In contrast to Theorem 1, only the\nbounds for the eigenvalues of A obtained by Ger\u02c7sgorin\u2019s theorem are known instead of the exact\neigenvalues. For strongly correlated features, the eigenvalues can be distributed almost arbitrarily\naccording to the bounds (15) and (16). For weakly correlated features, the bounds are tighter. In\nparticular, for normalized features and R1 < 1, Theorem 2 implies:\n\n1 \u2264 \u03ba(X) \u2264 1 + Rd\n1 \u2212 R1\n\n.\n\n(17)\n\nThis shows that the best conditioning of the optimization problem is obtained for uncorrelated and\nnormalized features. Conversely, our analysis shows that log-linear training can be accelerated by\ndecorrelating the features and normalizing their means and variances, i.e. after whitening of the data.\n\n4.3 The Regularized Case\n\nIn the following, we investigate the regularized training criterion, i.e. the objective function (3) with\n\u03b1 > 0. Since the Hessian of the (cid:96)2-regularization term is a multiple of the identity, the eigenvalues\nof the regularization term and the loss-term can be added. This has an important consequence. In the\nunregularized case, all non-zero eigenvalues depend on the eigenvalues of X. In the regularized case,\nthe eigenvalue zero changes to \u03b1, which is then the smallest non-zero eigenvalue of the Hessian.\nTherefore, the condition number depends only on the largest eigenvalue of X\n\nC\u22121\u00b5d(X) + \u03b1\n\n.\n\n\u03b1\n\n\u03ba(HF (\u039b0)) =\n\n(18)\nThis shows that for large regularization parameters, the condition number is close to one and con-\nvergence is fast. On the other hand, for small regularization parameters, the condition number gets\nvery large, even if X is well-conditioned. On \ufb01rst glance, it seems paradoxical that a small modi-\n\ufb01cation of the objective function can change the convergence behavior completely. But for a small\nregularization constant, the objective function has a very \ufb02at optimum instead of being constant in\nthese directions. Finding the exact optimum is indeed very hard. On the other hand, the optimiza-\ntion is dominated by the unregularized part of the objective function. Therefore, the iterates of the\noptimization algorithm will be close to an optimum of the unregularized objective function. Since\nthe regularization term is only small, the iterates already correspond to good models according to\nthe objective function.\n5 Experimental Results\n\nIn this section, we validate the theoretical results on two classi\ufb01cation tasks. The \ufb01rst one is the\nwell-known USPS task for handwritten digit recognition. The second task, IAM, is a large-scale\ncontinuous handwriting recognition task with real-life data. Our main interest is the large-scale task,\nsince this is a task for which log-linear models are especially useful.\n\n5\n\n\fPreprocessing\n\nTable 1: Results on the USPS task for different feature transformations and regularization parameters\n\u03b1. The columns \u201cseparation\u201d and \u201ctermination\u201d list the number of passes through the dataset until\nseparation of the training data, respectively the termination of the optimization algorithm.\nSeparation Termination\n66\n116\n513\n731\n358\n174\n100\n\nWhitening and mean norm.\nMean and variance norm.\nNone\nNone\nNone\nNone\nNone\n\n\u03b1N Train error (%)\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.03\n0.01\n0.43\n0.1\n2.08\n1.0\n4.29\n10.0\n\n21\n61\n356\n-\n-\n-\n-\n\n5.1 Handwritten Digit Recognition\n\nThe USPS dataset2 consists of 7291 training and 2007 test images from ten classes of handwritten\ndigits. We trained a log-linear classi\ufb01er directly on the whole image with 16 \u00d7 16 pixels.\nWe used the L-BFGS algorithm for optimization, which is considered as the best algorithm for log-\nlinear training. For all experiments, we used a a backtracking line search and a history length of ten,\nwhich is a standard value given in literature [14, 21]. We stopped the optimization, when the relative\nchange in the objective was below \u0001 = 10\u22125, i.e.\n\n(F(\u039bk\u22121) \u2212 F(\u039bk))/F(\u039bk) < \u0001 .\n\n(19)\nTable 1 contains the results on the USPS task. The results re\ufb02ect our analysis of the condition num-\nber. Without normalizing mean and variance, the optimization problem is not well-conditioned. It\nrequires more than 500 passes through the dataset until the termination criterion is reached. The opti-\nmization takes even longer, when a very small non-zero regularization constant is used. This is what\nwe expected by analyzing the condition number \u2013 the objective function has a very \ufb02at optimum,\nwhich slows down convergence. On the other hand, for higher regularization parameters, the opti-\nmization is much faster. We applied the normalizations only to the unregularized models, because\nthe feature transformations affect the regularization term. Therefore, results with regularization are\nnot comparable when feature transformations are applied. The mean and variance normalization re-\nduced the computational costs greatly, from 513 to 116 iterations. The application of the whitening\ntransformation further reduced the number of iterations to 66. Often, the classi\ufb01cation error on the\ntraining data reaches its minimum before the optimization algorithm terminates, so one might argue\nthat it is not necessary to run the optimization until the termination criterion is reached. The USPS\ntraining data is linearly separable and for all unregularized trainings, a zero classi\ufb01cation error on\nthe training set is reached. It turns out that the effect of the feature transformations is even stronger\nwhen the number of iterations until the training data is separated is compared (see Table 1).\n5.2 Handwritten Text Recognition\n\nOur second task is the IAM handwriting database [15]. In contrast to USPS, where single images\nare classi\ufb01ed into a small number of classes, IAM de\ufb01nes a continuous handwriting recognition task\nwith unrestricted vocabulary, and is therefore much harder. The corpus has a prede\ufb01ned subdivision\ninto training, development, and testing folds. The training fold contains lines of handwritten text\nwith 53k words in total. With our feature extraction, this corresponds to 3, 592, 006 observations.\nThe development and test fold contain 9k respectively 25k words. The IAM database is a large-scale\nlearning problem in the sense that it is not feasible to run the optimization until convergence [2] and\nthe test error is strongly in\ufb02uenced by the optimization accuracy.\n\n5.2.1 Baseline Model\n\nFor our baseline model, we use the conventional generative approach of a statistical classi\ufb01er based\non hidden Markov models (HMMs) with Gaussian Mixture models (GMMs) as emission probabil-\n1 = (x1, . . . , xT ) \u2208 X to a word\nities. The generative classi\ufb01er maps an observation sequence xT\n\n2ftp://ftp.kyb.tuebingen.mpg.de/pub/bs/data/\n\n6\n\n\fsequence \u02c6wN\n\n1 = ( \u02c6w1, . . . , \u02c6wN ) \u2208 W according to Bayes rule:\n\nr : X \u2192 W ,\n\n1 (cid:55)\u2192 \u02c6wN\nxT\n\n1 = argmax\n1 \u2208W\nwN\n\np\u03b8(wN\n\n1 )\u03b3p\u03b8(xT\n\n1 |wN\n\n1 ) .\n\n(20)\n\n1 |wN\n\n1 ) is a smoothed trigram language model trained on the reference of the\nThe prior probability p\u03b8(wN\ntraining data and the three additional text corpora Lancaster-Oslo-Bergen, Brown, and Wellington,\nas proposed in [1]. The language model scale \u03b3 > 0 has been optimized on the development set.\nThe visual model p\u03b8(xT\n1 ) is de\ufb01ned by an HMM, which is composed of submodels for each\ncharacter in the word sequence. In total there are 78 characters, which are modeled by \ufb01ve-state\nleft-to-right HMMs, resulting in 390 distinct states plus one state for the whitespace model. The\nemission probabilities of the HMM are modeled by GMMs with a single shared covariance matrix.\nThe parameters of the visual model are optimized according to the maximum likelihood criterion\nwith the expectation-maximization (EM) algorithm and a splitting procedure. We obtained best\nresults with 25k mixture components in total. We only used basic deslanting and size normalization\nfor feature preprocessing, as it is commonly applied in handwriting recognition. An image slice was\nextracted at every position. Seven features in a sliding window were concatenated and projected to a\nthirty dimensional vector by a principal component analysis (PCA). The recognition lexicon consists\nof the 50k most frequent words in the language model training data. The generative baseline system\nachieves a word error rate (WER) of 32.8% on the development set and 39.4% on the test set, similar\nto the results of the GMM/HMM-baseline systems by [1, 6, 5].\n\n5.2.2 Hybrid LL/HMM Recognition System\n\nThe main component of the visual model of our baseline system is the GMM for the emission proba-\nbilities p\u03b8(x|s). Analogous to the use of neural network outputs by [6], we build a hybrid LL/HMM\nrecognition system by deriving the emission probabilities via p\u039b(x|s) = p\u039b(s|x)p(x)/p(s) . The\nprior probability p(s) can be estimated easily as the relative frequency, and p(x) can be discarded in\nrecognition without changing the maximizing word sequence.\nWe used our baseline system for generating a state alignment, i.e. an assignment of the feature\nvectors to an HMM state, and then trained a log-linear model on the resulting training sample\n(xt, st)t=1,...,T analogous to the setup on USPS. Note that the training of the log-linear model is\nconceptually exactly the same as for USPS and our convergence analysis applies.\nOn large-scale tasks as IAM, it is not practicable to run the optimization until convergence as on\nUSPS. Instead, we assume a limited training budget for all experiments, which allows for per-\nforming 200 iterations, and compare the resulting classi\ufb01ers. This procedure corresponds to the\ncharacterization of large-scale learning tasks by Bottou and Bousquet [2].\nThe performance of a linear classi\ufb01er on a complex task as IAM is quite limited. Therefore, we\nused polynomial feature spaces of degree one (d = 30), two (d = 495) and three (d = 5455),\ncorresponding to polynomial kernels.\nIn contrast to USPS, where the classi\ufb01cation error on the\ntraining data without regularization was zero, on IAM, the state-classi\ufb01cation error on the training\ndata ranges from forty to sixty percent. Thus, the impact of regularization on the performance of\nthe classi\ufb01er is only minor. In preliminary experiments, we obtained almost no improvements by\nregularization. Therefore, we report only the results without regularization.\n\n5.2.3 Results\n\nThe results on the IAM database (see Table 2) are again in accordance to our theoretical analysis.\nThe \ufb01rst-order features are already decorrelated, but without mean and variance normalization, the\nconvergence is slower, resulting in a worse WER on the development and test set. The difference\nis moderate, when the parameters are initialized with zero, corresponding to a uniform distribution.\nIn a next experiment we initialized all parameters randomly with plus or minus one. This results\nin a huge degradation for the unnormalized features and \u2013 with exactly the same random initial-\nization \u2013 has only a minor impact when normalized features are used. The differences are even\nlarger for the second-order experiments. This can be expected, since mean and variance take on\nmore extreme values when the features are squared, and the features are correlated. For the zero\ninitialization, the improvement from mean and variance normalization is only moderate in WER.\nFor the unnormalized features and random initialization, the optimization did not lead to a usable\n\n7\n\n\fTable 2: Results on the IAM database for polynomial feature spaces of degree m \u2208 {1, 2, 3} with\ndifferent initializations and preprocessings.\n\nm\n1\n1\n2\n2\n2\n2\n3\n\nPreprocessing\n\nnone\nmean and var. norm.\nnone\nmean and var. norm.\nmean and var. norm.\nwhitening and mean norm.\nmean and var. norm.\n\nInitialization WER / dev set (%) WER / test set (%)\n60.1 / 75.5\nzero / random\n58.9 / 58.5\nzero / random\n40.2 / >100.0\nzero / random\n38.5 / 41.3\nzero / random\n1st order\n33.1\n31.6 / 32.3\nzero / random\n2nd order\n27.4\n\n49.9 / 68.3\n49.7 / 48.9\n32.4 / >100.0\n30.2 / 34.4\n26.8\n25.1 / 25.9\n23.0\n\nmodel for recognition at all. Fastest convergence and best results are obtained by the application\nof the whitening transformation to the features. In addition, the in\ufb02uence of the initialization is the\nsmallest in this case. Because of the high dimension of the third-order features, the estimation of\nthe whitening transformation itself is already computationally very expensive. Therefore, we only\nperformed a mean and variance normalization of the third-order features, but initialized the models\nincrementally from \ufb01rst to second to third-order features. In this manner, we obtain our best result\nof 27.4% WER, which is a drastic improvement over the generative baseline system (39.4% WER).\nOur hybrid LL/HMM system outperforms other systems based on HMMs with comparable prepro-\ncessing. Bertolami and Bunke [1] obtain 32.9% WER with an ensemble-based HMM approach.\nDreuw et al. [5] obtain 30.0% WER with discriminatively trained GMMs and 29.0% WER with an\nadditional discriminative adaptation method. The system of Graves [7], which has a completely dif-\nferent architecture based on recurrent neural networks, outperforms our system with 25.9% WER.\nThe best published result of 21.2% WER on the IAM database is by Espa\u02dcna-Boquera et al. [6], who\nuse several specialized neural networks for preprocessing.\n\n6 Discussion\n\nIn this paper, we presented a novel convergence analysis for the optimization of the parameters\nof log-linear models. Our main results are \ufb01rst that the convergence of gradient-based optimiza-\ntion algorithms depends on the eigenvalues of the uncentered empirical covariance matrix. For this\nderivation we assumed that the quadratic term of the objective function at the optimum behaves sim-\nilar as at the initialization. Second, we analyzed the eigenvalues of the covariance matrix. According\nto this analysis, it is important to normalize mean and variances of the features. Best convergence\nbehavior can be expected when, in addition, the features are decorrelated.\nInterestingly, the same result is obtained by LeCun et al. [11] for neural network training, but their\nanalysis differs from ours in a number of aspects. First, LeCun et al. consider a simpler loss function.\nIn contrast to our analysis, they assume that all components of the observations have identical mean\nand variance and that the components are independent. Furthermore, they \ufb01x the ratio of the number\nof model parameters and the number of training observations. The derivation of the spectrum of the\nHessian is then performed in the limit of in\ufb01nite training data, leading to a continuous spectrum.\nThis approach is more suited for the analysis of online learning. In the case of batch learning, the\ntraining data as well as the model size is \ufb01xed.\nWe veri\ufb01ed our \ufb01ndings on two handwriting recognition tasks and found that the theoretical analysis\npredicted the observed convergence behavior very well. On IAM, a real-life dataset for continuous\nhandwriting recognition, our log-linear system outperforms other systems with comparable archi-\ntecture and preprocessing. This is remarkable, because we use a generic and conceptually simple\nmethod, which is simple to implement and allows for reproducing experimental results easily.\nAn interesting point for future work is the use of approximate decorrelation techniques, e.g. by as-\nsuming a structure for the covariance matrix. This will be useful for very high-dimensional features\nfor which the estimation of the whitening transformation is not feasible.\n\n8\n\n\fReferences\n[1] Bertolami, R., Bunke, H.: HMM-based Ensamble Methods for Of\ufb02ine Handwritten Text Line\n\nRecognition. Pattern Recogn. 41, 3452\u20133460 (2008)\n\n[2] Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Advances in Neural Infor-\n\nmation Processing Systems. pp. 161\u2013168 (2008)\n\n[3] Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press (2004)\n[4] Darroch, J., Ratcliff, D.: Generalized Iterative Scaling for Log-Linear Models. Ann. Math.\n\nStat. 43(5), 1470\u20131480 (1972)\n\n[5] Dreuw, P., Heigold, G., Ney, H.: Con\ufb01dence- and Margin-Based MMI/MPE Discriminative\n\nTraining for Off-Line Handwriting Recognition. Int. J. Doc. Anal. Recogn. pp. 1\u201316 (2011)\n\n[6] Espa\u02dcna-Boquera, S., Castro-Bleda, M., Gorbe-Moya, J., Zamora-Martinez, F.: Improving Of-\n\ufb02ine Handwritten Text Recognition with Hybrid HMM/ANN Models. IEEE Trans. Pattern\nAnal. Mach. Intell. 33(4), 767 \u2013779 (april 2011)\n\n[7] Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A Novel\nConnectionist System for Unconstrained Handwriting Recognition. IEEE Trans. Pattern Anal.\nMach. Intell. 31(5), 855\u2013868 (May 2009)\n\n[8] Horn, R., Johnson, C.: Topics in Matrix Analysis. Cambridge University Press (1994)\n[9] Horn, R., Johnson, C.: Matrix Analysis. Cambridge University Press (2005)\n[10] Lafferty, J., McCallum, A., Pereira, F.: Conditional random \ufb01elds: Probabilistic models for\nsegmenting and labeling sequence data. In: Proceedings of the 18th International Conference\non Machine Learning. pp. 282\u2013289 (2001)\n\n[11] LeCun, Y., Kanter, I., Solla, S.: Second order properties of error surfaces: Learning time and\ngeneralization. In: Advances in Neural Information Processing Systems. pp. 918\u2013924. Morgan\nKaufmann Publishers Inc. (1990)\n\n[12] Liu, D., Nocedal, J.: On the Limited Memory BFGS Method for Large-Scale Optimization.\n\nMath. Program. 45(1), 503\u2013528 (1989)\n\n[13] Luenberger, D., Ye, Y.: Linear and Nonlinear Programming. Springer Verlag (2008)\n[14] Malouf, R.: A comparison of algorithms for maximum entropy parameter estimation. In: Pro-\n\nceedings of the Sixth Conference on Natural Language Learning. pp. 49\u201355 (2002)\n\n[15] Marti, U., Bunke, H.: The IAM-Database: An English Sentence Database for Of\ufb02ine Hand-\n\nwriting Recognition. Int. J. Doc. Anal. Recogn. 5(1), 39\u201346 (2002)\n\n[16] McCallum, A., Freitag, D., Pereira, F.: Maximum entropy markov models for information\nextraction and segmentation. In: Proceedings of the 17th International Conference on Machine\nLearning. pp. 591\u2013598 (2000)\n\n[17] Minka, T.: Algorithms for maximum-likelihood logistic regression. Tech. rep., Carnegie Mel-\n\nlon University (2001)\n\n[18] Nocedal, J., Wright, S.: Numerical Optimization. Springer (1999)\n[19] Notay, Y.: Solving positive (semi)de\ufb01nite linear systems by preconditioned iterative methods.\nIn: Preconditioned Conjugate Gradient Methods, Lecture Notes in Mathematics, vol. 1457, pp.\n105\u2013125. Springer (1990)\n\n[20] Salakhutdinov, R., Roweis, S., Ghahramani, Z.: On the convergence of bound optimization\n\nalgorithms. In: Uncertainty in Arti\ufb01cial Intelligence. vol. 19, pp. 509\u2013516 (2003)\n\n[21] Sha, F., Pereira, F.: Shallow parsing with conditional random \ufb01elds. In: Proceedings of the\n2003 Conference of the North American Chapter of the Association for Computational Lin-\nguistics on Human Language Technology. pp. 134\u2013141 (2003)\n\n[22] Sutton, C., McCallum, A.: An introduction to conditional random \ufb01elds for relational learning.\nIn: Getoor, L., Taskar, B. (eds.) Introduction to Statistical Relational Learning. MIT Press\n(2007)\n\n9\n\n\f", "award": [], "sourceid": 466, "authors": [{"given_name": "Simon", "family_name": "Wiesler", "institution": null}, {"given_name": "Hermann", "family_name": "Ney", "institution": null}]}