{"title": "Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 577, "page_last": 584, "abstract": "On-line handwriting recognition is unusual among sequence labelling tasks in that the underlying generator of the observed data, i.e. the movement of the pen, is recorded directly. However, the raw data can be difficult to interpret because each letter is spread over many pen locations. As a consequence, sophisticated pre-processing is required to obtain inputs suitable for conventional sequence labelling algorithms, such as HMMs. In this paper we describe a system capable of directly transcribing raw on-line handwriting data. The system consists of a recurrent neural network trained for sequence labelling, combined with a probabilistic language model. In experiments on an unconstrained on-line database, we record excellent results using either raw or pre-processed data, well outperforming a benchmark HMM in both cases.", "full_text": "Unconstrained Online Handwriting Recognition with\n\nRecurrent Neural Networks\n\nAlex Graves\nTUM, Germany\n\nalex@idsia.ch\n\nSantiago Fern\u00b4andez\nIDSIA, Switzerland\n\nsantiago@idsia.ch\n\nMarcus Liwicki\n\nUniversity of Bern, Switzerland\nliwicki@iam.unibe.ch\n\nHorst Bunke\n\nUniversity of Bern, Switzerland\n\nbunke@iam.unibe.ch\n\nJ\u00a8urgen Schmidhuber\n\nIDSIA, Switzerland and TUM, Germany\n\njuergen@idsia.ch\n\nAbstract\n\nIn online handwriting recognition the trajectory of the pen is recorded during writ-\ning. Although the trajectory provides a compact and complete representation of\nthe written output, it is hard to transcribe directly, because each letter is spread\nover many pen locations. Most recognition systems therefore employ sophisti-\ncated preprocessing techniques to put the inputs into a more localised form. How-\never these techniques require considerable human effort, and are speci\ufb01c to par-\nticular languages and alphabets. This paper describes a system capable of directly\ntranscribing raw online handwriting data. The system consists of an advanced re-\ncurrent neural network with an output layer designed for sequence labelling, com-\nbined with a probabilistic language model. In experiments on an unconstrained\nonline database, we record excellent results using either raw or preprocessed data,\nwell outperforming a state-of-the-art HMM based system in both cases.\n\n1 Introduction\n\nHandwriting recognition is traditionally divided into of\ufb02ine and online recognition. Of\ufb02ine recogni-\ntion is performed on images of handwritten text. In online handwriting the location of the pen-tip on\na surface is recorded at regular intervals, and the task is to map from the sequence of pen positions\nto the sequence of words.\nAt \ufb01rst sight, it would seem straightforward to label raw online inputs directly. However, the fact that\neach letter or word is distributed over many pen positions poses a problem for conventional sequence\nlabelling algorithms, which have dif\ufb01culty processing data with long-range interdependencies. The\nproblem is especially acute for unconstrained handwriting, where the writing style may be cursive,\nprinted or a mix of the two, and the degree of interdependency is therefore dif\ufb01cult to determine\nin advance. The standard solution is to preprocess the data into a set of localised features. These\nfeatures typically include geometric properties of the trajectory in the vicinity of every data point,\npseudo-of\ufb02ine information from a generated image, and character level shape characteristics [6, 7].\nDelayed strokes (such as the crossing of a \u2018t\u2019 or the dot of an \u2018i\u2019) require special treatment because\nthey split up the characters and therefore interfere with localisation. HMMs [6] and hybrid systems\nincorporating time-delay neural networks and HMMs [7] are commonly trained with such features.\nThe issue of classifying preprocessed versus raw data has broad relevance to machine learning, and\nmerits further discussion. Using hand crafted features often yields superior results, and in some\ncases can render classi\ufb01cation essentially trivial. However, there are three points to consider in\nfavour of raw data. Firstly, designing an effective preprocessor requires considerable time and ex-\npertise. Secondly, hand coded features tend to be more task speci\ufb01c. For example, features designed\n\n1\n\n\ffor English handwriting could not be applied to languages with substantially different alphabets,\nsuch as Arabic or Chinese. In contrast, a system trained directly on pen movements could be ap-\nplied to any alphabet. Thirdly, using raw data allows feature extraction to be built into the classi\ufb01er,\nand the whole system to be trained together. For example, convolutional neural networks [10], in\nwhich a globally trained hierarchy of network layers is used to extract progressively higher level\nfeatures, have proved effective at classifying raw images, such as objects in cluttered scenes or iso-\nlated handwritten characters [15, 11]. (Note than convolution nets are less suitable for unconstrained\nhandwriting, because they require the text images to be presegmented into characters [10]).\nIn this paper, we apply a recurrent neural network (RNN) to online handwriting recognition. The\nRNN architecture is bidirectional Long Short-Term Memory [3], chosen for its ability to process data\nwith long time dependencies. The RNN uses the recently introduced connectionist temporal classi-\n\ufb01cation output layer [2], which was speci\ufb01cally designed for labelling unsegmented sequence data.\nAn algorithm is introduced for applying grammatical constraints to the network outputs, thereby\nproviding word level transcriptions. Experiments are carried out on the IAM online database [12]\nwhich contains forms of unconstrained English text acquired from a whiteboard. The performance\nof the RNN system using both raw and preprocessed input data is compared to that of an HMM\nbased system using preprocessed data only [13]. To the best of our knowledge, this is the \ufb01rst time\nwhole sentences of unconstrained handwriting have been directly transcribed from raw online data.\nSection 2 describes the network architecture, the output layer and the algorithm for applying gram-\nmatical constraints. Section 3 provides experimental results, and conclusions are given in Section 4.\n\n2 Method\n\n2.1 Bidirectional Long Short-Term Memory\n\nOne of the key bene\ufb01ts of RNNs is their ability to make use of previous context. However, for\nstandard RNN architectures, the range of context that can in practice be accessed is limited. The\nproblem is that the in\ufb02uence of a given input on the hidden layer, and therefore on the network\noutput, either decays or blows up exponentially as it cycles around the recurrent connections. This\nis often referred to as the vanishing gradient problem [4].\nLong Short-Term Memory (LSTM; [5]) is an RNN architecture designed to address the vanishing\ngradient problem. An LSTM layer consists of multiple recurrently connected subnets, known as\nmemory blocks. Each block contains a set of internal units, known as cells, whose activation is\ncontrolled by three multiplicative \u2018gate\u2019 units. The effect of the gates is to allow the cells to store\nand access information over long periods of time.\nFor many tasks it is useful to have access to future as well past context. Bidirectional RNNs [14]\nachieve this by presenting the input data forwards and backwards to two separate hidden layers, both\nof which are connected to the same output layer. Bidirectional LSTM (BLSTM) [3] combines the\nabove architectures to provide access to long-range, bidirectional context.\n\n2.2 Connectionist Temporal Classi\ufb01cation\n\nConnectionist temporal classi\ufb01cation (CTC) [2] is an objective function designed for sequence la-\nbelling with RNNs. Unlike previous objective functions it does not require pre-segmented training\ndata, or postprocessing to transform the network outputs into labellings. Instead, it trains the network\nto map directly from input sequences to the conditional probabilities of the possible labellings.\nA CTC output layer contains one more unit than there are elements in the alphabet L of labels for\nthe task. The output activations are normalised with the softmax activation function [1]. At each\ntime step, the \ufb01rst |L| outputs are used to estimate the probabilities of observing the corresponding\nlabels. The extra output estimates the probability of observing a \u2018blank\u2019, or no label. The combined\noutput sequence estimates the joint probability of all possible alignments of the input sequence with\nall possible labellings. The probability of a particular labelling can then be estimated by summing\nover the probabilities of all the alignments that correspond to it.\nMore precisely, for an input sequence x of length T , choosing a label (or blank) at every time\nstep according to the probabilities implied by the network outputs de\ufb01nes a probability distribution\n\n2\n\n\fover the set of length T sequences of labels and blanks. We denote this set L(cid:48)T , where L(cid:48) = L \u222a\n{blank}. To distinguish them from labellings, we refer to the elements of L(cid:48)T as paths. Assuming\nthat the label probabilities at each time step are conditionally independent given x, the conditional\nprobability of a path \u03c0 \u2208 L(cid:48)T is given by\n\np(\u03c0|x) =\n\nyt\n\u03c0t\n\n,\n\n(1)\n\nT(cid:89)\n\nt=1\n\nk is the activation of output unit k at time t. Denote the set of sequences of length less than\nwhere yt\nor equal to T on the alphabet L as L\u2264T . Then Paths are mapped onto labellings l \u2208 L\u2264T by an\noperator B that removes \ufb01rst the repeated labels, then the blanks. For example, both B(a,\u2212, a, b,\u2212)\nand B(\u2212, a, a,\u2212,\u2212, a, b, b) yield the labelling (a,a,b). Since the paths are mutually exclusive, the\nconditional probability of a given labelling l \u2208 L\u2264T is the sum of the probabilities of all paths\ncorresponding to it:\n\np(l|x) = (cid:88)\n\n\u03c0\u2208B\u22121(l)\n\np(\u03c0|x).\n\n(2)\n\nAlthough a naive calculation of the above sum would be unfeasible, it can be ef\ufb01ciently evaluated\nwith a graph-based algorithm [2], similar to the forward-backward algorithm for HMMs.\nTo allow for blanks in the output paths, for each label sequence l \u2208 L\u2264T consider a modi\ufb01ed label\nsequence l(cid:48) \u2208 L(cid:48)\u2264T , with blanks added to the beginning and the end and inserted between every\npair of labels. The length of l(cid:48) is therefore |l(cid:48)| = 2|l| + 1.\nFor a labelling l, de\ufb01ne the forward variable \u03b1t(s) as the summed probability of all paths whose\nlength t pre\ufb01xes are mapped by B onto the length s/2 pre\ufb01x of l, i.e.\n\n\u03b1t(s) = P (\u03c01:t : B(\u03c01:t) = l1:s/2, \u03c0t = l(cid:48)\n\nyt(cid:48)\n\u03c0t(cid:48) ,\n\n(3)\n\ns|x) = (cid:88)\n\nt(cid:89)\n\nt(cid:48)=1\nB(\u03c01:t)=l1:s/2\n\n\u03c0:\n\ns|x) = (cid:88)\n\nT(cid:89)\n\n\u03c0:\n\nB(\u03c0t:T )=ls/2:|l|\n\nt(cid:48)=t+1\n\nwhere, for some sequence s, sa:b is the subsequence (sa, sa+1, ..., sb\u22121, sb), and s/2 is rounded\ndown to an integer value.\nThe backward variables \u03b2t(s) are de\ufb01ned as the summed probability of all paths whose suf\ufb01xes\nstarting at t map onto the suf\ufb01x of l starting at label s/2\n\n\u03b2t(s) = P (\u03c0t+1:T : B(\u03c0t:T ) = ls/2:|l|, \u03c0t = l(cid:48)\n\nyt(cid:48)\n\u03c0t(cid:48)\n\n(4)\n\nBoth the forward and backward variables are calculated recursively [2]. The label sequence proba-\nbility is given by the sum of the products of the forward and backward variables at any time step:\n\np(l|x) =\n\n\u03b1t(s)\u03b2t(s).\n\n(5)\n\n|l(cid:48)|(cid:88)\n\ns=1\n\nThe objective function for CTC is the negative log probability of the network correctly labelling the\nentire training set. Let S be a training set, consisting of pairs of input and target sequences (x, z),\nwhere target sequence z is at most as long as input sequence x. Then the objective function is:\n\nln (p(z|x)).\n\n(6)\n\nOCT C = \u2212 (cid:88)\n\n(x,z)\u2208S\n\nThe network can be trained with gradient descent by differentiating OCT C with respect to the out-\nputs, then using backpropagation through time to differentiate with respect to the network weights.\nNoting that the same label (or blank) may be repeated several times for a single labelling l, we de\ufb01ne\ns = k}, which may be empty. We\nthe set of positions where label k occurs as lab(l, k) = {s : l(cid:48)\n(cid:88)\nthen set l = z and differentiate (5) with respect to the unnormalised network outputs at\nk to obtain:\n\n\u2202OCT C\n\n= \u2212 \u2202ln (p(z|x))\n\n\u2202at\nk\n\n= yt\n\nk \u2212 1\n\np(z|x)\n\n\u2202at\nk\n\ns\u2208lab(z,k)\n\n\u03b1t(s)\u03b2t(s).\n\n(7)\n\n3\n\n\fOnce the network is trained, we would ideally label some unknown input sequence x by choosing\nthe most probable labelling l\u2217:\n\nUsing the terminology of HMMs, we refer to the task of \ufb01nding this labelling as decoding. Unfortu-\nnately, we do not know of a tractable decoding algorithm that is guaranteed to give optimal results.\nHowever a simple and effective approximation is given by assuming that the most probable path\ncorresponds to the most probable labelling, i.e.\n\narg max\n\n\u03c0\n\np(\u03c0|x)\n\n.\n\n(9)\n\nl\u2217 = arg max\n\np(l|x).\n\nl\n\nl\u2217 \u2248 B(cid:16)\n\n(cid:17)\n\n(8)\n\n(10)\n\n2.3\n\nIntegration with an External Grammar\n\nFor some tasks we want to constrain the output labellings according to a prede\ufb01ned grammar. For\nexample, in speech and handwriting recognition, the \ufb01nal transcriptions are usually required to form\nsequences of dictionary words. In addition it is common practice to use a language model to weight\nthe probabilities of particular sequences of words.\nWe can express these constraints by altering the probabilities in (8) to be conditioned on some\nprobabilistic grammar G, as well as the input sequence x:\n\nl\u2217 = arg max\n\np(l|x, G).\n\nl\n\nAbsolute requirements, for example that l contains only dictionary words, can be incorporated by\nsetting the probability of all sequences that fail to meet them to 0.\nAt \ufb01rst sight, conditioning on G seems to contradict a basic assumption of CTC: that the labels\nare conditionally independent given the input sequences (see Eqn. (1)). Since the network attempts\nto model the probability of the whole labelling at once, there is nothing to stop it from learning\ninter-label transitions direct from the data, which would then be skewed by the external grammar.\nHowever, CTC networks are typically only able to learn local relationships such as commonly occur-\nring pairs or triples of labels. Therefore as long as G focuses on long range label interactions (such\nas the probability of one word following another when the outputs are letters) it doesn\u2019t interfere\nwith the dependencies modelled by CTC.\nThe basic rules of probability tell us that p(l|x, G) = p(l|x)p(l|G)p(x)\n, where we have used the fact\nthat x is conditionally independent of G given l. If we assume x is independent of G, this reduces\nto p(l|x, G) = p(l|x)p(l|G)\n. That assumption is in general false, since both the input sequences and\nthe grammar depend on the underlying generator of the data, for example the language being spo-\nken. However it is a reasonable \ufb01rst approximation, and is particularly justi\ufb01able in cases where\nthe grammar is created using data other than that from which x was drawn (as is common prac-\ntice in speech and handwriting recognition, where independent textual corpora are used to generate\nlanguage models).\nFinally, if we assume that all label sequences are equally probable prior to any knowledge about the\ninput or the grammar, we can drop the p(l) term in the denominator to get\n\np(x|G)p(l)\n\np(l)\n\nl\u2217 = arg max\n\np(l|x)p(l|G).\n\nl\n\n(11)\nNote that, since the number of possible label sequences is \ufb01nite (because both L and |l| are \ufb01nite),\nassigning equal prior probabilities does not lead to an improper prior.\nWe now describe an algorithm, based on the token passing algorithm for HMMs [16], that allows us\nto \ufb01nd an approximate solution to (11) for a simple grammar.\nLet G consist of a dictionary D containing W words, and a set of W 2 bigrams p(w| \u02c6w) that de\ufb01ne\nthe probability of making a transition from word \u02c6w to word w. The probability of any labelling that\ndoes not form a sequence of dictionary words is 0.\nFor each word w, de\ufb01ne the modi\ufb01ed word w(cid:48) as w with blanks added at the beginning and end and\nbetween each pair of labels. Therefore |w(cid:48)| = 2|w| + 1. De\ufb01ne a token tok = (score, history)\nto be a pair consisting of a real valued score and a history of previously visited words. In fact,\n\n4\n\n\feach token corresponds to a particular path through the network outputs, and its score is the log\nprobability of that path. The basic idea of the token passing algorithm is to pass along the highest\nscoring tokens at every word state, then maximise over these to \ufb01nd the highest scoring tokens at\nthe next state. The transition probabilities are used when a token is passed from the last state in one\nword to the \ufb01rst state in another. The output word sequence is given by the history of the highest\nscoring end-of-word token at the \ufb01nal time step.\nAt every time step t of the length T output sequence, each segment s of each modi\ufb01ed word w(cid:48) holds\na single token tok(w, s, t). This is the highest scoring token reaching that segment at that time. In\naddition we de\ufb01ne the input token tok(w, 0, t) to be the highest scoring token arriving at word w at\ntime t, and the output token tok(w,\u22121, t) to be the highest scoring token leaving word w at time t.\n\nelse\ntok(w, s, 1) = (\u2212\u221e, ()) for all s (cid:54)= \u22121\n\nsort output tokens tok(w,\u22121, t \u2212 1) by ascending score\nfor all words w \u2208 D do\n\nb ), (w))\nw1), (w))\ntok(w,\u22121, 1) = tok(w, 2, 1)\ntok(w,\u22121, 1) = (\u2212\u221e, ())\n\n1: Initialisation:\n2: for all words w \u2208 D do\ntok(w, 1, 1) = (ln(y1\n3:\ntok(w, 2, 1) = (ln(y1\n4:\nif |w| = 1 then\n5:\n6:\n7:\n8:\n9:\n10: Algorithm:\n11: for t = 2 to T do\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24: Termination:\n25: \ufb01nd output token tok\u2217(w,\u22121, T ) with highest score at time T\n26: output tok\u2217(w,\u22121, T ).history\n\nP = {tok(w, s, t \u2212 1), tok(w, s \u2212 1, t \u2212 1)}\nif w(cid:48)\ns then\n\ns (cid:54)= blank and s > 2 and w(cid:48)\nadd tok(w, s \u2212 2, t \u2212 1) to P\n\ns\u22122 (cid:54)= w(cid:48)\n\nw\u2217 = arg max \u02c6w\u2208D tok( \u02c6w,\u22121, t \u2212 1).score + ln (p(w| \u02c6w))\ntok(w, 0, t).score = tok(w\u2217,\u22121, t \u2212 1).score + ln (p(w|w\u2217))\ntok(w, 0, t).history = tok(w\u2217,\u22121, t \u2212 1).history + w\nfor segment s = 1 to |w(cid:48)| do\n\ntok(w, s, t) = token in P with highest score\ntok(w, s, t).score += ln(yt\nw(cid:48)\n\n)\n\ntok(w,\u22121, t) = highest scoring of {tok(w,|w(cid:48)|, t), tok(w,|w(cid:48)| \u2212 1, t)}\n\ns\n\nAlgorithm 1: CTC Token Passing Algorithm\n\nThe algorithm\u2019s worst case complexity is O(T W 2), since line 14 requires a potential search through\nall W words. However, because the output tokens tok(w,\u22121, T ) are sorted in order of score, the\nsearch can be terminated when a token is reached whose score is less than the current best score\nwith the transition included. The typical complexity is therefore considerably lower, with a lower\nbound of O(T W logW ) to account for the sort. If no bigrams are used, lines 14-16 can be replaced\nby a simple search for the highest scoring output token, and the complexity reduces to O(T W ).\nNote that this is the same as the complexity of HMM decoding, if the search through bigrams is\nexhaustive. Much work has gone into developing more ef\ufb01cient decoding techniques (see e.g. [9]),\ntypically by pruning improbable branches from the tree of labellings. Such methods are essential\nfor applications where a rapid response is required, such as real time transcription. In addition,\nmany decoders use more sophisticated language models than simple bigrams. Any HMM decoding\nalgorithm could be applied to CTC outputs in the same way as token passing. However, we have\nstuck with a relatively basic algorithm since our focus here is on recognition rather than decoding.\n\n5\n\n\f3 Experiments\n\nThe experimental task was online handwriting recognition, using the IAM-OnDB handwriting\ndatabase [12], which is available for public download from http://www.iam.unibe.ch/ fki/iamondb/\nFor CTC, we record both the character error rate, and the word error rate using Algorithm 1 with\na language model and a dictionary. For the HMM system, the word error rate is quoted from the\nliterature [13]. Both the character and word error rate are de\ufb01ned as the total number of insertions,\ndeletions and substitutions in the algorithm\u2019s transcription of test set, divided by the combined length\nof the target transcriptions in the test set.\nWe compare results using both raw inputs direct from the pen sensor, and a preprocessed input\nrepresentation designed for HMMs.\n\n3.1 Data and Preprocessing\n\nIAM-OnDB consists of pen trajectories collected from 221 different writers using a \u2018smart white-\nboard\u2019 [12]. The writers were asked to write forms from the LOB text corpus [8], and the position of\ntheir pen was tracked using an infra-red device in the corner of the board. The input data consisted\nof the x and y pen coordinates, the points in the sequence when individual strokes (i.e. periods when\nthe pen is pressed against the board) end, and the times when successive position measurements\nwere made. Recording errors in the x, y data were corrected by interpolating to \ufb01ll in for missing\nreadings, and removing steps whose length exceeded a certain threshold.\nIAM-OnDB is divided into a training set, two validation sets, and a test set, containing respectively\n5364, 1438, 1518 and 3859 written lines taken from 775, 192, 216 and 544 forms. The data sets\ncontained a total of 3,298,424, 885,964, 1,036,803 and 2,425,5242 pen coordinates respectively. For\nour experiments, each line was used as a separate sequence (meaning that possible dependencies\nbetween successive lines were ignored).\nThe character level transcriptions contain 80 distinct target labels (capital letters, lower case letters,\nnumbers, and punctuation). A dictionary consisting of the 20, 000 most frequently occurring words\nin the LOB corpus was used for decoding, along with a bigram language model optimised on the\ntraining and validation sets [13]. 5.6% of the words in the test set were not in the dictionary.\nTwo input representations were used. The \ufb01rst contained only the offset of the x, y coordinates\nfrom the top left of the line, the time from the beginning of the line, and the marker for the ends of\nstrokes. We refer to this as the raw input representation. The second representation used state-of-the-\nart preprocessing and feature extraction techniques [13]. We refer to this as the preprocessed input\nrepresentation. Brie\ufb02y, in order to account for the variance in writing styles, the pen trajectories\nwere normalised with respect to such properties as the slant, skew and width of the letters, and the\nslope of the line as a whole. Two sets of input features were then extracted, the \ufb01rst consisting of\n\u2018online\u2019 features, such as pen position, pen speed, line curvature etc., and the second consisting of\n\u2018of\ufb02ine\u2019 features created from a two dimensional window of the image created by the pen.\n\n3.2 Experimental Setup\n\nThe CTC network used the BLSTM architecture, as described in Section 2.1. The forward and\nbackward hidden layers each contained 100 single cell memory blocks. The input layer was fully\nconnected to the hidden layers, which were fully connected to themselves and the output layer. The\noutput layer contained 81 units (80 characters plus the blank label). For the raw input representation,\nthere were 4 input units and a total of 100,881 weights. For the preprocessed representation, there\nwere 25 inputs and 117,681 weights. tanh was used for the cell activation functions and logistic\nsigmoid in the range [0, 1] was used for the gates. For both input representations, the data was\nnormalised so that each input had mean 0 and standard deviation 1 on the training set. The network\nwas trained with online gradient descent, using a learning rate of 10\u22124 and a momentum of 0.9.\nTraining was stopped after no improvement was recorded on the validation set for 50 training epochs.\nThe HMM setup [13] contained a separate, left-to-right HMM with 8 states for each character (8 \u2217\n81 = 648 states in total). Diagonal mixtures of 32 Gaussians were used to estimate the observation\n\n6\n\n\fTable 1: Word Error Rate (WER) on IAM-OnDB. LM = language model. CTC results are a mean\nover 4 runs, \u00b1 standard error. All differences were signi\ufb01cant (p < 0.01)\n\nLM WER\n\nSystem Input\nHMM preprocessed (cid:88)\nCTC\nraw\n\u0017\npreprocessed\nCTC\n\u0017\n(cid:88)\nCTC\nraw\npreprocessed (cid:88)\nCTC\n\n35.5% [13]\n30.1 \u00b1 0.5%\n26.0 \u00b1 0.3%\n22.8 \u00b1 0.2%\n20.4 \u00b1 0.3%\n\nprobabilities. All parameters, including the word insertion penalty and the grammar scale factor,\nwere optimised on the validation set.\n\n3.3 Results\nThe character error rate for the CTC network with the preprocessed inputs was 11.5 \u00b1 0.05%.\nFrom Table 1 we can see that with a dictionary and a language model this translates into a mean\nword error rate of 20.4%, which is a relative error reduction of 42.5% compared to the HMM.\nWithout the language model, the error reduction was 26.8%. With the raw input data CTC achieved\na character error rate of 13.9 \u00b1 0.1%, and word error rates that were close to those recorded with\nthe preprocessed data, particularly when the language model was present.\nThe key difference between the input representations is that the raw data is less localised, and there-\nfore requires more use of context. A useful indication of the network\u2019s sensitivity to context is\nprovided by the derivatives of the output yt\nk at a particular point t in the data sequence with respect\nk at all points 1 \u2264 t(cid:48) \u2264 T . We refer to these derivatives as the sequential Jacobian.\nto the inputs xt(cid:48)\nLooking at the relative magnitude of the sequential Jacobian over time gives an idea of the range of\ncontext used, as illustrated in Figure 1.\n\n4 Conclusion\n\nWe have combined a BLSTM CTC network with a probabilistic language model. We have applied\nthis system to an online handwriting database and obtained results that substantially improve on a\nstate-of-the-art HMM based system. We have also shown that the network\u2019s performance with raw\nsensor inputs is comparable to that with sophisticated preprocessing. As far as we are aware, our\nsystem is the \ufb01rst to successfully recognise unconstrained online handwriting using raw inputs only.\n\nAcknowledgments\n\nThis research was funded by EC Sixth Framework project \u201cNanoBioTact\u201d, SNF grant 200021-\n111968/1, and the SNF program \u201cInteractive Multimodal Information Management (IM)2\u201d.\n\nReferences\n[1] J. S. Bridle. Probabilistic interpretation of feedforward classi\ufb01cation network outputs, with relationships\nto statistical pattern recognition. In F. Fogleman-Soulie and J.Herault, editors, Neurocomputing: Algo-\nrithms, Architectures and Applications, pages 227\u2013236. Springer-Verlag, 1990.\n\n[2] A. Graves, S. Fern\u00b4andez, F. Gomez, and J. Schmidhuber. Connectionist temporal classi\ufb01cation: Labelling\nunsegmented sequence data with recurrent neural networks. In Proc. 23rd Int. Conf. on Machine Learning,\nPittsburgh, USA, 2006.\n\n[3] A. Graves and J. Schmidhuber. Framewise phoneme classi\ufb01cation with bidirectional LSTM and other\n\nneural network architectures. Neural Networks, 18(5-6):602\u2013610, June/July 2005.\n\n[4] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient \ufb02ow in recurrent nets: the dif\ufb01culty\nof learning long-term dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical\nRecurrent Neural Networks. IEEE Press, 2001.\n\n[5] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Comp., 9(8):1735\u20131780, 1997.\n[6] J. Hu, S. G. Lim, and M. K. Brown. Writer independent on-line handwriting recognition using an HMM\n\napproach. Pattern Recognition, 33:133\u2013147, 2000.\n\n7\n\n\fFigure 1: Sequential Jacobian for an excerpt from the IAM-OnDB, with raw inputs (left) and pre-\nprocessed inputs (right). For ease of visualisation, only the derivative with highest absolute value\nis plotted at each time step. The reconstructed image was created by plotting the pen coordinates\nrecorded by the sensor. The individual strokes are alternately coloured red and black. For both rep-\nresentations, the Jacobian is plotted for the output corresponding to the label \u2018i\u2019 at the point when\n\u2018i\u2019 is emitted (indicated by the vertical dashed lines). Because bidirectional networks were used, the\nrange of sensitivity extends in both directions from the dashed line. For the preprocessed data, the\nJacobian is sharply peaked around the time when the output is emitted. For the raw data it is more\nspread out, suggesting that the network makes more use of long-range context. Note the spike in\nsensitivity to the very end of the raw input sequence: this corresponds to the delayed dot of the \u2018i\u2019.\n\n[7] S. Jaeger, S. Manke, J. Reichert, and A. Waibel. On-line handwriting recognition: the NPen++ recognizer.\n\nInt. Journal on Document Analysis and Recognition, 3:169\u2013180, 2001.\n\n[8] S. Johansson, R. Atwell, R. Garside, and G. Leech. The tagged LOB corpus user\u2019s manual; Norwegian\n\nComputing Centre for the Humanities, 1986.\n\n[9] P. Lamere, P. Kwok, W. Walker, E. Gouvea, R. Singh, B. Raj, and P. Wolf. Design of the CMU Sphinx-4\n\ndecoder. In Proc. 8th European Conf. on Speech Communication and Technology, Aug. 2003.\n\n[10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProc. IEEE, 86(11):2278\u20132324, Nov. 1998.\n\n[11] Y. LeCun, F. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to\n\npose and lighting. In Proc. of CVPR\u201904. IEEE Press, 2004.\n\n[12] M. Liwicki and H. Bunke. IAM-OnDB - an on-line English sentence database acquired from handwritten\ntext on a whiteboard. In Proc. 8th Int. Conf. on Document Analysis and Recognition, volume 2, pages\n956\u2013961, 2005.\n\n[13] M. Liwicki, A. Graves, S. Fern\u00b4andez, H. Bunke, and J. Schmidhuber. A novel approach to on-line\nhandwriting recognition based on bidirectional long short-term memory networks. In Proc. 9th Int. Conf.\non Document Analysis and Recognition, Curitiba, Brazil, Sep. 2007.\n\n[14] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal\n\nProcessing, 45:2673\u20132681, Nov. 1997.\n\n[15] P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices for convolutional neural networks applied\nto visual document analysis. In Proc. 7th Int. Conf. on Document Analysis and Recognition, page 958,\nWashington, DC, USA, 2003. IEEE Computer Society.\n\n[16] S. Young, N. Russell, and J. Thornton. Token passing: A simple conceptual model for connected speech\nrecognition systems. Technical Report CUED/F-INFENG/TR38, Cambridge University Eng. Dept., 1989.\n\n8\n\n\f", "award": [], "sourceid": 894, "authors": [{"given_name": "Alex", "family_name": "Graves", "institution": null}, {"given_name": "Marcus", "family_name": "Liwicki", "institution": null}, {"given_name": "Horst", "family_name": "Bunke", "institution": null}, {"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": null}, {"given_name": "Santiago", "family_name": "Fern\u00e1ndez", "institution": null}]}