{"title": "Comparison of Human and Machine Word Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 94, "page_last": 100, "abstract": "", "full_text": "Comparison of Human and  Machine Word \n\nRecognition \n\nM. Schenkel \n\nDept of Electrical Eng. \nUniversity of Sydney \n\nSydney,  NSW  2006,  Australia \nschenkel@sedal.usyd.edu.au \n\nC. Latimer \n\nDept of Psychology \nUniversity of Sydney \n\nSydney, NSW  2006,  AustTalia \n\nM. Jabri \n\nDept of Electrical Eng. \nUniversity of Sydney \n\nSydney,  NSW  2006,  Australia \nmarwan@sedal. usyd.edu.au \n\nAbstract \n\nWe  present  a  study  which  is  concerned with  word  recognition  rates  for \nheavily  degraded  documents.  We  compare  human  with  machine  read(cid:173)\ning capabilities in a series of experiments, which  explores the interaction \nof word/non-word recognition, word frequency  and legality of non-words \nwith  degradation level.  We  also study the influence  of character segmen(cid:173)\ntation, and compare human performance with that of our artificial neural \nnetwork model for reading.  We found that the proposed computer model \nuses  word  context  as  efficiently  as  humans,  but performs  slightly  worse \non the pure character recognition task. \n\n1 \n\nIntroduction \n\nOptical  Character  Recognition  (OCR)  of  machine-print  document  images \u00b7has  matured \nconsiderably  during the  last  decade.  Recognition  rates  as  high  as  99.5%  have  been  re(cid:173)\nported on  good  quality  documents.  However,  for  lower image resolutions  (200  Dpl  and \nbelow),  noisy  images,  images  with  blur  or  skew,  the  recognition  rate declines  consider(cid:173)\nably.  In  bad quality documents, character segmentation is  as  big  a problem as the actual \ncharacter recognition.  fu many cases, characters tend either to merge  with  neighbouring \ncharacters (dark documents) or to break into several pieces (light documents) or both.  We \nhave developed a reading system based on a combination of neural networks and hidden \nMarkov  models (HMM),  specifically for low  resolution and degraded documents. \nTo assess  the limits  of the system and to see where  possible improvements  are still to be \n\n\fComparison of Human and Machine Word Recognition \n\n95 \n\nexpected, an obvious  comparison is between its performance  and that of the best reading \nsystem  known,  the  human  reader.  It has  been  argued,  that  humans  use  an  extremely \nwide range of context information, such as current topics, syntax and semantic analysis in \naddition to simple lexical knowledge during reading.  Such higher level context is very hard \nto model  and we  decided to run a. first  comparison on a word recognition task, excluding \nany context beyond word knowledge. \n\nThe main questions asked for  this study are:  how  does  human performance compare with \nour system when it comes to pure character recognition  (no context at all)  of bad quality \ndocuments?  How  do  they  compare  when  word  context  can  be  used?  Does  character \nsegmentation information help  in reading? \n\n2  Data Preparation \n\nWe  created as  stimuli  36  data sets,  each  containing  144  character strings,  72  words  and \n72  non-words,  all  lower  case.  The  data sets  were  generated  from  6  original  sets,  each \nCOI!taining  144  unique  wordsjnon-words.  For  each  original  set  we  used  three  ways  to \ndivide the words into the different degradation levels such that each word appears once in \neach degradation level.  We  also had two  ways  to pick segmented/non-segmented so  that \neach word  is  presented  once  segmented  and  once  non-segmented.  This counterbalancing \ncreates the 36  sets out of the six original ones.  The order of presentation within a test set \nwas  randomized with respect to degradation, segmentation and lexical status. \n\nAll  character strings were printed in  'times roman 10 pt' font.  Degradation was  achieved \nby photocopying and faxing the printed docJiment before scanning it at 200Dpl. Care was \ntaken to randomize the print position of the words such that as few systematic degradation \ndifferences as possible were introduced. \n\nWords  were  picked  from  a  dictionary  of the  44,000  most  frequent  words  in  the  'Sydney \nMorning Herald'.  The length of the words was restricted to be between 5 and 9 characters. \nThey were divided in a  3x3x2 mixed factorial model containing 3 word-frequency groups, \n3  stimulus  degradation  levels  and  visually  segmented/non-segmented  words.  The three \nword-frequency  groups  were:  1  to  10  occurences/million  (o/m)  as  low  frequency,  11  to \n40  ojm as  medium  frequency  and  41  or  more  ojm as  high  frequency.  Each participant \nwas  presented  with  four  examples  per stimulus  class  (e.g.  four  high  frequency  words  in \nmedium degradation level, not segmented). \n\nThe non-words conformed to a 2x3x2 model containing legal/illegal non-words, 3 stimulus \ndegradation levels  and visually  segmented/non-segmented strings.  The illegal  non-words \n(e.g.  'ptvca')  were  generated by randomly selecting a  word length between  5 and 9 char(cid:173)\nacters  (using the same word length frequencies  as the dictionary has)  and then randomly \npicking characters  (using the same character frequencies  as  the dictionary has)  and keep(cid:173)\ning the  unpronouncable sequences.  The legal non-words  (e.g.  'slunk')  were generated  by \nusing  trigrams  (using  the  dictionary  to compute  the  trigram  probabilities)  and  keeping \npronouncable sequences.  Six examples per non-word stimulus class were used in each test \nset.  (e.g.  six illegal non-words in high degradaton level,  segmented). \n\n3  Human Reading \n\nThere  were  36  participants  in  the  study.  Participants  were  students  and  staff  of the \nUniversity  of Sydney,  recruited  by  advertisement  and  paid for  their  service.  They  were \nall  native English speakers, aged between 19  and 52  with no  reported uncorrected visual \ndeficits. \n\nThe participants viewed the images, one at a time, on a computer monitor and were asked \nto  type  in  the  character string they  thought  would  best  fit  the  image.  They  had  been \n\n\f96 \n\nM.  Schenkel,  C.  Latimer and M.  Jabri \n\ninstructed that half of the character strings were  English words  and half non-words,  and \nthey were informed about the degradation levels and the segmentation hints.  Participants \nwere asked to be as fast and as  accurate as possible.  After an initial training session of 30 \nrandomly picked character strings not from  an independent  training set, the participants \nhad a short break and were then presented with the test set,  one string at a time.  After a \nCarriage Return was typed, time was recorded and the next word was displayed.  Training \nand testing took about one hour.  The words  were  about 1-1.5cm large on the screen and \nviewed at a distance of 60cm, which corresponds to a viewing  angle of 1\u00b0. \n\n4  Machine Reading \n\nFor the machine  reading tests,  we  used  our integrated segmentation/recognition system, \nusing a sliding window technique with a combination of a neural network and an HMM [6). \nIn the following we  describe the basic workings without going into too much  detail on the \nspecific  algorithms.  For more detailed description see (6]. \nA sliding  window  approach to word  recognition performs  no  segmentation  on  the  input \ndata of the recognizer.  It consists basically of sweeping a window over the input word  in \nsmall steps.  At each step the window is taken to be a tentative character and corresponding \ncharacter class scores are produced.  Segmentation and recognition decisions are then made \non  the  basis  of  the  sequence  of  character  scores  produced,  possibly  taking  contextual \ninformation into account. \nIn  the  preprocessing stage  we  normalize  the  word  to  a  fixed  height.  The  result  is  a \ngrey-normalized pixel map of the word.  This pixel map is the input to a neural network \nwhich estimates a posteriori probabilities of occurrence for each character given the input \nin the sliding window whose length corresponds approximately to two characters.  We use \na space displacement neural network (SDNN)  which is  a multi-layer feed-forward network \nwith local connections and shared weights, the layers of which perform successively higher(cid:173)\nlevel feature extraction.  SDNN's are derived from Time Delay Neural Networks which have \nbeen successfully used in speech recognition (2]  and handwriting recognition (4,  1].  Thanks \nto its convolutional structure the computational complexity of the sliding window approach \nis  kept tractable.  Only  about one  eighth  of the network connections  are reevaluated for \neach new  input  window.  The outputs  of the  SDNN  are processed  by  an  HMM.  In  our \ncase the HMM implements character duration models.  It tries to align the best scores of \nthe SDNN with the corresponding expected character durations.  The Viterbi algorithm is \nused for this alignment, determining simultaneously the segmentation and the recognition \nof the word.  Finding this state sequence is  equivalent to finding  the most probable path \nthrough the graph which represents the HMM. Normally additive costs are used instead of \nmultiplicative probabilities.  The HMM then selects the word causing the smallest costs. \nOur best architecture contains 4 convolutional layers with a total of 50,000 parameters (6]. \nThe training set consisted of a subset of 180,000 characters from the SEDAL database, a \nlow resuloution degraded document database which  was  collected  earlier and is  indepen(cid:173)\ndent of any data used in this experiment. \n\n4.1  The Dictionary. Model \n\nA natural way  of including a dictionary in this process, is  to restrict the solution space \nof the HMM to words given  by the dictionary.  Unfortunately this means calculating the \ncost  for  each  word  in  the  dictionary,  which  becomes  prohibitively  slow  with  increasing \ndictionary size  (we  use a  combination of available dictionaries, with  a total size of 98,000 \nwords).  We thus chose a two step process for the dictionary search:  in a first step a list of \nthe most probable words is generated, using a fast-matcher technique.  In the second step \nthe HMM costs are calculated for the words in the proposed list. \n\n\fComparison of Human and Machine Word Recognition \n\n97 \n\nTo generate the word list,  we take the character string as found  by the HMM without the \ndictionary  and  calculate  the  edit-distance  between  that  string  and  all  the  words  in  the \ndictionary.  The edit-distance measues how  many edit operations (insertion,  deletion  and \nsubstitution)  are necessary  to convert  a given  input  string into  a  target word  [3,  5].  We \nnow select all dictionary words that have the smallest edit-distance to the string recognized \nwithout using the dictionary.  The composed word list contains on average 10 words,  and \nits length varies considerably depending on the quality of the initial string. \nFor all  words  in  the  word  list  the  HMM cost  is  now  calculated  and  the  word  with  the \nsmallest  cost  is  the  proposed  dictionary  word.  As  the calculation of the  edit-distance is \nmuch  faster  than the  calculation of the  HMM costs,  the  recognition  speed  is  increased \nsubstantially. \nIn a last step the difference in cost between the proposed dictionary word  and the initial \nstring is calculated. If this difference is smaller than a threshold, the system will return the \ndictionary word, otherwise the original string is  returned.  This allows for  the recognition \nof non-dictionary words.  The  value  for  the threshold  determines  the amount  of reliance \non  the dictionary.  A high  value will  correct most words  but will  also force  non-words  to \nbe recognized as words.  A low  value,  on  the other hand, leaves the non-words unchanged \nbut doesn't help for words either.  Thus the value of the threshold influences the difference \nbetween word  and non-word recognition.  We  chose the value such that the over-all error \nrate is  optimized. \n\n4.2  The Case of Segmented data \n\nWhen character segmentation is given, we know how  many characters we have and where \nto  look  for  them.  There  is  no  need  for  an  HMM  and  we  just  sum  up  the  character \nprobabilities over the x-coordinate in the region corresponding to a segment.  This leaves \na vector of 26 scores {the whole alphabet) for  each character in the input string.  With no \ndictionary constraints, we  simply pick the label corresponding to the highest  probability \nfor  each character.  The dictionary is  used in the same way,  replacing the HMM scores by \ncalculating the word scores directly from the corresponding character probabilities. \n\n5  Results \n\nRecognition Performance \n\n0.6 \n\n:::::=---(cid:173)\n\nNon-Words \n\nHuman Reading \n\nMachine Reading \n\n0.6 \n\nX  Non-Segmental \n0  S.pncnled \n\n-- ... \n\n1:: \n\n~0.5 \n~ ::;0.4 \nIX'! r\u00b73 \n\nao.2 \n\n,..-\n\n-------lC' \n\n-\n\n--\n\n0.1 \n\nWords \n\n0.1 \n\nWords \n\no~------------~2------------~ \n\n0\u00b7~------------~2------------~3~ \n\nDegradation \n\nDegradation \n\nFigure  1:  Human  Reading  Perfor(cid:173)\nmance. \n\nFigure  2:  Machine  Reading  Perfor(cid:173)\nmance. \n\n\f98 \n\nM.  Schenkel,  C. Latimer and M.  Jabri \n\nFigure  1 depicts  the recognition results  for  human readers.  All  results  are per character \nerror rates counted by the edit-distance.  All  results reported as significant pass an F -test \nwith  p  <  .01.  As  expected  there  was  a  significant  interaction  between  error rate  and \ndegradation  and  clearly  non-words  have  higher  error  rates than  words.  Also  character \nsegmentation has also an influence on the error rate.  Segmentation seems to help slightly \nmore for  higher degradations. \nFigure  2 shows  performance of the machine  algorithm.  Again  greater degradation leads \nto  higher  error rates  and  non-words  have  higher  error  rates  than  words.  Segmentation \nhints  lead to significantly better recognition for  all  degradation levels;  in  fact  there is  no \ninteraction between  degradation and segmentation for  the machine algorithm.  In general \nthe machine benefited more from segmentation than humans. \nOne would expect a smaller gain from  lexical knowledge for higher en;or rates (i.e.  higher \ndegradation)  as  in  the limit  of complete  degradation  all error rates  will  be  100%.  Both \nhumans  and machine show this  'closing of the gap . \n\nSegmented Recognition \n\nNon-Segmented Recognition \n\n0.6 \n\n-Human \n- - - \u2022  Mac;binc \n\n0.6 \n\n-Human \n\n0.1 \n\nWords \n\no~------------~2------------~3~ \n\no~------------~2------------~3~ \n\nDegradation \n\nDegradation \n\nFigure 3:  Segmented Data. \n\nFigure 4:  Non-Segmented Data. \n\nMore interesting is the direct comparison between the error rates for humans and machine \nas  shown  in figure  3 and figure  4.  The difference for  non-words  reflects  the  difference  in \nability to recognize the geometrical shape of characters without context.  For degradation \nlevels  1 and  2,  the machine has the same reading abilities  as  humans for  segmented data \nand looses only about 7% in the non-'segmented case.  For degradation level 3; the machine \nclearly performs worse than human readers. \nThe difference between word and non-word error rates reflects the ability of the participant \nto  use  lexical  knowledge.  Note  that the  task contains  word/non-word  discrimination  as \nwell  as  recognition.  It is  striking how  similar the  behaviour for  humans  and  machine  is \nfor degradation levels  1 and 2. \n\nTiming Results \n\nFigure  5  shows  the  word  entry  times  for  humans.  As  the  main  goal  was  to  compare \nrecognition  rates,  we  did  not  emphasize  entry  speed  when  instructing  the  participants. \nHowever, we  recorded the word entry time for  each word  (which includes inspection time \nand typing).  When analysing the timimg data the only interest was in relative difference \nbetween word groups.  Times were therefore\u00b7 converted for  each participant into a  z-score \n(zero  mean with a  standard deviation of one)  and statistics were  made  over the z-scores \nof all participants. \n\nNon-words generally took longer to recognize than words and segmented data took longer \n\n\fComparison of Human and Machine Woni Recognition \n\n99 \n\no.s,..-------,---------.--, \n\nHumau Reading Times \n\n\u00b7----\n\n___ .. \n\n-------0 \n\n----\n\n-------====---------------x \n\n0.5,..-------,.--------~ \n\nNon-Segmented \n\n-----\u00b7 \n\n-Human \n\n-------- ------=~~~= \n\n--\n\n;...,  0.3 \n\n~ \n\n.!:!.  0  I \n~  . \nb Jj -{).I \n\"E \n~ \n\n-{)3 \n\n-{).S'-:-1 ------2:-------~3--l \n\nDegradation \n\n-{).S'-;-------2:-------~3:-' \n\nDesradation \n\nFigure 5:  Human Reading Times. \n\nFigure  6:  Non-Segmented  Reading \nTimes. \n\nto recognize than non-segmented for humans which we believe stems from  participants not \nbeing  used to reading segmented  data.  When  asked,  participants reported difficulties  in \nusing the segmentation lines.  Interestingly this  segmentation effect  is  significant only for \nwords but not for non-words. \nAs  predicted there is  also  an interaction between  time and  degradation.  Greater degra(cid:173)\ndations take longer to recognize.  Again, the degradation effect for  time is  only significant \nfor  words but barely for  non-words. \nOur machine reading algorithm behaves differently in segmented and non-segmented mode \nwith respect  to time consumption.  In segmented mode, the time for  evaluating the word \nlist  in  our  system  is  very  short  compared  to  the  base  recognition  time,  as  there  is  no \nHMM involved.  Accordingly we found no or very little effects on timing for our system for \nsegmented data.  All  the timing information for  the machine  refer to the non-segmented \ncase  (see Figure 6). \n\nFrequency and Legality \n\nTable  5  shows  word  frequencies,  legality  of non-words  and entry-time.  Our experiment \nconfirmed the well  known frequency  and legality effect for  humans in recognition rate as \nwell as time and respectively for frequency.  The only exception is that there is no difference \nin error rate for middle and low frequency  words. \nThe  machine  shows  (understandably)  no  frequency  effect  in  error  rate  or  time,  as  all \nlexical  words  had the  same  prior probability.  Interestingly  even  when  using  the correct \nprior probabilities we  could  not produce a strong word frequency\u00b7 effect  for  the machine. \nAlso  no  legality  effect  was  observed for  the  error  rate.  One  way  to incorporate legality \neffects would  be the use of Markov chains such as n-grams. \nNote  however,  how  the recognition  time for  non-words is  higher than for  words  and  the \nlegality effect for the recognition time.  Recognition times for our system in non-segmented \nmode depend  mainly on the time it takes to evaluate the word list.  Non-words generally \nproduce  longer  word  lists  than  words,  because there are  no  good  quality  matches  for  a \nnon-word in the dictionary (on average a word list length of 8.6 words was found for words \nand of 14.5  for non-words).  Also  illegal  non-words  produce  longer word  lists  than  legal \nones,  again  because  the match  quality for  illegal  non-words  is  worse  than for  legal  ones \n(average length for  illegal non-words  15.9 and for  legal non-words  13.2).  The z-scores for \nthe word list  length parallel nicely  the recognition time scores. \nIn segmented  mode,  the time  for  evaluating the word list  is  very short  compared to the \n\n\f100 \n\nM. Schenkel,  C. Latimer and M.  Jabri \n\nbase recognition  time,  as  there  is  no  HMM  involved.  Accordingly  we  found  no  or very \nlittle effects  on timing for our system in the segmented case. \n\nError l%J \n\nWords 41+ \nWords  11-40 \nWords  1-10 \nLegal  Non-W. \nlllegal Non-W. \n\nHumans \n\nMachine \n\nError  z-Time  Error  z-Time \n-0.14 \n0.22 \n-0.19 \n0.27 \n-0.22 \n0.26 \n0.09 \n0.36 \n0.46 \n0.28 \n\n-0.37 \n-0.13 \n-0.06 \n0.07 \n0.31 \n\n0.36 \n0.34 \n0.33 \n0.47 \n0.49 \n\nTable 1:  Human and Machine Error rates for the different word and non-word \nclasses.  The z-times for the machine are for  the non-segmented data only. \n\n6  Discussion \n\n. The ability to recognize the geometrical shape of characters without the possibility to use \nany  sort  of context  information  is  reflected  in  the  error rate  of illegal  non-words.  The \ndifference  between the error rate for  illegal non-words  and the one for  words  reflects  the \nability  to  use  lexical  knowledge.  To  our surprise  the  behavior of humans  and  machine \nis  very similar for  both  tasks,  indicating  a  near to optimal machine recognition system. \nClearly this  does  not mean  our system is  a good model for  human reading.  Many effects \nsuch  as  semantic and repetition priming are  not reproduced and  call for a  system which \nis  able  to  build  semantic  classes  and  memorize  the stimuli  presented.  Nevertheless,  we \nbelieve that our experiment validates empirically the verification model we  implemented, \nusing real world  data. \n\nAcknowledgments \n\nThis  research  is  supported  by  a  grant from  the Australian  Research  Council  (grant  No \nA49530190). \n\nReferences \n\n[1]  I.  Guyon,  P.  Albrecht,  Y.  Le  Cun,  J.  Denker,  and  W.  Hubbard.  Design of a  neural \nnetwork character recognizer for a touch terminal.  Pattern Recognition,  24(2):105-119, \n1991. \n\n[2]  K.  J.  Lang and G.  E.  Hinton.  A Time Delay  Neural Network architecture for  speech \nrecognition.  Technical Report CMU-cs-88-152, Carnegie-Mellon University, Pittsburgh \nPA,  1988. \n\n[3]  V.I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. \n\nSoviet Physics-Doklady,  10(8):707-710, 1966. \n\n[4]  0. Matan,  C.  J. C.  Burges,  Y.  Le  Cun,  and J. Denker.  Multi-digit recognition  using \na  Space  Dispacement  Neural  Network.  In  J.  E.  Moody,  editor,  Advances  in Neural \nInformation  Processing  Systems 4,  pages 488-495, Denver,  1992. Morgan Kaufmann. \nf5]  T. Okuda, E. Tanaka, and K.  Tamotsu.  A method for  the correction of garbled words \nbased on the Levenshtein metric.  IEEE  Transactions  on  Computers,  c-25(2):172-177, \n1976. \n\n[6]  M. Schenkel and M.  Jabri. Degraded printed document recognition using convolutional \nneural networks and hidden markov models.  In Proceedings  of the  A CNN,  Melbourne, \n1997. \n\n\f\f", "award": [], "sourceid": 1393, "authors": [{"given_name": "Markus", "family_name": "Schenkel", "institution": null}, {"given_name": "Cyril", "family_name": "Latimer", "institution": null}, {"given_name": "Marwan", "family_name": "Jabri", "institution": null}]}