{"title": "Discriminative Learning for Label Sequences via Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 1001, "page_last": 1008, "abstract": null, "full_text": "Discriminative  Learning for  Label \n\nSequences via Boosting \n\nYasemin Altun,  Thomas Hofmann and  Mark Johnson* \n\nDepartment of Computer Science \n\n*Department of Cognitive and Linguistics Sciences \n\nBrown University, Providence, RI  02912 \n\n{altun,th}@cs.brown.edu, Mark_Johnson@brown.edu \n\nAbstract \n\nThis  paper  investigates  a  boosting  approach  to  discriminative \nlearning of label sequences based on a sequence rank loss function. \nThe proposed method combines many of the  advantages of boost(cid:173)\ning schemes  with  the efficiency  of dynamic programming methods \nand is  attractive both, conceptually and computationally.  In  addi(cid:173)\ntion, we also discuss alternative approaches based on the Hamming \nloss for  label sequences.  The sequence boosting algorithm offers an \ninteresting alternative to  methods  based  on  HMMs  and  the  more \nrecently proposed  Conditional Random Fields.  Applications  areas \nfor the presented technique range from natural language processing \nand  information extraction to  computational biology.  We  include \nexperiments  on  named  entity  recognition  and  part-of-speech  tag(cid:173)\nging  which  demonstrate  the  validity  and  competitiveness  of  our \napproach. \n\n1 \n\nIntroduction \n\nThe  problem  of  annotating  or  segmenting  observation  sequences  arises  in  many \napplications  across  a  variety  of scientific  disciplines,  most  prominently  in  natural \nlanguage  processing,  speech  recognition,  and  computational  biology.  Well-known \napplications include  part-of-speech  (POS)  tagging,  named entity  classification, in(cid:173)\nformation  extraction,  text  segmentation  and  phoneme  classification  in  text  and \nspeech processing [7]  as well as problems like protein homology detection, secondary \nstructure prediction or gene classification in  computational biology  [3]. \n\nUp to now,  the predominant formalism for  modeling and predicting label sequences \nhas  been  based  on  Hidden  Markov  Models  (HMMs)  and  variations  thereof.  Yet, \ndespite its  success,  generative probabilistic models - of which  HMMs  are a  special \ncase - have two  major shortcomings, which  this  paper is  not the first  one to point \nout.  First,  generative  probabilistic  models  are  typically  trained  using  maximum \nlikelihood  estimation  (MLE)  for  a  joint  sampling  model  of observation  and  label \nsequences.  As  has been emphasized frequently,  MLE based on the joint probability \nmodel is  inherently non-discriminative and thus may lead to suboptimal prediction \naccuracy.  Secondly,  efficient  inference  and  learning  in  this  setting  often  requires \n\n\fto make questionable conditional independence assumptions.  More precisely, in the \ncase of HMMs, it is assumed that the Markov blanket of the hidden label variable at \ntime step t  consists of the previous  and next labels  as well  as  the t-th observation. \nThis  implies  that  all  dependencies  on  past  and  future  observations  are  mediated \nthrough neighboring labels. \nIn this paper, we  investigate the use of discriminative learning methods for learning \nlabel  sequences.  This  line  of  research  continues  previous  approaches  for  learning \nconditional  models,  namely  Conditional  Random  Fields  (CRFs)  [6],  and  discrim(cid:173)\ninative  re-ranking  [1,  2] .  CRFs  have  two  main  advantages  compared  to  HMMs: \nThey  are  trained discriminatively  by  maximizing  a  conditional  (or  pseudo-)  likeli(cid:173)\nhood criterion and they are more flexible  in modeling additional dependencies such \nas direct dependencies of the t-th label on past or future observations.  However, we \nstrongly believe there are two further lines of research that are worth pursuing and \nmay offer  additional benefits  or improvements. \n\nFirst of all,  and this is the main emphasis of this paper, an exponential loss function \nsuch as the one used in boosting algorithms [9,4] may be preferable to the logarith(cid:173)\nmic  loss function  used in  CRFs.  In  particular we  will  present a  boosting algorithm \nthat has the additional advantage of performing implicit feature selection, typically \nresulting in  very sparse models.  This is  important for  model regularization as  well \nas  for  reasons  of efficiency  in  high  dimensional  feature  spaces.  Secondly,  we  will \nalso  discuss  the use  of loss  functions  that explicitly  minimize  the  zer%ne loss  on \nlabels,  i.e.  the  Hamming loss,  as  an  alternative to loss  functions  based on  ranking \nor predicting entire label sequences. \n\n2  Additive  Models and  Exponential Families \n\nFormally, learning label sequences is a generalization of the standard supervised clas(cid:173)\nsification problem.  The goal is  to learn a discriminant function for  sequences, i.e.  a \nmapping  from  observation  sequences  X  =  (X1,X2, ... ,Xt, ... )  to  label  sequences \ny  =  (Y1, Y2,  ... , Yt, ... ).  The  availability  of  a  training  set  of  labeled  sequences \nX  ==  {(Xi, yi) : i  =  1, ... ,n} to learn this  mapping from  data is  assumed. \nIn  this  paper,  we  focus  on  discriminant  functions  that  can  be  written  as  additive \nmodels.  The models  under consideration take the following  general form: \n\nFe(X , Y) = L Fe(X, Y; t),  with  Fe(X, Y; t)  = L fh!k(X , Y ; t) \n\n(1) \n\nk \n\nHere  fk  denotes  a  (discrete)  feature  in  the  language  of  maximum  entropy  mod(cid:173)\neling,  or  a  weak  learner  in  the  language  of boosting.  In  the  context  of label  se-\nquences  fk  will  typically  be either  of the  form  f~1)(Xt+s,Yt)  (with  S  E {-l , O, l}) \nor  f~2) (Yt-1, Yt).  The  first  type  of features  will  model  dependencies  between  the \nobservation  sequence  X  and the  t-th  label  in  the  sequence,  while  the  second  type \nwill  model  inter-label  dependencies  between  neighboring  label  variables.  For  ease \nof presentation, we  will  assume that all features  are binary, i.e. each learner corre(cid:173)\nsponds to an indicator function.  A typical way of defining a  set of weak learners is \nas follows: \n\n(1) ( \n\n) \nfk  Xt+s , Yt \n(2) ( \n) \nfk  Yt-1, Yt \n\nJ(Yt, y(k))Xdxt+s) \n\n(2) \n\n(3) \nwhere  J  denotes  the Kronecker-J and Xk  is  a  binary feature  function  that extracts \na  feature  from  an  observation  pattern;  y(k)  and  y(k)  refer  to  the  label  values  for \nwhich  the weak learner becomes  \"active\". \n\nJ(Yt ,y(k))J(Yt-1 ,y(k)) . \n\n\fThere is  a  natural way to associate a  conditional probability distribution over label \nsequences Y  with an additive model Fo  by defining an exponential family for  every \nfixed  observation sequence X \n\nPo(YIX)  ==  exp~:(~; Y)],  Zo(X) ==  Lexp[Fo(X,Y)]. \n\n(4) \n\ny \n\nThis distribution is in exponential normal form and the parameters B are also called \nnatural or  canonical parameters.  By  performing  the  sum  over  the  sequence  index \nt,  we  can  see  that the  corresponding sufficient  statistics  are given  by  Sk(X, Y)  == \n2:t  h(X, Y; t).  These  sufficient  statistics  simply  count  the  number  of  times  the \nfeature  fk  has been  \"active\"  along the labeled sequence  (X, Y). \n\n3  Logarithmic Loss  and  Conditional Random Fields \n\nIn  CRFs,  the  log-loss  of the  model  with  parameters B w.r.t.  a  set  of sequences  X \nis  defined  as the negative sum of the conditional probabilities of each training label \nsequence given the observation sequence, \n\nAlthough [6]  has proposed a modification of improved iterative scaling for parameter \nestimation  in  CRFs,  gradient-based  methods  such  as  conjugate  gradient  descent \nhave  often  found  to  be  more  efficient  for  minimizing  the  convex  loss  function  in \nEq.  (5)  (cf.  [8]).  The gradient can be readily computed as \n\n(6) \n\nwhere expectations are taken w.r.t. Po(YIX).  The stationary equations then simply \nstate that uniformly averaged over the training data, the observed sufficient statis(cid:173)\ntics  should  match their  conditional expectations.  Computationally,  the evaluation \nof  S(Xi, yi)  is  straightforward counting,  while  summing  over  all  sequences  Y  to \ncompute E  [S(X, Y)IX =  Xi]  can be performed using dynamic programming, since \nthe dependency structure between labels is  a  simple chain. \n\n4  Ranking  Loss  Functions  for  Label  Sequences \n\nAs  an  alternative to  logarithmic  loss  functions,  we  propose  to  minimize  an  upper \nbound  on  the  ranking  loss  [9]  adapted  to  label  sequences.  The  ranking  loss  of a \ndiscriminant function  Fo  w.r.t.  a  set of training sequences is  defined  as \n\n1{rnk(B;X)  =  L  L  8(Fo(Xi,Y) _FO(Xi,yi)), 8(x) ==  {~  ~~~:r~~e (7) \n\ni  Y;iY; \n\nwhich is simply the sum of the number of label sequences that are ranked higher than \nor equal to the true label sequence over all training sequences.  It  is  straightforward \nto see  (based on a term by term comparison)  that an upper bound on the rank loss \nis  given  by the following  exponential loss function \n\n1{exp(B; X) ==  L  L exp [FO(Xi, Y) - FO(Xi, yi)]  =  L  [Po  (~iIXi) -1].(8) \n\ni  Y#Y' \n\ni \n\n0 \n\n\fInterestingly  this  simply  leads  to a  loss  function  that  uses  the  inverse  conditional \nprobability  of the  true  label  sequence,  if  we  define  this  probability  via  the  expo(cid:173)\nnential form  in  Eq.  (4).  Notice that compared to  [1],  we  include  all  sequences  and \nnot  just  the  top  N  list  generated  by  some  external  mechanism.  As  we  will  show \nshortly,  an  explicit  summation  is  possible  because  of the  availability  of  dynamic \nprogramming formulation  to compute sums over all  sequences efficiently. \n\nIn order to derive  gradient equations for  the exponential loss  we  can  simply  make \nuse of the elementary facts \n\n\\le(-logP(()))=- P(()) \n\n\\1 eP(()) \n\n\\1 eP(()) \n,  and\\le p (())=- P(())2 \n\n1 \n\n\\le(-logP(())) \n\nP(()) \n\n(9) \n\nThen it is  easy to see that \n\n(10) \n\nThe  only  difference  between  Eq.  (6)  and  Eq.  (10)  is  the  non-uniform  weighting  of \ndifferent  sequences  by  their  inverse  probability,  hence  putting  more  emphasis  on \ntraining label sequences that receive a  small overall  (conditional)  probability. \n\n5  Boosting Algorithm for  Label  Sequences \n\nAs  an  alternative  to  a  simple  gradient  method,  we  now  turn  to  the  derivation  of \na  boosting  algorithm,  following  the  boosting  formulation  presented  in  [9].  Let  us \nintroduce  a  relative  weight  (or  distribution)  D(i , Y)  for  each  label  sequence  Y \nw.r.t.  a  training instance  (Xi, yi), i.e.  L i Ly D(i , Y)  =  1, \n\nD(i, Y) \n\nexp  [Fe (Xi, Y) - Fe (Xi, yi)] \n\nLj, LY,#Yj exp [Fe(Xj , Y') - Fe (Xj, y j)]' \n\nfor  Y  1- y i  (11) \n\n.  Pe(YIXi) \n\n.  _ \n\nPe(yi IXi) - l  - 1 \n\nD(z) 1 _  Pe(yiIXi) '  D(z)  =  Lj [Pe(yjIXj) -l _  1] \n\n(12) \n\nIn addition,  we  define  D(i, y i)  =  O.  Eq.  (12)  shows  how  we  can split  D(i, Y)  into \na  relative weight for  each training instance, given  by D(i) , and a  relative weight of \neach sequence, given by the re-normalized conditional probability Pe(YIXi ).  Notice \nthat D(i)  --+  0 as we  approach the perfect  prediction case of Pe(yi IXi)  --+  1. \n\nWe  define  a  boosting algorithm  which  in  each  round  aims  at  minimizing  the  par(cid:173)\ntition function  or weight normalization constant  Zk  w.r.t.  a  weak learner  fk  and a \ncorresponding optimal parameter increment  L,()k \n\nZk(L,()k)  ==  \"D(i)\"  P~~IXli)  .)  exp [L,()k(Sk(Xi, Y)-Sk(Xi, yi))](13) \n\nY # Y ' \n\n~  ~ . 1- e Y\u00b7X\u00b7 \n\u2022 \n=  ~ ( ~ D(i)P(bIXi; k))  exp [bL,()k], \n\n(14) \n\nwhere  Pe(bIXi; k)  =  LY EY (b; X i) Pe(YIXi)/(l  - Pe(yi IXi))  and Y(b; Xi)  ==  {Y  : \nY  1- y i  1\\  (Sk(Xi,Y)  - Sk(Xi,yi))  =  b}.  This  minimization  problem  is  only \ntractable  if  the  number  of  features  is  small,  since  a  dynamic  programming  run \nwith  accumulators  [6]  for  every  feature  seems  to  be  required  in  order  to  compute \n\n\fthe  probabilities  Po(bIXi; k),  i.e.  the  probability  for  the  k-th  feature  to  be  active \nexactly b times,  conditioned on the observation sequence Xi. \nIn  cases,  where  this  is  intractable  (and  we  assume  this  will  be  the  case  in  most \napplications),  one can instead minimize  an upper bound on every  Zk'  The general \nidea is  to exploit the convexity of the exponential function  and to bound \n\n(15) \n\nwhich is  valid for  every x  E [xmin; xmax]. \nWe  introduce the following  shorthand notation Uik(Y)  ==  Sk(Xi,Y) - SdXi,yi), \n(Y)  min-\nmax  -\n,  Uk \nUik \n= \nmini u'[kin  and 7fi(Y)  ==  Po(YIXi )!(1 - Po(yiIXi) ) which allows  us  to rewrite \n\n,  Uik  =  mmy:;tyi Uik \n\n=  maxy:;tyi Uik \n\nmax  min  - '  \n\n(Y)  max  _ \n\n- maxi Uik \n\n,  Uk \n\nZk(L.Bk) =  LD(i)  L \n\n7fi(Y) exp [L.BkUik(Y)] \n\n(16) \n\ny:;tyi \n\n< \"  D(i)  \"\n- ~  ~ \n\n7fi(Y)  [u'[kax  - Uik(:) eL:o.Oku,&;n  + Uik(Y)  - u~in eL:o.Oku,&ax] \n\ni \n\ntk \n\ny:;tyi \n\nuI?ax  - uI?m \ntk \n\nuI?ax  - uI?m \ntk \nLD(i) (rikeMkU,&;n  + (1- rik)eMkU,&aX),  where \ni \n\"\n~ \ny:;tyi \n\n7fi(Y) u'[kax  - Uik(:) \nuI?ax  _  u mm \ntk \n\ntk \n\ntk \n\nrik  == \n\n(17) \n\n(18) \n\nBy taking the second derivative w.r.t.  L.Bk  it is  easy to verify that this  is  a  convex \nfunction  in  L.Bk  which  can be minimized with a  simple line search. \n\nIf one  is  willing  to  accept  a  looser  bound,  one  can  instead  work  with  the  inter(cid:173)\nval  [uk'in; uk'ax]  which  is  the  union  of the  intervals  [u'[kin; u'[kax]  for  every  training \nsequence i  and obtain the upper bound \n\nZk(L.Bk)  <  rkeMkuk';n + (1  _ rk)eL:o.Okuk'ax \n\n\"D(i)  \" \n~  ~ \n\n7fi(Y) uk'ax  - Uik(:) \n\nu max  _umm \n\ni \n\ny=/-yi \n\nk \n\nk \n\nWhich can be solved analytically \n\nL.B  -\n\n1 \n\nk - uk'ax  _  uk'in  g \n\n10 \n\n( \n\n-rkuk'in \n\n(1  - rk)Uk'ax \n\n) \n\n(19) \n\n(20) \n\n(21) \n\nbut will  in general lead to more conservative step sizes. \n\nThe final  boosting procedure picks  at every round the feature  for  which  the upper \nbound on Zk  is minimal and then performs an update of Bk  +- Bk + L.Bk.  Of course, \none  might  also  use  more  elaborate  techniques  to  find  the  optimal  L.Bk,  once  !k \nhas  been  selected,  since  the  upper  bound  approximation  may  underestimate  the \noptimal step sizes.  It is  important  to see  that the  quantities  involved  (rik  and rk, \nrespectively) are simple expectations of sufficient statistics that can be computed for \nall features  simultaneously with a  single dynamic programming run per sequence. \n\n6  Hamming  Loss  for  Label  Sequences \n\nIn  many applications one  is  primarily interested in the label-by-labelloss or  Ham(cid:173)\nming  loss  [9].  Here  we  investigate  how  to  train  models  by  minimizing  an  upper \n\n\fbound  on  the  Hamming  loss.  The  following  logarithmic  loss  aims  at  maximizing \nthe log-probability for  each individual label and is  given by \n\nF1og(B;X)  ==  - LL)og Po(y1I Xi )  =  - LLlog  L  PO(YIXi ). \n\n(22) \n\nAgain,  focusing  on gradient descent methods, the gradient is  given by \n\nv:Yt = Y; \n\nAs  can  be  seen,  the  expected  sufficient  statistics  are  now  compared  not  to  their \nempirical  values,  but  to  their  expected  values,  conditioned  on  a  given  label  value \nY;  (and  not  the  entire  sequence  Vi).  In  order to  evaluate these  expectations,  one \ncan  perform  dynamic  programming  using  the  algorithm  described  in  [5],  which \nhas  (independently  of our  work)  focused  on  the  use  of Hamming loss  functions  in \nthe  context  of CRFs.  This  algorithm  has  the complexity of the forward-backward \nalgorithm scaled by  a  constant. \n\nSimilar to the log-loss  case,  one  can define  an exponential loss function  that corre(cid:173)\nsponds to a  margin-like quantity at every single label.  We  propose minimizing the \nfollowing  loss  function \n\n~ ~ ~ exp  [F'(X;, Y) -log Y'~\": exp [Fo(X\" V')] ]<24) \nL \n. t  l:v Yt=y i  exp [FO(Xi, Y)] \n\n=  LR ( iIXi'B) - l \n\nl:vexp [FO(Xi,y)] \n\n0  Yt \n\n(25) \n\n2, \n\n' t  \n\n. t \n\n2 , \n\n, \n\nAs  a  motivation,  we  point  out  that  for  the  case  of  sequences  of  length  1,  this \nwill  reduce  to the standard multi-class  exponential loss.  Effectively  in  this  model, \nthe  prediction  of a  label  Yt  will  mimic  the  probabilistic marginalization,  i.e.  y;  = \nargmaxy  FO(Xi, Y; t),  FO(Xi, Y;  t)  =  log l:v:Yt=Y exp [FO(Xi, Y)]. \n\nSimilar to the log-loss case,  the gradient is  given  by \n\n_  \"E  [S(X , Y)IX =  Xi ,Yt  =  yn  ~ E  [S(Xi, Y)IX =  Xi]  (26) \nit' \n\nPo(y:IX') \n\nAgain, we see the same differences between the log-loss and the exponential loss, but \nthis time for individual labels.  Labels for  which the marginal probability Po (yf IXi) \nis  small are accentuated in the exponential loss.  The computational complexity for \ncomputing  \\7 oFexp and  \\7 oF1og  is  practically the  same.  We  have not been  able to \nderive  a  boosting  formulation  for  this  loss  function,  mainly  because  it  cannot  be \nwritten as a  sum of exponential terms.  We have thus resorted to conjugate gradient \ndescent methods for  minimizing Fexp  in our experiments. \n\n7  Experimental Results \n\n7 .1  Named Entity Recognition \n\nNamed Entity Recognition  (NER) , a  subtask of Information Extraction, is  the task \nof finding  the phrases that contain person, location and organization names,  times \nand quantities.  Each word is  tagged with the type of the name as well as its position \nin  the name  phrase  (i.e.  whether it is  the first  item of the phrase or not)  in order \nto represent the boundary information. \n\n\fWe used a Spanish corpus which was provided for the Special Session of CoNLL2002 \non  NER.  The  data is  a  collection  of news  wire  articles  and  is  tagged  for  person \nnames, organizations, locations and miscellaneous names. \n\nWe  used  simple  binary  features  to  ask  questions  about  the  word  being  tagged,  as \nwell  as  the previous tag (i.e.  HMM features).  An example feature would be:  Is the \ncurrent word= 'Clinton' and the  tag='Person-Beginning '?  We  also used features to \nask detailed  questions  (i.e.  spelling  features)  about the  current  word  (e.g.:  Is  the \ncurrent  word  capitalized  and  the  tag='Location-Intermediate'?)  and  the neighbor(cid:173)\ning  words.  These  questions  cannot  be  asked  (in  a  principled  way)  in  a  generative \nHMM  model.  We  ran experiments comparing the different  loss functions optimized \nwith  the  conjugate  gradient  method  and  the  boosting  algorithm.  We  designed \nthree  sets  of features:  HMM  features  (=31),  31  and  detailed features  of the  cur(cid:173)\nrent  word  (= 32),  and  32  and  detailed  features  of the  neighboring  words  (=33). \nThe  results  summarized  in  Table  1 \ndemonstrate the  competitiveness  of the \nproposed  loss  functions  with  respect  to \n1{log.  We  observe  that  with  different \nsets of features,  the ordering of the per(cid:173)\nformance  of the  loss  functions  changes. \nBoosting performs worse than the conju(cid:173)\ngate gradient when  only  HMM features \nare used,  since  there  is  not  much  infor(cid:173)\nmation  in  the  features  other  than  the \nidentity of the word to be labeled.  Con(cid:173)\nsequently,  the boosting algorithm needs \nto  include  almost  all  weak  learners  in \nthe ensemble and cannot exploit feature \nsparseness.  When  there  are  more  de-\ntailed  features ,  the  boosting  algorithm  is  competitive  with  the  conjugate gradient \nmethod,  but has the advantage of generating sparser models.  The conjugate gradi(cid:173)\nent method uses all of the available features, whereas boosting uses only about 10% \nof the features. \n\nTable  1:  Test  error of the Spanish cor(cid:173)\npus for  named entity recognition. \n\n1{  6.60 \n6.73 \n:F \n6.72 \n1{ \n:F \n6.67 \n6.15 \n1{ \n5.90 \n:F \n\n6.95 \n7.33 \n7.03 \n7.49 \n5.84 \n5.10 \n\nObjective \nexp \n\nboost \n8.05 \n\nFeature \n\nS2 \n\nS3 \n\nSet \n\nSl \n\nlog \n\n6.93 \n\n6.77 \n\n-\n\n-\n\n-\n\n7.2  Part  of Speech Tagging \n\nWe  used  the  Penn  TreeBank  corpus  for \nthe  part-of-speech  tagging  experiments. \nThe  features  were  similar  to  the  fea(cid:173)\nture  sets  Sl  and  S2  described  above  in \nthe  context of NER.  Table  2 summarizes \nthe experimental results obtained on  this \ntask. \nrors  obtained  by  different  loss  functions \nlie within a  relatively small range.  Qual(cid:173)\nitatively the behavior of the different  op(cid:173)\ntimization  methods  is  comparable to the \nNER experiments . \n\nIt  can  be  seen  that  the  test  er(cid:173)\n\n7.3  General Comments \n\nFeature \n\nSet \n\nSl \n\nS2 \n\nlog \n\nObjective \nexp \n1{  4.69  5.04 \n4.96 \n:F \n1{  4.37  4.74 \n4.90 \n:F \n\n4.88 \n\n4.71 \n\nboost \n10.58 \n\n-\n\n5.09 \n\n-\n\nTable 2:  Test  error of the Penn Tree(cid:173)\nBank corpus for  POS \n\nEven  with  the  tighter  bound  in  the  boosting  formulation ,  the  same  features  are \nselected  many  times,  because  of the  conservative estimate of the  step  size  for  pa(cid:173)\nrameter updates.  We expect to speed up the convergence of the boosting algorithm \n\n\fby using a  more sophisticated line  search mechanism to compute the optimal step \nlength,  a  conjecture that will  be addressed in future  work. \n\nAlthough we  did not use real-valued features  in our experiments,  we  observed that \nincluding  real-valued  features  in  a  conjugate  gradient  formulation  is  a  challenge, \nwhereas it is  very natural to have such features  in  a  boosting algorithm. \n\nWe  noticed  in  our  experiments  that  defining  a  distribution  over  the  training  in(cid:173)\nstances  using the  inverse  conditional  probability  creates  problems  in  the  boosting \nformulation  for  data sets  that  are highly  unbalanced in  terms  of the length of the \ntraining sequences.  To overcome this problem, we  divided the sentences into pieces \nsuch that the variation in the length of the  sentences  is  small.  The conjugate gra(cid:173)\ndient  optimization, on the other hand, did  not appear to suffer from  this  problem. \n\n8  Conclusion and  Future Work \n\nThis  paper  makes  two  contributions  to  the  problem  of  learning  label  sequences. \nFirst,  we  have  presented an efficient  algorithm for  discriminative  learning of label \nsequences that combines boosting with dynamic programming.  The algorithm com(cid:173)\npares favorably  with  the best previous  approach,  Conditional Random Fields,  and \noffers additional benefits such as model sparseness.  Secondly,  we  have discussed the \nuse of methods that optimize a label-by-labelloss and have shown that these meth(cid:173)\nods bear promise for further improving classification accuracy.  Our future work will \ninvestigate the performance  (in  both accuracy and computational expenses)  of the \ndifferent  loss  functions  in  different  conditions  (e.g.  noise  level,  size  of the  feature \nset). \n\nAcknowledgments \n\nThis work was  sponsored by an NSF-ITR grant,  award number IIS-0085940. \n\nReferences \n[1]  M.  Collins.  Discriminative reranking for  natural language parsing.  In Proceedings  17th \nInternational  Conference  on  Machine  Learning,  pages  175- 182.  Morgan  Kaufmann , \nSan Francisco,  CA,  2000. \n\n[2]  M.  Collins.  Ranking algorithms for  named- entity extraction:  Boosting and the voted \nperceptron.  In Proceedings  40th  Annual Meeting  of the  Association for  Computational \nLinguistics  (ACL),  pages 489- 496,  2002. \n\n[3]  R.  Durbin, S.  Eddy,  A.  Krogh, and G.  Mitchison.  Biological Sequence  Analysis:  Prob(cid:173)\n\nabilistic  Models  of Proteins  and  Nucleic  Acids.  Cambridge University Press,  1998. \n\n[4]  J.  Friedman,  T.  Hastie,  and R.  Tibshirani.  Additive  logistic  regression:  a  statistical \n\nview  of boosting.  Annals  of Statistics,  28:337- 374,  2000. \n\n[5]  S.  Kakade, Y.W.  Teh,  and S.  Roweis.  An alternative objective function  for  Markovian \n\nfields.  In  Proceedings  19th  International  Conference  on  Machine  Learning,  2002. \n\n[6]  J .  Lafferty,  A.  McCallum,  and  F .  Pereira.  Conditional  random  fields:  Probabilistic \nmodels for segmenting and labeling sequence data.  In Proc.  18th International Conf.  on \nMachine  Learning,  pages  282- 289.  Morgan Kaufmann, San Francisco,  CA,  200l. \n\n[7]  C.  Manning and H.  Schiitze.  Foundations  of Statistical  Natural  Language  Processing. \n\nMIT  Press,  1999. \n\n[8]  T.  Minka.  Algorithms  for  maximum-likelihood logistic  regression.  Technical  report , \n\nCMU,  Department of Statistics,  TR 758,  200l. \n\n[9]  R.  Schapire and Y.  Singer.  Improved boosting algorithms using confidence-rated pre(cid:173)\n\ndictions.  Machine  Learning, 37(3):297- 336,  1999. \n\n\f", "award": [], "sourceid": 2333, "authors": [{"given_name": "Yasemin", "family_name": "Altun", "institution": null}, {"given_name": "Thomas", "family_name": "Hofmann", "institution": null}, {"given_name": "Mark", "family_name": "Johnson", "institution": null}]}