{"title": "Data-Dependent Structural Risk Minimization for Perceptron Decision Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 336, "page_last": 342, "abstract": "", "full_text": "On Parallel Versus  Serial  Processing: \n\nA  Computational Study of Visual  Search \n\nEyal  Cohen \n\nDepartment of Psychology \n\nTel-Aviv University  Tel  Aviv 69978,  Israel \n\neyalc@devil. tau .ac .il \n\nEytan Ruppin \n\nDepartments of Computer Science  &  Physiology \n\nTel-Aviv University Tel  Aviv  69978,  Israel \n\nruppin@math.tau .ac.il \n\nAbstract \n\nA novel neural network model of pre-attention processing in visual(cid:173)\nsearch  tasks  is  presented.  Using displays of line orientations taken \nfrom  Wolfe's experiments  [1992], we  study the  hypothesis  that the \ndistinction  between  parallel  versus  serial  processes  arises  from  the \navailability of global information in  the internal representations  of \nthe  visual  scene.  The  model  operates  in  two  phases.  First,  the \nvisual  displays  are  compressed  via  principal-component-analysis. \nSecond,  the compressed data is processed by a target detector mod(cid:173)\nule in order to identify the existence of a  target in the display.  Our \nmain  finding  is  that  targets  in  displays  which  were  found  exper(cid:173)\nimentally  to  be  processed  in  parallel can  be  detected  by  the  sys(cid:173)\ntem,  while  targets  in  experimentally-serial  displays  cannot .  This \nfundamental  difference  is  explained  via  variance  analysis  of  the \ncompressed representations,  providing a  numerical criterion distin(cid:173)\nguishing parallel from serial displays.  Our model yields a  mapping \nof response-time slopes that is similar to Duncan and Humphreys's \n\"search  surface\"  [1989],  providing  an  explicit  formulation  of their \nintuitive  notion  of feature  similarity.  It presents  a  neural  realiza(cid:173)\ntion of the  processing that may underlie the classical metaphorical \nexplanations of visual search. \n\n\fOn Parallel versus Serial Processing: A  Computational Study a/Visual Search \n\n11 \n\n1 \n\nIntroduction \n\nThis  paper  presents  a  neural-model of pre-attentive  visual  processing.  The  model \nexplains why  certain displays can be processed  very fast,  \"in parallel\" , while others \nrequire slower,  \"serial\"  processing, in subsequent attentional systems.  Our approach \nstems from  the observation that the  visual environment is  overflowing with diverse \ninformation,  but  the  biological  information-processing  systems  analyzing  it  have \na  limited  capacity  [1].  This  apparent  mismatch  suggests  that  data  compression \nshould  be  performed  at  an  early  stage  of perception,  and  that  via  an  accompa(cid:173)\nnying  process  of  dimension  reduction,  only  a  few  essential  features  of the  visual \ndisplay  should  be  retained.  We  propose  that  only  parallel  displays  incorporate \nglobal features  that enable fast  target  detection,  and hence  they  can  be  processed \npre-attentively,  with  all  items  (target  and  dis tractors)  examined  at  once.  On  the \nother  hand,  in  serial  displays'  representations,  global  information  is  obscure  and \ntarget  detection  requires  a  serial,  attentional scan of local features  across  the  dis(cid:173)\nplay.  Using  principal-component-analysis (peA), our  main goal  is  to demonstrate \nthat  neural  systems  employing compressed,  dimensionally reduced  representations \nof the visual information can successfully  process  only parallel displays and not se(cid:173)\nrial ones.  The sourCe  of this difference  will be explained via variance analysis of the \ndisplays'  projections on the  principal axes. \n\nThe  modeling  of  visual  attention  in  cognitive  psychology  involves  the  use  of \nmetaphors,  e.g.,  Posner's  beam  of attention  [2].  A  visual  attention  system  of a \nsurviving  organism  must  supply  fast  answers  to  burning  issues  such  as  detecting \na  target  in  the  visual field  and characterizing  its primary features.  An  attentional \nsystem employing a  constant-speed  beam of attention [3]  probably cannot  perform \nsuch  tasks  fast  enough  and  a  pre-attentive  system  is  required.  Treisman's feature \nintegration  theory  (FIT)  describes  such  a  system  [4].  According  to  FIT, features \nof separate dimensions  (shape,  color, orientation)  are  first  coded  pre-attentively in \na locations map and  in separate feature  maps, each  map representing  the values of \na  particular dimension.  Then,  in  the  second  stage,  attention  \"glues\"  the  features \ntogether  conjoining them into objects  at their specified  locations.  This  hypothesis \nwas  supported  using  the  visual-search  paradigm  [4],  in  which  subjects  are  asked \nto detect  a  target  within an array of distractors,  which  differ  on given  physical di(cid:173)\nmensions such  as  color, shape or orientation.  As  long  as  the  target  is  significantly \ndifferent from  the distractors in one dimension, the reaction  time (RT)  is  short and \nshows  almost  no  dependence  on  the  number  of distractors  (low  RT slope).  This \nresult  suggests  that  in  this  case  the  target  is  detected  pre-attentively,  in  parallel. \nHowever,  if the  target  and distractors  are  similar,  or  the  target  specifications  are \nmore  complex,  reaction  time  grows  considerably  as  a  function  of the  number  of \ndistractors  [5,  6],  suggesting  that  the  displays'  items  are  scanned  serially  using  an \nattentional process. \nFIT and other related cognitive models of visual search are formulated on  the con(cid:173)\nceptual  level  and  do  not  offer  a  detailed  description  of the  processes  involved  in \ntransforming the  visual scene  from  an  ordered  set of data points  into given  values \nin specified  feature  maps.  This  paper  presents  a  novel  computational explanation \nof the source  of the  distinction  between  parallel and serial  processing,  progressing \nfrom  general  metaphorical terms  to a  neural  network  realization.  Interestingly,  we \nalso  come  out  with  a  computational interpretation  of some  of these  metaphorical \nterms, such  as feature similarity. \n\n\f12 \n\n2  The Model \n\nE.  Cohen and E.  Ruppin \n\nWe  focus  our study on  visual-search  experiments of line orientations  performed by \nWolfe et.  al.  [7],  using three set-sizes composed of 4,  8 and 12 items.  The number of \nitems equals the number of dis tractors + target in target displays, and in non-target \ndisplays the target was  replaced  by another distractor,  keeping a constant set-size. \nFive experimental conditions were simulated:  (A)  - a 20 degrees  tilted target among \nvertical  distractors  (homogeneous  background).  (B)  - a  vertical  target  among 20 \ndegrees  tilted distractors  (homogeneous background).  (C)  - a  vertical target among \nheterogeneous  background  ( a  mixture of lines  with  \u00b120,  \u00b140  , \u00b160  , \u00b180 degrees \norientations).  (E)  - a  vertical  target among two flanking distractor orientations  (at \n\u00b120 degrees),  and (G)  - a vertical target among two flanking distractor orientations \n(\u00b140 degrees).  The response  times  (RT)  as  a  function  of the set-size  measured  by \n[7]  show  that  type  A,  Band  G  displays  are  scanned  in  a  parallel \nWolfe  et.  al. \nmanner (1.2,  1.8,4.8 msec/item for  the RT slopes),  while type C and E displays are \nscanned serially (19.7,17.5 msec/item).  The input displays of our system were  pre(cid:173)\npared following Wolfe's prescription:  Nine images of the basic line orientations were \nproduced  as  nine matrices of gray-level  values.  Displays for  the  various conditions \nof Wolfe's  experiments  were  produced  by  randomly  assigning  these  matrices  into \na  4x4  array,  yielding  128x100  display-matrices  that  were  transformed  into  12800 \ndisplay-vectors.  A  total  number of 2400  displays  were  produced  in  30  groups  (80 \ndisplays  in  each  group):  5  conditions  (A,  B,  C,  E,  G  )  x  target/non-target  x  3 \nset-sizes  (4,8,  12). \n\nOur  model  is  composed  of two  neural  network  modules  connected  in  sequence  as \nillustrated in  Figure 1:  a  peA module which  compresses  the  visual data into a  set \nof principal axes,  and a  Target  Detector  (TD)  module.  The latter module uses  the \ncompressed data obtained by the former  module to detect  a  target  within an array \nof distractors.  The system is  presented  with  line-orientation displays  as  described \nabove. \n\nNO\u00b7TARGET  =\u00b71  TARGET-I \n\nTn  [JUTPUT  LAYER  (I  UN IT ) - - - - - - - ,  \n\nTARGET \nDETECTOR \nMODULE \n(11)) \n\nTn  INrnRMEDIATE  LAYER  (12  UNITS) \n\nPeA  O~=~ LAYER J DATA \n\nCOMPRESSION \n\n--..;;:::~~~ \n\nINPUT  LAYER  (12Il00  UNITS) \n\nMODULE \n(PeA) \n\n-\n\n_ \n\n/ \n\nDISPLAY \n\nt \n\nt \n\n/--- --\n\nFigure  1:  General architecture  of the model \n\nFor  the  PCA  module  we  use  the  neural  network  proposed  by  Sanger,  with  the \nconnections' values updated in accordance with his  Generalized Hebbian Algorithm \n(GHA)  [8].  The  outputs  of the  trained  system  are  the  projections  of the  display(cid:173)\nvectors  along  the first  few  principal  axes,  ordered  with  respect  to  their  eigenvalue \nmagnitudes.  Compressing  the  data is  achieved  by  choosing outputs from  the first \n\n\fOn Parallel versus Serial Processing: A  Computational Study o/Visual Search \n\n13 \n\nfew  neurons  (maximal variance and minimal information loss).  Target detection in \nour  system  is  performed  by  a  feed-forward  (FF)  3-layered  network,  trained  via  a \nstandard  back-propagation algorithm  in  a  supervised-learning  manner.  The  input \nlayer  of the  FF  network  is  composed of the first  eight  output  neurons  of the  peA \nmodule.  The  transfer  function  used  in  the  intermediate  and  output  layers  is  the \nhyperbolic  tangent function. \n\n3  Results \n\n3.1  Target Detection \n\nThe  performance  of the  system  was  examined  in  two  simulation experiments.  In \nthe first,  the peA module was  trained only with  \"parallel\" task displays, and in the \nsecond, only with \"serial\"  task displays.  There is an inherent difference in the ability \nof the  model  to  detect  targets  in  parallel  versus  serial  displays .  In  parallel  task \nconditions (A, B, G) the target detector module learns the task after a comparatively \nsmall number (800  to 2000)  of epochs,  reaching performance level  of almost 100%. \nHowever,  the  target  detector  module is  not  capable of learning  to  detect  a  target \nin  serial  displays  (e,  E  conditions) .  Interestingly,  these  results  hold  (1)  whether \nthe preceding peA module was trained to perform data compression using parallel \ntask  displays  or  serial  ones,  (2)  whether  the  target  detector  was  a  linear  simple \nperceptron,  or the more powerful,  non-linear network depicted  in  Figure  1,  and  (3) \nwhether  the full set of 144  principal axes  (with non-zero  eigenvalues)  was  used. \n\n3.2 \n\nInformation Span \n\nTo analyze the differences  between  parallel and serial  tasks  we  examined the  eigen(cid:173)\nvalues  obtained  from  the  peA  of  the  training-set  displays.  The  eigenvalues  of \ncondition B  (parallel) displays in 4 and 12 set-sizes  and of condition e (serial-task) \ndisplays  are  presented  in  Figure  2.  Each  training set  contains  a  mixture of target \nand  non-target displays. \n\n(a) \n\n40 \n\n35 \n\nPARALLEL \n\nl!J \nII> \n\"'I;l \n\n+4 ITEMS \n\no  12 ITEMS \n\n30  ~ \n\n25 \n\n~ \n~ \n\nw \n~20 \n~ 15 \nw \n\n10 \n\nSERIAL \n\n+4 ITEMS \n\no  12 ITEMS \n\n(b) \n\n40 \n\n35 \n\n30 \n\n25 \n\nw \n~20 \n~ 15 \nw \n\n10 \n\n5  ~  5 \n\n0 \n\n0 \n\n-\n\n-5 \n\n0 \n\n10 \n\n40 \nNo.  of PRINCIPAL AXIS \n\n20 \n\n30 \n\n-5 \n\n0 \n\n10 \n\n40 \nNo.  of PRINCIPAL AXIS \n\n20 \n\n30 \n\nFigure 2:  Eigenvalues spectrum  of displays  with different  set-sizes,  for  parallel and \nserial  tasks.  Due  to  the  sparseness  of  the  displays  (a  few  black  lines  on  white \nbackground),  it takes only  31  principal axes  to describe  the parallel training-set  in \nfull  (see fig  2a.  Note that the remaining axes  have  zero  eigenvalues,  indicating that \nthey  contain  no  additional information.), and  144  axes  for  the serial  set  (only  the \nfirst  50  axes  are shown  in fig  2b). \n\n\f14 \n\nE.  Cohen and E.  Ruppin \n\nAs  evident,  the eigenvalues distributions of the two display types are fundamentally \ndifferent:  in the parallel task, most of the eigenvalues  \"mass\"  is  concentrated in the \nfirst  few  (15)  principal  axes,  testifying  that  indeed,  the  dimension  of the  parallel \ndisplays  space  is  quite  confined.  But  for  the  serial  task,  the  eigenvalues  are  dis(cid:173)\ntributed almost uniformly over  144 axes.  This inherent difference  is  independent  of \nset-size:  4 and  12-item displays  have  practically the same eigenvalue spectra. \n\n3.3  Variance Analysis \n\nThe target detector inputs are the projections of the display-vectors along the first \nfew  principal  axes.  Thus,  some  insight  to  the  source  of  the  difference  between \nparallel  and  serial  tasks  can  be  gained  performing  a  variance  analysis  on  these \nprojections.  The  five  different  task  conditions  were  analyzed  separately,  taking  a \ngroup of 85  target displays  and  a  group of 85  non-target  displays for  each set-size. \nTwo types of variances were  calculated for  the projections on the 5th principal axis: \nThe  \"within  groups\"  variance,  which  is  a  measure  of the  statistical  noise  within \neach  group of 85  displays,  and the  \"between  groups\"  variance,  which  measures  the \nseparation between target and non-target groups of displays for each set-size.  These \nvariances  were  averaged for  each  task  (condition),  over  all set-sizes.  The  resulting \nratios Q of within-groups to between-groups standard deviations are:  QA  =  0.0259, \nQB =  0.0587  ,and  Qa =  0.0114 for  parallel displays  (A,  B,  G), and  QE  = 0.2125 \nQc =  0.771  for  serial ones  (E,  C). \nAs  evident,  for  parallel task  displays  the Q values  are smaller by  an order  of mag(cid:173)\nnitude  compared  with  the  serial  displays,  indicating  a  better  separation  between \ntarget  and  non-target  displays  in  parallel  tasks.  Moreover,  using  Q as  a  criterion \nfor  parallel/serial  distinction  one  can  predict  that  displays  with  Q  < <  1  will  be \nprocessed  in  parallel,  and serially  otherwise,  in  accordance  with  the  experimental \nresponse time (RT) slopes measured by Wolfe et.  al.  [7].  This differences are further \ndemonstrated in Figure 3,  depicting projections of display-vectors on the sub-space \nspanned by  the 5,  6 and 7th principal axes.  Clearly, for  the parallel task  (condition \nB),  the  PCA  representations  of the  target-displays  (plus  signs)  are separated  from \nnon-target  representations  (circles),  while for  serial displays  (condition C)  there  is \nno  such  separation.  It should  be  emphasized  that  there  is  no  other  principal  axis \nalong which such a  separation is  manifested for  serial displays. \n\n-11106 \n\n-1 un \n\n.11615 \n\n-11025 \n\n_1163 \n\n'.II , .. \n\" \n\n.,0' \n\nTil \n\n.. .  +  ++ \n. \n\n.+ \n\n+ \n\no \n\no \n\no \n\n0 \n\no \n\n7.&12 \n\n'.7 \n\n1.1186 \n\nINIIS \n\n11166  , .   18846  , .  \n\n\"'AXIS \n\n71hAXIS \n\n110\" \n\n::::~ \n\n_1157 \n\n-11M \n\n~ : . ,  Hill \n-'.1' \n\n-'181 \n\n_1182 \n'10 \n\n'07 \n\n,.II \n\n, .. .. \n\n\u2022  10~ \n\n, ow \n\n'~AXIS \n\n1\"1 \n\no \n\no \n\n.  l \n\u2022  0 \n+0 o \n\n. 0-\n\no \n\no \n\n1114  1113  1.1e2  , . ,   , .  \nno AXIS \n\n1.1'11 \n\n'.71  1 iTT  1.178 \n\n1.175  1 114 \n\n.,. \n\nFigure  3:  Projections of display-vectors  on  the sub-space  spanned  by  the  5,  6 and \n7th  principal  axes.  Plus  signs  and  circles  denote  target  and  non-target  display(cid:173)\nvectors  respectively,  (a)  for  a  parallel  task  (condition  B),  and  (b)  for  a  serial  task \n(condition C).  Set-size is 8 items. \n\n\fOn Parallel versus Serial Processing: A  Computational Study o/Visual Search \n\n15 \n\nWhile  Treisman  and  her  co-workers  view  the  distinction  between  parallel  and se(cid:173)\nrial  tasks  as  a  fundamental  one,  Duncan  and  Humphreys  [5]  claim  that  there  is \nno  sharp  distinction  between  them,  and  that search  efficiency  varies  continuously \nacross  tasks  and  conditions.  The  determining  factors  according  to  Duncan  and \nHumphreys  are  the  similarities between  the  target  and  the  non-targets  (T-N  sim(cid:173)\nilarities)  and  the similarities between  the  non-targets themselves  (N-N  similarity). \nDisplays with homogeneous background (high N-N similarity) and a target which is \nsignificantly different  from the distractors  (low T-N similarity) will exhibit parallel, \nlow RT slopes, and vice versa.  This claim was illustrated by them using a qualitative \n\"search surface\"  description as shown  in figure  4a.  Based  on results  from our vari(cid:173)\nance  analysis,  we  can now  examine this claim quantitatively:  We  have  constructed \na  \"search  surface\",  using  actual  numerical  data of RT slopes  from  Wolfe's exper(cid:173)\niments,  replacing  the  N-N  similarity axis  by  its  mathematical manifestation,  the \nwithin-groups standard deviation,  and  N-T similarity by  between-groups  standard \ndeviation 1.  The resulting surface (Figure 4b) is qualitatively similar to Duncan and \nHumphreys's.  This interesting result testifies  that the PCA representation succeeds \nin producing a  viable realization of such intuitive terms as inputs similarity, and is \ncompatible with the way we  perceive  the world  in  visual search  tasks. \n\n(b) \n\nSEARCH  SURFACE \n\n(a) \n\no \n\nCIo-..... ~:..-4.:0,........::~\"\"\"\"\"'\" \n1- _.-...-_ \n\nl.rgeI-.-Jargel \nIImllarll), \n\nFigun J. The seatcllaurface. \n\nFigure 4:  RT rates versus:  (a)  Input similarities (the search surface,  reprinted from \nDuncan  and  Humphreys,  1989).  (b)  Standard  deviations  (within  and  between)  of \nthe  PCA  variance analysis.  The asterisks  denote  Wolfe's experimental data. \n\n4  Summary \n\nIn  this  work  we  present  a  two-component  neural  network  model of pre-attentional \nvisual  processing.  The model has been  applied  to  the  visual search  paradigm per(cid:173)\nformed  by  Wolfe  et.  al.  Our  main finding  is  that when  global-feature compression \nis  applied to visual displays, there is an inherent difference  between  the representa(cid:173)\ntions of serial and parallel-task displays:  The  neural  network  studied in  this paper \nhas succeeded  in  detecting  a  target  among distractors  only for  displays  that  were \nexperimentally  found  to  be  processed  in  parallel.  Based  on  the  outcome  of the \n\n1 In general,  each principal  axis contains information from different features,  which may \nmask  the information  concerning  the existence  of a  target.  Hence,  the first  principal  axis \nmay  not  be  the  best  choice  for  a  discrimination  task.  In  our  simulations,  the  5th  axis \nfor  example,  was  primarily  dedicated  to target  information,  and  was  hence  used  for  the \nvariance  analysis  (obviously,  the  neural  network  uses  information  from  all  the first  eight \nprincipal  axes). \n\n\f16 \n\nE.  Cohen andE. Ruppin \n\nvariance  analysis  performed on the  PCA representations  of the  visual displays,  we \npresent  a  quantitative criterion enabling one to distinguish between serial and par(cid:173)\nallel  displays.  Furthermore,  the  resulting  'search-surface'  generated  by  the  PCA \ncomponents is in close correspondence with the metaphorical description of Duncan \nand  Humphreys. \n\nThe  network  demonstrates  an  interesting  generalization  ability:  Naturally,  it  can \nlearn to detect a target in parallel displays from examples of such displays.  However, \nit can also learn  to perform this task from examples of serial displays only!  On the \nother  hand,  we  find  that  it  is  impossible  to  learn  serial  tasks,  irrespective  of the \ncombination of parallel and serial displays that are presented to the network during \nthe  training  phase.  This  generalization  ability  is  manifested  not  only  during  the \nlearning  phase,  but  also  during  the  performance  phase;  displays  belonging  to  the \nsame task  have  a  similar eigenvalue spectrum,  irrespective of the actual set-size  of \nthe displays,  and  this  result  holds true for  parallel as  well  as for  serial displays. \n\nThe role of PCA in perception was previously investigated by  Cottrell [9],  designing \na  neural  network  which  performed tasks  as  face  identification  and gender  discrim(cid:173)\nination.  One  might  argue  that  PCA,  being  a  global  component  analysis  is  not \ncompatible with the existence  of local feature  detectors  (e.g.  orientation detectors) \nin  the  cortex.  Our  work  is  in  line  with  recent  proposals  [10J  that  there  exist  two \npathways  for  sensory  input  processing:  A  fast  sub-cortical  pathway  that contains \nlimited information, and a slow cortical pathway which is capable of providing richer \nrepresentations  of the stimuli.  Given  this  assumption this  paper  has presented  the \nfirst  neural  realization of the processing  that may underline the classical metaphor(cid:173)\nical explanations involved in  visual search. \n\nReferences \n[1]  J.  K. Tsotsos.  Analyzing vision  at the complexity level.  Behavioral  and Brain \n\nSciences,  13:423-469, 1990. \n\n[2J  M.  I.  Posner,  C.  R.  Snyder,  and  B.  J.  Davidson.  Attention  and  the detection \n\nof signals.  Journal  of Experimental Psychology:  General,  109:160-174, 1980. \n\n[3J  Y.  Tsal.  Movement of attention across the visual field.  Journal of Experimental \n\nPsychology:  Human  Perception  and Performance,  9:523-530, 1983. \n\n[4]  A.  Treisman and G. Gelade. A feature integration theory of attention.  Cognitive \n\nPsychology,  12:97-136,1980. \n\n[5]  J.  Duncan and G.  Humphreys.  Visual search  and stimulus similarity.  Psycho(cid:173)\n\nlogical  Review,  96:433-458, 1989. \n\n[6]  A.  Treisman and S.  Gormican.  Feature analysis in early vision:  Evidence from \n\nsearch  assymetries.  Psychological  Review,  95:15-48, 1988. \n\n[7]  J .  M.  Wolfe,  S.  R.  Friedman-Hill,  M.  I.  Stewart,  and  K.  M.  O'Connell.  The \nrole of categorization in  visual search for orientation.  Journal of Experimental \nPsychology:  Human  Perception  and Performance,  18:34-49, 1992. \n\n[8]  T.  D.  Sanger.  Optimal unsupervised  learning  in  a  single-layer  linear  feedfor(cid:173)\n\nward  neural  network.  Neural Network,  2:459-473,  1989. \n\n[9]  G.  W.  Cottrell.  Extracting  features  from  faces  using  compression  networks: \nFace,  identity, emotion and gender recognition using holons.  Proceedings  of the \n1990  Connectionist  Models  Summer School,  pages  328-337,  1990. \n\n[10]  J.  L.  Armony, D.  Servan-Schreiber, J . D.  Cohen,  and J. E.  LeDoux.  Computa(cid:173)\n\ntional  modeling of emotion:  exploration through  the  anatomy and  physiology \nof fear  conditioning.  Trends  in  Cognitive  Sciences,  1(1):28-34, 1997. \n\n\fData-Dependent Structural Risk \n\nMinimisation for  Perceptron Decision \n\nTrees \n\nJohn Shawe-Taylor \n\nDept of Computer Science \n\nRoyal Holloway,  University of London \n\nEgham, Surrey TW20 OEX,  UK \n\nEmail:  jst@dcs.rhbnc.ac.uk \n\nN ello Cristianini \n\nDept of Engineering  Mathematics \n\nUniversity  of Bristol \nBristol BS8  ITR, UK \n\nEmail:  nello.cristianini@bristol.ac. uk \n\nAbstract \n\nPerceptron  Decision  'frees  (also  known  as  Linear  Machine  DTs, \netc.)  are  analysed  in  order  that data-dependent  Structural  Risk \nMinimisation  can  be  applied.  Data-dependent  analysis  is  per(cid:173)\nformed  which  indicates  that choosing  the  maximal margin hyper(cid:173)\nplanes  at the decision  nodes  will  improve  the generalization.  The \nanalysis uses a novel technique to bound the generalization error in \nterms of the margins at individual  nodes.  Experiments  performed \non real data sets  confirm the validity of the approach. \n\n1 \n\nIntroduction \n\nNeural network researchers have traditionally tackled classification  problems byas(cid:173)\nsembling  perceptron  or  sigmoid  nodes  into  feedforward  neural  networks.  In  this \npaper we consider  a less  common approach where  the perceptrons  are used as deci(cid:173)\nsion nodes in a  decision  tree structure.  The approach has the advantage that more \nefficient  heuristic  algorithms exist  for  these structures,  while  the advantages of in(cid:173)\nherent parallelism are if anything greater as all the perceptrons can be evaluated in \nparallel,  with  the  path through  the  tree determined  in a  very fast  post-processing \nphase. \nClassical  Decision  'frees  (DTs),  like  the  ones  produced  by  popular  packages  as \nCART [5] or C4.5 [9], partition the input space by means ofaxis-parallel hyperplanes \n(one  at  each  internal  node),  hence  inducing  categories  which  are  represented  by \n(axis-parallel)  hyperrectangles in such a  space. \nA  natural  extension  of that  hypothesis  space  is  obtained  by  associating  to  each \ninternal  node  hyperplanes  in general  position,  hence  partitioning  the  input  space \nby means of polygonal (polyhedral)  categories. \n\n\fData-Dependent SRMfor Perceptron Decision Trees \n\n337 \n\nThis approach has been  pursued  by many researchers,  often with different  motiva(cid:173)\ntions, and hE.nce  the resulting hypothesis space has been given a number of different \nnames:  multivariate  DTs [6],  oblique  DTs [8],  or DTs using linear  combinations of \nthe attributes [5],  Linear Machine DTs, Neural Decision Trees [12], Perceptron Trees \n[13],  etc. \nWe  will call  them  Perceptron  Decision  Trees  (PDTs),  as  they  can  be  regarded  as \nbinary trees  having a simple  perceptron  associated  to each decision  node. \nDifferent  algorithms  for  Top-Down  induction  of PDTs from  data have  been  pro(cid:173)\nposed,  based  on different  principles,  [10],  [5],  [8], \nExperimental study of learning by means of PDTs indicates that their performances \nare sometimes better than those of traditional decision  trees in terms of generaliza(cid:173)\ntion error,  and usually  much  better in  terms of tree-size  [8],  [6],  but on some  data \nset  PDTs can be outperformed by  normal DTs. \nWe  investigate  an  alternative  strategy  for  improving  the  generalization  of these \nstructures,  namely  placing  maximal margin hyperplanes  at the decision  nodes.  By \nuse  of a  novel  analysis  we  are  able  to  demonstrate  that  improved  generalization \nbounds can be obtained for this approach.  Experiments confirm that such a method \ndelivers  more accurate trees  in all  tested databases. \n\n2  Generalized Decision  Trees \n\nDefinition 2.1  Generalized Deci.ion  Tree.  (GDT). \nGiven a  space  X  and a  set of boolean functions \n~ = {/ : X  -+ {O, I}}, the class  GDT(~) of Generalized  Decision  Trees  over  ~ are \nfunctions  which  can  be  implemented  using  a  binary  tree  where  each  internal  node \nis labeled with an element  of ~, and each leaf is labeled  with either  1 or O. \nTo evaluate a particular tree T on input z  EX, All the boolean functions associated \nto the nodes are assigned the same argument z  EX, which is the argument of T( z). \nThe values  assumed  by  them determine  a  unique  path from  the root  to a  leaf:  at \neach internal node the left  (respectively  right) edge  to a child is taken if the output \nof the function  associated  to that internal  node is  0 (respectively  1).  The value  of \nthe function at the assignment of a z  E X  is the value associated to the leaf reached. \nWe  say  that input  z  reaches  a  node of the  tree,  if that  node is  on  the  evaluation \npath for  z. \n\nIn the following,  the nodu are the internal nodes of the binary tree,  and the leave. \nare its external ones. \nExamples. \n\n\u2022  Given  X  = {O, I}\", a  Boolean  Deci6ion  Tree  (BDT) is  a  GDT over \n\nThis kind of decision  trees defined  on a continuous space are the output of \ncommon algorithms like  C4.5 and CART, and we  will call them - for short \n- CDTs. \n\n\u2022  Given  X = lR\",  a  Perceptron  Deci.ion  Tree  (PDT) is a  GDT over \n\n~PDT = {wT x  : W  E lR\"+1}, \n\nwhere  we  have assumed that the inputs have been augmented  with a coor(cid:173)\ndinate of constant value,  hence  implementing a  thresholded  perceptron. \n\n\u2022  Given  X  = lR\",  a  C-I.5-like  Deci.ion  Tree  (CDT) is a GDT over \n\n~BDT = U : \"(x) = Xi, \"Ix E X} \n~CDT = U\"  : \",,(x) =  1 \u00a2:> z, > 8} \n\n\f338 \n\n1.  Shawe-Taylor and N.  Cristianini \n\n3  Data-dependent SRM \n\nWe  begin  with the definition  of the fat-shattering dimension,  which  was first  intro(cid:173)\nduced in  [7],  and has been  used for  several problems in learning  since  [1,  4,  2,  3]. \n\nDefinition 3.1  Let F  be  a  ,et of real valued functiom.  We  ,ay that a  ,et of point. \nX  u1-shattered by F  relative to r = (r.).ex  if there  are  real number, r.  indezed \nby z  E X  ,uch that for  all binary vector' b indezed by X,  there u a function I\"  E F \n,atufying \n\n~ (z) {  ~ r. + 1 \nI\" \n\n~ r. -1  otheMDue. \n\nif b. = .1 \n\nThe fat shattering dimension  fat:F  of the  ,et F  i, a function  from  the  po,itive  real \nnumber,  to  the  integer' which  map' a  value 1  to  the  ,ize of the  largut 1-,hattered \n,et, if thi' i, finite,  or infinity otherwi6e. \n\nAs an example which  will be relevant  to the subsequent analysis consider the class: \n\nJ=nn  = {z -+ (w, z) + 8: IIwl! = 1}. \n\nWe quote  the following  result  from  [11]. \nCorollary 3.2  [11}  Let J=nn  be  reltricted  to  point'  in  a  ball  of n  dimemiom  of \nradiu, R  about  the  origin  and  with thre,hold8 181  ~ R.  Then \nfat~ (1) ~ min{9R2 /12, n + I} + 1. \n\nThe  following  theorem  bounds  the  generalisation  of a  classifier  in  terms  of the \nfat  shattering  dimension  rather  than  the  usual  Vapnik-Chervonenkis  or  Pseudo \ndimension. \nLet T9  denote the threshold function at 8:  T9: 1R -+ {O,I}, T9(a) = 1 iff a> 8.  For \na  class  offunctions F, T9(F) = {T9(/): IE F}. \nTheorem 3.3  [11}  Comider  a  real  valued  function  dOl, F  having  fat  ,hattering \nfunction  bounded  above  by  the  function  &lat  :  1R  -+  N  which  i,  continuOtU  from \nthe  right.  Fi:D  8  E  1R.  If a  learner  correctly  cIOl,ifie,  m  independently  generated \nezample, \u2022  with h = T9(/)  E T9(F)  ,uch that er.(h) = 0  and 1  = min I/(z,) - 81, \nthen  with  confidence  1 - i  the  ezpected error of h  u bounded from  above  by \n\ne(m,k,6) = ! (kiog (8~m) log(32m) + log (8;a)) , \n\nwhere k  = &lath/8). \nThe  importance of this  theorem is  that it can  be  used  to explain  how  a  classifier \ncan give better generalisation  than would  be predicted  by a  classical  analysis of its \nVC  dimension.  Essentially  expanding  the margin  performs an automatic capacity \ncontrol for function classes with small fat shattering dimensions.  The theorem shows \nthat when a large margin is achieved it is as if we were working in a lower VC class. \nWe  should  stress  that  in  general  the  bounds  obtained  should  be  better  for  cases \nwhere a  large margin is observed,  but that a  priori there is no guarantee that such \na  margin will occur.  Therefore a  priori only the classical VC bound can be used.  In \nview  of corresponding lower bounds on the generalisation error in terms of the VC \ndimension,  the a  posteriori bounds depend on a favourable probability distribution \nmaking  the  actual  learning  task  easier.  Hence,  the  result  will  only  be  useful  if \nthe  distribution  is  favourable  or at least  not  adversarial.  In  this  sense  the  result \nis  a  distribution  dependent  result,  despite  not  being distribution  dependent in the \n\n\fData-Dependent SRMfor Perceptron Decision Trees \n\n339 \n\ntraditional sense  that assumptions  about the distribution  have had to be made in \nits derivation.  The benign  behaviour of the distribution is automatically estimated \nin the learning  process. \nIn order to perform a similar analysis for  perceptron decision  trees we  will consider \nthe set of margins obtained at each  of the nodes,  bounding the generalization  as a \nfunction  of these  values. \n\n4  Generalisation  analysis  of the Tree Class \n\nIt  turns  out  that  bounding  the fat  shattering  dimension  of PDT's  viewed  as  real \nfunction  classifiers  is  difficult.  We will therefore do a  direct  generalization  analysis \nmimicking the proof of Theorem 3.3 but taking into account the margins at each of \nthe decision  nodes in the tree. \nDefinition 4.1  Let (X, d)  be  a  {p,eudo-}  metric 'pace,  let A  be  a  ,ub,et of X  and \nE > O.  A  ,et B  ~ X  i, an E-cover  for  A  if,  for  every a E  A,  there  eNtI b E B  ,uch \nthat d(a,b) < E.  The  E-covering  number  of A, A'd(E,A),  is the  minimal cardinality \nof an E-cover  for A  (if there  is  no ,uch finite  cover then it i, defined  to  be  (0). \nWe write A'(E,:F, x) for the E-covering  number of:F with respect  to the lao  pseudo(cid:173)\nmetric measuring  the maximum discrepancy  on the  sample  x.  These  numbers  are \nbounded in the following  Lemma. \nLemma 4.2  (.Alon  et  al.  [1])  Let:F  be  a  cla.s,  of junction, X  -+  [0,1]  and  P  a \ndistribution  over X.  Choo,e  0 < E < 1  and let d = fat:F(E/4).  Then \n\nE (A'(E,:F, x\u00bb ~ 2 \\  -;;-\n\n(4m)dlos(2em/(cU\u00bb \n\n, \n\nwhere  the  ezpectation  E  i, taken w.r.t.  a  ,ample x  E xm  drawn  according  to  pm. \nCorollary 4.3  [11} Let :F  be  a  cla\" of junctiom X  -+  [a, b]  and P  a  distribution \nover X.  Choo,e  0 < E < 1  and let d = fat:F(E/4).  Then \n\nE (A'(E,:F, x\u00bb ~ 2 \n\n(\n\n4m(b _  a)2)dlos(2em(\"-Cl)/(cU\u00bb \n\nE2 \n\n' \n\nwhere  the  ezpectation  E  is over ,amples x  E xm  drawn  according  to pm. \nWe are now  in  a  position  to tackle  the main lemma which  bounds the  probability \nover a  double sample that the first  half has lero error and the second error greater \nthan an  appropriate  E.  Here,  error is  interpreted  as  being differently  classified  at \nthe  output  of tree.  In  order  to simplify  the  notation  in  the  following  lemma  we \nassume that the decision  tree has K  nodes.  We also  denote fat:Flin (-y)  by fat(-y)  to \nsimplify  the notation. \nLemma 4.4  Let T  be  a perceptron decision tree with K  decuion node, with margim \n'11 , '12 , \u2022\u2022\u2022 ,'1K  at the  decision nodes.  If it ha.s  correctly cla.s,ified m  labelled ezamples \ngenerated  independently  according  to  the  unknown  (but jized)  distribution  P,  then \nwe  can  bound  the  following probability to  be  Ie\" than ~, \n\np2m { xy: 3  a  tree T  : T  correctly cla.s,ifie, x, \n\nfraction  of y  mi,cla\"ified  > E( m, K,~) }  < ~, \n\nwhere E(m,K,~) = !(Dlog(4m) + log ~). \nwhere  D = E~1 kslog(4em/k.)  and k, = fat(-y./8). \n\n\f340 \n\n1.  Shawe-Taylor and N.  Cristianini \n\nProof: Using  the standard permutation argument,  we  may fix a sequence  xy and \nbound  the  probability  under  the  uniform  distribution  on  swapping  permutations \nthat  the sequence  satisfies  the condition  stated.  We  consider  generating  minimal \n'YI&/2-covers B!y for  each  value  of Ie,  where  \"11&  = min{'Y'  : fath' /8) :5  Ie}.  Suppose \nthat for node i oCthe tree the margin 'Yi  of the hyperplane 'Wi  satisfies fathi /8) = ~. \nWe can therefore find  Ii  E B!~ whose output values are within 'Yi /2 of 'Wi.  We now \nconsider  the tree  T'  obtained  by  replacing  the  node  perceptrons  'Wi  of T  with  the \ncorresponding  Ii.  This  tree  performs  the same  classification  function  on  the first \nhalf of the  sample,  and  the  margin  remains  larger  than 'Yi  - \"1\".12  > \"11&.12.  If a \npoint  in  the  second  half of the  sample  is  incorrectly  classified  by  T  it  will  either \nstill  be  incorrectly  classified  by  the adapted tree  T' or will at one  of the  decision \nnodes  i  in  T'  be  closer  to  the  decision  boundary  than  'YI&i /2.  The  point  is  thus \ndistinguishable  from  left  hand  side  points  which  are  both correctly  classified  and \nhave margin greater  than \"11&.12  at node  i.  Hence,  that point  must  be  kept  on the \nright  hand  side  in  order  for  the  condition  to  be  satisfied.  Hence,  the fraction  of \npermutations  that can  be  allowed  for  one  choice  of the functions  from  the  covers \nis  2-\"\".  We  must take the union  bound over all choices  of the functions  from  the \ncovers.  U sing  the  techniques  of [11]  the  numbers  of these  choices  is  bounded  by \nCorollory 4.3 as follows \n\nn~12(8m)I&.los(4emll&.) = 2K (8m)D, \n\nwhere  D  = ~~1 ~ log(4em/lei).  The value  of E in  the lemma statement therefore \nensures  that this the union  bound is less  than 6. \no \nUsing  the standard lemma due to Vapnik  [14,  page  168]  to bound the error  proba(cid:173)\nbilities  in  terms of the discrepancy  on a  double sample, combined with Lemma 4.4 \ngives  the following  result. \n\nTheorem 4.5  Suppo,e  we  are  able  to  cleu,i/y  an m  ,ample  of labelled  ezamplea \nwing a perceptron  decilion  tree  with K  node,  and obtaining  margina 'Yi  at  node  i, \nthen we  can  bound  the  generali,ation error with probability greater  than  1 - 6 to  be \nIe\"  than \n\n1 \n-(Dlog(4m) + log \nm \n\n(8m)K(2K) \nK \n(K + 1)6 \n\n) \n\nwhere  D = E~l ~log(4em//cj) and lei  = fathi/8). \nProof: We  must  bound  the  probabilities  over different  architectures  of trees  and \ndifferent  margins.  We  simply  have  to  choose  the  values  of E  to  ensure  that  the \nindividual  6's  are  sufficiently  small  that  the  total  over  all  possible  choices  is  less \nthan 6.  The details are omitted in this abstract. \no \n\n5  Experiments \n\nThe  theoretical  results  obtained  in  the  previous  section  imply  that  an  algorithm \nwhich  produces  large  margin  splits  should  have  a  better  generalization,  since  in(cid:173)\ncreasing  the  margins in  the internal  nodes,  has  the effect  of decreasing  the  bound \non the test  error. \nIn order to test this strategy,  we  have performed  the following  experiment,  divided \nin  two parts:  first  run a  standard perceptron  decision  tree  algorithm  and then for \neach decision  node generate a  maximal margin hyperplane implementing  the same \ndichotomy in place  of the decision  boundary generated by the algorithm. \n\n\fData-Dependent SRMfor Perceptron Decision Trees \n\n341 \n\nInput:  Random m  sample x  with corresponding classification  b. \nAlgorithm:  Find a perceptron decision  tree T  which correctly classifies  the sample \n\nusing a  standard algorithm; \nLet  Ie  = number of decision  nodes  of Tj \nFrom  tree T  create T' by executing  the following  loop: \nFor each decision node i  replace  the  weight vector w,  by  the vector wi \nwhich realises  the maximal margin hyperplane agreeing with w,  on the \n\nset  of inputs reaching node i; \n\nLet  the margin of w~ on the inputs reaching node i  be 'Y,j \n\nOutput:  Classifier T', with bound on the generalisation error in terms of the num-\nber of decision nodes K  and D  = 2:~11e, log(4em/~) where Ie,  = fath,/8). \nNote  that the classification  of T  and T' agree  on  the sample and hence,  that T' is \nconsistent  with the sample. \nAs  a  PDT learning algorithm  we  have used  OC1  [8],  created by  Murthy,  Kasif and \nSalzberg and freely  available over the internet.  It is a randomized algorithm,  which \nperforms  simulated  annealing for  learning  the  perceptrons.  The details  about  the \nrandomization,  the pruning, and the splitting criteria can be found in  [8]. \nThe  data we  have  used  for  the  test  are  4  of the  5  sets  used  in  the  original  OC1 \npaper,  which  are publicly  available in the  UCI data repository  [16]. \nThe results  we  have obtained on these  data are compatible with the ones reported \nin the original  OC1  paper,  the differences  being due  to different  divisions  between \ntraining  and testing  sets  and  their  sizesj  the absence  in  our experiments  of cross(cid:173)\nv&l.idation  and  other  techniques  to  estimate  the  predictive  accuracy  of the  PDT; \nand the inherently randomized  nature of the algorithm. \nThe second stage of the experiment involved finding - for each node - the hyperplane \nwhich performes  the  lame split as performed by the OC1  tree but with the ma.ximal \nmargin.  This  can  be  done  by  considering  the  subsample  reaching  each  node  as \nperfectly  divided  in  two  parts,  and feeding  the  data accordingly  relabelled  to an \nalgorithm which finds  the optimal split in the linearly separable case.  The ma.ximal \nmargin hyperplanes are then placed in the decision nodes and the new  tree is tested \non the same testing set. \nThe data sets we have used are:  Wi,eoun,in Brealt Caneer,  Pima Indiana Diabetel, \nBOlton Houling transformed into a classification  problem  by thresholding the price \nat \u2022  21.000 and  the classical  Inl studied  by  Fisher  (More informations  about the \ndatabases and their authors are in  [8]).  All  the details  about sample sizes,  number \nof attributes and results  (training and testing accuracy,  tree  size)  are summarised \nin table  1. \nWe were not particularly interested in achieving a high testing accuracy, but rather \nin  observing  if improved  performances  can  be  obtained  by  increasing  the  margin. \nFor this reason we did  not try to optimize the performance of the original classifier \nby  using  cross-v&l.idation,  or a  convenient  training/testing  set  ratio.  The relevant \nquantity,  in  this  experiment,  is  the  different  in  the  testing  error  between  a  PDT \nwith  arbitrary  margins and  the  same  tree  with  optimized  margins.  This quantity \nhas turned  out to be always  positive,  and  to range from  1.7 to 2.8  percent  of gain, \non test errors  which  were  already very low. \n\nCANC  96.53 \n96.67 \nIRIS \n89.00 \nDIAB \nHOUS \n95.90 \n\ntrain  OC1  test  FAT  test  #trs  #ts  attrib. \n9 \n4 \n8 \n13 \n\n93.52 \n96.67 \n70.48 \n81.43 \n\n95.37 \n98.33 \n72.45 \n84.29 \n\n249 \n90 \n209 \n306 \n\n108 \n60 \n559 \n140 \n\nclasses  nodes \n1 \n2 \n4 \n7 \n\n2 \n3 \n2 \n2 \n\n\f342 \n\nReferences \n\n1.  Shawe-Taylor and N.  Cristianini \n\n[1]  Ncga Alon,  Shai Ben-David,  Nicolo  Cesa-Bianchi and David Haussler,  \"Scale(cid:173)\n\nsensitive  Dimensions,  Uniform  Convergence,  and Learnability,\"  in  Proceeding. \nof the  Conference  on Foundation.  of Computer Science  (FOCS),  (1993).  Also \nto appear in Journal  of the  ACM. \n\n[2]  Martin  Anthony  and  Peter  Bartlett,  \"Function  learning  from  interpolation\", \nTechnical  Report,  (1994).  (An  extended  abstract  appeared  in  Computational \nLearning  Theory,  Proceeding.  2nd European  Conference,  EuroCOLT'95, pages \n211-221,  ed.  Paul  Vitanyi,  (Lecture  Notes  in  Artificial  Intelligence,  904) \nSpringer-Verlag,  Berlin,  1995). \n\n[3]  Peter  L.  Bartlett  and  Philip  M.  Long,  \"Prediction,  Learning,  Uniform  Con(cid:173)\n\nvergence,  and  Scale-Sensitive  Dimensions,\"  Preprint,  Department  of Systems \nEngineering,  Australian National University,  November  1995. \n\n[4]  Peter L.  Bartlett, Philip  M.  Long,  and Robert  C. Williamson,  \"Fat-shattering \n\nand the learnability  of Real-valued  Functions,\"  Journal of Computer and  Sy.(cid:173)\ntem  Science.,  52(3), 434-452,  (1996). \n\n[5]  Breiman  L.,  Friedman  J.H.,  Olshen  R.A., Stone C.J., \"Classification and Re(cid:173)\n\ngression  Trees\",  Wadsworth International Group, Belmont,  CA,  1984. \n\n[6]  Brodley  C.E., UtgofF P.E., Multivariate  Decision  Trees,  Machine  Learning  19, \n\npp.  45-77,  1995. \n\n[7]  Michael J. Kearns and Robert E. Schapire,  \"Efficient Distribution-free Learning \nof Probabilistic Concepts,\" pages 382-391 in Proceeding. of the Slit Sympo.ium \non  the  Foundation.  of Computer Science,  IEEE  Computer Society  Press,  Los \nAlamitos,  CA,  1990. \n\n[8]  Murthy S.K., Kasif S., Salzberg S., A System for Induction of Oblique Decision \n\nTrees,  Journal of Artificial  Intelligence  Research,  2 (1994), pp.  1-32. \n\n[9]  Quinlan  J.R.,  \"C4.5:  Programs  for  Machine  Learning\",  Morgan  Kaufmann, \n\n1993. \n\n[10]  Sankar A., Mammone R.J., Growing and Pruning Neural Tree Networks, IEEE \n\nTransactions on  Computers, 42:291-299,  1993. \n\n[11]  John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson,  Martin Anthony, \nStructural  Risk  Mjnjmization  over  Data-Dependent  Hierarchi~, NeuroCOLT \nTechnical  Report  NC-TR-96-053, 1996. \n(ftp:llftp.dc \u2022\u2022 rhbDc.ac.uk/pub/Deurocolt/t.c~.port.). \n\n[12]  J.A.  Sirat, and J.-P. Nadal, \"Neural  trees:  a new  tool for classification\",  Net(cid:173)\n\nwork,  1,  pp. 423-438,  1990 \n\n[13]  UtgofF  P.E.,  Perceptron  Trees:  a  Case  Study in  Hybrid  Concept  Representa(cid:173)\n\ntions,  Connection  Science  1 (1989), pp.  377-391. \n\n[14]  Vladimir  N.  Vapnik,  E.timation  of Dependence.  Baled  on  Empirical  Data, \n\nSpringer-Verlag,  New  York,  1982. \n\n[15]  Vladimir  N.  Vapnik,  The  Nature  of Statiltical  Learning  Theory,  Springer(cid:173)\n\nVerlag,  New  York,  1995 \n\n[16]  University  of  California, \n\nIrvine \n\nhttp://www.icB.uci.edu/ mlearn/MLRepoBitory.html \n\nMachine  Learning  Repository, \n\n\f", "award": [], "sourceid": 1359, "authors": [{"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}, {"given_name": "Nello", "family_name": "Cristianini", "institution": null}]}