{"title": "A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data", "book": "Advances in Neural Information Processing Systems", "page_first": 571, "page_last": 577, "abstract": null, "full_text": "A  Mixture of Experts  Classifier with \nLearning Based on Both Labelled and \n\nUnlabelled Data \n\nDavid J. Miller and Hasan  S.  Uyar \n\nDepartment of Electrical  Engineering \n\nThe  Pennsylvania State  University \n\nUniversity Park, Pa.  16802 \nmiller@perseus.ee.psu.edu \n\nAbstract \n\nWe  address  statistical  classifier  design  given  a  mixed  training  set  con(cid:173)\nsisting  of  a  small  labelled  feature  set  and  a  (generally  larger)  set  of \nunlabelled features.  This situation  arises,  e.g., for  medical images,  where \nalthough  training  features  may  be  plentiful,  expensive  expertise  is  re(cid:173)\nquired  to  extract  their  class  labels.  We  propose  a  classifier  structure \nand learning  algorithm  that make effective  use of unlabelled data to im(cid:173)\nprove performance.  The learning  is  based  on  maximization  of the  total \ndata  likelihood,  i.e.  over  both  the  labelled  and  unlabelled  data  sub(cid:173)\nsets.  Two distinct EM learning  algorithms  are proposed,  differing  in  the \nEM  formalism  applied  for  unlabelled  data.  The classifier,  based  on  a \njoint probability model for features  and labels,  is  a  \"mixture of experts\" \nstructure that is  equivalent to the  radial  basis function  (RBF) classifier, \nbut unlike RBFs, is  amenable  to likelihood-based training.  The scope of \napplication  for  the  new  method  is  greatly  extended  by  the  observation \nthat test data, or any new data to classify, is in fact  additional,  unlabelled \ndata - thus,  a combined learning/classification operation - much akin  to \nwhat  is  done  in  image  segmentation  - can  be  invoked  whenever  there \nis  new  data to classify.  Experiments  with  data sets from  the  UC  Irvine \ndatabase  demonstrate  that  the  new  learning  algorithms  and  structure \nachieve substantial performance  gains  over alternative  approaches. \n\n1 \n\nIntroduction \n\nStatistical classifier design is fundamentally a supervised learning problem, wherein \na  decision  function,  mapping an  input  feature  vector  to  an  output  class  label,  is \nlearned  based on  representative (feature,class label)  training pairs.  While a  variety \nof classifier  structures  and  associated  learning  algorithms  have  been  developed,  a \ncommon  element  of nearly  all  approaches  is  the  assumption  that  class  labels  are \n\n\f572 \n\nD.  J.  Miller and H.  S.  Uyar \n\nknown  for  each  feature  vector  used  for  training.  This  is  certainly  true  of  neu(cid:173)\nral  networks such  as  multilayer perceptrons  and  radial  basis  functions  (RBFs),  for \nwhich classification is  usually  viewed as function  approximation, with  the networks \ntrained  to minimize the squared distance  to  target class  values.  Knowledge of class \nlabels  is  also  required  for  parametric classifiers  such  as  mixture  of Gaussian  clas(cid:173)\nsifiers,  for  which  learning  typically involves dividing  the  training  data into subsets \nby class  and then  using  maximum likelihood estimation (MLE)  to separately learn \neach  class  density.  While  labelled  training data may be  plentiful for  some applica(cid:173)\ntions,  for  others,  such as  remote sensing and  medical imaging,  the training set  is  in \nprinciple  vast but  the  size  of the  labelled  subset  may be  inadequate.  The difficulty \nin  obtaining  class  labels  may  arise  due  to  limited  knowledge  or  limited  resources, \nas  expensive  expertise  is  often  required  to  derive  class  labels  for  features.  In  this \nwork,  we  address  classifier  design  under  these  conditions,  i.e. \nthe  training  set  X \nis  assumed  to  consist  of two subsets,  X  = {Xl, Xu},  where  Xl  = {(Xl, cd, (X2' C2), \n... ,(XNI,CNln  is  the  labelled  subset  and  Xu  =  {XNI+l, ... ,XN}  is  the  unlabelled \nsubset l.  Here,  Xi  E R. k  is  the  feature  vector  and  Ci  E I  is  the  class  label from  the \nlabel set  I  =  {I, 2,  . \", N c }. \nThe  practical  significance  of this  mixed  training problem  was  recognized  in  (Lipp(cid:173)\nmann  1989).  However,  despite  this  realization,  there  has  been  surprisingly  little \nwork  done  on  this  problem.  One  likely  reason  is  that  it  does  not  appear  possi(cid:173)\nble  to incorporate  unlabelled  data directly  within conventional supervised  learning \nmethods  such  as  back  propagation.  For  these  methods,  unlabelled  features  must \neither be discarded or preprocessed  in a suboptimal, heuristic fashion to obtain class \nlabel  estimates.  We  also  note  the  existence  of work  which  is  less  than  optimistic \nconcerning  the value of unlabelled data for  classification  (Castelli and Cover 1994). \nHowever,  (Shashahani  and  Landgrebe  1994)  found  that  unlabelled  data could  be \nused  effectively  in  label-deficient  situations.  While  we  build  on  their  work,  as  well \nas  on  our  own  previous  work  (Miller  and  Uyar  1996),  our  approach  differs  from \n(Shashahani  and  Landgrebe  1994)  in  several important  respects.  First,  we  suggest \na  more powerful mixture-based probability model with an associated classifier struc(cid:173)\nture that has been shown  to  be equivalent to  the  RBF  classifier  (Miller  1996).  The \npractical  significance  of this  equivalence is  that  unlike  RBFs,  which  are  trained  in \na  conventional supervised  fashion,  the  RBF-equivalent  mixture  model  is  naturally \nsuited for  statistical training  (MLE).  The statistical framework  is  the  key  to incor(cid:173)\nporating  unlabelled  data  in  the  learning.  A  second  departure  from  prior  work  is \nthe  choice of learning criterion.  We  maximize the joint data likelihood and suggest \ntwo di\"tinct EM algorithms for this purpose,  whereas  the conditional likelihood was \nconsidered  in  (Shashahani and Landgrebe  1994).  We have found  that our approach \nachieves superior  results.  A  final  novel contribution is  a  considerable  expansion  of \nthe range of situations for  which the  mixed training paradigm can be applied.  This \nis  made possible by the realization that test data or new data to classify can  al\"o be \nviewed  as  an  unlabelled  set,  available  for  \"training\".  This  notion  will  be  clarified \nin  the sequel. \n\n2  Unlabelled  Data  and  Classification \n\nHere  we  briefly  provide  some  intuitive  motivation for  the  use  of unlabelled  data. \nSuppose,  not very  restrictively, that the data is  well-modelled by a  mixture density, \n\nlThis problem  can be  viewed  as  a  type of \"missing data\"  problem,  wherein  the  missing \nitems  are  class  labels.  As  such,  it  is  related  to,  albeit  distinct  from  supervised  learning \ninvolving  missing  and/or noisy  jeaturecomponents,  addressed in (Ghahramani  and Jordan \n1995),(Tresp et  al.  1995). \n\n\fA Mixture of Experts Classifier for Label-deficient Data \n\n573 \n\nL \n\n1=1 \n\nin  the  following  way.  The  feature  vectors  are  generated  according  to  the  density \nf(z/9)  =  2: ad(z/9t),  where  f(z/Oc)  is  one  of L  component densities,  with  non-\nnegative mixing parameters 0.1,  such that 2: 0.1  = 1.  Here,  01  is the set of parameters \nspecifying  the  component  density,  with  9 = {Ol}.  The class  labels  are  also  viewed \nas  random quantities and  are  assumed chosen conditioned  on  the  selected  mixture \ncomponent  7'7I.i  E  {I, 2, ... , L}  and  possibly  on  the  feature  value,  i.e.  according \nto  the  probabilities  P[CdZi,7'7I.iJ  2.  Thus,  the  data  pairs  are  assumed  generated \nby  selecting,  in  order,  the  mixture  component,  the  feature  value,  and  the  class \nlabel,  with  each  selection  depending  in  general  on  preceding  ones.  The  optimal \nclassification  rule for  this  model is  the  maximum a  posteriori  rule: \n\n1=1 \n\nL \n\nS(z) = arg max L P[c ..  = k/7'7I.i  = i, Zi]P[7'7I.i  = i/Zi], \n\nk \n\n. \nj \n\n(1) \n\nwhere  P[7'7I.i  =  i/Zi]  = \n\nLajJ(~./6,)  ,  and  where  S(z)  is  a  selector  function  with \n2: atf(~i/61) \n1=1 \n\nrange  in T.  Since  this  rule  is  based  on  the  a  posteriori  class  probabilities,  one  can \nargue  that learning should focus  solely on estimating these  probabilities.  However, \nif the  classifier  truly  implements  (1),  then  implicitly it has  been  assumed  that  the \nestimated mixture density accurately  models the feature  vectors.  If this is  not true, \nthen presumably estimates of the a posteriori probabilities will also be affected.  This \nsuggests  that  even  in the  ab8ence  of cla88  label8,  the feature  vectors can  be  used  to \nbetter learn a  posteriori probabilities via improved estimation of the mixture-based \nfeature  density.  A commonly used  measure of mixture density accuracy  is the data \nlikelihood. \n\n3 \n\nJoint  Likelihood  Maximization  for  a  Mixtures of Experts \nClassifier \n\nThe previous section basically argues for  a learning approach that uses  labelled data \nto  directly  estimate a  posteriori  probabilities  and  unlabelled  data to  estimate the \nfeature density.  A criterion which essentially fulfills these objectives is  the joint data \nlikelihood, computed over both the labelled and unlabelled data subsets.  Given our \nmodel, the joint data log-likelihood is  written in the form \n\nlog L = L  log L ad(z,i/O,) +  L  log L aIP[cdzi, 7'7I.i  = l]f(Zi/91). \n\nL \n\nL \n\n(2) \n\n1=1 \n\n1=1 \n\nThis objective function  consists of a  \"supervised\"  term based  on  XI  and an  \"unsu(cid:173)\npervised\"  term  based  on  Xu.  The joint data likelihood was  previously  considered \nin  a  learning context in  (Xu  et  al.  1995).  However,  there  the  primary justification \nwas simplification of the learning algorithm in order  to allow parameter estimation \nbased  on  fixed  point iterations  rather  than  gradient descent.  Here,  the joint likeli(cid:173)\nhood  allows  the  inclusion  of unlabelled  samples in  the  learning.  We  next  consider \ntwo special  cases  of the  probability model described  until now. \n\n2The  usual  assumption  made  is  that  components  are  \"hard-partitioned\",  in  a  deter(cid:173)\n\nministic  fashion,  to classes.  Our  random  model includes  the  \"partitioned\"  one  as  a  special \ncase.  We have generally  found  this  model  to  be more  powerful than  the  \"partitioned\"  one \n(Miller  Uyar  1996). \n\n\f574 \n\nD.  J.  Miller and H.  S.  Uyar \n\nThe  \"partitioned\"  mixture  (PM)  model:  This  is  the  previously  mentioned \ncase  where  mixture components are  \"hard-partitioned\"  to classes  (Shashahani and \nLandgrebe  1994).  This is  written  Mj  E C/e,  where  Mj  denotes  mixture component \nj  and C/e  is  the subset of components owned by class  k.  The posterior  probabilities \nhave the form \n\n(3) \n\n2: \n\najf(3!/Oj) \nP[Ci  = k/3!]  = )_\u00b7;_M_'L,-EC_,, ___  _ \n2: azf(3!/Or) \n\n1=1 \n\nThe  generalized  mixture  (G M)  model:  The  form  of  the  posterior  for  each \nmixture component is  now  P[c,:/1'7l.i, 3!il  =  P[c,:/1'7l.il  ==  {3c,/m,,  i.e.,  it is  independent \nof the feature  value.  The overall posterior probability takes  the form \n\nP  C,:/3!i  = ~ '2t azf(3!dOI )  {3c,lj. \n[ \n\n1  '\" (  ad(3!i/Oj)  ) \n\n(4) \n\nThis model  was introduced in  (Miller and Uyar  1996)  and was shown there  to lead \nto  performance improvement over  the  PM  model.  Note  that  the probabilities have \na  \"mixture  of  experts\"  structure,  where  the  \"gating  units\"  are  the  probabilities \nP[1'7l.i  =  jl3!il  (in  parentheses),  and  with  the  \"expert\"  for  component  j  just  the \nprobability {3c,Ii'  Elsewhere  (Miller  1996),  it  has  been  shown  that  the  associated \nclassifier  decision function  is  in fact  equivalent to that of an  RBF classifier  (Moody \nand  Darken  1989) .  Thus,  we  suggest  a  probability  model  equivalent  to  a  widely \nused  neural  network  classifier,  but  with  the  advantage  that,  unlike  the  standard \nRBF, the RBF-equivalent probability model is amenable to statistical training, and \nhence  to the incorporation of unlabelled data in  the learning.  Note  that more pow(cid:173)\nerful  models  P[cilTn.i, 3!i]  that  do  condition on  3!i  are  also  possible.  However,  such \nmodels  will  require  many  more  parameters  which  will  likely  hurt  generalization \nperformance,  especially  in  a  label-deficient  learning context.  Interestingly,  for  the \nmixed training problem,  there are two Expectation-Maximization (EM)  (Dempster \net al.  1977) formulations that can be applied to maximize the likelihood associated \nwith  a  given  probability model.  These  two formulations  lead  to  di8tinct  methods \nthat  take  different  learning  \"trajectories\",  although  both  ascend  in  the  data like(cid:173)\nlihood.  The  difference  between  the  formulations  lies  in  how  the  \"incomplete\" and \n\"complete\" data elements are  defined  within  the  EM  framework.  We  will  develop \nthese  two approaches for  the suggested  G M model. \nEM-I  (No  class  labels  assumed):  Distinct  data  interpretations  are  given  for \nXI  and  Xu'  In  this case,  for  Xu,  the  incomplete  data  consists  of the  features  {3!o.} \nand  the  complete  data  consists  of {(3!i' 1'7l.iH.  For  XI,  the  incomplete  data  consists \nof {(3!;, Ci)},  with the  complete  data  now  the triple  {(3!o., Co.,  Tn.i)}.  To clarify,  in  this \ncase  mizture labels  are  viewed  as  the  sole  missing data elements,  for  Xu  as  well  as \nfor  XI'  Thus,  in  effect  class labels are  not  even  postulated  to exist  for  Xu' \nEM-II  (Class  labels assumed):  The  definitions  for  XI  are  the  same  as  before. \nHowever, for  Xu,  the  complete data now consists of the triple {( 3!o., Ci, 1'7l.i H,  i.e.  class \nlabels  are also assumed  missing for  Xu' \nFor  Gaussian components,  we  have 01  =  {I-'I , EI},  with 1-'1  the mean vector  and  EI \nthe  covariance  matrix.  For EM-I,  the  resulting  fixed  point iterations for  updating \nthe  parameters are: \n\n\fA Mixture of Experts Classifier for Label-deficient Data \n\n575 \n\n+  L  S};)P[ffli  =j/Xi,O(t)]) \n\nz.EX\" \n\n+  L  P[ffli  = j/Xi, ott)]) \n\nVj \n\nz.EX,. \n\n.B(Hl)  =  ziEX,nCi=k \n\nkIJ \n\nP[ffli =  j / Xi, Ci, ott)] \n\nI: \nI:  P[ffli =  j/Xi,Ci,O(t)] \nziEX, \n\nVk,j \n\n(5) \n\nHere,  S~;)  ==  (Xi  - ~;t\u00bb)(Xi - ~;t\u00bb)T.  New  parameters  are  computed  at  iteration \nt+ 1 based on their values at iteration t.  In  these  equations, P[ffli =  j/Xi, Ci, ott)]  = \n\nI:~(\"C\\~) \n\n\",(')p(')  \u00b7f(z )e('\u00bb \n'zJ e(.) \n....  Pcil .... f (  .1  ....  ) \n\nandP[ffli=j/Xi,O(t)]=  M J \n\n\u2022  For EM-II, it can be \n\n\",(.) f(~ )e(\") \nI: \",~., f(zile~\") \n\nJ \n\n\u2022 \n\nshown that  the  resulting re-estimation equations are  identical to those  in (5)  except \nregarding  the  parameters  {.Bk/}}'  The  updates for  these  parameters  now  take  the \nform \n\n,\",=1 \n\n,q(t+l)  _ \n-\nfJklj \n\n( \n\n1 \n--(t-)  ~  ffli  - )  X\"  C,,!7 \nN a j \n\nP[. _  'j  . \n\nz.EX,nCi=k \n\n\" \n\nZiEX,. \n\n.  il(t)l  \"P[\u00b7 - '   . _  k/  .  il(t)]) \n\nJ +  ~  ffli  -\n\n), c,  -\n\nX,,!7 \n\nH \n\nere,  we  1  en  1  y \n\n'd \n\nt'f  P[ \n\nffli  = ), Ci  =  Xi, !7  =\"  i.) \n\nk/ \n\nil(t)] \n\n. \n\n(t)~(.)  ( \n\n\"'J  \"Io/,f  Zi \nL. \"'~ f(z.le ....  ) \n\nJ \n\nle(\") \n\n(.).  n  t  1S  !ormu atlOn, \n\nJ: \n\nl '  \n\nh' \n\nI \n\njoint probabilities for class and mixture labels are computed for  data in  Xu  and used \nin the estimation of {.Bkfj},  whereas in the previous formulation {.Bklj} are updated \nsolely  on  the  basis  of X,.  While  this  does  appear  to  be  a  significant  qualitative \ndifference  between  the  two  methods,  both  do  ascend  in  log L,  and  in  practice  we \nhave found  that  they achieve comparable performance. \n\n... \n\n4  Combined  Learning  and  Classification \n\nThe  range  of application  for  mixed  training  is  greatly  extended  by  the  following \nobservation:  te~t  data  (with  label~ withheld),  or for  that  matter,  any  new  batch  of \ndata  to  be  cla~~ified,  can  be  viewed  ~ a  new,  unlabelled  data  ~et,  Hence,  this  new \ndata can  be  taken to  be  Xu  and  used  for  learning  (based  on  EM-I  or  EM-II)  prior \nto  its  classification,  What  we  are  suggesting  is  a  combined  learning/classification \noperation that can  be applied whenever  there  is  a  new  batch of data to  classify.  In \nthe  usual supervised  learning setting, there  is  a  clear  division between the  learning \nand classification  (use)  phases,  In  this setting,  modification of the classifier  for  new \ndata is  not  possible  (because  the data is  unlabelled),  while for  test  data such  mod(cid:173)\nification  is  a  form  of  \"cheating\".  However,  in  our  suggested  scheme,  this  learning \nfor  unlabelled  data is  viewed  simply as  part  of the  classification  operation.  This \nis  analogous  to  image  segmentation,  wherein  we  have  a  common  energy  function \nthat  is  minimized  for  each  new  image  to  be  segmented.  Each  such  minimization \ndetermines a  model local to the image  and a segmentation for  the image,  Our  \"seg(cid:173)\nmentation\"  is just classification,  with log L  playing the  role  of the  energy  function. \nIt may consist  of one  term which is  always fixed  (based on  a given labelled training \nset)  and one  term which is  modified  based  on each new  batch of unlabelled data to \nclassify.  We  can  envision  several  distinct  learning contexts  where  this  scheme  can \n\n\f576 \n\nD.  1.  Miller and H.  S.  Uyar \n\nbe  used,  as  well  as  different  ways of realizing  the  combined learning/classification \noperation3  One  use  is  in  classification  of an  image/speech  archive,  where  each  im(cid:173)\nage/speaker segment is a separate  data \"batch\".  Each batch to classify can be  used \nas  an unlabelled  \"training\" set,  either in concert  with a  representative labelled data \nset,  or  to  modify  a  design  based  on  such  a  set 4 .  Effectively,  this  scheme  would \nadapt  the  classifier  to  each  new  data  batch.  A  second  application  is  supervised \nlearning wherein  the  total amount of data is fixed.  Here,  we  need  to divide the data \ninto training  and  test  sets  with  the  conflicting  goals  of i)  achieving a  good  design \nand  ii)  accurately  measuring generalization  performance.  Combined learning  and \nclassification  can  be  used  here  to  mitigate the  loss  in  performance  associated  with \nthe  choice  of a  large  test  set.  More  generally,  our  scheme  can  be  used  effectively \nin  any  setting  where  the  new  data  to  classify  is  either  a)  sizable  or  b)  innovative \nrelative to the  existing  training set. \n\n5  Experimental Results \n\nFigure  1a shows  results  for  the  40-dimensional,  3-class  wa.veform- +noise  data set \nfrom the UC  Irvine database.  The 5000 data pairs were split into equal-size training \nand test  sets.  Performance curves  were  obtained by varying the  amount of labelled \ntraining  data.  For  each  choice  of N/,  various  learning  approaches  produced  6  so(cid:173)\nlutions  based  on  random  parameter  initialization,  for  each  of 7  different  labelled \nsubset  realizations.  The test  set  performance was  then averaged over  these  42  \"tri(cid:173)\nals\".  All  schemes  used  L  = 12  components.  DA-RBF  (Miller  et  at.  1996)  is  a \ndeterministic  annealing method for  RBF classifiers  that  has  been  found  to  achieve \nvery  good  results,  when  given  adequate  training  datas .  However,  this  supervised \nlearning  method is  forced  to discard  unlabelled  data,  which  severely  handicaps  its \nperformance relative  to  EM-I,  especially  for  small  N I ,  where  the  difference  is  sub(cid:173)\nstantial.  TEM-I  and  TEM-II  are  results  for  the  EM  methods  (both  I  and  II)  in \ncombined  learning  and  classification  mode,  i.e.,  where  the  2500  test  vectors  were \nalso  used  as  part  of Xu.  As  seen  in  the  figure,  this  leads  to  additional,  significant \nperformance gains for small N/.  ~ ote also that performance of the two EM methods \nis  comparable.  Figure  1b shows results  of similar experiments performed on 6-class \nsatellite  imagery  data  (\"at),  also  from  the  UC  Irvine  database.  For  this  set,  the \nfeature dimension is  36,  and we  chose  L =  18  components.  Here  we compared EM-I \nwith the method suggested in (Shashahani and Landgrebe  1994)  (SL), based on the \nPM  model.  EM-I  is  seen  to  achieve  substantial performance  gains  over  this  alter(cid:173)\nnative learning approach.  Note  also that  the EM-I performance is  nearly constant, \nover  the entire range  of N/. \n\nFuture work will investigate practical applications of combined learning and classi(cid:173)\nfication,  as  well  as  variations on  this  scheme  which  we  have only  briefly  outlined. \nMoreover,  we  will investigate possible  extensions of the  methods described  here  for \nthe regression  problem. \n\n3The  image  segmentation  analogy  in  fact  suggests  an  alternative  scheme  where  we \nperform  joint  likelihood  maximization  over  both  the  model  parameters  and  the  \"hard\", \nmissing class  labels.  This  approach,  which  is  analogous  to  segmentation  methods  such  as \nICM,  would  encapsulate  the classification  operation  directly  within  the learning.  Such  a \nscheme  will  be investigated in  future  work. \n\n~Note that if the classifier is  simply  modified based on  Xu,  EM-I will  not  need to update \n\n{,8kl;},  while  EM-II  must  update  the  entire  model. \n\n5 We  assumed  the same number of basis functions  as  mixture components.  Also,  for  the \nDA design,  there  was only one initialization,  since DA is  roughly insensitive  to this choice. \n\n\fA Mixture of Experts Classifier for Label-deficient Data \n\n577 \n\n1 02<  \" \nI \n~022 \nIi 1 02 \n\" 1\u00b0,8 \nio.,6  . \nJ \n0,,,  .  \"\n\n. ..... , .. \n\n1021  ., ..  , .. \nI \nlO.2< \nIi \n\n1022 \n\" \n\n'\n\n! '  \u2022\u2022 . . \u2022 \u2022. !  .\u2022\u2022..\u2022\u2022\u2022 ,. \n\n. !EI'~I .... .... . , \n\n0\"  .... , .. .. .... , ....  . \n\nM-H \n\n, ....\n\n... ; ....... . ; . . . . \n\n'0' \n\n'hi \n\nAcknowledgement s \nThis  work  was  supported  in  part  by  National  Science  Foundation  Career  Award \nIRI-9624870. \n\nReferences \n\nV.  Castelli and T.  M.  Cover.  On  the exponential value of labeled samples.  Pattern \nRecognition Letters,  16:105-111,  1995. \n\nA.P. Dempster,  N.M. Laird, and D.B.  Rubin.  Maximum-likelihood from incomplete \ndata via the  EM  algorithm.  Journal  of the  Roy.  Stat.  Soc. I  Ser.  B,  39:1-38, 1977. \nZ.  Ghahramani and M.  I. Jordan.  Supervised learning from incomplete data via an \nEM  approach.  In  Neural Information  Processing  Systems  6,  120-127,  1994. \n\nM.  1.  Jordan  and  R.  A.  Jacobs.  Hierarchical  mixtures  of experts  and  the  EM \nalgorithm.  Neural  Computation,  6:181-214, 1994. \n\nR.  P.  Lippmann.  Pattern classification  using  neural  networks.  IEEE  Communica(cid:173)\ntions  Magazine,  27,47-64,  1989. \n\nD.  J.  Miller,  A.  Rao,  K.  Rose,  and  A.  Gersho.  A  global  optimization method for \nstatistical classifier  design.  IEEE  Transactions  on  Signal  Processing,  Dec.  1996. \n\nD.  J. Miller and H. S. Uyar.  A generalized Gaussian mixture classifier with learning \nbased  on  both labelled  and unlabelled data.  Conf.  on  Info.  Sci.  and  Sys.,  1996. \n\nD.  J.  Miller.  A  mixture  model  equivalent  to  the  radial  basis  function  classifier. \nSubmitted to  Neural Computation,  1996. \n\nJ.  Moody and C.  J. Darken.  Fast learning in locally-tuned processing units.  Neural \nComputation,  1:281-294,  1989. \n\nB.  Shashahani  and  D.  Landgrebe.  The  effect  of  unlabeled  samples  in  reducing \nthe  small  sample  size  problem  and  mitigating  the  Hughes  phenomenon. \nIEEE \nTransactions  on  Geoscience  and  Remote  Sensing,  32:1087-1095,  1994. \n\nV.  Tresp,  R.  N euneier,  and S.  Ahmad.  Efficient  methods for  dealing  with  missing \ndata  in  supervised  learning. \nIn  Neural  Information  Processing  Systems  7,  689-\n696,1995. \nL.  Xu,  M.  I.  Jordan,  and  G.  E.  Hinton.  An  alternative  model  for  mixtures  of \nexperts.  In  Neural Information  Processing Systems  7,  633-640,  1995. \n\n\f", "award": [], "sourceid": 1208, "authors": [{"given_name": "David", "family_name": "Miller", "institution": null}, {"given_name": "Hasan", "family_name": "Uyar", "institution": null}]}