{"title": "Probabilistic Modeling for Face Orientation Discrimination: Learning from Labeled and Unlabeled Data", "book": "Advances in Neural Information Processing Systems", "page_first": 854, "page_last": 860, "abstract": null, "full_text": "Probabilistic Modeling for Face Orientation \n\nDiscrimination: \n\nLearning from  Labeled and Unlabeled Data \n\nShumeet Baluja \nbaluja@cs.cmu.edu \n\nJustsystem Pittsburgh Research Center & \n\nSchool of Computer Science, Carnegie Mellon University \n\nAbstract \n\nThis paper presents probabilistic modeling methods to solve the problem of dis(cid:173)\ncriminating between five facial  orientations with  very  little labeled data.  Three \nmodels are explored. The first model maintains no inter-pixel dependencies, the \nsecond model is capable of modeling a set of arbitrary pair-wise dependencies, \nand the last model allows  dependencies  only  between  neighboring pixels. We \nshow that for all three of these models, the accuracy of the learned models can \nbe greatly improved by  augmenting a small number of labeled training images \nwith  a  large  set of unlabeled  images using  Expectation-Maximization.  This  is \nimportant because it is often difficult to obtain image labels, while many unla(cid:173)\nbeled images  are  readily  available.  Through  a  large  set  of empirical  tests,  we \nexamine the benefits  of unlabeled data  for  each  of the  models.  By  using only \ntwo randomly selected labeled examples per class, we can discriminate between \nthe five  facial orientations with an accuracy of 94%; with six labeled examples, \nwe achieve an accuracy of 98%. \n\n1  Introduction \n\nThis  paper  examines  probabilistic  modeling  techniques  for  discriminating  between  five \nface orientations:  left profile, left semi-profile, frontal,  right semi-profile, and right profile. \nThree models are explored: the first model represents no inter-pixel dependencies, the sec(cid:173)\nond model  is  c,~pable of modeling a  set of arbitrary  pair-wise dependencies,  and the  last \nmodel allows'~~~ndencies only between neighboring pixels. \nModels  which  capture  inter-pixel  dependencies  can  provide  better  classification  perfor(cid:173)\nmance than those that do not capture dependencies.  The difficulty in using the more com(cid:173)\nplex models, however, is that as more dependencies are modeled, more parameters must be \nestimated - which requires more training data.  We show that by  using Expectation-Maxi(cid:173)\nmization, the accuracy of what is  learned can be greatly improved by  augmenting a small \nnumber of labeled training images with unlabeled images, which are much easier to obtain. \nThe remainder of this section describes the problem  of face  orientation discrimination  in \ndetail. Section 2 provides a brief description ofthe probabilistic models explored. Section 3 \npresents  results  with  these  models with  varying amounts  of training  data. Also  shown  is \nhow  Expectation-Maximization  can  be used to  augment the  limited  labeled training data \nwith  unlabeled training  data.  Section 4  briefly  discusses  related work.  Finally,  Section  5 \ncloses the paper with conclusions and suggestions for future work. \n\n\fProbabilistic Modelingfor Face Orientation Discrimination \n\n855 \n\n1.1  Detailed Problem Description \n\nThe  interest  in  face  orientation  discrimination  arises  from  two  areas.  First,  the  rapid \nincrease  in  the  availability  of inexpensive  cameras  makes  it  practical  to  create  systems \nwhich  automatically monitor a  person while using  a  computer.  By  using  motion,  color, \nand size cues, it is possible to quickly fmd and segment a person's face when he/she is sit(cid:173)\nting in front of a computer monitor. By determining whether the person is looking directly \nat  the  computer,  or is  staring away from  the computer, we can provide  feedback to any \nuser interface that could benefit from knowing whether a user is paying attention or is dis(cid:173)\ntracted (such as  computer-based tutoring systems for children,  computer games,  or even \ncar-mounted cameras that monitor drivers). \nSecond,  to  perform  accurate face  detection  for  use  in  video-indexing  or content-based \nimage retrieval systems, one approach is to design detectors specific to each face orienta(cid:173)\ntion, such as [Rowley et aI.,  1998, Sung 1996]. Rather than applying all detectors to every \nlocation,  a  face-orientation  system  can  be  applied  to  each  candidate  face  location  to \n\"route\" the candidate to the appropriate detector, thereby reducing the potential for false(cid:173)\npositives,  and  also  reducing  the  computational  cost  of applying  each  detector.  This \napproach was taken in [Rowley et at.,  1998]. \nFor the experiments in this paper, each image to be classified is 20x20 pixels. The face  is \ncentered  in  the  image,  and  comprises  most  of the  image.  Sample  faces  are  shown  in \nFigure 1. Empirically, our experiments show that accurate pose discrimination is possible \nfrom  binary versions of the images. First, the images were histogram-equalized to values \nbetween  0  and 255.  This  is  a  standard non-linear transformation  that maps  an  approxi(cid:173)\nmately equal number of pixels to each value within the 0-255 range.  It is used to improve \nthe contrast in  images.  Second, to \"binarize\" the images,  pixels with intensity above  128 \nwere mapped to a value of255, otherwise the pixels were mapped to a value ofO. \n\nFrontal \n\nRight \nHalf Profile \n\n--\n\nRight Profile \n\nLeft \nHalf Profile \n\nLeft Profile \n\n\"... \n-[j \n.. \n\n.i \n~ \n\n..... ., \n\n.  ,. \n~ ;;  ,,\" \nl\" \n\u2022 \n... \n' J J \u2022 \nI \n\n\\'  ' . 1 \nfII'!-\u2022 , t . \n... \n~ ... \nl \n.J \u00b7b  , .... \n\ni ~~ \nl \n\nJ \n\nt. \n\nlIZ! \n\n6-\n; \n\nOriginal \n\nFigure 1: 4 images of \neach of the 5 classes to be \ndiscriminated, Note the \nvariability in the images. \nLeft: Original Images. \nRight:  Images after \nhistogram equalization \nand binary quantization. \n\n2  Methods Explored \n\nThis  section  provides  a  description  of the  probabilistic  models  explored:  Naive-Bayes, \nDependency Trees  (as  proposed by  [Chow  and Liu,  1968]),  and  a  dependence  network \nwhich models dependencies only between neighboring pixels.  For more details on  using \nBayesian \"multinets\" (independent networks trained to  model  each  class) for  classifica(cid:173)\ntion in a manner very similar to that used in this paper, see [Friedman, et at.,  1997]. \n\n2.1  The Naive-Bayes Model \n\nThe first, and simplest, model assumes that each pixel is independent of every other pixel. \nAlthough this  assumption is  clearly violated in  real  images, the model often yields good \nresults with limited training data since it requires the estimation of the fewest parameters. \nAssuming that each  image  belongs exclusively to  one of the  five  face  classes to be dis-\n\n\f856 \n\nS.  Baluja \n\ncriminated, the probability of the image belonging to a particular class is given as follows: \n\nCI \n\nP(ImagelClassc) x P(Classc) \nP (  assc Jlmage)  = -...:....-;;.\"\".,.;.--P-(l-m...!:a,;..ge-)-.:...-~ \n\nP(lmagelClassc) =  I1 P(Pixel, IClassc) \n\n400 \n\n,  ~ I \n\nP(PixelilClassJ is estimated directly from the training data by: \n\nk+ \n\nl: \n\nPixel, x  P(Classcllmage) \n\nP(Pixelri C lassc) = _\"\"'Tr-\"-a\"\"\"l1\"\"ng .... /\"\"ma .... g .... es'--___ ___  _ \nP(Classcllmage) \n\n2k+ \n\nl: \n\nTrarnrng/mages \n\nSince  we  are  only  counting  examples  from  the  training  images,  P(ClasscIImage)  is \nknown.  The notation P(ClasscIImage)  is used to represent image labels because it is con(cid:173)\nvenient for describing the counting process with both labeled and unlabeled data (this will \nbe described in detail in Section 3). With the labeled data, P(ClasscIImage)E{O,I}. Later, \nP(ClasscIImage) may not be binary; instead, the probability mass may be divided between \nclasses. PixeliE {O, I}  since the images are binary. k is a smoothing constant, set to 0.001. \n\nWhen  used for  classification,  we compute the posterior probabilities and take the  maxi(cid:173)\nmum, Cpredicted, where:  cpred,cled  =  BrgmBX c  P(Classc I Image) =:  P(lmage lClassc) \n. For sim(cid:173)\nplicity,  P(ClassJ is assumed equal for all c;  prImage) is  a normalization constant which \ncan be ignored since we are only interested in fmding the maximum posterior probability. \n\n2.2 Optimal Pair-Wise Dependency Trees \n\nWe wish to model  a probability distribution P(Xb  ... , X 4001ClassJ, where each  X  corre(cid:173)\nsponds to  a pixel  in the image.  Instead of assuming pixel  independence,  we restrict our \nmodel to the following form: \n\nP(X1\u00b7\u00b7 \u00b7XnIClassc)  =  Il p(xilnx-,ClassJ \n\nn \n\ni  =  1 \n\nI \n\n, \n\n( n x, = x) ~ m(,) < mU) \n\nwhere  I1x  is  Xi's  single \"parent\" variable.  We  require  that there  be  no  cycles  in  these \n\"parent-of' relationships: formally, there must exist some permutation m  = (m b  ... '  m,J of \nfor all i.  In  other words, we restrict P' to \n(1,  ... , n)  such that \nfactorizations  representable by  Bayesian networks  in  which  each node  (except the root) \nhas one parent, i.e.,  tree-shaped graphs. \nA method for  finding  the  optimal  model within these restrictions  is  presented in  [Chow \nand Liu,  1968]. A complete weighted graph G is created in which each variable Xi is rep(cid:173)\nresented by a corresponding vertex Vi, and in which the weight Wjj  for the edge between \nvertices V j and Vj  is set to the mutual information I(Xj,Xj) between Xj  and Xj. The edges \nin the maximum spanning tree of G determine an optimal set of (n-l) conditional probabil(cid:173)\nities with which to construct a tree-based model of the original probability distribution. \nWe  calculate the probabilities  P(Xi) and P(Xj,  Xj) directly  from  the dataset.  From  these, \nwe calculate the mutual information, I(Xj, Xj), between all pairs of variables Xi and Xj: \n\nI(X  X)  =  \"P(X  =  a  X =  b). log \n\nI \n\n'J \n\nP(X  =  a) \u00b7 P(X  =  b) \n\n\"J \n\nL.. \na,b \n\nP(X . =  a, X  =  b) \n\nI \n\nI \n\nI \n\nJ \n\nThe maximum spanning tree minimizes the Kullback-Leibler divergence D(PIIP') between \n\n\fProbabilistic Modelingfor Face Orientation Discrimination \n\n857 \n\nthe true and estimated distributions: \n\nD(P II  P')  =  L P(X)log :,\u00ab~ \n\nx \n\nas shown in [Chow &  Liu,  1968]. Among all distributions of the same form, this distribu(cid:173)\ntion maximizes the likelihood of the data when the data is a set of empirical observations \ndrawn from  any unknown distribution. \n\n2.3  Local Dependency Models \n\nUnlike the Dependency Trees presented in the previous section, the local dependency net(cid:173)\nworks only model dependencies between adjacent pixels. The most obvious dependencies \nto model are each pixel's eight neighbors. The dependencies are shown graphically in Fig(cid:173)\nure 2(left). The difficulty with the above representation is that two pixels may be depen(cid:173)\ndent upon each other (if this above model was represented as a Bayesian network, it would \ncontain cycles). Therefore, to avoid problems with circular dependencies, we use the fol(cid:173)\nlowing model  instead.  Each  pixel is  still  connected to each  of its eight neighbors;  how(cid:173)\never, the arcs are directed such that the dependencies are acyclic. In this local dependence \nnetwork, each pixel is only dependent on  four of its neighbors: the three neighbors to the \nright  and  the  one immediately  below.  The dependencies  which  are  modeled  are  shown \ngraphically in Figure 2 (right). The dependencies are: \n\nP(ImagelClassC> =  TI P(Pixel,lnptrel,' Classc ) \n\n400 \n\n, = 1 \n\n(0,0) \n\n0 \n0 \n\n0 \n\nDO \n0 \n0 \n0 \n\n0 \n0 \n0 \n0 \n\n(0,0)0 \n\n0 \n0 \n\n0  DO \n0 \n0 \n0 \n\n0 \n0 \n0 \n0 \n\nDO  0  o  0 \n\n. \u2022\u2022  0  (20,20) \n\n0  DODD \n\n0(20,20) \n\nFigure 2: Diagmm of the dependencies maintained. Each square represents a pixel  in the image. \nDependencies are shown only for two pixels. (Left) Model with 8 dependencies - note that because this model \nhas circular dependencies, we do not use it.  Instead, we use the model shown on the Right. (Right) Model used \nhas 4 dependencies per pixel.  By  imposing an  ordering on the pixels, circular dependencies are avoided. \n\n3  Performance with Labeled and Unlabeled Data \n\nIn  this  section,  we  compare  the  results  of the  three  probabilistic  models  with  varying \namounts of labeled training data.  The training set consists of between  1 and 500 labeled \ntraining  examples,  and  the  testing  set  contains  5500  examples.  Each  experiment  is \nrepeated at least 20 times with random train/test splits of the data. \n\n3.1  Using only Labeled Data \n\nIn this  section,  experiments  are  conducted with  only  labeled  data.  Figure 3(left}  shows \neach model's accuracy  in  classifying the  images  in  the  test  set  into  the  five  classes.  As \n\n\f858 \n\nS.  Baluja \n\nexpected, as more training data is used, the performance improves for all models. \nNote  that  the  model  with  no-dependencies  performs  the  best  when  there  is  little  data. \nHowever,  as the amount of data  increases, the relative performance of this model,  com(cid:173)\npared to the other models which account for dependencies,  decreases.  It  is  interesting to \nnote that when there is  little data, the Dependency Trees perform poorly. Since these trees \ncan select dependencies between any two pixels, they are the most susceptible to fmding \nspurious dependencies. However, as the amount of data increases, the performance of this \nmodel rapidly improves. By using all of the labeled data (500 examples total), the Depen(cid:173)\ndency Tree and the Local-Dependence network perform approximately the same, achiev(cid:173)\ning a correct classification rate of approximately 99% . \n\nCIoMfkwtIap Poi Ibi \n\n..tda.....\"  LaboIecI  Data \n\nI All \n\n0.11 \n\n0 ... \n\nI  0.'111 \nI 0 . .  \n\n0 . .  \n\n0'\" \n\no.m \n\n0:111 \n\n/ \n\n, , \n, , \n, \n\nI \nl \n, \nI \n.......... \" \n\n/ , \n/ , , , \n\n-------\n\n, \n,/ \n\n--=-.., \n~\"\"-J \n----\n~\".. \n- - .... ~ \n\n0 ... \n\n0\"  0.' \nI \n... \n3 \nI \n... \n\n, , , \n, \n, \nI , , \n........  , \n.... , \n\nI \nI \nI \n\nI \nI \n\n.... \n\nl.oa \n\n\u2022. 711 \n\n0.40 \n\no.m \n\n0:111 \n\nau \n\nus \n\n..(cid:173)\n( .......... J \n\n--~ \n\n---- ~,.,. \n\n- - .... D~ \n\n10 \n\n......... \n\n:00 \n\n1011 \n\n-\n\n1011  -\n\nFigure 3:  Perfonnance of the three models. X Axis: Amount oflabeled training data used. Y Axis: Percent \ncorrect on an independent test set. In the left graph, only labeled data was used.  In the right graph, unlabeled and \nlabeled data was used (the total number of examples were 500, with varying amounts of labeled data). \n\n3.2 Augmenting the Models with Unlabeled Data \n\nWe can augment what is learned from  only using the labeled examples by  incorporating \nunlabeled  examples  through  the  use  of the  Expectation-Maximization  (EM)  algorithm. \nAlthough the details of EM are beyond the scope of this paper, the resulting algorithm is \neasily described (for a description of EM and applications to filling in missing values, see \n[Dempster et al.,  1977] and [Ghahramani &  Jordan,  1994]): \n\n1. \n\n2. \n\n3. \n\nBuild the models using only the labeled data (as in Section 2). \n\nUse the models to probabilistically label the unlabeled images. \n\nUsing  the  images  with  the  probabilistically  assigned  labels,  and  the \nimages  with  the  given  labels,  recalculate  the  models'  parameters.  As \nmentioned  in  section  2,  for  the  images  labeled  by  this  process, \nP(Classcllmage)  is  not restricted to  {0,1};  the probability  mass  for  an \nimage may be spread to multiple classes. \n\n4. \n\nIf a pre-specified termination condition is not met, go to step 2. \n\nThis process is used for each classifier. The termination condition was five iterations; after \nfive iterations, there was little change in the models' parameters. \nThe performance of the three classifiers with unlabeled data is shown in  Figure 3(right). \nNote that with small amounts of data,  the performance of all of the classifiers  improved \ndramatically when the unlabeled data is used.  Figure 4 shows the percent improvement by \nusing the unlabeled data to  augment the  labeled data.  Note that the  error is  reduced by \n\n\fProbabilistic Modelingfor Face Orientation Discrimination \n\n859 \n\nalmost 90% with the use of unlabeled data (see the case with Dependency Trees with only \n4 labeled examples,  in which the accuracy rates increase from  44% to 92.5%).  With only \n50  labeled examples,  a classification  accuracy of 99% was obtained.  This accuracy was \nobtained with  almost  an  order of magnitude  fewer  labeled examples than  required with \nclassifiers which used only labeled examples. \nIn almost every case examined, the addition of unlabeled data helped performance. How(cid:173)\never,  unlabeled  data  actually  hurt  the  no-dependency  model  when  a  large  amount  of \nlabeled data already  existed.  With  large  amounts  of labeled  data,  the  parameters of the \nmodel  were  estimated  well.  Incorporating  unlabeled  data  may  have  hurt  performance \nbecause the underlying generative process modeled did not match the real generative pro(cid:173)\ncess. Therefore, the additional data provided may not have been labeled with the accuracy \nrequired to  improve the model's classification performance.  It is  interesting to  note that \nwith  the  more  complex  models,  such  as  the  dependency  trees  or local  dependence net(cid:173)\nworks, even with the same amount of labeled data, unlabeled data improved performance. \n[Nigam,  et al.,  1998]  have reported similar performance degradation when  using a large \nnumber of labeled examples and EM  with  a naive-Bayesian model to classify text docu(cid:173)\nments. They describe two methods for overcoming this problem. First, they adjust the rel(cid:173)\native  weight  of the  labeled  and unlabeled  data  in  the  M-step  by  using  cross-validation. \nSecond,  they providing multiple  centroids per class,  which  improves the data/model  fit. \nAlthough not presented here due to space limitations, the first method was attempted - it \nimproved the performance on the face orientation discrimination task. \n\nLaaI \n\n--~-... ---\n.,.,..J..-_ \n, . \n~~  = \n...... \n&'-;  ,  \u00b7 \n-\nI \n\n\u2022 \n'\" \n\u2022 \n.. \n\u2022 \n\u2022 \n\u2022 \n\n...... \n<-) \n\n., \n\n~~ ....................... \n\nILK) \n\n\u2022\u2022  \"\",..<-~)  ,.. ...... (1K)  LaaI_Do \n\u2022 \n.. \n.f-\n.. \n\u2022 \n\u2022 \n\u2022 \n\" \n\nt) \n\n~ 0 u \nis \n8 ... 01) \n\n0... \n\n(11\"') \n\n\u2022 \n\u2022  ~ \n'\" \n\u2022  I  ;::-\n\u00b7 \n.1 \n.. \n\u2022 \n\u2022  I \n\n, . \n. ,--: \n\n\" \n\nFigure 4:  Improvement for each model by using unlabeled data to augment the labeled data.  Left: with \nonly  1 labeled example, Middle: 4 labeled, Right: 50 labeled. The bars in light gray represent the \nperformance with only labeled data, the dark bars indicate the performance with the unlabeled data. The \nnumber in  parentheses indicates the absolute (in contrast to relative) percentage change in classification \nperformance with the use of unlabeled data. \n\n4  Related Work \n\nThere is  a large amount of work which attempts to discover attributes of faces,  including \n(but  not limited to)  face  detection,  face  expression  discrimination,  face  recognition,  and \nface orientation discrimination (for example [Rowley et al.,  1998][Sung,  1996][Bartlett & \nSejnowski,  1997][Cottrell  &  Metcalfe,  1991 ][Turk  &  Pentland,  1991 D.  The  work pre(cid:173)\nsented in this paper demonstrates the effective incorporation of unlabeled data into image \nclassification procedures; it should be possible to use unlabeled data in any of these tasks. \nThe  closest  related  work  is  presented  in  [Nigam  et  aI,  1998].  They  used  naive-Bayes \nmethods to classify text documents into a pre-specified number of groups. By using unla(cid:173)\nbeled  data,  they  achieve  significant  classification performance  improvement  over using \nlabeled documents alone.  Other work which has employed EM  for learning from  labeled \nand unlabeled data include [Miller and Uyar,  1997] who used a mixture of experts classi(cid:173)\nfier,  and [Shahshahani &  Landgrebe,  1994]  who used a mixture of Gaussians.  However, \nthe dimensionality oftheir input was at least an order of magnitude smaller than used here. \nThere is a wealth of other related work, such as  [Ghahramani &  Jordan,  1994] who have \n\n\f860 \n\nS.  Baluja \n\nused EM to fill  in  missing values in the training examples.  In their work, class labels can \nbe regarded as another feature value to fill-in. \nOther approaches to reducing the need for large amounts of labeled data take the fonn of \nactive learning in which the learner can ask for the labels of particular examples.  [Cohn, \net.  a11996]  [McCallum &  Nigam, 1998] provide good overviews of active learning. \n\n5  Conclusions &  Future Work \n\nThis paper has made two contributions.  The first  contribution is to  solve the problem of \ndiscriminating between five face  orientations with very  little data.  With only two labeled \nexample images per class, we were able to obtain classification accuracies of94% on sep(cid:173)\narate test sets (with the local dependence networks with 4 parents). With only a few more \nexamples, this was increased to greater than 98% accuracy. This task has a range of appli(cid:173)\ncations in the design of user-interfaces and user monitoring. \nWe also explored the use of mUltiple probabilistic models with unlabeled data.  The mod(cid:173)\nels varied in their complexity, ranging from modeling no dependencies between pixels, to \nmodeling  four  dependencies  per pixel.  While  the  no-dependency  model  perfonns  well \nwith very  little  labeled  data,  when  given  a large  amount  of labeled data,  it is  unable  to \nmatch the perfonnance of the other models presented. The Dependency-Tree models per(cid:173)\nfonn  the  worst when  given  small  amounts of data because they  are  most susceptible to \nfinding  spurious  dependencies  in  the  data.  The  local  dependency  models perfonned the \nbest overall, both by working well with little data, and by being able to exploit more data, \nwhether labeled or unlabeled. By using EM to incorporate unlabeled data into the training \nof the classifiers, we improved the perfonnance of the classifiers by up to approximately \n90% when little labeled data was available. \nThe use of unlabeled data is vital in this domain. It is time-consuming to hand label many \nimages,  but  many  unlabeled  images  are  often  readily  available.  Because  many  similar \ntasks, such as face recognition and facial expression discrimination,  suffer from  the same \nproblem of limited labeled data, we hope to apply the methods described in this paper to \nthese applications. Preliminary results on related recognition tasks have been promising. \n\nAcknowledgments \nScott Davies helped tremendously with discussions about modeling dependencies. I would also like to acknowl(cid:173)\nedge the help of Andrew McCallum for discussions of EM, unlabeled data and the related work. Many thanks are \ngiven  to  Henry  Rowley  who  graciously  provided the  data  set.  Finally,  thanks  are given to  Kaari  Flagstad  for \ncomments on drafts of this paper. \n\nReferences \nBartlett, M. &  Sejnowski, T.  (1997) \"Viewpoint Invariant Face Recognition using ICA and Attractor Networks\", \nin Adv. in Neural Information Processing Systems (NIPS)  9. \nChow, C.  &  Liu, C. (1968) \"Approximating Discrete Probability Distributions with Dependence Trees\".  IEEE(cid:173)\nTransactions on Information Theory, 14: 462-467. \nCohn, D.A., Ghahramani, Z.  &  Jordan, M. (1996) \"Active Learning with Statistical Models\", Journal of Artifi(cid:173)\ncial Intelligence Research 4: 129-145. \nCottrell, G. & Metcalfe, (1991) \"Face, Gender and Emotion Recognition using Holons\", NIPS 3. \nDempster,  A.  P.,  Laird,  N.M.,  Rubin,  D.B. (1977) \" Maximum  Likelihood  from  Incomplete  Data via the  EM \nAlgorithm\", J  Royal Statistical Society Series B, 39  1-38. \nFriedman, N., Geiger, D. Goldszmidt, M. (1997) \"Bayesian Network Classifiers\", Machine Learning 1:29. \nGhahramani &  Jordan (1994) \"Supervised Learning from Incomplete Data Via an  EM Approach\" NIPS 6. \nMcCallum, A.  &  Nigam, K.  (1998) \"Employing EM in  Pool-Based Active Learning\", in ICML98. \nMiller, D.  &  Uyar, H.  (1997) \"A Mixture of Experts Classifier with Learning based on both Labeled and Unla(cid:173)\nbeled data\", in Adv. in Neural Information Processing Systems 9. \nNigam, K.  McCallum,  A.,  Thrun,  S.,  Mitchell, T.  (1998),  \"Learning  to  Classify  Text from  Labeled  and  Unla(cid:173)\nbeled Examples\", to appear in AAAI-98. \nRowley,  H.,  Baluja,  S. &  Kanade,  T.  (1998) \"Neural  Network-Based  Face  Detection\", IEEE-Transactions  on \nPattern Analysis and Machine Intelligence (PAMI). Vol. 20, No. 1, January,  1998. \nShahshahani, B.  &  Landgrebe, D. (1994) \"The Effect of Unlabeled  samples  in  reducing the small  sample size \nproblem and mitigating the Hughes Phenomenon\", IEEE Trans. on Geosc.  and Remote Sensing 32. \nSung,  K.K. (1996),  Learning and Example Selection for Object and Pattern Detection.  Ph.D.  Thesis, MIT AI \nLab - AI  Memo 1572. \nTurk, M. &  Pentland, A.  (1991) \"Eigenfaces for Recognition\". J. Cog Neurosci. 3 (I). \n\n\f", "award": [], "sourceid": 1567, "authors": [{"given_name": "Shumeet", "family_name": "Baluja", "institution": null}]}