{"title": "Human Face Detection in Visual Scenes", "book": "Advances in Neural Information Processing Systems", "page_first": 875, "page_last": 881, "abstract": null, "full_text": "Human Face Detection in Visual Scenes \n\nHenry A. Rowley \nhar@cs.cmu.edu \n\nShumeet Baluja \nbaluja@cs.cmu.edu \n\nTakeo Kanade \ntk@cs.cmu.edu \n\nSchool of Computer Science, Carnegie Mellon University, Pittsburgh, PA  15213, USA \n\nAbstract \n\nWe  present  a  neural  network-based face  detection  system.  A retinally \nconnected  neural  network  examines  small  windows  of an  image,  and \ndecides  whether each  window  contains  a  face.  The  system  arbitrates \nbetween multiple networks to improve performance over a single network. \nWe  use  a  bootstrap  algorithm for training, which  adds  false  detections \ninto the training set as  training progresses.  This eliminates the difficult \ntask  of manually  selecting  non-face  training examples,  which  must  be \nchosen  to span the entire space of non-face images.  Comparisons with \nanother state-of-the-art face  detection system are presented;  our system \nhas better performance in terms of detection and false-positive rates. \n\n1 \n\nINTRODUCTION \n\nIn this paper, we present a neural network-based algorithm to detect frontal views of faces \nin gray-scale images.  The algorithms and training methods are general, and can be applied \nto other views of faces,  as well as  to similar object and pattern recognition problems. \n\nTraining a neural network for the face detection task is challenging because of the difficulty \nin  characterizing  prototypical \"non-face\"  images.  Unlike in  face  recognition,  where  the \nclasses  to  be  discriminated  are  different faces,  in  face  detection,  the  two  classes  to  be \ndiscriminated are \"images containing faces\"  and \"images not containing faces\".  It is easy \nto  get a  representative  sample  of images  which contain faces,  but  much  harder to  get  a \nrepresentative  sample of those  which do  not.  The  size of the training set for the second \nclass can grow very quickly. \n\nWe avoid the problem of using a huge training set of non-faces by selectively adding images \nto  the training set as  training progresses  [Sung and Poggio,  1994].  This \"bootstrapping\" \nmethod reduces  the  size of the training set needed.  Detailed descriptions of this training \nmethod,  along  with  the  network  architecture  are  given  in  Section  2.  In  Section  3  the \nperformance of the system is examined.  We find that the system is able to detect 92.9% of \nfaces with an  acceptable number of false positives.  Section 4 compares this system with a \nsimilar system.  Conclusions and directions for future research are presented in Section 5. \n\n2  DESCRIPTION OF THE SYSTEM \n\nOur system consists of two major parts:  a set of neural network-based filters, and a system \nto  combine  the  filter outputs.  Below,  we  describe the  design  and  training  of the  filters, \n\n\f876 \n\nH.  A.  ROWLEY, S. BALUJA, T. KANADE \n\nwhich scan  the input image for  faces.  This  is followed  by  descriptions of algorithms for \narbitrating among multiple networks and for merging multiple overlapping detections. \n\n2.1  STAGE ONE: A NEURAL -NETWORK-BASED FILTER \n\nThe first component of our system is a filter that receives as  input a small square region of \nthe image, and generates an output ranging from  1 to -1, signifying the presence or absence \nof a face,  respectively.  To detect faces anywhere in the input, the filter must be applied at \nevery  location in  the image.  To  allow detection of faces  larger than  the window size,  the \ninput image is repeatedly reduced in size (by subsampling), and the filter is applied at each \nsize.  The set of scaled input images is known as an \"image pyramid\", and is illustrated in \nFigure  1.  The filter  itself must have  some  invariance to position and scale.  The amount \nof invariance in the filter determines the number of scales and positions at which the filter \nmust be applied. \n\nWith these  points in mind,  we  can  give the filtering  algorithm (see Figure  I).  It consists \nof two  main  steps:  a  preprocessing  step,  followed  by  a  forward  pass  through  a  neural \nnetwork.  The preprocessing consists of lighting correction,  which equalizes the intensity \nvalues across the window, followed by histogram equalization, which expands the range of \nintensities in the window [Sung and Poggio,  19941  The preprocessed window is used as \nthe input to the neural network.  The network has retinal connections to its input layer; the \nreceptive fields  of each hidden  unit are shown in the figure.  Although the figure  shows  a \nsingle hidden unit for each  subregion of the input,  these  units can  be replicated.  Similar \narchitectures are commonly used in speech and character recognition tasks  [Waibel et al., \n1989, Le Cun et al., 19891 \n\nInput image pyramid  Extracted window  Correct lighting \n\nPreprocessing \n\nNeural network \n\nFigure 1: The basic algorithm used for face detection. \n\nExamples  of output from  a  single filter  are  shown  in  Figure  2.  In  the  figure,  each  box \nrepresents the position and size of a window to  which the  neural  network gave a positive \nresponse.  The network has some invariance to position and scale, which results in mUltiple \nboxes around some faces.  Note that there are some false detections; we present methods to \neliminate them in Section 2.2.  We next describe the training of the network which generated \nthis output. \n\n/ \n\n2.1.1  Training Stage One \n\nTo train a neural network to serve as an accurate filter,  a large number of face and non-face \nimages are needed.  Nearly 1050 face examples were gathered from face databases at CMU \nand  Harvard.  The  images  contained faces  of various  sizes,  orientations,  positions,  and \nintensities.  The eyes  and  upper lip of each face  were located  manually, and these  points \nwere used to normalize each face to the same scale,  orientation, and position.  A 20-by-20 \npixel region containing the face is extracted and preprocessed (by apply lighting correction \nand histogram equalization).  In the training set,  15 faces  were created from each original \nimage, by slightly rotating (up to 10\u00b0), scaling (90%-110%), translating (up to half a pixel), \n\n\fHuman Face Detection in Visual Scenes \n\n877 \n\nFigure 2:  Images with all \nthe above threshold detec(cid:173)\ntions indicated by boxes. \n\nFigure 3:  Example face  im(cid:173)\nages, randomly mirrored,  ro(cid:173)\ntated,  translated.  and  scaled \nby small amounts. \n\nand mirroring each face.  A few example images are shown in Figure 3. \n\nIt is difficult to collect a representative  set of non-faces.  Instead of collecting the images \nbefore training is  started,  the images  are  collected during training, as  follows  [Sung  and \nPoggio, 1994]: \n\nI.  Create  1000 non-face images using random pixel intensities. \n2.  Train a neural network to produce an output of 1 for the face examples,  and -I  for \n\nthe non-face examples. \n\n3.  Run the system on an image of scenery which contains no faces.  Collect subimages \n\nin which the network incorrectly identifies a face (an output activation> 0). \n\n4.  Select up to 250 of these subimages at random, and add them into the training set. \n\nGo to step 2. \n\nSome examples of non-faces that are collected during training are shown in Figure 4.  We \nused  120 images for collecting negative examples in this bootstrapping manner.  A typical \ntraining run selects approximately 8000 non-face images from the 146,212,178 subimages \nthat are available at all locations and scales in the scenery images. \n\nFigure 4:  Some non-face examples which are collected during training. \n\n2.2  STAGE TWO: ARBITRATION AND MERGING OVERLAPPING \n\nDETECTIONS \n\nThe examples in Figure 2 showed that just one network cannot eliminate all false detections. \nTo  reduce  the  number  of false  positives,  we  apply  two  networks,  and  use  arbitration to \nproduce  the  final  decision.  Each  network  is  trained  in  a  similar  manner,  with  random \ninitial weights, random  initial non-face images,  and random permutations of the order of \npresentation of the scenery images.  The detection and false positive rates of the individual \nnetworks  are  quite close.  However,  because  of different training conditions and  because \nof self-selection of negative training examples, the networks will have different biases and \nwill make different errors. \n\nFor the  work presented here,  we  used  very  simple arbitration strategies.  Each  detection \nby a filter at a particular position and scale is recorded in an image pyramid.  One way to \n\n\f878 \n\nH.  A. ROWLEY, S. BALUJA, T.  KANADE \n\ncombine two such pyramids is by ANDing.  This strategy signals a detection only if both \nnetworks  detect  a  face  at  precisely  the  same  scale  and  position.  This ensures  that,  if a \nparticular false detection is made by only one network, the combined output will not have \nthat error.  The disadvantage is that if an actual face is detected by only one network, it will \nbe lost in the combination.  Similar heuristics, such as ORing the outputs, were also tried. \n\nFurther heuristics (applied either before or after the arbitration step) can be used to improve \nthe performance of the system.  Note that in Figure 2, most faces  are detected at multiple \nnearby positions or scales,  while false detections often occur at single locations.  At each \nlocation  in  an  image pyramid representing detections,  the  number of detections within a \nspecified neighborhood of that location can be counted.  If the number is above a threshold, \nthen  that location is  classified  as  a face.  These  detections  are  then  collapsed down  to  a \nsingle point,  located at  their centroid.  when this is done  before arbitration, the centroid \nlocations rather than the actual outputs from the networks are ANDed together. \n\nIf we further assume that a position is correctly identified as a face,  then all other detections \nwhich overlap it are likely to be errors, and can therefore be eliminated.  There are relatively \nfew  cases in  which this heuristic fails;  however, one such case is illustrated in the left two \nfaces  in  Figure 2B,  in  which one face  partially occludes another.  Together,  the  steps of \ncombining multiple detections and eliminating overlapping detections will be referred to as \nmerging detections.  In the next section, we show that by merging detections and arbitrating \namong multiple networks, we can reduce the false detection rate significantly. \n\n3  EMPIRICAL RESULTS \n\nA large number of experiments were performed to evaluate the system.  Because of space  . \nrestrictions only a few  results are reported here; further results are presented in [Rowley et \nal., 1995].  We first show an analysis of which features the neural network is using to detect \nfaces,  and then present the error rates of the system over two large test sets. \n\n3.1  SENSITIVITY ANALYSIS \n\nIn order to  determine which  part of the input image the  network uses  to decide whether \nthe input is  a face,  we  performed  a sensitivity analysis  using  the  method  of [Baluja and \nPomerleau, 1995]. We collected a test set of face images (based on the training database, but \nwith different randomized scales,  translations, and rotations than were used for training), \nand used  a set of negative examples collected during the training of an  earlier version of \nthe  system.  Each  of the 20-by-20 pixel  input images  was  divided  into  100 two-by-two \npixel subimages.  For each  subimage in turn,  we went through the test set,  replacing that \nsubimage  with random  noise,  and  tested  the neural  network.  The  sum of squared  errors \nmade by the network is an  indication of how important that portion of the image is for the \ndetection task.  Plots of the error rates for two networks we developed are shown in Figure 5. \n\nFigureS: Sum of squared errors (z(cid:173)\naxis) on a small test resulting from \nadding noise to  various portions of \nthe input image (horizontal plane), \nfor two networks.  Network  1 uses \ntwo  sets  of the  hidden units  illus(cid:173)\ntrated in Figure 1, while network 2 \nuses three sets. \n\nThe  networks rely  most  heavily  on  the  eyes,  then  on  the  nose,  and  then  on  the  mouth \n(Figure  5).  Anecdotally,  we  have  seen  this  behavior  on  several  real  test  images: \nthe \nnetwork's  accuracy  decreases  more  when  an  eye  is  occluded  than  when  the  mouth  is \noccluded.  Further, when both eyes of a face are occluded, it is rarely detected. \n\n\fHuman Face Detection in Visual  Scenes \n\n879 \n\n3.2  TESTING \n\nThe system was tested on two large sets of images.  Test Set A was collected at CMU, and \nconsists of 42 scanned photographs, newspaper pictures, images collected from the World \nWide  Web,  and  digitized television pictures.  Test  set B  consists of 23  images  provided \nby  Sung  and  Poggio;  it  was  used  in  [Sung  and  Poggio,  1994]  to  measure  the  accuracy \nof their system.  These test sets  require the system to  analyze  22,053,124 and  9,678,084 \nwindows, respectively.  Table 1 shows the performance for the two networks working alone, \nthe effect of overlap elimination and collapsing multiple detections, and the results of using \nANDing and ~Ring for arbitration. Each system has a better false positive rate (but a worse \ndetection  rate)  on Test  Set  A  than  on  Test  Set B,  because  of differences  in  the  types  of \nimages in the two sets.  Note that for systems using arbitration, the ratio of false detections \nto  windows  examined  is  extremely  low,  ranging  from  1  in  146,638  to  1  in  5,513,281, \ndepending on the  type of arbitration used.  Figure 6 shows some example  output images \nfrom the system, produced by merging the detections from networks 1 and 2, and ANDing \nthe results.  Using another neural network to arbitrate among the two networks gives about \nthe same performance as  the simpler schemes presented above  [Rowley et ai.,  1995]. \n\nTable 1:  Detection and Error Rates \nTest Set A \n\nTest Set B \n\nType \n\nSystem \n0) Ideal System \n\nSingle \nnetwork, \nno \nheuristics \nSingle \nnetwork, \nwith \nheuristics \nArbitrating \namong \ntwo \nnetworks \n\n1) Network  1 (52 hidden \nunits, 2905 connections) \n2) Network 2 (7~ hidden \nunits, 4357 connections) \n3) Network 1 4- merge \ndetections \n4) Network 2 4- merge \ndetections \n5) Networks  1 and 2 4- AND \n4- merge detections \n6) Networks  1 and 2 4-\nmerge detections 4- AND \n7) Networks  1 and 2 4-\nmerge 4- OR 4- merge \n\n# miss 1 Detect rate \n# miss 1 Detect rate \nFalse detects 1 Rate  False detects 1 Rate \n100.0% \n0/9678084 \n92.9% \n1127417 \n93.5% \n1127891 \n92.3% \n1176810 \n91.6% \n1178684 \n78.1% \n113226028 \n~7.1% \n11645206 \n92.9% \n11151220 \n\n100.0% \n0/22053124 \n89.9% \n1143497 \n~~.2% \n1157281 \n85.8% \n1199338 \n84.0% \n11123202 \n69.2% \n115513281 \n78.7% \n111470208 \n84.6% \n11245035 \n\n0/169 \n0 \n17 \n507 \n20 \n385 \n24 \n222 \n27 \n179 \n52 \n4 \n36 \n15 \n26 \n90 \n\n01155 \n0 \n11 \n353 \n10 \n347 \n12 \n126 \n13 \n123 \n34 \n3 \n20 \n15 \n11 \n64 \n\n4  COMPARISON TO OTHER SYSTEMS \n[Sung and Poggio,  1994]  reports a face-detection  system based on clustering techniques. \nTheir system, like ours, passes a small window over all portions of the image, and determines \nwhether a  face exists in each  window.  Their system uses  a  supervised clustering method \nwith six \"face\" and six \"non-face\" clusters.  Two distance metrics measure the distance of \nan  input image to the prototype clusters.  The first  metric  measures  the \"partial\" distance \nbetween  the  test  pattern  and  the cluster's 75  most  significant eigenvectors.  The  second \ndistance  metric  is  the  Euclidean  distance  between  the  test  pattern  and  its  projection  in \nthe  75  dimensional  subspace.  These  distance  measures  have  close  ties  with  Principal \nComponents Analysis (PeA), as  described in  [Sung and  Poggio,  1994].  The last  step  in \ntheir system is to use either a perceptron or a neural  network with a hidden layer,  trained \nto  classify  points  using  the  two  distances  to  each  of the  clusters  (a total  of 24 inputs). \nTheir system is trained with 4000 positive examples, and nearly 47500 negative examples \ncollected in the \"bootstrap\" manner.  In comparison, our system uses approximately 16000 \npositive examples and 8000 negative examples. \n\nTable  2  shows the  accuracy  of their  system on Test  Set B,  along with the  results of our \n\n\f880 \n\nH.  A. ROWLEY, S. BALUJA, T. KANADE \n\nFigure 6:  Output  produced by  System 6  in Table  1.  For each image,  three  numbers  are  shown: \nthe number of faces  in the image,  the  number of faces detected correctly,  and the  number of false \ndetections.  Some notes  on specific images:  Although  the  system  was  not  trained on hand-drawn \nfaces, it detects them in K and R.  One false detect is present in both D and R.  Faces are missed in D \n(removed because a false detect overlapped it), B (one due to occlusion, and one due to large angle), \nand in  N (babies with  fingers  in  their mouths are not well represented in  training data).  Images B, \nD,  F,  K,  L, and M were provided by Sung and Poggio at MIT.  Images A, G, 0, and P were scanned \nfrom  photographs, image R was obtained with  a CCD camera, images J and  N were scanned from \nnewspapers, images H, I,  and Q were scanned from printed photographs, and image C was obtained \noff of the World Wide Web.  Images P and B correspond to Figures 2A and 2B. \n\n\fHuman Face Detection in Visual Scenes \n\n881 \n\nsystem using a variety of arbitration heuristics.  In [Sung and Poggio, 1994], only 149 faces \nwere labelled in the test set,  while we labelled 155 (some are difficult for either system to \ndetect).  The  number of missed  faces  is therefore six more than the values  listed in  their \npaper.  Also note that [Sung and Poggio, 1994] check a slightly smaller number of windows \nover the entire test set; this is taken into account when computing the false detection rates. \nThe table shows that we can achieve higher detection rates with fewer false detections. \n\nTable 2:  Comparison of [Sung and Poggio, 1994] and Our System on Test Set B \n\nSystem \n5) Networks 1 and 2 ~ AND ~ merge \n6) Networks 1 and 2 ~ merge ~ AND \n7) Networks  1 and 2 ~ merge ~ OR ~ merge \n[Sung and Poggio, 1994] (Multi-layer network) \n[Sung and Poggio, 1994] (Perceptron) \n\nII  Missed \nfaces \n34 \n20 \n11 \n36 \n28 \n\nDetect  I False \ndetects \nrate \n78.l% \n3 \n87.l% \n15 \n64 \n92.9% \n76.8% \n5 \n81.9% \n13 \n\nRate \n113226028 \n11645206 \n11151220 \n111929655 \n11742175 \n\n5  CONCLUSIONS AND FUTURE RESEARCH \n\nOur algorithm can  detect up to 92.9% of faces  in a  set of test images  with an  acceptable \nnumber of false positives. This is a higher detection rate than [Sung and Poggio, 1994]. The \nsystem can be made more conservative by varying the arbitration heuristics or thresholds. \n\nCurrently,  the  system  does  not  use  temporal  coherence  to  focus  attention  on  particular \nportions of the image.  In motion sequences, the location of a face in one frame is a strong \npredictor of the location of a face in next frame.  Standard tracking methods can be applied to \nfocus the detector's attention.  The system's accuracy might be improved with more positive \nexamples for training, by using separate networks to recognize different head orientations, \nor by applying more sophisticated image preprocessing and normalization techniques. \n\nAcknowledgements \n\nThe authors thank Kah-Kay Sung and Dr.  Tomaso Poggio (at MIT), Dr.  Woodward Yang (at Harvard), \nand Michael Smith (at CMU) for providing training and testing images. We also thank Eugene Fink, \nXue-Mei Wang, and Hao-Chi Wong for comments on drafts of this paper. \n\nThis  work  was  partially  supported by  a grant  from  Siemens Corporate  Research, Inc.,  and  by  the \nDepartment of the Army,  Army Research Office under grant number DAAH04-94-G-0006.  Shu meet \nBaluja  was  supported  by  a  National  Science  Foundation  Graduate  Fellowship.  The  views  and \nconclusions in  this  document are those of the  authors, and should not be  interpreted  as  necessarily \nrepresenting official policies or endorsements, either expressed or implied, of the sponsoring agencies. \n\nReferences \n[Baluja and Pomerleau, 1995]  Shumeet Baluja and Dean Pomerleau. Encouraging distributed input \nreliance in  spatially constrained artificial neural networks:  Applications to  visual scene analysis \nand control.  Submitted, 1995. \n\n[Le Cun et al.,  1989]  Y.  Le Cun,  B.  Boser,  1.  S.  Denker,  D.  Henderson, R.  E.  Howard,  W.  Hub(cid:173)\n\nbard, and  L.  D.  Jackel.  Backpropogation applied  to  handwritten  zip  code recognition.  Neural \nComputation, 1:541-551, 1989. \n\n[Rowley et al.,  1995]  Henry A.  Rowley, Shumeet Baluja, and Takeo Kanade.  Human face detection \nin visual scenes. CMU-CS-95-158R, Carnegie Mellon University, November 1995.  Also available \nat http://www.cs.cmu.edul11ar/faces.html. \n\n[Sung and Poggio, 1994]  Kah-Kay  Sung and  Tomaso  Poggio.  Example-based learning for  view(cid:173)\n\nbased human face detection. A.I. Memo 1521, CBCL Paper 112, MIT,  December 1994. \n\n[Waibel et al.,  1989]  Alex  Waibel,  Toshiyuki Hanazawa, Geoffrey  Hinton,  Kiyohiro  Shikano, and \nKevin  J.  Lang.  Phoneme  recognition  using  time-delay  neural  networks.  Readings  in  Speech \nRecognition, pages 393-404, 1989. \n\n\f", "award": [], "sourceid": 1168, "authors": [{"given_name": "Henry", "family_name": "Rowley", "institution": null}, {"given_name": "Shumeet", "family_name": "Baluja", "institution": null}, {"given_name": "Takeo", "family_name": "Kanade", "institution": null}]}