{"title": "Learning to Find Pictures of People", "book": "Advances in Neural Information Processing Systems", "page_first": 782, "page_last": 788, "abstract": null, "full_text": "Learning to Find Pictures of People \n\nSergey Ioffe \n\nComputer Science Division \n\nU.C. Berkeley \n\nBerkeley CA 94720 \niojJe(Cj)cs. be1\u00b7keley. edu \n\nDavid Forsyth \n\nComputer Sciencp Division \n\nU.C. Berkeley \n\nBerkeley CA 94720 \ndaf@cs.beTkeley. edv \n\nAbstract \n\nFinding articulated objects, like people, in pictures present.s a par(cid:173)\nticularly difficult object. recognition problem. We show how t.o \nfind people by finding putative body segments, and then construct.(cid:173)\ning assemblies of those segments that are consist.ent with the con(cid:173)\nstraints on the appearance of a person that result from kinematic \nproperties. Since a reasonable model of a person requires at. least \nnine segments, it is not possible to present every group to a classi(cid:173)\nfier. Instead, the search can be pruned by using projected versions \nof a classifier that accepts groups corresponding to people. We \ndescribe an efficient projection algorithm for one popular classi(cid:173)\nfier , and demonstrate that our approach can be used to determine \nwhether images of real scenes contain people. \n\n1 \n\nIntroduction \n\nSeveral t.ypical collpctions containing over ten million images are listed in [2]. Th ere \nis an extensiw literature on obtaining images from large collections using features \ncomputed from t.he whole image, including colour histograms, texture measures and \nshape measures ; a partial review appears in [5]. \n\nHowever, in the most comprehensive field study of usage pract.ices (a paper by \nEnser [2] surveying the use of the Hulton Deutsch collection), t.here is a clear user \npreference for searching these collections on image semantics. An ideal search tool \n,,,ould be a quite general object recognition system that could be adapted quickly \nand easily to the types of objects sought by a user. An important special case \nis finding people and determining what they are doing . This is hard , because \npeople have many internal degrees of freedom. We follow the approach of [3], \nand represent people as collections of cylinders, each representing a body segment. \nRegions that could be the projections of cylinders are easily found using techniques \nsimilar to those of [1]. Once these regions ate found , they must be assembled \n\n\fLearning to Find Pictures of People \n\n783 \n\nint.o collect.ions t.hat. are consistent with the appearance of images of real people, \nwhich are constrained by the kinematics of human joints; consistency is tested \nwit.h a classifier. Since t.here are many candidate segment.s, a brute force search \nis impossible. \\Ve show how this search can be pruned using projections of the \nclassifier . \n\n2 Learning to Build Seglnent Configurations \n\nSuppose that. ;V segments have been found in an image , and there are m body parts. \nWe will defin e a labeling as a set L = {(Ll , sd , (l2, S2), .. . , (h\u00b7, sd} of pairs where \neach segment. Si E {1 .. . N} is labeled with the labelli E {1 . .. m}. A labeling is \ncomplete if it represents a full m-segment configuration (Fig. 2( a,b)). \nAssume we have a classifier C that for any complete labeling L output.s C( L) > 0 \nif L corresponds to a person-like configuration, and C (L) < 0 otherwise. Finding \nall the possible body configurations in an image is equivalent. t.o finding all the \ncomplete labelings L for which C(L) > O. This cannot be done with brute-force \nsearch t.hrough the entire set.. The search can be pruned if, for an (incomplete) \nlabeling L' there is no complete L ;2 L' such that G(L) > O. For inst.ance, if two \nsegments cannot represent the upper and lower left. arm, as in Figure la, then we \ndo not consider any complete labelings where they are labeled as such. \n\nProjected classifiers make the search for body configura tions efficient. by pruning \nla belings using the properties of smaller sub-Iabelings (as in [7], who use manually \ndetermined bounds and do not learn the tests). Given a classifier G which is a \nfunction of a set of features whose values depend on segments with labels l1 . . . Im , \nt.he projected classifier Cil (k is a function of of all those features that depend \nonly on the segments with labels 11 ... lh ' In particular, GIllk(1') > 0 if there is \nsome ext.ension L of l' such that C(L) > 0 (see figure l).The converse need not \nbe true: t.he fea ture values required to bring a projected point inside the positive \n. volUl11f' of C' may not be realized with any labeling of t.he current Sf't. of segments \n1, . .. , N. For a projected classifier to be usefuL it must be easy to compute the \nprojection , and it must be effective in rejecting labelings at. an early stage. These \nare strong rf'quirements which are not satisfied by most good classifiers; for example, \nin our f'xperience a support vector machine with a posit.ive definit.e quadratic kernel \nprojects easily but typically yields unrestrictive projected classifiers. \n\n2.1 Building Labelings Increm entally \n\nAssume we have a classifier C that accepts assemblies corresponding to people and \nthat we can construct. projected classifiers as we need them. We will now show how \nt.o use them to ronst.ruct labelings, using a pyramid of classifiers. \n\nA pyramid of classifiers (Fig. 1 (c)) , determined by the classifier C and a permutation \nof labels (11 .. . ld consists of nodes NI, ... I J corresponding to each of the projected \nclassifiers CI , .I J \u2022 i ~ j. Each of the bottom-level nodes NI , receives the set of all \nsegments ill th e image as the input . The top node Nil 1m OUt.pUt.S t.he set of all \ncomplete labelings L = {(/ 1 , sIl . . . (lm,sm)) such that G(L) > 0, i.e. the set of all \nassemblies in t.he image classified as people. Further, each node NI , . I, outputs the \nset of all sub-labelings L = {(li,sil . . . (lj,Sj)) such that GI, \nThE' node:,> Nt , at t.he bottom level work by selecting all segments Si in the image for \nwhich n, {(I,.:>i)} > O. Each of the remaining nodes has t.wo part.s: merging and \nfilt.ering. The merying stage of node NI, .. I J merges the outputs of its children by \ncomputing t.he set of all la belings {(li, s;) . .. (lj, Sj)} where {(Ii , sd ... (lj -1, S j - tl} \n\nI)(L) > O. \n\n\fS. Ioffe and D. Forsyth \n\n784 \n\ny(sl,s2) \n\n\\J. \n\n\u00b7 \n\n\u00b7 \n\u00b7 \n\u00b7 \n\n. \n\n. \n. \n. \n: x(sJ) \n\na \n\nII \n\nb \n\nx(sJ) \n\n.. \n\n'--_---'-_--'-_---' __ -'--_segments \n\nc \n\nFigure 1: \n(a) Two segments that cannot correspond to the left upper and lower \narm. Any configuration where they do can be rejected using a projected classifier \nregardless of the other segments that might appear in the configuration. (b) Pro(cid:173)\nJecting a classifier G {( [1, SI), ([2, S2)}' The shaded area is the volume classified as \npositive, for the feature set {x (SI), y( SI , S2)} . Finding the projection Gil amounts \nto projecting off the features that cannot be computed from SI only, i. e., Y(SI' S2}. \n(c) A pyramid of classifiers. Each node outputs sub-assemblies accepted by the cor(cid:173)\nresponding projected classifier. Each node except those in the bottom row works by \nforming labelings from the outputs of its two children, and filtering the result using \nthe corresponding projected classifier. The top node outputs the set of all complete \nlabelings that correspond to body configurations. \n\nand {(li+l, si+d . .. (Ij, Sj)} are in the outputs of N I ,lj_1 and NI,+l .. lj' respectively. \nThe filtering stage then selects, from the resulting set of labelings, those for which \nG1, ... lj(\u00b7) > 0, and the resulting set is the output of Nl, . lj' It is clear, from the \ndefinition of projected classifiers, that the output of the pyramid is, in fact, the set \nof all complete L for which G(L) > 0 (note that GIl 1m = G) . \nThe only constraint on the order in which the outputs of nodes are computed is that \nchildren nodes have to be applied before parents. In our implementation, we use \nnodes Nl, . l j where j changes from 1 to m, and, for each j, i changes from j down to \n1. This is equivalent to computing sets of labelings of the form {(II , stl . .. (lj, Sj)} \nin order, where getting (j + I)-segment labelings from j-segment ones is itself an \nincremental process, whereby we check labels againstlj +l in the order [j, lj-I, . . . , [1. \nIn practice, we choose the latter order on the fly for each increment step using a \ngreedy algorithm, to minimize the size of labeling sets that are constructed (note \nthat in this case the classifiers no longer form a pyramid) . The order (11 .. . lm) in \nwhich labels are added to an assembly needs to be fixed. We determine this order \nwith a greedy algorithm by running a large segment set through the labeling builder \nand choosing the next label to add so as to minimize the number of labelings that \nresult. \n\n2.2 Classifiers that Project \n\nIn our problem, each segment from the set {I .. . N} is a rectangle in some position \nand orientation. Given a complete labeling L = {(I, SI), ... , (m, sm)} , we want to \nhave G(L) > 0 iff the segment arrangement produced by L looks like a person . \n\n\fLearning to Find Pictures of People \n\n785 \n\n=0.25+0.22 \n\n0.47 \n\n)' ~ ------\n, , \n, 0.25 \n, \n, , , \n, 0.4 \n, , \n, \n, , \n, , \n- - - -- ------\n\n0.62 \n\n0.15 \n\n0.37 \n\n=0.4+0.22 \n\n=0.15+0.22 \n\n-------1 \n\n~ \n0.85 \n\n, , \n=0.25+0.6' , \n\n: fO.15 \n, \n\n1.0 \n\n=0.4+0.6 : to.25 \n\n, \n, , \n=0.15+0.6 ' \n------_. \n\n0.75 \n\n0.25 \n\n0.4 \n\n0.15 \n\na \n\nb \n\n0 \n\n0.22 \n\n0.6 \n\n=0.22+0.38 \n\n~ 0.22 \n\n0.6 \n\nx \n\n\" x \n\nC \n\n(b) A labeled segment con(cid:173)\n\nFigure 2: (a) All segments extracted for an image. \nfiguration corresponding to a person, where T=torso, LUA=left upper arm, etc. \nThe head is not marked because we are not looking for it with our method. The \nsingle left leg segment in (a) has been broken in (b) to generate the upper and \nlower leg segments. (c) (top) A combination of a bounding box (the dashed line) \nand a boosted classifier, for two features x and y. Each plane in the boosted \nclassifier is a thick line with the positive half-space indicated by an arrow; the \nassociated weight {3 is shown next to the arrow. The shaded area is the posi(cid:173)\ntive volume of the classifier, which are the points P where LJ wJ{P(f)) > 1/2. \nThe weights wx (-) and wy{') are shown along the x- and y-axes, respectively, and \nthe total weight wx{P{x)) + Wy{P{y)) is shown for each region of the bounding \nbox. (bottom) The projected classifier, given by wx{P{x)) > 1/2 - 8 = 0.1 whel'P \n8 = maxp(y) wy{P{y)) = max{0.25, 0.4, 0.15} = 0.4. \n\nEach feature will depend on a few segments (1 to 3 in our experiments). Our \nkinematic features are invariant to translation, uniform scaling or rotation of the \nsegment set, and include angles between segments and ratios of lengths, widths and \ndistances. We expect the features that correspond to human configurations to lie \nwithin small fractions of their possible value ranges. This suggests using an axis(cid:173)\naligned bounding box, with bounds learned from a collection of positive labelings, \nfor a good first separation, and then using a boosted version of a weak classifier that \nsplits the feature space on a single feature value (as in [6]). This classifier projects \nparticularly well, using a simple algorithm described in section 2.3. \nEach weak classifier (Fig. 2(c)) is defined by the feature Ij on which the split is \nmade, the position Pj of the splitting hyperplane, and the direct.ion dj E {I, -I} \nthat determines which half-space is positive. A point P is classified as positive iff \ndj{P{fj) - Pj) > 0, where P{fj) is the value of feature /j. The boosting algorithm \nwill associate a weight {3j with each plane {so that Lj {3j = 1), and the resulting \nclassifier will classify a point as positive iffLd,(p(f,)-Pi\u00bbo{3j > 1/2, that is, iff the \ntotal weight of the weak classifiers that classify the point as positive is at least a \nhalf of the total weight of the classifiers. The set {/j} may have repeating features \n(which may have different Pj, dj and Wj values), and does not need to span the \nentire feature set. \n\nBy grouping together the weights corresponding to planes splitting on the same \nfeature, we finally rewrite the classifier as LJ wJ(P(f)) > 1/2, where 'U'J(P(f)) = \n\n\f786 \n\nS. Joffe and D. Forsyth \n\nLfJ=j, dJ (P(f)-Pl \u00bb0 j3j is the weight associated with the particular value of feature \nf, is a piece-wise constant function and depends on in which of the intervals given \nby {pj I fj = f} this value falls . \n\n2.3 Projecting a Boosted Classifier \n\nGiven a classifier constructed as above, we need to construct classifiers that depend \non on some identified subset of the features . The geometry of our classifiers -\nwhose positive regions consist of unions of axis-aligned bounding boxes - makes \nthis easy to do. \n\nLet 9 be the feature to be projected away -\nperhaps because the value depends on \na label that is not available. The projection of the classifier should classify a point \npi in the (lower-dimensional) feature space as positive iffmaxp Lj Wj (P(f)) > 1/2 \nwhere P is a point which projects into pi but can have any value for P(g). We can \nrewrite this expression as LNg Wj(PI(f)) + maXp(g) wg(P(g)) > 1/2. The value \nof J = maxwg(P(g)) is readily available and independent of P'. We can see that, \nwith the feature projected away, we obtain Lj Wj (Pi (f)) > 1/2 - J. Any number \nof features can be project.ed away in a sequence in this fashion . An example of the \nprojected classifier is shown in Figure 2( c). \nThe classifier C we are using allows for an efficient building of labelings, in that \nthe features do not need to be recomputed when we move from G/t.l k to Gil .lk+l. \nWe achieve this efficiency by carrying along with a labeling L = {(it , SI) ... (lk' Sk)} \nthe sum <T(L) = L.jEF(II.lk) Wj(P(f)) where F(ll ... Ik ) is the set of all features \ncomputable from the segments labeled as 11, ... , lk' and {P(f)} -\nthe values of \nthese features . When we add another segment. to get L' = {(II , sd .. . (lk+l, Sk+d}, \nwe can compute <T(L') = <T(L) + LjEF(II\n.lk+d\\F(lllk) 11'j(PI(f)). In other words , \nwhen we add a labellk+l, we need to compute only those features that require Sk+l \nfor their computation. \n\n3 Experimental Results \n\nWe report results for a system that automatically identifies potential body segments \n(using the techniques described in [4]), and then applies the assembly process de(cid:173)\nscribed above. Images for which assemblies that are kinematically consistent with a \nperson are reported as having people in them. The segment finder may find either \n1 or 2 segments for each limb, depending on whether it is bent or straight; because \nthe pruning is so effective, we can allow segments to be broken into two equal halves \nlengt.hwise (like the left leg in Fig. 2(b)), both of which are tested. \n\n3.1 Training \n\nThe training set included 79 images without people , selected randomly from t.he \nCOREL dat.abase, and 274 images each with a single person on uniform background. \nThe images wit.h people have been scanned from books of human models [10]. All \nsegments in the test images were reported; in the control images, only segments \nwhose int.erior corresponded to human skin in colour and texture were reported. \nControl images, both for the training and for the test set, were chosen so that all \nhad at least 30% of their pixels similar to human skin in colour and texture . This \ngives a more realistic test of the system performance by excluding regions that are \nobviously not human, and reduces the number of segments in the control images to \nthe same order of magnitude as those in the test images. \n\n\fLearning to Find Pictures of People \n\n787 \n\nFeatures II Test Control \n\n367 \n567 \n\nII \n\n120 \n120 \na \n\nI Features II False Neg. \n37 ~ \n49 \n\n367 \n567 \n\nII \n\nFalse Pos. \n\n1~~ \n\n28 \n86 \n\nI I \n\n0 \nb \n\nTable 1: (a) Number of images of people (test) and without people (control) processed \nby the classifiers with 367 and 567 features. (b) False negative rim ages with a person \nwhere no body configuration was found) and false positive (images with no people \nwhere a person was detected) rates. \n\nThe models are all wearing either swim suits or no clothes, otherwise segment finding \nfails; it is an open problem to segment people wearing loose clothing. There is a \nwide variation in the poses of the training examples, although all body segments \nare visible. The sets of segments corresponding to people were then hand-labeled. \nOf the 274 images with people, segments for each body part were found in 193 \nimages. The remaining 81 resulted in incomplete configurations, which could still \nbe used for computing the bounding box used to obtain a first separation. Since \nwe assume that if a configuration looks like a person then its mirror image would \ntoo, we double the number of body configurations by flipping each one about a \nvertical axis. The bounding box is then computed from the resulting .548 points in \nthe feature space, without looking at the images without people . \n\nThe boosted classifier was trained to separate two classes: the 193 x 2 = 386 points \ncorresponding to body configurations, and 60727 points that did not correspond to \npeople but lay in the bounding box, obtained by using the bounding box classifier \nto incrementally build labelings for the images with no people. We added 1178 \nsynthetic positive configurations obtained by randomly selecting each limb and the \ntorso from one of the 386 real images of body configurations (which were rotated \nand scaled so the torso positions were the same in all of them) to give an effect \nof joining limbs and torsos from different images rather like children's flip-books . \nRemarkably, tlw boosted classifier classified each of the real data points correctly but \nmisclassified 976 out of the 1178 synthetic configurations as negative; the synthetic \nexamples were unexpectedly more similar to the negative examples than the real \npositive examples were. \n\n3.2 Results \n\nThe test dataset was separate from the training set and included 120 images with a \nperson on a uniform background, and varying numbers of control images , reported \nin Table 1. We report results for two classifiers, one using 567 features and the \nother using a subset of 367 of those features . Table 1 b shows the false positive \nand false negative rates achieved for each of the two classifiers. By marking 51 % \nof test images and only 10% of control images, the classifier using 567 features \ncompares extremely favorably with that of [3], which marked 54% of test images \nand 38% of control images using hand-tuned tests to form groups of four segments. \nIn 55 of the 59 images where there was a false negative, a segment corresponding \nto a body part was missed by the segment finder, meaning that t he overall system \nperformance significantly understates the classifier performance. There are few \nsigns of overfitting, probably because the features are highly redundant. Using the \nlarger set of features makes labeling faster (by a factor of about five), because more \nconfigurations are rejected earlier. \n\n\f788 \n\nS. loffe and D. Forsyth \n\n4 Conclusions and Future Work \n\nGroups of segments that satisfy kinematic constraints, learned from images of real \npeople, quite reliably correspond to people and can be used to identify them. Our \ntrick of projecting classifiers is effective at pruning an otherwise completely unman(cid:173)\nageable correspondence search . Future issues include: fusing responses from face \nfinders (such as those of [11, 9]; exploiting patterns of shading on human limbs to \nget better selectivity (as in [8]); determining the configuration of the person, which \nmight tell what they are doing; and exploiting the kinematic similarities between \nhumans and many animals to build systems that can find many different types of \nanimal without searching the classes one by one. \n\nReferences \n\n[1] J .M. Brady and H. Asada. Smoothed local symmetries and their implementation. \n\nInternational Journal of Robotics Research, 3(3) , 1984. \n\n[2] P.G.B. Enser. Query analysis in a visual information retrieval context. 1. Document \n\nand Text Management, 1(1):25-52, 1993. \n\n[3] M. M. Fleck, D. A. Forsyth, and C. Bregler. Finding naked people. In European \n\nConfel'ence on Computer Vision 1996. Vol. II, pages 592-602, 1996. \n\n(4] D.A. Forsyth and M.M. Fleck. Body plans. In IEEE Conf. on ComputEr Vision and \n\nPattern Recognition, 1997. \n\n[5] D.A. Forsyth, J. Malik, M.M. Fleck, H. Greenspan, T. Leung, S. Belongie, C. Carson, \nand C . Bregler. Finding pictures of objects in large collections of images. In Proc. \n'2 'nd Intel'national Workshop on Object Representation in Computer Vision, 1996. \n\n[6] Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Machine \n\nLearning - 1.'3, 1996. \n\n[7] W.E.L. Grimson and T. Lozano-Perez. Localizing overlapping parts by searching the \n\ninterpretation tree. IEEE Trans. Patt. Anal. Mach. Intell. , 9(4):469-482, 1987. \n\n[8] J. Haddon and D.A. Forsyth. Shading primitives. In Int. Conf. on Computer Vision, \n\n1997. to appear. \n\n[9] H.A. Rowley, S. Baluja, and T. Kanade. Human face detection in visual scenes. \nIn D.S. Touretzky, M.C . Mozer, and M.E. Hasselmo, editors, Advances in Neural \nInformation Processing 8, pages 875-881, 1996. \n\n[10] Elte Shuppan. Pose file, volume 1-7. Books Nippan, 1993-1996. A collection of \n\nphotographs of human models, annotated in Japanese. \n\n[11] K-K Sung and T. Poggio. Example based learning for view based face detection. Ai \n\nmemo 1521, MIT, 1994. \n\n\f", "award": [], "sourceid": 1596, "authors": [{"given_name": "Sergey", "family_name": "Ioffe", "institution": null}, {"given_name": "David", "family_name": "Forsyth", "institution": null}]}