#### Authors

Oliver Williams, Andrew Blake, Roberto Cipolla

#### Abstract

There has been substantial progress in the past decade in the development of object classifiers for images, for example of faces, humans and vehi- cles. Here we address the problem of contaminations (e.g. occlusion, shadows) in test images which have not explicitly been encountered in training data. The Variational Ising Classifier (VIC) algorithm models contamination as a mask (a field of binary variables) with a strong spa- tial coherence prior. Variational inference is used to marginalize over contamination and obtain robust classification. In this way the VIC ap- proach can turn a kernel classifier for clean data into one that can tolerate contamination, without any specific training on contaminated positives.

1 Introduction

Recent progress in discriminative object detection, especially for faces, has yielded good performance and efficiency [1, 2, 3, 4]. Such systems are capable of classifying those positives that can be generalized from positive training data. This is restrictive in practice in that test data may contain distortions that take it outside the strict ambit of the training positives. One example would be lighting changes (to a face) but this can be addressed reasonably effectively by a normalizing transformation applied to training and test images; doing so is common practice in face classification. Other sorts of disruption are not so easily factored out. A prime example is partial occlusion.

The aim of this paper is to extend a classifier trained on clean positives to accept also partially occluded positives, without further training. The approach is to capture some of the regularity inherent in a typical pattern of contamination, namely its spatial coherence. This can be thought of as extending the generalizing capability of a classifier to tolerate the sorts of image distortion that occur as a result of contamination.

As done previously in one-dimension, for image contours [5], the Variational Ising Classi- fier (VIC) models contamination explicitly as switches with a strong coherence prior in the form of an Ising model, but here over the full two-dimensional image array. In addition, the Ising model is loaded with a bias towards non-contamination. The aim is to incorporate these hidden contamination variables into a kernel classifier such as [1, 3]. In fact the Rel- evance Vector Machine (RVM) is particularly suitable [6] as it is explicitly probabilistic, so that contamination variables can be incorporated as a hidden layer of random variables.

                                                                  edge

neighbours of i                        i


Figure 1: The 2D Ising model is applied over a graph with edges e between neigh- bouring pixels (connected 4-wise).

Classification is done by marginalization over all possible configurations of the hidden vari- able array, and this is made tractable by variational (mean field) inference. The inference scheme makes use of "hallucination" to fill in parts of the object that are unobserved due to occlusion.

Results of VIC are given for face detection. First we show that the classifier performance is not significantly damaged by the inclusion of contamination variables. Then a contam- inated test set is generated using real test images and computer generated contaminations. Over this test data the VIC algorithm does indeed perform significantly better than a con- ventional classifier (similar to [4]). The hidden variable layer is shown to operate effec- tively, successfully inferring areas of contamination. Finally, inference of contamination is shown working on real images with real contaminations.

2 Bayesian modelling of contamination

Classification requires P (F |I), the posterior for the proposition F that an object is present given the image data intensity array I. This can be computed in terms of likelihoods

             P (F | I) = P (I | F )P (F )/ P (I | F )P (F ) + P (I | F )P (F )          (1)


so then the test P (F | I) > 1 becomes 2

                             log P (I | F ) - log P (I | F ) > t                        (2)


where t is a prior-dependent threshold that controls the tradeoff between positive and neg- ative classification errors. Suppose we are given a likelihood P (I|, F ) for the presence of a face given contamination , an array of binary "observation" variables corresponding to each pixel Ij of I, such that j = 0 indicates contamination at that pixel, whereas j = 1 indicates a successfully observed pixel. Then, in principle,

                                 P (I|F ) =         P (I|, F )P (),                   (3)


(making the reasonable assumption P (|F ) = P (), that the pattern of contamination is object independent) and similarly for log P (I | F ). The marginalization itself is intractable, requiring a summation over all 2N possible configurations of , for images with N pixels. Approximating that marginalization is dealt with in the next section. In the meantime, there are two other problems to deal with: specifying the prior P (); and specifying the likeli- hood under contamination P (I|, F ) given only training data for the unoccluded object.

2.1 Prior over contaminations

The prior contains two terms: the first expresses the belief that contamination will occur in coherent regions of a subimage. This takes the form of an Ising model [7] with energy

UI() that penalizes adjacent pixels which differ in their labelling (see Figure 1); the second term UC biases generally against contamination a priori and its balance with the first term is mediated by the constant . The total prior energy is then

          U () = UI() + UC() =              [1 - (e -  )] +                     (                                                                  1         e2                      j ),     (4)                                              e                                      j


where (x) = 1 if x = 0 and 0 otherwise, and e1, e2 are the indices of the pixels at either end of edge e (figure 1). The prior energy determines a probability via a temperature constant 1/T0 [7]:

                   P ()  e-U()/T0 = e-UI()/T0e-UC()/T0                                            (5)


2.2 Relevance vector machine

An unoccluded classifier P (F |I, = 0) can be learned from training data using a Rele- vance Vector Machine (RVM) [6], trained on a database of frontal face and non-face im- ages [8] (see Section 4 for details). The probabilistic properties of the RVM make it a good choice when (later) it comes to marginalising over . For now we consider how to construct the likelihood itself. First the conventional, unoccluded case is considered for which the posterior P (F |I) is learned from positive and negative examples. Kernel functions [9] are computed between a candidate image I and a subset of relevance vectors {xk}, retained from the training set. Gaussian kernels are used here to compute

                       y(I) =         wk exp -             (Ij - xkj)2 .                               (6)                                      k                     j


where wk are learned weights, and xkj is the jth pixel of the kth relevance vector. Then the posterior is computed via the logistic sigmoid function as

                                                                  1                            P (F |I,  = 1) = (y(I)) =                           .                          (7)                                                                  1 + e-y(I)


and finally the unoccluded data-likelihood would be

                          P (I|F,  = 1)  (y(I))/P (F ).                                              (8)


2.3 Hallucinating appearance

The aim now is to derive the occluded likelihood from the unoccluded case, where the con- tamination mask is known, without any further training. To do this, (8) must be extended to give P (I|F, ) for arbitrary masks , despite the fact the pixels Ij from the object are not observed wherever j = 0. In principle one should take into account all possible (or at least probable) values for the occluded pixels. Here, for simplicity, a single fixed hallu- cination is substituted for occluded pixels, then we proceed as if those values had actually been observed. This gives

                            P (I|F, )  (~                                                     y(I, ))/P (F )                                         (9)