Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Amnon Shashua, Tamir Hazan*

This paper presents a general family of algebraic positive definite simi- larity functions over spaces of matrices with varying column rank. The columns can represent local regions in an image (whereby images have varying number of local parts), images of an image sequence, motion tra- jectories in a multibody motion, and so forth. The family of set kernels we derive is based on a group invariant tensor product lifting with param- eters that can be naturally tuned to provide a cook-book of sorts covering the possible "wish lists" from similarity measures over sets of varying cardinality. We highlight the strengths of our approach by demonstrat- ing the set kernels for visual recognition of pedestrians using local parts representations.

1 Introduction

In the area of learning from observations there are two main paths that are often mutually exclusive: (i) the design of learning algorithms, and (ii) the design of data representations. The algorithm designers take pride in the fact that their algorithm can generalize well given straightforward data representations (most notable example is SVM [11]), whereas those who work on data representations demonstrate often remarkable results with sophisticated data representations using only straightforward learning algorithms (e.g. [5, 10, 6]). This dichotomy is probably most emphasized in the area of computer vision, where image under- standing from observations involve data instances of images or image sequences containing huge amounts of data. A straightforward representation treating all the measurements as a single vector, such as the raw pixel data, or a transformed raw-pixel data, places un- reasonable demands on the learning algorithm. The "holistic" representations suffer also from sensitivity to occlusions, invariance to local and global transformations, non-rigidity of local parts of the object, and so forth.

Practitioners in the area of data representations have long noticed that a collection of local representations (part-based representations) can be most effective to ameliorate changes of appearance [5, 10, 6]. The local data representations vary in their sophistication, but share the same principle where an image corresponds to a collection of points each in a relatively small dimensional space -- instead of a single point in high-dimensional space induced by holistic representations. In general, the number of points (local parts) per image may vary and the dimension of each point may vary as well. The local representations tend

```
School of Engineering and Computer Science, Hebrew University of Jerusalem, Jerusalem 91904, Israel
```

to be robust against occlusions, local and global transformations and preserve the original resolution of the image (the higher the resolution the more parts are generated per image).

The key for unifying local and holistic representations for inference engines is to design positive definite similarity functions (a.k.a. kernels) over sets (of vectors) of varying cardi- nalities. A Support Vector Machine (SVM) [11] can then handle sets of vectors as a single instance via application of those "set kernels". A set kernel would be useful also to other types of inference engines such as kernel versions of PCA, LDA, CCA, ridge regression and any algorithm which can be mapped onto inner-products between pairs of data instances (see [8] for details on kernel methods).

Formally, we consider an instance being represented by a collection of vectors, which for the sake of convenience, form the columns of a matrix. We would like to find an algebraic family of similarity functions sim(A, B) over matrices A, B which satisfy the following requirements: (i) sim(A, B) is an inner product, i.e., sim(A, B) = (A) (B) for some mapping () from matrices to vectors, (ii) sim(A, B) is built over local kernel functions k(ai, bj) over columns ai and bj of A, B respectively, (iii) The column cardinality (rank of column space) of A and B need not be the same (number of local parts may differ from image to image), and (iv) the parameters of sim(A, B) should induce the properties of in- variance to order (alignement) of parts, part occlusions, and degree of interactions between local parts. In a nutshell, our work provides a cook-book of sorts which fundamentally covers the possible algebraic kernels over collections of local representations built on top of local kernels by combining (linearly and non-linearly) local kernels to form a family of global kernels over local representations.

The design of a kernel over sets of vectors has been recently attracting much attention in the computer vision and machine learning literature. A possible approach is to fit a distribution to the set of vectors and define the kernel as a distribution matching measure [9, 12, 4]. This has the advantage that the number of local parts can vary but at the expense of fitting a distribution to the variation over parts. The variation could be quite complex at times, unlikely to fit into a known family of distributions in many situations of interest, and in practice the sample size (number of columns of A) is not sufficiently large to reliably fit a distribution. The alternative, which is the approach taken in this paper, is to create a kernel over sets of vectors in a direct manner. When the column cardinality is equal it is possible to model the similarity measure as a function over the principal angles between the two column spaces ([14] and references therein) while for varying column cardinality only heuristic similarity measures (which are not positive definite) have so far been introduced [13].

It is important to note that although we chose SVM over local representations as the appli- cation to demonstrate the use of set kernels, the need for adequately working with instances made out of sets of various cardinalities spans many other application domains. For exam- ple, an image sequence may be represented by a set (ordered or unordered) of vectors, where each vector stands for an image, the pixels in an image can be represented as a tuple consisting of position, intensity and other attributes, motion trajectories of multiply mov- ing bodies can be represented as a collection of vectors, and so on. Therefore, the problem addressed in this paper is fundamental both theoretically and from a practical perspective as well.

2 The General Family of Inner-Products over Matrices

We wish to derive the general family of positive definite similarity measures sim(A, B) over matrices A, B which have the same number of rows but possibly different column rank (in particular, different number of columns). Let A be of dimensions n k and B of dimension n q where n is fixed and k, q can vary at will over the application of sim(, ) on pairs of matrices. Let m = max{n, k, q} be the upper bound over all values

of k, q encountered by the data. Let ai, bj be the column vectors of matrices A, B and let k(ai, bj) be the local kernel function. For example, in the context where the column vectors represent local parts of an image, then the matching function k(, ) between pairs of local parts provides the building blocks of the overall similarity function. The local kernel is some positive definite function k(x, y) = (x) (y) which is the inner-product between the "feature"-mapped vectors x, y for some feature map (). For example, if () is the polynomial map of degree up to d, then k(x, y) = (1 + x y)d.

The local kernels can be combined in a linear or non-linear manner. When the combination is linear the similarity becomes the analogue of the inner-product between vectors extended to matrices. We will refer to the linear family as sim(A, B) =< A, B > and that will be the focus of this section. In the next section we will derive the general (algebraic) non- linear family which is based on "lifting" the input matrices A, B onto higher dimensional spaces and feeding the result onto the < , > machinery developed in this section, i.e., sim(A, B) =< (A), (B) >.

We will start by embedding A, B onto m m matrices by zero padding as follows. Let ei denote the i'th standard basis vector (0, .., 0, 1, 0, .., 0) of Rm. The the embedding is represented by linear combinations of tensor products:

```
n k n q
A aijei ej, B bltel et. i=1 j=1 l=1 t=1
```

Note that A, B are the upper-left blocks of the zero-padded matrices. Let S be a positive semi definite m2 m2 matrix represented by S = p G r=1 r Fr where Gr , Fr are m m matrices1. Let ^ Fr be the q k upper-left sub-matrix of Fr , and let ^ Gr be the n n upper-left sub-matrix of Gr. We will be using the following three identities: Gx1 F x2 = (G F )(x1 x2), (G F )(G F ) = GG F F ,

```
< x1 x2, y >= ( )( ). 1 y2 x1 y1 x2 y2 The inner-product < A, B > over all p.s.d. matrices S has the form:
< A, B > = < aijei ej, ( Gr Fr) bltel et > i,j r l,t
= aijblt < ei ej, Grel Fret > r i,j,l,t
= aijblt(e G F i r el)(ej r et) r i,j,l,t
= aijblt(Gr)il(Fr)jt r i,j,l,t
= (A ^ GrB)jt(Fr)jt r lt
= trace (A ^ GrB) ^ Fr r
```

We have represented the inner product < A, B > using the choice of m m matrices Gr, Fr instead of the choice of a single m2 m2 p.s.d. matrix S. The matrices Gr, Fr

1Any S can be represented as a sum over tensor products: given column-wise ordering, the matrix G F is composed of n n blocks of the form fij G. Therefore, take Gr to be the n n blocks of S and Fr to be the elemental matrices which have "1" in coordinate r = (i, j) and zero everywhere else.

must be selected such that p G r=1 r Fr is positive semi definite. The problem of decid- ing on the the necessary conditions on Fr and Gr such that the sum over tensor products is p.s.d is difficult. Even deciding whether a given S has a separable decomposition is known to be NP-hard [3]. The sufficient conditions are easy -- choosing Gr, Fr to be positive semi definite would make p G r=1 r Fr positive semi definite as well. In this context (of separable S) we need one more constraint in order to work with non-linear local ker- nels k(x, y) = (x) (y): the matrices ^ G ~ r = ~ M M r r must "distribute with the kernel", namely there exist Mr such that k(M ~ r x, Mr y) = (Mr x) (Mry) = (x) ~ M M r r (y) = (x) ^ Gr(y).

To summarize the results so far, the most general, but seperable, analogue of the inner- product over vectors to the inner-product of matrices of varying column cardinality has the form: < A, B >= trace(H ^ r Fr ) (1) r

Where the entries of Hr consists of k(Mrai, Mrbj) over the columns of A, B after possibly undergoing global coordinate changes by Mr (the role of ^ Gr), and ^ Fr are the q k upper- left sub-matrix of positive definite m m matrices Fr .

The role of the matrices ^ Gr is to perform global coordinate changes of Rn before applica- tion of the kernel k() on the columns of A, B. These global transformations include pro- jections (say onto prototypical "parts") that may be given or "learned" from a training set. The matrices ^ Fr determine the range of interaction between columns of A and columns of B. For example, when ^ Gr = I then < A, B >= trace(A B ^ F ) where ^ F is the upper-left submatrix with the appropriate dimension of some fixed m m p.s.d matrix F = F r r . Note that entries of A B are k(ai, bj). In other words, when Gr = I, < A, B > boils down to a simple linear super-position of the local kernels, k(a ij i, bj )fij where the en- tries fij are part of the upper-left block of a fixed positive definite matrix F where the block dimensions are commensurate with the number of columns of A and those of B. The various choices of F determine the type of invariances one could obtain from the simi- larity measure. For example, when F = I the similarity is simply the sum (average) of the local kernels k(ai, bi) thereby assuming we have a strict alignment between the local parts represented by A and the local parts represented by B. On the other end of the in- variance spectrum, when F = 11 (all entries are "1") the similarity measure averages over all interactions of local parts k(ai, bj) thereby achieving an invariance to the order of the parts. A decaying weighted interaction such as fij = -|i-j| would provide a middle ground between the assumption of strict alignment and the assumption of complete lack of alignment. In the section below we will derive the non-linear version of sim(A, B) based on the basic machinery of < A, B > of eqn. (1) and lifting operations on A, B.

3 Lifting Matrices onto Higher Dimensions

The family of sim(A, B) =< A, B > forms a weighted linear superposition of the local kernel k(ai, bj). Non-linear combinations of local kernels emerge using map- pings (A) from the input matrices onto other higher-dimensional matrices, thus forming sim(A, B) =< (A), (B) >. Additional invariance properties and parameters control- ling the perfromance of sim(A, B) emerge with the introduction of non-linear combina- tions of local kernels, and those will be discussed later on in this section.

Consider the general d-fold lifting (A) = Ad which can be viewed as a nd kd matrix. Let Fr be a p.s.d. matrix of dimension md md and ^ Fr be the upper-left qd kd block of Fr. Let Gr = ( ^ Gr)d be a p.s.d matrix of dimension nd nd where ^ Gr is p.s.d. n n matrix. Using the identity (Ad) Bd = (A B)d we obtain the inner-product in the

lifted space: < Ad, Bd >= trace (A ^ GrB)d ^ Fr . r

By taking linear combinations of < Al, Bl >, l = 1, ..., d, we get the general non- homogenous d-fold inner-product simd(A, B). A this point the formulation is general but somewhat unwieldy computational-wise. The key for computational simplification lay in the fact that choices of Fr determine not only local interactions (as in the linear case) but also group invariances. The group invariances are a result of applying symmetric operators on the tensor product space -- we will consider two of those operators here, known as the the d-fold alternating tensor Ad = A .... A and the d-fold symmetric tensor Ad = A ... A. These lifting operations introduce the determinant and permanent operations on submatrices of A ^ GrB, as described below.

The alternating tensor is a multilinear map of Rn, (A .... A)(x1 ... xd) = Ax1 ... Axd, where 1 x1 ... xd = sign()x d! (1) .... x(d), Sd

where Sd is the symmetric group over d letters and Sd are the permutations of the group. If x1, ..., xn form a basis of Rn, then the n elements x ... x , where 1 d i1 id i1 < ... < id n form a basis of the alternating d - f old tensor product of Rn, denoted as dRn. If A Rnk is a linear map on Rn sending points to Rk, then Ad is a linear map on dRn sending x1 ... xd to Ax1 ... Axd, i.e., sending points in dRn to points in dRk. The matrix representation of Ad is called the "d'th compound matrix" Cd(A) whose (i1, ..., id|j1, ..., jd) entry has the value det(A[i1, ..., id : j1, ..., jd]) where the determinant is of the d d block constructed by choosing the rows i1, ..., id and the columns j1, ..., jd of A. In other words, Cd(A) has n rows and k columns d d (instead of nd kd necessary for Ad) whose entries are equal to the d d minors of A. When k = d, Ck(A) is a vector known as the Grasmanian of A, and when n = k = d then Cd(A) = det(A). Finally, the identity (Ad) Bd = (A B)d specializes to (Ad) Bd = (A B)d which translates to the identity Cd(A) Cd(B) = Cd(A B) known as the Binet-Cauchy theorem [1]. Taken together, the "d-fold alternating kernel" d(A, B) is defined by:

d(A, B) =< Ad, Bd >=< Cd(A), Cd(B) >= trace Cd(A ^ GrB) ^ Fr , (2) r

where ^ Fr is the q k upper-left submatrix of the p.s.d m m matrix F d d d d r . Note that the local kernel plugs in as the entries of (A ^ GrB)ij = k(Mrai, Mrbj) where ^ Gr = M M r r .

Another symmetric operator on the tensor product space is via the d-fold symmetric tensor space SymdRn whose points are: 1 x1 xd = x d! (1) .... x(d). Sd

The analogue of Cd(A) is the "d'th power matrix" Rd(A) whose (i1, ..., id|j1, ..., jd) entry has the value perm(A[i1, ..., id : j1, ..., jd]) and which stands for the map Ad

```
(A A)(x1 xd) = Ax1 Axd.
```

In other words, Rd(A) has n+d-1 rows and k+d-1 columns whose entries are equal to d d the dd permanents of A. The analogue of the Binet-Cauchy theorem is Rd(A) Rd(B) =

Rd(A B). The ensuing kernel similarity function, referred to as the "d-fold symmetric kernel" is:

Symd(A, B) =< Ad, Bd >=< Rd(A), Rd(B) >= trace Rd(A ^ GrB) ^ Fr (3) r

where ^ Fr is the q+d-1 k+d-1 upper-left submatrix of the positive definite m+d-1 d d d n+d-1 matrix F d r . Due to lack of space we will stop here and spend the remainder of this section in describing in laymen terms what are the properties of these similarity measures, how they can be constructed in practice and in a computationally efficient manner (despite the combinatorial element in their definition).

3.1 Practical Considerations

To recap, the family of similarity functions sim(A, B) comprise of the linear version < A, B > (eqn. 1) and non-linear versions l(A, B), Syml(A, B) (eqns. 2,3) which are group projections of the general kernel < Ad, Bd >. These different similarity func- tions are controlled by the choice of three items: Gr, Fr and the parameter d representing the degree of the tensor product operator. Specifically, we will focus on the case Gr = I and on d(A, B) as a representative of the non-linear family. The role of ^ Gr is fairly in- teresting as it can be viewed as a projection operator from "parts" to prototypical parts that can be learned from a training set but we leave this to the full length article that will appear later.

Practically, to compute d(A, B) one needs to run over all d d blocks of the k q ma- trix A B (whose entries are k(ai, bj)) and for each block compute the determinant. The similarity function is a weighted sum of all those determinants weighted by fij. By appro- priate selection of F one can control both the complexity (avoid running over all possible d d blocks) of the computation and the degree of interaction between the determinants. These determinants have an interesting geometric interpretation if those are computed over unitary matrices -- as described next.

Let A = QARA and B = QBRB be the QR factorization of the matrices, i.e., QA has orthonormal columns which span the column space of A, then it has been recently shown [14] that R-1 can be computed from A using only operations over k(a A i, aj ). Therefore, the product Q Q A BR-1, can be computed using only local A B , which is equal to R-T A B kernel applications. In other words, for each A compute R-1 (can be done using only A inner-products over columns of A), then when it comes to compute A B compute in- stead R-T A BR-1 which is equivalent to computing Q Q A B A B . Thus effectively we have replaced every A with QA (unitary matrix).

Now, d(QA, QB) for unitary matrices is the sum over the product of the cosine principal angles between d-dim subspaces spanned by columns of A and B. The value of each determinant of the d d blocks of Q Q A B is equal to the product of the cosine principal angles between the respective d-dim subspaces determined by corresponding selection of d columns from A and d columns from B. For example, the case k = q = d produces d(QA, QB) = det(Q Q Q A B ) which is the product of the eigenvalues of the matrix QA B . Those eigenvalues are the cosine of the principal angles between the column space of A and the column space of B [2]. Therefore, det(Q Q A B ) measures the "angle" between the two subspaces spanned by the respective columns of the input matrices -- in particular is invariant to the order of the columns. For smaller values of d we obtain the sum over such products between subspaces spanned by subsets of d columns between A and B.

The advantage of smaller values of d is two fold: first it enables to compute the similarity when k = q and second breaks down the similarity between subspaces into smaller pieces. The entries of the matrix F determine which subspaces are being considered and the inter- action between subspaces in A and B. A diagonal F compares corresponding subspaces

```
(a) (b)
```

Figure 1: (a) The configuration of the nine sub-regions is displayed over the gradient image. (b) some of the positive examples -- note the large variation in appearance, pose and articulation.

between A and B whereas off-diagonal entries would enable comparisons between differ- ent choices of subspaces in A and in B. For example, we may want to consider choices of d columns arranged in a "sliding" fashion, i.e., column sets {1, .., d}, {2, ..., d + 1}, ... and so forth, instead of the combinatorial number of all possible choices. This selection is associated with a sparse diagonal F where the non-vanishing entries along the diagonal have the value of "1" and correspond to the sliding window selections.

To conclude, in the linear version < A, B > the role of F is to determine the range of interaction between columns of A and columns of B, whereas with the non-linear version it is the interaction between d-dim subspaces rather than individual columns. We could select all possible interactions (exponential number) or any reduced interaction set such as the sliding window rule (linear number of choices) as described above.

Do not remove: This comment is monitored to verify that the site is working properly