Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Anton Schwaighofer, Volker Tresp, Kai Yu*

We present a novel method for learning with Gaussian process regres- sion in a hierarchical Bayesian framework. In a first step, kernel matri- ces on a fixed set of input points are learned from data using a simple and efficient EM algorithm. This step is nonparametric, in that it does not require a parametric form of covariance function. In a second step, kernel functions are fitted to approximate the learned covariance matrix using a generalized Nystrom method, which results in a complex, data driven kernel. We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance.

1 Introduction

In many real-world application domains, the available training data sets are quite small, which makes learning and model selection difficult. For example, in the user preference modelling problem we will consider later, learning a preference model would amount to fitting a model based on only 20 samples of a user's preference data. Fortunately, there are situations where individual data sets are small, but data from similar scenarios can be obtained. Returning to the example of preference modelling, data for many different users are typically available. This data stems from clearly separate individuals, but we can expect that models can borrow strength from data of users with similar tastes. Typically, such problems have been handled by either mixed effects models or hierarchical Bayesian modelling.

In this paper we present a novel approach to hierarchical Bayesian modelling in the context of Gaussian process regression, with an application to recommender systems. Here, hier- archical Bayesian modelling essentially means to learn the mean and covariance function of the Gaussian process.

In a first step, a common collaborative kernel matrix is learned from the data via a simple and efficient EM algorithm. This circumvents the problem of kernel design, as no paramet- ric form of kernel function is required here. Thus, this form of learning a covariance matrix is also suited for problems with complex covariance structure (e.g. nonstationarity).

A portion of the learned covariance matrix can be explained by the input features and, thus,

generalized to new objects via a content-based kernel smoother. Thus, in a second step, we generalize the covariance matrix (learned by the EM-algorithm) to new items using a generalized Nystrom method. The result is a complex content-based kernel which itself is a weighted superposition of simple smoothing kernels. This second part could also be applied to other situations where one needs to extrapolate a covariance matrix on a finite set (e.g. a graph) to a continuous input space, as, for example, required in induction for semi-supervised learning [14].

The paper is organized as follows. Sec. 2 casts Gaussian process regression in a hierarchical Bayesian framework, and shows the EM updates to learn the covariance matrix in the first step. Extrapolating the covariance matrix is shown in Sec. 3. We illustrate the function of the EM-learning on a toy example in Sec. 4, before applying the proposed methods as a recommender system for images in Sec. 4.1.

1.1 Previous Work

In statistics, modelling data from related scenarios is typically done via mixed effects mod- els or hierarchical Bayesian (HB) modelling [6]. In HB, parameters of models for indi- vidual scenarios (e.g. users in recommender systems) are assumed to be drawn from a common (hyper)prior distribution, allowing the individual models to interact and regular- ize each other. Recent examples of HB modelling in machine learning include [1, 2]. In other contexts, this learning framework is called multi-task learning [4]. Multi-task learn- ing with Gaussian processes has been suggested by [8], yet with the rather stringent as- sumption that one has observations on the same set of points in each individual scenario. Based on sparse approximations of GPs, a more general GP multi-task learner with para- metric covariance functions has been presented in [7]. In contrast, the approach presented in this paper only considers covariance matrices (and is thus non-parametric) in the first step. Only in a second extrapolation step, kernel smoothing leads to predictions based on a covariance function that is a data-driven combination of simple kernel functions.

2 Learning GP Kernel Matrices via EM

The learning task we are concerned with can be stated as follows: The data are observations from M different scenarios. In the i.th scenario, we have observations yi = (yi , . . . , yi ) 1 N i on a total of N i points, Xi = {xi , . . . , xi } 1 . In order to analyze this data in a hierarchical N i Bayesian way, we assume that the data for each scenario is a noisy sample of a Gaussian process (GP) with unknown mean and covariance function. We assume that mean and covariance function are shared across different scenarios.1

In the first modelling step presented in this section, we consider transductive learning ("la- belling a partially labelled data set"), that is, we are interested in the model's behavior only on points X, with X = M Xi and cardinality N = |X|. This situation is relevant i=1 for most collaborative filtering applications. Thus, test points are the unlabelled points in each scenario. This reduces the whole "infinite dimensional" Gaussian process to its finite dimensional projection on points X, which is an N -variate Gaussian distribution with co- variance matrix K and mean vector m. For the EM algorithm to work, we also require that there is some overlap between scenarios, that is, Xi Xj = for some i, j. Coming back to the user modelling problem mentioned above, this means that at least some items have been rated by more than one user.

Thus, our first modelling step focusses on directly learning the covariance matrix K and

```
1Alternative HB approaches for collaborative filtering, like that discussed in [5], assume that model weights are drawn from a shared Gaussian distribution.
```

m from the data via an efficient EM algorithm. This may be of particular help in problems where one would need to specify a complex (e.g. nonstationary) covariance function.

Following the hierarchical Bayesian assumption, the data observed in each scenario is thus a partial sample from N (y | m, K + 21), where 1 denotes the unit matrix. The joint model is simply M

```
p(m, K) p(yi | f i)p(f i | m, K), (1)
i=1
```

where p(m, K) denotes the prior distribution for mean and covariance. We assume a Gaus- sian likelihood p(yi | f i) with diagonal covariance matrix 21.

2.1 EM Learning

For the above hierarchical Bayesian model, Eq. (1), the marginal likelihood becomes

```
M
p(m, K) p(yi | f i)p(f i | m, K) df i. (2)
i=1
```

To obtain simple and stable solutions when estimating m and K from the data, we con- sider point estimates of the parameters m and K, based on a penalized likelihood approach with conjugate priors.2 The conjugate prior for mean m and covariance K of a multivari- ate Gaussian is the so-called Normal-Wishart distribution [6], which decomposes into the product of an inverse Wishart distribution for K and a Normal distribution for m,

```
p(m, K) = N (m | , -1K)Wi-1(K|, U ). (3)
```

That is, the prior for the Gram matrix K is given by an inverse Wishart distribution with scalar parameter > 1/2(N - 1) and U being a symmetric positive-definite matrix. Given the covariance matrix K, m is Gaussian distributed with mean and covariance -1K, where is a positive scalar. The parameters can be interpreted in terms of an equivalent data set for the mean (this data set has size A, with A = , and mean = ) and a data set for the covariance that has size B, with = (B + N )/2, and covariance S, U = (B/2)S.

In order to write down the EM algorithm in a compact way, we denote by I(i) the set of indices of those data points that have been observed in the i.th scenario, that is I(i) = {j | j {1, . . . , N } and xj Xi}. Keep in mind that in most applications of interest N i N such that most targets are missing in training. KI(i),I(i) denotes the square submatrix of K that corresponds to points I(i), that is, the covariance matrix for points in the i.th scenario. By K,I(i) we denote the covariance matrix of all N points versus those in the i.th scenario.

Do not remove: This comment is monitored to verify that the site is working properly