Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Yves Grandvalet, Yoshua Bengio*

We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach in- cludes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solu- tion benefits from unlabeled data. The method challenges mixture mod- els when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the "cluster assumption". Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces.

1 Introduction

In the classical supervised learning classification framework, a decision rule is to be learned from a learning set Ln = {xi, yi}ni , where each example is described by a pattern =1 xi X and by the supervisor's response yi = {1, . . . , K }. We consider semi-supervised learning, where the supervisor's responses are limited to a subset of Ln.

In the terminology used here, semi-supervised learning refers to learning a decision rule on X from labeled and unlabeled data. However, the related problem of transductive learning, i.e. of predicting labels on a set of predefined patterns, is addressed as a side issue. Semi- supervised problems occur in many applications where labeling is performed by human experts. They have been receiving much attention during the last few years, but some important issues are unresolved [10].

In the probabilistic framework, semi-supervised learning can be modeled as a missing data problem, which can be addressed by generative models such as mixture models thanks to the EM algorithm and extensions thereof [6].Generative models apply to the joint den- sity of patterns and class (X, Y ). They have appealing features, but they also have major drawbacks. Their estimation is much more demanding than discriminative models, since the model of P (X, Y ) is exhaustive, hence necessarily more complex than the model of

```
This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence IST-2002-506778. This publication only reflects the authors' views.
```

P (Y |X). More parameters are to be estimated, resulting in more uncertainty in the es- timation process. The generative model being more precise, it is also more likely to be misspecified. Finally, the fitness measure is not discriminative, so that better models are not necessarily better predictors of class labels. These difficulties have lead to proposals aiming at processing unlabeled data in the framework of supervised classification [1, 5, 11]. Here, we propose an estimation principle applicable to any probabilistic classifier, aiming at making the most of unlabeled data when they are beneficial, while providing a control on their contribution to provide robustness to the learning scheme.

2 Derivation of the Criterion

Do not remove: This comment is monitored to verify that the site is working properly