{"title": "Generalization in Clustering with Unobserved Features", "book": "Advances in Neural Information Processing Systems", "page_first": 683, "page_last": 690, "abstract": null, "full_text": "Generalization in Clustering with Unobserved Features\nEyal Krupka and Naftali Tishby School of Computer Science and Engineering, Interdisciplinary Center for Neural Computation The Hebrew University Jerusalem, 91904, Israel {eyalkr,tishby}@cs.huji.ac.il\n\nAbstract\nWe argue that when objects are characterized by many attributes, clustering them on the basis of a relatively small random subset of these attributes can capture information on the unobserved attributes as well. Moreover, we show that under mild technical conditions, clustering the objects on the basis of such a random subset performs almost as well as clustering with the full attribute set. We prove a finite sample generalization theorems for this novel learning scheme that extends analogous results from the supervised learning setting. The scheme is demonstrated for collaborative filtering of users with movies rating as attributes.\n\n1\n\nIntroduction\n\nData clustering is unsupervised classification of objects into groups based on their similarity [1]. Often, it is desirable to have the clusters to match some labels that are unknown to the clustering algorithm. In this context, a good data clustering is expected to have homogeneous labels in each cluster, under some constraints on the number or complexity of the clusters. This can be quantified by mutual information (see e.g. [2]) between the objects' cluster identity and their (unknown) labels, for a given complexity of clusters. Since the clustering algorithm has no access to the labels, it is unclear how the algorithm can optimize the quality of the clustering. Even worse, the clustering quality depends on the specific choice of the unobserved labels. For example a good documents clustering with respect to topics is very different from a clustering with respect to authors. In our setting, instead of trying to cluster by some \"arbitrary\" labels, we try to predict unobserved features from observed ones. In this sense our target \"labels\" are yet other features that \"happened\" to be unobserved. For example, when clustering fruits based on their observed features, such as shape, color and size, the target of clustering is to match unobserved features, such as nutritional value and toxicity. In order to theoretically analyze and quantify this new learning scheme, we make the following assumptions. Consider an infinite set of features, and assume that we observe only a random subset of n features, called observed features. The other features are called unobserved features. We assume that the random selection of features is done uniformly and independently.\n\n\f\nTable 1: Analogy with supervised learning Training set Test set Learning algorithm Hypothesis class Min generalization error ERM Good generalization n randomly selected features (observed features) Unobserved features Cluster the instances into k clusters All possible partitions of m instances into k clusters Max expected information on unobserved features Observed Information Maximization (OIM) Mean observed and unobserved information are similar\n\nThe clustering algorithm has access only to the observed features of m instances. After the clustering, one of the unobserved features is randomly and uniformly selected to be a target label, i.e. clustering performance is measured with respect to this feature. Obviously, the clustering algorithm cannot be directly optimized for this specific feature. The question is whether we can optimize the expected performance on the unobserved feature, based on the observed features alone. The expectation is over the random selection of the target feature. In other words, can we find clusters that match as many unobserved features as possible? Perhaps surprisingly, for large enough number of observed features, the answer is yes. We show that for any clustering algorithm, the average performance of the clustering with respect to the observed and unobserved features, is similar. Hence we can indirectly optimize clustering performance with respect to the unobserved features, in analogy to generalization in supervised learning. These results are universal and do not require any additional assumptions such as underling model or a distribution that created the instances. In order to quantify these results, we define two terms: the average observed information and the expected unobserved information. Let T be the variable which represents the cluster for each instance, and {X1 , ..., X } the set of random variables which denotes the features. The average observed information, denoted by Iob , is the average mutual information between T and each of the observed features. In other words, if the observed features n 1 are {X1 , ..., Xn } then Iob = n j =1 I (T ; Xj ). The expected unobserved information, denoted by Iun , is the expected value of the mutual information between T and a randomly selected unobserved feature, i.e. Ej {I (T ; Xj )}. Note that whereas Iob can be measured directly, this paper deals with the question of how to infer and maximize Iun . Our main results consist of two theorems. The first is a generalization theorem. It gives an upper bound on the probability of large difference between Iob and Iun for all possible clusterings. It also states a uniform convergence in probability of |Iob - Iun | as the number of observed features increases. Conceptually, the observed mean information, Iob , is analogous to the training error in standard supervised learning [3], whereas the unobserved information, Iun , is similar to the generalization error. The second theorem states that under constraint on the number of clusters, and large enough number of observed features, one can achieve nearly the best possible performance, in terms of Iun . Analogous to the principle of Empirical Risk Minimization (ERM) in statistical learning theory [3], this is done by maximizing Iob . Table 1 summarizes the correspondence of our setting to that of supervised learning. The key difference is that in supervised learning, the set of features is fixed and the training instances (samples) are assumed to be randomly drawn from some distribution. In our setting, the set of instances is fixed, but the set of observed features is assumed to be randomly selected. Our new theorems are evaluated empirically in section 3, on a data set of movie ratings.\n\n\f\nThis empirical test also suggests one future research direction: use the framework suggested in this paper for collaborative filtering. Our main point in this paper, however, is the new conceptual framework and not a specific algorithm or experimental performance. Related work The idea of an information tradeoff between complexity and information on target variables is similar to the idea of the information bottleneck [4]. But unlike the bottleneck method, here we are trying to maximize information on unobserved variables, using finite samples. In the framework of learning with labeled and unlabeled data [5], a fundamental issue is the link between the marginal distribution P (x) over examples x and the conditional P (y |x) for the label y [6]. From this point of view our approach assumes that y is a feature in itself.\n\n2\n\nMathematical Formulation and Analysis\n\nConsider a set of discrete random variables {X1 , ..., XL }, where L is very large (L   ). We randomly, uniformly and independently select n << L variables from this set. These variables are the observed features and their indexes are denoted by {q1 , ..., qn }. The remaining L - n variables are the unobserved features. A clustering algorithm has access only to the observed features over m instances {x[1], ..., x[m]}. The algorithm assigns a cluster label ti  {1, ..., k} for each instance x[i], where k is the number of clusters. Let T denote the cluster label assigned by the algorithm. Shannon's mutual information between two variabPes is a function of their joint distribul . t (t,x ) P (t, xj ) log P (t)P (j j ) Since we are dealing with a tion, defined as I (T ; Xj ) = ,xj x finite number of samples, m, the distribution P is taken as the empirical joint distribution of (T , Xj ), for every j . For a random j , this empirical mutual information is a random variable on its own. n 1 The average observed information, Iob , is now defined as Iob = n i=1 I (T ; Xqi ). In general, Iob is higher when clusters are more coherent, i.e. elements within each cluster have many similar attributes. The expected unobserved information, Iun , is defined as Iun = Ej {I (T ; Xj )}. We can assume that the unobserved feature is with high probability from the unobserved set. Equivalently, Iun can be the mean mutual information between j 1 the clusters and each of the unobserved features, Iun = L-n {q1 ,...,qn } I (T ; Xj ). / The goal of the clustering algorithm is to find cluster labels {t1 , ..., tm }, that maximize Iun , subject to a constraint on their complexity - henceforth considered as the number of clusters (k  D) for simplicity, where D is an integer bound.\n\nBefore discussing how to maximize Iun , we consider first the problem of estimating it. Similar to the generalization error in supervised learning, Iun cannot be estimated directly in the learning algorithm, but we may be able to bound the difference between the observed information Iob - our \"training error\" - and Iun - the \"generalization error\". To obtain generalization this bound should be uniform over all possible clusterings with a high probability over the randomly selected features. The following lemma argues that such uniform convergence in probability of Iob to Iun always occurs. Lemma 1 With the definitions above, s  2e-2n\n2\n\nPr\n\nup\n\n{t1 ,...,tm }\n\n|Iob - Iun | > \n\n/(log k)2 +m log k\n\n > 0\n\nwhere the probability is over the random selection of the observed features.\n\n\f\nProof: For fixed cluster labels, {t1 , ..., tm }, and a random feature j , the mutual information I (T ; Xj ) is a function of the random variable j , and hence I (T ; Xj ) is a random variable in itself. Iob is the average of n such independent random variables and Iun is its expected value. Clearly, for all j , 0  I (T ; Xj )  log k . Using Hoeffding's inequality [7], 2 2 Pr {|Iob - Iun | > }  2e-2n /(log k) . Since there are at most k m possible partitions, the union bound is sufficient to prove the lemma 1. Note that for any  > 0, the probability that |Iob - Iun | >  goes to zero, as n  . The  convergence rate of Iob to Iun is bounded by O(log n/ n). As expected, this upper bound decreases as the number of clusters, k , decreases. Unlike the standard bounds in supervised learning, this bound increases with the number of instances (m), and decreases with increasing number of observed features (n). This is because in our scheme the training size is not the number of instances, but rather the number of observed features (See Table 1). However, in the next theorem we obtain an upper bound that is independent of m, and hence is tighter for large m. Theorem 1 (Generalization Theorem) With the definitions above, s \n2 4k maxj |Xj | \n\nPr\n\nup\n\n{t1 ,...,tm }\n\n|Iob - Iun | > \n\n8(log k )e\n\n - 8(lng k)2 + o\n\nlog k-log \n\n > 0\n\nwhere |Xj | denotes the alphabet size of Xj (i.e. the number of different values it can obtain). Again, the probability is over the random selection of the observed features.  The convergence rate here is bounded by O(log n/3 n). However, for relatively large n one can use the bound in lemma 1, which converge faster. A detailed proof of theorem 1 can be found in [8]. Here we provide the outline of the proof. Proof outline: From the given m instances and any given cluster labels {t1 , ..., tm }, draw uniformly and independently m instances (repeats allowed) and denote their indexes by {i1 , ..., im }. We can estimate I (T ; Xj ) from the empirical distribution of (T , Xj ) over ^ the m instances. This distribution is denoted by P (t, xj ) and the corresponding mutual information is denoted by IP (T ; Xj ). Theorem 1 is build up from the following upper ^ bounI s, which are independen,t of m, but depend on the choice of m . The first bound is on d E (T ; Xj ) - IP (T ; Xj ) where the expectation is over random selection of the m ^ ^ ^ instances. From this bound we derive upper bounds on |Iob - E (Iob )| and |Iun - E (Iun )|, ^ob , Iun are the estimated values of Iob , Iun based on the subset of m instances. ^ where I ^ ^ The last required bound is on the probability that sup{t1 ,...,tm } |E (I ob ) - E (Iun )| > 1 , for any 1 > 0. This bound is obtained from lemma 1. The choice of m is independent on ^^ m. Its value should be large enough for the estimations Iob , Iun to be accurate, but not too large, so as to limit the number of possible clusterings over the m instances. We now describe the above mentioned upper bounds in more details. Using Paninski [9] (proposition 1) it is easy to show that the bias between I (T ; Xj ) and its maximum likeli^ hood estimation, based on P (t, xj ) is bounded as follows. E{i1 ,...,im } I (T ; Xj ) - IP (T ; Xj ) ^  log 1 k |Xj | - 1 + m  k |Xj | m (1)\n\nFrom this equation we obtain,\n\n^ ^ |Iob - E{i1 ,...,im } (Iob )|, |Iun - E{i1 ,...,im } (Iun )|  k max |Xj |/m\nj\n\n(2)\n\n\f\n^ ^ Using lemma 1 we have an upper bound on the probability that sup{t1 ,...,tm } |Iob - Iun | >   over the random selection of features, as a function of m . However, the upper bound ^ ^ we need is on the probability that sup{t1 ,...,tm } |E (I ob ) - E (Iun )| > 1 . Note that the ^ob ), E (Iun ) are done over random selection of the subset of m instances, ^ expectations E (I for a set of features that were randomly selected once. In order to link between these two probabilities, we need the following lemma. Lemma 2 Consider a function f of two independent random variables (Y , Z ). We assume that f (y , z )  c, y , z , where c is some constant. If Pr {f (Y , Z ) > }   , then ~ Pr {Ey (f (y , Z ))  } \nZ\n\nc- ~  - ~\n\n >  ~\n\nThe proof of this lemma is rather standard and is given in [8]. From lemmas 1 and 2 it is easy to show that  s E > ^ 2 4 log k - 2(lng1 )2 +m log k ^un ok e (3) up Pr 1 Iob - I {i1 ,...,im } 1 {t1 ,...,tm } Lemma 2 is used, where Z represents the random selection of features, Y represents the ^ ^ random selection of m instances, f (y , z ) = sup{t1 ,...,tm } |Iob - Iun |, c = log k , and  = 1 /2. From eq. 2 and 3 it can be shown that ~ s  2 2k maxj |Xj | 4 log k - 2(lng1 )2 +m log k ok Pr up |Iob - Iun | > 1 + e m 1 {t1 ,...,tm } By selecting 1 = /2, m = 4k maxj |Xj |/, we obtain theorem 1.\n\nNote that the selection of m depends on k maxj |Xj |. This reflects the fact that in order to accurately estimate I (T , Xj ), we need a number of instances, m , which is much larger than the product of the alphabet sizes of T , Xj . We can now return to the problem of specifying a clustering that maximizes Iun , using only the observed features. For a reference, we will first define Iun of the best possible clusters.\n Definition 1 Maximally achievable unobserved information: Let Iun,D be the maximum value of Iun that can be achieved by any clustering {t1 , ..., tm }, subject to the constraint k  D, for some constant D  Iun,D =\n\nsup\n{{t1 ,...,tm }:kD }\n\nIun\n\nThe clustering that achieves this value is called the best clustering. The average observed  information of this clustering is denoted by Iob,D . Definition 2 Observed information maximization algorithm: Let IobMax be any clustering algorithm that, based on the values of observed features alone, selects the cluster labels {t1 , ..., tm } having the maximum possible value of Iob , subject to the constraint k  D.\n\n~ ~ Let Iob,D be the average observed information achieved by IobMax algorithm. Let Iun,D be the expected unobserved information achieved by the IobMax algorithm. The next theorem states that IobMax not only maximizes Iob , but also Iun .\n\n\f\nTheorem 2 With the definitions above, ~  8k maxj |Xj | n 2 + log k-log(/2) -   Pr Iun,D  Iun,D -  8(log k )e 32(log k)2 where the probability is over the random selection of the observed features.\n\n > 0 (4)\n\nProof: We now define a bad clustering as a clustering whose expected unobserved infor mation satisfies Iun  Iun,D - . Using Theorem 1, the probability that |Iob - Iun | > /2 for any of the clusterings is upper bounded by the right term of equation 4. If for all clus  terings |Iob - Iun |  /2, then surely Iob,D  Iun,D - /2 (see Definition 1) and Iob of  all bad clusterings satisfies Iob  Iun,D - /2. Hence the probability that a bad clustering has a higher average observed information than the best clustering is upper bounded as in Theorem 2. As a result of this theorem, when n is large enough, even an algorithm that knows the value of all the features (observed and unobserved) cannot find a clustering with the same complexity (k ) which is significantly better than the clustering found by I obM ax algorithm.\n\n3\n\nEmpirical Evaluation\n\nIn this section we describe an experimental evaluation of the generalization properties of the IobMax algorithm for a finite large number of features. We examine the difference between Iob and Iun as function of the number of observed features and the number of clusters used. We also compare the value of Iun achieved by IobMax algorithm to the  maximum achievable Iun,D (See definition 1). Our evaluation uses a data set typically used for collaborative filtering. Collaborative filtering refers to methods of making predictions about a user's preferences, by collecting preferences of many users. For example, collaborative filtering for movie ratings could make predictions about rating of movies by a user, given a partial list of ratings from this user and many other users. Clustering methods are used for collaborative filtering by cluster users based on the similarity of their ratings (see e.g. [10]). In our setting, each user is described as a vector of movie ratings. The rating of each movie is regarded as a feature. We cluster users based on the set of observed features, i.e. rated movies. In our context, the goal of the clustering is to maximize the information between the clusters and unobserved features, i.e. movies that have not yet been rated by any of the users. By Theorem 2, given large enough number of rated movies, we can achieve the best possible clustering of users with respect to unseen movies. In this region, no additional information (such as user age, taste, rating of more movies) beyond the observed features can improve Iun by more than some small . The purpose of this section is not to suggest a new algorithm for collaborative filtering or compare it to other methods, but simply to illustrate our new theorems on empirical data. Dataset. We used MovieLens (www.movielens.umn.edu), which is a movie rating data set. It was collected distributed by GroupLens Research at the University of Minnesota. It contains approximately 1 million ratings for 3900 movies by 6040 users. Ratings are on a scale of 1 to 5. We used only a subset consisting of 2400 movies by 4000 users. In our setting, each instance is a vector of ratings (x1 , ..., x2400 ) by specific user. Each movie is viewed as a feature, where the rating is the value of the feature. Experimental Setup. We randomly split the 2400 movies into two groups, denoted by \"A\" and \"B\", of 1200 movies (features) each. We used a subset of the movies from group \"A\" as observed features and all movies from group \"B\" as the unobserved features. The experiment was repeated with 10 random splits and the results averaged. We estimated Iun by the mean information between the clusters and ratings of movies from group \"B\".\n\n\f\n0. 0 25\n\n0. 0 25\n0. 0 15\n\n0. 02\n\n0. 02\n\nIo b\nIob\n\n0. 0 15\n\n0. 0 15\n\nIob\n0. 01\n\nI* n u\n\n0. 01\n\nIun\n0. 01\n\nI* n u Iun\n\nIu n\n0. 0 05\n\n0. 0 05\n\n0. 0 05\n\n0\n\n0\n\n2 00\n\n400\n\n6 00\n\n80 0\n\n10 00\n\n1 200\n\n0\n\n0\n\n2 00\n\n400\n\n6 00\n\n80 0\n\n10 00\n\n1 200\n\n0\n\n2\n\n3\n\n4\n\n5\n\n6\n\nNumb e r of observed features (movies) (n)\n\nNumb e r of observed features (movies) (n)\n\nNum be r of clusters (k)\n\n(a) 2 Clusters\n\n(b) 6 Clusters\n\n(c) Fixed n (1200)\n\n Figure 1: Iob , Iun and Iun per number of training movies and clusters. In (a) and (b) the number of movies is variable, and the number of clusters is fixed. In (c) The number of observed movies is fixed (1200), and the number of clusters is variable. The overall mean information is low, since the rating matrix is sparse.\n\nHandling Missing Values. In this data set, most of the values are missing (not rated). We handle this by defining the feature variable as 1,2,...,5 for the ratings and 0 for missing value. We maximize the mutual information based on the empirical distribution of values than are present, and weight it by the probability of presence for this feature. Hence, Iob = t j =1 P (Xj = 0)I (T ; Xj |Xj = 0) and Iun = Ej {P (Xj = 0)I (T ; Xj |Xj = 0)}. The weighting prevents 'overfitting' to movies with few ratings. Since the observed features were selected at random, the statistics of missing values of the observed and unobserved features are the same. Hence, all theorems are applicable to these definitions of Iob and Iun as well. Greedy IobMax Algorithm We cluster the users using a simple greedy clustering algorithm . The input to the algorithm is all users, represented solely by the observed features. Since this algorithm can only find a local maximum of Iob , we ran the algorithm 10 times (each used a different random initialization) and selected the results that had a maximum value of Iob . More details about this algorithm can be found in [8].\n In order to estimate Iun,D (see definition 1), we also ran the same algorithm, where all the features are available to the algorithm (i.e. also features from group \"B\"). The algorithm finds clusters that maximize the mean mutual information on features from group \"B\".\n\nResults The results are shown in Figure 1. As n increases, Iob decreases and Iun increases, until they converge to each other. For small n, the clustering 'overfits' to the observed features. This is similar to training and test errors in supervised learning. For large n, Iun approaches  to Iun,D , which means the I obM ax algorithm found nearly the best possible clustering as expected from the theorem 2. As the number of clusters increases, both Iob and Iun increase, but the difference between them also increases.\n\n4\n\nDiscussion and Summary\n\nWe introduce a new learning paradigm: clustering based on observed features that generalizes to unobserved features. Our results are summarized by two theorems that tell us how, without knowing the value of the unobserved features, one can estimate and maximize information between the clusters and the unobserved features.\n\n\f\nThe key assumption that enables us to prove the theorems is the random independent selection of the observed features. Another interpretation of the generalization theorem, without using this assumption, might be combinatorial. The difference between the observed and unobserved information is large only for a small portion of all possible partitions into observed and unobserved features. This means that almost any arbitrary partition generalizes well. The importance of clustering which preserves information on unobserved features is that it enables us to learn new - previously unobserved - attributes from a small number of examples. Suppose that after clustering fruits based on their observed features, we eat a chinaberry1 and thus, we \"observe\" (by getting sick), the previously unobserved attribute of toxicity. Assuming that in each cluster, all fruits have similar unobserved attributes, we can conclude that all fruits in the same cluster, i.e. all chinaberries, are likely to be poisonous. We can even relate the IobMax principle to cognitive clustering in sensory information processing. In general, a symbolic representation (e.g. assigning object names in language) may be based on a similar principle - find a representation (clusters) that contain significant information on as many observed features as possible, while still remaining simple. Such representations are expected to contain information on other rarely viewed salient features. Acknowledgments We thank Amir Globerson, Ran Bachrach, Amir Navot, Oren Shriki, Avner Dor and Ilan Sutskover for helpful discussions. We also thank the GroupLens Research Group at the University of Minnesota for use of the MovieLens data set. Our work is partly supported by grant from the Israeli Academy of Science.\n\nReferences\n[1] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264323, September 1999. [2] T. M. Cover and J. A. Thomas. Elements Of Information Theory. Wiley Interscience, 1991. [3] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [4] N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. Proc. 37th Allerton Conf. on Communication and Computation, 1999. [5] M. Seeger. Learning with labeled and unlabeled data. Technical report, University of Edinburgh, 2002. [6] M. Szummer and T. Jaakkola. Information regularization with partially labeled data. In NIPS, 2003. [7] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:1330, 1963. [8] E. Krupka and N. Tishby. Generalization in clustering with unobserved features. Technical report, Hebrew University, 2005. http://www.cs.huji.ac.il/~tishby/nips2005tr.pdf. [9] L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15:1101 1253, 2003. [10] B. Marlin. Collaborative filtering: A machine learning perspective. Master's thesis, University of Toronto, 2004.\n\n1\n\nChinaberries are the fruits of the Melia azedarach tree, and are poisonous.\n\n\f\n", "award": [], "sourceid": 2896, "authors": [{"given_name": "Eyal", "family_name": "Krupka", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}