{"title": "Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1772, "page_last": 1780, "abstract": "One of the main challenges in data clustering is to define an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by defining the pairwise similarity based on the manual annotations obtained through crowdsourcing. Despite its encouraging results, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are available. To address this limitation, we propose a new approach for clustering, called \\textit{semi-crowdsourced clustering} that effectively combines the low-level features of objects with the manual annotations of a subset of the objects obtained via crowdsourcing. The key idea is to learn an appropriate similarity measure, based on the low-level features of objects, from the manual annotations of only a small portion of the data to be clustered. One difficulty in learning the pairwise similarity measure is that there is a significant amount of noise and inter-worker variations in the manual annotations obtained via crowdsourcing. We address this difficulty by developing a metric learning algorithm based on the matrix completion method. Our empirical study with two real-world image data sets shows that the proposed algorithm outperforms state-of-the-art distance metric learning algorithms in both clustering accuracy and computational efficiency.", "full_text": "Semi-Crowdsourced Clustering: Generalizing Crowd\n\nLabeling by Robust Distance Metric Learning\n\nJinfeng Yi\u2020, Rong Jin\u2020, Anil K. Jain\u2020, Shaili Jain(cid:92), Tianbao Yang\u2021\n\n\u2020Michigan State University, East Lansing, MI 48824, USA\n\n(cid:92)Yale University, New Haven, CT 06520, USA\n\n\u2021Machine Learning Lab, GE Global Research, San Ramon, CA 94583, USA\n{yijinfen, rongjin, jain}@cse.msu.edu, shaili.jain@yale.edu, tyang@ge.com\n\nAbstract\n\nOne of the main challenges in data clustering is to de\ufb01ne an appropriate similar-\nity measure between two objects. Crowdclustering addresses this challenge by\nde\ufb01ning the pairwise similarity based on the manual annotations obtained through\ncrowdsourcing. Despite its encouraging results, a key limitation of crowdclus-\ntering is that it can only cluster objects when their manual annotations are avail-\nable. To address this limitation, we propose a new approach for clustering, called\nsemi-crowdsourced clustering that effectively combines the low-level features of\nobjects with the manual annotations of a subset of the objects obtained via crowd-\nsourcing. The key idea is to learn an appropriate similarity measure, based on the\nlow-level features of objects and from the manual annotations of only a small por-\ntion of the data to be clustered. One dif\ufb01culty in learning the pairwise similarity\nmeasure is that there is a signi\ufb01cant amount of noise and inter-worker variations in\nthe manual annotations obtained via crowdsourcing. We address this dif\ufb01culty by\ndeveloping a metric learning algorithm based on the matrix completion method.\nOur empirical study with two real-world image data sets shows that the proposed\nalgorithm outperforms state-of-the-art distance metric learning algorithms in both\nclustering accuracy and computational ef\ufb01ciency.\n\n1\n\nIntroduction\n\nCrowdsourcing provides an easy and relatively inexpensive way to utilize human capabilities to\nsolve dif\ufb01cult computational learning problems (e.g. image annotation in ESP game [17]). It divides\na large task into a number of small-scale tasks, often referred to as Human Intelligence Tasks (HITs),\nand asks a human worker to solve each individual HIT. It then combines the partial solutions ob-\ntained from individual HITs to form the \ufb01nal solution. In the past, crowdsourcing has been explored\nfor a number of machine learning tasks (e.g., classi\ufb01cation and clustering) [21, 10, 19].\nCrowdclustering [10] exploits the crowdsourcing paradigm for data clustering. The key idea is to\n\ufb01rst obtain manual annotations of objects through crowdsourcing. The annotations can either be\nin the form of grouping objects based on their perceived similarities [10] or the keyword assign-\nments to individual objects (e.g., images) by human workers [25]. A pairwise similarity matrix is\nthen computed from the acquired annotations, and is used to cluster objects. Unlike the convention-\nal clustering techniques where the similarity measure is de\ufb01ned based on the features of objects, in\ncrowdclustering, the pairwise similarities are derived from the manual annotations, which better cap-\nture the underlying inter-object similarity. Studies [10] have shown that crowdclustering performs\nsigni\ufb01cantly better than the conventional clustering methods, given a suf\ufb01ciently large number of\nmanual annotations for all the objects to be clustered.\n\n1\n\n\f1, o(cid:48)\n\n2,\u00b7\u00b7\u00b7 o(cid:48)\n\nFigure 1: The proposed framework for semi-crowdsourced clustering. The given N objects\n(o1, o2, . . . , oN ) need to be clustered, but only a small subset of the N objects (o(cid:48)\nn) have\nbeen annotated by crowdsourcing, n (cid:28) N.\nDespite the encouraging results obtained via crowdclustering, a main shortcoming of crowdclus-\ntering is that it can only cluster objects for which manual annotations are available, signi\ufb01cantly\nlimiting its application to large scale clustering problems. For instance, when clustering hundreds of\nthousands of objects, it is not feasible to have each object manually annotated by multiple workers.\nTo address this limitation, we study the problem of semi-crowdsourced clustering, where given\nthe annotations obtained through crowdsourcing for a small subset of the objects, the objective is to\ncluster the entire collection of objects. Figure 1 depicts the proposed framework. Given a set of N\nobjects to be clustered, the objective is to learn a pairwise similarity measure from the crowdsourced\nlabels of n objects (n (cid:28) N) and the object feature vector x. Note that the available crowdclustering\nalgorithms [10, 25] expect that all N objects be labeled by crowdsourcing.\nThe key to semi-crowdsourced clustering is to de\ufb01ne an appropriate similarity measure for the subset\nof objects that do not have manual annotations (i.e., N \u2212 n objects). To this end, we propose to learn\na similarity function, based on the object features, from the pairwise similarities derived from the\nmanual annotations for the subset of n objects; we then apply the learned similarity function to\ncompute the similarity between any two objects, and perform data clustering based on the computed\nsimilarities. In this study, for computational simplicity, we restrict ourselves to a linear similarity\nfunction, i.e. given two objects oi and oj and their feature representation xi and xj, respectively,\ni M xj, where M (cid:23) 0 is the learned\ntheir similarity sim(Oi, Oj) is given by sim(Oi, Oj) = x(cid:62)\ndistance metric.\nLearning a linear similarity function from given pairwise similarities (sometimes referred to as pair-\nwise constraints when similarities are binary) is known as distance metric learning, which has been\nstudied extensively in the literature [24]. The key challenge of distance metric learning in semi-\ncrowdsourced clustering arises due to the noise in the pairwise similarities obtained from manual\nannotations. According to [25], large disagreements are often observed among human workers in\nspecifying pairwise similarities. As a result, pairwise similarities based on the majority vote among\nhuman workers often disagree with the true cluster assignments of objects. As an example, the au-\nthors in [25] show that for the Scenes data set [8], more than 80% of the pairwise labels obtained\nfrom human workers are inconsistent with the true cluster assignment. This large noise in the pair-\nwise similarities due to crowdsourcing could seriously misguide the distance metric learning and\nlead to a poor prediction performance, as already demonstrated in [12] as well as in our empirical\nstudy.\nWe propose a metric learning algorithm that explicitly addresses the presence of noise in pairwise\nsimilarities obtained via crowdsourcing. The proposed algorithm uses the matrix completion tech-\nnique [3] to rectify the noisy pairwise similarities, and regression analysis to ef\ufb01ciently learn a\n\n2\n\n\fFigure 2: The proposed framework of learning a distance metric from noisy manual annotations\n\ndistance metric from the restored pairwise similarities. More speci\ufb01cally, the proposed algorithm\nfor clustering N objects consists of three components: (i) \ufb01ltering noisy pairwise similarities for n\nobjects by only keeping object pairs whose pairwise similarities are agreed by many workers (not\nmajority of the workers). The result of the \ufb01ltering step is a partially observed n \u00d7 n similarity\nmatrix (n (cid:28) N) with most of its entries removed/unobserved; (ii) recovering the n \u00d7 n similarity\nmatrix from the partially observed entries by using the matrix completion algorithm; (iii) applying a\nregression algorithm to learn a distance metric from the recovered similarity matrix, and clustering\nthe N \u00d7 N pairwise similarities based on the learned distance metric. Figure 2 shows the basic steps\nof the proposed algorithm.\nCompared to the existing approaches of distance metric learning [24], the proposed algorithm has\nthe following three advantages: (i) by exploring the matrix completion technique, the proposed\nalgorithm is robust to a large amount of noise in the pairwise similarities; (ii) by utilizing regres-\nsion analysis, the proposed algorithm is computationally ef\ufb01cient and does not have to handle the\npositive semi-de\ufb01nite constraint, a key computational bottleneck for most distance metric learning\nalgorithms; (iii) the learned distance metric, with high probability, is close to the optimal metric\nlearned from the perfect or true similarities (i.e. similarity of 1 when two objects are in the same\ncluster and 0, otherwise) for arbitrarily large n.\nWe \ufb01nally note that in addition to distance metric learning, both kernel learning [16] and con-\nstrained clustering [2] can be applied to generalize the information in the manual annotations ac-\nquired by crowdsourcing. In this work, we focus on distance metric learning. The related work,\nas well as the discussion on exploring kernel learning and constrained clustering techniques for\nsemi-crowdsourced clustering can be found in Section 4.\n\n2 Semi-Crowdsourced Clustering by Robust Distance Metric Learning\n\nWe \ufb01rst present the problem and a general framework for semi-crowdsourced clustering. We then\ndescribe the proposed algorithm for learning distance metric from a small set of noisy pairwise\nsimilarities that are derived from manual annotations.\n\n2.1 Problem De\ufb01nition and Framework\nLet D = {O1, . . . , ON} be the set of N objects to be clustered, and let X = (x1, . . . , xN ) be their\nfeature representation, where xi \u2208 Rd is a vector of d dimensions. We randomly sample a subset\n\nof n (cid:28) N objects from the collection D, denoted by (cid:98)D = {(cid:98)O1, . . . ,(cid:98)On}, and obtain their manual\ni,j = 1 if objects (cid:98)Oi and (cid:98)Oj share common annotations (i.e. share common annotated keywords\n\nannotations by crowdsourcing. Let m be the number of HITs used by crowsourcing. Given the\nmanual annotations collected from the k-th HIT, we de\ufb01ne a similarity matrix Ak \u2208 Rn\u00d7n such that\nAk\nor assigned to the same cluster by the worker), zero if they don\u2019t, and \u22121 if either of the two objects\nis not annotated by the kth HIT (i.e. unlabeled pair). Note that we only consider a binary similarity\nmeasure in this study because our goal is to perfectly reconstruct the ideal pairwise similarities based\non the true cluster assignments (i.e. 1 when both objects are assigned to the same cluster and zero,\notherwise). The objective of semi-crowdsourced clustering is to cluster all the N objects in D based\non the features in X and the m \u00d7 m similarity matrices {Ak}m\nthis paper, we assume that the number of clusters, denoted by r, is given a priori 1.\n\nk=1 for the objects in (cid:98)D. Throughout\n\n1We may relax this requirement by estimating the number of clusters via some heuristic, e.g. considering\n\nthe number of clusters as the rank of the completed matrix A.\n\n3\n\n\fTo generalize the pairwise similarities from the subset (cid:98)D to the entire collection of objects D, we\n\npropose to \ufb01rst learn a distance metric from the similarity matrices {Ak}m\nk=1, and then compute the\npairwise similarity for all the N objects in D using the learned distance metric. The challenge is how\nto learn an appropriate distance metric from a set of similarity matrices {Ak}m\nk=1. A straightforward\napproach is to combine multiple similarity matrices into a single similarity matrix by computing\n\ntheir average. More speci\ufb01cally, let (cid:101)A \u2208 Rn\u00d7n be the average similarity matrix. We have\n\n1\n\ni,j \u2265 0)\n\nI(Ak\n\ni,j \u2265 0)Ai,j\n\nk=1\n\nwhere Ak\n\nk=1 I(Ak\n\ni,j < 0 indicates that the pair ((cid:98)Oi,(cid:98)Oj) is not labeled by the kth HIT (i.e. either object (cid:98)Oi\nor (cid:98)Oj is not annotated by the kth worker) and I(z) is an indicator function that outputs 1 when z is\ntrue and zero, otherwise. We then learn a distance metric M from (cid:101)A. The main problem with this\n\nsimple strategy is that due to the large disagreements among workers in determining the pairwise\nsimilarities, the average similarities do not correlate well with the true cluster assignments. In the\nnext subsection, we develop an ef\ufb01cient and robust algorithm that learns a distance metric from a set\nof noisy similarity matrices.\n\n(cid:101)Ai,j =\n\n(cid:80)m\n\nm(cid:88)\n\n2.2 Learning a Distance Metric from a Set of Noisy Similarity Matrices\n\nAs illustrated in Figure 2, the proposed algorithm consists of three steps, i.e. \ufb01ltering step, matrix\ncompletion step and distance metric learning step. For the \ufb01rst two steps, namely the data prepro-\ncessing steps, we follow the idea proposed in [25].\nFiltering step. To \ufb01lter out the uncertain object pairs, we introduce two thresholds d0 and d1(1 \u2265\nd1 > d0 \u2265 0) into the average similarity matrix \u02dcA. Since any similarity measure smaller than d0\nindicates that most workers put the corresponding object pair into different clusters, we simply set\nit as 0. Similarly, we set the similarity measure larger than d1 as 1. For object pairs with similarity\nmeasure in the range between d0 and and d1, they are treated as uncertain object pairs and are\ndiscarded (i.e. marked as unobserved) from the similarity matrix. The resulting partially observed\nsimilarity matrix A is given by\n\n\uf8f1\uf8f2\uf8f3 1\n\nAi,j =\n\n\u02dcAi,j \u2208 [d1, 1]\n\u02dcAi,j \u2208 [0, d0]\n\n0\nunobserved Otherwise\n\n(1)\n\nWe also de\ufb01ne \u2206 as the set of observed entries in Ai,j\n\n\u2206 = {(i, j) \u2208 [N ] \u00d7 [N ] : \u02dcAij \u2265 0, \u02dcAij /\u2208 (d0, d1)}\n\nMatrix completion step. Since A is constructed from the partial clustering results generated by\ndifferent workers, we expect some of the binary similarity measures in A to be incorrect. We in-\ntroduce the matrix E \u2208 Rn\u00d7n to capture the incorrect entries in A. If A\u2217 is the perfect similarity\nmatrix, we have P\u2206(A\u2217 + E) = P\u2206(A), where P\u2206 outputs a matrix with [P\u2206(B)]i,j = Bi,j if\n(i, j) \u2208 \u2206 and zero, otherwise. With appropriately chosen thresholds d0 and d1, we expect most\nof the observed entries in A to be correct and as a result, E to be a sparse matrix. To reconstruct\nthe perfect similarity matrix A\u2217 from A, following the matrix completion theory [3], we solve the\nfollowing optimization problem\n\n|(cid:98)A|\u2217 + C|E|1 s. t. P\u2206((cid:98)A + E) = P\u2206(A),\n\nwhere |A|\u2217 is the nuclear norm of matrix A and |E|1 =(cid:80)\n(2)\nfacts that E is a sparse matrix and (cid:98)A is of low rank [14], under the two assumptions made in [25],\ni,j |Ei,j| is the (cid:96)1 norm of E. Using the\nwith a high probability, we have A\u2217 = (cid:98)A, where (cid:98)A is the optimal solution for (2). For completeness,\nmatrix (cid:98)A. A common problem shared by most distance metric learning algorithms is their high\n\nwe include in the supplementary document the theoretical result for the problem in (2)\nDistance metric learning step. This step learns a distance metric from the completed similarity\n\ncomputational cost due to the constraint that a distance metric has to be positive semi-de\ufb01nite. In this\nstudy, we develop an ef\ufb01cient algorithm for distance metric learning that does not have to deal with\n\nmin(cid:98)A,E\n\n4\n\n\fmin\n\nn(cid:88)\n\n(cid:98)L(M ) =\n\nthe positive semi-de\ufb01nite constraint. Our algorithm is based on the key observation that with a high\n\nprobability, the completed similarity matrix (cid:98)A is positive semi-de\ufb01nite. This is because according\nto Theorem 1 of [25], with a probability at least 1 \u2212 n\u22123, (cid:98)A = Y Y (cid:62), where Y \u2208 {0, 1}n\u00d7r is\nGiven the similarity matrix (cid:98)A, the optimal distance metric M is given by a regression problem\n\nthe true cluster assignment. This property guarantees the resulting distance metric to be positive\nsemi-de\ufb01nite.\nThe proposed distance metric learning algorithm is based on a standard regression algorithm [15].\n\ni M(cid:98)xj \u2212 (cid:98)Ai,j)2 = |(cid:98)X(cid:62)M(cid:98)X \u2212 (cid:98)A|2\n((cid:98)x(cid:62)\nwhere (cid:98)xi is the feature vector for the sampled object (cid:98)Oi and (cid:98)X = ((cid:98)x1, . . . ,(cid:98)xn). The optimal\nsolution to (3), denoted by(cid:99)M, is given by\nwhere Z\u22121 is pseudo inverse of Z. It is straightforward to verify(cid:99)M (cid:23) 0 if (cid:98)A (cid:23) 0.\nDirectly using the solution in (4) could result in the over\ufb01tting of similarity matrix (cid:98)A because of the\npotential singularity of (cid:98)X(cid:98)X(cid:62). We address this challenge by a smoothing technique, i.e.\n\n(cid:99)M = ((cid:98)X(cid:98)X(cid:62))\u22121(cid:98)X(cid:98)A(cid:98)X(cid:62)((cid:98)X(cid:98)X(cid:62))\u22121\n\nM\u2208Rd\u00d7d\n\n(3)\n\n(4)\n\ni,j=1\n\nF\n\n(cid:99)Ms = ((cid:98)X(cid:98)X(cid:62) + \u03bbmI)\u22121(cid:98)X(cid:98)A(cid:98)X(cid:62)((cid:98)X(cid:98)X(cid:62) + \u03bbmI)\u22121\n\nthe space constraints.\n\n(5)\nwhere I is the identity matrix of size d \u00d7 d and \u03bb > 0 is a smoothing parameter used to address the\nover\ufb01tting and the curse of dimensionality. Note that the computation in (5) can be simpli\ufb01ed by\n\nexpressing(cid:99)Ms in terms of the singular values and singular vectors of (cid:98)X. We omit the details due to\nWe now state the theoretical property of (cid:99)Ms. Let A(Oi, Oj) be the perfect similarity that outputs\n1 when Oi and Oj belong to the same cluster and zero, otherwise. It is straightforward to see that\ni yj, where yi \u2208 {0, 1}r is the cluster assignment for object Oi. To learn an ideal\nA(Oi, Oj) = y(cid:62)\ndistance metric from the perfect similarity measure A(Oi, Oj), we generalize the regression problem\nin (3) as follows\n\nmin\n\nM\u2208Rd\u00d7d\n\n(6)\nThe solution to (6) is given by M = C\u22121\ni ] and B = Exi[xiy(cid:62)\ni ].\nLet Ms be the smoothed version of the ideal distance metric M, i.e. M = (CX +\u03bbI)\u22121BB(cid:62)(CX +\n\n\u03bbI)\u22121. The following theorem shows that with a high probability, the difference between (cid:99)Ms and\n\nX , where CX = Exi[xix(cid:62)\n\nX BB(cid:62)C\u22121\n\nMs is small if both \u03bb and n are not too small.\nTheorem 1. Assume |x|2 \u2264 1 for the feature representation of any object. Assume the conditions\nin Theorem 1 of [25] hold. Then, with a probability 1 \u2212 3n\u22123, we have\n\n(cid:2)(x(cid:62)\ni M xj \u2212 A(Oi, Oj))2(cid:3)\n\nL(M ) = Exi,xj\n\n|Ms \u2212(cid:99)Ms|2 = O\n\n(cid:19)\n\n(cid:18) ln n\n\n\u221a\n\n\u03bb2\n\nn\n\nwhere |Z|2 stands for the spectral norm of matrix Z.\nThe detailed proof can be found in the supplementary materials. Given the learned distance metric\n\n(cid:99)Ms, we construct a similarity matrix S = X(cid:62)(cid:99)MsX and then apply a spectral clustering algorith-\n\nm [18] to compute the \ufb01nal data partition for N objects.\n\n3 Experiments\n\nIn this section, we demonstrate empirically that the proposed semi-crowdsourced clustering algo-\nrithm is both effective and ef\ufb01cient.\n\n5\n\n\f3.1 Data Sets, Baselines, and Parameter Settings\n\nData Sets. Two real-world image data sets are used in our experiments: (i) ImageNet data set is\na subset of the larger ImageNet database [6]. The subset contains 6, 408 images belonging to 7\ncategories: tractor, horse cart, bench, blackberry, violin, saxophone, and hammer. (ii) PASCAL 07\ndata set is a subset of the PASCAL Visual Object Classes Challenge 2007 database [7]. The subset\ncontains 2, 989 images belonging to \ufb01ve classes: car, dog, chair, cat and bird. We choose these\nspeci\ufb01c image categories because they yield relatively low classi\ufb01cation performance in ImageNet\ncompetition and PASCAL VOC Challenge, indicating that it could be dif\ufb01cult to cluster these im-\nages using low level features without side information. The image features for these datasets were\ndownloaded from the homepages of the ImageNet database 2 and the research group of Learning\nand Recognition in Vision (LEAR) 3, respectively.\nTo perform crowdlabeling, we follow [25], and ask human workers to annotate images with key-\nwords of their choice in each HIT. A total of 249 and 332 workers were employed using the Ama-\nzon\u2019s Mechanical Turk [13] to annotate images from ImageNet and PASCAL datasets, respectively.\nOn average, each image is annotated by \ufb01ve different workers, with three keywords from each\nindividual worker. For every HIT, the pairwise similarity between two images (i.e. Ak\ni,j used in\nSection 2.1) is set to 1 if the two images share at least one common annotated keyword and zero,\notherwise 4.\nBaselines. Two baseline methods are used as reference points in our study: (a) the Base method\nthat clusters images directly using image features without distance metric learning, and (b) the Raw\n\nmethod that runs the proposed algorithm against the average similarity matrix (cid:101)A without \ufb01ltering\n\nand matrix completion steps. The comparison to the Base method allows us to examine the effect of\ndistance metric learning in semi-crowdsourced clustering, and the comparison to the Raw method\nreveals the effect of \ufb01ltering and matrix completion steps in distance metric learning.\nWe compare the proposed algorithm for distance metric learning to the following \ufb01ve state-of-the-art\ndistance metric learning algorithms: (a) GDM, the global distance metric learning algorithm [23],\n(b) RCA, the relevant component analysis [1], (c) DCA, the discriminative component analysis [11],\n(d) ITML, the information theoretic metric learning algorithm [5], and (e) LMNN, the large margin\nnearest neighbor classi\ufb01er [20]. Some of the other state-of-the-art distance metric learning algo-\nrithms (e.g. the neighborhood components analysis (NCA) [9]) were excluded from the comparison\nbecause they can only work with class assignments, instead of pairwise similarities, and therefore\nare not applicable in our case. The code for the baseline algorithms was provided by their respective\nauthors (In LMNN, Principal Component Analysis (PCA) is used at \ufb01rst to reduce the data to lower\ndimensions). For a fair comparison, all distance metric learning algorithms are applied to the pair-\n\nwise constraints derived from (cid:98)A, the n \u00d7 n pairwise similarity matrix reconstructed by the matrix\n\ncompletion algorithm. We refer to the proposed distance metric learning algorithm as Regression\nbased Distance Metric Learning, or RDML for short, and the proposed semi-crowdsourced clus-\ntering algorithm as Semi-Crowd.\nParameter Settings. Two criteria are used in determining the values for d0 and d1 in (1). First,\nd0 (d1) should be small (large) enough to ensure that most of the retained pairwise similarities are\nconsistent with the cluster assignments. Second, d0 (d1) should be large (small) enough to obtain a\nsuf\ufb01cient number of observed entries in the partially observed matrix A. For both data sets, we set\nd0 to 0 and d1 to 0.8. We follow the heuristic proposed in [25] to determine the parameter C in (2),\nwhich is selected to generate balanced clustering results. Parameter \u03bb in (5) is set to 1. We varied \u03bb\nfrom 0.5 to 5 and found that the clustering results essentially remain unchanged.\nEvaluation. Normalized mutual information (NMI for short) [4] is used to measure the coherence\nbetween the inferred clustering and the ground truth categorization. The number of sampled images\nis varied from 100, 300, 600 to 1, 000. All the experiments are performed on a PC with Intel Xeon\n2.40 GHz processor and 16.0 GB of main memory. Each experiment is repeated \ufb01ve times, and the\nperformance averaged over the \ufb01ve trials is reported.\n\n2http://www.image-net.org/download-features\n3http://lear.inrialpes.fr/people/guillaumin/data.php\n4We tried several other similarity measures (e.g. cosine similarity measure and tf.idf weighting) and found\n\nthat none of them yielded better performance than the simple similarity measure used in this work\n\n6\n\n\f(a) ImageNet data set\n(b) PASCAL 07 data set\nFigure 3: NMI vs. no. of sampled images (n) used in crowdlabeling.\n\n(a) Two images incorrectly placed in dif-\nferent clusters by the Base method (similari-\nty 0.16) but correctly grouped into the same\ncluster by the proposed method (similarity\n0.66).\n\n(b) Two images incorrectly placed in d-\nifferent clusters by the Base method (sim-\nilarity 0.31) but correctly grouped into\nthe same cluster by the proposed method\n(similarity 0.85)\n\n(c) Two images incorrectly grouped into the\nsame cluster by the Base method (similarity\n0.72) but correctly clustered to different clus-\nters by the proposed method (similarity 0.22)\n\nFigure 4: Sample image pairs that are incorrectly clustered by the Base method but correctly clustered by the\n\nproposed method (the similarity of our method is based on the normalized distance metric(cid:99)Ms).\n\n3.2 Experimental Results\n\nFirst, we examine the effect of distance metric learning algorithm on semi-crowdsourced clustering.\nFigure 3 compares the clustering performance with six different metric learning algorithms with\nthat of the Base method that does not learn a distance metric. We observed that four of the distance\nmetric learning algorithms (i.e. GDM, ITML, LMNN and the proposed RDML) outperform the\nBase method, while RCA and DCA fail to improve the clustering performance of Base. We con-\njecture that the failure of RCA and DCA methods is due to their sensitivity to the noisy pairwise\nsimilarities. In fact, RCA and DCA can yield better performance than the Base method if all the\npairwise similarities are consistent with the cluster assignments. Compared to all the baseline dis-\ntance metric learning algorithms, RDML, the proposed distance metric learning algorithm, yields\nthe best clustering results for both the data sets and for all values of n (i.e. the number of anno-\ntated images) considered here. Furthermore, the performance of RDML gradually stabilizes as the\nnumber of sampled images increases. This is consistent with our theoretical analysis in Theorem 1,\nand implies that only a modest number of annotated images is needed by the proposed algorithm to\nlearn an appropriate distance metric. This observation is particularly useful for crowdclustering as\nit is expensive to reliably label a very large number of images. Figure 4 shows some example image\npairs for which the Base method fails to make correct cluster assignments, but the proposed RDML\nmethod successfully corrects these mistakes with the learned distance metric.\nOur next experiment evaluates the impact of \ufb01ltering and matrix completion steps. In Figure 3,\nwe compare the clustering results of the proposed algorithm for semi-crowdsourced clustering\n(i.e. Filtering+Matrix-Completion+RDML) to the Raw method that runs the proposed distance met-\nric algorithm RDML without the \ufb01ltering and matrix completion steps. Based on these experiments,\nwe can make the following observations: (i) the proposed distance metric learning algorithms per-\nforms better than the Raw method, particularly when the number of annotated images is small; (ii)\nthe gap between the proposed semi-crowdsourced clustering method and the Raw method decreases\nas the sample size increases. These results indicate the importance of \ufb01ltering and matrix comple-\ntion steps for the crowdsourced data in semi-crowdsourced clustering. Finally, it is interesting to\nobserve that the Raw method still outperforms all the baseline methods, which further veri\ufb01es the\neffectiveness of the proposed algorithm for distance metric learning.\nFinally, we evaluate the computational ef\ufb01ciency of the proposed distance metric learning algorithm.\nTable 1 shows that the proposed distance metric learning algorithm is signi\ufb01cantly more ef\ufb01cient\nthan the baseline approaches evaluated here. The last row of Table 1 indicates the run time for the\n\n7\n\n\fTable 1: CPU time (in seconds) for learning the distance metrics.\n\nImageNet Data Set\n\nPASCAL 07 Data Set\n\nCPU time (s)\nSample sizes (n)\nRDML (proposed)\nGDM [23]\nLMNN [20]\nITML [5]\nDCA [11]\nRCA [1]\nMatrix Completion\n\n100\n4.2\n\n11384\n59.8\n2128\n8.5\n9.7\n12.4\n\n300\n6.3\n\n14706\n157\n2376\n9.2\n13.5\n74.2\n\n600\n8.0\n\n18140\n330\n2692\n14.5\n18.6\n536\n\n1,000\n11.2\n25155\n629\n3081\n20.7\n23.6\n1916\n\n100\n27.4\n26346\n55.1\n5311\n51.2\n71.4\n12.8\n\n300\n34.2\n36795\n124\n5721\n64.1\n92.7\n86.6\n\n600\n41.7\n44237\n277\n6104\n72.7\n103\n615\n\n1,000\n47.3\n53468\n527\n6653\n82.3\n122\n1873\n\nmatrix completion step. Since all the distance metric learning algorithms are applied to the similarity\nmatrix recovered by the matrix completion algorithm, the computational cost of matrix completion is\nshared by all distance metric learning algorithms used in our evaluation. We observe that the matrix\ncompletion step, particularly for large sample sizes, is computationally demanding, a problem that\nwill be investigated in our future work.\n\n4 Related Work and Discussion\n\nCrowdclustering was \ufb01rst proposed in [10]. It divided the task of clustering a collection of images\ninto a number of human intelligence tasks (or HITs). In each HIT, a small subset of images are\nrandomly sampled from the collection, and a worker is asked to cluster the subset of images in-\nto multiple groups. By using a large number of HITs, the authors ensure that every image in the\ncollection is included in at least one HIT. In [25], the authors extend the de\ufb01nition of HITs for\ncrowdclustering by asking workers to annotate images by keywords and then derive pairwise sim-\nilarities between images based on the commonality of annotated keywords. A major limitation of\nboth these studies, as pointed out earlier, is that they can only cluster images that have been manually\nannotated. Although the matrix completion technique was \ufb01rst proposed for crowdclustering in [25],\nit had a different goal from this work. In [25], matrix completion was used to estimate the similarity\nmatrix, while the proposed approach uses matrix completion to estimate a distance metric, so that\ncrowdsourced labels can be generalized to cluster those images which were not annotated during\ncrowdsourcing.\nOur work is closely related to distance metric learning that learns a distance metric consistent with\na given subset of pairwise similarities/constraints [24]. Although many studies on distance metric\nlearning have been reported, only a few address the challenge of learning a reliable distance metric\nfrom noisy pairwise constraints [12, 22]. One limitation of these earlier studies is that they can\nonly work with a relatively small number (typically less than 30%) of noisy pairwise constraints.\nIn contrast, in semi-crowdsourced clustering, we expect that a signi\ufb01cantly larger percentage of\npairwise similarities are inconsistent with the true cluster assignments (as many as 80% [25]).\nOne limitation of distance metric learning is that it is restricted to a linear similarity function. Kernel\nlearning generalizes distance metric learning to a nonlinear similarity function by mapping each\ndata point to a high dimensional space through a kernel function [16]. We plan to learn a kernel\nbased similarity function from a subset of manually annotated objects. Besides distance metric\nlearning, an alternative approach to incorporate the manual annotations into the clustering process\nis constrained clustering (or semi-supervised clustering) [2]. Compared to distance metric learning,\nconstrained clustering can be computationally more expensive. Unlike distance metric learning that\nlearns a distance metric from pairwise constraints only once and applies the learned distance metric\nto cluster any set of objects, a constrained clustering algorithm has to be rerun whenever a new\nset of objects needs to be clustered. To exploit the strength of constrained clustering algorithms, we\nplan to explore hybrid approaches that effectively combine distance metric learning with constrained\nclustering approaches for more accurate and ef\ufb01cient semi-crowdsourced clustering.\nAcknowledgments\nThis work was supported in part by National Science Foundation (IIS-0643494) and Of\ufb01ce of Navy\nResearch (Award nos. N00014-12-1-0431, N00014-11-1-0100, N00014-12-1-0522, and N00014-\n09-1-0663).\n\n8\n\n\fReferences\n[1] Aharon Bar-Hillel, Tomer Hertz, Noam Shental, and Daphna Weinshall. Learning a Maha-\n\nlanobis metric from equivalence constraints. JMLR, 2005.\n\n[2] Sugato Basu, Ian Davidson, and Kiri Wagstaff. Constrained Clustering: Advances in Algo-\n\nrithms, Theory, and Applications. Chapman & Hall/CRC, 2008.\n\n[3] Emmanuel J. Cand`es and Terence Tao. The power of convex relaxation: near-optimal matrix\n\ncompletion. IEEE Transactions on Information Theory, 56(5):2053\u20132080, 2010.\n\n[4] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (2nd ed.). Wiley, 2006.\n[5] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. Information-theoretic metric learning. In\n\nICML, pages 209\u2013216, 2007.\n\n[6] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchi-\n\ncal image database. In CVPR, 2009.\n\n[7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.\n\nThe\nPASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-\nnetwork.org/challenges/VOC/voc2007/workshop/index.html.\n\n[8] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories.\n\nIn CVPR, pages 524\u2013531, 2005.\n\n[9] Jacob Goldberger, Sam T. Roweis, Geoffrey E. Hinton, and Ruslan Salakhutdinov. Neighbour-\n\nhood components analysis. In NIPS, 2004.\n\n[10] R. Gomes, P. Welinder, A. Krause, and P. Perona. Crowdclustering. In NIPS, 2011.\n[11] S.C.H. Hoi, W. Liu, M.R. Lyu, and W.Y. Ma. Learning distance metrics with contextual con-\n\nstraints for image retrieval. In CVPR, pages 2072\u20132078, 2006.\n\n[12] Kaizhu Huang, Rong Jin, Zenglin Xu, and Cheng-Lin Liu. Robust metric learning by smooth\n\noptimization. In UAI, 2010.\n\n[13] Panagiotis G. Ipeirotis. Analyzing the amazon mechanical turk marketplace. ACM Crossroads,\n\n17(2):16\u201321, 2010.\n\n[14] Ali Jalali, Yudong Chen, Sujay Sanghavi, and Huan Xu. Clustering partially observed graphs\n\nvia convex optimization. In ICML, pages 1001\u20131008, 2011.\n\n[15] D.C. Montgomery, E.A. Peck, and G.G. Vining. Introduction to Linear Regression Analysis,\n\nvolume 49. John Wiley & Sons, 2007.\n\n[16] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Ma-\nchines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.\n[17] L. Seneviratne and E. Izquierdo. Image annotation through gaming. In Proceedings of the 2nd\n\nK-Space PhD Jamboree Workshop, 2008.\n\n[18] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. PAMI, 2000.\n[19] Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam Kalai. Adaptively learning the\n\ncrowd kernel. In ICML, 2011.\n\n[20] K.Q. Weinberger, J. Blitzer, and L.K. Saul. Distance metric learning for large margin nearest\n\nneighbor classi\ufb01cation. In NIPS, 2006.\n\n[21] P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds.\n\nIn NIPS, 2010.\n\n[22] Lei Wu, Steven C. H. Hoi, Rong Jin, Jianke Zhu, and Nenghai Yu. Distance metric learning\n\nfrom uncertain side information for automated photo tagging. ACM TIST, 2011.\n\n[23] E.P. Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning, with application to\n\nclustering with side-information. In NIPS, 2002.\n\n[24] Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Technical report,\n\nDepartment of Computer Science and Engineering, Michigan State University, 2006.\n\n[25] Jinfeng Yi, Rong Jin, Anil K. Jain, and Shaili Jain. Crowdclustering with sparse pairwise\n\nlabels: A matrix completion approach. In AAAI Workshop on Human Computation, 2012.\n\n9\n\n\f", "award": [], "sourceid": 868, "authors": [{"given_name": "Jinfeng", "family_name": "Yi", "institution": null}, {"given_name": "Rong", "family_name": "Jin", "institution": null}, {"given_name": "Shaili", "family_name": "Jain", "institution": null}, {"given_name": "Tianbao", "family_name": "Yang", "institution": null}, {"given_name": "Anil", "family_name": "Jain", "institution": null}]}