{"title": "Estimating Jaccard Index with Missing Observations: A Matrix Calibration Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 2620, "page_last": 2628, "abstract": "The Jaccard index is a standard statistics for comparing the pairwise similarity between data samples. This paper investigates the problem of estimating a Jaccard index matrix when there are missing observations in data samples. Starting from a Jaccard index matrix approximated from the incomplete data, our method calibrates the matrix to meet the requirement of positive semi-definiteness and other constraints, through a simple alternating projection algorithm. Compared with conventional approaches that estimate the similarity matrix based on the imputed data, our method has a strong advantage in that the calibrated matrix is guaranteed to be closer to the unknown ground truth in the Frobenius norm than the un-calibrated matrix (except in special cases they are identical). We carried out a series of empirical experiments and the results confirmed our theoretical justification. The evaluation also reported significantly improved results in real learning tasks on benchmarked datasets.", "full_text": "Estimating Jaccard Index with Missing Observations:\n\nA Matrix Calibration Approach\n\nWenye Li\n\nMacao Polytechnic Institute\n\nMacao SAR, China\n\nwyli@ipm.edu.mo\n\nAbstract\n\nThe Jaccard index is a standard statistics for comparing the pairwise similarity be-\ntween data samples. This paper investigates the problem of estimating a Jaccard\nindex matrix when there are missing observations in data samples. Starting from\na Jaccard index matrix approximated from the incomplete data, our method cali-\nbrates the matrix to meet the requirement of positive semi-de\ufb01niteness and other\nconstraints, through a simple alternating projection algorithm. Compared with\nconventional approaches that estimate the similarity matrix based on the imputed\ndata, our method has a strong advantage in that the calibrated matrix is guaran-\nteed to be closer to the unknown ground truth in the Frobenius norm than the\nun-calibrated matrix (except in special cases they are identical). We carried out a\nseries of empirical experiments and the results con\ufb01rmed our theoretical justi\ufb01ca-\ntion. The evaluation also reported signi\ufb01cantly improved results in real learning\ntasks on benchmark datasets.\n\n1\n\nIntroduction\n\nA critical task in data analysis is to determine how similar two data samples are. The applications\narise in many science and engineering disciplines. For example, in statistical and computing sci-\nences, similarity analysis lays a foundation for cluster analysis, pattern classi\ufb01cation, image analysis\nand recommender systems [15, 8, 17].\n\nA variety of similarity models have been established for different types of data. When data samples\ncan be represented as algebraic vectors, popular choices include cosine similarity model, linear\nkernel model, and so on [24, 25]. When each vector element takes a value of zero or one, the\nJaccard index model is routinely applied, which measures the similarity by the ratio of the number\nof unique elements common to two samples against the total number of unique elements in either of\nthem [14, 23].\n\nDespite the wide applications, the Jaccard index model faces a non-trivial challenge when data\nsamples are not fully observed. As a treatment, imputation approaches may be applied, which\nreplace the missing observations with substituted values and then calculate the Jaccard index based\non the imputed data. Unfortunately, with a large portion of missing observations, imputing data\nsamples often becomes un-reliable or even infeasible, as evidenced in our evaluation.\n\nInstead of trying to \ufb01ll in the missing values, this paper investigates a completely different approach\nbased on matrix calibration. Starting from an approximate Jaccard index matrix that is estimated\nfrom incomplete samples, the proposed method calibrates the matrix to meet the requirement of\npositive semi-de\ufb01niteness and other constraints. The calibration procedure is carried out with a\nsimple yet \ufb02exible alternating projection algorithm.\n\n1\n\n\fThe proposed method has a strong theoretical advantage. The calibrated matrix is guaranteed to be\nbetter than, or at least identical to (in special cases), the un-calibrated matrix in terms of a shorter\nFrobenius distance to the true Jaccard index matrix, which was veri\ufb01ed empirically as well. Be-\nsides, our evaluation of the method also reported improved results in learning applications, and the\nimprovement was especially signi\ufb01cant with a high portion of missing values.\nA note on notation. Throughout the discussion, a data sample, Ai (1 \u2264 i \u2264 n), is treated as a set of\nfeatures. Let F = {f1, \u00b7 \u00b7 \u00b7 , fd} be the set of all possible features. Without causing ambiguity, Ai\nalso represents a binary-valued vector. If the j-th (1 \u2264 j \u2264 d) element of vector Ai is one, it means\nfj \u2208 Ai (feature fj belongs to sample Ai); if the element is zero, fj 6\u2208 Ai; if the element is marked\nas missing, it remains unknown whether feature fj belongs to sample Ai or not.\n\n2 Background\n\n2.1 The Jaccard index\n\nThe Jaccard index is a commonly used statistical indicator for measuring the pairwise similarity\n[14, 23]. For two nonempty and \ufb01nite sets Ai and Aj, it is de\ufb01ned to be the ratio of the number of\nelements in their intersection against the number of elements in their union:\n\nJ \u2217\nij =\n\n|Ai \u2229 Aj|\n|Ai \u222a Aj|\n\nwhere |\u00b7| denotes the cardinality of a set.\nThe Jaccard index has a value of 0 when the two sets have no elements in common, 1 when they have\nexactly the same elements, and strictly between 0 and 1 otherwise. The two sets are more similar\n(have more common elements) when the value gets closer to 1.\nFor n sets A1, \u00b7 \u00b7 \u00b7 , An (n \u2265 2), the Jaccard index matrix is de\ufb01ned as an n \u00d7 n matrix J \u2217 =\n\n. The matrix is symmetric and all diagonal elements of the matrix are 1.\n\nij(cid:9)n\n(cid:8)J \u2217\n\ni,j=1\n\n2.2 Handling missing observations\n\nWhen data samples are fully observed, the accurate Jaccard index can be obtained trivially by enu-\nmerating the intersection and the union between each pair of samples if both the number of samples\nand the number of features are small. For samples with a large number of features, the index can\noften be approximated by MinHash and related methods [5, 18], which avoid the explicit counting\nof the intersection and the union of the two sets.\n\nWhen data samples are not fully observed, however, obtaining the accurate Jaccard index generally\nbecomes infeasible. One na\u00a8\u0131ve approximation is to ignore the features with missing values. Only\nthose features that have no missing values in all samples are used to calculate the Jaccard index.\nObviously, for a large dataset with missing-at-random features, it is very likely that this method will\nthrow away all features and therefore does not work at all.\n\nThe mainstream work tries to replace the missing observations with substituted values, and then\ncalculates the Jaccard index based on the imputed data. Several simple approaches, including zero,\nmedian and k-nearest neighbors (kNN) methods, are popularly used. A missing element is set to\nzero, often implying the corresponding feature does not exist in a sample. It can also be set to the\nmedian value (or the mean value) of the feature over all samples, or sometimes over a number of\nnearest neighboring instances.\nA more systematical imputation framework is based on the classical expectation maximization (EM)\nalgorithm [6], which generalizes maximum likelihood estimation to the case of incomplete data.\nAssuming the existence of un-observed latent variables, the algorithm alternates between the ex-\npectation step and the maximization step, and \ufb01nds maximum likelihood or maximum a posterior\nestimates of the un-observed variables. In practice, the imputation is often carried out through it-\nerating between learning a mixture of clusters of the \ufb01lled data and re-\ufb01lling missing values using\ncluster means, weighted by the posterior probability that a cluster generates the samples [11].\n\n2\n\n\f3 Solution\n\nOur work investigates the Jaccard index matrix estimation problem for incomplete data. Instead\nof throwing away the un-observed features or imputing the missing values, a completely different\nsolution based on matrix calibration is designed.\n\n3.1\n\nInitial approximation\n\nthe set of features that are known to be in Ai, and denote by O\u2212\ni\n\nthe\nFor a sample Ai, denote by O+\ni\ni . If Oi = F , Ai is fully observed\nset of features that are known to be not in Ai. Let Oi = O+\nwithout missing values; otherwise, Ai is not fully observed with missing values. The complement\nof Oi with respect to F , denoted by Oi, gives Ai\u2019s unknown features and missing values.\nFor two samples Ai and Aj with missing values, we approximate their Jaccard index by:\n\ni \u222a O\u2212\n\nJ 0\n\nij = (cid:12)(cid:12)(cid:0)O+\n(cid:12)(cid:12)(cid:0)O+\n\ni \u2229 Oj(cid:1) \u2229(cid:0)O+\ni \u2229 Oj(cid:1) \u222a(cid:0)O+\n\nj \u2229 Oi(cid:1)(cid:12)(cid:12)\nj \u2229 Oi(cid:1)(cid:12)(cid:12)\n\n=\n\n(cid:12)(cid:12)O+\n\ni \u2229 O+\nj (cid:12)(cid:12)\ni \u2229 Oj(cid:1) \u222a(cid:0)O+\nj \u2229 Oi(cid:1)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:0)O+\n\nHere we assume that each sample has at least one observed feature. It is obvious that J 0\nthe ground truth J \u2217\n\nij if the samples are fully observed.\nThere exists an interval [\u2113ij, \u00b5ij] that the true value J \u2217\n\nij lies in, where\n\nij is equal to\n\nand\n\n\u2113ij = \uf8f1\uf8f2\n\uf8f3\n\n1,\n|O+\ni\n(cid:12)\nO\u2212\n(cid:12)\ni\n(cid:12)\n\nj |\n\u2229O+\n(cid:12)\n\u2229O\u2212\n(cid:12)\nj\n(cid:12)\n\n,\n\nif i = j\notherwise\n\n1,\n\n(cid:12)\n(cid:12)\n(cid:12)\n\nO\u2212\ni\n\n\u222aO\u2212\nj\n|Oi\u222aOj \u222aO+\n\ni\n\n(cid:12)\n(cid:12)\n(cid:12)\nj |\n\u222aO+\n\nif i = j\n\n,\n\notherwise\n\n.\n\n\u00b5ij = \uf8f1\uf8f2\n\uf8f3\n\nThe lower bound \u2113ij is obtained from the extreme case of setting the missing values in a way that the\ntwo sets have the fewest features in their intersection while having the most features in their union.\nOn the contrary, the upper bound \u00b5ij is obtained from the other extreme. When the samples are fully\nobserved, the interval shrinks to a single point \u2113ij = \u00b5ij = J \u2217\nij.\n\n3.2 Matrix calibration\n\nij(cid:9)n\nDenote by J \u2217 = (cid:8)J \u2217\n\nij=1\n\nthe true Jaccard index matrix for a set of data samples {A1, \u00b7 \u00b7 \u00b7 , An},\n\nwe have [2]:\nTheorem 1. For a given set of data samples, its Jaccard index matrix J \u2217 is positive semi-de\ufb01nite.\n\nij(cid:9)n\nFor data samples with missing values, the matrix J 0 = (cid:8)J 0\noften loses positive semi-\nde\ufb01niteness. Nevertheless, it can be calibrated to ensure the property by seeking an n \u00d7 n matrix\nJ = {Jij}n\n\nij=1 to minimize:\n\nij=1\n\nsubject to the constraints:\n\nL0 (J) = (cid:13)(cid:13)J \u2212 J 0(cid:13)(cid:13)\n\n2\n\nF\n\nJ (cid:23) 0, and, \u2113ij \u2264 Jij \u2264 \u00b5ij (1 \u2264 i, j \u2264 n)\n\nwhere J (cid:23) 0 requires J to be positive semi-de\ufb01nite and k\u00b7kF denotes the Frobenius norm of a\nmatrix and kJk2\n\nLet Mn be the set of n \u00d7 n symmetric matrices. The feasible region de\ufb01ned by the constraints,\ndenoted by R, is a nonempty closed and convex subset of Mn. Following standard results in op-\ntimization theory [20, 3, 10], the problem of minimizing L0 (J) is convex. Denote by PR the\nprojection onto R. Its unique solution is given by the projection of J0 onto R: J 0\n\nij.\nF = Pij J 2\n\nR = PR(cid:0)J 0(cid:1).\n\nFor J 0\n\nR, we have:\n\n3\n\n\fTheorem 2. (cid:13)(cid:13)J \u2217 \u2212 J 0\nR(cid:13)(cid:13)\n\nF \u2264 (cid:13)(cid:13)J \u2217 \u2212 J 0(cid:13)(cid:13)\n\n2\n\nProof. De\ufb01ne an inner product on Mn that induces the Frobenius norm:\n\n2\n\nF\n\n. The equality holds iff J 0 \u2208 R, i.e., J 0 = J 0\nR.\n\nThen\n\n2\n\nF\n\nhX, Y i = trace(cid:0)X T Y(cid:1) , for X, Y \u2208 Mn.\n(cid:13)(cid:13)J \u2217 \u2212 J 0(cid:13)(cid:13)\nR(cid:1)(cid:13)(cid:13)\n= (cid:13)(cid:13)(cid:0)J \u2217 \u2212 J 0\nR(cid:1) \u2212(cid:0)J 0 \u2212 J 0\nR(cid:13)(cid:13)\nR(cid:13)(cid:13)\nF +(cid:13)(cid:13)J 0 \u2212 J 0\n= (cid:13)(cid:13)J \u2217 \u2212 J 0\nR(cid:13)(cid:13)\n\u2265 (cid:13)(cid:13)J \u2217 \u2212 J 0\nF \u2212 2(cid:10)J \u2217 \u2212 J 0\nR(cid:13)(cid:13)\n\u2265 (cid:13)(cid:13)J \u2217 \u2212 J 0\n\nF \u2212 2(cid:10)J \u2217 \u2212 J 0\nR, J 0 \u2212 J 0\nR(cid:11)\n\nF\n2\n\nF\n\n2\n\n2\n\n2\n\n2\n\nR, J 0 \u2212 J 0\nR(cid:11)\n\nThe second \u201c\u2265\u201d holds due to the Kolmogrov\u2019s criterion, which states that the projection of J 0 onto\nR, J 0\n\nR, is unique and characterized by:\n\nThe equality holds iff (cid:13)(cid:13)J 0 \u2212 J 0\nR(cid:13)(cid:13)\n\nJ 0\n\nR \u2208 R, and(cid:10)J \u2212 J 0\n\nR, J 0 \u2212 J 0\nF = 0 and (cid:10)J \u2217 \u2212 J 0\n\n2\n\nR(cid:11) \u2264 0 for all J \u2208 R.\nR, J 0 \u2212 J 0\n\nR .\nR(cid:11) = 0, i.e., J 0 = J 0\n\nThis key observation shows that projecting J 0 onto the feasible region R will produce an improved\nestimate towards J \u2217, although this ground truth matrix remains unknown to us.\n\n3.3 Projection onto subsets\n\nBased on the results in Section 3.2, we are to seek a minimizer to L0 (J) to improve the estimate\nJ 0. De\ufb01ne two nonempty closed and convex subsets of Mn:\n\nand\n\nT = {X|X \u2208 Mn, \u2113ij \u2264 Xij \u2264 \u00b5ij (1 \u2264 i, j \u2264 n)} .\n\nS = {X|X \u2208 Mn, X (cid:23) 0}\n\nObviously R = S \u2229 T . Now our minimization problem becomes \ufb01nding the projection of J 0 onto\nthe intersection of two sets S and T with respect to the Frobenius norm. This can be done by\nstudying the projection onto the two sets individually. Denote by PS the projection onto S, and PT\nthe projection onto T . For projection onto T , a straightforward result based on the Kolmogrov\u2019s\ncriterion is:\nTheorem 3. For a given matrix X \u2208 Mn, its projection onto T , XT = PT (X), is given by\n\n(XT )ij = \uf8f1\uf8f2\n\uf8f3\n\nXij,\n\u2113ij,\n\u00b5ij,\n\nif \u2113ij \u2264 Xij \u2264 \u00b5ij\nif Xij < \u2113ij\nif Xij > \u00b5ij\n\n.\n\nFor projection onto S, a well known result is the following [12, 16, 13]:\nTheorem 4. For X \u2208 Mn and its singular value decomposition X = U \u03a3V T where \u03a3 =\ndiag (\u03bb1, \u00b7 \u00b7 \u00b7 , \u03bbn), the projection of X onto S is given by: XS = PS (X) = U \u03a3\u2032V T where\n\u03a3\u2032 = diag (\u03bb\u2032\n\n1, \u00b7 \u00b7 \u00b7 , \u03bb\u2032\n\nn) and\n\n\u03bb\u2032\n\ni = (cid:26)\u03bbi,\n\n0,\n\nif \u03bbi \u2265 0\notherwise\n\n.\n\nThe matrix XS = PS (X) gives the positive semi-de\ufb01nite matrix that most closely approximates X\nwith respect to the Frobenius norm.\n\n4\n\n\f3.4 Dykstra\u2019s algorithm\n\nTo study the orthogonal projection onto the intersection of subspaces, a classical result is von Neu-\nmann\u2019s alternating projection algorithm. Let H be a Hilbert space with two closed subspaces C1\nand C2. The orthogonal projection onto the intersection C1 \u2229 C2 can be obtained by the product of\nthe two projections PC1PC2 when the two projections commute (PC1PC2 = PC2PC1). When they\ndo not commute, the work shows that for each x0 \u2208 H, the projection of x0 onto the intersection\ncan be obtained by the limit point of a sequence of projections onto each subspace respectively:\n\nlimk\u2192\u221e (PC2PC1)k(cid:0)x0(cid:1) = PC1\u2229C2 (cid:0)x0(cid:1). The algorithm generalizes to any \ufb01nite number of sub-\n\nspaces and projections onto them.\n\nUnfortunately, different from the application in [19], in our problem both S and T are not subspaces\nbut subsets, and von Neumann\u2019s convergence result does not apply. The limit point of the generated\nsequence may converge to non-optimal points.\n\nTo handle the dif\ufb01culty, Dykstra extended von Neumann\u2019s work and proposed an algorithm that\nworks with subsets [9]. Consider the case of C = Tr\ni=1 Ci where C is nonempty and each Ci is\na closed and convex subset in H. Assume that for any x \u2208 H, obtaining PC (x) is hard, while\nobtaining each PCi (x) is easy. Starting from x0 \u2208 H, Dykstra\u2019s algorithm produces two sequences,\nthe iterates (cid:8)xk\n\ni (cid:9). The two sequences are generated by:\n\ni(cid:9) and the increments (cid:8)I k\n\nr\n\nxk\n0 = xk\u22121\nxk\ni = PCi (cid:0)xk\ni \u2212(cid:0)xk\nI k\ni = xk\n\ni\u22121 \u2212 I k\u22121\n(cid:1)\ni\u22121 \u2212 I k\u22121\n\ni\n\ni\n\n(cid:1)\n\nwhere i = 1, \u00b7 \u00b7 \u00b7 , r and k = 1, 2, \u00b7 \u00b7 \u00b7 . The initial values are given by x0\n\nr = x0, I 0\n\ni = 0.\n\ni(cid:9) converges to the optimal solution with a theoretical guarantee [9, 10].\n\nThe sequence of (cid:8)xk\nTheorem 5. Let C1, \u00b7 \u00b7 \u00b7 , Cr be closed and convex subsets of a Hilbert space H such that C =\nr\nCk 6= \u03a6. For any i = 1, \u00b7 \u00b7 \u00b7 , r and any x0 \u2208 H, the sequence (cid:8)xk\ni(cid:9) converges strongly to\nTk=1\nC = PC (cid:0)x0(cid:1) (i.e. (cid:13)(cid:13)xk\n\nC(cid:13)(cid:13) \u2192 0 as k \u2192 \u221e).\n\nThe convergent rate of Dykstra\u2019s algorithm for polyhedral sets is linear [7], which coincides with\nthe convergence rate of von Neumann\u2019s alternating projection method.\n\ni \u2212 x0\n\nx0\n\n3.5 An iterative method\n\nBased on the discussion in Section 3.4, we have a simple approach, shown in Algorithm 1, that \ufb01nds\nthe projection of an initial matrix J 0 onto the nonempty set R = S \u2229 T . Here the projections onto\nS and T are given by the two theorems in Section 3.3. The algorithm stops when J k falls into the\nfeasible region or when a maximal number of iterations is achieved. For practical implementation,\na more robust stopping criterion can be adopted [1].\n\n3.6 Related work\n\nIt is a known study in mathematical optimization \ufb01eld to \ufb01nd a positive semi-de\ufb01nite matrix that\nis closest to a given matrix. A number of methods have been proposed recently. The idea of alter-\nnating projection method was \ufb01rstly applied in a \ufb01nancial application [13]. The problem can also\nbe phrased as a semi-de\ufb01nite programming (SDP) model [13] and be solved via the interior-point\nmethod. In the work of [21] and [4], the quasi-Newton method and the projected gradient method\nto the Lagrangian dual of the original problem were applied, which reported faster results than the\nSDP formulation. An even faster Newton\u2019s method was developed in [22] by investigating the dual\nproblem, which is unconstrained with a twice continuously differentiable objective function and has\na quadratically convergent solution.\n\n5\n\n\fAlgorithm 1 Projection onto R = S \u2229 T\nRequire: Initial matrix J 0\n\nk = 0\nJ 0\nT = J 0\nI 0\nS = 0\nI 0\nT = 0\nwhile NOT CONVERGENT do\n\nJ k+1\nT \u2212 I k\nS = PS (cid:0)J k\nS(cid:1)\nS = J k+1\nI k+1\nS \u2212(cid:0)J k\nT \u2212 I k\nS(cid:1)\nJ k+1\nT = PT (cid:0)J k+1\nS \u2212 I k\nT(cid:1)\nI k+1\nT = J k+1\nT \u2212(cid:0)J k+1\nS \u2212 I k\nT(cid:1)\nk = k + 1\n\nend while\nreturn J k = J k\nT\n\n4 Evaluation\n\nTo evaluate the performance of the proposed method, four benchmark datasets were used in our\nexperiments.\n\n\u2022 MNIST: a grayscale image database of handwritten digits (\u201c0\u201d to \u201c9\u201d). After binarization,\n\neach image is represented as a 784-dimensional 0-1 vector.\n\n\u2022 USPS: another grayscale image database of handwritten digits. After binarization, each\n\nimage is represented as a 256-dimensional 0-1 vector.\n\n\u2022 PROTEIN: a bioinformatics database with three classes of instances. Each instance is rep-\n\nresented as a sparse 357-dimensional 0-1 vector.\n\n\u2022 WEBSPAM: a dataset with both spam and non-spam web pages. Each page is represented\nas a 0-1 vector. The data are highly sparse. On average one vector has about 4, 000 non-zero\nvalues out of more than 16 million features.\n\nOur experiments have two objectives. One is to verify the effectiveness of the proposed method in\nestimating the Jaccard index matrix by measuring the derivation of the calibrated matrix from the\nground truth in Frobenius norm. The other is to evaluate the performance of the calibrated matrix in\ngeneral learning applications. The comparison is made against the popular imputation approaches\nlisted in Section 2.2, including the zero, kNN and EM 1 approaches. (As the median approach gave\nvery similar performance as the zero approach, its results were not reported separately.)\n\n4.1\n\nJaccard index matrix estimation\n\nThe experiment was carried out under various settings. For each dataset, we experimented with\n1, 000 and 10, 000 samples respectively. For each sample, different portions (from 10% to 90%)\nof feature values were marked as missing, which was assumed to be \u201cmissing at random\u201d and all\nfeatures had the same probability of being marked.\n\nAs mentioned in Section 3, for the proposed calibration approach, an initial Jaccard index matrix\nwas \ufb01rstly built based on the incomplete data. Then the matrix was calibrated to meet the positive\nsemi-de\ufb01nite requirement and the lower and upper bounds requirement. While for the imputation\napproaches, the Jaccard index matrix was calculated directly from the imputed data.\nNote that for the kNN approach, we iterated different k from 1 to 5 and the best result was collected,\nwhich actually overestimated its performance. Under some settings, the results of the EM approach\nwere not available due to its prohibitive computational requirement to our platform.\n\nThe results are presented through the comparison of mean square deviations from the ground truth\nof the Jaccard index matrix J \u2217. For an n \u00d7 n estimated matrix J \u2032, its mean square deviation from\n\n1ftp://ftp.cs.toronto.edu/pub/zoubin/old/EMcode.tar.Z\n\n6\n\n\fMean Square Deviation (1,000 Samples)\n\n(cid:46)(cid:70)(cid:66)(cid:79)(cid:1)(cid:52)(cid:82)(cid:86)(cid:66)(cid:83)(cid:70)(cid:1)(cid:37)(cid:70)(cid:87)(cid:74)(cid:66)(cid:85)(cid:74)(cid:80)(cid:79) (1,000 Samples)\n\nMean Square Deviation (1,000 Samples)\n\n(cid:46)(cid:70)(cid:66)(cid:79)(cid:1)(cid:52)(cid:82)(cid:86)(cid:66)(cid:83)(cid:70)(cid:1)(cid:37)(cid:70)(cid:87)(cid:74)(cid:66)(cid:85)(cid:74)(cid:80)(cid:79) (1,000 Samples)\n\n\u22121\n\n10\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nl\n(\n \n\nn\no\n\ni\nt\n\n\u22122\n\n10\n\nZERO/MEDIAN\nkNN\nEM\nNO_CALIBRATION\nCALIBRATION\n\n\u22121\n\n10\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nl\n(\n \n\n(cid:81)\n(cid:82)\n\n(cid:76)\n(cid:87)\n\n\u22122\n\n10\n\nZERO/MEDIAN\nkNN\nEM\nNO_CALIBRATION\nCALIBRATION\n\n\u22121\n\n10\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nl\n(\n \n\nn\no\n\ni\nt\n\n\u22122\n\n10\n\nZERO/MEDIAN\nkNN\nEM\nNO_CALIBRATION\nCALIBRATION\n\n\u22122\n\n10\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nl\n(\n \n\n(cid:81)\n(cid:82)\n\n(cid:76)\n(cid:87)\n\n\u22123\n\n10\n\nZERO/MEDIAN\nkNN\n\nNO_CALIBRATION\nCALIBRATION\n\ni\n\n \n\na\nv\ne\nD\ne\nr\na\nu\nq\nS\nn\na\ne\nM\n\n \n\n\u22123\n\n10\n\n\u22124\n\n10\n\n(cid:76)\n\n(cid:3)\n\n(cid:68)\n(cid:89)\n(cid:72)\n(cid:39)\n(cid:72)\n(cid:85)\n(cid:68)\n(cid:88)\n(cid:84)\n(cid:54)\n(cid:81)\n(cid:68)\n(cid:72)\n(cid:48)\n\n(cid:3)\n\n\u22123\n\n10\n\n\u22124\n\n10\n\ni\n\n \n\na\nv\ne\nD\ne\nr\na\nu\nq\nS\nn\na\ne\nM\n\n \n\n\u22123\n\n10\n\n\u22124\n\n10\n\n(cid:76)\n\n(cid:3)\n\n(cid:68)\n(cid:89)\n(cid:72)\n(cid:39)\n(cid:72)\n(cid:85)\n(cid:68)\n(cid:88)\n(cid:84)\n(cid:54)\n(cid:81)\n(cid:68)\n(cid:72)\n(cid:48)\n\n(cid:3)\n\n\u22124\n\n10\n\n\u22125\n\n10\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nRatio of Observed Features\n\nRatio of Observed Features\n\nRatio of Observed Features\n\nRatio of Observed Features\n\n(a) MNIST\n\n(b) USPS\n\n(c) PROTEIN\n\n(d) WEBSPAM\n\nMean Square Deviation (10,000 Samples)\n\n(cid:46)(cid:70)(cid:66)(cid:79)(cid:1)(cid:52)(cid:82)(cid:86)(cid:66)(cid:83)(cid:70)(cid:1)(cid:37)(cid:70)(cid:87)(cid:74)(cid:66)(cid:85)(cid:74)(cid:80)(cid:79) (10,000 Samples)\n\nMean Square Deviation (10,000 Samples)\n\n(cid:46)(cid:70)(cid:66)(cid:79)(cid:1)(cid:52)(cid:82)(cid:86)(cid:66)(cid:83)(cid:70)(cid:1)(cid:37)(cid:70)(cid:87)(cid:74)(cid:66)(cid:85)(cid:74)(cid:80)(cid:79) (10,000 Samples)\n\n\u22121\n\n10\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nl\n(\n \n\nn\no\n\ni\nt\n\n\u22122\n\n10\n\nZERO/MEDIAN\nkNN\nEM\nNO_CALIBRATION\nCALIBRATION\n\n\u22121\n\n10\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nl\n(\n \n\n(cid:81)\n(cid:82)\n\n(cid:76)\n(cid:87)\n\n\u22122\n\n10\n\nZERO/MEDIAN\nkNN\nEM\nNO_CALIBRATION\nCALIBRATION\n\n\u22121\n\n10\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nl\n(\n \n\nn\no\n\ni\nt\n\n\u22122\n\n10\n\nZERO/MEDIAN\nkNN\n\nNO_CALIBRATION\nCALIBRATION\n\n\u22122\n\n10\n\nl\n\n)\ne\na\nc\ns\n\u2212\ng\no\nl\n(\n \n\n(cid:81)\n(cid:82)\n\n(cid:76)\n(cid:87)\n\n\u22123\n\n10\n\nZERO/MEDIAN\nkNN\n\nNO_CALIBRATION\nCALIBRATION\n\ni\n\n \n\na\nv\ne\nD\ne\nr\na\nu\nq\nS\nn\na\ne\nM\n\n \n\n\u22123\n\n10\n\n\u22124\n\n10\n\n(cid:76)\n\n(cid:3)\n\n(cid:68)\n(cid:89)\n(cid:72)\n(cid:39)\n(cid:72)\n(cid:85)\n(cid:68)\n(cid:88)\n(cid:84)\n(cid:54)\n(cid:81)\n(cid:68)\n(cid:72)\n(cid:48)\n\n(cid:3)\n\n\u22123\n\n10\n\n\u22124\n\n10\n\ni\n\n \n\na\nv\ne\nD\ne\nr\na\nu\nq\nS\nn\na\ne\nM\n\n \n\n\u22123\n\n10\n\n\u22124\n\n10\n\n(cid:76)\n\n(cid:3)\n\n(cid:68)\n(cid:89)\n(cid:72)\n(cid:39)\n(cid:72)\n(cid:85)\n(cid:68)\n(cid:88)\n(cid:84)\n(cid:54)\n(cid:81)\n(cid:68)\n(cid:72)\n(cid:48)\n\n(cid:3)\n\n\u22124\n\n10\n\n\u22125\n\n10\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nRatio of Observed Features\n\n(e) MNIST\n\nRatio of Observed Features\n\n(f) USPS\n\nRatio of Observed Features\n\nRatio of Observed Features\n\n(g) PROTEIN\n\n(h) WEBSPAM\n\nFigure 1: Mean square deviations from the ground truth on benchmark datasets by different methods.\nHorizontal: percentages of observed values (from 10% to 90%); Vertical: mean square deviations\nin log-scale. (a)-(d): 1, 000 samples; (e)-(f): 10, 000 samples. (For better visualization effect of the\nresults shown in color, the reader is referred to the soft copy of this paper.)\n\nJ \u2217 is de\ufb01ned as the square Frobenius distance between the two matrices, divided by the number of\nelements, i.e., Pn\n. In addition to the comparison with the popular approaches, the mean\nsquare deviation between the un-calibrated matrix J 0 and J \u2217, shown as NO CALIBRATION, is\nalso reported as a baseline.\n\nij=1(J \u2032\nij\nn2\n\nij)2\n\n\u2212J \u2217\n\nFigure 1 shows the results. It can be seen that the calibrated matrices reported the smallest derivation\nfrom the ground truth in nearly all experiments. The improvement is especially signi\ufb01cant when the\nratio of observed features is low (the missing ratio is high). It is guaranteed to be no worse than the\nun-calibrated matrix. As evidenced in the results, for all the imputation approaches, there is no such\na guarantee.\n\n4.2 Supervised learning\n\nKnowing the improved results in reducing the deviation from the ground truth matrix, we would like\nto further investigate whether this improvement indeed bene\ufb01ts practical applications, speci\ufb01cally\nin supervised learning.\n\nWe applied the calibrated results in nearest neighbor classi\ufb01cation tasks. Given a training set of\nlabeled samples, we tried to predict the labels of the samples in the testing set. For each testing\nsample, its label was determined by the label of the sample in the training set that had the largest\nJaccard index value with it.\n\nSimilarly the experiment was carried out with 1, 000/10, 000 samples and different portions of miss-\ning values from 10% to 90% respectively. In each run, 90% of the samples were randomly chosen as\nthe training set and the remaining 10% were used as the testing set. The mean and standard deviation\nof the classi\ufb01cation errors in 1, 000 runs were reported. As a reference, the results from the ground\ntruth matrix J \u2217, shown as FULLY OBSERVED, were also included.\nFigure 2 shows the results. Again the matrix calibration method reported evidently improved results\nover the imputation approaches in most experiments. The improvement veri\ufb01ed the bene\ufb01ts brought\nby the reduced deviation from the true Jaccard index matrix, and therefore justi\ufb01ed the usefulness\nof the proposed method in learning applications.\n\n7\n\n\fr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\nClassification Error (1,000 Samples)\n\nFULLY_OBSERVED\nZERO/MEDIAN\nkNN\nEM\nNO_CALIBRATION\nCALIBRATION\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nRatio of Observed Features\n\n(a) MNIST\n\nClassification Error (10,000 Samples)\n\nFULLY_OBSERVED\nZERO/MEDIAN\nkNN\nEM\nNO_CALIBRATION\nCALIBRATION\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nRatio of Observed Features\n\n(e) MNIST\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\nClassification Error (1,000 Samples)\n\nFULLY_OBSERVED\nZERO/MEDIAN\nkNN\nEM\nNO_CALIBRATION\nCALIBRATION\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nRatio of Observed Features\n\n(b) USPS\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n0\n\nClassification Error (1,000 Samples)\n\nFULLY_OBSERVED\nZERO/MEDIAN\nkNN\nEM\nNO_CALIBRATION\nCALIBRATION\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nRatio of Observed Features\n\n(c) PROTEIN\n\nClassification Error (10,000 Samples)\n\nClassification Error (10,000 Samples)\n\nFULLY_OBSERVED\nZERO/MEDIAN\nkNN\nEM\nNO_CALIBRATION\nCALIBRATION\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nRatio of Observed Features\n\n(f) USPS\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0\n\nFULLY_OBSERVED\nZERO/MEDIAN\nkNN\nNO_CALIBRATION\nCALIBRATION\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nRatio of Observed Features\n\n(g) PROTEIN\n\n0.11\n\n0.1\n\n0.09\n\n0.08\n\n0.07\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\n0\n\n0.07\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\n0\n\nClassification Error (1,000 Samples)\n\nFULLY_OBSERVED\nZERO/MEDIAN\nkNN\nNO_CALIBRATION\nCALIBRATION\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nRatio of Observed Features\n\n(d) WEBSPAM\n\nClassification Error (10,000 Samples)\n\nFULLY_OBSERVED\nZERO/MEDIAN\nkNN\nNO_CALIBRATION\nCALIBRATION\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nRatio of Observed Features\n\n(h) WEBSPAM\n\nFigure 2: Classi\ufb01cation errors on benchmark datasets by different methods. Horizontal: percentage\nof observed values (from 10% to 90%); Vertical: classi\ufb01cation errors. (a)-(d): 1, 000 samples; (e)-\n(f): 10, 000 samples. (For better visualization effect of the results shown in color, the reader is\nreferred to the soft copy of this paper.)\n\n5 Discussion and conclusion\n\nThe Jaccard index measures the pairwise similarity between data samples, which is routinely used\nin real applications. Unfortunately in practice, it is non-trivial to estimate the Jaccard index matrix\nfor incomplete data samples. This paper investigates the problem, and proposes a matrix calibration\napproach in a way that is completely different from the existing methods.\nInstead of throwing\naway the unknown features or imputing the missing values, the proposed approach calibrates any\napproximate Jaccard index matrix by ensuring the positive semi-de\ufb01nite requirement on the matrix.\nIt is theoretically shown and empirically veri\ufb01ed that the approach indeed brings about improvement\nin practical problems.\n\nOne point that is not particularly addressed in this paper is the computational complexity issue. We\nadopted a simple alternating projection procedure based on Dykstra\u2019s algorithm. The computational\ncomplexity of the algorithm heavily depends on the successive matrix decompositions. It is ex-\npensive when the size of the matrix becomes large. Calibrating a Jaccard index matrix for 1, 000\nsamples can be \ufb01nished in seconds of time on our platform, while calibrating a matrix for 10, 000\nsamples quickly increases to more than an hour. Further investigations for faster solutions are thus\nnecessary for scalability.\n\nActually, there is a simple divide-and-conquer heuristic to calibrate a large matrix. Firstly divide\nthe matrix into small sub-matrices. Then calibrate each sub-matrix to meet the constraints. Finally\nmerge the results. Although the heuristic may not give the optimal result, it also guarantees to\nproduce a matrix better than or identical to the un-calibrated matrix. The heuristic runs with high\nparallel ef\ufb01ciency and easily scales to very large matrices. The detailed discussion is omitted here\ndue to the space limit.\n\nAcknowledgments\n\nThe work is supported by The Science and Technology Development Fund (Project No.\n006/2014/A), Macao SAR, China.\n\n8\n\n\fReferences\n\n[1] E.G. Birgin and M. Raydan. Robust stopping criteria for Dykstra\u2019s algorithm. SIAM Journal on Scienti\ufb01c\n\nComputing, 26(4):1405\u20131414, 2005.\n\n[2] M. Bouchard, A.L. Jousselme, and P.E. Dor\u00b4e. A proof for the positive de\ufb01niteness of the Jaccard index\n\nmatrix. International Journal of Approximate Reasoning, 54(5):615\u2013626, 2013.\n\n[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA,\n\n2004.\n\n[4] S. Boyd and L. Xiao. Least-squares covariance matrix adjustment. SIAM Journal on Matrix Analysis and\n\nApplications, 27(2):532\u2013546, 2005.\n\n[5] A.Z. Broder, M. Charikar, A.M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In\nProceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pages 327\u2013336. ACM,\n1998.\n\n[6] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical Society, Series B, 39:1\u201338, 1977.\n\n[7] F. Deutsch. Best Approximation in Inner Product Spaces. Springer, New York, NY, USA, 2001.\n[8] R.O. Duda and P.E. Hart. Pattern Classi\ufb01cation. John Wiley and Sons, Hoboken, NJ, USA, 2000.\n[9] R.L. Dykstra. An algorithm for restricted least squares regression. Journal of the American Statistical\n\nAssociation, 78(384):837\u2013842, 1983.\n\n[10] R. Escalante and M. Raydan. Alternating Projection Methods. SIAM, Philadelphia, PA, USA, 2011.\n[11] Z. Ghahramani and M.I. Jordan. Supervised learning from incomplete data via an EM approach.\n\nIn\nAdvances in Neural Information Processing Systems, volume 6, pages 120\u2013127. Morgan Kaufmann, 1994.\n[12] G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, MD,\n\nUSA, 1996.\n\n[13] N.J. Higham. Computing the nearest correlation matrix - a problem from \ufb01nance. IMA Journal of Nu-\n\nmerical Analysis, 22:329\u2013343, 2002.\n\n[14] P. Jaccard. The distribution of the \ufb02ora in the alpine zone. New Phytologist, 11(2):37\u201350, 1912.\n[15] A.K. Jain, M.N. Murty, and P.J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264\u2013\n\n323, 1999.\n\n[16] D.L. Knol and J.M.F. ten Berge. Least-squares approximation of an improper correlation matrix by a\n\nproper one. Psychometrika, 54(1):53\u201361, 1989.\n\n[17] J. Leskovec, A. Rajaraman, and J. Ullman. Mining of Massive Datasets. Cambridge University Press,\n\nNew York, NY, USA, 2014.\n\n[18] P. Li and A.C. K\u00a8onig. Theory and applications of b-bit minwise hashing. Communications of the ACM,\n\n54(8):101\u2013109, 2011.\n\n[19] W. Li, K.H. Lee, and K.S. Leung. Large-scale RLSC learning without agony. In Proceedings of the 24th\n\nInternational Conference on Machine Learning, pages 529\u2013536. ACM, 2007.\n\n[20] D.G. Luenberger. Optimization by Vector Space Methods. John Wiley & Sons, New York, NY, USA,\n\n1969.\n\n[21] J. Malick. A dual approach to semide\ufb01nite least-squares problems. SIAM Journal on Matrix Analysis and\n\nApplications, 26(1):272\u2013284, 2004.\n\n[22] H. Qi and D. Sun. A quadratically convergent newton method for computing the nearest correlation\n\nmatrix. SIAM Journal on Matrix Analysis and Applications, 28(2):360\u2013385, 2006.\n\n[23] D.J. Rogers and T.T. Tanimoto. A computer program for classifying plants. Science, 132(3434):1115\u2013\n\n1118, 1960.\n\n[24] G. Salton, A. Wong, and C.S. Yang. A vector space model for automatic indexing. Communications of\n\nthe ACM, 18(11):613\u2013620, 1975.\n\n[25] B. Scholk\u00a8opf and A.J. Smola. Learning With Kernels, Support Vector Machines, Regularization, Opti-\n\nmization, and Beyond. The MIT Press, Cambridge, MA, USA, 2001.\n\n9\n\n\f", "award": [], "sourceid": 1529, "authors": [{"given_name": "Wenye", "family_name": "Li", "institution": "Macao Polytechnic Institute"}]}