{"title": "Spectral Hashing", "book": "Advances in Neural Information Processing Systems", "page_first": 1753, "page_last": 1760, "abstract": "Semantic hashing seeks compact binary codes of datapoints so that the Hamming distance between codewords correlates with semantic similarity. Hinton et al. used a clever implementation of autoencoders to find such codes. In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresh- olded eigenvectors of the graph Laplacian. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigen- functions of manifolds, we show how to efficiently calculate the code of a novel datapoint. Taken together, both learning the code and applying it to a novel point are extremely simple. Our experiments show that our codes significantly outperform the state-of-the art.", "full_text": "Spectral Hashing\n\nYair Weiss1,3\n\nAntonio Torralba1\n\nRob Fergus2\n\n3School of Computer Science,\n\nHebrew University,\n\n91904, Jerusalem, Israel\nyweiss@cs.huji.ac.il\n\n1CSAIL, MIT,\n32 Vassar St.,\n\nCambridge, MA 02139\ntorralba@csail.mit.edu\n\n2Courant Institute, NYU,\n\n715 Broadway,\n\nNew York, NY 10003\n\nfergus@cs.nyu.edu\n\nAbstract\n\nSemantic hashing[1] seeks compact binary codes of data-points so that the\nHamming distance between codewords correlates with semantic similarity.\nIn this paper, we show that the problem of \ufb01nding a best code for a given\ndataset is closely related to the problem of graph partitioning and can\nbe shown to be NP hard. By relaxing the original problem, we obtain a\nspectral method whose solutions are simply a subset of thresholded eigen-\nvectors of the graph Laplacian. By utilizing recent results on convergence\nof graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of\nmanifolds, we show how to e\ufb03ciently calculate the code of a novel data-\npoint. Taken together, both learning the code and applying it to a novel\npoint are extremely simple. Our experiments show that our codes outper-\nform the state-of-the art.\n\n1 Introduction\n\nWith the advent of the Internet, it is now possible to use huge training sets to address\nchallenging tasks in machine learning. As a motivating example, consider the recent work\nof Torralba et al. who collected a dataset of 80 million images from the Internet [2, 3]. They\nthen used this weakly labeled dataset to perform scene categorization. To categorize a novel\nimage, they simply searched for similar images in the dataset and used the labels of these\nretrieved images to predict the label of the novel image. A similar approach was used in [4]\nfor scene completion.\n\nAlthough conceptually simple, actually carrying out such methods requires highly e\ufb03cient\nways of (1) storing millions of images in memory and (2) quickly \ufb01nding similar images to\na target image.\n\nSemantic hashing, introduced by Salakhutdinov and Hinton[5] , is a clever way of addressing\nboth of these challenges. In semantic hashing, each item in the database is represented by a\ncompact binary code. The code is constructed so that similar items will have similar binary\ncodewords and there is a simple feedforward network that can calculate the binary code for\na novel input. Retrieving similar neighbors is then done simply by retrieving all items with\ncodes within a small Hamming distance of the code for the query. This kind of retrieval can\nbe amazingly fast - millions of queries per second on standard computers. The key for this\nmethod to work is to learn a good code for the dataset. We need a code that is (1) easily\ncomputed for a novel input (2) requires a small number of bits to code the full dataset and\n(3) maps similar items to similar binary codewords.\n\nTo simplify the problem, we will assume that the items have already been embedded in\na Euclidean space, say Rd, in which Euclidean distance correlates with the desired simi-\nlarity. The problem of \ufb01nding such a Euclidean embedding has been addressed in a large\n\n1\n\n\fnumber of machine learning algorithms (e.g. [6, 7]). In some cases, domain knowledge can\nbe used to de\ufb01ne a good embedding. For example, Torralba et al. [3] found that a 512\ndimensional descriptor known as the GIST descriptor, gives an embedding where Euclidean\ndistance induces a reasonable similarity function on the items. But simply having Euclidean\nembedding does not give us a fast retrieval mechanism.\n\nIf we forget about the requirement of having a small number of bits in the codewords, then\nit is easy to design a binary code so that items that are close in Euclidean space will map\nto similar binary codewords. This is the basis of the popular locality sensitive hashing\nmethod E2LSH [8]. As shown in[8], if every bit in the code is calculated by a random linear\nprojection followed by a random threshold, then the Hamming distance between codewords\nwill asymptotically approach the Euclidean distance between the items. But in practice this\nmethod can lead to very ine\ufb03cient codes. Figure 1 illustrates the problem on a toy dataset\nof points uniformly sampled in a two dimensional rectangle. The \ufb01gure plots the average\nprecision at Hamming distance 1 using a E2LSH encoding. As the number of bits increases\nthe precision improves (and approaches one with many bits), but the rate of convergence\ncan be very slow.\n\nRather than using random projections to de\ufb01ne the bits in a code, several authors have\npursued machine learning approaches. In [5] the authors used an autoencoder with several\nhidden layers. The architecture can be thought of as a restricted Boltzmann machine (RBM)\nin which there are only connections between layers and not within layers. In order to learn 32\nbits, the middle layer of the autoencoder has 32 hidden units, and noise was injected during\ntraining to encourage these bits to be as binary as possible. This method indeed gives codes\nthat are much more compact than the E2LSH codes. In [9] they used multiple stacked RBMs\nto learn a non-linear mapping between input vector and code bits. Backpropagation using\nan Neighborhood Components Analysis (NCA) objective function was used to re\ufb01ne the\nweights in the network to preserve the neighborhood structure of the input space. Figure 1\nshows that the RBM gives much better performance compared to random bits. A simpler\nmachine learning algorithm (Boosting SSC) was pursued in [10] who used adaBoost to\nclassify a pair of input items as similar or nonsimilar. Each weak learner was a decision\nstump, and the output of all the weak learners on a given output is a binary code. Figure 1\nshows that this boosting procedure also works much better than E2LSH codes, although\nslightly worse than the RBMs1.\n\nThe success of machine learning approaches over LSH is not limited to synthetic data. In [5],\nRBMs gave several orders of magnitude improvement over LSH in document retrieval tasks.\nIn [3] both RBMs and Boosting were used to learn binary codes for a database of millions\nof images and were found to outperform LSH. Also, the retrieval speed using these short\nbinary codes was found to be signi\ufb01cantly faster than LSH (which was faster than other\nmethods such as KD trees).\n\nThe success of machine learning methods leads us to ask: what is the best code for perform-\ning semantic hashing for a given dataset? We formalize the requirements for a good code\nand show that these are equivalent to a particular form of graph partitioning. This shows\nthat even for a single bit, the problem of \ufb01nding optimal codes is NP hard. On the other\nhand, the analogy to graph partitioning suggests a relaxed version of the problem that leads\nto very e\ufb03cient eigenvector solutions. These eigenvectors are exactly the eigenvectors used\nin many spectral algorithms including spectral clustering and Laplacian eigenmaps [6, 11].\nThis leads to a new algorithm, which we call \u201cspectral hashing\u201d where the bits are calculated\nby thresholding a subset of eigenvectors of the Laplacian of the similarity graph. By utiliz-\ning recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami\neigenfunctions of manifolds, we show how to e\ufb03ciently calculate the code of a novel data-\npoint. Taken together, both learning the code and applying it to a novel point are extremely\nsimple. Our experiments show that our codes outperform the state-of-the art.\n\n1All methods here use the same retrieval algorithm, i.e. semantic hashing. In many applica-\ntions of LSH and Boosting SSC, a di\ufb00erent retrieval algorithm is used whereby the binary code\nonly creates a shortlist and exhaustive search is performed on the shortlist. Such an algorithm is\nimpractical for the scale of data we are considering.\n\n2\n\n\fTraining samples\n\nLSH\n\nstumps boosting SSC\n\nRBM (two hidden layers)\n\ni\n\ni\n\n2\n \n<\n \ne\nc\nn\na\nt\ns\nd\n \ng\nn\nm\nm\na\nh\n \nr\no\nf\n \ns\nr\no\nb\nh\ng\ne\nn\n \nd\no\no\ng\n \nn\no\ni\nt\nr\no\np\no\nr\nP\n\ni\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\nRBM\n\nstumps boosting SSC\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\nnumber of bits\n\nLSH\n\nFigure 1: Building hash codes to \ufb01nd neighbors. Neighbors are de\ufb01ned as pairs of points in\n2D whose Euclidean distance is less than \u0001. The toy dataset is formed by uniformly sampling\npoints in a two dimensional rectangle. The \ufb01gure plots the average precision (number of\nneighbors in the original space divided by number of neighbors in a hamming ball using the\nhash codes) at Hamming distance \u2264 1 for three methods. The plots on the left show how\neach method partitions the space to compute the bits to represent each sample. Despite\nthe simplicity of this toy data, the methods still require many bits in order to get good\nperformance.\n\n2 Analysis: what makes a good code\n\nAs mentioned earlier, we seek a code that is (1) easily computed for a novel input (2) requires\na small number of bits to code the full dataset and (3) maps similar items to similar binary\ncodewords. Let us \ufb01rst ignore the \ufb01rst requirement, that codewords be easily computed for\na novel input and search only for a code that is e\ufb03cient (i.e. requires a small number of\nbits) and similarity preserving (i.e. maps similar items to similar codewords). For a code\nto be e\ufb03cient, we require that each bit has a 50% chance of being one or zero, and that\ndi\ufb00erent bits are independent of each other. Among all codes that have this property, we\nwill seek the ones where the average Hamming distance between similar points is minimal.\nLet {yi}n\ni=1 be the list of codewords (binary vectors of length k) for n datapoints and Wn\u00d7n\nbe the a\ufb03nity matrix. Since we are assuming the inputs are embedded in Rd so that\nEuclidean distance correlates with similarity, we will use W (i, j) = exp(\u2212kxi \u2212 xjk2/\u00012).\nThus the parameter \u0001 de\ufb01nes the distance in Rd which corresponds to similar items. Using\nthis notation, the average Hamming distance between similar neighbors can be written:\nIf we relax the independence assumption and require the bits to be\n\nPij Wijkyi \u2212 yjk2.\n\nuncorrelated we obtain the following problem:\n\nminimize : Xij\n\nWijkyi \u2212 yjk2\n\n(1)\n\nsubject to : yi \u2208 {\u22121, 1}k\nyi = 0\n\nXi\nyiyT\n\ni = I\n\n1\n\nn Xi\n\n1\n\nwhere the constraint Pi yi = 0 requires each bit to \ufb01re 50% of the time, and the constraint\nn Pi yiyT\n\ni = I requires the bits to be uncorrelated.\n\nObservation: For a single bit, solving problem 1 is equivalent to balanced graph partition-\ning and is NP hard.\n\n3\n\n\fProof: Consider an undirected graph whose vertices are the datapoints and where the\nweight between item i and j is given by W (i, j). Consider a code with a single bit. The bit\npartitions the graph into two equal parts (A, B), vertices where the bit is on and vertices\n\nwhere the bit is o\ufb00. For a single bit, Pij Wijkyi \u2212 yjk2 is simply the weight of the edges cut\nby the partition: cut(A, B) = Pi\u2208A,j\u2208B W (i, j). Thus problem 1 is equivalent to minimizing\n\ncut(A, B) with the requirement that |A| = |B| which is known to be NP hard [12].\n\nFor k bits the problem can be thought of as trying to \ufb01nd k independent balanced partitions,\neach of which should have as low cut as possible.\n\n2.1 Spectral Relaxation\n\nBy introducing a n \u00d7 k matrix Y whose jth row is yT\nPj W (i, j) we can rewrite the problem as:\n\nj and a diagonal n \u00d7 n matrix D(i, i) =\n\nminimize : trace(Y T (D \u2212 W )Y )\nsubject to : Y (i, j) \u2208 {\u22121, 1}\nY T 1 = 0\nY T Y = I\n\n(2)\n\nThis is of course still a hard problem, but by removing the constraint that Y (i, j) \u2208 {\u22121, 1}\nwe obtain an easy problem whose solutions are simply the k eigenvectors of D \u2212 W with\nminimal eigenvalue (after excluding the trivial eigenvector 1 which has eigenvalue 0).\n\n2.2 Out of Sample Extension\n\nThe fact that the solution to the relaxed problem are the k eigenvectors of D \u2212 W with\nminimal eigenvalue would suggest simply thresholding these eigenvectors to obtain a binary\ncode. But this would only tell us how to compute the code representation of items in the\ntraining set. This is the problem of out-of-sample extension of spectral methods which is\noften solved using the Nystrom method [13, 14]. But note that the cost of calculating the\nNystrom extension of a new datapoint is linear in the size of the dataset. In our setting,\nwhere there can be millions of items in the dataset this is impractical. In fact, calculating\nthe Nystrom extension is as expensive as doing exhaustive nearest neighbor search.\nIn order to enable e\ufb03cient out-of-sample extension we assume the datapoints xi \u2208 Rd are\nsamples from a probability distribution p(x). The equations in the problem 1 are now seen\nto be sample averages which we replace with their expectations:\n\nminimize : Z ky(x1) \u2212 y(x2)k2W (x1, x2)p(x1)p(x2)dx1x2\nsubject to : y(x) \u2208 {\u22121, 1}k\nZ y(x)p(x)dx = 0\nZ y(x)y(x)T p(x)dx = I\n\n(3)\n\nwith W (x1, x2) = e\u2212kx1\u2212x2k2/\u00012\n. Relaxing the constraint that y(x) \u2208 {\u22121, 1}k now gives\na spectral problem whose solutions are eigenfunctions of the weighted Laplace-Beltrami\noperators de\ufb01ned on manifolds [15, 16, 13, 17]. More explicitly, de\ufb01ne the weighted Lapla-\ncian Lp as an operator that maps a function f to g = Lpf by g(x)\np(x) = D(x)f (x)p(x) \u2212\nRs W (s, x)f (s)p(s)ds with D(x) = Rs W (x, s). The solution to the relaxation of problem 3\n\nare functions that satisfy Lpf = \u03bbf with minimal eigenvalue (ignoring the trivial solution\nf (x) = 1 which has eigenvalue 0). As discussed in [16, 15, 13], with proper normalization,\nthe eigenvectors of the discrete Laplacian de\ufb01ned by n points sampled from p(x) converges\nto eigenfunctions of Lp as n \u2192 \u221e.\n\nWhat do the eigenfunctions of Lp look like ? One important special case is when p(x) is\na separable distribution. A simple case of a separable distribution is a multidimensional\n\n4\n\n\funiform distribution Pr(x) = Qi ui(xi) where ui is a uniform distribution in the range\n\n[ai, bi]. Another example is a multidimensional Gaussian, which is separable once the space\nhas been rotated so that the Gaussian is axes aligned.\n\nObservation: [17] If p(x) is separable, and similarity between datapoints is de\ufb01ned as\ne\u2212kxi\u2212xj k2/\u00012\nthen the eigenfunctions of the continuous weighted Laplacian, Lp have an outer\nproduct form. That is, if \u03a6i(x) is an eigenfunction of the weighted Laplacian de\ufb01ned on\nR1 with eigenvalue \u03bbi then \u03a6i(x1)\u03a6j(x2) \u00b7 \u00b7 \u00b7 \u03a6d(xd) is an eigenfunction of the d dimensional\nproblem with eigenvalue \u03bbi\u03bbj \u00b7 \u00b7 \u00b7 \u03bbd.\n\nSpeci\ufb01cally for a case of a uniform distribution on [a, b] the eigenfunctions of the one-\ndimensional Laplacian Lp are extremely well studied objects in mathematics. They corre-\nspond to the fundamental modes of vibration of a metallic plate. The eigenfunctions \u03a6k(x)\nand eigenvalues \u03bbk are:\n\n\u03a6k(x) = sin(\n\n\u03c0\n2\n\n+\n\n\u03bbk = 1 \u2212 e\u2212 \u0001\n\n2\n\nk\u03c0\n\nb \u2212 a\nb\u2212a |2\n\n2 | k\u03c0\n\nx)\n\n(4)\n\n(5)\n\nA similar equation is also available for the one dimensional Gaussian .\nIn this case the\neigenfunctions of the one-dimensional Laplacian Lp are (in the limit of small \u0001) solutions\nto the Schrodinger equations and are related to Hermite polynomials. Figure 2 shows the\nanalytical eigenfunctions for a 2D rectangle in order of increasing eigenvalue. The eigenvalue\n(which corresponds to the cut) determines which k bits will be used. Note that the eigenvalue\ndepends on the aspect ratio of the rectangle and the spatial frequency \u2014 it is better to cut\nthe long dimension before the short one, and low spatial frequencies are preferred. Note\nthat the eigenfunctions do not depend on the radius of similar neighbors \u0001. The radius does\nchange the eigenvalue but does not a\ufb00ect the ordering.\n\nWe distinguish between single-dimension eigenfunctions, which are of the form \u03a6k(x1) or\n\u03a6k(x2) and outer-product eigenfunctions which are of the form \u03a6k(x1)\u03a6l(x2). These outer-\nproduct eigenfunctions are shown marked with a red border in the \ufb01gure. As we now discuss,\nthese outer-product eigenfunctions should be avoided when building a hashing code.\n\nObservation: Suppose we build a code by thresholding the k eigenfunctions of Lp with\nminimal eigenvalue y(x) = sign(\u03a6k(x)). If any of the eigenfunctions is an outer-product\neigenfunction, then that bit is a deterministic function of other bits in the code.\n\nProof: This follows from the fact that sign(\u03a61(x1)\u03a62(x2)) = sign(\u03a61(x1))sign(\u03a62(x2)).\n\nThis observation highlights the simpli\ufb01cation we made in relaxing the independence con-\nstraint and requiring that the bits be uncorrelated. Indeed the bits corresponding to outer-\nproduct eigenfunctions are approximately uncorrelated but they are surely not independent.\n\nThe exact form of the eigenfunctions for 1D continuous Laplacian for di\ufb00erent distributions\nis a matter of ongoing research [17]. We have found, however, that the bit codes obtained\nby thresholding the eigenfunctions are robust to the exact form of the distribution. In par-\nticular, simply \ufb01tting a multidimensional rectangle distribution to the data (by using PCA\nto align the axes, and then assuming a uniform distribution on each axis) works surprisingly\nwell for a wide range of distributions. In particular, using the analytic eigenfunctions of a\nuniform distribution on data sampled from a Gaussian, works as well as using the numeri-\ncally calculated eigenvectors and far better than boosting or RBMs trained on the Gaussian\ndistribution.\n\nTo summarize, given a training set of points {xi} and a desired number of bits k the spectral\nhashing algorithm works by:\n\n\u2022 Finding the principal components of the data using PCA.\n\u2022 Calculating the k smallest single-dimension analytical eigenfunctions of Lp using a\nrectangular approximation along every PCA direction. This is done by evaluating\nthe k smallest eigenvalues for each direction using (equation 4), thus creating a list\nof dk eigenvalues, and then sorting this list to \ufb01nd the k smallest eigenvalues.\n\n5\n\n\fFigure 2: Left: Eigenfunctions for a uniform rectangular distribution in 2D. Right: Thresh-\nolded eigenfunctions. Outer-product eigenfunctions have a red frame. The eigenvalues de-\npend on the aspect ratio of the rectangle and the spatial frequency of the cut \u2013 it is better\nto cut the long dimension \ufb01rst and lower spatial frequencies are better than higher ones.\n\nLSH\n\nBoosting SSC\n\nLSH\n\nBoosting SSC\n\nLSH\n\nBoosting SSC\n\nRBM (two hidden layers)\n\nSpectral hashing\n\nRBM (two hidden layers)\n\nSpectral hashing\n\nRBM (two hidden layers)\n\nSpectral hashing\n\na) 3 bits\n\nb) 7 bits\n\nc) 15 bits\n\nFigure 3: Comparison of neighborhood de\ufb01ned by hamming balls of di\ufb00erent radii using\ncodes obtained with LSH, Boosting, RBM and spectral hashing when using 3, 7 and 15 bits.\nThe yellow dot denotes a test sample. The red points correspond to the locations that are\nwithin a hamming distance of zero. Green corresponds to a hamming ball of radius 1, and\nblue to radius 2.\n\n\u2022 Thresholding the analytical eigenfunctions at zero, to obtain binary codes.\n\nThis simple algorithm has two obvious limitations. First, it assumes a multidimensional\nuniform distribution generated the data. We have experimented with using multidimensional\nGaussians instead. Second, even though it avoids the trivial 3 way dependencies that arise\nfrom outer-product eigenfunctions, other high-order dependencies between the bits may\nexist. We have experimented with using only frequencies that are powers of two to avoid\nthese dependencies. Neither of these more complicated variants of spectral hashing gave a\nsigni\ufb01cant improvement in performance in our experiments.\n\nFigure 4a compares the performance of spectral hashing to LSH, RBMs and Boosting on a\n2D rectangle and \ufb01gure 3 visualizes the Hamming balls for the di\ufb00erent methods. Despite\nthe simplicity of spectral hashing, it outperforms the other methods. Even when we apply\nRBMs and Boosting to the output of spectral hashing the performance does not improve.\nA similar pattern of results is shown in high dimensional synthetic data (\ufb01gure 4b).\n\nSome insight into the superior performance can be obtained by comparing the partitions\nthat each bit de\ufb01nes on the data (\ufb01gures 2,1). Recall that we seek partitions that give low\ncut value and are approximately independent. LSH which uses random linear partitions\nmay give very unbalanced partitions. RBMs and Boosting both \ufb01nd good partitions, but\nthe partitions can be highly dependent on each other.\n\n3 Results\n\nIn addition to the synthetic results we applied the di\ufb00erent algorithms to the image databases\ndiscussed in [3]. Figure 5 shows retrieval results for spectral hashing, RBMs and boosting\non the \u201clabelme\u201d dataset. Note that even though the spectral hashing uses a terrible model\nof the statistics of the database \u2014 it simply assumes a N dimensional rectangle, it performs\nbetter than boosting which actually uses the distribution (the di\ufb00erence in performance\nrelative to RBMs is not signi\ufb01cant). Not only is the performance numerically better, but\n\n6\n\n\f \n\n \n\n2\n<\ne\nc\nn\na\n\ni\n\n \n\nt\ns\nd\ng\nn\nm\nm\na\nh\n\ni\n\n \nr\no\n\nf\n \ns\nr\no\nb\nh\ng\ne\nn\n\ni\n\n \n\nd\no\no\ng\n\n \n\nn\no\n\ni\nt\nr\no\np\no\nr\nP\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\nSpectral hashing\n\nBoosting + \nspectral hashing\n\nRBM\n\nRBM+\nspectral hashing\n\nstumps boosting SSC\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\nnumber of bits\n\nLSH\n\na) 2D uniform distribution\n\n \n\n \n\n2\n<\ne\nc\nn\na\n\ni\n\n \n\nt\ns\nd\ng\nn\nm\nm\na\nh\n\ni\n\n \nr\no\n\nf\n \ns\nr\no\nb\nh\ng\ne\nn\n\ni\n\n \n\nd\no\no\ng\n\n \n\nn\no\n\ni\nt\nr\no\np\no\nr\nP\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\nSpectral hashing\n\nRBM\n\nstumps boosting SSC\n\nLSH\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\nnumber of bits\n\nb) 10D uniform distribution\n\nFigure 4:\nleft: results on 2D rectangles with di\ufb00erent methods. Even though spectral\nhashing is the simplest, it gives the best performance. right: Similar pattern of results for\na 10 dimensional distribution.\n\n \n\n \n\n2\n<\ne\nc\nn\na\n\ni\n\n \n\nt\ns\nd\ng\nn\nm\nm\na\nh\n\ni\n\n \nr\no\n\ni\n\nf\n \ns\nr\no\nb\nh\ng\ne\nn\nd\no\no\ng\nn\no\n\n \n\n \n\ni\nt\nr\no\np\no\nr\nP\n\nInput\n\nGist neighbors\n\nSpectral hashing 10 bits\n\nBoosting 10 bits\n\n1\n\n9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\nSpectral hashing\nRBM\nBoosting SSC\nLSH\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\nnumber of bits\n\nFigure 5: Performance of di\ufb00erent binary codes on the LabelMe dataset described in [3]. The\ndata is certainly not uniformly distributed, and yet spectral hashing gives better retrieval\nperformance than boosting and LSH.\n\nour visual inspection of the retrieved neighbors suggests that with a small number of bits,\nthe retrieved images are better using spectral hashing than with boosting.\n\nFigure 6 shows retrieval results on a dataset of 80 million images. This dataset is obviously\nmore challenging and even using exhaustive search some of the retrieved neighbors are se-\nmantically quite di\ufb00erent. Still, the majority of retrieved neighbors seem to be semantically\nrelevant, and with 64 bits spectral hashing enables this peformance in fractions of a second.\n\n4 Discussion\n\nWe have discussed the problem of learning a code for semantic hashing. We de\ufb01ned a hard\ncriterion for a good code that is related to graph partitioning and used a spectral relaxation\nto obtain an eigenvector solution. We used recent results on convergence of graph Laplacian\neigenvectors to obtain analytic solutions for certain distributions and showed the importance\nof avoiding redundant bits that arise from separable distributions.\n\nThe \ufb01nal algorithm we arrive at, spectral hashing, is extremely simple - one simply performs\nPCA on the data and then \ufb01ts a multidimensional rectangle. The aspect ratio of this mul-\ntidimensional rectangle determines the code using a simple formula. Despite this simplicity,\nthe method is comparable, if not superior, to state-of-the-art methods.\n\n7\n\n\fGist neighbors\n\nSpectral hashing: 32 bits\n\n64 bits\n\nFigure 6: Retrieval results on a dataset of 80 million images using the original gist descriptor,\nand hash codes build with spectral hashing with 32 bits and 64 bits. The input image\ncorresponds to the image on the top-left corner, the rest are the 24 nearest neighbors using\nhamming distance for the hash codes and L2 for gist.\n\nReferences\n\n[1] R. R. Salakhutdinov and G. E. Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure.\n\nIn AISTATS, 2007.\n\n[2] A. Torralba, R. Fergus, and W. T. Freeman. Tiny images. Technical Report MIT-CSAIL-TR-2007-024, Computer\n\nScience and Arti\ufb01cial Intelligence Lab, Massachusetts Institute of Technology, 2007.\n\n[3] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large databases for recognition. In CVPR, 2008.\n\n[4] James Hays and Alexei A Efros. Scene completion using millions of photographs. ACM Transactions on Graphics\n\n(SIGGRAPH 2007), 26(3), 2007.\n\n[5] R. R. Salakhutdinov and G. E. Hinton. Semantic hashing. In SIGIR workshop on Information Retrieval and applications\n\nof Graphical Models, 2007.\n\n[6] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering.\n\nIn\n\nNIPS, pages 585\u2013591, 2001.\n\n[7] Geo\ufb00rey E. Hinton and Sam T. Roweis. Stochastic neighbor embedding. In NIPS, pages 833\u2013840, 2002.\n\n[8] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In\n\nFOCS, pages 459\u2013468, 2006.\n\n[9] R. R. Salakhutdinov and G. E. Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure.\n\nIn AI and Statistics, 2007.\n\n[10] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter sensitive hashing. In ICCV, 2003.\n\n[11] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering, analysis and an algorithm. In Advances in Neural Information\n\nProcessing 14, 2001.\n\n[12] J. Shi and J. Malik. Normalized cuts and image segmentation.\n\nIn Proc. IEEE Conf. Computer Vision and Pattern\n\nRecognition, pages 731\u2013737, 1997.\n\n[13] Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux, Jean-Fran\u00b8cois Paiement, Pascal Vincent, and Marie Ouimet. Learn-\n\ning eigenfunctions links spectral embedding and kernel pca. Neural Computation, 16(10):2197\u20132219, 2004.\n\n[14] Charless Fowlkes, Serge Belongie, Fan R. K. Chung, and Jitendra Malik. Spectral grouping using the nystr\u00a8om method.\n\nIEEE Trans. Pattern Anal. Mach. Intell., 26(2):214\u2013225, 2004.\n\n[15] R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. Geometric di\ufb00usions as\na tool for harmonic analysis and structure de\ufb01nition of data: Di\ufb00usion maps. Proceedings of the National Academy of\nSciences, 102(21):7426\u20137431, May 2005.\n\n[16] M. Belkin and P. Niyogi. Towards a theoretical foundation for laplacian based manifold methods. Journal of Computer\n\nand System Sciences, 2007.\n\n[17] Boaz Nadler, Stephane Lafon amd Ronald R. Coifman, and Ioannis G. Kevrekidis. Di\ufb00usion maps, spectral clustering\n\nand reaction coordinates of dynamical systems. Arxiv, 2008. http://arxiv.org/.\n\n8\n\n\f", "award": [], "sourceid": 806, "authors": [{"given_name": "Yair", "family_name": "Weiss", "institution": null}, {"given_name": "Antonio", "family_name": "Torralba", "institution": null}, {"given_name": "Rob", "family_name": "Fergus", "institution": null}]}