{"title": "Space-Time Local Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 100, "page_last": 108, "abstract": "Space-time is a profound concept in physics. This concept was shown to be useful for dimensionality reduction. We present basic definitions with interesting counter-intuitions. We give theoretical propositions to show that space-time is a more powerful representation than Euclidean space. We apply this concept to manifold learning for preserving local information. Empirical results on non-metric datasets show that more information can be preserved in space-time.", "full_text": "Space-Time Local Embeddings\n\nJun Wang2\n\nAlexandros Kalousis3,1\n\nSt\u00b4ephane Marchand-Maillet1\n\nKe Sun1\u2217\n\n1 Viper Group, Computer Vision and Multimedia Laboratory, University of Geneva\n\nsunk.edu@gmail.com, Stephane.Marchand-Maillet@unige.ch, and 2 Expedia,\nSwitzerland, jwang1@expedia.com, and 3 Business Informatics Department, University\n\nof Applied Sciences, Western Switzerland, Alexandros.Kalousis@hesge.ch\n\nAbstract\n\nSpace-time is a profound concept in physics. This concept was shown to be\nuseful for dimensionality reduction. We present basic de\ufb01nitions with interest-\ning counter-intuitions. We give theoretical propositions to show that space-time\nis a more powerful representation than Euclidean space. We apply this concept\nto manifold learning for preserving local information. Empirical results on non-\nmetric datasets show that more information can be preserved in space-time.\n\n1\n\nIntroduction\n\nAs a simple and intuitive representation, the Euclidean space (cid:60)d has been widely used in various\nlearning tasks. In dimensionality reduction, n given high-dimensional points in (cid:60)D, or their pair-\nwise (dis-)similarities, are usually represented as a corresponding set of points in (cid:60)d (d < D).\nThe representation power of (cid:60)d is limited. Some of its limitations are listed next. \u008c The maximum\nnumber of points which can share a common nearest neighbor is limited (2 for (cid:60); 5 for (cid:60)2) [1, 2],\nwhile such centralized structures do exist in real data. \u008d (cid:60)d can at most embed (d + 1) points\nwith uniform pair-wise similarities. It is hard to model pair-wise relationships with less variance. \u008e\nEven if d is large enough, (cid:60)d as a metric space must satisfy the triangle inequality, and therefore\nmust admit transitive similarities [2], meaning that a neighbor\u2019s neighbor should also be nearby.\nSuch relationships can be violated on real data, e.g. social networks. \u008f The Gram matrix of n\nreal vectors must be positive semi-de\ufb01nite (p. s. d.). Therefore (cid:60)d cannot faithfully represent the\nnegative eigen-spectrum of input similarities, which was discovered to be meaningful [3].\nTo tackle the above limitations of Euclidean embeddings, a commonly-used method is to impose a\nstatistical mixture model. Each embedding point is a random point on several candidate locations\nw. r. t. some mixture weights. These candidate locations can be in the same (cid:60)d [4]. This allows an\nembedding point to jump across a long distance through a \u201cstatistical worm-hole\u201d. Or, they can be\nin m independent (cid:60)d\u2019s [2, 5], resulting in m different views of the input data.\nAnother approach beyond Euclidean embeddings is to change the embedding destination to a curved\nspace Md. This Md can be a Riemannian manifold [6] with a positive de\ufb01nite metric, or equiva-\nlently, a curved surface embedded in a Euclidean space [7, 8]. To learn such an embedding requires\na closed-form expression of the distance measure. This Md can also be semi-Riemannian [9] with\nan inde\ufb01nite metric. This semi-Riemannian representation, under the names \u201cpseudo-Euclidean\nspace\u201d, \u201cMinkowski space\u201d, or more conveniently, \u201cspace-time\u201d, was shown [3, 7, 10\u201312] to be a\npowerful representation for non-metric datasets. In these works, an embedding is obtained through\na spectral decomposition of a \u201cpseudo-Gram\u201d matrix, which is computed based on some input data.\nOn the other hand, manifold learning methods [4, 13, 14] are capable of learning a p. s. d. ker-\nnel Gram matrix, that encapsulates useful information into a narrow band of its eigen-spectrum.\n\n\u2217Corresponding author\n\n1\n\n\fUsually, local neighborhood information is more strongly preserved as compared to non-local in-\nformation [4, 15], so that the input information is unfolded in a non-linear manner to achieve the\ndesired compactness.\nThe present work advocates the space-time representation. Section 2 introduces the basic concepts.\nSection 3 gives several simple propositions that describe the representation power of space-time. As\nnovel contributions, section 4 applies the space-time representation to manifold learning. Section 5\nshows that using the same number of parameters, more information can be preserved by such em-\nbeddings as compared to Euclidean embeddings. This leads to new data visualization techniques.\nSection 6 concludes and discusses possible extensions.\n\n2 Space-time\n\nThe fundamental measurements in geometry are established by the concept of a metric [6]. Intu-\nitively, it is a locally- or globally-de\ufb01ned inner product. The metric of a Euclidean space (cid:60)d is\neverywhere identity. The inner product between any two vectors y1 and y2 is (cid:104)y1, y2(cid:105) = yT\n1 Idy2,\nwhere Id is the d \u00d7 d identity matrix. A space-time (cid:60)ds,dt is a (ds + dt)-dimensional real vector\nspace, where ds \u2265 0, dt \u2265 0, and the metric is\nM =\n\n(cid:20)Ids\n\n(cid:21)\n\n(1)\n\n.\n\n0\n0 \u2212Idt\n\nThis metric is not trivial. It is semi-Riemannian with a background in physics [9]. A point in (cid:60)ds,dt\nis called an event, denoted by y = (y1, . . . , yds , yds+1, . . . , yds+dt)T . The \ufb01rst ds dimensions\nare space-like, where the measurements are exactly the same as in a Euclidean space. The last dt\ndimensions are time-like, which cause counter-intuitions. In accordance to the metric M in eq. (1),\n\n\u2200y1, y2 \u2208 (cid:60)ds,dt,\n\n(cid:104)y1, y2(cid:105) =\n\nyl\n1yl\n\n2 \u2212\n\n1yl\nyl\n2.\n\n(2)\n\nIn analogy to using inner products to de\ufb01ne distances, the following de\ufb01nition gives a dissimilarity\nmeasure between two events in (cid:60)ds,dt.\nDe\ufb01nition 1. The space-time interval, or shortly interval, between any two events y1 and y2 is\n\nds(cid:88)\n\nl=1\n\nds+dt(cid:88)\n\nl=ds+1\n\nds(cid:88)\n\nl=1\n\nds+dt(cid:88)\n\nl=ds+1\n\nc(y1, y2) = (cid:104)y1, y1(cid:105) + (cid:104)y2, y2(cid:105) \u2212 2(cid:104)y1, y2(cid:105) =\n\n(yl\n1 \u2212 yl\n\n2)2 \u2212\n\n(yl\n1 \u2212 yl\n\n2)2.\n\n(3)\n\nThe space-time interval c(y1, y2) can be positive, zero or negative. With respect to a reference point\ny0 \u2208 (cid:60)ds,dt, the set {y : c(y, y0) = 0} is called a light cone. Figure 1a shows a light cone in\n(cid:60)2,1. Within the light cone, c(y, y0) < 0, i. e., negative interval occurs; outside the light cone,\nc(y, y0) > 0. The following counter-intuitions help to establish the concept of space-time.\nA low-dimensional (cid:60)ds,dt can accommodate an arbitrarily large number of events sharing a com-\nmon nearest neighbor.\nIn (cid:60)2,1, let A = (0, 0, 1), and put {B1, B2, . . . ,} evenly on the circle\n{(y1, y2, 0) : (y1)2 + (y2)2 = 1} at time 0. Then, A is the unique nearest neighbor of B1, B2, . . . .\nA low-dimensional (cid:60)ds,dt can represent uniform pair-wise similarities between an arbitrarily large\nnumber of points. In (cid:60)1,1, the similarities within {Ai : Ai = (i, i)}n\nIn (cid:60)ds,dt, the triangle inequality is not necessarily satis\ufb01ed. In (cid:60)2,1, let A = (\u22121, 0, 0), B =\n(0, 0, 1), C = (1, 0, 0). Then c(A, C) > c(A, B) + c(B, C). The trick is that, as B\u2019s absolute\ntime value increases, its intervals with all events at time 0 are shrinking. Correspondingly, similarity\nmeasures in (cid:60)ds,dt can be non-transitive. The fact that B is similar to A and C independently does\nnot necessarily mean that A and C are similar.\nA neighborhood of y0 \u2208 (cid:60)2,1 is {(y1, y2, y3) : (y1\u2212y1\n0)2 \u2264 \u0001}, where \u0001 \u2208\n(cid:60). This hyperboloid has in\ufb01nite \u201cvolume\u201d, no matter how small \u0001 is. Comparatively, a neighborhood\nin (cid:60)d is much narrower, with an exponentially shrinking volume as its radius decreases.\n\n0)2 +(y2\u2212y2\n\n0)2\u2212(y3\u2212y3\n\ni=1 are uniform.\n\n2\n\n\fe\nm\ni\nt\n\ne\n\nc\n\na\n\np\n\ns\n\nspace\n\ny0\n\nlightcone\n\ne\nm\ni\nt\n\n0\n\n=\n\nc\n\n\u2212 1\n5\n0 .\n\u2212\nc =\nc = 0 . 5\nc = 1\n\nspace\n\n(a)\n\n(b)\n\n\u2206n\n\n\u02c6p3,0\n\ng(K 3,0\nn )\n\np(cid:63)\n\n\u02c6p2,1\n\ng(K 2,1\nn )\n(c)\n\nFigure 1: (a) A space-time; (b) A space-time \u201ccompass\u201d in (cid:60)1,1. The colored lines show equal-\ninterval contours with respect to the origin; (c) All possible embeddings in (cid:60)2,1 (resp. (cid:60)3) are\nmapped to a sub-manifold of \u2206n, as shown by the red (resp. blue) line. Dimensionality reduction\nprojects the input p(cid:63) onto these sub-manifolds, e. g. by minimizing the KL divergence.\n\n3 The representation capability of space-time\n\n(4)\n\n1\n2\n\n(In \u2212\n\n1\nn\n\neeT )C(In \u2212\n\n1\nn\n\n\u2200i,(cid:80)n\n\n(cid:54)= j, Cij = Cji} and Kn = {Kn\u00d7n :\nj=1 Kij = 0; \u2200i (cid:54)= j, Kij = Kji} are two families of real symmetric matrices. dim(Cn) =\n\nProof. \u2200C(cid:63) \u2208 Cn, K(cid:63) = K(C(cid:63)) has the eigen-decomposition K(cid:63) = (cid:80)rank(K(cid:63))\nl } are orthonormal. For each l = 1,\u00b7\u00b7\u00b7 , rank(K(cid:63)),(cid:112)\n\nThis section formally discusses some basic properties of (cid:60)ds,dt in relation to dimensionality reduc-\ntion. We \ufb01rst build a tool to shift between two different representations of an embedding: a matrix\nof c(yi, yj) and a matrix of (cid:104)yi, yj(cid:105). From straightforward derivations, we have\nLemma 1. Cn = {Cn\u00d7n : \u2200i, Cii = 0; \u2200i\ndim(Kn) = n(n \u2212 1)/2. A linear mapping from Cn to Kn and its inverse are given by\neeT ), C(K) = diag(K)eT + ediag(K)T \u2212 2K,\nK(C) = \u2212\nwhere e = (1, \u00b7\u00b7\u00b7 , 1)T , and diag(K) means the diagonal entries of K as a column vector.\nCn and Kn are the sets of interval matrices and \u201cpseudo-Gram\u201d matrices, respectively [3, 12]. In\nparticular, a p. s. d. K \u2208 Kn means a Gram matrix, and the corresponding C(K) means a square\ndistance matrix. The double centering mapping K(C) is widely used to generate a (pseudo-)Gram\nmatrix from a dissimilarity matrix.\nProposition 2. \u2200C(cid:63) \u2208 Cn, \u2203 n events in (cid:60)ds,dt, s. t. ds + dt \u2264 n \u2212 1 and their intervals are C(cid:63).\nl )T\nl (v(cid:63)\nwhere rank(K(cid:63)) \u2264 n \u2212 1 and {v(cid:63)\nl |v(cid:63)\n|\u03bb(cid:63)\nl\ngives the coordinates in one dimension, which is space-like if \u03bb(cid:63)\nRemark 2.1. (cid:60)ds,dt (ds + dt \u2264 n \u2212 1) can represent any interval matrix C(cid:63) \u2208 Cn, or equivalently,\nany K(cid:63) \u2208 Kn. Comparatively, (cid:60)d (d \u2264 n \u2212 1) can only represent {K \u2208 Kn : K (cid:23) 0}.\nA pair-wise distance matrix in (cid:60)d is invariant to rotations. In other words, the direction information\nof a point cloud is completely discarded. In (cid:60)ds,dt, some direction information is kept to distinguish\nbetween space-like and time-like dimensions. As shown in \ufb01g. 1b, one can tell the direction in (cid:60)1,1\n(cid:110)\nby moving a point along the curve {(y1)2 + (y2)2 = 1} and measuring its interval w. r. t. the origin.\np = (pij) : 1 \u2264 i \u2264 n; 1 \u2264 j \u2264 n; i < j; \u2200i, \u2200j, pij > 0;(cid:80)\nLocal embedding techniques often use similarity measures in a statistical simplex \u2206n =\n. This \u2206n has one\nless dimension than Cn and Kn so that dim(\u2206n) = n(n \u2212 1)/2 \u2212 1. A mapping from Kn (Cn) to\n\u2206n is given by\n(5)\nwhere f (\u00b7) is a positive-valued strictly monotonically decreasing function, so that a large probability\nmass is assigned to a pair of events with a small interval. Proposition 2 trivially extends to\nProposition 3. \u2200p(cid:63) \u2208 \u2206n, \u2203 n events in (cid:60)ds,dt, s. t. ds + dt \u2264 n \u2212 1 and their similarities are p(cid:63).\nRemark 3.1. (cid:60)ds,dt (ds + dt \u2264 n \u2212 1) can represent any n \u00d7 n symmetric positive similarities.\n\nl > 0 or time-like if \u03bb(cid:63)\n\nl < 0.\n\npij \u221d f (Cij(K)),\n\ni,j:i \u03b41, K(cid:63)(\u03b4) (cid:31) 0, and the number\nof positive eigenvalues of K(cid:63)(\u03b4) increases monotonically with \u03b4.\nWith enough dimensions, any p(cid:63) \u2208 \u2206n can be perfectly represented in a space-only, or time-\nonly, or space-time-mixed (cid:60)ds,dt. There is no particular reason to favor a space-only model,\nbecause the objective of dimensionality reduction is to get a compact model with a small num-\nber of dimensions, regardless of whether they are space-like or time-like. Formally, K ds,dt\n=\n{K+ \u2212 K\u2212 : rank(K+) \u2264 ds; rank(K\u2212) \u2264 dt; K+ (cid:23) 0; K\u2212\n(cid:23) 0} is a low-rank subset of\nIn the domain Kn, dimensionality reduction based on the input p(cid:63) \ufb01nds some \u02c6Kds,dt \u2208\nKn.\nK ds,dt\n, which is close to the curve K(cid:63)(\u03b4).\nn\nunder some mapping g : Kn \u2192 \u2206n is\nIn the probability domain \u2206n, the image of K ds,dt\ng(K ds,dt\n), so\nthat \u02c6pds,dt is the closest point to p(cid:63) w. r. t. some information theoretic measure. The proximity\nof p(cid:63) to \u02c6pds,dt, i. e. its proximity to g(K ds,dt\n), measures the quality of the model (cid:60)ds,dt as the\nembedding target space, when the model scale or the number of dimensions is given.\nWe will investigate the latter approach, which depends on the choice of ds, dt, the mapping g, and\nsome proximity measure on \u2206n. We will show that, with the same number of dimensions ds + dt,\nthe region g(K ds,dt\n\n). As shown in \ufb01g. 1c, dimensionality reduction \ufb01nds some \u02c6pds,dt \u2208 g(K ds,dt\n\n) with space-time-mixed dimensions is naturally close to certain input p(cid:63).\n\nn\n\nn\n\nn\n\nn\n\nn\n\nn\n\n4 Space-time local embeddings\n\nWe project a given similarity matrix p(cid:63) \u2208 \u2206n to some \u02c6K \u2208 K ds,dt\n, or equivalently, to a set of\ni=1 \u2282 (cid:60)ds,dt, so that \u2200i, \u2200j, (cid:104)yi, yj(cid:105) = \u02c6Kij as in eq. (2), and the similarities\nevents Y = {yi}n\namong these events resemble p(cid:63). As discussed in section 3, a mapping g : Kn \u2192 \u2206n helps transfer\nK ds,dt\ninto a sub-manifold of \u2206n, so that the projection can be done inside \u2206n. This mapping\nexpressed in the event coordinates is given by\n\nn\n\nn\n\nexp(cid:0)\n\npij(Y ) \u221d\n\n(cid:107)yt\n1 + (cid:107)ys\n\ni \u2212 yt\ni \u2212 ys\n\nj(cid:107)2(cid:1)\nj(cid:107)2 ,\n\n(7)\n\nwhere ys = (y1, . . . , yds)T , yt = (yds+1, . . . , yds+dt)T , and (cid:107) \u00b7 (cid:107) denotes the 2-norm. For any pair\nof events yi and yj, pij(Y ) increases when their space coordinates move close, and/or when their\ntime coordinates move away. This agrees with the basic intuitions of space-time. For time-like di-\nmensions, the heat kernel is used to make pij(Y ) sensitive to time variations. This helps to suppress\nevents with large absolute time values, which make the embedding less interpretable. For space-like\ndimensions, the Student-t kernel, as suggested by t-SNE [13], is used, so that there could be more\n\u201cvolume\u201d to accommodate the often high-dimensional input data. Based on our experience, this\nhybrid parametrization of pij(Y ) can better model real data as compared to alternative parametriza-\ntions. Similar to SNE [4] and t-SNE [13], an optimal embedding can be obtained by minimizing the\nKullback-Leibler (KL) divergence from the input p(cid:63) to the output p(Y ), given by\n\nKL(Y ) =\n\np(cid:63)\nij ln\n\np(cid:63)\nij\n\npij(Y )\n\n.\n\n(8)\n\n(cid:88)\n\ni,j:i p(cid:63)\ntime.\ni } are updated by the delta-bar-delta scheme as used in t-SNE [13],\nDuring gradient descent, {ys\nwhere each scalar parameter has its own adaptive learning rate initialized to \u03b3s > 0; {yt\ni} are\nupdated based on one global adaptive learning rate initialized to \u03b3t > 0. The learning of time\nshould be more cautious, because pij(Y ) is more sensitive to time variations by eq. (7). Therefore,\nthe ratio \u03b3t/\u03b3s should be very small, e.g. 1/100.\n\ni and yt\n\n5 Empirical results\n\nCAij(1/(cid:80)\n\nj CAij + 1/(cid:80)\n\nAiming at potential applications in data visualization and social network analysis, we compare\nSNE [4], t-SNE [13], and the method proposed in section 4 denoted as SNEST . They are based\non the same optimizer but correspond to different sub-manifolds of \u2206n, as presented by the curves\nin \ufb01g. 1c. Given different embeddings of the same dataset using the same number of dimensions,\nwe perform model selection based on the KL divergence as explained in the end of section 3.\nWe generated a toy dataset SCHOOL, representing a school with two classes. Each class has 20\nstudents standing evenly on a circle, where each student is communicating with his (her) 4 nearest\nneighbours, and one teacher, who is communicating with all the students in the same class and the\nteacher in the other class. The input p(cid:63) is distributed evenly on the pairs (i, j) who are socially\nconnected.\nNIPS22 contains a 4197 \u00d7 3624 author-document matrix from NIPS 1988 to 2009 [2]. After\ndiscarding the authors who have only one NIPS paper, we get 1418 authors who co-authored\n2121 papers. The co-authorship matrix is CA1418\u00d71418, where CAij denotes the number of pa-\npers that author i co-authored with author j. The input similarity p(cid:63) is computed so that p(cid:63)\nij \u221d\ni CAij), where the number of co-authored papers is normalized by each\nauthor\u2019s total number of papers. NIPS17 is built in the same way using only the \ufb01rst 17 volumes.\nGrQc is an arXiv co-authorship graph [16] with 5242 nodes and 14496 edges. After removing\none isolated node, a matrix CA5241\u00d75241 gives the numbers of co-authored papers between any two\nauthors who submitted to the general relativity and quantum cosmology category from January 1993\nto April 2003. The input similarity p(cid:63) satis\ufb01es p(cid:63)\nW5000 is the semantic similarities among 5000 English words in WS5000\u00d75000 [2, 17]. Each WSij\nis an asymmetric non-negative similarity from word i to word j. The input is normalized into a\nprobability vector p(cid:63) so that p(cid:63)\ni WSji. W1000 is built in the same way\nusing a subset of 1000 words.\nTable 1 shows the KL divergence in eq. (8). In most cases, SNEST for a \ufb01xed number of free param-\neters has the lowest KL. On NIPS22, GrQc, W1000 and W5000, the embedding by SNEST in (cid:60)2,1\nis even better than SNE and t-SNE in (cid:60)4, meaning that the embedding by SNEST is both compact\nand faithful. This is in contrast to the mixture approach for visualization [2], which multiplies the\nnumber of parameters to get a faithful representation.\nFixing the free parameters to two dimensions, t-SNE in (cid:60)2 has the best overall performance, and\nSNEST in (cid:60)1,1 is worse. We also discovered that, using d dimensions, (cid:60)d\u22121,1 usually performs\nbetter than alternative choices such as (cid:60)d\u22122,2, which are not shown due to space limitation. A time-\nlike dimension allows adaptation to non-metric data. The investigated similarities, however, are\n\nij \u221d CAij(1/(cid:80)\nj WSij + WSji/(cid:80)\n\nij \u221d WSij/(cid:80)\n\nj CAij + 1/(cid:80)\n\ni CAij).\n\n5\n\n\fTable 1: KL divergence of different embeddings. After repeated runs on different con\ufb01gurations for\neach embedding, the minimal KL that we have achieved within 5000 epochs is shown. The bold\nnumbers show the winners among SNE, t-SNE and SNEST using the same number of parameters.\n\nSCHOOL NIPS17 NIPS22 GrQc\n3.19\n0.52\n1.82\n0.36\n0.19\n1.03\n1.24\n0.61\n1.14\n0.58\n1.11\n0.58\n0.43\n2.34\n0.31\n1.00\n0.88\n0.29\n\n1.88\n0.85\n0.35\n0.88\n0.85\n0.84\n0.91\n0.60\n0.54\n\n2.98\n1.79\n1.01\n1.29\n1.23\n1.22\n1.62\n0.97\n0.93\n\ne\nm\ni\nt\n\nSNE \u2192 (cid:60)2\nSNE \u2192 (cid:60)3\nSNE \u2192 (cid:60)4\nt-SNE \u2192 (cid:60)2\nt-SNE \u2192 (cid:60)3\nt-SNE \u2192 (cid:60)4\nSNEST \u2192 (cid:60)1,1\nSNEST \u2192 (cid:60)2,1\nSNEST \u2192 (cid:60)3,1\n\nteachers\n\n(a)\n\nW1000 W5000\n3.67\n3.20\n2.76\n2.15\n2.00\n1.96\n2.59\n1.92\n1.79\n\n4.93\n4.42\n3.93\n3.00\n2.79\n2.74\n3.64\n2.57\n2.39\n\n(b)\n\nteachers). The paper coordinates (resp. color) mean the space (resp.\n\nFigure 2: (a) The embedding of SCHOOL by SNEST in (cid:60)2,1. The black (resp. colored) dots denote\ntime)\nthe students (resp.\ncoordinates. The links mean social connections. (b) The contour of exp((cid:107)yt\nin eq. (7) as a\n1+(cid:107)ys\nfunction of (cid:107)ys\n\nj(cid:107) (y-axis). The unit of the displayed levels is 10\u22123.\n\nj(cid:107) (x-axis) and (cid:107)yt\n\ni \u2212 ys\n\ni \u2212 yt\n\ni\u2212yt\ni \u2212ys\n\nj(cid:107)2)\nj(cid:107)2\n\nmainly space-like, in the sense that a random pair of people or words are more likely to be dissimilar\n(space-like) rather than similar (time-like). According to our experience, on such datasets, good\nperformance is often achieved with mainly space-like dimensions mixed with a small number of\ntime-dimensions, e.g. (cid:60)2,1 or (cid:60)3,1 as suggested by table 1.\nTo interpret the embeddings, \ufb01g. 2a presents the embedding of SCHOOL in (cid:60)2,1, where the space\nand time are represented by paper coordinates and three colors levels, respectively. Each class is\nembedded as a circle. The center of each class, the teacher, is lifted to a different time, so as to be\nnear to all students in the same class. One teacher being blue, while the other being red, creates a\n\u201chyper-link\u201d between the teachers, because their large time difference makes them nearby in (cid:60)2,1.\nFigures 3 and 4 show the embeddings of NIPS22 and W5000 in (cid:60)2,1. Similar to the (t-)SNE\nvisualizations [2, 4, 13], it is easy to \ufb01nd close authors or words embedded nearby. The learned\np(Y ), however, is not equivalent to the visual proximity, because of the counter-intuitive time di-\nmension. How much does the visual proximity re\ufb02ect the underlying p(Y )? From the histogram\nof the time coordinates, we see that the time values are in the narrow range [\u22121.5, 1.5], while the\nrange of the space coordinates is at least 100 times larger. Figure 2b shows the similarity function\non the right-hand-side of eq. (7) over an interesting range of (cid:107)ys\nj(cid:107). In this range,\nlarge similarity values are very sensitive to space variations, and their red level curves are almost\nvertical, meaning that the similarity information is largely carried by space coordinates. Therefore,\nthe visualization of neighborhoods is relatively accurate: visually nearby points are indeed similar;\nproximity in a neighborhood is informative regarding p(Y ). On the other hand, small similarity val-\nues are less sensitive to space variations, and their blue level curves span a large distance in space,\nmeaning that the visual distance between dissimilar points is less informative regarding p(Y ). For\n\nj(cid:107) and (cid:107)yt\n\ni \u2212 ys\n\ni \u2212 yt\n\n6\n\n50100150200kysi\u2212ysjk012kyti\u2212ytjk0.1110100exp(kyti\u2212ytjk2)/(1+kysi\u2212ysjk2)050100\fFigure 3: An embedding of NIPS22 in (cid:60)2,1. \u201cMajor authors\u201d with at least 10 NIPS papers or with\na time value in the range (\u2212\u221e,\u22121] \u222a [1,\u221e) are shown by their names. Other authors are shown\nby small dots. The paper coordinates are in space-like dimensions. The positions of the displayed\nnames are adjusted up to a tiny radius to avoid text overlap. The color of each name represents the\ntime dimension. The font size is proportional to the absolute time value.\n\nexample, a visual distance of 165 with a time difference of 1 has roughly the same similarity as a\nvisual distance of 100 with no time difference. This is a matter of embedding dissimilar samples far\nor very far and does not affect much the visual perception, which naturally requires less accuracy on\nsuch samples. However, perception errors could still occur in these plots, although they are increas-\ningly unlikely as the observation radius turns small. In viewing such visualizations, one must count\nin the time represented by the colors and font sizes, and remember that a point with a large absolute\ntime value should be weighted higher in similarity judgment.\nConsider the learning of yi by eq. (9), if the input p(cid:63)\nij is larger than what can be faithfully modeled\nin a space-only model, then j will push i to a different time. Therefore, the absolute value of time\nis a signi\ufb01cance measurement. By \ufb01g. 2a, the connection hubs, and points with remote connections,\nare more likely to be at a different time. Emphasizing the embedding points with large absolute time\nvalues helps the user to focus on important points. One can easily identify well-known authors and\npopular words in \ufb01gs. 3 and 4. This type of information is not discovered by traditional embeddings.\n\n6 Conclusions and Discussions\n\nWe advocate the use of space-time representation for non-metric data. While previous works on\nsuch embeddings [3, 12] compute an inde\ufb01nite kernel by simple transformations of the input data,\nwe learn a low-rank inde\ufb01nite kernel by manifold learning, trying to better preserve the neigh-\n\n7\n\n\u2212250\u22121500150250\u2212250\u22121500150250AchanAmariAtiyaAtkesonAttiasBachBaldiBallardBarberBartlettBartoBeckBengioBengioBialekBishopBlackBlairBleiBowerBradleyBuhmannCaruanaCauwenberghsChapelleCohnCottrellCourvilleCowanCrammerCristianiniDarrellDasDayanDeFreitasDeWeeseDenkerDoyaFrasconiFreemanFreyFukumizuGerstnerGhahramaniGilesGoldsteinGordonGraepelGrayGrettonGriffithsGrimesGuptaHaslerHastieHerbrichHintonHochreiterHofmannHornJaakkolaJinJohnsonJordanKakadeKawatoKearnsKochKollerLaffertyLeCunLeeLeeLeeLeenLewickiLiLippmannLiuMaassMalikMarchandMeirMelMinchMitchellMjolsnessMohriMontagueMoodyMooreMorganMovellanMozerMullerMurrayNgNowlanObermayerOpperPearlmutterPillowPlattPoggioPougetRahimiRaoRasmussenRatschRiesenhuberRosenfeldRothRoweisRumelhartRuppinSaadSahaniSaulScholkopfSchraudolphSchuurmansScottSeegerSejnowskiSeungShawe-TaylorSimardSimoncelliSingerSinghSmithSmolaSmythSollichStevensSunSuttonTehTenenbaumTesauroThrunTishbyTouretzkyTrespVapnikViolaWaibelWainwrightWangWangWarmuthWeinshallWeissWellingWestonWilliamsWilliamsonWillskyWintherXingYuYuilleZadorZemelZhangZhangSminchisescuGraumanGarriguesKimKulis\u22121.50.01.550100150200250histogram of time coordinates<-1.0-0.500.5>1.0---time-->\fFigure 4: An embedding of W5000 in (cid:60)2,1. Only a subset is shown for a clear visualization. The\nposition of each word represents its space coordinates up to tiny adjustments to avoid overlap. The\ncolor of each word shows its time value. The font size represents the absolute time value.\n\nbours [4]. We discovered that, using the same number of dimensions, certain input information is\nbetter preserved in space-time than Euclidean space. We built a space-time visualizer of non-metric\ndata, which automatically discovers important points.\nTo enhance the proposed visualization, an interactive interface can allow the user select one ref-\nerence point, and show the true similarity values, e.g., by aligning other points so that the visual\ndistances correspond to the similarities. Proper constraints or regularization could be proposed, so\nthat the time values are discrete or sparse, and the resulting embedding can be more easily inter-\npreted.\nThe proposed learning is on a sub-manifold K ds,dt\n\nn\n\nAnother interesting sub-manifold of Kn could be(cid:8)K \u2212 ttT : K (cid:31) 0; t \u2208 (cid:60)n(cid:9), which extends the\n\n\u2282 Kn, or a corresponding sub-manifold of \u2206n.\np. s. d. cone to any matrix in Kn with a compact negative eigen-spectrum. It is possible to construct\na sub-manifold of Kn so that the embedder can learn whether a dimension is space-like or time-like.\nAs another axis of future investigation, given the large family of manifold learners, there can be many\nways to project the input information onto these sub-manifolds. The proposed method SNEST is\nbased on the KL divergence in \u2206n. Some immediate extensions can be based on other dissimilarity\nmeasures in Kn or \u2206n. This could also be useful for faithful representations of graph datasets with\ninde\ufb01nite weights.\n\nAcknowledgments\n\nThis work has been supported be the Department of Computer Science, University of Geneva, in\ncollaboration with Swiss National Science Foundation Project MAAYA (Grant number 144238).\n\n8\n\n\u2212150\u2212100\u221250050100150\u2212150\u2212100\u221250050100150FIELDCOMPUTERBODYCONDEMNDISOWNRANGEINTENSITYATTENTIONBECHEERLEADERCHICKENCONFUSIONCRISISCULTUREGRACEHANGHOBBYPARSLEYRESISTANCEANIMALBEARCLEANINGDECENCYDRUGSEXERCISEHIDDENIMPATIENCEMADEPLANPOEMRESTORESALESMANSPLITBLOCKCLEANEREGOEVERYDAYGRADUATIONLACKMAINMANAGEMENTMEDICINEMOVENERVESPROFESSIONALRABBITRAREREASONRENOUNCERETREATRUNNERSUPERSTITIONTHERAPYTRAUMAATTRACTCLAIMSCLOTHESDISBELIEVEFORTFRAYFREEMOLENORMOUTLINEPROTEINRAPEREBELRESPECTSALESSCARSHEDSPYSTROKETRAITORUNIONWOODWORTHLESSBARRELCARCHISELCONGRESSCONSEQUENCECOVEREDDARINGDECORATEDIFFERENCEDUEELABORATEEMPIREEXCELEXTRAVAGANTFAIRFAMILYFLAPFOGFUZZHIGHLIGHTHONORIMPORTANTIMPRESSIONITALIANKEEPERMUSICNATURALPARADEPASSAGEPERSONALITYPLUGPOLICEMANPOTENTIALPROCESSSAUSAGESCIENTIFICSEALSPACESUPPORTSUSPENSETHEORYTOURISTTRAVELTUBEANNOYINGASSOCIATEAWARDBUSYCAPTURECLAYCOMFORTCOMMUNISTCOMPULSIONCONFUSECRIMECRUNCHDETERIORATEDIRECTIONDOMINATEDOWNTOWNELIMINATIONENGINEEREUROPEEVALUATEFACTORYFISHFREEDOMFRONTIERGHOSTGROWHOLEISSUEKIDSLACEMAFIAMASTERMINTNERVEOATMEALPERFORMANCEPERISHPRESENTATIONPROVEPUBERTYRACKRIGHTEOUSNESSROADSNEAKSTAINSTICKSWAMPTABOOTENDTOPPINGVIOLENTWARNWORRYBIRDBLOWBONDBUMPSCAPACITYCOMMONCONTROLSCOVERCREATIVITYCROOKEDDANCERDELAYDEPLETIONDICEDISASTERDISCIPLINEDISTINCTDOORDRAGONEMERGENCYFAITHFULFOOTBALLGETGODGRINDGROWTHHOROSCOPEINVENTORIRONJEWISHLABELLOBSTERMEASUREOPINIONPAINTERPINKPLASTICPLUSHPOTATOPRECIOUSPROJECTPROOFPROTECTIONRANKRECEIPTREDUCERETURNRIBSSCUMSENSITIVESPIKESPITSTAFFSTRIPESTUBBORNSTYLESUGGESTTILLTROPICALUNSUREWORMWRESTLINGABSENCEBEERBISCUITBLAMEBOWLCOASTCOMPOUNDCORNERCRITICISMDANGEROUSDILIGENCEELECTRICIANELEGANTELFEVENTEXTREMEFORBIDGRAVEHELPFULHORMONESINTERESTKITCHENLEADERLEANLEOLIMPLUXURYMAIDENMARBLEMONKEYMORALMUSCLENEGOTIATIONORDERPANICPANTSPARENTSPARTYPASTRYPERCENTPIGPINCHPLACEPOPULARPROTECTRECKLESSREGRETREPLACERESPONSIBILITYSCENERYSILVERWARESOAPSTOLENSWINGTHINKTHRESHOLDTRADEUNEVENUSEWINEABUNDANCEATTENDATTICBALLOONBATTERYBIRDSBOARDBUFFALOBUMCARDCHALLENGECLAMPCOLESLAWCOOKEDCREWCUEDECISIONDISMAYECONOMICENVIRONMENTFAVORFITFLOWERGENERALGLIDEHARDYHEALTHHIKERHISTORYJAPANLEVELLIFTLIMITLIZARDMAKE UPMISCHIEFMISSILEMIXEDNEUTRALNOTOBNOXIOUSOUTDOORSOVERFLOWPIEPOISEPOSSIBLERATEREACTIONREVIVALSECRETARYSEWSKILLSMEARSOUTHERNSPEAKERSPELLSQUEEZESTIMULUSSTRAWBERRYSYMBOLTIPTREETWELVEUNDERSTANDINGUNLOADVASELINEVIOLATIONVOTEWASTEDWELCOMEACCIDENTACCUMULATEAFTERNOONANARCHYBASEBEYONDBLACKMAILBLOODBREASTBREEZEWAYBROWNBUILDINGBUTTERFLYCASTCHARITYCHUCKCLEARCODECOURSECRUSHDATEDISGUSTDISPERSEDODRIFTDRUGECSTASYEGGENDLESSENTERTAINESSENCEEVICTEXPLORERFATTENINGFLOWERSFORBIDDENFOREIGNFUSSGHETTOGIVE UPGONEHANDLEINTAKEINTIMATELANDSCAPELOVERSMILDMIXTUREMOTORCYCLENONSENSEORANGE JUICEOUTRAGEOUSPEACEFULPILEPLAINPREDATORREPENTANCERIVERROCKSRUBBERSERIOUSSHAKESHARKSINGERSINKERSNEAKYSPECIFICSPRAYSQUASHSTRANGERTENTONEUNIFORMVOIDWOLFACCOMPLISHEDADAFFAIRALONEARTSBABYBACTERIABITEBRIEFCASECAPTIONCHANCECLAMCOLDDAMPDELIVERDOCTORDRAINDRILLDRUNKDUCKSELEPHANTESCAPEEXPERIENCEEYEBALLFAKEFIGUREFLUTEFLYFOLDFOULFUSEGARLICGLOVESGREEKHAIRHAIRCUTHANDKERCHIEFINFLATIONLEARNMATHMEANINGMICROSCOPEMONEYMOUTHNECKOPPONENTORIENTOUTSTANDINGPATPLEASERATRITUALSTICKERSWIMMERTEAPOTTELEVISIONTOGETHERTRAINTREATWASTEWRITERADMITAPARTMENTAURAAUTHORITYAWAREAXBEGBROKECHARTCOMMANDERCOSTUMECRACKERCROSSCUTEDAMNDAREDEERDEFENSEDELIGHTDIAMETERDOLPHINEFFORTENGAGEEXTRAFEELINGFILLFRYGIVINGGOOGULLYGUNHAYHIKINGHITHYPNOTIZEIMITATEINDEPENDENTINTESTINELEGALLEMONADELIVERMARINESMEETMILKNOMADOATHOFTENPANTYHOSEPERFECTPLANETSPOURPROFESSIONRAINRECENTRELIEFREPEATROBESENSESHADOWSLIVERSLURPSPONTANEOUSSTAIRSSTEAMSTIFFSTINGYSUPERMANTEMPERTHESISTURTLEVALVEVEERWAKEWATERWELFAREWRAPANOTHERAPPLEBAGBATBLENDERBLOCKADEBUSCAMPINGCLUMSYCOATCONSOLECOUNTERCOURTCURTAINSDIRECTDIVISIONGOODSHELPERHOSTESSIDENTITYINDIANINTEGRITYKEEPLUNCHMARINEMUSTYOILOZONEPADDYPENGUINPERSUADEROACHROYALRUNNINGSERIESSHEEPSUNDAYSUNSHINETAILTARTTELEPHONETELESCOPETRUCKVALUEVODKAWANDERAFFECTANKLEBOATCARTCHEEKDISCOVERDUNKDUSTEGYPTERAEXCISEEXPRESSFUMESHANDHAULHEATHEDGEHORSEHOTELIMAGELOVERMENTHOLMESSAGEMOLASSESMOTIONPOSSESSIONQUALITYSCREENSCRIBBLESHYSIGNALSISTERSNAPSOMETARTASTYUNICORNAHOYBICYCLEBOILBOUNDBRITTLECHANGECHINESECONTEMPORARYCONTEXTCOWGIRLCUPBOARDEXPLODEFIREPLACEFRAILFRUSTRATEHELICOPTERHUNGERIDOLINNOCENCEINSTANCELAKELICKLOFTMEMORYMINERNOTHINGNUNPROVERBPROVISIONQUARTERRADIATORSALUTESINCESLUGWIDEACCOUNTANGELBASSBOXERCATTLECHAMPIONCHASECORNDESCENTDRAFTEINSTEINFAVORITEFEATHERFEVERFIREMANFLEETGRASSHOPPERHOT DOGSJUNIORLEADLIGHTNINGMAROONMAXIMUMORIGINATEPERSONPIANOPIZZAPOUNDREDRESTRICTIONSHOPSHUTTERSITTINGSNOWSPOILSQUIDSTALLSUNSETTALETERMINALTIREDTRAILERTURKEYWATERFALLZITAARDVARKBEARDBIRTHBOOTBOOTSBREATHBUZZCYLINDERDOWNSTAIRSFORHANDBAGHEADACHEHOCKEYKEYSLONGMAJORITYOPENINGPRIMEPRONOUNRECLINERSHOTSMELLSPADESTABLESUBMARINETARGETTHIRSTYTOOTHPASTEWEEKANNIHILATEBACKBORROWCENTSCOCA-COLACOMPONENTSCOOLELDERSHANDICAPPALERAMRIDERSCREAMSPIDERSUPERMARKETADDCORALCRANECUBEEAGLEGROOMHOOPLAVALEMONNEPHEWSAUCERVALENTINEWHOANISETTEBETBOYBRAKECRATERMONARCHPARENTWASPDEFEATDRYERGOINGHARBORMANPARROTSMALLSTRAYADDITIONEMERALDHERSABERSWOONADORESALOONTHIRSTSWABSNESTPROFITDILL\u22121.50.01.55001000histogram of time coordinates<-0.8-0.400.4>0.8---time-->\fReferences\n[1] K. Zeger and A. Gersho. How many points in Euclidean space can have a common nearest\n\nneighbor? In International Symposium on Information Theory, page 109, 1994.\n\n[2] L. van der Maaten and G. E. Hinton. Visualizing non-metric similarities in multiple maps.\n\nMachine Learning, 87(1):33\u201355, 2012.\n\n[3] J. Laub and K. R. M\u00a8uller. Feature discovery in non-metric pairwise data. JMLR, 5(Jul):801\u2013\n\n818, 2004.\n\n[4] G. E. Hinton and S. T. Roweis. Stochastic neighbor embedding. In NIPS 15, pages 833\u2013840.\n\nMIT Press, 2003.\n\n[5] J. Cook, I. Sutskever, A. Mnih, and G. E. Hinton. Visualizing similarity data with a mixture of\n\nmaps. In AISTATS\u201907, pages 67\u201374, 2007.\n\n[6] J. Jost. Riemannian Geometry and Geometric Analysis. Universitext. Springer, 6th edition,\n\n2011.\n\n[7] R. C. Wilson, E. R. Hancock, E. Pekalska, and R. P. W. Duin. Spherical embeddings for\n\nnon-Euclidean dissimilarities. In CVPR\u201910, pages 1903\u20131910, 2010.\n\n[8] D. Lunga and O. Ersoy. Spherical stochastic neighbor embedding of hyperspectral data. Geo-\n\nscience and Remote Sensing, IEEE Transactions on, 51(2):857\u2013871, 2013.\n\n[9] B. O\u2019Neill. Semi-Riemannian Geometry With Applications to Relativity. Number 103 in Series:\n\nPure and Applied Mathematics. Academic Press, 1983.\n\n[10] L. Goldfarb. A uni\ufb01ed approach to pattern recognition. Pattern Recognition, 17(5):575\u2013582,\n\n1984.\n\n[11] E. Pekalska and R. P. W. Duin. The Dissimilarity Representation for Pattern Recognition:\n\nFoundations and Applications. World Scienti\ufb01c, 2005.\n\n[12] J. Laub, J. Macke, K. R. M\u00a8uller, and F. A. Wichmann. Inducing metric violations in human\n\nsimilarity judgements. In NIPS 19, pages 777\u2013784. MIT Press, 2007.\n\n[13] L. van der Maaten and G. E. Hinton. Visualizing data using t-SNE. JMLR, 9(Nov):2579\u20132605,\n\n2008.\n\n[14] N. D. Lawrence. Spectral dimensionality reduction via maximum entropy. In AISTATS\u201911,\n\nJMLR W&CP 15, pages 51\u201359, 2011.\n\n[15] K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel matrix for nonlinear dimensionality\n\nreduction. In ICML\u201904, pages 839\u2013846, 2004.\n\n[16] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution: Densi\ufb01cation and shrinking\n\ndiameters. ACM Transactions on Knowledge Discovery from Data, 1(1), 2007.\n\n[17] D. L. Nelson, C. L. McEvoy, and T. A Schreiber.\n\nThe university of South Florida\nword association, rhyme, and word fragment norms. 1998. http://www.usf.edu/\nFreeAssociation.\n\n9\n\n\f", "award": [], "sourceid": 70, "authors": [{"given_name": "Ke", "family_name": "Sun", "institution": "University of Geneva"}, {"given_name": "Jun", "family_name": "Wang", "institution": "Expedia, Geneva"}, {"given_name": "Alexandros", "family_name": "Kalousis", "institution": null}, {"given_name": "Stephane", "family_name": "Marchand-Maillet", "institution": "University of Geneva"}]}