{"title": "A Statistical Recurrent Model on the Manifold of Symmetric Positive Definite Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 8883, "page_last": 8894, "abstract": "In a number of disciplines, the data (e.g., graphs, manifolds) to be\nanalyzed are non-Euclidean in nature.  Geometric deep learning\ncorresponds to techniques that generalize deep neural network models\nto such non-Euclidean spaces. Several recent papers have shown how\nconvolutional neural networks (CNNs) can be extended to learn with\ngraph-based data.  In this work, we study the setting where the data\n(or measurements) are ordered, longitudinal or temporal in nature and\nlive on a Riemannian manifold -- this setting is common in a variety\nof problems in statistical machine learning, vision and medical\nimaging. We show how recurrent statistical recurrent network models\ncan be defined in such spaces. We give an efficient algorithm and\nconduct a rigorous analysis of its statistical properties. We perform\nextensive numerical experiments demonstrating competitive performance\nwith state of the art methods but with significantly less number of\nparameters. We also show applications to a statistical analysis task\nin brain imaging, a regime where deep neural network models have only\nbeen utilized in limited ways.", "full_text": "A Statistical Recurrent Model on the Manifold of\n\nSymmetric Positive De\ufb01nite Matrices\u2217\n\nRudrasis Chakraborty\u2020\n\nChun-Hao Yang\u2020(cid:93)\n\nXingjian Zhen\u2021(cid:93) Monami Banerjee\u2020\n\nDerek Archer\u2020\n\nDavid Vaillancourt\u2020 Vikas Singh\u2021\n\u2020University of Florida, Gainesville, USA\n\u2021University of Wisconsin Madison, USA\n\n(cid:93) Equal contribution\n\nBaba C. Vemuri\u2020\n\nAbstract\n\nIn a number of disciplines, the data (e.g., graphs, manifolds) to be analyzed are\nnon-Euclidean in nature. Geometric deep learning corresponds to techniques that\ngeneralize deep neural network models to such non-Euclidean spaces. Several\nrecent papers have shown how convolutional neural networks (CNNs) can be\nextended to learn with graph-based data. In this work, we study the setting where\nthe data (or measurements) are ordered, longitudinal or temporal in nature and live\non a Riemannian manifold \u2013 this setting is common in a variety of problems in\nstatistical machine learning, vision and medical imaging. We show how recurrent\nstatistical network models can be de\ufb01ned in such spaces. Then, we present an\nef\ufb01cient algorithm and conduct a rigorous analysis of its statistical properties.\nWe perform numerical experiments demonstrating competitive performance with\nstate of the art methods but with signi\ufb01cantly fewer parameters. We also show\napplications to a statistical analysis task in brain imaging, a regime where deep\nneural network models have only been utilized in limited ways.\n\nIntroduction\n\n1\nIn the last decade or so, deep neural network models have been very successful in learning compli-\ncated patterns from data such as images, videos and speech [41, 39] \u2013 this has led to a number of\nbreakthroughs as well as deployments in turnkey applications. A popular neural network architecture\nthat has contributed to these advancements is convolutional neural networks (CNNs). In the classical\nde\ufb01nition of convolution, one often assumes that the data correspond to discrete measurements,\nacquired at equally spaced intervals (i.e., Euclidean space), of a scalar (or vector) valued function.\nClearly, for images, the Euclidean lattice grid assumption makes sense and the use of convolutional\narchitectures is appropriate \u2013 as described in [11], a number of properties such as stationarity, locality\nand compositionality follow. While the assumption that the underlying data satis\ufb01es the Euclidean\nstructure is explicit or implicit in an overwhelming majority of models, recently there has been a\ngrowing interest in applying or extending deep learning models for non-Euclidean data. This line of\nwork is called Geometric deep learning and typically deals with data such as manifolds and graphs\n[11]. Existing results describe strategies for leveraging the mathematical properties of such geometric\nor structured data, speci\ufb01cally, lack of (a) global linear structure, (b) global coordinate system,\n(c) shift invariance/equivariance, by incorporating these ideas explicitly into deep networks used to\nmodel them [13, 37, 18, 31, 30, 19].\nSeparate from the evolving body of work at the interface of convolutional neural networks and\nstructured data, there is a mature literature in statistical machine learning [40] and computer vision\n\u2217This research was funded in part by the NSF grant IIS-1525431 and IIS-1724174 to BCV, R01 NS052318\nto DV and NSF CAREER award 1252725 and R01 EB022883 to VS. XZ and VS were also supported by UW\nCPCP (U54 AI117924).\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdemonstrating how exploiting the structure (or geometry) of the data can yield advantages. Structured\ndata abound in various data analysis tasks: directional data in measurements from antennas [44],\ntime series data (curves) in \ufb01nance [60] and health sciences [20], surface normal vectors on the\nunit sphere (in vision or graphics) [58], probability density functions (in functional data analysis)\n[56], covariance matrices (for use in conditional independences, image textures) [62], rigid motions\n(registration) [48], shape representations (shape space analysis) [34], tree-based data (parse trees\nin natural language processing) [51], subspaces (videos, segmentation) [65, 23], low-rank matrices\n[12, 63], and kernel matrices [53] are common examples. In neuroimaging, an image may have a\nstructured measurement at each voxel to describe water diffusion [7, 64, 42, 32, 4, 15, 35] or local\nstructural change [29, 68, 36]. And the study of the interface between geometry/structure and analysis\nmethods has given effective practical tools because in order to de\ufb01ne loss functions that make sense\nfor the data at hand, one needs to \ufb01rst de\ufb01ne a metric which is intrinsic to the structure of the data.\nThe foregoing discussion, for the most part, covers differential geometry inspired algorithms for\nnon-sequential (or non-temporal) data. The study of analogous schemes for temporal or longitudinal\ndata is less well-developed. But analysis of dynamical scenes and stochastic processes is an important\narea of machine learning and vision, and it is here that some results have shown the bene\ufb01ts of\nexplicitly using geometric ideas. Some of the examples include the modeling of temporal evolution of\nfeatures in dynamic scenes in action recognition [2, 9, 61], tractography [14, 50] and so on. There are\nalso proposals describing modeling stochastic linear dynamical system (LDS) [22, 2, 9, 61]. In [2, 3],\nauthors studied the Riemannian geometry of LDS to de\ufb01ne distances and \ufb01rst order statistics. Given\nthat the marriage between deep learning and learning on non-Euclidean domains is a fairly recent,\nthe existing body of work primarily deals with attempts to generalize the popular CNN architectures.\nFew results exist that study recurrent models for non-Euclidean structured domains.\nThe broad success of Recurrent Neural Network (RNN) architectures including Long short term\nmemory (LSTM) [28] and Gated recurrent unit (GRU) [17] in sequential modeling like Natural\nLanguage Processing (NLP) has motivated a number of attempts to apply such ideas to model\nstochastic processes or to characterize dynamical scenes which can be viewed as a sequence of\nimages. Several works have proposed variants of RNN to model dynamical scenes including\n[57, 21, 46, 54, 66]. In the recent past, developments have been made to reduce the number of\nparameters in RNN and making RNN faster [38, 66]. In [6, 27], authors proposed an ef\ufb01cient way to\nhandle vanishing and exploding gradient problem of RNN using unitary weight matrices. In [33],\nauthors proposed a RNN model which combines the remembering ability of unitary RNNs with the\nability of gated RNNs to effectively forget redundant/ irrelevant information. Despite these results, we\n\ufb01nd that no existing model describes a recurrent model for structured (speci\ufb01cally, manifold-valued)\ndata. The main contribution of this paper is to describe a recurrent model (and accompanying\ntheoretical analysis) that will fall under the umbrella of \u201cgeometric deep learning\u201d \u2014 it exploits the\ngeometry of non-Euclidean data but is speci\ufb01cally designed for temporal or ordered measurements.\n2 Preliminaries: Key Ingredients from Riemannian geometry\nIn this section, we will \ufb01rst give a brief overview of the Riemannian geometry of n \u00d7 n symmetric\npositive de\ufb01nite matrices (henceforth will be denoted by SPD(n)). Note that our development is not\nlimited to SPD(n), but choosing a speci\ufb01c manifold simpli\ufb01es the presentation and the notations\nsigni\ufb01cantly. Then, we will present key ingredients needed for our proposed recurrent model.\nDifferential Geometry of SPD(n): Let SPD(n) be the set of n \u00d7 n symmetric positive de\ufb01nite\nmatrices. The group of n\u00d7n full rank matrices, denoted by GL(n) and called the general linear group,\nacts on SPD(n) via the group action, g.A := gAgT , where g \u2208 GL(n) and A \u2208 SPD(n). One can\n\nde\ufb01ne a GL(n) invariant intrinsic metric, dGL on SPD(n) as dGL(A, B) =(cid:112)trace(Log(A\u22121B)2),\n\nsee [26]. Here, Log is the matrix logarithm. This metric is intrinsic but requires a spectral decomposi-\ntion for calculations, a computationally intensive task for large matrices. In [16], the Jensen-Bregman\nLogDet (JBLD) divergence was introduced on SPD(n). As the name suggests, this is not a metric\nbut as proved in [55], the square root of JBLD turns out to be a metric (called the Stein metric), which\nis de\ufb01ned as d(A, B) =\nHere, we used the notation d without any subscript to denote the Stein metric. It is easy to see that\nthe Stein metric is computationally much more ef\ufb01cient than the GL(n)-invariant natural metric on\nSPD(n) as no eigen decomposition is required. This will be useful for training our recurrent model.\nIn the remainder of the paper, we will assume the metric on SPD(n) to be the Stein metric. Now, we\ndescribe a few operations on SPD(n) which are needed to de\ufb01ne the recurrent model.\n\n2 log det(AB).\n\n(cid:113)\n\nlog det( A+B\n\n2\n\n) \u2212 1\n\n2\n\n\f\u201cTranslation\u201d operation on SPD(n): Let I be the set of all isometries on SPD(n), i.e., given\ng \u2208 I, d(g.A, g.B) = d(A, B), for all A, B \u2208 SPD(n), where . is the group action as de\ufb01ned\nearlier. It is clear that I forms a group (henceforth, will be denoted by G) and for a given g \u2208 G and\nA \u2208 SPD(n), g.A (cid:55)\u2192 B, for some B \u2208 SPD(n) is a group action. One can easily see that, endowed\nwith the Stein metric, G = GL(n). In this work, we will choose a subgroup of GL(n), i.e., O(n) as\nour choice of G, where, O(n) is the set of n \u00d7 n orthogonal matrices and g.A := gAgT . Since the\nO(n) group operation preserves the distance, we call this group operation \u201ctranslation\u201d, analogous to\nthe case of Euclidean space and is denoted by TA(g) := gAgT .\nParametrization of SPD(n): Let A \u2208 SPD(n). We will obtain the Cholesky factorization of\nA = LLT , where L is an invertible lower traingular matrix. This gives a unique parametrization of\nSPD(n). Let the parametrization be A = Chol((l1, l2,\u00b7\u00b7\u00b7 ln,\u00b7\u00b7\u00b7 , ln(n+1)/2)t). With a slight abuse\nof notation, we will use Chol to denote both decomposition and construction based on the type of the\ndomain of the function, i.e., Chol(A) := L and Chol(L) := LLT = A. Note that here l1, l2,\u00b7\u00b7\u00b7 , ln\nare diagonal entries of L and are positive and ln+1,\u00b7\u00b7\u00b7 , ln(n+1)/2 can be any real numbers.\nParametrization of O(n): O(n) is a Lie group [25] of n \u00d7 n orthogonal matrices (of dimension\nn(n\u22121)/2) with the corresponding Lie algebra, O(n), and consists of the set of n\u00d7n skew-symmetric\nmatrices. The Lie algebra is a vector space, so we will use the corresponding element from the Lie\nalgebra to parametrize a point on O(n). Let g \u2208 O(n), we will use the matrix logarithm of g = log(g)\nto get the parametrization as the skew-symmetric matrix. So, g = exp((g1, g2,\u00b7\u00b7\u00b7 , gn(n\u22121)/2)t).\nexp is the matrix exponential operator.\ni=1 \u2282 SPD(n), and {wi}N\nWeighted Fr\u00e9chet mean (wFM) of matrices on SPD(n): Given {Xi}N\n\nwith wi \u2265 0, for all i and(cid:80)\n\ni wi = 1, the weighted Fr\u00e9chet mean (wFM) [24] is:\n\ni=1\n\nwid2 (Xi, M )\n\n(1)\n\nThe existence and uniqueness of the Fr\u00e9chet mean (FM) is discussed in detail in [1]. In this paper,\nwe will assume that the samples lie within a geodesic ball of an appropriate radius so that FM exists\nand is unique. We will use FM({Xi} ,{wi}) to denote the wFM of {Xi} with weights {wi}. With\nthe above tools in hand, now we are ready to formulate the Statistical Recurrent Neural Network on\nSPD(n), dubbed as SPD-SRU.\n3 A Statistical Recurrent Network Model in the space of SPD(n) matrices\nThe main motivation for our work comes from the statistical recurrent unit (SRU) model on Euclidean\nspaces in [47]. To setup our formulation, we will brie\ufb02y review the SRU formulation followed by\ndetails of our recurrent model for manifold valued measurements.\nWhat is the Statistical Recurrent Unit (SRU)? The authors in [47] propose an interesting model\nfor sequential (or temporal) data based on an un-gated recurrent unit (called Statistical Recurrent\nUnit (SRU)). The model maintains the sequential dependency in the input samples through a simple\nsummary statistic \u2014 the so-called exponential moving average. Even though the proposal is based\non an un-gated architecture, the development and experiments show that the results from SRU are\ncompetitive with more complex alternatives like LSTM and GRU. One reason put forth in that work is\nthat using appropriately designed summary statistics, one can essentially emulate complicated gated\nunits and still capture long terms relations (or memory) in sequences. This property is particularly\nattractive when we study recurrent models for more complicated measurements such as manifolds.\nRecall that the key challenge in extending statistical machine learning models to manifolds involves\nre-deriving many of the classical (Euclidean) arithmetic and geometric operations while respecting\nthe geometry of the manifold of interest. The simplicity of un-gated units provides an excellent\nstarting point. Below, we describe the key update equations that de\ufb01ne the SRU.\nLet x1, x2,\u00b7\u00b7\u00b7 xT be an input sequence on Rn, presented to the model. As in most recurrent models,\nthe training process in SRU proceeds by updating the weights of the model. Let the weight matrix\nbe denoted by W (the node is indexed by the superscript). The update rules for SRU are as follows:\n\nN(cid:88)\n\ni=1\n\nM\u2217 = argmin\n\nM\n\nW (r)\u00b5t\u22121 + b(r)(cid:17)\n(cid:16)\n(cid:16)\nW (\u03c6)rt + W (x)xt + b(\u03c6)(cid:17)\n\n(2)\n\n(3)\n\nrt = ReLU\n\n\u03d5t = ReLU\n\n\u2200\u03b1 \u2208 J , \u00b5(\u03b1)\n\nt = \u03b1\u00b5(\u03b1)\n\not = ReLU\n\n(cid:16)\nt\u22121 + (1 \u2212 \u03b1)\u03d5t\n\nW (o)\u00b5t + b(o)(cid:17)\n\n(4)\n\n(5)\n\nwhere J is the set of different scales. The SRU formulation is analogous to mean map embedding\n(MME) but applied to non i.i.d. samples. Since the average of a set of i.i.d. samples will essentially\n\n3\n\n\fmarginalize over time, simple averaging will lose the temporal/sequential information. On the other\nhand, the SRU computes a moving average over time which captures the average of the data seen so\nfar, i.e., when computing \u00b5 from \u03d5 (as shown in Fig. 1). This is very similar to taking the average\nof stochastic processes and looking at the \u201caverage process\u201d. Further, by looking at averages over\ndifferent scales, essentially, we can uncover statistics computed over different time scales. This is\nbecause \u00b5 is not only a function of \u03c6 but also a function of {xi}t\u22121\ni=1 via rt. This dependence on the\npast \u201ctokens\u201d in the sequence is shown in Fig. 1 by a \u201cdashed\u201d line. With this description, we can\neasily list the key operational components in the update rules in (2)-(5) and then evaluate if such\ncomponents can be generalized to serve as the building blocks of our proposed model.\nWhich low-level operations are needed? We can verify that the key ingredients to de\ufb01ne the model\nin SRU are (i) weighted sum; (ii) addition of bias; (iii) moving average and (iv) non-linearity. In\nprinciple, if we can generalize each of these operations to the SPD(n) manifold, it will provide us\nthe basic components to de\ufb01ne the model. Observe that items (i) and (iii) are essentially a weighted\nsum if we impose a convexity constraint on the weights. Then, the weighted sum for the Euclidean\nsetting can be generalized using wFM as de\ufb01ned in Section 2 (denoted by FM).\nIf we can do so, it will also provide a way to compute mov-\ning averages on SPD(n). Now, the second operation we can\nidentify above is the translation on Euclidean spaces. This\ncan be achieved by the \u201ctranslation\u201d operation on SPD(n)\nas de\ufb01ned in Section 2 (denoted by T). Finally, in order\nto generalize ReLU on SPD(n), we will use the standard\nReLU on the parameter space (this will be the local chart\nof SPD(n)) and then map it back on to the manifold. This\nmeans that we have generalized each of the key components.\nWith this in hand, we are ready to present our proposed\nrecurrent model on SPD(n). We \ufb01rst formally describe our\nSPD-SRU layer and then contrast with the SRU layer, to help see the main differences.\nBasic components of the SPD-SRU model. Let, X1, X2,\u00b7\u00b7\u00b7 XT be an input temporal or ordered\nsequence of points on SPD(n). The update rules for a layer of SPD-SRU is as follows:\n\n(cid:111)\nw(y,\u03b1)(cid:111)(cid:17)\n(cid:16){Rt, Xt} , w(t)(cid:17)\nw(s,\u03b1)(cid:111)(cid:17)\n(cid:110)\n(cid:111)\n(9)\nwhere, t \u2208 {1,\u00b7\u00b7\u00b7 , T} and M (\u03b1)\nis initialized to be a diagonal n \u00d7 n matrix with small positive\nvalues. Similar to before, the set J consists of positive real numbers from the unit interval. Now,\ncomputing the FM at the different elements of J will give a wFM at different \u201cscales\u201d, exactly as\ndesired. Analogous to the SRU, here M (\u03b1)\ns are computed by averaging \u03a6t at different scales as\nshown in Fig. 1. This model leverages the context based on previous data by asking the moving\naverages, M (\u03b1)\nComparison between the SPD-SRU and the SRU layer: In the SPD-SRU unit above, each update\nidentity is a generalization of an update equation of SRU. In (6), we compute the weighted combi-\nnation of the previous FMs (computed using different \u201cscales\u201d) with a \u201ctranslation\u201d, i.e., the input\n\nis (cid:110)\nthe past means with bias as given in (2)) where the input is (cid:110)\n\n(cid:111) and the output is Rt. This update equation is analogous to the weighted combination of\n(cid:111) and the output is rt. This update\n\nFigure 1: Sketch of an SPD-SRU and SRU\nlayer (dashed line represnets dependence on\nthe previous time point).\n\n(cid:16)\nYt, g(r)(cid:17)\nTt, g(p)(cid:17)\n(cid:16)\n(cid:16)(cid:110)\n(cid:16)\n\n(cid:17)\n(cid:16)\nSt, g(y)(cid:17)(cid:17)(cid:17)(cid:17)\n\n, \u03a6t = T\n\u2200\u03b1 \u2208 J , M (\u03b1)\n\ni=1 through Rt (as shown in Fig. 1).\n\nto depend on past data, {Xi}t\u22121\n\n, Ot = Chol\n\nt = FM\n\n, Rt = T\n\nrule calculates a weighted combination of the past information. In (7), we compute a weighted\ncombination of the previous information, Rt and the current point or token, Xt with a \u201ctranslation\u201d.\nThe input of this equation is Rt and Xt and the output is \u03a6t. This is analogous to (3), where the input\nis rt and xt and the output is \u03d5t. This update rule combines old and new information. Now, we will\nupdate the new information based on the combined information at the current time step, i.e., \u03a6t. This\nis accomplished in (8). Here, we are computing an FM (average) at different \u201cscales\u201d. Computing\naverages at different \u201cscales\u201d essentially allows including information from previous data points\nwhich have been seen at various time scales. This step is a generalization of (4). In this step, the\ninput is\n\n(cid:111) and \u03d5t respectively) and the output is (cid:110)\n\nand \u03a6t (with (cid:110)\n\n(cid:111) (with (cid:110)\n\n(cid:111)).\n\nM (\u03b1)\n\n(cid:110)\n\n(cid:111)\n\n(\u03b1)\nt\u22121\n\n(\u03b1)\nt\n\n(\u03b1)\nt\n\nM (\u03b1)\n\nt\u22121, \u03a6t\n\n(cid:111)\n(cid:16)\n\n(cid:16)(cid:110)\n\n(cid:16)(cid:110)\n\nSt = FM\n\nReLU\n\nChol\n\nT\n\nTt = FM\n\n(6)\n\n(7)\n\n(8)\n\nM (\u03b1)\n\nt\n\n,\n\n(cid:110)\n\nYt = FM\n\nM (\u03b1)\nt\u22121\n\n,\n\nM\n\n\u00b5\n\nt\n\nt\n\n(cid:16)\n\n, \u03b1\n\n0\n\nt\n\n\u00b5\n\n4\n\nM\n\n(\u03b1)\nt\u22121\n\n\u00b5\n\n(\u03b1)\nt\u22121\n\nRtYtXtTttMtStOtrt'tot\u00b5txt\fThis step is the combined information gathered at the current time step. Finally, in (9), we used a\nweighted combination of the current FMs (averages) and outputs Ot. This is the last update rule in\nSRU, i.e., (5). Observe that we did not use the ReLU operation in each update rule of SPD-SRU, in\ncontrast to SRU. This is because, these update rules are highly nonlinear unlike in the SRU, hence, a\nReLU unit at the \ufb01nal output of the layer is suf\ufb01cient. Also, notice that Ot \u2208 SPD(n), hence, we\ncan cascade multiple SPD-SRU layers, in other words in the next layer, the input sequence will be\nO1, O2 \u00b7\u00b7\u00b7 OT . The update equations track the \u201caverages\u201d (FM) at varying scales. This is the reason\nwe can call our framework statistical recurrent network. We will shortly see that our framework can\nutilize parameters more ef\ufb01ciently and requires very few parameters because of the ability to use the\ncovariance structure.\nImportant properties of SPD-SRU model: The \u201ctranslation\u201d operator T is analogous to \u201cadding\u201d\na bias term in a standard neural network. One reason we call it \u201ctranslation\u201d is because the action of\nO(n), preserves the metric. Notice that although in this description, we track the FMs at different\nscales, one may easily use other statistics, e.g., Fr\u00e9chet median and mode, etc. The key bottleneck\nis to ef\ufb01ciently compute the moving statistic (whatever it may be), which will be discussed shortly.\nNote that the SPD-SRU formulation can be generalized to other manifolds. In fact, it can be easily\ngeneralized to Riemannian homogeneous spaces [26] because of two reasons (a) closed form\nexpressions for Riemannian exponential and inverse exponential maps exist and (b) a group G acts\ntransitively on these spaces, hence we can generalize the de\ufb01nition of \u201ctranslation\u201d. Other manifolds\nare also possible but the technical details will be different. Now, we will comment on learning the\nparameters of our proposed model.\nLearning the parameters: Notice that using the parametrization of O(n), we will learn the \u201cbias\u201d\nterm on the parametric space, which is a vector space. The weights in the wFM must satisfy the\nnon-negativity constraint. In order to ensure that this property is satis\ufb01ed, we will learn the square root\nof the weights which is unconstrained, i.e., the entire real line. We will impose the af\ufb01ne constraint\nexplicitly by normalizing the weights. Hence, all the trainable parameters lie in the Euclidean space\nand the optimization of these parameters is unconstrained, hence standard techniques are suf\ufb01cient.\nRemarks. It is interesting to observe that the update equations in (6)-(9) involve group operations\nand wFM computation. But as evident from the (1), the wFM computation requires numerical\noptimization, which is computationally not ef\ufb01cient. This is a bottleneck. For example, for our\nproposed model, on a batch size of 20 with 15 \u00d7 15 matrices with T = 50, we need to compute\nFM 3000 times, even for just 10 epochs. Next, we will develop a formulation to make this wFM\ncomputation faster since it is invoked hundreds of times in a typical training procedure.\n4 An ef\ufb01cient way to compute the wFM on SPD(n)\nThe foregoing discussion describes how the computation of wFM needs an optimization on the SPD\nmanifold. If this sub-module is slow, the demands of the overall runtime will rule out practical\nadoption. In contrast, if this sub-module is fast but numerically or statistically unstable, the errors\nwill propagate in unpredictable ways, and can adversely affect the parameter estimation. Thus, we\nneed a scheme that balances performance and ef\ufb01ciency.\nEstimation of the FM from samples is a well researched topic. For instance, the authors in [45, 49]\nused Riemannian gradient descent to compute the FM. But the algorithm has a runtime complexity of\nO(iN ), where N is the number of samples and i is the number of iterations for convergence. This\nprocedure comes with provable consistency guarantees \u2013 thus, while it will serve our goals in theory,\nwe \ufb01nd that the runtime for each run makes training incredibly slow. On the other hand, the O(N )\nrecursive FM estimator using the Stein metric presented in [52] is fast and apt for this task if no\nadditional assumptions are made. However, it comes with no theoretical guarantees of consistency.\nKey Observation. We found that with a few important changes to the idea described in [52], one\ncan derive an FM estimator that retains the attractive ef\ufb01ciency behavior and is provably consistent.\nThe key ingredient here involves using a novel isometric mapping from the SPD manifold to the unit\nHilbert sphere. Next, we present the main idea followed by the analysis.\ni=1 \u2282 SPD(n) for which we want to compute the FM which will be used\nProposed Idea. Let {Xi}N\nin (6)\u2013(9). Authors in [52] presented a recursive Stein mean estimator given below:\n\n(cid:34)(cid:114)\n\n(cid:35)\n\nM1 = X1 Mk = Mk\u22121\n\n(10)\nk\u22121Xk and {wi} is the set of weights. Instead, brie\ufb02y, our strategy is (i) use an\nwhere Tk = M\u22121\nisometric mapping from SPD(n) to the unit Hilbert sphere; (ii) make use of an ef\ufb01cient way to\n\nTk +\n\n4\n\n2\n\n(2wk \u2212 1)2\n\n(I \u2212 Tk)2 \u2212 2wk \u2212 1\n\n(I \u2212 Tk)\n\n,\n\n5\n\n\fcompute the FM on the unit Hilbert sphere; This isometric mapping to the Hilbert sphere then\ntransfers the problem of proving consistency of the estimator from SPD(n) to that on the Hilbert\nsphere, which is easier to prove as shown below. This then leads to consistency of FM estimator on\nSPD(n).\nWe de\ufb01ne the isometric mapping from SPD(n) with a Stein metric to S\u221e, i.e., the in\ufb01nite dimensional\nunit hypersphere. In order to de\ufb01ne it, notice that we need to de\ufb01ne a metric, dS on S\u221e such that,\n(SPD(n), d) and (S\u221e, dS) are isometric. This procedure and the associated consistency analysis is\ndescribed below (all proofs are in the supplement).\nDe\ufb01nition 1. Let A \u2208 SPD(n). Let f := G(A) be the Gaussian density with 0 mean and covariance\nmatrix A. Now, we normalize the density f by f (cid:55)\u2192 f /(cid:107)f(cid:107) to map it onto S\u221e. Let, \u03a6 : SPD(n) \u2192\n\n(cid:113)\u2212 log(cid:104)(cid:101)f ,(cid:101)g(cid:105)2.\nS\u221e be that mapping. We de\ufb01ne the metric on S\u221e as dS((cid:101)f ,(cid:101)g) =\nProposition 1. Let (cid:101)f = \u03a6(A) and(cid:101)g = \u03a6(B). Then, d(2A, 2B) = dS((cid:101)f ,(cid:101)g).\n\nHere, (cid:104),(cid:105) is the L2 inner product.The following proposition proves the isometry between SPD(n)\nwith the Stein metric and the hypersphere with the new metric. Let, A, B \u2208 SPD(n). Then,\n\nNote that, \u03a6 maps a point on SPD(n) to the positive orthant of S\u221e, denoted by H since the\ncomponents of any probability vector are non-negative. We should point out that in this metric space,\nthere are no geodesics since it is not a length space. As a result, we cannot simply use the consistency\nproof of the stochastic gradient descent based FM estimator presented in [10] for any Riemannian\nmanifold and apply it here. Hence, the recursive FM presented next for the identity in (10) with the\nmapping described above will need a separate consistency analysis.\nRecursive Fr\u00e9chet mean algorithm on (H, dS). Let {xi}N\ni=1 be the samples on (H, dS) where H\ngives the positive orthant of S\u221e. Then, the FM of the given samples, denoted by m\u2217, is de\ufb01ned as\nm\u2217 = arg minm\n\nS(xi, m). Our recursive algorithm to compute the wFM of {xi}N\n\n(cid:80)N\n\ni=1 is:\n\ni=1 d2\n\n(cid:0)wk d2(xk, x) + (1 \u2212 wk) d2(mk\u22121, x)(cid:1)\n\nx\n\n\u221a\n\n(cid:19)\n\n2c(1\u2212wk)\n\nand c = tan(\u03b8).\n\n(cid:18) \u22121+\n\nsin(\u03b8) mk\u22121 + sin(\u03b1)\n\nsin(\u03b8) xk, where \u03b8 =\n\nm1 = x1 mk = arg min\n\n(11)\nwhere, mk is the kth estimate of the FM. At each step of our algorithm, we simply calculate a wFM\nof two points and we chose the weights to be the Euclidean weights. So, in order to construct a\nrecursive algorithm, we need to have a closed form expression of the wFM, as stated next.\nProposition 2. The minimizer of (11) is given by mk = sin(\u03b8\u2212\u03b1)\narccos((cid:104)mk\u22121, xk(cid:105)) and \u03b1 = arctan\n4c2(1\u2212wk)\u22124c2(1\u2212wk)2+1\nConsistency and Convergence analysis of the estimator. The following proposition (see supple-\nment for proof) gives us the weak consistency of this estimator and also the convergence rate.\nProposition 3. (a) Var (mk) \u2192 0 as k \u2192 \u221e. (b) The rate of convergence of the proposed recursive\nFM estimator is super linear.\nDue to proposition 1, we obtain a consistency result for (10) with our mapping. These results suggest\nthat we now have a suitable FM estimator which is consistent and ef\ufb01cient \u2013 this can be used as a\nblack-box module in our RNN formulation in (6)-(9).\n5 Experiments\nIn this section, we demonstrate the application of SPD-SRU to answer three key questions (1) Using\nthe manifold constraint, what are we saving in terms of number of parameters/ time and is the\nperformance competitive? (2) When data is not manifold valued, can we use our framework with\nthe geometry constraint? (3) In a real application, what improvements can we get over the baseline?\nWe perform three sets of experiments to answer these questions namely: (a) classi\ufb01cation of moving\npatterns on Moving MNIST data, (b) classi\ufb01cation of actions on UCF11 dataset and (c) permutation\ntesting to detect group differences between patients with and without Parkinson\u2019s disease.\nIn\nthe following subsections, we discuss about each of these dataset in more detail and present the\nperformance of our SPD-SRU. Our code is available from https://goo.gl/SfAezS.\n5.1 Savings in terms of number of parameters/ time and experiments on vision datasets.\nIn this section, we perform two sets of experiments namely (1) classi\ufb01cation of moving patterns\non Moving MNIST data, (2) classi\ufb01cation of actions on UCF11 data to show the improvement\nof our proposed framework over the state-of-the-art methods in terms of number of parameters/\ntime. We compared with LSTM [28], SRU [47], TT-GRU and TT-LSTM [66]. In the \ufb01rst two\n\n6\n\n\fMode\n\n# params.\n\nSPD-SRU\nTT-GRU\nTT-LSTM\n\nSRU\nLSTM\n\n1559\n2240\n2304\n159862\n252342\n\norientation (\u25e6)\n\ntime (s)\n/ epoch\n\u223c 6.2\n\u223c 2.0\n\u223c 2.0\n\u223c 3.5\n\u223c 4.5\n\n10-15\n\n0.96 \u00b1 0.02\n0.52 \u00b1 0.04\n0.51 \u00b1 0.04\n0.75 \u00b1 0.19\n0.71 \u00b1 0.07\n\n30-60\n\n1.00 \u00b1 0.00\n1.00 \u00b1 0.00\n1.00 \u00b1 0.00\n1.00 \u00b1 000\n0.97 \u00b1 0.01\n\n10-15-20\n0.94 \u00b1 0.02\n0.47 \u00b1 0.03\n0.37 \u00b1 0.02\n0.73 \u00b1 0.14\n0.57 \u00b1 0.13\n\nTable 1: Comparative results on Moving MNIST\n\nclassi\ufb01cation applications, we use a convolution block before the recurrent unit for all the competitive\nmethods except for TT-GRU and TT-LSTM. In our SPD-SRU model, before the recurrent layer, we\nincluded a covariance block analogous to [67] after one convolution layer ([67] includes details of the\nconstruction for the covariance block). So, the input of our SPD-SRU layer is a sequence of matrices\nin SPD(c + 1), where c is the number of channels from the convolution layer.\nClassi\ufb01cation of moving patterns in Moving MNIST data. We used the Moving MNIST data as\ngenerated in [57]. For this experiment we performed 2 and 3 classes classi\ufb01cation experiment. In\neach class, we generated 1000 sequences each of length 20 showing 2 digits moving in a 64 \u00d7 64\nframe. Though within a class, the digits are random, we \ufb01xed the moving pattern by \ufb01xing the speed\nand direction of the movement. In this experiment, we kept the speed to be the same for all the\nsequences, but two sequences from two different classes can differ in orientation by at least 5\u25e6 and by\nat most 30\u25e6. We experimentally see that, SPD-SRU can achieve very good 10-fold testing accuracy\neven when the orientation difference of two classes is 5\u25e6. In fact SPD-SRU uses the smallest number\nof parameters among all methods tested and still offers the best average testing accuracy.\nIn Table 1, we report the mean\nand standard deviation of the 10-\nfold testing accuracy. We should\npoint out that the training accu-\nracy for all the competitive meth-\nods is > 95% for all cases. For\nTT-RNN, we reshaped the input\nto be 4\u00d7 8\u00d7 8\u00d7 16 and kept the\noutput shape and rank to be 4 \u00d7 4 \u00d7 4 \u00d7 4 and 1 \u00d7 4 \u00d7 4 \u00d7 4 \u00d7 1. The number of output units for\nLSTM is set to 10 and the number of statistics for SRU is set to 80. Note that, we chose different\nparameters for SRU and LSTM and TT-RNN and the one we report here are those for which the\nnumber of parameters are smallest for the reported testing accuracy. For the convolution layer, we\nchose the kernel size to be 5 \u00d7 5 and the input and output channels to be 5 and 10 respectively, i.e.,\nthe dimension of the SPD matrix is 11 for this experiment. As before, the parameters are chosen so\nthe number of parameters are smallest to get the reported testing accuracy.\nOne can see from the table that, SPD-SRU takes the least number\nof parameters and can achieve very good classi\ufb01cation accuracy\neven for 5\u25e6 orientation difference and for three classes. Note\nthat TT-RNN is the closest to SPD-SRU in terms of parameters.\nFor comparisons, we conduct an experiment where we vary the\ndifference of orientation from 30\u25e6 to 5\u25e6. The testing accuracies\nare shown in Fig. 2. We can see that only SPD-SRU maintains\ngood 10-fold testing accuracy for all orientation differences while\nthe performance of TT-RNN (both variants) deteriorates as we\ndecrease the difference between orientations of the two classes\n(the effect size). In terms of training time, SPD-SRU takes around\n6 seconds per epoch while the fastest method is TT-RNN which\ntakes around 2 seconds. But, in this experiment, SPD-SRU takes 75 epochs to converge to the reported\nresults while TT-RNN takes around 400 epochs. So, although TT-RNN is faster per epoch, the total\ntraining time for TT-RNN and SPD-SRU is almost the same. We also should point out that although\nthe number of trainable parameters are fewer for SPD-SRU than TT-RNN, the time difference is due\nto constructing the covariance in each epoch which can be optimized via faster implementations.\nClassi\ufb01cation of moving patterns in UCF-11 data. We performed an action classi\ufb01cation exper-\niment on UCF11 dataset [43]. It contains in total 1600 video clips belonging to 11 classes that\nsummarize the human action visible in each video clip such as basketball shooting, diving and others.\nWe followed the same processing step as in [66]. Each frame has resolution 320\u00d7 240. We generate a\nsequence of RGB frames of size 160 \u00d7 120 from each clip at 24 fps. The lengths of frame sequences\nfrom each video therefore are in the range of 204-1492 with an average of 483.7. For SPD-SRU, we\nchose two convolution layers with kernel size 7 \u00d7 7 and number of output channels to be 5 and 7\nrespectively and then 5 PSRN layers. Hence, the dimension of the covariance matrices are 8 \u00d7 8 for\nthis experiment. For TT-GRU and TT-LSTM, we used the same con\ufb01gurations of input and output\nfactorization as given in [66]. For SRU and LSTM we used the number of statistics and number\nof output units to be 750. For both SRU and LSTM we used 3 convolution layers with kernel size\n7 \u00d7 7 and output channels to be 10, 15 and 25 respectively to get the reported testing accuracies.\n\nFigure 2: Comparison of testing ac-\ncuracies with varying orientations\n\n7\n\n51015202530Orientation Difference (degree)0.40.60.811.2Testing AccuracyMethodSPD-SRUTT-GRUTT-LSTM\f3337\n6048\n6176\n\nSRU\nLSTM\n\n2535630\n14626425\n\n0.78\n0.78\n0.78\n0.75\n0.70\n\nModel\n\n# params.\n\ntime/ epoch Test acc.\n\n\u223c 76\n\u223c 42\n\u223c 33\n\u223c 50\n\u223c 57\n\nSPD-SRU\nTT-GRU\nTT-LSTM\n\nAll the models achieve > 90% training accuracy. We report the testing accuracy with the number\nof parameters and time per epoch in Table 2. From this experiment, we can see that the number of\nparameters for SPD-SRU is signi\ufb01cantly smaller than the other models without sacri\ufb01cing the testing\naccuracy. In terms of training time, SPD-SRU takes approximately 3 times more time than TT-RNN\nbut SPD-SRU (TT-RNN) converges in 50 (100) epochs. Furthermore, we like to point out that after\n400 epochs, SPD-SRU gives 79.90% testing accuracy. Hence, analogous to the previous experiment,\nwe can conclude that SPD-SRU maintains very good classi\ufb01cation accuracy while keeping the number\nof trainable parameters very small. Furthermore, this experiment indicates that SPD-SRU can achieve\ncompetitive performance on real data with small number of training parameters in comparable time.\n5.2 Application on manifold valued data\nFrom the previous two experiments, we can conclude that SPD-SRU requires a smaller number of\nparameters. Now, we focus our attention to a neuroimaging application where data is manifold valued.\nBecause the number of parameters are small, we can do statistical testing on brain connectivity at the\n\ufb01ber bundle level. We seek to \ufb01nd group differences between subjects with and without Parkinson\u2019s\ndisease (denoted by \u2018PD\u2019 and \u2018CON\u2019) based on the M1 \ufb01ber tracts on both hemispheres of the brain.\nPermutation testing to detect group differences. The\ndata pool consists of dMRI (human) brain scans acquired\nfrom 50 \u2018PD\u2019 patients and 44 \u2018CON\u2019 healthy controls. All\nimages were collected using a 3.0T MR scanner (Philips\nAchieva) and 32-channel quadrature volume head coil. The\nparameters of the diffusion imaging acquisition sequence\nwere: gradient directions = 64, b-values = 0/1000 s/mm2,\nTable 2: Comparative results on UCF11 data\nrepetition time =7748 ms, echo time = 86 ms, \ufb02ip angle = 90\u25e6, \ufb01eld of view = 224 \u00d7 224 mm, matrix\nsize = 112 \u00d7 112, number of contiguous axial slices = 60 and SENSE factor P = 2. We used FSL [8]\nsoftware to extract M1 \ufb01ber tracts (denoted by \u2018LM1\u2019 and \u2018RM1\u2019) [5], which consists of 33 and 34\npoints respectively (please see Fig. 3 for M1-SMATT \ufb01ber tract template). We \ufb01t a diffusion tensor\nand extract 3 \u00d7 3 SPD matrices. Now, for each of these two classes, we use 3 layers of SPD-SRU to\nlearn the tracts pattern to get two models for \u2018PD\u2019 and \u2018CON\u2019 (denoted by \u2018mPD\u2019 and \u2018mCON\u2019).\nNow, we use a permutation testing based on a \u201cdistance\u201d between \u2018mPD\u2019\nand \u2018mCON\u2019. We will de\ufb01ne the distance between two network models\nas proposed in [59] (let it be denoted by dmod). Here, we assume each\nsubject is independent hence use of permutation testing is sensible. Then\nwe perform permutation testing for each tract as follows (i) randomly\npermute the class labels of the subjects and learn \u2018mPD\u2019 and \u2018mCON\u2019\nmodels for each of the new group. (ii) compute dj\nmod (iii) repeat step (ii)\n10,000 times and report the p-value as the fraction of times dj\nmod > dmod.\nSo, we ask if we can reject the null hypothesis that there is no signi\ufb01cant\ndifference between the tracts models learned from the two different classes.\nAs a baseline, we use the following scheme: (i) for each tract of each\nsubject, compute the FM of the matrices on the tract. (ii) use Cramer\u2019s\ntest based on this Stein distance. (iii) do the permutation testing based on the Cramer\u2019s test.\nWe found that using our SPD-SRU model with 3 layers, the p-value for \u2018LM1\u2019 and \u2018RM1\u2019 are 0.01 and\n0.032 respectively, while the baseline method gives a p-value of 0.17 and 0.34 respectively. Hence,\nwe conclude that, unlike the baseline method, using SPD-SRU we can reject the null hypothesis with\n95% con\ufb01dence. To the best of our knowledge, this is the \ufb01rst result that demonstrates a RNN based\nstatistical signi\ufb01cance test applied on tract based group testing in neuroimaging.\n6 Conclusions\nNon-Euclidean or manifold valued data are ubiquitous in science and engineering. In this work, we\nstudy the setting where the data (or measurements) are ordered, longitudinal or temporal in nature\nand live on a Riemannian manifold. This setting is common in a variety of problems in statistical\nmachine learning, vision and medical imaging. We presented a generalization of the RNN to such\nnon-Euclidean spaces and analyze its theoretical properties. Our proposed framework is fast and\nneeds far fewer parameters than the state-of-the-art. Experiments show competitive performance on\nbenchmark computer vision datasets in comparable time. We also apply our framework to perform\nstatistical analysis in brain connectivity and demonstrate the applicability to manifold valued data.\n\nFigure 3: M1-SMATT tem-\nplate\n\n8\n\n\fReferences\n[1] Bijan Afsari. Riemannian lp center of mass: existence, uniqueness, and convexity. Proceedings of the\n\nAmerican Mathematical Society, 139(2):655\u2013673, 2011.\n\n[2] Bijan Afsari, Rizwan Chaudhry, Avinash Ravichandran, and Ren\u00e9 Vidal. Group action induced distances\nfor averaging and clustering linear dynamical systems with applications to the analysis of dynamic scenes.\nIn Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2208\u20132215. IEEE,\n2012.\n\n[3] Bijan Afsari and Ren\u00e9 Vidal. The alignment distance on spaces of linear dynamical systems. In Decision\n\nand Control (CDC), 2013 IEEE 52nd Annual Conference on, pages 1162\u20131167. IEEE, 2013.\n\n[4] Iman Aganj, Christophe Lenglet, and Guillermo Sapiro. ODF reconstruction in q-ball imaging with solid\nangle consideration. In Biomedical Imaging: From Nano to Macro, 2009. ISBI\u201909. IEEE International\nSymposium on, pages 1398\u20131401. IEEE, 2009.\n\n[5] Derek B Archer, David E Vaillancourt, and Stephen A Coombes. A template and probabilistic atlas of the\n\nhuman sensorimotor tracts using diffusion mri. Cerebral Cortex, pages 1\u201315, 2017.\n\n[6] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks.\n\nInternational Conference on Machine Learning, pages 1120\u20131128, 2016.\n\nIn\n\n[7] Peter J Basser, James Mattiello, and Denis LeBihan. MR diffusion tensor spectroscopy and imaging.\n\nBiophysical journal, 66(1):259\u2013267, 1994.\n\n[8] Timothy EJ Behrens, H Johansen Berg, Saad Jbabdi, Matthew FS Rushworth, and Mark W Woolrich.\nProbabilistic diffusion tractography with multiple \ufb01bre orientations: What can we gain? Neuroimage,\n34(1):144\u2013155, 2007.\n\n[9] Alessandro Bissacco, Alessandro Chiuso, Yi Ma, and Stefano Soatto. Recognition of human gaits. In\nComputer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer\nSociety Conference on, volume 2, pages II\u2013II. IEEE, 2001.\n\n[10] Silvere Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic\n\nControl, 58(9):2217\u20132229, 2013.\n\n[11] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep\n\nlearning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18\u201342, 2017.\n\n[12] Emmanuel J Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations\n\nof Computational mathematics, 9(6):717, 2009.\n\n[13] Rudrasis Chakraborty, Monami Banerjee, and Baba C Vemuri. H-CNNs: Convolutional neural networks\n\nfor riemannian homogeneous spaces. arXiv preprint arXiv:1805.05487, 2018.\n\n[14] Guang Cheng, Hesamoddin Salehian, John R Forder, and Baba C Vemuri. Tractography from HARDI\nusing an intrinsic unscented kalman \ufb01lter. IEEE Transactions on Medical Imaging, 34(1):298\u2013305, 2015.\n\n[15] Guang Cheng, Hesamoddin Salehian, and Baba C Vemuri. Ef\ufb01cient recursive algorithms for computing\nthe mean diffusion tensor and applications to DTI segmentation. In European Conference on Computer\nVision, pages 390\u2013401. Springer, 2012.\n\n[16] Anoop Cherian, Suvrit Sra, Arindam Banerjee, and Nikolaos Papanikolopoulos. Ef\ufb01cient similarity search\nfor covariance matrices via the jensen-bregman logdet divergence. In Computer Vision (ICCV), 2011 IEEE\nInternational Conference on, pages 2399\u20132406. IEEE, 2011.\n\n[17] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of\n\nneural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.\n\n[18] Taco S Cohen, Mario Geiger, Jonas Koehler, and Max Welling. Spherical cnns. arXiv preprint\n\narXiv:1801.10130, 2018.\n\n[19] Taco S Cohen and Max Welling. Steerable cnns. arXiv preprint arXiv:1612.08498, 2016.\n\n[20] Francesca Dominici, Aidan McDermott, Scott L Zeger, and Jonathan M Samet. On the use of generalized\nadditive models in time-series studies of air pollution and health. American journal of epidemiology,\n156(3):193\u2013203, 2002.\n\n9\n\n\f[21] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan,\nKate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and\ndescription. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n2625\u20132634, 2015.\n\n[22] Gianfranco Doretto, Alessandro Chiuso, Ying Nian Wu, and Stefano Soatto. Dynamic textures. Interna-\n\ntional Journal of Computer Vision, 51(2):91\u2013109, 2003.\n\n[23] Ehsan Elhamifar and Ren\u00e9 Vidal. Sparse subspace clustering. In Computer Vision and Pattern Recognition,\n\n2009. CVPR 2009. IEEE Conference on, pages 2790\u20132797. IEEE, 2009.\n\n[24] Maurice Fr\u00e9chet. Les \u00e9l\u00e9ments al\u00e9atoires de nature quelconque dans un espace distanci\u00e9. Ann. Inst. H.\n\nPoincar\u00e9, 10(3):215\u2013310, 1948.\n\n[25] Brian Hall. Lie groups, Lie algebras, and representations: an elementary introduction, volume 222.\n\nSpringer, 2015.\n\n[26] Sigurdur Helgason. Differential geometry and symmetric spaces, volume 12. Academic press, 1962.\n\n[27] Mikael Henaff, Arthur Szlam, and Yann LeCun. Recurrent orthogonal networks and long-memory tasks.\n\narXiv preprint arXiv:1602.06662, 2016.\n\n[28] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780,\n\n1997.\n\n[29] Xue Hua, Alex D Leow, Neelroop Parikshak, Suh Lee, Ming-Chang Chiang, Arthur W Toga, Clifford R\nJack Jr, Michael W Weiner, Paul M Thompson, Alzheimer\u2019s Disease Neuroimaging Initiative, et al. Tensor-\nbased morphometry as a neuroimaging biomarker for alzheimer\u2019s disease: an MRI study of 676 AD, MCI,\nand normal subjects. NeuroImage, 43(3):458\u2013469, 2008.\n\n[30] Zhiwu Huang and Luc J Van Gool. A riemannian network for spd matrix learning. In AAAI, volume 2,\n\npage 6, 2017.\n\n[31] Zhiwu Huang, Jiqing Wu, and Luc Van Gool. Building deep networks on grassmann manifolds. arXiv\n\npreprint arXiv:1611.05742, 2016.\n\n[32] Bing Jian, Baba C Vemuri, Evren \u00d6zarslan, Paul R Carney, and Thomas H Mareci. A novel tensor\n\ndistribution model for the diffusion-weighted MR signal. NeuroImage, 37(1):164\u2013176, 2007.\n\n[33] Li Jing, Caglar Gulcehre, John Peurifoy, Yichen Shen, Max Tegmark, Marin Solja\u02c7ci\u00b4c, and Yoshua Bengio.\n\nGated orthogonal recurrent units: On learning to forget. arXiv preprint arXiv:1706.02761, 2017.\n\n[34] David G Kendall. Shape manifolds, procrustean metrics, and complex projective spaces. Bulletin of the\n\nLondon Mathematical Society, 16(2):81\u2013121, 1984.\n\n[35] Hyunwoo J Kim, Nagesh Adluru, Maxwell D Collins, Moo K Chung, Barbara B Bendlin, Sterling C\nJohnson, Richard J Davidson, and Vikas Singh. Multivariate general linear models (mglm) on riemannian\nmanifolds with applications to statistical analysis of diffusion weighted images. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition, pages 2705\u20132712, 2014.\n\n[36] Hyunwoo J Kim, Nagesh Adluru, Heemanshu Suri, Baba C Vemuri, Sterling C Johnson, and Vikas Singh.\nRiemannian nonlinear mixed effects models: Analyzing longitudinal deformations in neuroimaging. In\nProceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[37] Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural\n\nnetworks to the action of compact groups. arXiv preprint arXiv:1802.03690, 2018.\n\n[38] Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. arXiv preprint\n\narXiv:1402.3511, 2014.\n\n[39] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[40] Guy Lebanon et al. Riemannian geometry and statistical machine learning. LAP LAMBERT Academic\n\nPublishing, 2015.\n\n[41] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n10\n\n\f[42] C. Lenglet, M. Rousson, and R. Deriche. DTI segmentation by statistical surface evolution. IEEE Trans.\n\non Medical Imaging, 25(6):685\u2013700, 2006.\n\n[43] Jingen Liu, Jiebo Luo, and Mubarak Shah. Recognizing realistic actions from videos \u201cin the wild\u201d. In\nComputer vision and pattern recognition, 2009. CVPR 2009. IEEE conference on, pages 1996\u20132003. IEEE,\n2009.\n\n[44] Konstantinos Mammasis and RobertW Stewart. Spherical statistics and spatial correlation for multielement\nantenna systems. EURASIP Journal on Wireless Communications and Networking, 2010(1):307265, 2010.\n\n[45] Maher Moakher and Philipp G Batchelor. Symmetric positive-de\ufb01nite matrices: From geometry to\napplications and visualization. In Visualization and Processing of Tensor Fields, pages 285\u2013298. Springer,\n2006.\n\n[46] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and\nGeorge Toderici. Beyond short snippets: Deep networks for video classi\ufb01cation. In Computer Vision and\nPattern Recognition (CVPR), 2015 IEEE Conference on, pages 4694\u20134702. IEEE, 2015.\n\n[47] Junier B Oliva, Barnab\u00e1s P\u00f3czos, and Jeff Schneider. The statistical recurrent unit. arXiv preprint\n\narXiv:1703.00381, 2017.\n\n[48] FC Park and Bahram Ravani. Bezier curves on riemannian manifolds and lie groups with kinematics\n\napplications. Journal of Mechanical Design, 117(1):36\u201340, 1995.\n\n[49] Xavier Pennec, Pierre Fillard, and Nicholas Ayache. A riemannian framework for tensor computing.\n\nInternational Journal of computer vision, 66(1):41\u201366, 2006.\n\n[50] Sonia Pujol, William Wells, Carlo Pierpaoli, Caroline Brun, James Gee, Guang Cheng, Baba Vemuri,\nOlivier Commowick, Sylvain Prima, Aymeric Stamm, et al. The DTI challenge: toward standardized\nevaluation of diffusion tensor imaging tractography for neurosurgery. Journal of Neuroimaging, 25(6):875\u2013\n882, 2015.\n\n[51] Chris Quirk, Arul Menezes, and Colin Cherry. Dependency treelet translation: Syntactically informed\nphrasal smt. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics,\npages 271\u2013279. Association for Computational Linguistics, 2005.\n\n[52] Hesamoddin Salehian, Guang Cheng, Baba C Vemuri, and Jeffrey Ho. Recursive estimation of the stein\ncenter of spd matrices and its applications. In Proceedings of the IEEE International Conference on\nComputer Vision, pages 1793\u20131800, 2013.\n\n[53] Bernhard Sch\u00f6lkopf and Alexander J Smola. Learning with kernels: support vector machines, regulariza-\n\ntion, optimization, and beyond. MIT press, 2002.\n\n[54] Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. Action recognition using visual attention. arXiv\n\npreprint arXiv:1511.04119, 2015.\n\n[55] Suvrit Sra. Positive de\ufb01nite matrices and the symmetric stein divergence. Technical report, 2011.\n\n[56] Anuj Srivastava, Ian Jermyn, and Shantanu Joshi. Riemannian analysis of probability density functions\nwith applications in vision. In Computer Vision and Pattern Recognition, 2007. CVPR. IEEE Conference\non, pages 1\u20138. IEEE, 2007.\n\n[57] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representa-\n\ntions using lstms. In International conference on machine learning, pages 843\u2013852, 2015.\n\n[58] Julian Straub, Jason Chang, Oren Freifeld, and John Fisher III. A dirichlet process mixture model for\n\nspherical data. In Arti\ufb01cial Intelligence and Statistics, pages 930\u2013938, 2015.\n\n[59] Umberto Triacca. Measuring the distance between sets of arma models. Econometrics, 4(3):32, 2016.\n\n[60] Ruey S Tsay. Analysis of \ufb01nancial time series, volume 543. John Wiley & Sons, 2005.\n\n[61] Pavan Turaga, Ashok Veeraraghavan, and Rama Chellappa. Statistical analysis on stiefel and grassmann\nmanifolds with applications in computer vision. In Computer Vision and Pattern Recognition, 2008. CVPR\n2008. IEEE Conference on, pages 1\u20138. IEEE, 2008.\n\n[62] Oncel Tuzel, Fatih Porikli, and Peter Meer. Region covariance: A fast descriptor for detection and\n\nclassi\ufb01cation. Computer Vision\u2013ECCV 2006, pages 589\u2013600, 2006.\n\n11\n\n\f[63] Bart Vandereycken. Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimiza-\n\ntion, 23(2):1214\u20131236, 2013.\n\n[64] Zhizhou Wang and Baba C Vemuri. Dti segmentation using an information theoretic tensor dissimilarity\n\nmeasure. IEEE Transactions on Medical Imaging, 24(10):1267\u20131277, 2005.\n\n[65] Jia Xu, Vamsi K Ithapu, Lopamudra Mukherjee, James M Rehg, and Vikas Singh. Gosus: Grassmannian\nonline subspace updates with structured-sparsity. In Computer Vision (ICCV), 2013 IEEE International\nConference on, pages 3376\u20133383. IEEE, 2013.\n\n[66] Yinchong Yang, Denis Krompass, and Volker Tresp. Tensor-train recurrent neural networks for video\n\nclassi\ufb01cation. arXiv preprint arXiv:1707.01786, 2017.\n\n[67] Kaicheng Yu and Mathieu Salzmann. Second-order convolutional neural networks. arXiv preprint\n\narXiv:1703.06817, 2017.\n\n[68] Ernesto Zacur, Matias Bossa, and Salvador Olmos. Multivariate tensor-based morphometry with a right-\ninvariant riemannian distance on GL+ (n). Journal of mathematical imaging and vision, 50(1-2):18\u201331,\n2014.\n\n12\n\n\f", "award": [], "sourceid": 5332, "authors": [{"given_name": "Rudrasis", "family_name": "Chakraborty", "institution": "University of Florida"}, {"given_name": "Chun-Hao", "family_name": "Yang", "institution": "University of Florida"}, {"given_name": "Xingjian", "family_name": "Zhen", "institution": "UW-Madison"}, {"given_name": "Monami", "family_name": "Banerjee", "institution": "University of Florida"}, {"given_name": "Derek", "family_name": "Archer", "institution": "University of Florida"}, {"given_name": "David", "family_name": "Vaillancourt", "institution": "University of Florida"}, {"given_name": "Vikas", "family_name": "Singh", "institution": "UW-Madison"}, {"given_name": "Baba", "family_name": "Vemuri", "institution": "University of Florida, USA"}]}