Paper ID: 1235
Title: Multilinear Dynamical Systems for Tensor Time Series
Reviews

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The wealth of multi-dimensional structured data found in many applications requires methods for decomposition, modeling and analysis. This paper considers time-evolving multi-d arrays and considers generalizations of systems for vector/matrix dynamics. The paper is fairly well-written and offers an exposition of basic tensor manipulations - I checked the math and it appears sound.

I have some comments and qns as below:
1) why are these multi-d objects tensors? A multi-d array is in general not a tensor, just a structured collection of indexed data. A tensor has more fundamental properties - of invariance to coordinates for example, as a tensor is a *geometric object* that defines relationships between vectors and matrices in generic coordinates? Perhaps the confusion lies in the common usage of "tensor" to describe the scalar coefficients of the tensor, which are indeed just scalars in a multi-d array. I am sure I'm not alone in being confused by understanding how an array of financial timeseries can be regarded as a tensor?

2) There is considerable work in the fmri literature on dynamics, factorizations and [scalar] tensor coefficients. How does your work relate to eg that of Beckmann et al?

3) Given that the tensor coefficients may be reshaped using generalized vec() operators, and that tensor structure may be preserved using structured indicator matrices - and so the system solved using standard dynamical modeling, what is the added benefit to direct tensor modeling?

4) parameter estimation in tensor, or high-d structure modeling is notoriously sensitive to priors, algorithm, latent dimension and amounts of data. I found little comment regarding your experiences.

5) The choice of latent decomposition is critically dependent upon the choice of the MLDS dimensionality. The paper quotes 'set' values. What is the sensitivity of the forecast error upon these?
Q2: Please summarize your review in 1-2 sentences
An interesting topic, with some incremental theoretical work. The paper suffers from a potential lack of detail with respect to critical parameter settings and knock on sensitivity analysis. Some of the concepts are obscured to the reader not familiar to tensors, especially given the confused definition of a "tensor".

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
In this paper the authors propose a natural extension of linear dynamical systems to deal with sequences of tensors. There are two advantages: first, the information conveyed in the structure of the tensors is kept; second, the tensor factorization of the transition and projection operators allow for keeping a low number of learning parameters. The paper is well written and the idea developed is novel as far as I am concerned.

In the following I mention small details:
1.- I understand that tensor A in Sec. 5.1 has dimensionality J_1 x ... x J_M x J_1 x ... x J_M. This should be clarified because the previous time tensor A appears in the paper (in Lemma 1) it has different dimensionality.
2.- \Gamma appears in the statement of Lemma 2 but is only defined later in its proof.
3.- In the experiments the number of tensors in a sequence is denoted by N, whereas before it is used K.

In the experiments it is shown the superiority of the proposed method with respect to LDS, as one may expect a priori. It would also be interesting to show the running time (according to the description of the algorithm, it looks quite demanding).

I acknowledge that I have read the author rebuttal and the reviews. I think the paper is acceptable, although after reading the reviews (Stephen, especially point 3) and rebuttal, I agree I have overestimated a bit the paper in my first appraisal and I will decrease it.
Q2: Please summarize your review in 1-2 sentences
Interesting extension of linear dynamical systems to consider tensor sequences. Novel method and well written paper.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a tensorial LDS model called MLDS. As such, the model is a restriction of the general Kalman filter model with state transition and observation operators having a certain Kronecker product factorization. The model can be understood intuitively as a sequence of core tensors that form a Markov Chain, and the observed tensor at each time slice as well as the core tensor at the next time slice are obtained by Tucker models with tensor factors C and A respectively. The term multilinear in the title is misleading in this respect as the model is linear.

After setting up the notation and background, authors derive the posterior distribution of hidden tensors given the observed tensor in Sec. 4, Lemma 2. This is direct generalization of standard marginalization and conditioning results of multivariate Gaussians. What is not discussed is the form of the state covariance and whether it admits a useful factorization (compare with matrix Gaussians where the covariance is given as a Kronecker Product). I suspect that this model would not admit such a structured covariance but I would have welcomed at least a discussion of this issue because calculating the covariance would be the computational bottleneck with large Z, not much mentioned here and somewhat swept under the carpet.

The ideas are quite analogous to theory of linear dynamical systems (LDS) but the tensor notation is unevitably quite heavy, yet the authors made a pretty good job in keeping the notation intuitive by drawing analogies with the vector case. Perhaps not surprisingly, the paper concludes that by preserving the linear structure, MLDS is more effective than a plain LDS where the tensors are simply vectorized. However, the paper does not much detail on inference (filtering/smoothing recursions) and instead focuses on parameter inference via EM.


Another issue I would have seen discussed is identifiability and the justification of the Tucker transition model. The LDS states are identifiable upto similarity transform, what is the case for the case discussed here? For the latter point, whilst the observation model is clear, it would be good at least to justify the Tucker transition model; does this have a physical meaning? Otherwise I would suspect taking an unconstrained transition operator A or the other extreme of taking A as the identity (e.g., a drifting core tensor) are equally viable.


Sec. 5.2 develops an EM algorithm to for parameter estimation. This section is rather short and many details are omitted. The classical EM algorithm for learning an LDS derives update rules for each parameter, including noise covariances. In the paper, although authors state that, ‘MLDS inherits the E-step of the LDS and the M-step of the LDS for all parameters except A and C’, it is not clear what they mean and it is not possible to see the update rules of noise covariances at one-step. Furthermore, authors do not explicitly state the update rules for A and C and merely derive the gradient. In the Eq. (11), authors use \phi and \varphi which are defined with expectations. In the LDS, these expectations require to find smoothed estimates (in addition to filtered estimates). The paper nowhere mentions the smoothing recursions explicitly and the reader has to simply redrive the results from scratch. From the description in the paper it is not clear that how these expectations are computed making the work hard to reproduce.

Despite all the small glitches, I find the model interesting and well executed enough deserving publication.
I am surprised that the literature does not contain special cases of this model such as 'Matrix variate Kalman Filter'.


Short comments
Abstract and introduction should be more clear and state the precise results about inference and learning algorithms that are obtained in this paper.

Experimental section is good. If authors think to resubmit this paper, they should focus on the Section 4 and Section 5.

The statement in the related work section page 3, line 122 is not entirely correct as GCTF is not limited to CP but can accomodate arbitrary factorization forms and fairly general noise models.

The statement in the conclusion "permitting us to learn
all but the transition and projection operators A and C of the MLDS via the Kalman filter"
is vague (and actually incorrect as KF provides the filtered state estimates and covariances)

The citation to [18] looks off-topic and can either be omitted or must be elaborated more to stress why the non existence may be an issue.
Q2: Please summarize your review in 1-2 sentences
A potentially useful and elegant special case of the LDS for tensor variate data is developed. The results would be hard to reproduce as details about inference are missing.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a formulation for linear dynamic systems when the observed and latent variables are tensors. The key is to use the tensor-variate normal distribution in place of the standard vector-variate normal. The authors show that one can handle the tensor-normal, using a simple algebra between tensors that is quite similar to the standard vector/matrix algebra. They extend the standard EM approach for model inference of LDS to tensor variables by taking advantage of the algebra they derived. Finally, they compared the model with vectorized LDS for matrix time-series (M=2) when the dimensionality of the latent tensor is known.

This paper is well-written. The authors give generic and useful theorems for tensor algebra useful in handling tensor-variate normal. To the best of my knowledge (I'm not extremely familiar with the domain, though), this is the first paper that derives an EM algorithm based on the tensor-variate normal. I think we should recognize the author for the effort.

The extension of LDS to tensors is interesting enough from a theoretical perspective, but I wonder if such an extension is really rewarding. In fact, the authors only give M=2 examples for given latent dimensionalities. In that case, the tensor algebra is reduced to the standard matrix algebra as described in standard multivariate analysis textbooks. In a journal version of the paper, I'd like to see simplified formulas customized for the M=2 case, where all the mysteriously-looking symbols are not necessary.
Q2: Please summarize your review in 1-2 sentences
This paper appears to be the first work deriving an EM algorithm for tensor-variate LDS based on a systematic treatment of tensor-variate Gaussian. It should be accepted if practical utility is rather questionable for M larger than 2.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Response to Assigned_Reviewer_3
Thank you for the constructive feedback.

Our definition of a tensor as a multidimensional array is consistent with the definition in [2].

Thanks to your suggestion, we will discuss TPICA of "Tensorial extensions of independent component analysis for multisubject FMRI analysis" by Beckmann et al. (2005). The MLDS is composed of Tucker operators while TPICA belongs to the family of CP decompositions. Note that the CP decomposition is a special case of the Tucker decomposition.

We agree that a structured indicator matrix formulation is possible but believe it may lead to a more complicated representation of the same model. Our direct tensor modeling offers a great reduction in model complexity for tensor time series and the formulation is simple and intuitive.

When the tensors are vectorized, our model becomes a constrained LDS; our experiments focused solely around the consequences of those constraints. The number of parameters is determined by the number of MLDS dimensions. The forecast error is shown as a function of the number of parameters in Figure 3(d). To illustrate the forecast error as the number of dimensions of each mode increases requires a graph with M+1 axes, where M is the number of modes. Such a graph cannot be visualized in general.

Response to Assigned_Reviewer_4
Thank you for pointing out the errata. We will fix all of them.

Each iteration for maximizing (11) takes O(p^M) time, where p is the number of elements of the largest projection matrix and M is the number of modes.

Response to Assigned_Reviewer_5
Thank you for the helpful comments. We will correct the GCTF statement. “Multilinear” in the title refers to the multilinear Tucker operators of the MLDS, which can be found in "Multilinear operators for higher-order decompositions" by Kolda (2006). We show that the multilinear operators of the MLDS can be represented as linear operators via matricization.

Note that when the tensors are vectorized, the MLDS becomes a LDS with additional constraints on the transition and projection operators. The smoothing equations for inference of the MLDS latent states are almost the same as those for LDS. Due to the space limit, we omited these details. We will make this more clear in the revision so that the reader understands that there is no need to rederive anything. We will offer a longer technical report version for reproduction. The matricized covariances of the MLDS are computed in the similar way as the covariances of LDS are. On this note, we found in the experiments that the main bottleneck in the computation lies in computing A and C and not in computing covariances.

Although the discussion of identifiability is beyond the scope of our work, we believe the MLDS is also identifiable up to similarity transform.

Response to Assigned_Reviewer_6
Thank you for the discussion. Because tensor methods such as ours are required for M>1, we chose M=2 examples for simplicity. Many M>2 examples exist; we discussed one of them in the introduction. Nevertheless, we will simplify the notation for the M=2 case.