__ Summary and Contributions__: This paper pays attention to multivariate time-series forecasting problem. The author proposes StemGNN which incorporates inter-series correlations and intra-series patterns by GFT and DFT respectively. In detail, the authors define a new structure which is called stemGNN block to solve both the learning pattern work and prediction work. Furthermore, StemGNN is made up of two stemGNN block which make full use of the data and the residual. To authors’ knowledge, this paper is the first one to adopt such ideas.

__ Strengths__: 1. This paper combines pattern extraction from inter-series and intra-series together which offers a new way to make predictions in multivariate time series. Tradition methods of predicting multivariate time series depends on a prior multivariate dependency which is always defined manually, the authors use GFT to model inter-series correlation. Furthermore, DFT is used to capture temporal dependencies in intra-series which is jointly finished with GFT in spectral domain.
2. Structure of StemGNN is made up of two StemBlock defined by the author and they make full use of both data and its residual which is helpful to learn the features of intra-series. Besides, it uses both forecast and backcast results to train which are stated elaborately in the paper. The pattern and relationship can be depicted clearly by this model.
3. Experiments on 9 real-word datesets show good result on three metrics: MAE, RMSE, MAPE and show great generality as well. Besides, the authors also design several ablation study to prove the significance of each part.

__ Weaknesses__: 1. Stacked network does bring more powerful computing competence but on the other hand, the more complex the network is, the more its training costs. It would be better if you can show the variation of metrics during training process.
2. In test phase, how long does the model cost to make predictions. If it has a time complexity of linear or sublinear to the length of predictions, applications will be easier.
3. In the w/o residual ablation experiment, residual connection is removed instead of deleting the second stemGNN block. I wonder how much the performance will decline if the second block is removed.

__ Correctness__: This paper proposes a brand new idea to solve multivariate time-series prediction problem by incorporating intra patterns and inter relations. Relations are revealed by latent correlation layer and patterns are learned by an seq-to-seq cell. We can treat it right on the basis of experiment results.

__ Clarity__: In general, this paper is well written. However, in line 190 of page 5: “ is the convolution kernel with the size of 3 in our experiments, is the the Hadamard product and nonlinear sigmoid gate determines how much information in the current input is closely related to the sequential pattern”, a redundant word “the” exists in this sentence.
Another place which can be done better is that a notion table is missing. Several characters are used in the paper and several variants of them are also used to represent different states. When I was reading this paper, for some times I need to find and check the meaning of some ordinary character.

__ Relation to Prior Work__: In the second chapter, the author discusses a lot on related works and analyses the weakness and differences from their proposed model. First and foremost, the author divides time series forecasting into two directions: univariate technique and multivariate technique. Then it states several univariate techniques, such as FC-LSTM, N-BEATS and analyses the weakness of this kind of technique which can be summarized as without considering correlations between different time series. About multivariate techniques, the authors list a lot of examples. Traditional methods such as TCN and DeepGLO tends to treat multivariate time series as a tensor and use matrix techniques to learn the features. DCRNN, ST-GCN, GraphWaveNet combine graph convolutional network into the network to capture dependencies such as spatio-temporal dependencies. However, the biggest weakness of them is they either ignore the inter-series correlations or require a dependency graph as a prior. Besides, to the author’s knowledge, the proposed StemGNN is the first to combine the task of capturing temporal patterns and multivariate dependencies jointly in the spectral domain. These two points give a big difference from previous contributions.

__ Reproducibility__: Yes

__ Additional Feedback__: Figure 2 in analysis on COVID-19 shows that this model can only predict the trend and fluctuations (especially in Germany subfigure) are missed a lot. Maybe this is caused by data quality and it is better to add more prediction figure on other datasets.
Stacked network is being used more broadly, however, the more complex the network is, the huger computation cost it requires. I wonder how big the difference is between single StemBlock and two StemBlocks. Furthermore, I am wondering the variation of metrics during training process because I think sometimes the dataset cannot support a successful training of a deep network.
As you mentioned at the last of paper, “directly applying eigenvalue decomposition is prohibitive for very large graphs high-dimensional time-series”. I also hope information about the test phase: how long does the model cost to make predictions. I believe a time complexity of linear or sublinear to the length of predictions may help the applications be easier.

__ Summary and Contributions__: The authors propose a novel neural network architecture for forecasting multivariate time series. The method estimates a correlation matrix, which along with the data is fed into a StemGNN layer. The StemGNN layer is boosted once. The whole network is trained to minimize a joint forecasting and backcasting loss. The StemGNN transforms the data and correlation matrix into the spectral domain to linearly decouple the components. It then featurizes the components and transforms them back to the time domain, yielding both a forecast

__ Strengths__: The authors present a strong set of comparisons and ablative evidence in favor of their claims that the propose architecture establishes a new state-of-the-art in multivariate time-series forecasting. The method is conceptually appropriate and appears sound.

__ Weaknesses__: The method appears to assume a static correlation matrix. How would the method perform if correlations where not static?
The authors make statements suggesting the method identifies "clear patterns" (ln 13) but give only a correlation matrix as an example. I find the automatic calculation of series correlations to be only a small benefit. One could after all compute the empirical covariance matrix from data directly. To be sure, the network can tune its believed covariances to enhance predictive performance, so joint calculation is different than direct calculation.

__ Correctness__: The methods and evaluation appear correct.

__ Clarity__: The paper is well written and easy to follow.
Figure 2 could use a more expressive caption.

__ Relation to Prior Work__: Prior work and its connection to the presented material is clearly discussed.

__ Reproducibility__: Yes

__ Additional Feedback__: In the rebuttal the authors have clarified my concerns. I would ask that they try to make the role of the correlation matrix as clear as possible in the camera ready version.

__ Summary and Contributions__: This paper addresses the problem of multivariate time-series prediction. The premise of the problem is, given N possibly correlated time series, predict the next H time steps for each of the time series. The paper develops over existing methods by proposing a novel deep neural network based algorithm that simultaneously accounts for the “spatial” and temporal correlations.
The proposed algorithm first constructs an adjacency matrix to capture the similarity between the different time series by using a self-attention based similarity measure. Post this, the data is passed through two “stemGNN” blocks, with each block as described below.
The data X (of size NxT, where N is the number of nodes and T the number of time steps) is then transformed into the eigenspace of the above adjacency matrix. The transformed set of N time series are then passed through a DFT block, and then through a 1D convolutional layer. The signal is now is the spectral domain along both the spatial and temporal axes. A graph convolution operation is then applied using the explicit eigen decomposition computed earlier. Finally, each the time series are transformed back into the canonical domain and passed through two separate neural networks, one for forecasting each series and the other for “backcasting”.
Contributions:
The main contribution of the papers are:
1. A novel approach to handling multivariate time-series prediction problem by first applying a spectral transformation on both the spatial and temporal axes.
2. A self-attention based graph adjacency matrix construction method that can help identify connections across different time-series
3. Improved empirical performance across 9 different public datasets and demonstration of the utility of the model on COVID-19 cases prediction.
%%%% Edit after rebuttal %%%%%%
The authors have sufficiently addressed my concerns. I already had a good recommendation for this paper and will retain it.
%%%%%%%%%%%%%%%%%%

__ Strengths__: Strengths:
1. The paper is very clearly written and is easy to understand. The proposed method is thoroughly validated using experiments, including ablation studies on the model and a high-level analysis of the model’s predictions on the COVID-19 dataset. I especially appreciate the visualization of the eigenvectors.
2. The proposed method has a good empirical performance on the datasets considered and does better than all of the baselines considered. The gain in performance is considerable (if not stellar). However, the proposed idea is interesting and justifies the good performance.
3. The proposed architecture is well-motivated and the idea of using operations in the 2D spectral domain is very interesting.

__ Weaknesses__: Weaknesses:
1. The proposed model has many good attributes, but a disadvantage that sort of stand out to me is the complexity of explicitly computing the eigen-decomposition of the graph Laplacian. In my opinion, the merits of the paper outweigh this drawback, but it is something the paper can improve upon. The O(N^3) complexity can easily be prohibitive on large datasets. For example, the COVID analysis is done at a country level. If this resolution were to be increased to state/county level, the resulting graph would be massive. I wonder if using approximations of the eigen-modes affect the performance of the model.
Also, going by the results of the ablation study, removing the GFT leads to only an incremental reduction in performance. So I wonder how critical the GFT is, given that it is an expensive operation.
2. The paper can also mention/ give examples where the time-series prediction did not do great. For example, in the COVID-19 dataset, UK and Russia are learnt to be highly correlated, even though are not neighboring countries. Further, were there countries where the predictions were not great? How did the trajectories of these countries relate to the top eigenvectors of the Laplacian?

__ Correctness__: The proposed method and the experiments are technically sound.

__ Clarity__: The paper is very clear to read.

__ Relation to Prior Work__: Mostly yes. Although after a search for relevant papers, I noticed that citation [10] is never actually referred to in the paper (I double-checked this, but there is a small chance I missed it). I wonder how the results of this paper compare against the proposed method. In general, it may not be a good idea to include a paper only in the references and not mention it in the body of the paper.

__ Reproducibility__: Yes

__ Additional Feedback__: I really enjoyed reading this paper and I liked the proposed ideas.

__ Summary and Contributions__: The paper proposes an architecture for multivariate time-series forecasting that manages to capture inter and intra series correlations in the frequency domain without requiring a prior knowledge domain. the architecture is experimentally validated through an ablation study that indicates that all of its portions are indispensable.

__ Strengths__: the model outperforms existing models on 3 benchmarks and a real world study while being able to identify plausible correlations among the time-series.

__ Weaknesses__: 1) the authors do not compare with the model in [15]:
"Modeling long-and short-term temporal patterns with deep neural networks.". This restricts the potential impact of the model.
2) the model has many components whose hyper parameters are not fully provided (someone may have to trace them in the source code)
3) the paper doesn't propose a conceptual/computational novelty. it combines existing modules to achieves its results.

__ Correctness__: 1) the authors should report learning curves to demonstrate convergence of the learning algorithm
2) for the experiments in table 1, how much is H?

__ Clarity__: the paper was incoherent at some points. In particular,
1) in figure 1, it is not explained what are the Yi-s
2) in equation (2), B() (the backcasting lost) is not defined.
3) in equation (3) are w^q, w^k learnable parameters?
4) what is the number of the eigenvectors used in the gft??

__ Relation to Prior Work__: the major difference, that is clearly described, from previous works is that the proposed model uses a data-driven approach for identifying inter-series correlations

__ Reproducibility__: No

__ Additional Feedback__: I would suggest the order of the descriptions in section 4.3 match the order of the components of the model in figure 1. for example, the description of Spectral Graph Convolution shouldn't be placed before the Spe-Seq Cell)