Review for NeurIPS paper: Benchmarking Deep Learning Interpretability in Time Series Predictions

NeurIPS 2020

Benchmarking Deep Learning Interpretability in Time Series Predictions

Review 1

Summary and Contributions: This paper introduces a set of benchmark datasets for timeseries saliency methods, a series of associated metrics for evaluating timeseries saliency methods, an empirical conclusion that common methods produce poor saliency maps on timeseries data, and a method for improving them.

Strengths: - The empirical evaluation seems very solid. - The comparison with CNNs makes it convincing that the saliency problems they identify really are specific to timeseries models, which seems like a novel finding. - Research on evaluating saliency methods with respect to objective ground-truth is always nice to see and has generally been more significant than interpretability papers which are only grounded subjectively. - Overall the problem is relevant.

Weaknesses: - As the authors acknowledge, they do not provide much theoretical insight into why such problems occur with saliency methods on timeseries data, or why certain methods (e.g. LSTMs) are generally "harder to interpret" - The paper relies entirely on gradient-based saliency methods (even for SHAP approximation), but there are more diverse/lower tech approaches, and leaving them out makes the paper less informative. For example, I'd be curious to see an empirical comparison to simple masking-based saliency (i.e. changes in model predictions when a feature is set to 0 at a timestep). Perhaps that would do comparatively better without TSR and would provide insight into why TSR is helpful.

Correctness: Overall, the claims and methodology seem sound.

Clarity: The paper is well and clearly written. However, the figures are definitely too small -- at least the ones which were squeezed into the main paper, rather than the supplement :)

Relation to Prior Work: Although I am not extremely well versed in the time-series literature, it seems like the paper does a good job discussing its relationship to the current state of the field.

Reproducibility: Yes

Additional Feedback: Update in response to author feedback: I still feel this paper is good and worth including, despite other reviewers' concerns that the contribution is too marginal. I also think the benchmark datasets this paper introduces are particularly valuable.

Review 2

Summary and Contributions: This submission compares the performance of various saliency-based interpretability methods across diverse neural architectures on a set of synthetic datasets. The authors also propose a simple two-step temporal saliency rescaling approach to improve the performance for time series data. Overall, this work is OK, but lacks in-depth analysis and novelty.

Strengths: It is a meaningful work to establish a test data set to compare saliency-based interpretability methods for time series data. The new simple method proposed in the submission also have a significant performance improvement on this synthetic dataset.

Weaknesses: This submission lacks depth and novelty, and has limited contributions to the machine learning community. The tested data set is an extension of previous synthetic data. It is a question to me whether this data set covers most issues real data has. This paper also lacks a theoretical analysis of why the previous metrics failed and why their proposed metric have better performance. There is no in-depth discussion of the limitations of the new metric. In addition, there is a lack of guidance on the selection of the threshold in the algorithm.

Correctness: Yes, I think the proposed method is reasonable.

Clarity: Yes, this submission is well written and well organized

Relation to Prior Work: Yes, the authors clearly discussed the related works in the Background and Related Work section.

Reproducibility: Yes

Additional Feedback: Thank you for authors to address my concerns in the rebuttal. I remain my score unchanged because I still think the novelty of this submission is not enough for a NeurIPS publication.

Review 3

Summary and Contributions: This paper is an extensive evaluation of performance of saliency and gradient based methods (developed extensively for images) for time-series (classification) interpretability. The time-series setup is interesting because importance of features can change over time for the classification setting. The claim of the paper is that saliency methods do not extend well to time series. Based on this observation, the authors propose a new algorithm that first finds important times by masking the time and evaluating change in relevance score and then obtaining feature importance at relevant time points. Experimental evaluation shows that among the class of saliency methods, the authors' proposed method provides better interpretability for evaluated datasets.

Strengths: The experimental evaluation is quite extensive. The work is relevant to the NeurIPS audience. I have some questions regarding empirical evaluation which I have outlined in the following. If not addressed, I am uncertain of the soundness of the claims as well as diminishes the novelty of the contribution to some extent. I did see some useful discussions around the choice of evaluation metrics for saliency methods.

Weaknesses: 1. My first concern is that the setup is not explained clearly. Problem Definition should be more elaborate. Is it a time-series classification setting? Will the conclusions apply if the label is available for every time instance t? Are all time points explaining the prediction y at t=T? That is, what is S(X) representing? 2. I think the premise that gradient based saliency methods might work in time-series settings is valid, but I am not entirely convinced why they will work knowing that RNNs, LSTMs, suffer from the vanishing gradient problem, which seems to have inspired the work of [21] Aya Abdelsalam Ismail, Mohamed Gunady, Luiz Pessoa, Hector Corrada Bravo, and Soheil Feizi. Input-cell attention reduces vanishing saliency of recurrent neural networks. In Advances in Neural Information Processing Systems 32, 2019. Nonetheless there is some merit in exhaustively evaluating existing methods on time-series data. Which brings me to my concern around how such datasets were designed. 3. 6 out of 7 datasets used for this extensive evaluation are not time-series data, but image data where one spatial axis is assumed to be temporal. I have major concerns around evaluating time-series data this way. It is highly unclear (neither is it justified in the paper) why this is a reasonable evaluation setup. Why weren't temporal generative process, state-space models, even gaussian process used as reliable data-generating mechanisms on which these methods are evaluated? The main insight would still likely be that saliency methods are noisy and architectural challenges of RNNs around vanishing saliency will in-fact affect the quality of the explanations. The authors finally mention this in Line 226: "These observations suggest that saliency maps fail when feature and time domains are conflated". I think reaching this conclusion doesn't say much about the methods at all and is more a claim of the data-set generation and assumption. 4. Authors claim in introduction that they have compared SHAP method but I only see SHAP+Gradient derivates rather than the original SHAP method. This method is not developed for time-series, only for tabular data, so it is unclear how they would do this evaluation, neither is it described anywhere in the main draft.

Correctness: It will help if the authors clarify all the concerns I raised above. As of now I do not believe the methodology is entirely correct. The claim that saliency methods cannot identify important features but identify important times needs to be tested on realistic time-series rather than images where one axis is assumed temporal. The correlations in temporal settings are hard to conflate with such image based correlations.

Clarity: The paper is well written and all conclusions clearly stated.

Relation to Prior Work: Yes prior work is clearly outlined, motivation of the study is clear, all baseline descriptions are clear.

Reproducibility: Yes

Additional Feedback: I have read the author response and acknowledge that they have taken a lot of care to address the concerns I have raised. With that in mind, I have raised the score from 3->4. I suggest authors include other attribution baselines relevant to time-series in their evaluation (other than standard saliency ones to form a convincing argument of their two-step saliency approach). ============================================================ Please see above for all concerns raised.

Review 4

Summary and Contributions: The main contributions of this paper are: 1- Presenting an extensive study and analysis of existing saliency-based interpretability methods on temporal data. 2- Proposing a two-step temporal saliency rescaling (TSR) approach.

Strengths: Strengths are mentioned in the previous section. Besides, the proposed method performs better than existing ones.

Weaknesses: Authors did a great job of studying and analysing the previous methods. But the proposed method is expensive and also not novel.

Correctness: Yes, claims and method are sound and correct.

Clarity: This paper is well-written.

Relation to Prior Work: Yes. Prior works are discussed and the difference is identified properly.

Reproducibility: Yes

Additional Feedback: