Review for NeurIPS paper: Adaptive Online Estimation of Piecewise Polynomial Trends

NeurIPS 2020

Adaptive Online Estimation of Piecewise Polynomial Trends

Review 1

Summary and Contributions: This paper studies the problem of forecasting TV^k bounded sequence and gives an algorithm that has the optimal regret bound. The main contribution is to connect the batch non-parametric regression to online stochastic optimization. Also, the author uses techniques from wavelet computation which is a very interesting connection.

Strengths: The main contribution of this paper is closing the gap between the previous upper and lower bound. Moreover, the techniques and connections developed in the paper are also of independent interest and might be useful for future study in related problems.

Weaknesses: The writing quality of this paper can be improved. Specifically, the introduction can start with a broader picture of the problem and explain more intuitions on the connection of the online estimation and the wavelet techniques etc.

Correctness: The technical content looks valid from the main paper, however, I didn't get a chance to verify all the details in the appendix.

Clarity: The presentation of the paper can be improved. Specifically, the structure of the paper, especially the introduction, is less organized.

Relation to Prior Work: The paper makes a comprehensive comparison with previous works. It would be nice if the author could mention more on the difference between the techniques used in this paper and the previous works.

Reproducibility: Yes

Additional Feedback: After the response phase, considering the additional feedback received, I remain with my initial assessment of the paper.

Review 2

Summary and Contributions: This paper proposes a new method for online trend estimation. The method is designed to operate in a streaming data setting, where it sequentially predicts the elements of a time series, given observations of all previous points. The objective is to minimise the cumulative error in the predictions over some fixed horizon. The authors' proposed method operates in a framework similar to that of Baby and Wang (2019, NeurIPS). In that paper, the time series being predicted should have bounded total variation difference. In the present paper, this restriction is generalised and the authors consider time series with bounds on higher order total variation - i.e. they difference the time series k>1 times and assume a bound on the magnitude of that resulting sequence. In line 246 the authors state this is equivalent to a bound on the l_1 norm of the (k+1)^{th} derivative. The proposed method predicts the next observation as an exponentially weighted function of previous data. Rather than use all of the previously observed data, the algorithm incorporates a wavelet based change detection method to detect substantial non-stationarities and limits how far back in the data to go based on this. This approach is similar in structure to the algorithm in Baby and Wang (2019) but the wavelet based procedure is different. The first difference is in the way that sequence of length 2^p where p is an integer are constructed (necessary for the wavelet transform). The present paper creates two overlapping sequences instead of padding with extra 0s. The second difference is the threshold for declaring a change which is optimised to the more general framework of the present paper. The authors prove an optimal order bound on the accumulated regret of the approach, and discuss its extensions to more complex problems: alternative norm bounds on the (k+1)^{th} derivative, and the setting of sparse changes.

Strengths: The proposed method is an interesting generalisation of the ARROWS algorithm from Baby and Wang (2019). The theory is as far as I can tell correct and guarantees that the authors have a order-optimal approach. The authors present the theoretical contributions nicely alongside those of other work to make their contribution in this regard clear. I am not the most familiar with this area and as such find it a little difficult to comment on the novelty of the contribution - it seems to make material improvements/extensions on the approach of Baby and Wang (2019), and the proof seems to use new ideas related to the particular wavelet transform.

Weaknesses: As I will discuss below, I think the principle weaknesses of the paper are in the clarity of its exposition.

Correctness: As far as I can tell the theoretical results are sound. The intuition behind the method seems appropriate to the problem. There is no empirical work to assess.

Clarity: I think the writing of the paper could be improved, there were several (key) areas I had to read multiple times to get the message. The most important example of this was in the description of the policy in Section 4.2. A good start to making this easier to understand would be a few sentences initially describing the general shape of the algorithm without notation (i.e. constructs estimator and computes residuals -> wavelet decomposition on these residuals -> etc.) When describing the various notation T, x, and functions pack, recenter, if there was more of an explanation as to what they were being used for at this point it would also help the reader. In terms of the notation, I think it would be better to replace $T$ with $T_\beta$ as it was not initially clear to me where the $\beta$ term that controls the high probability bound came in to the algorithm. Further, I think the paper should have a proper introduction paragraph stating its scope and aims and not dive in with a list of notation. Is 'comparator sequence' ever really defined? Section 4.3 giving the performance guarantees just starts with theory may be better structured with some linking text, explaining what the purpose of the theorems is, rather than a sequence of theorem statements and disjoint remarks.

Relation to Prior Work: As the work is so closely related to Baby and Wang (2019) I think a clear paragraph stating "this is what that paper does, and this is how this paper is different" would be really helpful to the reader. It seems to me that the authors have done their due diligence in referencing related work, although I am not an expert in this sub-domain, so other reviewers may be better qualified to comment on this.

Reproducibility: Yes

Additional Feedback: UPDATE: Thank you for your response to our reviewers. In light of the other positive reviews and your commitments to improve the exposition, I have increased my score.

Review 3

Summary and Contributions: This paper considers the non-stationary stochastic optimization with squared error losses and noisy gradient feedback. The main proposal is a new variational constraint that generalizes previous one for dynamic regret analysis. The new contraint captures both the sparsity and intensity of changes in underlying dynamics. From the algorithm side, several seemingly disparate components (VAW forecaster and CDJV wavelet) are combined, which are of interest for online learning and statistical learning communities.

Strengths: + The new variational budget strictly generalizes that of previous studies [Besbes et al., 2015; Baby and Wang, 2019]. It can capture both the sparsity and intensity of changes in underlying dynamics. So in the scenarios that indeed satisfies the piecewise stationary assumptions, the proposed algorithm will enjoy much desired guarantees, both empirically and theoreically. + Several new pieces from the offline statistical community are combined and introduced to the online nonparametric community (to the best of my knowledge), and the analysis is non-trivial and likely useful for other studies. + The justification and the empirical demonstraition on the scaling of n^k are nice, both of them release my concerns on the scaling issue.

Weaknesses: As far as I can see, the results of this paper only hold for the square loss, because the algorithm and analysis heavily rely on the access of an unbiased gradient estimate. The access is due to the special form of the square loss (as shown in line 60). So my question is: can the resulst be generalized to more general function classes? Is there any lower bound justification on the optimality of the results? It seems that the minimax lower bound of Proposition 11 is only for a special case of k=0.

Correctness: Correct as far as I have checked.

Clarity: yes

Relation to Prior Work: yes, it is clearly discussed in the appendix B

Reproducibility: No

Additional Feedback: It would be better to add some high-level proof ideas in Section 4.3, to improve the readbility.