Review for NeurIPS paper: Online Multitask Learning with Long-Term Memory

NeurIPS 2020

Online Multitask Learning with Long-Term Memory

Review 1

Summary and Contributions: The paper defines a multi-task learning framework which is essentially a combination of online learning in multi-task models and models with switching between hypotheses where the number switching times is much larger than the total number of hypotheses appearing. The other main contributions are efficient algorithms in the above model for the cases of a finite number of hypotheses and hypotheses generated by RKHS.

Strengths: The novel model defined in this paper may be of interest.

Weaknesses: The motivation behind the proposed setting is unclear. The paper is very hard to read (at least for someone with no specific background on multi-task learning) - i am not sure i understood the model itself and the contribution (see "clarity" for more details) .

Correctness: The results and claims are of a theoretical/algorithmic nature and proofs seem well grounded.

Clarity: The novel setting introduced in the paper is not well explained: in particular, a sequence of tasks arrives and all instanced in the task appear sequentially? or, at each round, an instance from a task arrives according to an internal counter? In any case, what is the difference from having instances arrive one after the other and imposing switching constraints and the long term memory property in that sequence?

Relation to Prior Work: Seems like: many relevant papers are mentioned and how the current paper is a generalization of them is discussed.

Reproducibility: Yes

Additional Feedback: ---- Update ----- Thanks for the rebuttal, i am still not sure i have fully understood the setting (and the differences with respect to the sequential single-task one).

Review 2

Summary and Contributions: Under the multitask framework, the authors study binary classification problems with a nonstationary environment, where they capture this concept in terms of switches. They provide the algorithms as well as the analysis in two settings, finite and infinite hypothesis classes.

Strengths: They do not assume the relationship between the tasks and the fixed distributions on the losses. These make the problems flexible. The results can be seen as the extension of [26] to the multiple tasks. Especially when the number of modes is much less than the number of switches, they attain a reasonable regret.

Weaknesses: 1. The algorithms require some prior knowledge of the problem such as the number of tasks and switches, time horizon, and the full-information feedback, which is due to the binary loss. 2. I think the authors need a further survey in the contextual bandit with switching regret. Here the task index resembles the context. [Luo et al, 2018] would be close to this paper. Moreover, the idea of "meta-experts" can also be seen in [Wu et al, 2019]. [Luo et al, 2018] Luo, Haipeng, et al. "Efficient contextual bandits in non-stationary worlds." Conference On Learning Theory. 2018. [Wu et al, 2019] Wu, Yi-Shan, Po-An Wang, and Chi-Jen Lu. "Lifelong Optimization with Low Regret." The 22nd International Conference on Artificial Intelligence and Statistics. 2019.

Correctness: The algorithms and proofs seem reasonable and detailed. However, 50 pages are too long for reviewing.

Clarity: It took me so much time to understand some simple questions. For instance, the example (line 17,18) describes a scenario that multiple drones are doing a job. This intuitively connects a setting that multiple tasks appear concurrently. However, it violates what the Figure 2 expresses. Also, some notations are not consistent with the common convention. And the lengthy inequalities make thess proofs hard to follow.

Relation to Prior Work: The authors provide a clear survey of long-term memory and multi-task.

Reproducibility: Yes

Additional Feedback: When people consider an algorithm under a multi-task framework, they usually take advantage of transferring knowledge among the tasks. I suggest the authors theoretically and numerically compare their regret upper bounds with the ones treating each task individually. Also, some works regarding switching regret would provide insight into the parameter-free design. _________________After rebuttal________________________________ As all the reviewers agree, this paper is not well-written. This makes us hard to evaluate this work. And I amn't convinced with authors response. For thess reasons, I would like to keep my score.

Review 3

Summary and Contributions: In prediction with expert advice, there is a well-known fixed share algorithm. It enables the learner to compete against meta-experts, which can (not so often) switch between base experts. This paper puts forward a multitask version of the switching framework. On every prediction round, we receive a task, which may be thought of as some side information drawn from a finite set. The timeline becomes partitioned between a finite number of tasks and the switches for a meta-expert are only counted within a task. Then the numbers of switches are added across tasks, which prevents the problem from decomposing into independent subproblems. The predictions and outcomes in the paper are bits and the loss is the number of mistakes. However, the learner is allowed to make randomised predictions, which makes the situation equivalent to outputting predictions from the interval with the absolute loss. The difficulty in such situations is in providing an efficient weights update. Meta-experts can always be merged in principle, but the straightforward approach is computationally infeasible. There should be an update rule dealing efficiently with a small number of weights. The paper proposes algorithms for these settings. As far as I understand the construction, separate arrays of weights are maintained for the tasks. The bound naturally extends the fixed share bound. A more complicated bound is obtained for experts from an RKHS.

Strengths: This is a very natural framework with obvious practical applications. The results make an important contribution to prediction with expert advice. The treatment of the topic is comprehensive.

Weaknesses: I find the writing of the paper difficult to follow. It takes a lot of time to get to the part where the results are formulated.

Correctness: Yes

Clarity: Mostly so. Typos Page 2, line -4: to the exploit -> to exploit

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Thank you for the answer to my comments. I am keeping my score. Here are some suggestions for improvement. You may want to discuss the relation to literature on the basis of the following two cases: 1. I can think of a trivial approach to the problem. One can handle s tasks with N^s superexperts (having one base expert for a task) and apply the standard fixed share bound. But then the term k\ln N in the standard bound will turn into ks\ln N, which is much worse than Theorem 1. 2. Setting k to 0 reduces the settings of the paper to Section 4 of [3] (their K is s in this paper). Theorem 1 seems to reduce to Theorem 6 of [3] modulo the transition between convex and mixable environments, at least asymptotically for large m.

Review 4

Summary and Contributions: This paper investigates the problem of online multitask tracking in the binary classification setting. Here, the learner maintains a hypothesis for $s$ tasks. During each trial, the adversary selects one task $i \in k$ and supplies an instance $x^i_t$ for that task. Next, the learner makes a prediction $\hat y^i_t \in \{0,1\}$ on $x^i_t$ using its hypothesis $h_t^i$. Then, the adversary reveals the true label $y^i_t \in \{0,1\}$, and the learner incurs the zero-one loss $| y^i_t - \hat y^i_t |$ for the selected task $i$. The performance is measured in terms of (expected) multitask tracking regret: the competitor is not a single hypothesis, but a sequence of hypotheses parameterized by a number of switches $k$, and a number of modes $m$. Furthermore, the regret is not measured on a single task but on the whole set of the $s$ tasks. Two main hypothesis classes are considered: finite ones and infinite ones defined on an RKHS. Regret bounds with complexity results are provided for both classes.

Strengths: Basically, there are two main variants of the online multitask learning protocol: on each trial (i) the learner receives $s$ observations and processes them simultaneously (e.g. [21, 46]), or (ii) the learner observes a single observation for some task selected adversarially (e.g. [1, 4]). The authors examine (ii) for both finite classes and infinite ones. Although some results are already known (notably [1]), this is a challenging problem worth to be further investigated.

Weaknesses: Unfortunately, the paper has several major weaknesses: * In the online multitask expert setting (Section 3), the authors claim that their framework is more general than the related work [1,3,4,7], by allowing switches between hypotheses for each class. Yet, the number $m$ of modes is known in advance. So, for each task $i \in [s]$, we know that there are at most $m$ best hypotheses. Thus, unless I missed something, we can simply replace the $s$ tasks by $m \times s$ ones (i.e. each task in $[s]$ consists of $m$ different subtasks), and just apply the results obtained by [1] for the shifting multitask problem with expert advice (Corollary 1 in [1]) in order to get a bound that is essentially similar to (3). Still, I am aware that there are some differences between [1] and the present paper. For example, in the present contribution, the authors use the zero-one loss, while convex losses are advocated in [1]. Yet, in order to highlight the novelty of the present contribution, a more detailed comparison with related work should be provided. * In the online multitask linear setting - and more generally the RKHS setting (Section 4), there is something wrong about the results. Namely, the authors are using the zero-one loss and measuring the regret performance of the learner with respect to that function (Line 66). In this setting, it is well-known that for linear functions of dimension $d \geq 2$ the Littlestone dimension is infinite (see e.g. [52]). So, we cannot hope to find sublinear regret bounds for linear functions with respect to the zero-one loss! Actually, in many (if not most) papers about online linear classification, the zero-one loss is replaced with a convex surrogate (e.g. hinge loss) in order to obtain nontrivial results. But this is not the case for all results presented in Section 4. Finally, it is well-known that learning halfspaces in the agnostic PAC setting are NP-hard, and even NP-hard to approximate [Ben-David & Simon, NIPS’00; A. Birnbaum and S. Shalev-Shwartz, NIPS’12]. So, one cannot hope to find a polytime algorithm with sublinear regret in the online learning setting (with zero-one loss) which is more challenging than the agnostic PAC learning setting, unless strong assumptions are made about the problem. Yet again, if the authors are convinced by the tractability of their computational result, a rigorous comparison with hardness results should be provided.

Correctness: I haven’t checked all the results since the extended version of the paper in the SM is 50-page long. Yet, just take the result (2) formalized by Theorem 50 in Appendix C. Here, I agree with Lemma 49: $$ \sum_{t=1}^T \mathcal L_{hi}(y_t, \bar y_t) - \sum_{t=1}^T \mathcal L_{hi}(y_t, \langle u, x_t \rangle) \leq \sqrt{U^2 X^2 T} $$ (i) Indeed, in the above bound, the regret is defined with respect to the hinge loss. But how does one arrive at the result in Theorem 50? Here, we have: $$ \sum_{t=1}^T \mathcal L_{01}(y_t, \bar y_t) - \sum_{t=1}^T \mathcal L_{01}(y_t, \langle u, x_t \rangle) \leq \frac{1}{2} \sqrt{U^2 X^2 T} $$ (ii) Note here that the regret is defined with respect to the (non-convex) zero-one loss. Obviously, using the fact that the hinge loss is a surrogate, one can easily obtain from (i) that: $$ \sum_{t=1}^T \mathcal L_{01}(y_t, \bar y_t) - \sum_{t=1}^T \mathcal L_{hi}(y_t, \langle u, x_t \rangle) \leq \sqrt{U^2 X^2 T} $$ (iii) But (iii) is “not” a regret bound with the zero-one loss. In the proof of Theorem 50 (Line 1072) it is just written that “The bound follows from applying Lemmas 23 and 24 from [33] to Lemma 49.” Well, I don’t get it because in [33] those lemmas are used to prove a mistake-bound for transductive matrix completion in the realizable setting. So, the proof here looks flawed.

Clarity: The online multitask expert setting is relatively well-written, but the online multitask RKHS setting requires more polishing. Notably, the hypothesis class $\mathcal H_K$ is not formally defined. So, I could not understand the meaning of $\mathcal H^{(x)}_K$. To this point, I guess that we are talking about kernel classifiers. A quite standard definition here is to specify any hypothesis in $\mathcal H_K$ as a mapping $h: \mathcal X \rightarrow \{-1,+1\}$ where $h(x) = \sign(w^{\top} x)$, and where $w, x$ are taken from an RKHS (e.g. $R^n$) with kernel function $K$. All my above comments about this part are related to this assumption about $\mathcal H_K$.

Relation to Prior Work: As indicated above, the paper should be better positioned with respect to related work. There are some missing references, related to the online multitask learning setting (ii) and the hardness of learning linear functions with the zero-one loss: [Ben-David & Simon, NIPS’00] S. Ben-David and H. Simon. Efficient learning of linear perceptrons. In NIPS, 2000. [A. Birnbaum and S. Shalev-Shwartz, NIPS’12]. A. Birnbaum and S. Shalev-Shwartz, Learning halfspaces with the zero-one loss: Time accuracy tradeoffs. In NIPS, 2012. [Saha et. al. AISTATS’11] A. Saha, P. Rai, H. Daume III, and S. Venkatasubramanian, “Online learning of multiple tasks and their relationships,” in AISTATS, 2011.

Reproducibility: No

Additional Feedback: To sum up, * In the online multitask expert setting, I would suggest providing a more rigorous comparison with the related work. As indicated above, it seems that the present framework can be cast as a problem already investigated in [1]. * In the online linear/RKHS setting, I would suggest replacing the zero-one loss with standard convex losses in order to provide correct results. Yet, if the authors are convinced by the correctness of their result with the 0-1 loss, some proofs (see comment above) should be revised, and most importantly a detailed comparison with known hardness results should be provided. --- After rebutal --- I am still concerned about the correctness of results for kernel classifiers. Take one of the first regret bounds in the paper, given by (2) (Page 2), whose proof is given in Appendix C (Theorem 50). Yet, Theorem 50 is defined for a very specific sequence of comparators $u_t$ for which the absolute value of $<u_t,x_t>$ is equal to one (this crucial point was forgotten in my original review). But (2) is a standard regret bound, defined for any hypothesis $h$ in $H_K^{(x)}$. Actually, in their response, the authors agreed about the fact that their hypothesis space can have an infinite Littlestone dimension. So, I still believe that (2) is flawed.