Paper ID: 953
Title: Lifelong Learning with Non-i.i.d. Tasks
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a theoretical analysis of Lifelong learning setting which goes beyond the traditional task i.i.d assumption. Two settings are considered: stationary environment with non-independent tasks and non-stationary environments.

I liked the paper, it is written well and I feel I was able to follow it even though I am not an expert in PAC-Bayesian analysis. I did miss a few concrete examples, particularly at the introduction level, motivating the two possible learning settings and relating them (or contrasting) to existing lifelong learning approaches if possible.

A synthetic data example was given for the non-stationary environments, it would have been nice to add an example for the stationary setting as well.
Q2: Please summarize your review in 1-2 sentences
I think this paper adds an important theoretical analysis to the Lifelong learning setting. I feel the paper would have benefited from a few concrete examples.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents some extensions to the Pentina and Lampert's PAC-Bayesian analysis of "Lifelong Learning" problems (ICML 2014) , where a learner must adapt to various tasks exploiting knowledge from previously seen ones. The main contributions are risk bounds dedicated to two scenarios where the observed task are not sampled independently from each other. Roughly speaking, the first scenario share similarities with domain adaptation (albeit the risk bound is given on an average of all possible domains, instead of on a specific target domain) and the second is quasi-identical (up to my knowledge) to distribution drift.

In the first setting (Section 3), the authors cleverly reuse Ralaivola et al.'s chromatic PAC-Bayesian theory to represent dependencies between tasks. However, this result alone let me unsatisfied. I wonder to which extent this result can be useful to the ambitious "lifelong learning" problem the authors are interested in. At the end of Section 3, the authors open the way to use the result as a "quality measure" benchmark or to the conception of a learning algorithm. If provided, such empirical study could be a good way to enhance this work.

The second setting (Section 4) is very similar to the "distribution drift" scenario. I would like the authors to compare their results with other ones in this area (e.g., Mohri and Medina's "New Analysis and Algorithm for Learning with Drifting Distributions", ALT 2012). I also think that this setting must be studied more deeply. Contrary to the author claim, I think that the assumption of Equation (18) is very restrictive: it is unlikely that *all* algorithms of a considered family would have *exactly* the same error on every two consecutive tasks. This assumption should be relaxed to become realistic (as in Mohri and Medina 2012). Moreover, the algorithm should be able to consider more than the two previous tasks.

Finally, throughout the paper, I wondered why the authors choose this specific formulation of PAC-Bayesian theorems. While Theorem 1 tradeoff is very easy to understand, the obtained bound is less tight than the bound of McAllester ("PAC-Bayesian stochastic model selection", Mach Learn 2003), where the complexity term appears under a square root, while being "explicit" (see also the slightly tighter version of Germain et al.'s "Risk Bounds for the Majority Vote...", JMLR 2015). Consequently, I presume that the further bounds for non-i.i.d. tasks are not as tight as they can be. I think this must be discussed in the paper.

Some typos and minor comments: - There is a small error in Theorem 1 proof: When applying Hoeffding's lemma (Equation (3) of the Supplementary Material), there should be a "4" before lambda^2, because the subtraction between losses lies in [-1,1]. I also think that the same error appears at Equations (11), (25) and (26) of the main paper. - Line 89: The dot is misplaced - Supplementary Material, Line 78: Fro -> For - For the reader's benefit, it is preferable to cite the more recent version of Ralaivola et al.'s "Chromatic PAC-Bayes bounds for non-iid data" (JMLR 2010 instead of AISTATS 2009) - I personally prefer when the definitions, theorems, lemmas, etc share the same counter.
Q2: Please summarize your review in 1-2 sentences
The paper is clearly written and contains interesting theoretical results. However, each of the two studied scenarios deserves to be studied more deeply to be published.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their helpful comments.


Assigned_Reviewer_2:

> "the paper would have benefited from a few concrete examples"

We will extend the discussion and include more examples.
Note that the non-i.i.d. setting is very common, e.g.
the active users of speech recognition software are
not i.i.d. sampled of time-zone, etc.

> "example for the stationary setting as well"

see lines 220/221: the setup (in particular the algorithm) for such
experiments would be identical to [Pentina 2014], except with dependent
instead of independent tasks for training.


Assigned_Reviewer_3:

> "I wonder to which extent this result can be useful to the ambitious "lifelong learning" problem" ?

The setting generalizes the setting of [Pentina 2014], which already targeted the lifelong learning
situation. We believe it is a significant generalization, since for real-world learning tasks, an
i.i.d. assumption is unrealistic (see the above time-zone example).

> empirical study

see Reviewer 2.

> relation to "distribution drift"

Both setting have commonalities, but they differ in important aspects:

* in distribution drift, e.g. [Mohri and Medina, 2012], one observes a sequence of examples
from a time-varying data distribution and bounds the error of a hypothesis on future samples
based on its performance at previous time steps.

* in our setting, we observe sample sets corresponding to tasks from a time-varying tasks environment.
At any time step, we learn a new predictor (that might be very different from the one of the previous
time step). Samples from previous tasks act as context/prior, not as training data, and our bound
quantifies the performance not of a single hypothesis, but of a transfer algorithm.

This difference is gives the bounds different characteristics:

* In distribution drift, the bound contains a term that grows linearly with the number of samples.
The tightest bound is achieved by disregarding samples from too far in the past. This makes sense,
since if the underlying data distribution changes, a single hypothesis cannot be expected to work
well in the distant past as well as in the future.

* In lifelong setting, we expect the performance to get better the more tasks we observe.
This is reflected in the bound by the terms decreasing with n. Note that his is only possible
because we bound the quality of transfer algorithms, not fixed hypotheses.

A final difference lies in the used assumptions:

* Distribution drift assumes the changes between steps to be small, but otherwise arbitrary,
i.e. a form of worst case assumption.

* We do not assume that changes between task environments are small, but "consistent",
i.e. past experience can be used to identify a transfer algorithm that can compensate
for the changes.
While this is indeed a restriction, we believe it is not as strict as it sounds: it is
not required that all algorithms have exactly the same error on every two consecutive
tasks, as the statement is only in expectation.

We agree, however, that it would be interesting and promising to relax the assumption
to allow for (small) deviations, along the lines of the distribution drift work.
Thank you for the suggestion.

> the algorithm should be able to consider more than the two previous tasks

It would be possible to give the transfer algorithm access to a larger
(but fixed) amount of tasks.

> specific formulation of PAC-Bayesian theorems

We chose this bound mainly to keep the expressions comparable with [Pentina 2014].
Also, we like that it allows the combination of multiple bounds without technical
complications and gives rise more directly to practical algorithms than e.g.
the square-root variant or the 'small-kl' bounds, even if these might be tighter
in some situations.
We'll add an explanation of our choice and the relation to tighter bounds in the
manuscript.

> minor comments:

thank you, we will fix these.




Assigned_Reviewer_4, Assigned_Reviewer_5, Assigned_Reviewer_6:

thank you for your comments.