NeurIPS 2020

### Review 1

Summary and Contributions: This work studies the restless bandit problem. Their main contribution is proposing a explore-then-commit type algorithm. By assuming the underline Markov chain are birth-death processes, they can show that the resulting regret upper bound is polynomial in the size of the problem. Besides, they can also reduce the computational complexity from exponential to polynomial as well.

Strengths: The results are interesting as they can show significant constant reduction in regret. The empirical evaluation also shows the good performance of the proposed algorithm.

Weaknesses: One key assumption is that the underline Markov chain is a birth-death process. Since the chain becomes linear, then it is not suprising that the regret and complexity are reduced to polynomial instead of exponential.

Correctness: The claims and methos are correct.

Clarity: Yes. The paper is well written and organized.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: 1) As I mentioned above, one question is whether the results here can be generalized to other cases in restless bandits. 1.a) Could this model be generalized to infinite state birth-death process? For example, M/M/1 queue in queueing system? 1.b) Could this model be generalized to other stochastic process that are not birth-death process? For example, Moran process? 2) Since the algorithm is an explore-then-commit type algorithm, it is also crucial in the experimental results to show the error bar of regret. Because this type of algorithm usually suffer from large variance in empirical performance and is not analyzed in theory. 3) The regret in Theorem 1 depends on M^3. Is this factor tight enough? Could it be reduced to M^2 or even M? ======= After Rebuttal ====== I think the authors are aware of the concerns about birth-death assumption and the lacking of lower bound on the regret complexity. These are not fully addressed in the rebuttal. So I maintained my score.

### Review 2

Summary and Contributions: The paper considers an online restless bandit problem, where each arm’s state evolves over time regardless of the player's actions, according to a birth-death Markov chain, the reward of pulling each arm depends on both the arm and the current state of the pulled arm. A player at each time can only observe the reward and the state of the pulled arm. An online algorithm (Restless-UCB) for learning the optimal policy that maximizes the total expected reward is proposed, which the paper claims to achieve a better regret upper bound (sub-linear) and computation complexity than existing algorithms [28]. The proposed algorithm is of the explore-then-commit type, where the arms are explored for a pre-specified amount of time (o(T)), and then use parameter estimates from the exploration to construct an offline instance of the problem; an oracle is assumed available to general an optimal policy from the offline instance, which is then used indefinitely thereafter. For the special case considered in the paper, birth-death chain with positively correlated state transitions and monotonically decreasing rewards as "birth" progresses, this algorithm achieves sub-linear strong regret (on par but slightly worse) with much lower complexity compared to prior work such as [28].

Strengths: The paper is clearly written and appears technically sound. As a special case of more general restless bandits, it obtains computationally efficient results.

Weaknesses: 1. Comparison with prior work is woefully inadequate. For instance, earlier in the paper (page 2) citing [35, 10, 21] it is stated these prior works have a log(T) regret; later on page 6, citing the same papers, it is stated that these suffer from \Theta(T) regret. A quick glance at these paper reveals a number of differences: (i) these earlier works are for weak regret, comparing with single best action policies, while this submission is for strong regret, comparing with the best dynamic policy, (ii) earlier works [35] are for much more general Markov chains, while this paper has much more limiting assumptions (birth-death chain, positively correlated state transitions, monotonic state rewards), (iii) the earlier works do indeed have log(T) regret, and are generally finite time bound (bound holds uniform in time). Also, it is incorrect to say that these restrictive assumptions are common in the bandit literature (page 4): they are perhaps common in the sub-literature of using bandit models to solve dynamic spectrum access problems, but is not at all common in the general bandit literature, and does not appear to have much support/motivation outside of this area. 2. The more relevant comparison, with [28] is equally problematic and misleading. Again, the assumption of a birth-death chain with positively correlated state transitions and monotonically decreasing rewards makes the present problem a very special case of a general Markov chain, so it is not surprising that one can construct algorithm to take advantage of these special features to reduce its complexity. These are indeed key ideas to this paper (Lemmas 1 & 2) because of these restrictive assumptions; these are not necessarily key ideas that prior works like [28] missed (as the wording would seem to suggest) because the latter was considering a much more general model. It would be highly desirable and more informative for the paper to state clearly that their model is a special case (starting from the paper title), but the assumptions allowed them to obtain better results. 3. In the same context, it would also be very helpful if the paper could explain what happens if one takes the result from [28] and applies it to this special case, rather than directly quoting the results form [28] which obviously is for the more general case.

Correctness: Technically correct, but some misleading statements (see above).

Clarity: Yes.

Reproducibility: No

Additional Feedback: - What is the oracle used in the experiments? - Suggest revising the title to make it clear the problem model is not for any restless bandits. ====== post rebuttal ============ The authors have addressed my concern on literature review. The limitation of birth-death chain remains. I have revised my score to 6.

### Review 3

Summary and Contributions: This paper studies the online restless bandit problem, where the state of each arm evolves according to a Markov chain, and the reward of pulling an arm depends on both the pulled arm and the current state of the corresponding Markov chain. This work propose is a learning policy that follows the explore-then-commit principle, called restless-UCB. The main challenges of online restless bandit learning include high computation complexity and exponential factor in the regret bound. Corresponding to the challenges, the main contributions of this paper include (1) low regret bound: the proposed algorithm has a regret bound which is only polynomial in the number of arms and states. (2) low computation complexity. Restless bandit solutions usually have high (exponential with respect to N) computation complexity. The proposed algorithm has a linear computation complexity.

Strengths: Strength: (1) The improvement in the computation complexity of restless bandit has great practical meaning in order for the restless bandit to be useful (2) The theoretical analysis developed in this paper is solid and novel.

Weaknesses: Weakness: (1) Birth-death MDP: This paper focuses on the birth-death MDP setting. Although the authors emphasized that Birth-death MDP has many important applications, it would be helpful to discuss a potential extension to the more general (PO)MDP setting. (2) Availability of the oracle: While achieving an O(N) computation complexity, the proposed solutions depends on the availability of a good oracle. The authors should justify a little more about whether such an oracle can be easily available in different settings of the problem (this is also related to weakness 1). (3) State sparsity: one main property used to reduce regret is the sparsity of state transition. However, this assumption is not well justified in the paper.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes but needs further improvments. (1) It will be helpful to compare different assumptions used in existing restless bandit solutions. For example is the state sparsity assumption already exploited in existing literature. (2) Restless bandit is also related with non-stationary bandit. Although not directly comparable, it would be helpful to also discuss the relationship with non-statioanry bandits. In addition, many only statistical tests developed in non-stationary bandit learning literature can potentially be used to estimate or predict the current state of the restless bandit.

Reproducibility: Yes

Additional Feedback: *************************Post rebuttal*********************** My concerns about prior work and the oracle are largely addressed by the authors' responses. I am sticking with my original score (7, accept).