
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper addresses the problem of riskaversive decision making within a MDP framework and CVaR. Firstly, it is showed that a CVaR measure can be interpreted as worstcase expected cost under MDP parameter perturbations given a fixed error budget, and this result provides a unified framework for risksensitive and robust decision making. An approximate value iteration algorithm for CVaR MDPs and its convergence rate is also presented. The approach is empirically illustrated on the problem of getting to a goal state while avoiding obstacles in 2D gridworld.
Quality: The paper is well written and technically sound. All relevant references are given.
Clarity: The paper is rather clear, but I have minor concerns in some mathematical equations as follows:
 In Line 95, equation (1), If I understood correctly, the CVaR at confidence level $\alpha$ would be defined as $CVaR_\alpha(Z) = \min_{w\in\mathbb{R}} \{ w + \frac{1}{1\alpha} \mathbb{E}[(Zw)^{+}]\} if $CVaR_\alpha(Z)=\mathbb{E}[ZZ \ge VaR_\alpha(Z)]$, which is nondecreasing in $\alpha$. In the case of
$CVaR_\alpha(Z)$ decreasing in $\alpha$, (1) is correct but $CVaR_\alpha(Z)$ is written as $\mathbb E[ZZ\ge VaR_{1\apha}(Z)]$.  In Line 121, the space of histories up to time $t$ should be defined recursively as $H_t = H_{t1} \times \mathcal X \times \mathcal A$.  In Line 123, deterministic historydependent policies...  In Line 204, Theorem 2, I couldn't find Theorem 10 in [17]. Also, if I understand correctly, $H_t$ would be $h_t$ since the history up to timestep $t$ is assumed to be given.  In Line 309, it would be good if $y_{i+1}$ could be explicitly defined though its definition is quite clear from the context.
 In Line 357, $y_1$ would be written as 0 according to the assumption made in Algorithm 1.  In Line 353~359, the adaptive procedure for additional interpolation points is understandable but a bit confusing. We already have a set of interpolation points $\mathbf Y(x)=\{y_1, y_2, ..., y_{N(x)}\}$ which satisfy the condition $\log \theta = \log y_{i+1}  \log y_i$. The additional points in $(y'_2, y_2)$ are not explicitly denoted so they don't seem to be related to the condition $\log \theta \ge
\log y_{i+1}  \log y_i$. Some other notations might be needed for the additional interpolation points. If we can add new interpolation points anywhere, this addition can affect $V_t(x,y_i)$? Is $V_t(x,y_i)$ linearly interpolated rather than exactly updated from (possibly approximated) $V_{t+1}$?  In Algorithm 1, it would be good to see the list of input parameters.  The parameters used for the experiment would be mentioned (e.g. $\theta$ and $\epsilon$).
In Line 393, it is stated that $M$ is chosen to be $2/(1\gamma)$. It would be great if the authors could explain in more detail how $M$ was chosen and it affects the computational complexity.
Originality: CVaR MDPs are not new, but this paper provides a novel equivalence between a CVaR objective and a robust MDP formulation. This is an improvement over the previously suggested equivalence in terms of conservativeness in parameter perturbations.
The paper also proposes an exact value iteration for CVaR MDPs, which runs on an MDP augmented with a continuous state variable which denotes the confidence level allowed. This challenge has been tackled by exploiting linear interpolation and the concavity, and the resulting approximate algorithm is guaranteed to converge and its error is also theoretically bounded. This paper seems sufficiently novel.
Significance: Riskaversive planning is an interesting problem and this paper provides significant contributions.
Q2: Please summarize your review in 12 sentences
This paper presents an equivalence between RMDPs and CVaR MDPs, and proposes a value iteration algorithm for CVaR MDPs with a convergence guarantee and a bound on approximation error. The algorithm and experiments are sounds.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper shows that a CVaR objective has an alternative interpretation as expected cost under worstcase modeling errors, if modeling errors are bounded by a budjet, bounding their product.
The paper also introduces an approximate value iteration algorithm for CVaR MDPs, which is to the best of their knowledge the first solution to CVaR MDPs with convergence error guarantees, and the first algorithm to approximate globally optimal policies for CVaR MDPs. Points that require improvement are:
(1) While proofs can be found in the supplementary material, it
seems necessary to have at least proof sketches inside the paper,
especially given that one of the two main contributions of the paper
is the proof of proposition 1.
(2) The paper heavily relies on an error budget constraint that
seems nonintuitive/nonrealistic to me. It would be good to
explain whether this errorbudget just makes the math easier, or
does it also have a realistic interpretation.
(3) Experiment with a more complex domain could strenghten the p
The paper is of highquality, and very wellpresented.
The paper provides significant insight on the connection between risksensitivity and robustness, as well as a practical algorithm for discrete state/action CVAR MDPs. Robust MDPs are an important research problem, and the paper makes progress towards better understanding and solving them.
Detailed comments by order of appearance: =========================================  line 39 'worstcase scenario'  what does it mean: worstcase cost under worstcase parameters? or expectedcost under worstcase parameters?
 line 47  why purturbing only transition probabilities, and not reward as well? would your result hold with reward purturbation? Some clarification is needed here, and possibly in the abstract.
 line 79  'compute globally optimal'  since your algorithm is an approximate VI, wouldn't it be more accurate to say "approximate globally optimal..." and mention the dependence on interpolation resolution?
 line 119 'our results easily generalize'  if they indeed generalize, it would be good to briefly explain how at a later section of the paper.
 line 122 it should be H x A x X
 line 128 you use Z_t for C(x,a)  please briefly clarify the connection explicitely
 line 144 'worstcase expected ... in presence of worstcase' => 'expected ... in presence of worstcase' Also, could you relate 'expected reward under worstcase purturbation'
to 'worstcase reward under worstcase purturbation' and possibly to
to 'worstcase reward under expected purturbation'?
 line 148  (x_0...x_T)  it's not clear if you include actions in the trajectory, and if not, why not (later it becomes clearer when you use a deterministic, fixed policy)
 line 151  why (x_0...x_T) starting from x_1, especially given that cost is C(x,a)?
 line 165  the error budget seems uninutitive to me. Beyond the fact that it's not clear why errors should be bounded by a budget, this specific form of budget which bounds the product of errors seem uninutitive.
Other than helping the math, does this come in realistic contexts? if yes, could you demonstrate?
Also does your algorithm needs to know the budget (\eta) in advance? if not, what happens when: the assumed budget is smaller than the actual errors? if yes, is there a way to find a budget (upperbound) efficiently in the presence of the exponential number of possible error combinations?
Also, it's not clear to my why bounding the delta product from above bounds the errors: very small deltas (e.g. delta=0) applied to 'good' transitions may be as bad  don't they?
line 166  "states that the worst cannot happen at each time"  it sounds more accurate to say "*can* state that the worst...", since if the budget is large enough, the worst can happen at each time
 While proofs can be found in the supplementary material, it seems necessary to have at least proof sketches inside the paper, especially for proposition 1, and
especially given that some proofs are part of the main claimed contributions of the paper.
 line 177  if you claim the comparison is 'instructive', it seems appropriate, to guide the reader and provide a brief explanation to what is instructive in the comparison.
 Theorem 2: it might add to clarify if 'Z' is replaced by something line \mathcal{Z}_t
 experiments:
 The experiments verify that your algorithm behaves as expected with
respect to risk and model errors on a grid world example. This is
good as a first step, but it would be better verify your algorithm
on a more complex domain.
 it seems that you built the obstacles in a way that the length of
the optimal path for any given (increasing) safety level is
increasing. It would be good explicitely clarify that (I didn't find
a mention of that in the paper).
 initially it confuses that the obstacles are yellow, given that the
value function is also yellow in parts  it seems like obstacles have
high value
 shouldn't M be negative (since your plotted value function has
negative values)
Q2: Please summarize your review in 12 sentences
The paper proves a novel connection between risksensitivity and robustness for a
a CVaR objective. The paper is interesting and well presented. Points that require
improvement are:
(1) providing at least proof sketches inside the paper, especially given
that the proofs are claimed as a main contribution
(2) The paper heavily relies on an error budget constraint that
seems nonintuitive/nonrealistic to me. It would be good to
explain whether this errorbudget just makes the math easier, or
does it also have a realistic interpretation.
(3) Experiment with a more complex domain could strenghten the paper
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We would like to thank the reviewers for thoroughly
reading our paper.
All comments and suggestions will be addressed
in the final version of this paper. In particular:
To reviewer
1: Adaptive interpolation: there is a simple trick to perform adaptive
interpolation without recalculating V_t. Note that the resolution
requirement is y_{i+1}/y_{i} <= \theta. Therefore, it's sufficient to
only add points between y'_2 and y_2 (under the requirement y_{i+1}/y_{i}
<= \theta, which is trivial to code), and keep the other points
unchanged. Since all the additional points now belong to the same original
interpolation segment (y_1,y_2), the linear interpolation of V_t does not
change, and may be used 'as is' when calculating V_{t+1}. Thank you for
pointing this out  we will clarify this issue in the algorithm, and also
modify the notation.
To reviewer 2: The choice of \alpha=0.11 is
due to a logspace partitioning of [0,1] to 21 points, and has no special
meaning. To obtain the results for \alpha=0.1, one can either interpolate
the corresponding value functions of 2 adjacent alphas or include
alpha=0.1 in the set of interpolation points.
To reviewer 3: The
discussion on the relation between CVaR and budgetconstrained MDP not
only motivates our work, but also suggests an important link between the
robust MDP framework in [13] and the dual representation of CVaR. A
similar study for dynamic risk can be found in the paper "Robustness and
risksensitivity in Markov decision processes" by Osogami (2012). The
static risk counterpart is subtler and models a less conservative
situation.
Our methods indeed require explicit knowledge of the MDP
while the CVaR policy gradient algorithms in [4,5,8,25] generally do not.
We will tone down the comparison. Thank you for pointing out the
additional reference; we will duly cite it.
To reviewer 4: We
will add a proof sketch and the relevant discussions to the final version
of the main paper.
We comment on the error budget.
Modeling
parameter uncertainty (or, equivalently, model errors) in *sequential*
decision problems is an open problem that is actively being investigated
in reinforcement learning, operations research, and stochastic
control.
The most common, but naive approach, is the rectangular
uncertainty model. In this model each state is treated independently, and
model uncertainty *for each state* is independently measured using
statistical confidence intervals on the transition and reward
distributions observed from the data. The error in the total reward is
then calculated by assuming *the worstcase* parameter realization (within
the confidence intervals) at each state. As is wellknown, this approach
is often too conservative to be of practical use.
A different
approach is to assume a statistical distribution on *the errors* (i.e., on
the probability of having an error in a state, or on its possible
magnitude). This idea was explored in [13], and it was shown, using
standard concentration inequalities, that when the error probabilities are
IID, then the *number* of errors in a given trajectory is bounded.
Similarly, if the error magnitudes are IID, then the sum of error
magnitudes  *the error budget*  is also bounded.
A third
approach is to adopt a Bayesian view, and assume a distribution on the
*parameter values*, instead of errors induced by an adversary.
The
error budget introduced in our paper resembles the approach of [13].
However, further work is required to rigorously relate it to a
*statistical model* of parameter uncertainty. We hope that the CVaR
connection we discovered will drive such future research; it is an
interesting and important topic, which we intend to pursue. Furthermore,
we believe that our results may extend beyond CVaR to general coherent
risk measures, which may offer additional flexibility in modeling
parameter uncertainty.
Regarding the upper bound on the Delta
product, note that in Lines 161162 the perturbations are required to be
probability distributions (i.e., positive and sum to 1). Thus, the upper
bound is also implicitly a lower bound. In particular, for eta=1, the
smallest possible value of eta, there is no
perturbation. 
