__ Summary and Contributions__: The paper considers a framework for learning an optimal policy in a nonmarkovian dynamic treatment regime, which includes a regularization term that encourages the optimal solution (satisfying first-order necessary conditions) to solve the nonparametric estimating equation.
The value functions are estimated by backwards recursion. Given that DTRs additionally take the history as input, a neural network architecture is imposed to use RNNs
----
Thanks to the authors for their response. I have read it. I agree the derivation is more difficult in the longitudinal setting due to the additional terms.
I think, however, the previous use of this idea (algorithmically; and with multiple ablations) raises the bar a bit for this paper to provide evidence of the utility of the TMLE-inspired regularization. The paper is a bit short on this evidence beyond the numerical simulations (where again, it achieves good/admirable performance; but the experimental setup is not set up with enough ablations to isolate improvements from the RNN, NN architecture, or regularization term.) I think a more complete story here would really make this paper an accept. As it stands, this and other issues with explanation of the algorithm and architecture (main contributions) lead me to suggest a score of 6.

__ Strengths__: methodological:
Proposes gradient regularization to encourage the solution of a nonparametric estimating equation for a dynamic treatment regime.
Proposes end-to-end learning framework with an RNN architecture to jointly learn outcome and propensity models
Algorithms for learning optimal policies in DTRs are a bit lacking due to the difficulty of not only incorporating nuisance models for off-policy evaluation, but in re-estimating nuisance models in policy learning.
(This is discussed in
Zhang, Baqun, et al. "A robust method for estimating optimal treatment regimes." Biometrics 68.4 (2012): 1010-1018.
and a preliminary approach is discussed there.
)
While there is extensive work in estimation of dynamic treatment regimes, learning optimal policies with algorithmically sound techniques has been a weakness in the literature.

__ Weaknesses__: The primary theoretical justification is given based on the first order necessary conditions satisfied by the regularizer term.
The primary contribution beyond this is algorithmic (but some of the details of the algorithmic approach are a bit unclear). It seems to me that the appropriate baseline should be AIPW with a gradient based approach on some relaxation of the policy indicator functions (similar "outcome weighted learning" schemes appear in the biostatistics literature), see:
Zhao, Ying-Qi, et al. "New statistical learning methods for estimating optimal dynamic treatment regimes." Journal of the American Statistical Association 110.510 (2015): 583-598.
Given this, still, the paper can be viewed as applying targeted regularization to the task of policy learning. Certainly this leads to a large nonconvex problem but by appeal to the algorithmic success of neural networks, this can be viewed as an algorithmic contribution.
Empirical evaluation:
The empirical evaluation could be more structured since it provides the main justification for the approach. For example, results on policy improvement value could be primarily driven by using RNNS to capture temporal dynamics in the backwards-recursive outcome models.

__ Correctness__: The claims are correct to the best of my knowledge.
(There are some difficulties regarding the optimization landscape of neural nets and properties of local optima, but this is inevitable for the off-policy learning problem).

__ Clarity__: The paper could be improved on the clarity front. The paper requires understanding different fields with extensive notation, so it is a difficult task. This could be accomplished by expanding the appendices to discuss some background to keep the discussion self-contained. (For example, expanding on the semi parametric efficiency discussion and derivation of the EIC could be helpful in the appendix).
In particular, the presentation of some of the engineering decisions and architectures seems to suffer a bit due to lack of space. Given that a key contribution of the paper is the engineering of the overall policy learning system, it would be great to be more detailed in this regard.

__ Relation to Prior Work__: The paper does a reasonable job situating itself relative to prior work given the breadth of relevant areas.
However, a major concern is the relationship to the TMLE-inspired regularization approach taken in:
Shi, Claudia, David Blei, and Victor Veitch. "Adapting neural networks for the estimation of treatment effects." Advances in Neural Information Processing Systems. 2019.
I believe the fundamental observation re: the regularization term is the same as that paper (except this paper applies it for the more complicated EIC form for a nonmarkovian DTR, with a more complicated derivation). That is, that a regularization term can be added with the additional \epsilon parameter such that 0 partial derivative of that term implies \E_n EIC = 0.
I may be a bit mistaken, but where exactly is the difference? Is the derived regularization a direct extension to the sequential setting, or is the proposed regularization in this setting a bit different? I went through the proofs but it is still not entirely clear to me where this differs; so further clarity/exposition on this front would be important. This would be important to get a clear sense of novelty and aid comparison to the batch setting.
Another prior work that should be discussed is longitudinal TMLE for off policy evaluation in non-

__ Reproducibility__: Yes

__ Additional Feedback__:
Certainly the paper is an ambitious project. The reason the score is not higher is because some steps are suboptimal (using a "softmax" relaxation of the indicator function rather than a properly calibrated scoring loss; refitting nuisance estimates for every policy value) and other steps are explained in a somewhat unclear fashion. Overall, the paper would benefit a great deal from additional ablation studies. It seems that the primary contribution is algorithmic, in addition to deriving the regularization term. (I find it somewhat unclear still why this should be expected to perform better finite-sample regularization).
Comparing against ADR on the when-to-stop task seems a bit strange because contextual optimal stopping is such a specialized case. If one knew this were the task at hand, a more specialized approach on the structure of the value/q-functions would be favorable anyway.
While it is certainly reassuring that good performance can be maintained even in this specialized setting (the general approach does not suffer too much relative to a specialized approach), it's hard to compare this approach to previous approaches in general DTR setting studied in the paper, such as optimizing the AIPW with the chosen optimization approach for the indicator function loss, eg lines 238-240 here. This would be a more direct/informative comparison.
See:
Zhao, Ying-Qi, et al. "New statistical learning methods for estimating optimal dynamic treatment regimes." Journal of the American Statistical Association 110.510 (2015): 583-598.

__ Summary and Contributions__: A crucial need in medicine is to decide how to optimally treat a patient. This problem can be defined in the context of a dynamic treatment regime (DTR), which is a sequence of individualized treatment rules based on the patientâ€™s history and previous treatments. However, learning DTR based on offline data can lead to sub-optimal solutions since the bias due to time-varying confounders are ignored. This work presents a novel method called Gradient Regularized V-learning (GRV) to address this issue and learn the value function of a DTR.

__ Strengths__: 1. The proposed GRV estimator is stable and achieves optimal asymptotic efficiency while relaxing the high variance caused by unstable inverse propensity score (IPS) product.
2. Theoretical results show that the optimality condition is met when the proposed regularizer is minimized.
3. Based on the designed estimator, two DTR learning algorithms has been developed.
4. Superior performance of the proposed method has been validated through both synthetic and real-world simulations.
5. Paper is well-written and fairly easy to follow.

__ Weaknesses__: 1. According to the Table 1, ADR method shows competitive performance compared to the proposed GRV method. It is better to include a short discussion on their advantages vs disadvantages, e.g. their running time, stability etc. to better establish the advantage of the proposed GRV method.

__ Correctness__: Yes

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: They propose a gradient regularized V-learning to learn the optimal dynamic treatment regime. It is semiparametric efficient and allows people to use complex RNN.

__ Strengths__: This estimator is statistically efficient. It allows the usage of complex deep learning methods.

__ Weaknesses__: The paper is very complicated, and the explanation of the main algorithm is not enough. I could follow until Section 4.2. However, I feel puzzled in Section 4.3. For example, more explanation of the network architecture would be desired. More description of the objective function is also desired. The expression of the objective function is very long. The author has to give motivation why we should use it.
Another thing is a comparison between DR. It says DR is efficient, but the finite sample property is bad because of the PS model. The proposed model also uses inverse PS and suffers from the same problem. Besides, the DR is much simpler to implement, and can partially avoid the problem by normalization of the inverse propensity score model. In this sense, I still cannot see why we have to use a very complex objective function in Section 4.3.
Another thing the reader would want to know is the connection with the TMLE. As far as I know, TMLE is the way introducing epsilon, and updating the initial estimator so that the final estimator combining epsilon align with the efficient influence curve. It looks the proposed method does a similar thing. Is the proposed method fundamentally different from TMLE?
>>>>>>>>>>>>>>>>>>.(After reading rebutal)
The author tried to address my concerns. I updated the score. I understand the paper is super dense, and how to allocate the space is difficult. In my personal opinion, the most important part would be more explanation of the objective function. So I recommend the author to add it in the primary draft.

__ Correctness__: I think so.

__ Clarity__: The explanation and the introduction are clear. However, the main algorithm and the statement regarding the main algorithm (Section 4.3), which is the most important, is not well explained.

__ Relation to Prior Work__: Well written.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: Dynamic treatment regime (DTR) is a sequence of treatment rules indicating how to individualize treatments for a patient based on the previously assigned treatments and the evolving covariate history. The authors in this paper have proposed Gradient Regularized V-learning (GRV), a novel method for estimating the value function of a DTR. GRV regularizes the underlying outcome and propensity score models with respect to the optimality condition in semiparametric estimation theory and enables recurrent neural network models to estimate the value function under a DTR accurately.

__ Strengths__: Dynamic decision-making in complex non-markovian environments are very challenging, especially when dealing with treatments of patients. It is crucial to give certain treatments in certain times in order to improve the conditions of the patients. This paper proposes a method to address this problem by formulating a value function and introducing an efficient method (GRV) to estimate the value functions using the recurrent neural networks (RNNs). The authors have also included theoretical analysis of optimality for their proposed method which adds to the value of the paper.

__ Weaknesses__: There seems to me that there are not enough experiments in the paper. The authors have only included a simulated experiment in the paper and a real-world MIMIC experiment in the supplement. Even the MIMIC experiment is not very convincing for the practicality of the proposed method.

__ Correctness__: They seem to be correct.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: I liked the idea of the paper in using RNN for estimation of the value functions and also providing theoretical justifications for showing the solution of the proposed method. However, what cannot be seen in the paper is an actual real world problem which demands the existence of such a method. Although there is MIMIC experiment in the supplement, the results are not very convincing. For example, in Fig 6b the proposed DTR has suggested more patients to have 1, 2, and 3 treatments between the time steps t=5 and t=8, and it has only suggested fewer patients with #treatment=4. So, it is hard to interpret the results and claim that the proposed DTR outperforms the others. I think some other convincing examples could add to the value of the paper and the practicality of the method.
The response has been read.