__ Summary and Contributions__: This paper proposes a new algorithm for sparse phase retrieval, which involves recovering a k-sparse signal from squared linear measurements. Specifically, the algorithm comprises (continuous-time) mirror descent with the hypentropy mirror map, initialized away from zero. This algorithm requires no tuning parameters (beyond a temperature parameter that can be estimated fairly robustly from the measurements).
The authors prove that this algorithm provably recovers any signal with approximately uniform-magnitude coefficients --- despite the fact that no sparsity regularization or projection is needed. In this sense, the mirror descent algorithm automatically adapts to the sparsity level.
The authors also provide several synthetic simulation results that support the analysis. In particular, the convergence of the algorithm (measured in terms of Bregman divergence) is theoretically broken down into three phases; simulations show that this is not merely an artifact of the the proof techniques, and that mirror descent indeed exhibits such convergence behavior.

__ Strengths__: + This paper is excellently written.
+ The theoretical contributions are clear and compelling. To my knowledge, this is the first algorithm on sparse phase retrieval that succeeds without any parameter tuning, which can be a significant hurdle.
+ The proofs involve several interesting new elements that I have not seen before in the phase retrieval literature. Among these include: use of non-Euclidean gradient flow techniques (i.e., continuous time mirror descent); use of the Bregman divergence of the mirror map to characterize convergence; careful analysis of on-support and off-support convergence; careful choice of the temperature parameter in mirror descent; connections to exponentiated gradient descent (EG-plus-minus); and many others.
+ This paper also fills in a few gaps in the analysis offered by the recent preprint [53] that also attempts to solve sparse phase retrieval using (vanilla) gradient descent with an overparameterized representation.
+ Big picture-wise, the paper adds to the growing body of literature on implicit regularization achieved by gradient-based dynamics. Sparse phase retrieval is a "prototypical" non-convex problem, in that the objective function has exponentially many local minima, and that the constraints are non-differentiable. Understanding the behavior on gradient flow-type techniques on such problems can be very useful for more general machine learning problems relevant to the NeurIPS community.

__ Weaknesses__: [The following weaknesses can be major or minor, depending on how one interprets their relative significance. I personally think all these points are minor and should not detract from the contributions of the paper.]
- The absence of noise in the measurement model is somewhat limiting. What would happen to the performance of mirror descent in such cases? I suspect the method might become somewhat unstable since mirror descent requires computing the inverse of the metric tensor in each iteration.
- Practical aspects (such as how to properly discretize the continuous time updates) are not addressed.
- The analysis succeeds only for signals with (approximately) similiar-magnitude coefficients.

__ Correctness__: I believe the claims are correct, although I did not check completely.
The experimental results are limited to small synthetic examples but evocative of the theoretical analysis.

__ Clarity__: The paper is very well-written.

__ Relation to Prior Work__: The paper does a very good job of connecting to existing work in both the phase retrieval as well as the algorithmic machine learning literature.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The authors consider the sparse (real) gaussian phase retrieval (GPR) problem,
with mean squared error objective and m square measurements of a k-sparse
n-dimensional signal. Their main contribution is a full convergence analysis
for a continuous-time mirror descent algorithm for this objective, using the
hypentropy mirror map and an initialization scheme that requires knowledge of
one coordinate of the ground truth's support (Theorem 2): they prove that when
the ell^2-normalized minimum nonzero element of the ground truth vector has
magnitude \Omega(1/sqrt(k)), the number of samples satisfies m >= \max{k^2,
\log^3 n} \log^2 n, and the mirror descent trajectory X(t) is bounded, the
trajectory X(t) converges in ell^2 to the ground truth at a rate that is
initially linear, and eventually sublinear, with constants depending on k and
n; the ultimate accuracy and length of the linear convergence phase can be
controlled through the mirror descent parameter \beta, which is the algorithm's
only hyperparameter. In reasonable scaling regimes for the sparsity k, this
result matches the sample complexities of other algorithms for sparse GPR. The
authors' analysis is motivated by a connection to the recently-proposed
Hadamard Wirtinger flow (HWF) algorithm, a practical algorithm for sparse GPR
that potentially offers increased adaptivity relative to other algorithms for
sparse GPR, as it does not enforce sparsity of the signal through explicit
regularization or hard-thresholding steps, and has been shown empirically
to offer performance advantages in terms of signal recovery rate over other
sparse GPR algorithms [53]. In particular, the authors give calculations
(section 5) that show HWF can be interpreted as a first-order approximation to
an algorithm equivalent to continuous-time mirror descent on the sparse GPR
objective with hypentropy mirror map, and they illustrate the fruitfulness of
this connection by presenting an experiment that demonstrates intuitions
obtained from the mirror descent convergence theorem in the context of the HWF
algorithm (after initializing HWF in a way inspired by the theorem). Besides
this motivation, the authors' analysis contributes to the growing literature on
gradient flow analyses for nonconvex problems, and in particular mirror descent
analyses for nonconvex problems.

__ Strengths__:
- The authors analyze a conceptual algorithm for sparse GPR that has desirable
qualitative features, as in HWF [53]: namely adaptivity and fewer turning
parameters, with those present having simple interpretations. The convergence
analysis is complete, with almost all features one would desire for an
analysis of this type: the sample complexity matches that of standard sparse
GPR algorithms, the hypotheses are not overly restrictive, and the
aspects of the convergence behavior established that are not straightforward
(i.e., the switching of the rate from linear to sublinear) are explained at a
quantitative level and a qualitative level through the experiments.
- The authors provide a concrete connection to HWF (section 5), which could
enable the techniques underlying the convergence analysis of
the present work to be applied to a similar convergence analysis of HWF. The
experiments here make the point well, since the role of \beta in the
convergence analysis of Theorem 2 appears to express itself identically when
HWF is initialized with a mirror-descent-motivated setting of the
initialization weight (this kind of understanding appears to be absent from
the setting of the \alpha parameter in [53], for example).
- The analysis and result of Theorem 2 lead to a useful qualitative ``three
stage'' interpretation of the convergence process that seems suggestive of
some of the convergence behavior of HWF observed in [53]: as described in
lines 216-226, the convergence schedule goes from a slow ``warm-up'' phase to
a linear convergence phase to a sublinear phase, and the authors use the
analysis underlying the convergence result to give concrete reasons for the
behavior in each of these stages.
- Although the sublinear rate of convergence `in the tail' suggested by Theorem
2 seems pessimistic relative to the linear rates of convergence guaranteed
for other sparse GPR algorithms (e.g. SPARTA), the authors also discuss (lines
227-236) the interplay between the convergence process and the setting of the
mirror descent parameter \beta, demonstrating that with proper setting of
\beta, one can essentially guarantee via Theorem 2 linear convergence to a
desired accuracy (in particular, by making \beta very small, as the authors
discuss). As in the previous item, these considerations are displayed in a
small experiment in section 5 (Figure 1), which echoes the ideas the authors
discuss.
- The paper contributes to the general understanding of the performance of
continuous time mirror descent on nonconvex optimization problems with
statistical structure; understanding the technical conditions necessary for
success in such settings, as advanced here, may be of general use.

__ Weaknesses__:
- The continuous-time mirror descent algorithm studied here is not a practical
algorithm for sparse GPR, since in practice one would have to work with a
discretization, and the analysis only applies to the continuous time setting.
However, this type of analysis is becoming more standard recently, and can be
valuable for getting algorithmic insight in a setting with fewer technical
difficulties (as the authors discuss in the introduction). This
notion is borne out convincingly in the calculations experiments of section
5, which transfer some insights from the convergence analysis of Theorem 2 to
the empirical performance of HWF.
- It is not a weakness per se, but a point where connections to the intuitions
behind improved recovery performance of HWF in certain sparsity regimes
(namely, regimes where there are a small number of entries in the ground
truth x* that are as large as a constant multiple of the ell^2 norm)
demonstrated empirically in [53] are missed: as the authors discuss in Remark
1, the sample complexity in Theorem 2 goes to O(k^2) instead of involving
k || x* ||_2^2 || x* ||_{\infty}^{-2} (ignoring logarithmic factors) as the
HWF experiments suggest should be the case, which precludes an improved
sample complexity in the aforementioned setting.

__ Correctness__: - Due to the length of the appendix I have not been able to review all of the
proofs of the supporting lemmas for Theorem 2 line by line. Based on the
authors' summary of the main ideas of the proof and my understanding of prior
works with a somewhat similar flavor (e.g. Sun et al. (2017), Chen et al.
(2019)), I believe the high-level approach and tools should be commensurate
with the task of establishing Theorem 2.

__ Clarity__: - The paper is well written and organized; the authors' writing makes it is
easy to understand the mirror descent algorithm, the initialization scheme,
and other relevant technical details. Examples and intuitions (e.g. sections
4-5) are pertinent and aid in appreciation of subtleties of the material.
- Small potential typo issues:
- - (7) and (8) involve the Bregman divergence of \psi instead of \Phi?
- - I may be missing something (per the discussion in ``Scaling with signal
magnitude'' on page 6), but is the definition of x^{\star}_{min} in line 181
supposed to be normalized by || x^{\star} ||_2? This seems to clash with the
normalization that is used in the lower bounds on x^{\star}_{min} in the
first lines of the statements of Lemma 1 and Theorem 2; but I may be missing
something here.

__ Relation to Prior Work__: - The paper is well-referenced, with an appropriate discussion of contemporary
works on algorithms for sparse phase retrieval (and how they differ, in terms
of advantages/disadvantages, from HWF/the present work's mirror descent
algorithm), and technical approaches used in similar works on analyzing
mirror descent for nonconvex problems, in particular exploiting the presence
in this problem of an ``along-trajectory'' variational coherence property is
likely similar to the kinds of analyses used to prove first-order methods
succeed in similar problems (e.g. matrix sensing/factorization).

__ Reproducibility__: Yes

__ Additional Feedback__: --- POST-REBUTTAL ---
I thank the authors for their response to my review, in particular for their insightful response to my fairly vaguely-worded thoughts on the sample complexity in the main theorem. After reading the other referees' reviews, I have decided to keep my score the same. I am eager to see what theoretical work will follow from the authors' result.
---------------------

__ Summary and Contributions__: This paper studies the problem of sparse phase retrieval. For that a new approach based on mirror-descent (using the hypertrophy mirror map) is used.
The authors prove that with at least order of k^2 measurements their (continuous-time) algorithm recovers the unknown signal (ignoring log-factors and k denotes the sparsity of the signal.)
Furthermore, the claims are supported by experiments and the connection to the "Hadamard Wirtinger Flow" is discussed.
-------------------
After reading the rebuttal as well as the other referee's comments, I changed my opinion on the strength of the paper. For this reason, I will change my score to 7.

__ Strengths__: Sparse phase retrieval is an important problem to both the machine learning and the signal processing community. This paper provides a new approach together with a theoretical analysis. I think that this new point view and the connection between mirror descent, Hadamard Wirtinger Flow, and sparse phase retrieval is inspiring and will trigger follow-up work.
In addition, this paper might help in understanding non-convex problems and mirror descent for those problems in general.

__ Weaknesses__: -For the main result, it is required that x_min is at least at the order of ||x||/sqrt(k).
This is a fairly strong additional assumption and it is also very unintuitive: Signals where the mass is equidistributed over the support should be harder to recover than signals which are largely unbalanced.
The paper would be stronger if this assumption would have been relaxed.
Note, for example, [10] does not need this assumption. I would ask the authors, in Remark 1, when citing this paper, to also admit this and not say "the sample complexity in Theorem 2 matches that of existing results". Moreover, the authors might point the interested reader to the exact location in the proof where this assumption is needed.
(While strengthening the result may not be possible during the rebuttal phase, I would ask the authors to consider the points listed in the section above.)
-Regarding the experiment in Section 5: The authors have chosen m=n=1000. This means the number of measurements is already as large as the ambient dimension. In my own experience, this can lead to many artifacts. The experiments would be much more convincing, if the authors would devise a setting where k^2 <m << n.

__ Correctness__: The claims and proofs in the paper are sound.

__ Clarity__: The paper is well written and very clear.

__ Relation to Prior Work__: In general, prior work is discussed with sufficient detail.
However, the authors might add the following references which deal with sparse phase retrieval:
-"Structured Signal Recovery From Quadratic Measurements: Breaking Sample Complexity Barriers via Nonconvex Optimization", Soltanolkotabi
https://ieeexplore.ieee.org/document/8606215
(solves phase retrieval with optimal sample complexity via projected gradient descent given a good initialization)
-"Solving Equations of Random Convex Functions via Anchored Regression" Bahmani, Romberg
https://link.springer.com/article/10.1007/s10208-018-9401-4
(solves sparse phase retrieval with optimal sample complexity given a good initialization)

__ Reproducibility__: Yes

__ Additional Feedback__:
-Appendix, p. 14: In the last equation of the proof of Lemma 1, a factor 2 is missing after the integration.