Review for NeurIPS paper: A Spectral Energy Distance for Parallel Speech Synthesis

NeurIPS 2020

A Spectral Energy Distance for Parallel Speech Synthesis

Review 1

Summary and Contributions: The author proposed a learning method that can generate speech fully in parallel, without explicit likelihood training. The proposed model is based on generalized energy distance between the distributions of the generated and real speech. This spectral energy distance can be calculated within mini-batches without bias and stabilize the training of implicit generative models. When combining the method with adversarial training, it generates high-quality speech in terms of Mean Opinion Score.

Strengths: 1. Stabilize the training process of implicit generative models using generalized energy distance; 2. Synthesize speech in parallel with high-fidelity in both GED and GED + GAN cases.

Weaknesses: The proposed generalized energy distance is similar with the normalization proposed in "Improved Techniques for Training GANs, Salimans et al., NIPS 2016". The "attractive" item in GED is similar with feature matching and the "repulsive" item in GED is similar with mini-batch discrimination, which is a normalization used for mode collapse. Further comparison is needed.

Correctness: The paper is technically solid.

Clarity: The paper is well written.

Relation to Prior Work: See comment above.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This work proposes the spectral energy distance for training parallel waveform models. It has an interesting repulsive term, which addresses the over-smoothing problem of spectrogram loss for high-fidelity speech synthesis. pros: - The repulsive term for spectrogram loss is well motivated and quite interesting. Overall, it is a complementary solution or even a good substitute for GAN loss, because it is much simpler to train. cons: - The ablation studies are not performed in a thorough way (see my comments).

Strengths: See my comments.

Weaknesses: See my comments.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: No

Additional Feedback: Comments: - Section 2: Flow-based models are not necessarily large. The new SOTA WaveFlow is a small-footprint flow-based model for raw audio. The authors may reference WaveFlow and clarify the inaccurate claim in related work section. - After Eq.(1), one may mention the definition of p(x), q(y). - It is interesting that the repulsive term can provide such noticeable improvement of audio fidelity for spectrogram-based loss. - It would be more interesting to see an ablation study to investigate the individual contributions of repulsive term and multi-scale spectrogram loss. For example, what's the MOS score of combing repulsive term with single-scale spectrogram loss. I usually don't take such FDSD measures seriously, as it couldn't provide meaningful comparisons across different models in general, which is also observed by the authors. - As suggested by Parallel WaveGAN, combining multi-scale spectrogram loss with GAN loss could also provides good results. It would very nice to see an ablation study with MOS scores by varying three design choices: 1) w/ or w/o repulsive term, 2) single or multi-scale spectrogram loss, 3) w/ or w/o GAN loss. It will single out and emphasize the benefit of repulsive term under different circumstances. - The linguistic feature based TTS systems are almost infeasible to reproduce, as they involve huge amount of hand-engineered features. I wish the authors could provide as much information about these linguistic/pitch features as they can, for example, which model predicts these features at synthesis. In contrast, a neural vocoder experiment conditioned on mel spectrogram would be much easier to reproduce. ===================post rebuttal update============== I've read the rebuttal. I think the authors have spent substantial efforts to improve the paper from several aspects, including ablation studies, related work, reproducibility. Thus, I've changed the overall score from 6 to 7. Overall, I like this work. Note that, being more parallel / having fewer sequential steps is not necessarily an advantage. WaveFlow has more sequential steps than WaveGlow, but it runs faster on GPU at synthesis due to its small-footprint.

Review 3

Summary and Contributions: This paper derives a spectral distance for parallel TTS systems, and shows that when combined with GAN loss, it has a superior performance to GAN-based TTS systems.

Strengths: 1. The derivation of the proposed loss is clear and well-grounded. 2. The explanation of the repulsive term is clear and enlightening. 3. The generated audio samples are convincing.

Weaknesses: The major weakness of this paper is that the contribution is not among the most significant, nor is it well-supported by the experiments. Specifically: 1. Despite the well-grounded loss term, the proposed loss term is essentially Lp norm of the error in the frequency domain. The application of Lp norm on spectral error has been previously studied for many applications that aim to generate speech waveform, e.g. speech enhancement [1]. The repulsive term is a major novelty in such context, but this is not highlighted in the motivation, nor fully studied in the empirical evaluation. [1] Pandey, Ashutosh, and DeLiang Wang. "A new framework for CNN-based speech enhancement in the time domain." IEEE/ACM Transactions on Audio, Speech, and Language Processing 27.7 (2019): 1179-1188. 2. The performance advantage of the proposed metric over the baseline is not very pronounced. The main motivation is that the existing parallel synthesis baselines, including flow-based and GAN-based methods, are very hard to train, and the proposed loss is very easy to train. However, we do not see a competitive performance of using the proposed loss alone. The tradeoff between training complexity and performance is ubiquitous, and this paper fails to show that the proposed loss term achieves a good tradeoff. The only system that has a (marginally) better performance than the baselines is when it is combined with the GAN loss, but this defeats the purpose of having a simpler training scheme. 3. The authors claim that the proposed loss term is still useful even when it is combined with GAN, because it replaces the condition generation of GAN. However, this claim is unwarranted, because the authors did not compare the system with condition-GAN with that without. Thus it would be very helpful if the authors can also implement GED + conditional GAN. 4. There are other research attempts to find efficient AND high-quality synthesis architecture, e.g. WaveFlow [2], which is much easier to train and can still produce a high-quality synthesis. It would be nice if the authors can discuss the contribution in relation to such work. [2] Ping, Wei, et al. "Waveflow: A compact flow-based model for raw audio." arXiv preprint arXiv:1912.01219 (2019). 5. Considering the repulsive term is the major novelty, there should be a more thorough evaluation of the repulsive term. For example, it would be interesting to see the results of GED + GAN w/o repulsive term. If the repulsive term is truly irreplaceable, this result is expected to be much poorer than the best model. As a summary, the motivation, related work, and evaluation sections need to be improved in order to highlight the significance of the contribution of this paper.

Correctness: The derivation of the proposed loss term is correct. The empirical methodology is largely correct, although I remain skeptical of the FDSD metric. More specifically, since FDSD uses DeepSpeech2 features, and the input to the DeepSpeech2 is spectrogram, it is possible that any loss term that directly targets the spectrogram, such as the GED loss, would have an advantage under this metric. This can also explain why GED+iSTFT performs better than GED, because the former directly generates spectrograms, and thus has a greater degree of freedom to fit the distribution in the frequency domain. It would be helpful if the authors can comment on this.

Clarity: This paper has a clear explanation and is well-written.

Relation to Prior Work: It would be nice if the authors should discuss about the prior work that generates speech in the time domain but applies a spectral loss term. The authors should also include other TTS systems that attempt to simplify the training/generation procedure, e.g. WaveFlow. See 1 and 4 in the weakness review.

Reproducibility: Yes

Additional Feedback: I would like to thank the authors for the response. It addresses some of my concerns. Therefore, I have adjusted my score accordingly.