{"title": "Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion", "book": "Advances in Neural Information Processing Systems", "page_first": 6793, "page_last": 6803, "abstract": "End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations. Voice conversion, in which a model has to impersonate a speaker in a recording, is one of those situations. In this paper, we propose Blow, a single-scale normalizing flow using hypernetwork conditioning to perform many-to-many voice conversion between raw audio. Blow is trained end-to-end, with non-parallel data, on a frame-by-frame basis using a single speaker identifier. We show that Blow compares favorably to existing flow-based architectures and other competitive baselines, obtaining equal or better performance in both objective and subjective evaluations. We further assess the impact of its main components with an ablation study, and quantify a number of properties such as the necessary amount of training data or the preference for source or target speakers.", "full_text": "Blow: a single-scale hyperconditioned \ufb02ow for\n\nnon-parallel raw-audio voice conversion\n\nJoan Serr\u00e0\n\nTelef\u00f3nica Research\n\njoan.serra@telefonica.com\n\nSantiago Pascual\n\nUniversitat Polit\u00e8cnica de Catalunya\n\nsanti.pascual@upc.edu\n\nCarlos Segura\n\nTelef\u00f3nica Research\n\ncarlos.seguraperales\n\n@telefonica.com\n\nAbstract\n\nEnd-to-end models for raw audio generation are a challenge, specially if they have\nto work with non-parallel data, which is a desirable setup in many situations. Voice\nconversion, in which a model has to impersonate a speaker in a recording, is one\nof those situations. In this paper, we propose Blow, a single-scale normalizing\n\ufb02ow using hypernetwork conditioning to perform many-to-many voice conversion\nbetween raw audio. Blow is trained end-to-end, with non-parallel data, on a frame-\nby-frame basis using a single speaker identi\ufb01er. We show that Blow compares\nfavorably to existing \ufb02ow-based architectures and other competitive baselines,\nobtaining equal or better performance in both objective and subjective evaluations.\nWe further assess the impact of its main components with an ablation study, and\nquantify a number of properties such as the necessary amount of training data or\nthe preference for source or target speakers.\n\n1\n\nIntroduction\n\nEnd-to-end generation of raw audio waveforms remains a challenge for current neural systems.\nDealing with raw audio is more demanding than dealing with intermediate representations, as it\nrequires a higher model capacity and a usually larger receptive \ufb01eld. In fact, producing high-level\nwaveform structure was long thought to be intractable, even at a sampling rate of 16 kHz, and is\nonly starting to be explored with the help of autoregressive models [1\u20133], generative adversarial\nnetworks [4, 5] and, more recently, normalizing \ufb02ows [6, 7]. Nonetheless, generation without long-\nterm context information still leads to sub-optimal results, as existing architectures struggle to capture\nsuch information, even if they employ a theoretically suf\ufb01ciently large receptive \ufb01eld (cf. [8]).\nVoice conversion is the task of replacing a source speaker identity by a targeted different one while\npreserving spoken content [9, 10]. It has multiple applications, the main ones being in the medical,\nentertainment, and education domains (see [9, 10] and references therein). Voice conversion systems\nare usually one-to-one or many-to-one, in the sense that they are only able to convert from a single or,\nat most, a handful of source speakers to a unique target one. While this may be suf\ufb01cient for some\ncases, it limits their applicability and, at the same time, it prevents them from learning from multiple\ntargets. In addition, voice conversion systems are usually trained with parallel data, in a strictly\nsupervised fashion. To do so, one needs input/output pairs of recordings with the corresponding\nsource/target speakers pronouncing the same underlying content with a relatively accurate temporal\nalignment. Collecting such data is non-scalable and, in the best of cases, problematic. Thus,\nresearchers are shifting towards the use of non-parallel data [11\u201315]. However, non-parallel voice\nconversion is still an open issue, with results that are far from those using parallel data [10].\nIn this work, we explore the use of normalizing \ufb02ows for non-parallel, many-to-many, raw-audio voice\nconversion. We propose Blow, a normalizing \ufb02ow architecture that learns to convert voice recordings\nend-to-end with minimal supervision. It only employs individual audio frames, together with an\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fidenti\ufb01er or label that signals the speaker identity in such frames. Blow inherits some structure from\nGlow [16], but introduces several improvements that, besides yielding better likelihoods, prove crucial\nfor effective voice conversion. Improvements include the use of a single-scale structure, many blocks\nwith few \ufb02ows in each, a forward-backward conversion mechanism, a conditioning module based on\nhypernetworks [17], shared speaker embeddings, and a number of data augmentation strategies for raw\naudio. We quantify the effectiveness of Blow both objectively and subjectively, obtaining comparable\nor even better performance than a number of baselines. We also perform an ablation study to quantify\nthe relative importance of every new component, and assess further aspects such as the preference for\nsource/target speakers or the relation between objective scores and the amount of training audio. We\nuse public data and make our code available at https://github.com/joansj/blow. A number of\nvoice conversion examples are provided in https://blowconversions.github.io.\n\n2 Related work\n\nTo the best of our knowledge, there are no published works utilizing normalizing \ufb02ows for voice\nconversion, and only three using normalizing \ufb02ows for audio in general. Prenger et al. [6] and Kim\net al. [7] concurrently propose using normalizing \ufb02ows as a decoder from mel spectrograms to raw\naudio. Their models are based on Glow, but with a WaveNet [1] structure in the af\ufb01ne coupling\nnetwork. Yamaguchi et al. [18] employ normalizing \ufb02ows for audio anomaly detection and cross-\ndomain image translation. They propose the use of class-dependant statistics to adaptively normalize\n\ufb02ow activations, as done with AdaBN for regular networks [19].\n\n2.1 Normalizing \ufb02ows\n\nBased on Barlow\u2019s principle of redundancy reduction [20], Redlich [21] and Deco and Brauer\n[22] already used invertible volume-preserving neural architectures. In more recent times, Dinh\net al. [23] proposed performing factorial learning via maximum likelihood for image generation,\nstill with volume-preserving transformations. Rezende and Mohamed [24] and Dinh et al. [25]\nintroduced the usage of non-volume-preserving transformations, the formers adopting the terminology\nof normalizing \ufb02ows and the use of af\ufb01ne and radial transformations [26]. Kingma and Dhariwal\n[16] proposed an effective architecture for image generation and manipulation that leverages 1\u00d71\ninvertible convolutions. Despite having gained little attention compared to generative adversarial\nnetworks, autoregressive models, or variational autoencoders, \ufb02ow-based models feature a number of\nmerits that make them specially attractive [16], including exact inference and likelihood evaluation,\nef\ufb01cient synthesis, a useful latent space, and some potential for gradient memory savings.\n\n2.2 Non-parallel voice conversion\n\nNon-parallel voice conversion has a long tradition of approaches using classical machine learning\ntechniques [27\u201330]. However, today, neural networks dominate the \ufb01eld. Some approaches make use\nof automatic speech recognition or text representations to disentangle content from acoustics [31, 32].\nThis easily removes the characteristics of the source speaker, but further challenges the generator,\nwhich needs additional context to properly de\ufb01ne the target voice. Many approaches employ a\nvocoder for obtaining an intermediate representation and as a generation module. Those typically\nconvert between intermediate representations using variational autoencoders [11, 12], generative\nadversarial networks [13, 14], or both [15]. Finally, there are a few works employing a fully neural\narchitecture on raw audio [33]. In that case, parts of the architecture may be pre-trained or not learned\nend-to-end. Besides voice conversion, there are some works dealing with non-parallel music or audio\nconversion: Engel et al. [34] propose a WaveNet autoencoder for note synthesis and instrument timbre\ntransformations; Mor et al. [35] incorporate a domain-confusion loss for general musical translation\nand Nachmani and Wolf [36] incorporate an identity-agnostic loss for singing voice conversion;\nHaque et al. [37] use a sequence-to-sequence model for audio style transfer.\n\n3 Flow-based generative models\nFlow-based generative models learn a bijective mapping from input samples x \u2208 X to latent\nrepresentations z \u2208 Z such that z = f (x) and x = f\u22121(z). This mapping f, commonly called\na normalizing \ufb02ow [24], is a function parameterized by a neural network, and is composed by a\n\n2\n\n\fsequence of k invertible transformations f = f1 \u25e6 \u00b7\u00b7\u00b7 \u25e6 fk. Thus, the relationship between x and z,\nwhich are of the same dimensionality, can be expressed [16] as\n\nx (cid:44) h0\n\nf1\u2190\u2192 h1\n\nf2\u2190\u2192 h2 \u00b7\u00b7\u00b7\n\nfk\u2190\u2192 hk (cid:44) z.\n\nFor a generative approach, we want to model the probability density p(X ) in order to be able to\ngenerate realistic samples. This is usually intractable in a direct way, but we can now use f to model\nthe exact log-likelihood\n\nL(X ) =\n\n1\n|X|\n\nlog (p (xi)) .\n\n(1)\n\nFor a single sample x, and using a change of variables, the inverse function theorem, compositionality,\nand logarithm properties (Appendix A), we can write\n\n|X|(cid:88)\n\ni=1\n\nk(cid:88)\n\ni=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)det\n\n(cid:18) \u2202fi(hi\u22121)\n\n\u2202hi\u22121\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ,\n\nlog (p (x)) = log (p (z)) +\n\nlog\n\nwhere \u2202fi(hi\u22121)/\u2202hi\u22121 is the Jacobian matrix of fi at hi\u22121 and the log-determinants measure the\nchange in log-density made by fi. In practice, one chooses transformations fi with triangular Jacobian\nmatrices to achieve a fast calculation of the determinant and ensure invertibility, albeit these may\nnot be as expressive as more elaborate ones (see for instance [38\u201340]). Similarly, one chooses an\nisotropic unit Gaussian for p(z) in order to allow fast sampling and straightforward operations.\nA number of structures and parameterizations of f and fi have been proposed for image generation,\nthe most popular ones being RealNVP [25] and Glow [16]. More recently, other works have proposed\nimprovements for better density estimation and image generation in multiple contexts [38\u201343].\nRealNVP uses a block structure with batch normalization, masked convolutions, and af\ufb01ne coupling\nlayers. It combines those with 2\u00d72 squeezing operations and alternating checkerboard and channel-\nwise masks. Glow goes one step further and, besides replacing batch normalization by activation\nnormalization (ActNorm), introduces a channel-wise mixing through invertible 1\u00d71 convolutions. Its\narchitecture is composed of 3 to 6 blocks, formed by a 2\u00d72 squeezing operation and 32 to 64 steps of\n\ufb02ow, which comprise a sequence of ActNorm, 1\u00d71 invertible convolution, and af\ufb01ne coupling. For\nthe af\ufb01ne coupling, three convolutional layers with recti\ufb01ed linear units (ReLUs) are used. Both Glow\nand RealNVP feature a multi-scale structure that factors out components of z at different resolutions,\nwith the intention of de\ufb01ning intermediary levels of representation at different granularities. This\nis also the strategy followed by other image generation \ufb02ows and the two existing audio generation\nones [6, 7].\n\n4 Blow\n\nBlow inherits some structure from Glow, but incorporates several modi\ufb01cations that we show are key\nfor effective voice conversion. The main ones are the use of (1) a single-scale structure, (2) more\nblocks with less \ufb02ows in each, (3) a forward-backward conversion mechanism, (4) a hyperconditioning\nmodule, (5) shared speaker embeddings, and (6) a number of data augmentation strategies for raw\naudio. We now provide an overview of the general structure (Fig. 1).\nWe use one-dimensional 2\u00d7 squeeze operations with an alternate pattern [25] and a series of steps\nof \ufb02ow (Fig. 1, left). A step of \ufb02ow is composed of a linear invertible layer as channel mixer\n(similar to a 1\u00d71 invertible convolution in the two-dimensional case), ActNorm, and a coupling\nnetwork with af\ufb01ne coupling (Fig. 1, center). Coupling networks are formed by one-dimensional\nconvolutions and hyperconvolutions with ReLU activations (Fig. 1, right). The last convolution\nand the hyperconvolution of the coupling network have a kernel width of 3, while the intermediate\nconvolution has a kernel width of 1 (we use 512\u00d7512 channels). The same speaker embedding feeds\nall coupling networks, and is independently adapted for each hyperconvolution. Following common\npractice, we compare the output z against a unit isotropic Gaussian and optimize the log-likelihood L\n(Eq. 1) normalized by the dimensionality of z.\n\n4.1 Single-scale structure\n\nBesides the aforementioned ability to deal with intermediary levels of representation, a multi-scale\nstructure is thought to encourage the gradient \ufb02ow and, therefore, facilitate the training of very deep\n\n3\n\n\fFigure 1: Blow schema featuring its block structure (left), steps of \ufb02ow (center), and coupling network\nwith hyperconvolution module (right).\n\nmodels [44] like normalizing \ufb02ows. Here, in preliminary analysis, we observed that speaker identity\ntraits were almost present only at the coarser level of representation. Moreover, we found that, by\nremoving the multi-scale structure and carrying the same input dimensionality across blocks, not only\ngradients were \ufb02owing without issue, but better log-likelihoods were also obtained (see below).\nWe believe that the fact that gradients still \ufb02ow without factoring out block activations is because\nthe log-determinant term in the loss function is still factored out at every \ufb02ow step (Appendix A).\nTherefore, some gradient is still shuttled back to the corresponding layer and below. The fact that we\nobtain better log-likelihoods with a single-scale structure was somehow expected, as block activations\nnow undergo further processing in subsequent blocks. However, to our understanding, this aspect\nseems to be missed in the likelihood-based evaluation of current image generation \ufb02ows.\n\n4.2 Many blocks\nFlow-based image generation models deal with images between 32\u00d732 and 256\u00d7256 pixels. For raw\naudio, a one-dimensional input of 256 samples at 16 kHz corresponds to 16 ms, which is insuf\ufb01cient\nto capture any interesting speech construct. Phoneme duration can be between 50 and 180 ms [45],\nand we need a little more length to model some phoneme transition. Therefore, we need to increase\nthe input and the receptive \ufb01eld of the model. To do so, \ufb02ow-based audio generation models [6, 7]\nopt for more aggressive squeezing factors, together with a WaveNet-style coupling network with\ndilation up to 28. In Blow, in contrast, we opt for using many blocks with relatively few \ufb02ow steps\neach. In particular, we use 8 blocks with 12 \ufb02ows each (an 8\u00d712 structure). Since every block has a\n2\u00d7 squeeze operation, this implies a total squeezing of 28 samples.\nConsidering two convolutions of kernel width 3, an 8\u00d712 structure yields a receptive \ufb01eld of roughly\n12500 samples that, at 16 kHz, corresponds to 781 ms. However, to allow for larger batch sizes, we\nuse an input frame size of 4096 samples (256 ms at 16 kHz). This is suf\ufb01cient to accommodate, at\nleast, one phoneme and one phoneme transition if we cut in the middle of words, and is comparable\nto the receptive \ufb01eld of other successful models like WaveNet. Blow operates on a frame-by-frame\nbasis without context; we admit that this could be insuf\ufb01cient to model long-range speaker-dependent\nprosody, but nonetheless believe it is enough to model core speaker identity traits.\n\n4.3 Forward-backward conversion\n\nThe default strategy to perform image manipulation [16] or class-conditioning [41, 42] in Glow-\nbased models is to work in the z space. This has a number of interesting properties, including the\npossibility to perform progressive changes or interpolations, and the potential for few-shot learning\nor manipulations based on small data. However, we observed that, for voice conversion, results\nfollowing this strategy were largely unsatisfactory (Appendix B).\nInstead of using z to perform identity manipulations, we think of it as an identity-agnostic representa-\ntion. Our idea is that any supplied condition specifying some real input characteristic of x should be\nuseful to transform x to z, specially if we consider a maximum likelihood objective. That is, knowing\n\n4\n\nSqueezeStep of \ufb02owEmbeddingStep of \ufb02owBlockxyStep of \ufb02owChannel mixerActNormCouplingnetworkAf\ufb01ne couplingHyperconvolutionReLUConvolutionReLUConvolutionAdapterCoupling network\fa condition/characteristic of the input should facilitate the discovery of further similarities that were\nhidden by said condition/characteristic, and thus facilitate learning. Following this line of thought, if\nconditioning at multiple levels in the \ufb02ow from x to z progressively get us to a condition-free z space\n(Appendix C.3), then, when transforming back from z to x with a different condition, that should also\nprogressively imprint the characteristics of this new condition to the output x. Blow uses the source\nspeaker identi\ufb01er yS for transforming x(S) to z, and the target speaker identi\ufb01er yT for transforming z\nto the converted audio frame x(T).\n\n4.4 Hyperconditioning\n\nA straightforward place to introduce conditioning in \ufb02ow-based models is the coupling network, as\nno Jacobian matrix needs to be computed and no invertibility constraints apply. Furthermore, in the\ncase of af\ufb01ne channel-wise couplings [16, 25], the coupling network is in charge of performing most\nof the transformation, so we want it to have a great representation power, possibly boosted by further\nconditioning information. A common way to condition the coupling network is to add or concatenate\nsome representation to its input layers. However, based on our observations that concatenation tended\nto be ignored and that addition was not powerful enough, we decided to perform conditioning directly\nwith the weights of the convolutional kernels. That is, that a conditioning representation determines\nthe weights employed by a convolution operator, like done with hypernetworks [17]. We do it at the\n\ufb01rst layer of the coupling network (Fig. 1, right).\nUsing one-dimensional convolutions, and given an input activation matrix H, for the i-th convolutional\n\ufb01lter we have\n\n(2)\nwhere \u2217 is the one-dimensional convolution operator, and W(i)\nrepresent the i-th kernel\nweights and bias, respectively, imposed by condition y. A set of n condition-dependent kernels and\nbiases Ky can be obtained by\nKy =\n\n(3)\nwhere g is an adapter network that takes the conditioning representation ey as input, which in turn\ndepends on condition identi\ufb01er y (the speaker identity in our case). Vector ey is an embedding that\ncan either be \ufb01xed or initialized at some pre-calculated feature representation of a speaker, or learned\nfrom scratch if we need a standalone model. In this paper we choose the standalone version.\n\ny \u2217 H + b(i)\ny ,\n(cid:16)\n\nh(i) = W(i)\n\ny and b(i)\n\ny\n\nW(1)\n\ny , b(1)\ny\n\n(cid:110)(cid:16)\n\n. . .\n\nW(n)\n\ny , b(n)\n\ny\n\n= g (ey) ,\n\n(cid:17)\n\n(cid:17)(cid:111)\n\n4.5 Structure-wise shared embeddings\n\nWe \ufb01nd that learning one ey per coupling network usually results in sub-optimal results. We hypothe-\nsize that, given a large number of steps of \ufb02ow (or coupling networks), independent conditioning\nrepresentations do not need to focus on the essence of the condition (the speaker identity), and are\nthus free to learn any combination of numbers that minimizes the negative log-likelihood, irrespective\nof their relation with the condition. Therefore, to reduce the freedom of the model, we decide to\nconstrain such representations. Loosely inspired by the StyleGAN architecture [46], we set a single\nlearnable embedding ey that is shared by each coupling network in all steps of \ufb02ow (Fig. 1, left). This\nreduces both the number of parameters and the freedom of the model, and turns out to yield better\nresults. Following a similar reasoning, we also use the smallest possible adapter network g (Fig. 1,\nright): a single linear layer with bias that merely performs dimensonality adjustment.\n\n4.6 Data augmentation\n\nTo train Blow, we discard silent frames (Appendix B) and then enhance the remaining ones with\n4 data augmentation strategies. Firstly, we apply a temporal jitter. We shift the start j of each frame x\nas j(cid:48) = j + (cid:98)U (\u2212\u03be, \u03be)(cid:101), where U is a uniform random number generator and \u03be is half of the frame\nsize. Secondly, we use a random pre-/de-emphasis \ufb01lter. Since the identity of the speaker is not\ngoing to vary with a simple \ufb01ltering strategy, we apply an emphasis \ufb01lter [47] with a coef\ufb01cient\n\u03b1 = U (\u22120.25, 0.25). Thirdly, we perform a random amplitude scaling. Speaker identity is also\ngoing to be preserved with scaling, plus we want the model to be able to deal with any amplitude\nbetween \u22121 and 1. We use x(cid:48) = U (0, 1) \u00b7 x/ max(|x|). Finally, we randomly \ufb02ip the values in the\nframe. Auditory perception is relative to an average pressure level, so we can \ufb02ip the sign of x to\nobtain a different input with the same perceptual qualities: x(cid:48) = sgn(U (\u22121, 1)) \u00b7 x.\n\n5\n\n\f4.7\n\nImplementation details\n\nWe now outline the details that differ from the common implementation of \ufb02ow-based generative\nmodels and further refer the interested reader to the provided code for a full account of them. We also\nwant to note that we did not perform any hyperparameter tuning on Blow.\nGeneral \u2014 We train Blow with Adam using a learning rate of 10\u22124 and a batch size of 114. We\nanneal the learning rate by a factor of 5 if 10 epochs have passed without improvement in the\nvalidation set, and stop training at the third time this happens. We use an 8\u00d712 structure, with 2\u00d7\nalternate-pattern squeezing operations. For the coupling network, we split channels into two halves,\nand use one-dimensional convolutions with 512 \ufb01lters and kernel widths 3, 1, and 3. Embeddings are\nof dimension 128. We train with a frame size of 4096 at 16 kHz with no overlap, and initialize the\nActNorm weights with one data-augmented batch (batches contain a random mixture of frames from\nall speakers). We synthesize with a Hann window and 50% overlap, normalizing the entire utterance\nbetween \u22121 and 1. We implement Blow using PyTorch [48].\nCoupling \u2014 As done in the of\ufb01cial Glow code (but not mentioned in the paper), we \ufb01nd that\nconstraining the scaling factor that comes out of the coupling network improves the stability of\ntraining. For af\ufb01ne couplings with channel-wise concatenation\n\nH(cid:48) =(cid:2) H1:c , s(cid:48)(H1:c) (Hc+1:2c + t(H1:c)) (cid:3),\n\nwhere 2c is the total number of channels, we use\n\ns(cid:48)(H1:c) = \u03c3(s(H1:c) + 2) + \u0001,\n\nwhere \u03c3 corresponds to the sigmoid function and \u0001 is a small constant to prevent an in\ufb01nite log-\ndeterminant (and division by 0 in the reverse pass).\nHyperconditioning \u2014 If we strictly follow Eqs. 2 and 3, the hyperconditioning operation can involve\nboth a large GPU memory footprint (n different kernels per batch element) and time-consuming\ncalculations (a double loop for every kernel and batch element). This can, in practice, make the\noperation impossible to perform for a very deep \ufb02ow-based architecture like Blow. However, by\nrestricting the dimensionality of kernels W(i)\ny such that every channel is convolved with its own set of\nkernels, we can achieve a minor GPU footprint and a tractable number of parameters per adaptation\nnetwork. This corresponds to depthwise separable convolutions [49], and can be implemented with\ngrouped convolution [50], available in most deep learning libraries.\n\n5 Experimental setup\n\nTo study the performance of Blow we use the VCTK data set [51], which comprises 46 h of audio\nfrom 109 speakers. We downsample it at 16 kHz and randomly extract 10% of the sentences for\nvalidation and 10% for testing (we use a simple parsing script to ensure that the same sentence text\ndoes not get into different splits, see Appendix B). With this amount of data, the training of Blow\ntakes 13 days using three GeForce RTX 2080-Ti GPUs1. Conversions are performed between all\npossible gender combinations, from test utterances to randomly-selected VCTK speakers.\nTo compare with existing approaches, we consider two \ufb02ow-based generative models and two\ncompetitive voice conversion systems. As \ufb02ow-based generative models we adapt Glow [16] to the\none-dimensional case and replicate a version of Glow with a WaveNet coupling network following [6,\n7] (Glow-WaveNet). Conversion is done both via manipulation of the z space and by learning an\nidentity conditioner (Appendix B). These models use the same frame size and have the same number\nof \ufb02ow steps as Blow, with a comparable number of parameters. As voice conversion systems we\nimplement a VQ-VAE architecture with a WaveNet decoder [33] and an adaptation of the StarGAN\narchitecture to voice conversion like StarGAN-VC [14]. VQ-VAE converts in the waveform domain,\nwhile StarGAN does it between mel-cepstrums. Both systems can be considered as very competitive\nfor the non-parallel voice conversion task. We do not use pre-training nor transfer learning in any of\nthe models.\nTo quantify performance, we carry out both objective and subjective evaluations. As objective metrics\nwe consider the per-dimensionality log-likelihood of the \ufb02ow-based models (L) and a spoo\ufb01ng\n1Nonetheless, conversion plus synthesis with 1 GPU and 50% overlap is around 14\u00d7 faster than real time.\n\n6\n\n\fTable 1: Objective scores and their relative difference for possible Blow alternatives (5 min per\nspeaker, 100 epochs).\n\nCon\ufb01guration\nBlow\n1: with 3\u00d732 structure\n2: with 3\u00d732 structure (squeeze of 8)\n3: with multi-scale structure\n4: with multi-scale structure (5\u00d719, squeeze of 4)\n5: with additive conditioning (coupling network)\n6: with additive conditioning (before ActNorm)\n7: without data augmentation\n\nL [nat/dim]\n\n4.30\n\n4.01 (\u2212 6.7%)\n4.21 (\u2212 2.1%)\n3.64 (\u221215.3%)\n3.99 (\u2212 7.2%)\n4.28 (\u2212 0.5%)\n4.28 (\u2212 0.5%)\n4.15 (\u2212 3.5%)\n\nSpoo\ufb01ng [%]\n\n66.2\n\n17.2 (\u221274.0%)\n65.7 (\u2212 0.8%)\n3.5 (\u221294.7%)\n16.6 (\u221274.9%)\n39.5 (\u221240.3%)\n22.5 (\u221266.0%)\n28.3 (\u221257.2%)\n\nTable 2: Objective and subjective voice conversion scores. For all measures, higher is better. The \ufb01rst\ntwo reference rows correspond to using original recordings from source or target speakers as target.\n\nApproach\n\nObjective\n\nSubjective\n\nL [nat/dim]\n\nSpoo\ufb01ng [%]\n\nNaturalness [1\u20135]\n\nSimilarity [%]\n\nSource as target\nTarget as target\nGlow\nGlow-WaveNet\nStarGAN\nVQ-VAE\nBlow\n\nn/a\nn/a\n4.11\n4.18\nn/a\nn/a\n4.45\n\n1.1\n99.3\n1.2\n3.1\n44.4\n65.0\n89.3\n\n4.83\n4.83\nn/a\nn/a\n2.87\n2.42\n2.83\n\n10.6\n98.5\nn/a\nn/a\n61.8\n69.7\n77.6\n\nmeasure re\ufb02ecting the percentage of times a conversion is able to fool a speaker identi\ufb01cation\nclassi\ufb01er (Spoo\ufb01ng). The classi\ufb01er is an MFCC-based single-layer classi\ufb01er trained with the same\nsplit as the conversion systems (Appendix B). For the subjective evaluation we follow Wester et al.\n[52] and consider the naturalness of the speech (Naturalness) and the similarity of the converted\nspeech to the target identity (Similarity). Naturalness is based on a mean opinion score from 1 to 5,\nwhile Similarity is an aggregate percentage from a binary rating. A total of 33 people participated in\nthe subjective evaluation. Further details on our experimental setup are given in Appendix B.\n\n6 Results\n\n6.1 Ablation study\n\nFirst of all, we assess the effect of the introduced changes with objective scores L and Spoo\ufb01ng. Due\nto computational constraints, in this set of experiments we limit training to 5 min of audio per speaker\nand 100 epochs. The results are in Table 1. In general, we see that all introduced improvements\nare important, as removing any of them always implies worse scores. Nonetheless, some are more\ncritical than others. The most critical one is the use of a single-scale structure. The two alternatives\nwith a multi-scale structure (3\u20134) yield the worst likelihoods and spoo\ufb01ngs, to the point that (3) does\nnot even perform any conversion. Using an 8\u00d712 structure instead of the original 3\u00d732 structure\nof Glow can also have a large effect (1). However, if we further tune the squeezing factor we can\nmitigate it (2). Substituting the hyperconditioning module by a regular convolution plus a learnable\nadditive embedding has a marginal effect on L, but a crucial effect on Spoo\ufb01ng (5\u20136). Finally, the\nproposed data augmentation strategies also prove to be important, at least with 5 min per speaker (7).\n\n6.2 Voice conversion\n\nIn Table 2 we show the results for both objective and subjective scores. The two objective scores, L\nand Spoo\ufb01ng, indicate that Blow outperforms the other considered approaches. It achieves a relative\nL increment of 6% from Glow-Wavenet and a relative Spoo\ufb01ng increment of 37% from VQ-VAE.\nAnother thing to note is that adapted Glow-based models, although achieving a reasonable likelihood,\n\n7\n\n\fFigure 2: Objective scores with respect to amount of training (A\u2013B) and target/source speaker (C\u2013D).\n\nare not able to perform conversion, as their Spoo\ufb01ng is very close to that of the \u201csource as target\u201d\nreference. Because of that, we discarded those in the subjective evaluation.\nThe subjective evaluation con\ufb01rms the good performance of Blow. In terms of Naturalness, StarGAN\noutperforms Blow, albeit by only a 1% relative difference, without statistical signi\ufb01cance (ANOVA,\np = 0.76). However, both approaches are signi\ufb01cantly below the reference audios (p < 0.05). In\nterms of similarity to the target, Blow outperforms both StarGAN and VQ-VAE by a relative 25\nand 11%, respectively. Statistical signi\ufb01cance is observed between Blow and StarGAN (Barnard\u2019s\ntest, p = 0.02) but not between Blow and VQ-VAE (p = 0.13). Further analysis of the obtained\nsubjective scores can be found in Appendix C. To put Blow\u2019s results into further perspective, we\ncan have a look at the non-parallel task of the last voice conversion challenge [10], where systems\nthat do not perform transfer learning or pre-training achieve Naturalness scores slightly below\n3.0 and Similarity scores equal to or lower than 75%. Example conversions can be listened from\nhttps://blowconversions.github.io.\n\n6.3 Amount of training data and source/target preference\n\nTo conclude, we study the behavior of the objective scores when decreasing the amount of training\naudio (including the inherent silence in the data set, which we estimate is around 40%). We observe\nthat, at 100 epochs, training with 18 h yields almost the same likelihood (Fig. 2A) and spoo\ufb01ng\n(Fig. 2B) than training with the full set of 37 h. With it, we do not observe any clear relationship\nbetween Spoo\ufb01ng and per-speaker duration (Appendix C). What we observe, however, is a tendency\nwith regard to source and target identities. If we average spoo\ufb01ng scores for a given target identity, we\nobtain both almost-perfect scores close to 100% and some scores below 50% (Fig. 2C). In contrast, if\nwe average spoo\ufb01ng scores for a given source identity, those are almost always above 70% and below\n100% (Fig. 2D). This indicates that the target identity is critical for the conversion to succeed, with\nrelative independence of the source. We hypothesize that this is due to the way normalizing \ufb02ows\nare trained (maximizing likelihood only for single inputs and identi\ufb01ers; never performing an actual\nconversion to a target speaker), but leave the analysis of this phenomenon for future work.\n\n7 Conclusion\n\nIn this work we put forward the potential of \ufb02ow-based generative models for raw audio synthesis,\nand specially for the challenging task of non-parallel voice conversion. We propose Blow, a single-\nscale hyperconditioned \ufb02ow that features a many-block structure with shared embeddings and\nperforms conversion in a forward-backward manner. Because Blow departs from existing \ufb02ow-based\ngenerative models in these aspects, it is able to outperform those and compete with, or even improve\nupon, existing non-parallel voice conversion systems. We also quantify the impact of the proposed\nimprovements and assess the effect that the amount of training data and the selection of source/target\nspeaker can have in the \ufb01nal result. As future work, we want to improve the model to see if we\ncan deal with other tasks such as speech enhancement or instrument conversion, perhaps by further\nenhancing the hyperconditioning mechanism or, simply, by tuning its structure or hyperparameters.\n\nAcknowledgments\n\nWe are grateful to all participants of the subjective evaluation for their input and feedback. We thank\nAntonio Bonafonte, Ferran Diego, and Martin Pielot for helpful comments. SP acknowledges partial\nsupport from the project TEC2015-69266-P (MINECO/FEDER, UE).\n\n8\n\n1481632Training audio [h]4.14.24.34.4L [nat/dim]A1481632Training audio [h]0255075100Spoofing [%]B1255075100Target speaker (sorted)050100Spoofing [%]C1255075100Source speaker (sorted)050100D\fReferences\n[1] A. Van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner,\nA. Senior, and K. Kavukcuoglu. WaveNet: a generative model for raw audio. ArXiv, 1609.03499,\n2016.\n\n[2] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio.\nSampleRNN: an unconditional end-to-end neural audio generation model. In Proc. of the Int.\nConf. on Learning Representations (ICLR), 2017.\n\n[3] N. Kalchbrenner, E. Elsen, K. Simonyan, N. Casagrande, E. Lockhart, F. Stimberg, A. Van den\nOord, S. Dieleman, and K. Kavukcuoglu. Ef\ufb01cient neural audio synthesis. In Proc. of the Int.\nConf. on Machine Learning (ICML), pages 2410\u20132419, 2018.\n\n[4] S. Pascual, A. Bonafonte, and J. Serr\u00e0. SEGAN: speech enhancement generative adversarial\nnetwork. In Proc. of the Int. Speech Communication Association Conf. (INTERSPEECH), pages\n3642\u20133646, 2017.\n\n[5] C. Donahue, J. McAuley, and M. Puckette. Adversarial audio synthesis. In Proc. of the Int.\n\nConf. on Learning Representations (ICLR), 2019.\n\n[6] R. Prenger, R. Valle, and B. Catanzaro. WaveGlow: a \ufb02ow-based generative network for speech\nsynthesis. In Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP),\npages 3617\u20133621, 2018.\n\n[7] S. Kim, S.-G. Lee, J. Song, and S. Yoon. FloWaveNet : a generative \ufb02ow for raw audio. In\n\nProc. of the Int. Conf. on Machine Learning (ICML), pages 3370\u20133378, 2018.\n\n[8] S. Dieleman, A. Van den Oord, and K. Simonyan. The challenge of realistic music generation:\nmodeling raw audio at scale. In Advances in Neural Information Processing Systems (NeurIPS),\nvolume 31, pages 7989\u20137999. Curran Associates, Inc., 2018.\n\n[9] S. H. Mohammadi and A. Kain. An overview of voice conversion systems. Speech Communica-\n\ntion, 88:65\u201382, 2017.\n\n[10] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling.\nThe voice conversion challenge 2018: promoting development of parallel and nonparallel\nmethods. In Proc. of Odissey, The Speaker and Language Recognition Workshop (Odissey),\npages 195\u2013202, 2018.\n\n[11] Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi. Non-parallel voice conversion using variational\nautoencoders conditioned by phonetic posteriorgrams and d-vectors. In Proc. of the IEEE Int.\nConf. on Acoustics, Speech and Signal Processing (ICASSP), pages 5274\u20135278, 2018.\n\n[12] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. ACVAE-VC: Non-parallel many-to-many\n\nvoice conversion with auxiliary classi\ufb01er variational autoencoder. ArXiv, 1808.05092, 2018.\n\n[13] T. Kaneko and H. Kameoka. CycleGAN-VC: non-parallel voice conversion using cycle-\nconsistent adversarial networks. In Proc. of the European Signal Processing Conf. (EUSIPCO),\npages 2114\u20132118, 2018.\n\n[14] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. StarGAN-VC: non-parallel many-to-many\nIn Proc. of the IEEE Spoken\n\nvoice conversion with star generative adversarial networks.\nLanguage Technology Workshop (SLT), pages 266\u2013273, 2018.\n\n[15] C. C. Hsu, H. T. Hwang, Y. C. Wu, Y. Tsao, and H. M. Wang. Voice conversion from unaligned\ncorpora using variational autoencoding wasserstein generative adversarial networks. In Proc. of\nthe Int. Speech Communication Association Conf. (INTERSPEECH), pages 3364\u20133368, 2017.\n\n[16] D. P. Kingma and P. Dhariwal. Glow: generative \ufb02ow with invertible 1x1 convolutions. In\nAdvances in Neural Information Processing Systems (NeurIPS), volume 31, pages 10215\u201310224.\nCurran Associates, Inc., 2018.\n\n[17] D. Ha, A. Dai, and Q. V. Le. HyperNetworks. In Proc. of the Int. Conf. on Learning Represen-\n\ntations (ICLR), 2017.\n\n9\n\n\f[18] M. Yamaguchi, Y. Koizumi, and N. Harada. AdaFlow: domain-adaptive density estimator with\napplication to anomaly detection and unpaired cross-domain translation. In Proc. of the IEEE\nInt. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 3647\u20133651, 2019.\n\n[19] Y. Li, N. Wang, J. Shi, H. Hou, and J. Liu. Adaptive batch normalization for practical domain\n\nadaptation. Pattern Recognition, 80:109\u2013117, 2018.\n\n[20] H. B. Barlow. Unsupervised learning. Neural Computation, 1:295\u2013311, 1989.\n\n[21] A. N. Redlich. Supervised factorial learning. Neural Computation, 5:750\u2013766, 1993.\n\n[22] G. Deco and W. Brauer. Higher order statistical decorrelation without information loss. In\nAdvances in Neural Information Processing Systems (NeurIPS), volume 7, pages 247\u2013254. MIT\nPress, 1995.\n\n[23] L. Dinh, D. Krueger, and Y. Bengio. NICE: non-linear independent components estimation. In\n\nProc. of the Int. Conf. on Learning Representations (ICLR), 2015.\n\n[24] D. J. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. In Proc. of the\n\nInt. Conf. on Machine Learning (ICML), pages 1530\u20131538, 2015.\n\n[25] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. In Proc. of the\n\nInt. Conf. on Learning Representations (ICLR), 2017.\n\n[26] E. G. Tabak and C. V. Turner. A family of non-parametric density estimation algorithms.\n\nCommunications on Pure and Applied Mathematics, 66(2):145\u2013164, 2013.\n\n[27] A. Mouchtaris, J. Van der Spiegel, and P. Mueller. Non-parallel training for voice conversion\nbased on a parameter adaptation approach. IEEE Trans. on Audio, Speech and Language\nProcessing, 14(3):952\u2013963, 2006.\n\n[28] D. Erro, A. Moreno, and A. Bonafonte. INCA algorithm for training voice conversion systems\nfrom nonparallel corpora. IEEE Trans. on Audio, Speech and Language Processing, 18(5):\n944\u2013953, 2010.\n\n[29] Z. Wu, T. Kinnunen, E. S. Chang, and H. Li. Mixture of factor analyzers using priors from\nnon-parallel speech for voice conversion. IEEE Signal Processing Letters, 19(12):914\u2013917,\n2012.\n\n[30] T. Kinnunen, L. Juvela, P. Alku, and J. Yamagishi. Non-parallel voice conversion using i-vector\nPLDA: towards unifying speaker veri\ufb01cation and transformation. In Proc. of the IEEE Int. Conf.\non Acoustics, Speech and Signal Processing (ICASSP), pages 5535\u20135539, 2017.\n\n[31] F.-L. Xie, F. K. Soong, and H. Li. A KL divergence and DNN-based approach to voice\nconversion without parallel training sentences. In Proc. of the Int. Speech Communication\nAssociation Conf. (INTERSPEECH), pages 287\u2013291, 2016.\n\n[32] S. O. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou. Neural voice cloning with a few samples. In\nAdvances in Neural Information Processing Systems (NeurIPS), volume 31, pages 10019\u201310029.\nCurran Associates, Inc., 2018.\n\n[33] A. Van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. In\nAdvances in Neural Information Processing Systems (NeurIPS), volume 30, pages 6306\u20136315.\nCurran Associates, Inc., 2017.\n\n[34] J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan, and M. Norouzi. Neural\naudio synthesis of musical notes with WaveNet autoencoders. In Proc. of the Int. Conf. on\nMachine Learning (ICML), pages 1068\u20131077, 2017.\n\n[35] N. Mor, L. Wolf, A. Polyak, and Y. Taigman. A universal music translation network. In Proc.\n\nof the Int. Conf. on Learning Representations (ICLR), 2019.\n\n[36] E. Nachmani and L. Wolf. Unsupervised singing voice conversion. ArXiv, 1904.06590, 2019.\n\n10\n\n\f[37] A. Haque, M. Guo, and P. Verma. Conditional end-to-end audio transforms. In Proc. of the Int.\n\nSpeech Communication Association Conf. (INTERSPEECH), pages 2295\u20132299, 2018.\n\n[38] W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. FFJORD: free-\nform continuous dynamics for scalable reversible generative models. In Proc. of the Int. Conf.\non Learning Representations (ICLR), 2019.\n\n[39] J. Ho, X. Chen, A. Srinivas, R. Duan, and P. Abbeel. Flow++: improving \ufb02ow-based generative\nmodels with variational dequantization and architecture design. In Proc. of the Int. Conf. on\nMachine Learning (ICML), pages 2722\u20132730, 2019.\n\n[40] L. Dinh, J. Sohl-Dickstein, R. Pascanu, and H. Larochelle. A RAD approach to deep mixture\n\nmodels. ArXiv, 1903.07714, 2019.\n\n[41] M. Livne and D. J. Fleet. TzK Flow - Conditional Generative Model. ArXiv, 1811.01837, 2018.\n[42] S. J. Hwang and W. H. Kim. Conditional recurrent \ufb02ow: conditional generation of longitudinal\n\nsamples with applications to neuroimaging. ArXiv, 1811.09897, 2018.\n\n[43] E. Hoogeboom, R. Van den Berg, and M. Welling. Emerging convolutions for generative\nnormalizing \ufb02ows. In Proc. of the Int. Conf. on Machine Learning (ICML), pages 2771\u20132780,\n2019.\n\n[44] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In Proc. of the Int. Conf. on Learning Representations (ICLR), 2015.\n\n[45] B. Ziolko and M. Ziolko. Time durations of phonemes in Polish language for speech and speaker\nrecognition. In Z. Vetulani, editor, Human language technology - Challenges for computer\nscience and linguistics, volume 6562 of Lecture Notes in Computer Science. Springer, Berlin,\nGermany, 2011.\n\n[46] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial\n\nnetworks. In Proc. of the Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.\n\n[47] P. Boersma and D. Weenink. Praat: doing phonetics by computer, 2019. URL http://www.\n\npraat.org/.\n\n[48] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\nL. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In NeurIPS Workshop on The\nFuture of Gradient-based Machine Learning Software & Techniques (NeurIPS-Autodiff), 2017.\n[49] L. Kaiser, A. N. Gomez, and F. Chollet. Depthwise separable convolutions for neural machine\n\ntranslation. In Proc. of the Int. Conf. on Learning Representations (ICLR), 2018.\n\n[50] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classi\ufb01cation with deep convolutional\nneural networks. In Advances in Neural Information Processing Systems (NeurIPS), volume 25,\npages 1097\u20131105. Curran Associates, Inc., 2012.\n\n[51] C. Veaux, J. Yamagishi, and K. MacDonald. CSTR VCTK corpus: English multi-speaker corpus\n\nfor CSTR voice cloning toolkit, 2012. URL http://dx.doi.org/10.7488/ds/1994.\n\n[52] M. Wester, Z. Wu, and J. Yamagishi. Analysis of the voice conversion challenge 2016 evaluation\nresults. In Proc. of the Int. Speech Communication Association Conf. (INTERSPEECH), pages\n1637\u20131641, 2016.\n\n[53] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. StarGAN: uni\ufb01ed generative\nadversarial networks for multi-domain image-to-image translation. In Proc. of the Conf. on\nComputer Vision and Pattern Recognition (CVPR), pages 8789\u20138797, 2018.\n\n[54] M. Morise, F. Yokomori, and K. Ozawa. WORLD: a vocoder-based high-quality speech\nsynthesis system for real-time applications. IEICE Transactions on Information and Systems,\n99(7):1877\u20131884, 2016.\n\n[55] S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller,\nA. Ng, J. Raiman, S. Sengupta, and M. Shoeybi. Deep voice: real-time neural text-to-speech.\nIn Proc. of the Int. Conf. on Machine Learning (ICML), pages 195\u2013204, 2017.\n\n11\n\n\f", "award": [], "sourceid": 3681, "authors": [{"given_name": "Joan", "family_name": "Serr\u00e0", "institution": "Dolby Laboratories"}, {"given_name": "Santiago", "family_name": "Pascual", "institution": "Universitat Polit\u00e8cnica de Catalunya"}, {"given_name": "Carlos", "family_name": "Segura Perales", "institution": "Telef\u00f3nica Research"}]}