{"title": "Blind channel identification for speech dereverberation using l1-norm sparse learning", "book": "Advances in Neural Information Processing Systems", "page_first": 921, "page_last": 928, "abstract": null, "full_text": "Blind channel identi\ufb01cation for speech\n\ndereverberation using l1-norm sparse learning\n\nYuanqing Lin\u2020, Jingdong Chen\u2021, Youngmoo Kim\u266f, Daniel D. Lee\u2020\n\n\u2020GRASP Laboratory, Department of Electrical and Systems Engineering, University of Pennsylvania\n\n\u266f Department of Electrical and Computer Engineering, Drexel University\n\n\u2021Bell Laboratories, Alcatel-Lucent\n\nAbstract\n\nSpeech dereverberation remains an open problem after more than three decades\nof research. The most challenging step in speech dereverberation is blind chan-\nnel identi\ufb01cation (BCI). Although many BCI approaches have been developed,\ntheir performance is still far from satisfactory for practical applications. The main\ndif\ufb01culty in BCI lies in \ufb01nding an appropriate acoustic model, which not only\ncan effectively resolve solution degeneracies due to the lack of knowledge of the\nsource, but also robustly models real acoustic environments. This paper proposes\na sparse acoustic room impulse response (RIR) model for BCI, that is, an acous-\ntic RIR can be modeled by a sparse FIR \ufb01lter. Under this model, we show how\nto formulate the BCI of a single-input multiple-output (SIMO) system into a l1-\nnorm regularized least squares (LS) problem, which is convex and can be solved\nef\ufb01ciently with guaranteed global convergence. The sparseness of solutions is\ncontrolled by l1-norm regularization parameters. We propose a sparse learning\nscheme that infers the optimal l1-norm regularization parameters directly from\nmicrophone observations under a Bayesian framework. Our results show that the\nproposed approach is effective and robust, and it yields source estimates in real\nacoustic environments with high \ufb01delity to anechoic chamber measurements.\n\n1 Introduction\n\nSpeech dereverberation, which may be viewed as a denoising technique, is crucial for many speech\nrelated applications, such as hands-free teleconferencing and automatic speech recognition. It is a\nchallenging signal processing task and remains an open problem after more than three decades of\nresearch. Although many approaches [1] have been developed for speech dereverberation, blind\nchannel identi\ufb01cation (BCI) is believed to be the key to thoroughly solving the dereverberation\nproblem. Most BCI approaches rely on source statistics (higher order statistics [2] or statistics\nof LPC coef\ufb01cients [3]), or spatial difference among multiple channels [4] for resolving solution\ndegeneracies due to the lack of knowledge of the source. The performance of these approaches\ndepends on how well they model real acoustic systems (mainly sources and channels). The BCI\napproaches using source statistics need a long sequence of data to build up the statistics, and their\nperformance often degrades signi\ufb01cantly in real acoustic environments where acoustic systems are\ntime-varying and only approximately time-invariant during a short time window. Besides the data\nef\ufb01ciency issue, there are some other dif\ufb01culties in the BCI approaches using source statistics, for\nexample, non-stationarity of a speech source, whitening side effect, and non-minimum phase of\na \ufb01lter [2]. In contrast, the BCI approaches exploiting channel spatial difference are blind to the\nsource, and thus they avoid those dif\ufb01culties arising in assuming source statistics. Unfortunately,\nthese approaches are often too ill-conditioned to tolerate even a very small amount of ambient noise.\nIn general, BCI for speech dereverberation is an active research area, and the main challenge is how\nto build an effective acoustic model that not only can resolve solution degeneracies due to the lack\nof knowledge of the source, but also robustly models real acoustic environments.\n\n1\n\n\fTo address the challenge, this paper proposes a sparse acoustic room impulse response (RIR) model\nfor BCI, that is, an acoustic RIR can be modeled by a sparse FIR \ufb01lter. The sparse RIR model is\ntheoretically sound [5], and it has been shown to be useful for estimating RIRs in real acoustic envi-\nronments when the source is given a priori [6]. In this paper, the sparse RIR model is incorporated\nwith channel spatial difference, resulting a blind sparse channel identi\ufb01cation (BSCI) approach for\na single-input multiple-output (SIMO) acoustic system. The BSCI approach aims to resolve some\nof the dif\ufb01culties in conventional BCI approaches. It is blind to the source and therefore avoids the\ndif\ufb01culties arising in assuming source statistics. Meanwhile, the BSCI approach is expected to be\nrobust to ambient noise. It has been shown that, when the source is given a priori [7], the prior\nknowledge about sparse RIRs plays an important role in robustly estimating RIRs in noisy acoustic\nenvironments. Furthermore, the statistics describing the sparseness of RIRs are governed by acous-\ntic room characteristics, and thus they are close to be stationary with respect to a speci\ufb01c room. This\nis advantageous in terms of both learning the statistics and applying them in channel identi\ufb01cation.\n\nBased on the cross relation formulation [4] of BCI, this paper develops a BSCI algorithm that incor-\nporates the sparse RIR model. Our choice for enforcing sparsity is l1-norm regularization [8], which\nhas been the driving force for many emerging \ufb01elds in signal processing, such as sparse coding and\ncompressive sensing. In the context of BCI, two important issues need to be addressed when using\nl1-norm regularization. First, the existing cross relation formulation for BCI is nonconvex, and di-\nrectly enforcing l1-norm regularization will result in an intractable optimization. Second, l1-norm\nregularization parameters are critical for deriving correct solutions, and their improper setting may\nlead to totally irrelevant solutions. To address these two issues, this paper shows how to formulate\nthe BCI of a SIMO system into a convex optimization, indeed an unconstrained least squares (LS)\nproblem, which provides a \ufb02exible platform for incorporating l1-norm regularization; it also shows\nhow to infer the optimal l1-norm regularization parameters directly from microphone observations\nunder a Bayesian framework.\n\nWe evaluate the proposed BSCI approach using both simulations and experiments in real acoustic\nenvironments. Simulation results illustrate the effectiveness of the proposed sparse RIR model in\nresolving solution degeneracies, and they show that the BSCI approach is able to robustly and accu-\nrately identify \ufb01lters from noisy microphone observations. When applied to speech dereverberation\nin real acoustic environments, the BSCI approach yields source estimates with high \ufb01delity to ane-\nchoic chamber measurements. All of these demonstrate that the BSCI approach has the potential for\nsolving the dif\ufb01cult speech dereverberation problem.\n\n2 Blind sparse channel identi\ufb01cation (BSCI)\n\n2.1 Previous work\n\nOur BSCI approach is based on the cross relation formulation for blind SIMO channel identi\ufb01ca-\ntion [4]. In a one-speaker two-microphone system, the microphone signals at time k can be written\nas:\n\nxi(k) = s(k) \u2217 hi + ni(k), i = 1, 2,\n\n(1)\n\nwhere \u2217 denotes linear convolution, s(k) is a source signal, hi represents the channel impulse re-\nsponse between the source and the ith microphone, and ni(k) is ambient noise. The cross relation\nformulation is based on a clever observation, x2(k) \u2217 h1 = x1(k) \u2217 h2 = s(k) \u2217 h1 \u2217 h2, if the mi-\ncrophone signals are noiseless [4]. Then, without requiring any knowledge from the source signal,\nthe channel \ufb01lters can be identi\ufb01ed by minimizing the squared cross relation error. In matrix-vector\nform, the optimization can be written as\n\n1, h\u2217\nh\u2217\n\n2 =\n\nargmin\n\nkh1k2+kh2k2=1\n\n1\n2\n\nkX2h1 \u2212 X1h2k2\n\n(2)\n\nwhere Xi is the (N + L \u2212 1) \u00d7 L convolution Toeplitz matrix whose \ufb01rst row and \ufb01rst column are\n[xi(k \u2212 N + 1), 0, . . . , 0] and [xi(k \u2212 N + 1), xi(k \u2212 N + 2), ..., xi(k), 0, . . . , 0]T , respectively, N\nis the microphone signal length, L is the \ufb01lter length, hi(i = 1, 2) are L \u00d7 1 vectors representing the\n\ufb01lters, k \u00b7 k denotes l2-norm, and the constraint is to avoid the trivial zero solution. It is easy to see\nthat the above optimization is a minimum eigenvalue problem, and it can be solved by eigenvalue\ndecomposition. As shown in [4], the eigenvalue decomposition approach \ufb01nds the true solution\nwithin a constant time delay and a constant scalar factor when 1) the system is noiseless; 2) the two\n\n2\n\n\f\ufb01lters are co-prime (namely, no common zeros); and 3) the system is suf\ufb01ciently excited (i.e., the\nsource needs to have enough frequency bands).\n\nUnfortunately, the eigenvalue decomposition approach has not been demonstrated to be useful for\nspeech dereverberation in real acoustic environments. This is because the conditions for \ufb01nding\ntrue solutions are dif\ufb01cult to sustain. First, microphone signals in real acoustic environments are\nalways immersed in excessive ambient noise (such as air-conditioning noise), and thus the noiseless\nassumption is never true. Second, it requires precise information about \ufb01lter order for the \ufb01lters to\nbe co-prime, however, the \ufb01lter order itself is hard to compute accurately since the \ufb01lters modeling\nRIRs are often thousands of taps long. As a result, eigenvalue decomposition approach is often\nill-conditioned and very sensitive to even a very small amount of ambient noise.\n\nOur proposed sparse RIR model aims to alleviate those dif\ufb01culties. Under the sparse RIR model,\nsparsity regularization automatically determines \ufb01lter order since surplus \ufb01lter coef\ufb01cients are\nforced to be zero. Furthermore, previous work [7] has demonstrated that, when the source is given a\npriori, sparsity regularization plays an important role in robustly estimating RIRs in noisy acoustic\nenvironments. In order to exploit the sparse RIR model, we \ufb01rst formulate the BCI using cross rela-\ntion into a convex optimization, which will provide a \ufb02exible platform for enforcing l1-norm sparsity\nregularization.\n\n2.2 Convex formulation\n\nThe optimization in Eq. 2 is nonconvex because its domain, kh1k2 + kh2k2 = 1, is nonconvex. We\npropose to replace it with a convex singleton linear constraint, and the optimization becomes\n\nh\u2217\n1, h\u2217\n\n2 = argmin\nh1(l)=1\n\n1\n2\n\nkX2h1 \u2212 X1h2k2\n\n(3)\n\nwhere h1(l) is the lth element of \ufb01lter h1.\nIt is easy to see that, when microphone signals are\nnoiseless, the optimizations in Eqs. 2 and 3 yield equivalent solutions within a constant time delay\nand a constant scalar factor. Because the optimization is a minimization, h1(l) tends to align with\nthe largest coef\ufb01cient in \ufb01lter h1, which normally is the coef\ufb01cient corresponding to the direct path.\nConsequently, the singleton linear constraint removes two degrees of freedom in \ufb01lter estimates: a\nconstant time delay (by \ufb01xing l) and a constant scalar factor [by \ufb01xing h1(l) = 1]. The choice of l\n(0 \u2264 l \u2264 L \u2212 1) is arbitrary as long as the direct path in \ufb01lter h2 is no more than l samples earlier\nthan the one in \ufb01lter h1.\n\nThe new formulation in Eq. 3 has many advantages. It is convex and indeed an unconstrained LS\nproblem since the singleton linear constraint can be easily substituted into the objective function.\nFurthermore, the new LS formulation is more robust to ambient noise than the eigenvalue decompo-\nsition approach in Eq. 2. This can be better viewed in the frequency domain. Because the squared\ncross relation error (the objective function in Eqs. 2 and 3) is weighted in the frequency domain by\nthe power spectrum density of a common source, the total \ufb01lter energy constraint in Eq. 2 may be\n\ufb01lled with less signi\ufb01cant frequency bands which contribute little to the source and are weighted\nless in the objective function. As a result, the eigenvalue decomposition approach is very sensitive\nto noise. In contrast, the singleton linear constraint in Eq. 3 has much less coupling in \ufb01lter energy\nallocation, and the new LS approach is more robust to ambient noise.\n\nThen, the BSCI approach is to incorporate the LS formulation with l1-norm sparsity regularization,\nand the optimization becomes\n\n1, h\u2217\nh\u2217\n\n2 = argmin\nh1(l)=1\n\n1\n2\n\nL\u22121\n\nkX2h1 \u2212 X1h2k2 + \u03bb\u2032\n\nXj=0\n\n[|h1(j)| + |h2(j)|]\n\n(4)\n\nwhere \u03bb\u2032 is a nonnegative scalar regularization parameter that balances the preference between the\nsquared cross relation error and the sparseness of solutions described by their l1-norm. The setting\nof \u03bb\u2032 is critical for deriving appropriate solutions, and we will show how to compute its optimal\nsetting in a Bayesian framework in Section 2.3. Given a \u03bb\u2032, the optimization in Eq. 4 is convex\nand can be solved by various methods with guaranteed global convergence. We implemented the\nMehrotra predictor-corrector primal-dual interior point method [9], which is known to yield better\nsearch directions than the Newton\u2019s method. Our implementation usually solves the optimization in\nEq. 4 with extreme accuracy (relative duality gap less than 10\u221214) in less than 20 iterations.\n\n3\n\n\f2.3 Bayesian l1-norm sparse learning for blind channel identi\ufb01cation\n\nThe l1-norm regularization parameter \u03bb\u2032 in Eq. 4 is critical for deriving appropriately sparse solu-\ntions. How to determine its optimal setting is still an open research topic. A recent development is to\nsolve the optimization in Eq. 4 with respect to all possible values of \u03bb\u2032 [10], and cross-validation is\nthen employed to \ufb01nd an appropriate solution. However, it is not easy to obtain extra data for cross-\nvalidation in BCI since real acoustic environments are often time-varying. In this study, we develop\na Bayesian framework for inferring the optimal regularization parameters for the BSCI formulation\nin Eq. 4. A similar Bayesian framework can be found in [7], where the source was assumed to be\nknown a priori.\n\nThe optimization in Eq. 4 is a maximum-a-posteriori estimation under the following probabilistic\nassumptions\n\nP (cid:0)X2h1 \u2212 X1h2|\u03c32, h1, h2(cid:1) =\n\nP (h1, h2|\u03bb) = (cid:18) \u03bb\n\n1\n\n2\u03c32 kX2h1 \u2212 X1h2k2(cid:27) , (5)\n\n1\n\n(2\u03c0\u03c32)(N +L\u22121)/2 exp(cid:26)\u2212\n2(cid:19)2L\n\n\u2212\u03bb\n\nL\u22121\n\nexp\uf8f1\uf8f2\n\uf8f3\n\nXj=0\n\n[|h1(j)| + |h2(j)|]\uf8fc\uf8fd\n\uf8fe\n\n(6)\n\nwhere the cross relation error is an I.I.D. zero-mean Gaussian with variance \u03c32, and the \ufb01lter coef\ufb01-\ncients are governed by a Laplacian sparse prior with the scalar parameter \u03bb. Then, the regularization\nparameter \u03bb\u2032 in Eq. 4 can be written as\n\nWhen the ambient noise [n1(k) and n2(k) in Eq. 1] is an I.I.D. zero-mean Gaussian with variance\n\u03c32\n0 , the parameter \u03c32 can be approximately written as\n\n\u03bb\u2032 = \u03c32\u03bb.\n\n(7)\n\n\u03c32 = \u03c32\n\n0(kh1k2 + kh2k2),\n\n(8)\nbecause x2(k) \u2217 h1 \u2212 x1(k) \u2217 h2 = n2(k) \u2217 h1 \u2212 n1(k) \u2217 h2. The above form of \u03c32 is only an\napproximation because the cross relation error is temporally correlated through the convolution.\nNevertheless, since the cross relation error is the result of the convolutive mixing, its distribution\nwill be close to the Gaussian with its variance described by Eq. 8 according to the central limit\ntheorem. We choose to estimate the ambient noise level (\u03c32\n0) directly from microphone observations\nvia restricted maximum likelihood [11]:\n\n\u03c32\n0 = min\ns,h1,h2\n\n1\n\nN \u2212 L \u2212 1\n\n2\n\nN \u22121\n\nXi=1\n\nXk=0\n\nkxi(k) \u2212 s(k) \u2217 hik2\n\n(9)\n\nwhere the denominator N \u2212 L \u2212 1 (but not 2N ) accounts for the loss of the degrees of freedom\nduring the optimization. The above minimization is solved by coordinate descent alternatively with\nrespect to the source and the \ufb01lters. It is initialized with the LS solution by Eq. 3 and often able to\nyield a good \u03c32\n0 estimate in a few iterations. Note that each iteration can be computed ef\ufb01ciently in\nthe frequency domain. Meanwhile, the parameter \u03bb can be computed by\n\n2L\n\nj=0 [|h1(j)| + |h2(j)|]\n\n\u03bb =\n\nPL\u22121\n\n,\n\n(10)\n\nas a result of \ufb01nding the optimal Laplacian distribution given its suf\ufb01cient statistics.\n\ntics of \ufb01lters, kh1k2 + kh2k2 and PL\u22121\n\nWith the Eqs. 8 and 10, \ufb01nding the optimal regularization parameters becomes computing the statis-\nj=0 [|h1(j)| + |h2(j)|]. These statistics are closely related\nto acoustic room characteristics and may be computed from them if they are known a priori. For\nexample, the reverberation time of a room de\ufb01nes how fast echoes decay \u221260 dB, and it can be\nused to compute the \ufb01lter statistics. More generally, we choose to compute the statistics directly\nfrom microphone observations in the Baysian framework by maximizing the marginal likelihood,\n\nP (X2h1 \u2212 X1h2|\u03c32, \u03bb) = Rh1(l)=1 P (X2h1 \u2212 X1h2, h1, h2|\u03c32, \u03bb)dh1dh2. The optimization is\n\nthrough Expectation-Maximization (EM) updates [7]:\n\n\u03c32 \u2190\u2212 \u03c32\n\n0 Zh(l)=1\n\n(kh1k2 + kh2k2)Q(h1, h2)dh1dh2\n\n\u03bb \u2190\u2212\n\n2L\n\nRh(l)=1(PL\u22121\n\nj=0 |h1(j)| + |h2(j)|)Q(h1, h2)dh1dh2\n\n(11)\n\n(12)\n\n4\n\n\f2\u03c32 kX2h1 \u2212 X1h2k2 \u2212 \u03bb[PL\u22121\n\nwhere h1 and h2 are treated as hidden variables, \u03c32 and \u03bb are parameters, and Q(h1, h2) \u221d\nexp{\u2212 1\nj=0 |h1(j)| + |h2(j)|]} is the probability distribution of h1\nand h2 given the current estimate of \u03c32 and \u03bb. The integrals in Eqs. 11 and 12 can be computed\nusing the variational scheme described in [7]. The EM updates often converge to a good estimate of\n\u03c32 and \u03bb in a few iterations. Moreover, since the \ufb01lter statistics are relatively stationary for a speci-\n\ufb01ed room, the Bayesian inference may be carried out off-line and only once if the room conditions\nstay the same.\n\nAfter the \ufb01lters are identi\ufb01ed by BCI approaches, the source can be computed by various meth-\nods [12]. We choose to estimate the source by the following optimization\n\ns\u2217 = argmin\n\ns\n\n2\n\nN \u22121\n\nXi=1\n\nXk=0\n\nkxi(k) \u2212 s(k) \u2217 hik2,\n\n(13)\n\nwhich will yield maximum-likelihood (ML) estimation if the \ufb01lter estimates are accurate.\n\n3 Simulations and Experiments\n\n3.1 Simulations\n\n3.1.1 Simulations with arti\ufb01cial RIRs\n\nWe \ufb01rst employ a simulated example to illustrate the effectiveness of the proposed sparse RIR model\nfor BCI. In the simulation, we used a speech sequence of 1024 samples (with 16 kHz sampling rate)\nas the source (s) and simulated two 16-sample FIR \ufb01lters (h1 and h2). The \ufb01lter h1 had nonzero\nelements only at indices 0, 2, and 12 with amplitudes of 1, -0.7, and 0.5, respectively; the \ufb01lter h2 had\nnonzero elements only at indices 2, 6, 8, and 10 with amplitudes of 1, -0.6, 0.6, and 0.4, respectively.\nNotice that both h1 and h2 are sparse. Then the simulated microphone observations (x1 and x2)\nwere computed by Eq. 1 with the ambient noise being real noise recorded in a classroom. The noise\nwas scaled so that the signal-to-noise ratio (SNR) of the microphone signals was approximately 20\ndB. Because a big portion of the noise (mainly air-conditioning noise) was at low frequency, the\nmicrophone observations were high-passed with a cut-off frequency of 100 Hz before they were fed\nto BCI algorithms. In the BSCI algorithm, the l1-norm regularization parameters, \u03c32 and \u03bb, were\nestimated in the Bayesian framework using the update rules given in Eqs. 11 and 12.\n\nFigure 1 shows the \ufb01lters identi\ufb01ed by different BCI approaches. Compared to the conventional\neigenvalue decomposition method (Eq. 2), the new convex LS approach (Eq. 3) is more robust to\nambient noise and yielded better \ufb01lter estimates even though the estimates still seem to be convolved\nby a common \ufb01lter. The proposed BSCI approach (Eq. 4) yielded \ufb01lter estimates that are almost\nidentical to the true ones. It is evident that the proposed sparse RIR model played a crucial role in\nrobustly and accurately identifying \ufb01lters in blind manners. The robustness and accuracy gained by\nthe BSCI approach will become essential when the \ufb01lters are thousands of taps long in real acoustic\nenvironments.\n\n3.1.2 Simulations with measured RIRs\n\nHere we employ simulations using RIRs measured in real rooms to demonstrate the effective-\nness of the proposed BSCI approach for speech dereverberation.\nIts performance is compared\nto the beamforming, the eigenvalue decomposition (Eq. 2), and the LS (Eq. 3) approaches.\nIn the simulation, the source sequence (s) was a sentence of speech (approximately 1.5 sec-\nonds), and the \ufb01lters (h1 and h2) were two measured RIRs from York MARDY database\n(http://www.commsp.ee.ic.ac.uk/ sap/mardy.htm) but down-sampled to 16 kHz (from originally 48\nkHz). The original \ufb01lters in the database were not sparse, but they had many tiny coef\ufb01cients which\nwere in the range of measurement uncertainty. To make the simulated \ufb01lters sparse, we simply\nzeroed out those coef\ufb01cients whose amplitudes were less than 2% of the maximum. Finally, we\ntruncated the \ufb01lters to have length of 2048 since there were very few nonzero coef\ufb01cients after that.\nWith the simulated source and \ufb01lters, we then computed microphone observations using Eq. 1 with\nambient noise being real noise recorded in a classroom. For testing the robustness of different BCI\nalgorithms, the ambient noise was scaled to different levels so that the SNRs varied from 60 dB to 10\ndB. Similar to the previous simulations, the simulated observations were high-passed with a cutoff\n\n5\n\n\fEig\u2212\ndecomp\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n \n0\n\nLS\n\nBSCI\n\n1\n\n0\n\n\u22121\n\n0\n\n1\n\n0\n\n\u22121\n\n \n0\n\nh\n1\n\nEstimated\nTrue\n \n\n1\n\nh\n2\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n \n\n0\n\n\u22121\n\n0\n\n1\n\n0\n\n\u22121\n\n0\n\n1\n\n0\n\n5\n10\nTime (sample)\n\n15\n\n\u22121\n\n \n0\n\n10\n5\nTime (sample)\n\n5\n\n10\n\n15\n\n5\n\n10\n\n15\n\n \n\n15\n\nFigure 1: Identi\ufb01ed \ufb01lters by three different BCI approaches in a simulated example: the eigenvalue decom-\nposition approach (denoted as eig-decomp) in Eq. 2, the LS approach in Eq. 3, and the blind sparse channel\nidenti\ufb01cation (BSCI) approach in Eq. 4. The solid-dot lines represent the estimated \ufb01lters, and the dot-square\nlines indicate the true \ufb01lters within a constant time delay and a constant scalar factor.\n\n100\n\n80\n\n60\n\n40\n\n20\n\n)\n\n%\n\n(\n \nn\no\n\ni\nt\n\nl\n\na\ne\nr\nr\no\nc\n \n\nd\ne\nz\n\ni\nl\n\na\nm\nr\no\nN\n\nFilter estimates\n\nSource estimates\n\n \n\n100\n\n \n\neigen\u2212decomp\nLS\nBSCI\n\n)\n\n%\n\n(\n \nn\no\n\ni\nt\n\nl\n\na\ne\nr\nr\no\nc\n \n\nd\ne\nz\n\ni\nl\n\na\nm\nr\no\nN\n\n80\n\n60\n\n40\n\n20\n\neigen\u2212decomp\n\nbeamforming\n\nLS\n\nBSCI\n\n \n0\n\u221260\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\nNoise level (dB)\n\n \n0\n\u221260\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\nNoise level (dB)\n\nFigure 2: The simulation results using measured real RIRs. The normalized correlation (de\ufb01ned in Eq. 14)\nof the estimates were computed with respect to their true values. The \ufb01lters were identi\ufb01ed by three different\napproaches: the eigenvalue decomposition approach (denoted as eigen-decomp) in Eq. 2 , the LS approach in\nEq. 3, and the blind sparse channel identi\ufb01cation (BSCI) approach in Eq. 4. After the \ufb01lters were identi\ufb01ed,\nthe source was estimated by Eq. 13. The source estimated by beamforming is also presented as a baseline\nreference.\n\nfrequency of 100 Hz before they were fed to different BCI algorithms. In the BSCI approach, the\nl1-norm regularization parameters were iteratively computed using the updates in Eqs. 11 and 12.\nAfter \ufb01lters were identi\ufb01ed, the source was estimated using Eq. 13.\n\nBecause both \ufb01lter and source estimates by BCI algorithms are within a constant time delay and\na constant scalar factor, we use normalized correlation for evaluating the estimates. Let \u02c6s and s0\ndenote an estimated source and the true source, respectively, then the normalized correlation C(\u02c6s, s0)\nis de\ufb01ned as\n\nC(\u02c6s, s0) = max\n\n(14)\n\nm Pk \u02c6s(k \u2212 m)s0(k)\n\nk\u02c6skks0k\n\nwhere m and k are sample indices, and k \u00b7 k denotes l2-norm. It is easy to see that, the normalized\ncorrelation is between 0% and 100%: it is equal to 0% when the two signals are uncorrelated, and it\nis equal to 100% only when the two signal are identical within a constant time delay and a constant\nscalar factor. The de\ufb01nition in Eq. 14 is also applicable to the evaluation of \ufb01lter estimates.\n\nThe simulation results are shown in Fig. 2. Similar to what we observed in the previous example,\nthe convex LS approach (Eq. 3) shows signi\ufb01cant improvement in both \ufb01lter and source estimation\ncompared to the eigenvalue decomposition approach (Eq. 2). In fact, the eigenvalue decomposition\n\n6\n\n\f)\n\n%\n\n(\n \n\nn\no\n\ni\nt\n\nl\n\na\ne\nr\nr\no\nc\n \n\nd\ne\nz\n\ni\nl\n\na\nm\nr\no\nN\n\n100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n \n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\nExperiments\n\n \n\nBeamforming\nEig\u2212decomp\nLS\nBSCI\n\nFigure 3: The source estimates of 10 experiments in real acoustic environments. The normalized correlation\nwas with respect to their anechoic chamber measurement. The \ufb01lters were identi\ufb01ed by three different BCI\napproaches: the eigenvalue decomposition approach (denoted as eig-decomp) in Eq. 2, the LS approach in\nEq. 3, and the blind sparse channel identi\ufb01cation (BSCI) approach in Eq. 4. The beamforming results serve as\nthe baseline performance for comparison.\n\ne\nd\nu\n\nt\ni\nl\n\np\nm\nA\n\ne\nd\nu\n\nt\ni\nl\n\np\nm\nA\n\n1\n\n0.5\n\n0\n\n-0.5\n\n-1\n\n0\n\n1\n\n0.5\n\n0\n\n -0.5\n\n0\n\nh\n1\n\n0\n\ne\nd\nu\n\nt\ni\nl\n\np\nm\nA\n\n10\n\n0\n\n\u221210\n\nAnechoic chamber measurement\n\nA\n\n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n500\n\n1000\n\n1500\n\nh\n2\n\n0\n\n500\n1000\nTime (samples)\n\n1500\n\ne\nd\nu\n\nt\ni\nl\n\np\nm\nA\n\ne\nd\nu\n\nt\ni\nl\n\np\nm\nA\n\n5\n\n0\n\n\u22125\n\n0\n\n5\n\n0\n\n\u22125\n\n0\n\nReal room recording (left microphone)\n\nB\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\nSource estimate using the filters identified by BSCI\n\nC\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\nTime (samples)\n\nFigure 4: Results of Experiment 6 in Fig. 3. Left: the \ufb01lters estimated by the proposed blind sparse channel\nidenti\ufb01cation (BSCI) approach. They are sparse as indicated by the enlarged segments. Right: a segment of\nsource estimate (shown in C) using the BSCI approach. It is compared with its anechoic measurement (shown\nin A) and its microphone recording (shown in B).\n\napproach did not yield relevant results because it was too ill-conditioned due to the long \ufb01lters.\nThe remarkable performance came from the BSCI approach, which incorporates the convex LS\nformulation with the sparse RIR model. In particular, the BSCI approach yielded higher than 90%\nnormalized correlation in source estimates when SNR was better than 20 dB, and it yielded higher\nthan 99% normalized correlation in the low noise limit. The performance of the canonical delay-\nand-sum beamforming is also presented as the baseline for all BCI algorithms.\n\n3.2 Experiments\n\nWe also evaluated the proposed BSCI approach using signals recorded in real acoustic environ-\nments. We carried out 10 experiments in total in a reverberant room. In each experiment, a sentence\nof speech (approximately 1.5 seconds, and the same for all experiments) was played through a loud-\nspeaker (NSW2-326-8A, Aura Sound) and recorded by a matched omnidirectional microphone pair\n(M30MP, Earthworks). The speaker-microphone positions (and thus RIRs) were different in dif-\nferent experiments. Because the recordings had a large amount of low-frequency noise, they were\nhigh-passed with a cutoff frequency of 100 Hz before they were fed to BCI algorithms.\nIn the\nBSCI approach, the l1-norm regularization parameters, \u03c32 and \u03bb, were iteratively computed using\nthe updates in Eq. 11 and 12. After the \ufb01lters were identi\ufb01ed, the sources were computed using\nEq. 13. We also had recordings in the anechoic chamber at Bell Labs using the same instruments\nand settings, and the anechoic measurement served as the approximated ground truth for evaluating\nthe performance of different BCI approaches.\n\n7\n\n\fFigure 3 shows the source estimates in the 10 experiments in terms of their normalized correlation\nto the anechoic measurement. The performance of the proposed BSCI is compared with the beam-\nforming, the eigenvalue decomposition (Eq. 2), and the convex LS (Eq. 3) approaches. The results of\nthe 10 experiments unanimously support our previous \ufb01ndings in simulations. First, the convex LS\napproach yielded signi\ufb01cantly better source estimates than the eigenvalue decomposition method.\nSecond, the proposed BSCI approach, which incorporates the convex LS formulation with the sparse\nRIR model, yielded the most dramatic results, achieving 85% or higher of normalized correlation in\nsource estimates in most experiments while the LS approach only obtained approximately 70% of\nnormalized correlation.\n\nFigure 4 shows one instance of \ufb01lter and source estimates. The estimated \ufb01lters have about 2000\nzeros out of totally 3072 coef\ufb01cients, and thus they are sparse. This observation experimentally\nvalidates our hypothesis of the sparse RIR models, namely, an acoustic RIR can be modeled by a\nsparse FIR \ufb01lter. The source estimate shown in Fig. 4 vividly illustrates the convolution and dere-\nverberation process. It only plots a small segment to reveal greater details. As we see, the anechoic\nmeasurement was clean and had clear harmonic structure; the signal recorded in the reverberant\nroom was smeared by echoes during the convolution process; and then, the dereverberation using\nour BSCI approach deblurred the signal and recovered the underlying harmonic structure.\n\n4 Discussion\n\nWe propose a blind sparse channel identi\ufb01cation (BSCI) approach for speech dereverberation. It\nconsists of three important components. The \ufb01rst is the sparse RIR model, which effectively resolves\nsolution degeneracies and robustly models real acoustic environments. The second is the convex\nformulation, which guarantees global convergence of the proposed BSCI algorithm. And the third\nis the Bayesian l1-norm sparse learning scheme that infers the optimal regularization parameters\nfor deriving optimally sparse solutions. The results demonstrate that the proposed BSCI approach\nholds the potential to solve the speech dereverberation problem in real acoustic environments, which\nhas been recognized as a very dif\ufb01cult problem in signal processing. The acoustic data used in this\npaper are available at http://www.seas.upenn.edu/\u223clinyuanq/Research.html.\n\nOur future work includes side-by-side comparison between our BSCI approach and existing source\nstatistics based BCI approaches. Our goal is to build a uniform framework that combines various\nprior knowledge about acoustic systems for best solving the speech dereverberation problem.\n\nReferences\n\n[1] T. Nakatani, M. Miyoshi, and K. Kinoshita, \u201cOne microphone blind dereverberation based on quasi-\n\nperiodicity of speech signals,\u201d in NIPS 16. 2004.\n\n[2] A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Analysis, New York, NY: John Wiley\n\nand Sons, 2001.\n\n[3] H. Attias, J. C. Platt, A. Acero, and L. Deng, \u201cSpeech denoising and dereverberation using probabilistic\n\nmodels,\u201d in NIPS 13, 2000.\n\n[4] L. Tong, G. Xu, and T. Kailath, \u201cBlind identi\ufb01cation and equalization based on second-order statistics: A\n\ntime domain approach,\u201d IEEE Trans. Information Theory, vol. 40, no. 2, pp. 340\u2013349, 1994.\n\n[5] J. B. Allen and D. A. Berkley, \u201cImage method for ef\ufb01ciently simulating small-room acoustics,\u201d J.\n\nAcoustical Society America, vol. 65, pp. 943\u2013950, 1979.\n\n[6] D. L. Duttweiler, \u201cProportionate normalized least-mean-squares adaptation in echo cancelers,\u201d IEEE\n\nTrans. Speech Audio Processing, vol. 8, pp. 508\u2013518, 2000.\n\n[7] Y. Lin and D. D. Lee, \u201cBayesian L1-norm sparse learning,\u201d in Proc. ICASSP, 2006.\n\n[8] S. S. Chen, D. L. Donoho, and M. A. Saunders, \u201cAtomic decomposition by basis pursuit,\u201d SIAM J.\n\nScienti\ufb01c Computing, vol. 20, no. 1, pp. 33\u201361, 1998.\n\n[9] S. J. Wright, Primal-Dual Interior Point Methods, Philadelphia, PA: SIAM, 1997.\n\n[10] D. M. Malioutov, M. Cetin, and A. S. Willsky, \u201cHomotopy continuation for sparse signal representation,\u201d\n\nin Proc. ICASSP, 2005.\n\n[11] D.A. Harville, \u201cMaximum likelihood approaches to variance component estimation and to related prob-\n\nlems,\u201d J. American Statistical Association, vol. 72, pp. 320\u2013338, 1977.\n\n[12] M. Miyoshi and Y. Kaneda, \u201cInverse \ufb01ltering of room acoustics,\u201d IEEE Trans. Acoustics, Speech, and\n\nSignal Processing, vol. 36, no. 2, pp. 145\u2013152, 1988.\n\n8\n\n\f", "award": [], "sourceid": 793, "authors": [{"given_name": "Yuanqing", "family_name": "Lin", "institution": null}, {"given_name": "Jingdong", "family_name": "Chen", "institution": null}, {"given_name": "Youngmoo", "family_name": "Kim", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}]}