{"title": "A Classification-based Cocktail-party Processor", "book": "Advances in Neural Information Processing Systems", "page_first": 1425, "page_last": 1432, "abstract": "", "full_text": "A classification-based cocktail-party \n\nprocessor \n\n \n Nicoleta Roman, DeLiang Wang \n Department of Computer and Information \n Science and Center for Cognitive Science \n The Ohio State University \n Columbus, OH 43210, USA \n {niki,dwang}@cis.ohio-state.edu \n \n\n \n\n Abstract \n\n \n\nGuy J. Brown \n\n \n\nDepartment of Computer Science \n\nUniversity of Sheffield \n211 Portobello Street \nSheffield, S1 4DP, UK \ng.brown@dcs.shef.ac.uk \n\ninteraural \n\nlearning approach \n\nAt a cocktail party, a listener can selectively attend to a single \nvoice and filter out other acoustical interferences. How to simulate \nthis perceptual ability remains a great challenge. This paper \ndescribes a novel supervised \nto speech \nsegregation, in which a target speech signal is separated from \ninterfering sounds using spatial location cues: interaural time \ndifferences (ITD) and \nintensity differences (IID). \nMotivated by the auditory masking effect, we employ the notion of \nan ideal time-frequency binary mask, which selects the target if it \nis stronger than the interference in a local time-frequency unit. \nWithin a narrow frequency band, modifications to the relative \nstrength of the target source with respect to the interference trigger \nsystematic changes for estimated ITD and IID. For a given spatial \nconfiguration, this interaction produces characteristic clustering in \nthe binaural feature space. Consequently, we perform pattern \nclassification in order to estimate ideal binary masks. A systematic \nevaluation in terms of signal-to-noise ratio as well as automatic \nspeech recognition performance shows that the resulting system \nproduces masks very close to ideal binary ones. A quantitative \ncomparison shows that our model yields significant improvement \nin performance over an existing approach. Furthermore, under \ncertain conditions the model produces large speech intelligibility \nimprovements with normal listeners. \n\n1 Introduction \n\nThe perceptual ability to detect, discriminate and recognize one utterance in a \nbackground of acoustic interference has been studied extensively under both \nmonaural and binaural conditions [1, 2, 3]. The human auditory system is able to \nsegregate a speech signal from an acoustic mixture using various cues, including \nfundamental frequency (F0), onset time and location, in a process that is known as \n\n\f \n\nauditory scene analysis (ASA) [1]. F0 is widely used in computational ASA systems \nthat operate upon monaural input \u2013 however, systems that employ only this cue are \nlimited to voiced speech [4, 5, 6]. Increased speech intelligibility in binaural \nlistening compared to the monaural case has prompted research in designing \ncocktail-party processors based on spatial cues [7, 8, 9]. Such a system can be \napplied to, among other things, enhancing speech recognition in noisy environments \nand improving binaural hearing aid design. \nIn this study, we propose a sound segregation model using binaural cues extracted \nfrom the responses of a KEMAR dummy head that realistically simulates the \nfiltering process of the head, torso and external ear. A typical approach for signal \nreconstruction uses a time-frequency (T-F) mask: T-F units are weighted selectively \nin order to enhance the target signal. Here, we employ an ideal binary mask [6], \nwhich selects the T-F units where the signal energy is greater than the noise energy. \nThe ideal mask notion is motivated by the human auditory masking phenomenon, in \nwhich a stronger signal masks a weaker one in the same critical band. In addition, \nfrom a theoretical ASA perspective, an ideal binary mask gives a performance \nceiling for all binary masks. Moreover, such masks have been recently shown to \nprovide a highly effective front-end for robust speech recognition [10]. We show for \nmixtures of multiple sound sources that there exists a strong correlation between the \nrelative strength of target and interference and estimated ITD/IID, resulting in a \ncharacteristic clustering across frequency bands. Consequently, we employ a \nnonparametric classification method to determine decision regions in the joint ITD-\nIID feature space that correspond to an optimal estimate for an ideal mask. \nRelated models for estimating target masks through clustering have been proposed \npreviously [11, 12]. Notably, the experimental results by Jourjine et al. [12] suggest \nthat speech signals in a multiple-speaker condition obey to a large extent disjoint \northogonality in time and frequency. That is, at most one source has a nonzero \nenergy at a specific time and frequency. Such models, however, assume input \ndirectly from microphone recordings and head-related filtering is not considered. \nSimulation of human binaural hearing introduces different constraints as well as \nclues to the problem. First, both ITD and IID should be utilized since IID is more \nreliable at higher frequencies than ITD. Second, frequency-dependent combinations \nof ITD and IID arise naturally for a fixed spatial configuration. Consequently, \nchannel-dependent training should be performed for each frequency band. \nThe rest of the paper is organized as follows. The next section contains the \narchitecture of the model and describes our method for azimuth localization. Section \n3 is devoted to ideal binary mask estimation, which constitutes the core of the \nmodel. Section 4 presents the performance of the system and a quantitative \ncomparison with the Bodden [7] model. Section 5 concludes our paper. \n\n2 M odel architecture and azi muth locali zation \n\nOur model consists of the following stages: 1) a model of the auditory periphery; 2) \nfrequency-dependent ITD/IID extraction and azimuth localization; 3) estimation of \nan ideal binary mask. \nThe input to our model is a mixture of two or more signals presented at different, \nbut fixed, locations. Signals are sampled at 44.1 kHz. We follow a standard \nprocedure for simulating free-field acoustic signals from monaural signals (no \nreverberations are modeled). Binaural signals are obtained by filtering the monaural \nsignals with measured head-related transfer functions (HRTF) from a KEMAR \ndummy head [13]. HRTFs introduce a natural combination of ITD and IID into the \nsignals that is extracted in the subsequent stages of the model. \n\n\f \n\nTo simulate the auditory periphery we use a bank of 128 gammatone filters in the \nrange of 80 Hz to 5 kHz as described in [4]. In addition, the gains of the gammatone \nfilters are adjusted in order to simulate the middle ear transfer function. In the final \nstep of the peripheral model, the output of each gammatone filter is half-wave \nrectified in order to simulate firing rates of the auditory nerve. Saturation effects are \nmodeled by taking the square root of the signal. \nCurrent models of azimuth localization almost invariably start with Jeffress\u2019s cross-\ncorrelation mechanism. For all frequency channels, we use the normalized cross-\ncorrelation computed at lags equally distributed in the plausible range from \u20131 ms to \n1 ms using an integration window of 20 ms. Frequency-dependent nonlinear \ntransformations are used to map the time-delay axis onto the azimuth axis resulting \nin a cross-correlogram structure. In addition, a \u2018skeleton\u2019 cross-correlogram is \nformed by replacing the peaks in the cross-correlogram with Gaussians of narrower \nwidths that are inversely proportional to the channel center frequency. This results \nin a sharpening effect, similar in principle to lateral inhibition. Assuming fixed \nsources, multiple locations are determined as peaks after summating the skeleton \ncross-correlogram across frequency and time. The number of sources and their \nlocations computed here, as well as the target source location, feed to the next stage. \n\n3 Binary mask estimation \n\nThe objective of this stage of the model is to develop an efficient mechanism for \nestimating an ideal binary mask based on observed patterns of extracted ITD and \nIID features. Our theoretical analysis for two-source interactions in the case of pure \ntones shows relatively smooth changes for ITD and IID with the relative strength R \nbetween the two sources in narrow frequency bands [14]. More specifically, when \nthe frequencies vary uniformly in a narrow band the derived mean values of \nITD/IID estimates vary monotonically with respect to R. \nTo capture this relationship in the context of real signals, statistics are collected for \nindividual spatial configurations during training. We employ a training corpus \nconsisting of 10 speech utterances from the TIMIT database (see [14] for details). In \nthe two-source case, we divide the corpus in two equal sets: target and interference. \nIn the three-source case, we select 4 signals for the target set and 2 interfering sets \nof 3 signals each. \n\nFor all frequency channels, local estimates of ITD, IID and R are based on 20-ms \ntime frames with 10 ms overlap between consecutive time frames. In order to \neliminate the multi-peak ambiguity in the cross-correlation function for mid- and \nhigh-frequency channels, we use the following strategy. We compute ITDi as the \npeak location of the cross-correlation in the range \n centered at the target ITD, \niw indicates the center frequency of the ith channel. On the other hand, IID \nwhere \nand R are computed as follows: \n\np\n/2\n\niw\n\n=\n\nIID\ni\n\n20\n\nlog\n\n10\n\n2\n\nr\ni\n\n)(\nt\n\n\u2211\nt\n\n\u2211\nt\n\n2\nl\ni\n\n)(\nt\n\n, \n\nR\n\ni\n\n=\n\ns\n\n2\ni\n\n)(\nt\n\n\u2211\nt\n\n\uf8eb\n\uf8ec\n\uf8ec\n\uf8ed\n\ns\n\n2\ni\n\n)(\nt\n\n+\n\n\u2211\nt\n\n2\n)(\ntn\ni\n\n\u2211\nt\n\n\uf8f6\n\uf8f7\n\uf8f7\n\uf8f8\n\n \n\n \n\nwhere \nir refer to the left and right peripheral output of the ith channel, \nil and \nin that for the acoustic \nrespectively, \ninterference. In computing IIDi, we use 20 instead of 10 in order to compensate for \nthe square root operation in the peripheral model. \n\nis refers to the output for the target signal, and \n\n\fFig. 1 shows empirical results obtained for a two-source configuration on the \ntraining corpus. The data exhibits a systematic shift for both ITD and IID with \nrespect to the relative strength R. Moreover, the theoretical mean values obtained in \nthe case of pure tones [14] match the empirical ones very well. This observation \nextends to multiple-source scenarios. As an example, Fig. 2 displays histograms that \nshow the relationship between R and both ITD (Fig. 2A) and IID (Fig. 2B) for a \nthree-source situation. Note that the interfering sources introduce systematic \ndeviations for the binaural cues. Consider a worst case: the target is silent and two \ninterferences have equal energy in a given T-F unit. This results in binaural cues \nindicating an auditory event at half of the distance between the two interference \nlocations; for Fig. 2, it is 0\u00b0 - the target location. However, the data in Fig. 2 has a \nlow probability for this case and shows instead a clustering phenomenon, suggesting \nthat in most cases only one source dominates a T-F unit. \n\n \n\n \n\nA \n\nR\n\n1\n\n0\n-1\n\ntheoretical\nempirical\n\nITD (ms)\n\n1\n\nB \n\n1\n\nR\n\n0\n-15\n\ntheoretical\nempirical\n\nIID (dB)\n\n15 \n\n \n\nFigure 1. Relationship between ITD/IID and relative strength R for a two-source \nconfiguration: target in the median plane and interference on the right side at 30\u00b0. The \nsolid curve shows the theoretical mean and the dash curve shows the data mean. A: The \nscatter plot of ITD and R estimates for a filter channel with center frequency 500 Hz. B: \nResults for IID for a filter channel with center frequency 2.5 kHz. \nA \n\nB \n\nC \n\n1\n\nR\n\n1\n\nR\n\n0.5\n\n( m s )\n\n10\n\nI\n\nI\n\nD\n\n \n\n(\n\nd\n\nB\n\n)\n\n10\n\n0.5\n\n( m s )\n\nI T D \n\n-10\n\n-0.5\n\nI T D \n\n0\n\n-0.5\n\n0 \n\n-10\n\n( d B )\n\nI D \n\nI\n\n \nFigure 2. Relationship between ITD/IID and relative strength R for a three-source \nconfiguration: target in the median plane and interference at -30\u00b0 and 30\u00b0. Statistics are \nobtained for a channel with center frequency 1.5 kHz. A: Histogram of ITD and R \nsamples. B: Histogram of IID and R samples. C: Clustering in the ITD-IID space. \n\nBy displaying the information in the joint ITD-IID space (Fig. 2C), we observe \nlocation-based clustering of the binaural cues, which is clearly marked by strong \npeaks that correspond to distinct active sources. There exists a tradeoff between ITD \nand IID across frequencies, where ITD is most salient at low frequencies and IID at \nhigh frequencies [2]. But a fixed cutoff frequency that separates the effective use of \nITD and IID does not exist for different spatial configurations. This motivates our \nchoice of a joint ITD-IID feature space that optimizes the system performance \nacross different configurations. Differential training seems necessary for different \nchannels given that there exist variations of ITD and, especially, IID values for \ndifferent center frequencies. \nSince the goal is to estimate an ideal binary mask, we focus on detecting decision \nregions in the 2-dimensional ITD-IID space for individual frequency channels. \n\n\f \n\n|\n\n|\n\n1\n\n|\n(\n\n and \n\n(\n1Hxp\n\n, and the second one is \n\n(\n)\n2Hxp\n(\n)\nHxpHp\n\n2H : interference is dominant or \n\nConsequently, supervised learning techniques can be applied. For the ith channel, \n1H : target is dominant or \nwe test the following two hypotheses. The first one is \n5.0>iR\n5.0<iR\n. Based on \nthe estimates of the bivariate densities \n)\n the classification is \n>\n. \ndone by the maximum a posteriori decision rule: \n)\nThere exist a plethora of techniques for probability density estimation ranging from \nparametric techniques (e.g. mixture of Gaussians) to nonparametric ones (e.g. kernel \ndensity estimators). In order to completely characterize the distribution of the data \nwe use the kernel density estimation method independently for each frequency \nchannel. One approach for finding smoothing parameters is the least-squares cross-\nvalidation method, which is utilized in our estimation. \nOne cue not employed in our model is the interaural time difference between signal \nenvelopes (IED). Auditory models generally employ IED in the high-frequency \nrange where the auditory system becomes gradually insensitive to ITD. We have \ncompared the performance of the three binaural cues: ITD, IID and IED and have \nfound no benefit for using IED in our system after incorporating ITD and IID [14]. \n\n(\nHxpHp\n\n(\n\n)\n\n1\n\n2\n\n|\n\n)\n\n2\n\n4 Pe rfo rmanc e and comparison \n\nThe performance of a segregation system can be assessed in different ways, \ndepending on intended applications. To extensively evaluate our model, we use the \nfollowing three criteria: 1) a signal-to-noise (SNR) measure using the original target \nas signal; 2) ASR rates using our model as a front-end; and 3) human speech \nintelligibility tests. \nTo conduct the SNR evaluation a segregated signal is reconstructed from a binary \nmask using a resynthesis method described in [5]. To quantitatively assess system \nperformance, we measure the SNR using the original target speech as signal: \n\n \n\nSNR\n\n=\n\n10\n\nlog\n\n10\n\ns\n\n2\no\n\n)(\nt\n\n\u2211\nt\n\n(\n\n\u2211\nt\n\ns\n\no\n\n)(\nt\n\ns\n\ne\n\n)(\nt\n\n)\n\n2\n\n \n\nso\n\n)(t\n\n represents \n\nwhere \nthe \nreconstructed speech from an estimated mask. One can measure the initial SNR by \nreplacing the denominator with \n\nthe resynthesized original speech and \n\n, the resynthesized original interference. \n\n)(t\n\n)(t\n\nse\n\n \n\ns N\n\nFig. 3 shows the systematic results for two-source scenarios using the Cooke corpus \n[4], which is commonly used in sound separation studies. The corpus has 100 \nmixtures obtained from 10 speech utterances mixed with 10 types of intrusion. We \ncompare the SNR gain obtained by our model against that obtained using the ideal \nbinary mask across different noise types. Excellent results are obtained when the \ntarget is close to the median plane for an azimuth separation as small as 5\u00b0. \nPerformance degrades when the target source is moved to the side of the head, from \nan average gain of 13.7 dB for the target in the median plane (Fig. 3A) to 1.7 dB \nwhen target is at 80\u00b0 (Fig. 3B). When spatial separation increases the performance \nimproves even for side targets, to an average gain of 14.5 dB in Fig. 3C. This \nperformance profile is in qualitative agreement with experimental data [2]. \n\nFig. 4 illustrates the performance in a three-source scenario with target in the \nmedian plane and two interfering sources at \u201330\u00b0 and 30\u00b0. Here 5 speech signals \nfrom the Cooke corpus form the target set and the other 5 form one interference set. \nThe second interference set contains the 10 intrusions. The performance degrades \ncompared to the two-source situation, from an average SNR of about 12 dB to 4.1 \n\n-\n\f \n\ndB. However, the average SNR gain obtained is approximately 11.3 dB. This ability \nof our model to segregate mixtures of more than two sources differs from blind \nsource separation with independent component analysis. \nIn order to draw a quantitative comparison, we have implemented Bodden\u2019s \ncocktail-party processor using the same 128-channel gammatone filterbank [7]. The \nlocalization stage of this model uses an extended cross-correlation mechanism based \non contralateral inhibition and it adapts to HRTFs. The separation stage of the \nmodel is based on estimation of the weights for a Wiener filter as the ratio between \na desired excitation and an actual one. Although the Bodden model is more flexible \nby incorporating aspects of the precedence effect into the localization stage, the \nestimation of Wiener filter weights is less robust than our binary estimation of ideal \nmasks. Shown in Fig. 5, our model shows a considerable improvement over the \nBodden system, producing a 3.5 dB average improvement. \n\nA \n\nB \n\nC \n\n20\n\n10\n\n0\n\n-10\n\n)\n\nB\nd\n(\n \n\nR\nN\nS\n\n20\n\n10\n\n0\n\n-10\n\n20\n\n10\n\n0\n\n-10\n\nN0 N1 N2 N3 N4 N5 N6 N7 N8 N9\n\nN0 N1 N2 N3 N4 N5 N6 N7 N8 N9\n\n \nFigure 3. Systematic results for two-source configuration. Black bars correspond to the \nSNR of the initial mixture, white bars indicate the SNR obtained using ideal binary \nmask, and gray bars show the SNR from our model. Results are obtained for speech \nmixed with ten intrusion types (N0: pure tone; N1: white noise; N2: noise burst; N3: \n\u2018cocktail party\u2019; N4: rock music; N5: siren; N6: trill telephone; N7: female speech; N8: \nmale speech; N9: female speech). A: Target at 0\u00b0, interference at 5\u00b0. B: Target at 80\u00b0, \ninterference at 85\u00b0. C: Target at 60\u00b0, interference at 90\u00b0. \n\nN0 N1 N2 N3 N4 N5 N6 N7 N8 N9\n\n)\n\nB\nd\n(\n \n\nR\nN\nS\n\n5\n\n0\n\n-5\n\n-10\n\n-15\n\n-20\n\n)\n\nB\nd\n(\n \n\nR\nN\nS\n\n20\n\n10\n\n0\n\n-10\n\nN0 N1 N2 N3 N4 N5 N6 N7 N8 N9\n\nN0 N1 N2 N3 N4 N5 N6 N7 N8 N9\n\ntarget at 0\u00b0 and \n\nFigure 4. Evaluation for a three-source \nconfiguration: \ntwo \ninterfering sources at \u201330\u00b0 and 30\u00b0. Black \nbars correspond to the SNR of the initial \nmixture, white bars to the SNR obtained \nusing the ideal binary mask, and gray bars \nto the SNR from our model. \n\nFigure 5. SNR comparison between the \nBodden model (white bars) and our \ntwo-source \nmodel (gray bars) for a \nand \nconfiguration: \nat \ninterference \nbars \ncorrespond to the SNR of the initial \nmixture. \n\n0\u00b0 \n30\u00b0. Black \n\ntarget \n\nat \n\nFor the ASR evaluation, we use the missing-data technique as described in [10]. In \nthis approach, a continuous density hidden Markov model recognizer is modified \nsuch that only acoustic features indicated as reliable in a binary mask are used \nduring decoding. Hence, it works seamlessly with the output from our speech \nsegregation system. We have implemented the missing data algorithm with the same \n128-channel gammatone filterbank. Feature vectors are obtained using the Hilbert \nenvelope at the output of the gammatone filter. More specifically, each feature \nvector is extracted by smoothing the envelope using an 8-ms first-order filter, \nsampling at a frame-rate of 10 ms and finally log-compressing. We use the bounded \nmarginalization method for classification [10]. The task domain is recognition of \n\n\f \n\nconnected digits, and both training and testing are performed on acoustic features \nfrom the left ear signal using the male speaker dataset in the TIDigits database. \nFig. 6A shows the correctness scores for a two-source condition, where the male \ntarget speaker is located at 0\u00b0 and the interference is another male speaker at 30\u00b0. \nThe performance of our model is systematically compared against the ideal masks \nfor four SNR levels: 5 dB, 0 dB, -5 dB and \u201310 dB. Similarly, Fig. 6B shows the \nresults for the three-source case with an added female speaker at -30\u00b0. The ideal \nmask exhibits only slight and gradual degradation in recognition performance with \ndecreasing SNR and increasing number of sources. Observe that large improvements \nover baseline performance are obtained across all conditions. This shows the strong \npotential of applying our model to robust speech recognition. \n\nA \n\n100\n\n)\n\n%\n\n(\n \n\ns\ns\ne\nn\nt\nc\ne\nr\nr\no\nC\n\n80\n\n60\n\n40\n\nBaseline\nIdeal Mask\nEstimated Mask\n\nB \n\n100\n\n)\n\n%\n\n(\n \n\ns\ns\ne\nn\nt\nc\ne\nr\nr\no\nC\n\n80\n\n60\n\n40\n\nBaseline\nIdeal Mask\nEstimated Mask\n\n20\n5 dB\n\n20\n5 dB\n\n0 dB\n\n\u22125 dB\n\n \n\u221210 dB \nFigure 6. Recognition performance at different SNR values for original mixture (dotted \nline), ideal binary mask (dashed line) and estimated mask (solid line). A. Correctness \nscore for a two-source case. B. Correctness score for a three-source case. \n\n \n\n\u221210 dB \n\n\u22125 dB\n\n0 dB\n\nFinally we evaluate our model on speech intelligibility with listeners with normal \nhearing. We use the Bamford-Kowal-Bench sentence database that contains short \nsemantically predictable sentences [15]. The score is evaluated as the percentage of \nkeywords correctly identified, ignoring minor errors such as tense and plurality. To \neliminate potential location-based priming effects we randomly swap the locations \nfor target and interference for different trials. In the unprocessed condition, binaural \nsignals are produced by convolving original signals with the corresponding HRTFs \nand the signals are presented to a listener dichotically. In the processed condition, \nour algorithm is used to reconstruct the target signal at the better ear and results are \npresented diotically. \n\nA \n\n100\n\n)\n\n%\n\n(\n \n\n \n\ne\nr\no\nc\ns\nd\nr\no\nw\ny\ne\nK\n\n80\n\n60\n\n40\n\n20\n\nB \n\n100\n\n)\n\n%\n\n \n\n(\n \ne\nr\no\nc\ns\nd\nr\no\nw\ny\ne\nK\n\n80\n\n60\n\n40\n\n20\n\n0\n\n0 dB\n\n\u22125 dB\n\n \nFigure 7. Keyword intelligibility score for twelve native English speakers (median \nvalues and interquartile ranges) before (white bars) and after processing (black bars). A. \nTwo-source condition (0\u00b0 and 5\u00b0). B. Three-source condition (0\u00b0, 30\u00b0 and -30\u00b0). \n\n \n\n\u221210 dB\n\n0\n\nFig. 7A gives the keyword intelligibility score for a two-source configuration. Three \nSNR levels are tested: 0 dB, -5 dB and \u201310 dB, where the SNR is computed at the \nbetter ear. Here the target is a male speaker and the interference is babble noise. Our \nalgorithm improves the intelligibility score for the tested conditions and the \nimprovement becomes larger as the SNR decreases (61% at \u201310 dB). Our informal \nobservations suggest, as expected, that the intelligibility score improves for \nunprocessed mixtures when two sources are more widely separated than 5\u00b0. Fig. 7B \nshows the results for a three-source configuration, where our model yields a 40% \n\n\f \n\nimprovement. Here the interfering sources are one female speaker and another male \nspeaker, resulting in an initial SNR of \u201310 dB at the better ear. \n \n5 Conclusion \n\nWe have observed systematic deviations of the ITD and IID cues with respect to the \nrelative strength between target and acoustic interference, and configuration-specific \nclustering in the joint ITD-IID feature space. Consequently, supervised learning of \nbinaural patterns is employed for individual frequency channels and different spatial \nconfigurations to estimate an ideal binary mask that cancels acoustic energy in T-F \nunits where interference is stronger. Evaluation using both SNR and ASR measures \nshows that the system estimates ideal binary masks very well. A comparison shows \na significant improvement in performance over the Bodden model. Moreover, our \nmodel produces substantial speech intelligibility improvements for two and three \nsource conditions. \n\nAc k nowl e dg me nt s \n\nThis research was supported in part by an NSF grant (IIS-0081058) and an AFOSR \ngrant (F49620-01-1-0027). A preliminary version of this work was presented in \n2002 ICASSP. \n\nReferences \n[1] A. S. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT press, 1990. \n[2] J. Blauert, Spatial Hearing - The Psychophysics of Human Sound Localization, \n\nCambridge, MA: MIT press, 1997. \n\n[3] A. Bronkhorst, \u201cThe cocktail party phenomenon: a review of research on speech \n\nintelligibility in multiple-talker conditions,\u201d Acustica, vol. 86, pp. 117-128, 2000. \n\n[4] M. P. Cooke, Modeling Auditory Processing and Organization, Cambridge, U.K.: \n\nCambridge University Press, 1993. \n\n[5] G. J. Brown and M. P. Cooke, \u201cComputational auditory scene analysis,\u201d Computer \n\nSpeech and Language, vol. 8, pp. 297-336, 1994. \n\n[6] G. Hu and D. L. Wang, \u201cMonaural speech separation,\u201d Proc. NIPS, 2002. \n[7] M. Bodden, \u201cModeling human sound-source localization and the cocktail-party-effect,\u201d \n\nActa Acoustica, vol. 1, pp. 43-55, 1993. \n\n[8] C. Liu et al., \u201cA two-microphone dual delay-line approach for extraction of a speech \nsound in the presence of multiple interferers,\u201d J. Acoust. Soc. Am., vol. 110, pp. 3218-\n3230, 2001. \n\n[9] T. Whittkop and V. Hohmann, \u201cStrategy-selective noise reduction for binaural digital \n\nhearing aids,\u201d Speech Comm., vol. 39, pp. 111-138, 2003. \n\n[10] M. P. Cooke, P. Green, L. Josifovski and A. Vizinho, \u201cRobust automatic speech \nrecognition with missing and unreliable acoustic data,\u201d Speech Comm., vol. 34, pp. 267-\n285, 2001. \n\n[11] H. Glotin, F. Berthommier and E. Tessier, \u201cA CASA-labelling model using the \nlocalisation cue for robust cocktail-party speech recognition,\u201d Proc. EUROSPEECH, pp. \n2351-2354, 1999. \n\n[12] A. Jourjine, S. Rickard and O. Yilmaz, \u201cBlind separation of disjoint orthogonal signals: \n\ndemixing N sources from 2 mixtures,\u201d Proc. ICASSP, 2000. \n\n[13] W. G. Gardner and K. D. Martin, \u201cHRTF measurements of a KEMAR dummy-head \n\nmicrophone,\u201d MIT Media Lab Technical Report #280, 1994. \n\n[14] N. Roman, D. L. Wang and G. J. Brown, \u201cSpeech segregation based on sound \n\nlocalization,\u201d J. Acoust. Soc. Am., vol. 114, pp. 2236-2252, 2003. \n\n[15] J. Bench and J. Bamford, Speech Hearing Tests and the Spoken Language of Hearing-\n\nImpaired Children, London: Academic press, 1979. \n\n\f", "award": [], "sourceid": 2414, "authors": [{"given_name": "Nicoleta", "family_name": "Roman", "institution": null}, {"given_name": "Deliang", "family_name": "Wang", "institution": null}, {"given_name": "Guy", "family_name": "Brown", "institution": null}]}