{"title": "One Microphone Source Separation", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 799, "abstract": null, "full_text": "One Microphone Source Separation \n\nSam T. Roweis \n\nGatsby Unit, University College London \n\nroweis@gatsby.ucl. a c.uk \n\nAbstract \n\nSource separation,  or computational auditory  scene analysis , attempts to extract \nindividual  acoustic objects from  input which contains  a mixture of sounds  from \ndifferent  sources,  altered  by  the  acoustic  environment.  Unmixing  algorithms \nsuch  as  lCA  and  its  extensions  recover sources  by  reweighting  multiple  obser(cid:173)\nvation sequences, and thus cannot operate when only  a single observation  signal \nis  available.  I present a technique  called  refiltering  which  recovers  sources  by \na nonstationary  reweighting  (\"masking\")  of frequency  sub-bands  from  a  single \nrecording,  and  argue for  the  application  of statistical algorithms to  learning this \nmasking  function .  I  present results  of a  simple  factorial  HMM  system  which \nlearns on recordings of single speakers and can then separate mixtures using only \none observation signal by  computing the masking function and then refiltering. \n\n1  Learning from data in computational auditory scene analysis \nImagine listening to many pianos being played simultaneously. If each pianist were striking \nkeys randomly it would be very  difficult to  tell  which note came from which piano.  But \nif each  were  playing  a  coherent  song,  separation  would  be  much  easier  because  of the \nstructure of music.  Now  imagine teaching a computer to  do the separation by  showing it \nmany musical scores as \"training data\".  Typical auditory perceptual input contains a mix(cid:173)\nture of sounds from different sources, altered by the acoustic environment.  Any biological \nor  artificial  hearing  system  must  extract individual  acoustic  objects  or  streams  in  order \nto  do  successful localization,  denoising and recognition.  Bregman [1]  called this  process \nauditory scene analysis in analogy to vision.  Source separation, or computational auditory \nscene analysis  (CASA) is  the  practical realization  of this  problem via computer analysis \nof microphone recordings and  is  very  similar to the musical task described above.  It has \nbeen investigated by research groups with different emphases. The CASA community have \nfocused on both multiple and single microphone source separation problems under highly \nrealistic  acoustic  conditions,  but  have  used  almost  exclusively  hand  designed  systems \nwhich include substantial knowledge of the human auditory system and its psychophysical \ncharacteristics  (e.g. [2,3]).  Unfortunately,  it is  difficult  to  incorporate large  amounts  of \ndetailed  statistical  knowledge  about  the  problem  into  such  an  approach.  On  the  other \nhand, machine learning researchers, especially those working on independent components \nanalysis  (lCA) and related algorithms, have focused  on the case  of multiple microphones \nin  simplified mixing environments and have used powerful \"blind\" statistical  techniques. \nThese  \"unmixing\"  algorithms  (even  those  which  attempt  to  recover  more  sources  than \nsignals) cannot operate on single recordings. Furthermore, since they often depend only on \nthe joint amplitude histogram of the observations they can be very sensitive to the details of \nfiltering and reverberation in the environment. The goal of this paper is to bring together the \nrobust representations of CAS A and methods which  learn from  data to  solve a restricted \nversion  of the  source  separation problem - isolating  acoustic  objects  from  only  a single \nmicrophone recording. \n\n\f2  Refiltering vs.  unmixing \nUnmixing algorithms reweight multiple simultaneous recordings mk (t)  (generically called \nmicrophones) to form a new source object s(t): \n\ns(t) \n'-v-\" \n\nestimated source \n\n=  D:lml(t)+D:2m2(t)+  ... +D:KmK(t) \n............... \nmic K \n\n'-v-'\" \nmic 2 \n\n'-v-'\" \nmic  1 \n\n(1) \n\nThe  unmixing  coefficients  D:i  are  constant  over  time  and  are  chosen  to  optimize  some \nproperty of the set of recovered sources, which often translates into a kurtosis measure on \nthe joint amplitude histogram of the microphones . The intuition is that unmixing algorithms \nare finding spikes (or dents for low kurtosis sources) in the marginal amplitude histogram. \nThe time ordering of the datapoints is often irrelevant. \nUnmixing depends on a fine  timescale,  sample-by-sample comparison of several observa(cid:173)\ntion signals.  Humans, on the other hand, cannot hear histogram spikes l  and perform well on \nmany monaural separation tasks.  We  are doing structural analysis, or a kind of perceptual \ngrouping on the incoming sound.  But what is being grouped? There is substantial evidence \nthat the energy across  time in different frequency  bands can carry relatively independent \ninformation.  This suggests that the appropriate subparts of an audio signal may be narrow \nfrequency  bands  over  short  times.  To  generate  these  parts,  one  can  perform multi band \nanalysis - break the original  signal  y(t)  into  many  subband signals  bi(t)  each  filtered  to \ncontain only energy from a small portion of the spectrum.  The results of such an analysis \nare often displayed as  a spectrogram which shows energy (using colour or grayscale) as  a \nfunction of time (ordinate) and frequency (abscissa).  (For example one is shown on the top \nleft of figure 5.)  In the musical analogy, a spectrogram is like a musical score in which the \ncolour or grey level of the each note tells you how hard to hit the piano key. \nThe  basic  idea  of refiltering  is  to  construct  new  sources  by  selectively  reweighting  the \nmultiband signals bi(t).  Crucially, however, the mixing coefficients are no longer constant \nover time;  they  are now called masking signals.  Given  a set of masking signals,  denoted \nD:i(t),  a  source  s(t)  can  be  recovered  by  modulating  the  corresponding  subband  signals \nfrom the original input and summing: \n\ns(t) \n'-v-\" \n\nestimated source \n\nmask 1 \n,-.-.. \nD:l(t)  b1(t) \n\nmask 2 \n,-.-.. \n\n+  D:2(t)  b2(t) \n\n~ \n\nsub-band 1 \n\n~ \n\nsub-band 2 \n\nmaskK \n,.--.... \n\n+ ... + D:K(t)  bK(t) \n'-v-\" \n\nsub-band K \n\n(2) \n\nThe  D:i(t)  are  gain  knobs  on  each  subband  that  we  can  twist  over  time  to  bring  bands \nin  and  out of the  source as  needed.  This  performs masking on  the  original  spectrogram. \n(An  equivalent operation  can  be  performed in  the  frequency  domain. 2)  This  approach, \nillustrated in figure  1, forms the basis of many CASA approaches (e.g.  [2,3,4]). \nFor any  specific  choice  of masking  signals  D:i(t),  refiltering  attempts  to  isolate  a  single \nsource from the input signal and suppress all other sources and background noises.  Differ(cid:173)\nent sources can be isolated by choosing different masking signals.  Henceforth, I will make \na  strong  simplifying assumption that  D:i(t)  are  binary  and constant over a timescale T of \nroughly  30ms.  This  is  physically unrealistic,  because the  energy in  each small region  of \ntime-frequency never comes entirely from a single source.  However in practice, for small \nnumbers of sources, this approximation works quite well (figure 3).  (Think of ignoring col(cid:173)\nlisions by assuming separate piano players do not often hit the same note at the same time.) \n\nlTry randomJy permuting the time order of samples in a stereo mixture containing several sources \n\nand see if you still hear distinct streams when you play it back. \n2Make  a  conventional  spectrogram  of the  original  signal  y(t)  and  modulate  the  magnitude  of \neach  short  time  DFT  while  preserving  its  phase:  SW(T)  = F- 1 {D:wIIF{yW(r)}IILF{yW(r)}} \nwhere sW(r)  and yW(r)  are the  wth  windows  (blocks)  of the recovered and original  signals, oi is \nthe masking signal for subband i  in window w, and F[\u00b7] is the DFf. \n\n\fFigure 1:  The  refiltering  approach  to  one  microphone  source  separation.  Multiband  analysis  of \nthe original signal y(t) gives  sub-band signals bi(t)  which are modulated by  masking signals ai(t) \n(binary or real valued between 0 and 1) and recombined to  give the estimated source or object s(t). \n\nRefiltering can also be thought of as  a highly nonstationary Wiener filter in which both the \nsignal and noise spectra are re-estimated at a rate l/T; the binary assumption is equivalent \nto assuming that over a timescale T  the signal and noise spectra are nonoverlapping. \nIt is a fortunate empirical fact that refiltering, even with binary masking signals, can cleanly \nseparate sources from a single mixed recording. This can be demonstrated by taking several \nisolated sources or noises and mixing them in a controlled way.  Since the original compo(cid:173)\nnents are known, an  \"optimal\" set of masking signals can be computed.  For example, we \nmight set 0i ( t) equal to the ratio of energy from one source in band i around times t \u00b1 T  to \nthe sum of energies from all sources in the same band at that time (as recommended by the \nWiener filter) or to a binary version which thresholds this ratio.  Constructing masks in this \nway is also useful for generating labeled training data, as discussed below. \n\n3  Multiband grouping as a statistical pattern recognition problem \n\nSince one-microphone source separation using refiltering is possible if the masking signals \nare well chosen, the essential problem becomes:  how can the Oi(t) be computed automat(cid:173)\nically  from  a  single  mixed recording?  The goal  is  to  group or \"tag\"  together regions  of \nthe  spectrogram that belong  to  the  same  auditory  object.  Fortunately,  in  audition  (as  in \nvision), natural  signals-especially speech---exhibit a lot of regularity in  the  way  energy \nis  distributed across  the  time-frequency plane.  Grouping cues based on these regularities \nhave been studied for many years by psychophysicists and are hand built into many CASA \nsystems.  Cues are based on the idea of suspicious coincidences: roughly, \"things that move \ntogether likely belong together\".  Thus, frequencies which exhibit common onsets, offsets, \nor upward/downward sweeps are more likely to be grouped into the same stream (figure 2). \nAlso, many real world sounds have harmonic spectra;  so frequencies which lie exactly on \na harmonic \"stack\" are often perceptually grouped together.  (Musically, piano players do \nnot hit keys randomly, but instead use chords and repeated melodies.) \n\nFigure 2:  Examples of three common group(cid:173)\ning cues  for  energy  which  often  comes  from \na  single  source.  (left)  Frequencies  which  lie \nexactly on harmonic multiples of a single base \nfrequency.  (middle)  Frequencies  which  sud(cid:173)\ndenly  increase  or  decrease  their  energy  to(cid:173)\ngether.  (right) Energy which which moves up \nor down  in frequency  at the same time. \n\nHarmonic \nstacking. \n\nCommon \nonset. \n\nFrequency \nco-modulation. \n\nThere are several ways  that statistical pattern recognition might be applied to  take advan(cid:173)\ntage of these cues.  Methods may be roughly grouped into unsupervised ones, which learn \nmodels  of isolated  sources  and  then  try  to  explain  mixed  input  as  being  caused  by  the \ninteraction of individual source models;  and  supervised methods,  which explicitly model \ngrouping in mixed acoustic input but require labeled data consisting of mixed input as well \n\n\fas masking signals.  Luckily it is very easy to generate such data by mixing isolated sources \nin a controlled way, although the subsequent supervised learning can difficult. 3 \n\nFigure 3:  Each point represents  the  energy  from  one  source versus \nanother  in  a  narrow  frequency  band  over a  32ms  window.  The plot \nshows  all frequencies  over a 2  second  period from  a speech  mixture. \nTypically  when  one  source  has  large  energy  the  other does  not.  The \nbinary  assumption  on  the  masking signals O!i(t)  is equivalent  to pro(cid:173)\njecting the points shown onto either the horizontal or vertical axis. \n\n4  Results using factorial-max HMMs \nHere,  I will describe one  (purely unsupervised) method I have pursued for  automatically \ngenerating masking  signals  from a  single microphone.  The  approach first  trains  speaker \ndependent hidden  Markov  models  (HMMs)  on  isolated  data  from  single  talkers.  These \npre-trained models are then combined in a particular way to build a separation system. \nFirst, for each speaker, a simple HMM is  fit using patches of narrowband spectrograms as \nthe pattern vectors.4  The emission densities model the typical spectral patterns produced by \neach talker, while the transition probabilities encourage spectral continuity. HMM training \nwas initialized by first training a mixture of Gaussians on each speaker's data (with a single \nshared covariance matrix) independent of time order. Each mixture had 8192 components \nof dimension  1026  = 513 x 2;  thus  each  HMM had  8192  states.  To  avoid  overfitting, \nthe  transition matrices  were regularlized after training  so  that each  transition  (even those \nunobserved in the training set) had a small finite probability. \nNext, to  separate a new single recording which is  a mixture of known speakers, these pre(cid:173)\ntrained models are combined into afactorial hidden Markov model (FHMM) architecture \n[5].  A FHMM consists of two or more underlying Markov chains (the hidden states) which \nevolve independently. The observation Yt at any time depends on the states of all the chains. \nA simple way  to model  this dependence is  to have each chain c independently propose an \noutput yC  and then combine them to generate the observation according to  some rule Yt  = \nQ(yi, yl, ... ,yD\u00b7  Below, I use a model with only two chains, whose states are denoted \nXt  and  Zt.  At each time, one chain proposes an  output vector ax,  and the  other proposes \nh z,.  The key part of the model is  the function Q:  observations are generated by taking the \nelementwise maximum of the proposals and adding noise. This maximum operation reflects \nthe observation that the log magnitude spectrogram of a mixture of sources is  very nearly \nthe elementwise maximum of the individual spectrograms.  The full  generative model for \nthis \"factorial-max HMM\" can be written simply as: \n\np(Xt = jlXt-l = i) = Tij \np(Zt = jlZt-l = i) = U ij \np(Yt IXt, Zt)  = N(max[axt! h z,], R) \n\n(3) \n(4) \n(5) \n\n3Recall that refiltering can only isolate one auditory stream at a time from the scene (we are always \nseparating \"a source\" from \"the background\").  This makes learning the masking signals an  unusual \nproblem because  for  any  input  (spectrogram)  there  are  as  many  correct  answers  as  objects  in  the \nscene.  Such  a highly  multimodal distribution  on  outputs given inputs  means that the  mapping from \nauditory input to  masking signals cannot be  learned using backprop or other single-valued function \napproximators which take the average  of the possible maskings present in the training data. \n\n4The observations are created by concatenating the values of 2 adjacent columns of the log magni(cid:173)\n\ntude periodogram into a single vector.  The original waveforms were sampled at 16kHz.  Periodogram \nwindows of 32ms at a frame rate of 16ms were analyzed using a Hamming tapered OFT zero padded \nto  length  1024.  This  gave 513  frequency  samples from  OC to  Nyquist.  Average  signal  energy  was \nnormalized across the most recent 8 frames before computing each OFT. \n\n\fwhere N(f.-L, 1;) denotes a Gaussian distribution with mean f.-L  and covariance 1; and max[\u00b7] \nis  the elementwise maximum operation on  two  vectors.  (There are  also  densities  on  the \ninitial states  Xl  and zd This model is illustrated in figure 4.  It ignores two aspects of the \nspectrogram data:  first,  Gaussian noise is used although the observations are nonnegative; \nsecond, the probability factor requiring the non-maximum output proposal to be less  than \nthe maximum proposal is  missing.  However, in practice these approximations are not too \nsevere and making them allows an efficient inference procedure (see below) . \n\n\u2022\u2022\u2022 \n\n\u2022\u2022\u2022 \n\nFigure  4:  Factorial  HMM  with \nmax output semantics.  Two Markov \nchains  Xt  and  Zt  evolve  indepen(cid:173)\ndently.  Observations  Yt  are  the \nelementwise  max  of  the  individ(cid:173)\nual emission  vectors  max[ax \"  b z ,] \nplus Gaussian noise. \n\nIn the experiment presented below, each chain represents a speaker dependent HMM (one \nmale and one female).  The emission and transition probabilities from each speaker's pre(cid:173)\ntrained  HMM  were  used  as  the  parameters for the  combined FHMM.  (The  output noise \ncovariance R is shared between the two HMMs.) \n\nGiven  an  input  waveform,  the  observation  sequence  Y  =  YI, ... ,YT  is  created  from \nthe  spectrogram as  before. 4  Separation is  done by  first  inferring a joint underlying  state \nsequence {Xt, Zt}  of the two Markov chains in the model and then using the difference of \ntheir individual output predictions to compute a binary masking signal: \n\nClt(i)  = 1 \n\nif  a~, (i)  > hz, (i) \n\nand  0 \n\nif  a~, (i)  ~ hz, (i) \n\n(6) \n\nIdeally,  the  inferred  state  sequences  {Xt, Zt}  should  be  the  mode  of the  posterior distri(cid:173)\nbution p(Xt, ztIY).  Since  the  hidden chains  share  a single  visible  output variable,  naive \ninference in the FHMM graphical model yields an intractable amount of work exponential \nin  the size of the  state space of each submodel.  However, because all of the observations \nare  nonnegative and  the  max  operation is  used  to  combine output proposals,  there  is  an \nefficient  trick for  computing  the  best joint  state  trajectory.  At  each  time,  we  can  upper \nbound the log-probability of generating the observation vector if one chain is in  state i, no \nmatter what state  the  other chain is  in.  Computing these bounds for each  state setting  of \neach chain requires only a linear amount of work in the size of the state spaces.  With these \nbounds in hand, each time  we evaluate the probability of a  specific pair of states we can \neliminate from consideration all state settings of either chain whose bounds are worse than \nthe achieved probability.  If pairs  of states  are evaluated in a  sensible heuristic  order (for \nexample by ranking the bounds) this results in practice in almost all possible configurations \nbeing quickly eliminated.  (This trick turns out to be equivalent to Clj3  search in game trees.) \n\nThe training data for the model consists only of spectrograms of isolated examples of each \nspeaker but inference can be done on test data which is a spectrogram of a single mixture of \nknown speakers.  The results of separating a simple two speaker mixture are shown below. \nThe test utterance was formed by  linearly mixing two out-of-sample utterances (one male \nand one female)  from  the  same speakers  as  the models  were trained on.  Figure 5  shows \nthe  original mixed  spectrogram (top left)  as  well as  the  sequence of outputs  a~,  (bottom \nleft) and hz,  (bottom right) from  each chain.  The chain with  the maximum output in  any \nsub-band at any  time has Cli(t)  = 1, otherwise Cli(t)  = 0 (top right).  The FHMM system \nachieves good separation from only a single microphone (see figure 6). \n\n\f< \n? \n> \n\nFigure 5:  (top  left)  Original  spectrogram  of mixed  utterance.  (bottom)  Male  and  female  spec(cid:173)\ntrograms  predicted by  factorial  HMM  and used to  compute refiltering masks.  (top  right) Masking \nsignals Oi (t) , computed by comparing the magnitudes of each model's predictions. \n\nhz, \n\n5  Conclusions \n\nIn  this  paper I  have  argued  for  the  marriage  of learning  algorithms  with  the  refiltering \napproach  to  CASA.  I  have  presented  results  from  a  simple  factorial  HMM  system  on \na  speaker dependent  separation  problem which  indicate  that  automatically  learned  one(cid:173)\nmicrophone  separation  systems  may  be  possible.  In  the  machine  learning  community, \nthe  one-microphone separation problem has  received much  less  attention  than  unmixing \nproblems,  while  CASA  researchers  have  not employed automatic  learning techniques  to \nfull effect.  Scene analysis is an interesting and challenging learning problem with exciting \nand  practical  applications,  and  the  refiltering  setup  has  many  nice  properties.  First,  it \ncan  work if the  masking  signals  are chosen properly.  Second, it is  easy  to  generate  lots \nof training data,  both  supervised  and  unsupervised.  Third,  a good  learning  algorithm(cid:173)\nwhen  presented  with  enough  data-should automatically  discover the  sorts  of grouping \ncues which have been built into existing systems by hand. \nFurthermore, in the refiltering paradigm there is no need to make a hard decision about the \nnumber of sources present in an input.  Each proposed masking has  an associated score or \nprobability; groupings with high scores can be considered \"sources\", while ones with low \nscores might be parts  of the background or mixtures other faint  sources.  CAS A returns a \ncollection of candidate maskings and their associated scores, and then it is up to the user to \ndecide-based on the range of scores-the number of sources in the scene. \nMany existing approaches to  speech and audio processing have the potential to  be applied \nto  the  monaural  source  separation  problem.  The  unsupervised  factorial  HMM  system \npresented  in  this  paper is  very  similar  to  the  work in  the  speech recognition  community \non parallel model combination [6,7];  however rather than using the  combined models  to \nevaluate  the  likelihood  of speech  in  noise,  the  efficiently  inferred  states  are  being  used \nto  generate a  masking  signal  for refiltering.  Wan  and  Nelson have  developed dual  EKF \nmethods [8] and applied them speech denoising but have also informally demonstrated their \npotential  application to  monaural  source separation.  Attias  and colleagues  [9]  developed \na  fully  probabilistic  model  of speech  in  noise  and  used  variational  Bayesian  techniques \nto  perform inference and learning allowing denoising and dereverberation; their approach \nclearly  has  the potential to  be  applied to  the  separation problem as  well.  Cauwenberghs \n[10] has a very promising approach to the problem for purely harmonic signals that takes \nadvantage of powerful phase constraints which are ignored by other algorithms. \nUnsupervised  and  supervised approaches can  be combined to  various degrees.  Learning \nmodels  of isolated  sounds  may  be  useful  for  developing  feature  detectors;  conjunctions \nof such  feature  detectors  can  then  be  trained  in  a  supervised fashion  using  labeled  data. \n\n\fFigure 6: Test separation results, using a 2-chain speaker dependent factorial-max HMM, followed \nby refiltering.  (See figure 4 and text for details.)  (A) Original waveform of mixed utterance. \n(B)  Original isolated male &  female waveforms.  (C) Estimated male and female waveforms. \n\nThe  oscillatory correlation algorithm  of Brown  and  Wang  [4]  has  a low  level  module to \ndetect features  in  the correlogram and a high level module to do grouping.  Related ideas \nin  machine vision,  such as  Markov networks  [11]  and  minimum normalized cut [12]  use \nlow  level  operations to  define weights between pixels  and  then higher level computations \nto group pixels together. \n\nAcknowledgements \nThanks to Hagai Attias, Guy Brown, Geoff Hinton and Lawrence Saul for many insightful discussions \nabout the CASA problem, and to three anonymous referees and many visitors to my poster for helpful \ncomments, criticisms and references to  work I had overlooked. \n\nReferences \n[1]  AS. Bregman.  (1994) Auditory Scene Analysis. MIT Press. \n[2]  G. Brown &  M.  Cooke.  (1994) Computational auditory scene analysis. \n\nComputer Speech and Language 8. \n\n[3]  D. Ellis.  (1994) A computer implementation of psychoacoustic grouping rules. \n\nProc.  12th IntI.  Conf.  on Pattern Recognition, Jerusalem. \n\n[4]  G. Brown &  D.L. Wang.  (2000) An oscillatory correlationframeworkfor \n\ncomputational auditory scene analysis.  NIPS  12. \n\n[5]  Z. Ghalu'amani &  M.l. Jordan  (1997) Factorial hidden Markov models, Machine Learning 29. \n[6]  AP. Varga &  R.K. Moore (1990) Hidden Markov model decomposition of speech and noise, \n\nIEEE Conf.  Acoustics, Speech &  Signal Processing (ICASSP'90). \n\n[7]  M.J.F. Gales &  SJ. Young (1996) Robust continuous speech recognition using \n\nparallel model combination, IEEE Trans.  Speech &  Audio Processing 4. \n\n[8]  E.A. Wan  &  A.T. Nelson (1998) Removal of noise from speech using the dual EKF algorithm, \n\nIEEE Conf.  Acoustics, Speech &  Signal Processing (ICASSP'98). \n\n[9]  H. Attias, J.C. Platt &  A.  Acero (2001) Speech denoising and dereverberation \n\nusing probabilistic models, this volume. \n\n[10]  G. Cauwenberghs (1999) Monaural separation of independent acoustical components, \n\nIEEE Symp.  Circuit &  Systems (ISCAS'99). \n\n[11]  W.  Freeman &  E.  Pasztor. (1999) Markov networks for low-level vision. \n\nMitsubishi Electric Research Laboratory Technical Report TR99-08. \n[12]  J.  Shi &  J.  Malik.  (1997) Normalized cuts and image segmentation. \n\nIEEE Conf.  Computer Vision and Pattern Recognition, Puerto Rico (ICCV'97). \n\n\f", "award": [], "sourceid": 1885, "authors": [{"given_name": "Sam", "family_name": "Roweis", "institution": null}]}