{"title": "Learning from Infinite Data in Finite Time", "book": "Advances in Neural Information Processing Systems", "page_first": 673, "page_last": 680, "abstract": null, "full_text": "Learning from  Infinite  Data \n\nin  Finite Time \n\nPedro  Domingos \n\nGeoff H ulten \n\nDepartment of Computer Science and Engineering \n\nUniversity of Washington \n\nSeattle, WA  98185-2350, U.S.A. \n\n{pedrod,  ghulten} @cs.washington.edu \n\nAbstract \n\nWe  propose  the  following  general  method  for  scaling  learning \nalgorithms  to  arbitrarily  large  data  sets.  Consider  the  model \nMii  learned  by  the  algorithm  using  ni  examples  in  step  i  (ii  = \n(nl , ... ,nm)) , and the model  Moo  that would  be learned using in(cid:173)\nfinite  examples.  Upper-bound  the loss  L(Mii' M oo ) between  them \nas  a  function  of ii, and  then  minimize  the  algorithm's  time  com(cid:173)\nplexity f(ii) subject to the constraint  that L(Moo , Mii ) be at most \nf  with  probability  at  most  8.  We  apply  this  method  to  the  EM \nalgorithm for  mixtures  of Gaussians.  Preliminary experiments on \na  series  of large data sets  provide evidence of the potential of this \napproach. \n\n1  An  Approach to Large-Scale  Learning \n\nLarge  data sets  make  it  possible  to  reliably  learn  complex  models.  On  the  other \nhand,  they require large  computational  resources  to learn from.  While  in  the past \nthe factor limiting the quality of learnable models was typically the quantity of data \navailable,  in many domains today data is  super-abundant, and the bottleneck is t he \ntime  required  to  process  it.  Many  algorithms for  learning  on  large  data sets  have \nbeen  proposed,  but  in  order  to  achieve  scalability  they  generally  compromise  the \nquality of the results to an unspecified degree.  We  believe this  unsatisfactory state \nof affairs  is  avoidable,  and  in  this  paper  we  propose  a  general  method  for  scaling \nlearning algorithms to arbitrarily large databases without compromising the quality \nof the  results.  Our  method  makes  it  possible  to  learn  in  finite  time  a  model  that \nis  essentially  indistinguishable from  the  one  that  would  be  obtained  using  infinite \ndata. \n\nConsider the simplest possible learning problem:  estimating the mean of a  random \nvariable  x.  If we  have a  very large number of samples,  most  of them  are  probably \nsuperfluous.  If we are willing to accept an error of at most f with probability at most \n8,  Hoeffding bounds  [4]  (for example)  tell  us that, irrespective of the distribution of \nx,  only n  =  ~(R/f)2 1n (2/8) samples  are needed,  where R is  x's range.  We  propose \nto  extend  this  type  of  reasoning  beyond  learning  single  parameters,  to  learning \ncomplex models.  The approach we  propose consists of three steps: \n\n\f1.  Derive  an  upper  bound  on  the  relative  loss  between  the  finite-data  and \ninfinite-data models,  as  a  function  of the  number of samples  used  in  each \nstep of the finite-data algorithm. \n\n2.  Derive  an  upper bound  on  the  time  complexity of the learning algorithm, \n\nas  a function  of the number of samples used in  each  step. \n\n3.  Minimize  the  time  bound  (via  the  number  of samples  used  in  each  step) \n\nsubject to target limits on the loss. \n\nIn  this  paper we  exemplify  this  approach  using  the  EM algorithm  for  mixtures  of \nGaussians.  In  earlier  papers  we  applied  it  (or  an  earlier  version  of it)  to  decision \ntree  induction  [2J  and  k-means  clustering  [3J.  Despite  its  wide  use,  EM  has  long \nbeen criticized for its inefficiency  (see discussion following Dempster et al.  [1]),  and \nhas been considered unsuitable for large data sets [8J.  Many approaches to speeding \nit up have  been proposed  (see Thiesson et al.  [6J  for  a  survey) .  Our method can be \nseen as  an extension of progressive  sampling approaches like Meek et al.  [5J:  rather \nthan  minimize  the total number of samples  needed  by  the algorithm, we  minimize \nthe  number  needed  by each step,  leading  to  potentially much  greater savings;  and \nwe  obtain guarantees that do not depend on unverifiable extrapolations of learning \ncurves. \n\n2  A  Loss  Bound for  EM \n\nIn  a  mixture of Gaussians  model,  each  D-dimensional data point  Xj  is  assumed  to \nhave been independently generated by  the following  process:  1)  randomly choose a \nmixture component k;  2)  randomly generate a point from it according to a Gaussian \ndistribution with mean f-Lk  and covariance matrix  ~k. In  this paper we  will  restrict \nourselves to the case where the number K  of mixture components and the probabil(cid:173)\nity of selection P(f-Lk)  and covariance matrix for  each  component are known.  Given \na  training  set  S  =  {Xl, ... , X N },  the  learning  goal  is  then  to  find  the  maximum(cid:173)\nlikelihood  estimates  of the  means  f-Lk.  The EM algorithm  [IJ  accomplishes  this  by, \nstarting from  some set  of initial means, alternating until convergence between esti(cid:173)\nmating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the \nEstep), and computing the ML estimates of the means ilk  =  2::;':1 WjkXj /  2::f=l Wjk \n(the  M  step),  where  Wjk  =  p(f-Lklxj)  from  the  previous  E  step.  In  the  basic  EM \nalgorithm,  all  N  examples  in  the  training set  are  used  in  each  iteration.  The goal \nin  this  paper  is  to  speed  up  EM  by  using  only  ni  < N  examples  in  the  ith  itera(cid:173)\ntion,  while  guaranteeing that  the  means  produced  by  the  algorithm  do  not  differ \nsignificantly from  those that would  be obtained with arbitrarily large N. \n\nLet  Mii  =  (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data \nEM algorithm (i.e.,  using ni examples in iteration i), and let Moo  =  (f-L1,  ... ,f-LK)  be \nthe vector obtained  using infinite  examples at each  iteration.  In  order to proceed, \nwe  need  to  quantify  the  difference  between  Mii  and  Moo .  A  natural  choice  is  the \nsum  of the squared errors  between  corresponding  means,  which  is  proportional to \nthe negative log-likelihood of the finite-data means given  the infinite-data ones: \n\nL(Mii' Moo ) =  L Ililk  - f-Lkl12  =  L L lilkd  -\n\nK  D \n\nK \n\nk=l \n\nf-Lkdl 2 \n\n(1) \n\nk=ld=l \n\nwhere  ilkd  is  the dth coordinate of il,  and similarly for  f-Lkd. \n\nAfter any given iteration of EM,  lilkd - f-Lkdl  has two components.  One, which we call \nthe sampling  error, derives from  the fact  that ilkd  is estimated from  a finite sample, \n\n\fJ-Lkdi 1  :::;  Iflkdi  -\n\nJ-Lkdi I,  the weighting error is  Iflkdi  -\n\nwhile  J-Lkd  is  estimated  from  an  infinite  one.  The other  component,  which  we  call \nthe  weighting  error,  derives  from  the fact  that,  due  to sampling errors in  previous \niterations, the weights  Wjk  used  to compute the two estimates may differ.  Let  J-Lkdi \nbe  the  infinite-data  estimate  of  the  dth  coordinate  of the  kth  mean  produced  in \niteration i,  flkdi  be the corresponding finite-data estimate, and  flkdi  be the estimate \nthat  would  be  obtained  if  there  were  no  weighting  errors  in  that  iteration.  Then \nflkdi I, \nthe sampling error at iteration i  is  Iflkdi  -\nand the total error is  Iflkdi  -\nGiven bounds on the total error of each coordinate of each mean after iteration i-I, \nwe  can  derive  a  bound  on the  weighting  error  after iteration i  as  follows.  Bounds \non  J-Lkd ,i-l for  each  d  imply  bounds on  p(XjlJ-Lki )  for  each  example  Xj,  obtained by \nsubstituting the maximum and minimum allowed distances between Xjd  and J-Lkd ,i-l \ninto  the  expression  of the  Gaussian  distribution.  Let  P}ki  be  the  upper  bound  on \nP(XjlJ-Lki) ,  and Pjki  be the lower bound.  Then the weight of example Xj  in mean J-Lki \ncan be  bounded from  below by wjki  =  PjkiP(J-Lk)/ ~~=l P}k'iP(J-LU,  and from  above \nby  W}ki  =  min{p}kiP(J-Lk)/ ~~=l Pjk'iP(J-LU, I}.  Let  w;t:  =  W}ki  if  Xj \n:::::  0  and \nth \n(+)  -\nW jki  - W jki  ot  erWlse,  an \nan  W jki  - W jki  0 \nerWlse. \nThen \n\ne  W jki  - W jki  1  Xj  _ \n\n- ' f   > 0 \n\nflkdi 1 +  Iflkdi  -\n\n- h '  \n\nd  1 t \n\n(- )  -\n\nd \n\n(- )  -\n\n+ \n\nJ-Lkdi  I\u00b7 \n\n. \n\nIflkdi  -\n\nflkdi 1 \n\n< \n\n, \nI\nJ-Lkdi  -\n\nmax \n\n~7~1 Wjk i Xj  I \n\",ni \nuj=l Wjki \n{I \n\n, \nJ-Lkdi  -\n\nuj =l W jki Xj\n\", ni \n\",ni \nuj=l w jki \n\n_ \n\n(+)  II \n\n, \n\n,J-Lkdi  -\n\n( - )  I} \n\nuj =l W jki Xj \n\",ni \n+ \n\",ni \nuj=l w jki \n\n(2) \n\nA corollary of Hoeffding's  [4]  Theorem 2 is  that, with probability at least  1 - 8,  the \nsampling error is  bounded by \n\nIflkdi  -\n\nJ-Lkdi  1  :::; \n\n(3) \n\nwhere  Rd  is  the  range of the  dth  coordinate of the  data (assumed  known 1 ).  This \nbound  is  independent  of the  distribution  of the  data,  which  will  ensure  that  our \nresults are valid even if the data was not truly generated by a mixture of Gaussians, \nas  is  often  the  case  in  practice.  On  the  other  hand,  the  bound  is  more  conserva(cid:173)\ntive  than  distribution-dependent  ones,  requiring  more  samples  to  reach  the  same \nguarantees. \n\nThe  initialization  step  is  error-free,  assuming  the  finite- and  infinite-data  algo(cid:173)\nrithms  are  initialized  with  the  same  means.  Therefore  the  weighting  error  in \nthe  first  iteration  is  zero,  and  Equation  3  bounds  the  total  error.  From  this \nwe  can  bound  the  weighting  error  in  the  second  iteration  according  to  Equa(cid:173)\ntion  2,  and  therefore  bound  the  total  error  by  the  sum  of  Equations  2  and  3, \nIf the  finite- and \nand  so  on  for  each  iteration  until  the  algorithms  converge. \ninfinite-data  EM  converge  in  the  same  number  of  iterations  m,  the  loss  due  to \nfinite  data  is  L(Mii\"  Moo )  =  ~f= l ~~= llflkdm -\nsume  that  the  convergence  criterion  is  ~f=l  IIJ-Lki  -\n\nIn  general \n1 Although  a  normally  distributed  variable  has  infinite  range,  our  experiments  show \n\nJ-Lkdml 2  (see  Equation  1).  As(cid:173)\n\nJ-Lk,i-111 2 \n\n:::;  f. \n\nthat assuming a  sufficiently  wide  finite  range does  not  significantly affect  the results. \n\n\f(with  probability  specified  below),  infinite-data  EM  converges  at  one  of the  iter(cid:173)\nations  for  which  the  minimum  possible  change  in  mean  positions  is  below  ,,/,  and \nis  guaranteed  to  converge  at  the  first  iteration  for  which  the  maximum  possible \nchange is  below  \"(.  More  precisely,  it  converges  at  one  of the  iterations  for  which \n~~=l ~~=l (max{ IPkd,i- l  - Pkdil-IPkd,i- l  -\nftkdil, O})2  ::;  ,,/,  and \nis  guaranteed  to  converge  at  the  first  iteration  for  which  ~~=l ~~=l (IPkd,i-l  -\nPkdil + IPkd,i-l -\nftkdil)2  ::;  \"/.  To obtain a  bound for  L(Mn, Moo), \nfinite-data  EM  must  be  run  until  the  latter  condition  holds.  Let  I  be  the  set  of \niterations at which infinite-data EM could  have converged.  Then we  finally  obtain \n\nftkd ,i-ll + IPkdi -\n\nftkd,i - ll-IPkdi -\n\nwhere  m  is  the  total  number  of iterations  carried  out.  This  bound  holds  if all  of \nthe  Hoeffding  bounds  (Equation  3)  hold.  Since  each  of  these  bounds  fails  with \nprobability at  most 8,  the bound above fails  with  probability at most 8*  =  K Dm8 \n(by  the  union  bound).  As  a  result,  the  growth  with  K,  D  and  m  of the  number \nof examples  required  to  reach  a  given  loss  bound  with  a  given  probability is  only \nO(v'lnKDm). \n\nThe bound we  have just derived utilizes  run-time information, namely the distance \nof each example to each mean along each coordinate in each iteration.  This allows it \nto be tighter than  a priori bounds.  Notice also that it would be trivial to modify the \ntreatment for  any other loss criterion that depends only on the terms IPkdm  -\nftkdm I \n(e.g.,  absolute loss) . \n\n3  A  Fast  EM  Algorithm \n\nWe  now  apply the previous section's result  to reduce the number of examples  used \nby  EM  at  each  iteration  while  keeping  the  loss  bounded.  We  call  the  resulting \nalgorithm VFEM. The goal is to learn in minimum time a model whose loss relative \nto EM applied  to infinite  data is  at most  f*  with  probability at least  1 - 8*.  (The \nreason to use f*  instead of f  will become apparent below.)  Using the notation of the \nprevious section, if ni examples are used at each iteration then the running time of \nEM is  O(KD ~::l ni) , and can be minimized by minimizing  ~::l ni.  Assume for \nthe moment that the number of iterations m  is known.  Then, using Equation 1,  we \ncan state the goal more precisely as  follows. \nGoal:  Minimize  ~::l ni,  subject to  the  constraint that ~~=l IIPkm  - ftkml12  ::;  f* \nwith probability  at  least  1 - 8* . \n\nftkml12  ::;  f*  is  that  Vk \n\nA  sufficient  condition  for  ~~=l IIPkm -\nftkmll  ::; \nJf*/K.  We  thus  proceed  by first  minimizing  ~::l ni  subject  to  IIPkm - ftkmll  ::; \nJ f* / K  separately for  each mean.2  In order to do  this,  we  need  to express  IIPkm  -\nftkm II  as a function of the ni 'so  By the triangle inequality,  IIPki - ftki II  ::;  IIPki - ftki II + \n~R2ln(2/8) ~;~l w;kd(~;~l Wjki)2, \nIlftki  - ftk&  By Equation 3,  Ilftki - ftki II::; \nwhere  R2  =  ~~=l RJ  and  8  =  8* / K Dm  per  the  discussion  following  Equation 4. \nThe  (~;~ l Wjki)2 /  ~;~ l W;ki  term  is  a  measure  of the  diversity  of  the  weights, \n\nIIPkm -\n\n2This  will  generally  lead  to  a  suboptimal solution;  improving it  is  a  matter for  future \n\nwork. \n\n\fbeing  equal  to  1 - l/Gini(W~i)'  where  W~i is  the  vector  of  normalized  weights \nwjki  =  wjkd 2:j,i=l Wjl ki.  It attains a minimum of! when all the weights but one are \nzero, and a  maximum of ni  when all the weights are equal and non-zero.  However, \nwe  would  like  to  have  a  measure  whose  maximum  is  independent  of ni,  so  that  it \nremains  approximately  constant  whatever  the  value  of  ni  chosen  (for  sufficiently \nlarge ni).  The measure will  then depend only on the underlying distribution of the \ndata.  Thus we  define  f3ki  =  (2:7~1 Wjki)2 /(ni 2:7~1 W]ki) '  obtaining  IliLki  - ILkill  :::; \nJR2ln(2/8)/(2f3ki ni).  Also,  IIP-ki-iLkill  =  J2:~=llP-kdi - iLkdil 2, with lP-kdi-iLkdil \nbounded by  Equation 2.  To  keep  the analysis tractable, we  upper-bound  this term \nby a function proportional to IIP-kd,i-1 - ILkd,i-111.  This captures the notion than the \nweighting error in one iteration should increase with  the total error in the previous \none.  Combining this with  the bound for  IliLki  - ILkill,  we  obtain \n\nwhere CXki  is  the proportionality constant.  Given this equation and IIP-kO  - ILkO II  =  0, \nit  can  be shown by induction that \n\nR2 ln(2/8) \n\n2f3kini \n\n(5) \n\nIIP-km  - ILkmll  :::;  ~ ~ \n\nm \n\nwhere \n\n(6) \n\n(7) \n\nThe  target  bound  will  thus  be  satisfied  by  minimizing  2::1 ni  subject  to \n2::1 (rkd,;niJ  =  J  E* /  K. 3  Finding  the  n/s  by  the  method  of Lagrange  multi(cid:173)\npliers  yields \n\nni  =  ~ (f ~rkir%j) 2 \n\n)=1 \n\n(8) \n\nThis equation will  produce a  required value of ni  for  each  mean.  To guarantee the \ndesired  E*,  it is  sufficient  to make ni  equal to the maximum of these values. \n\nThe  VFEM  algorithm  consists  of a  sequence  of runs  of EM,  with  each  run  using \nmore  examples  than  the  last,  until  the  bound  L(Mii' Moo)  :::;  E*  is  satisfied,  with \nL(Mii' Moo)  bounded according to Equation 4.  In the first  run, VFEM postulates a \nmaximum number of iterations m, and uses it to set 8 =  8* / K Dm.  If m is exceeded, \nfor  the next  run it  is  set to  50%  more than the number needed in  the current run. \n(A  new  run will  be carried out if either the 8*  or E*  target is  not met.)  The number \nof examples  used  in  the first  run of EM is  the same for  all  iterations,  and is  set  to \n1.1(K/2)(R/E*)2ln(2/8).  This is 10% more than the number of examples that would \ntheoretically  be  required  in  the  best  possible  case  (no  weighting  errors in  the  last \n3This may lead to a  suboptimal solution  for  the ni's, in the unlikely case  that Ilflkm -\n\nJtkm II  increases  with them. \n\n\fiteration, leading to a pure Hoeffding bound, and a uniform distribution of examples \namong mixture components).  The numbers of examples for subsequent runs are set \naccording  to  Equation  8.  For  iterations  beyond  the  last  one  in  the  previous  run, \nthe  number  of examples  is  set  as  for  the  first  run.  A  run  of  EM  is  terminated \nILkdi 1)2  :s:  \"(  (see \nwhen  L~= l L~=l (Iflkd,i- l  -\nILk,i-1 11 2 :s: \ndiscussion  preceding  Equation  4),  or  two  iterations  after  L~=l IIILki  -\n\"( 13,  whichever  comes  first.  The latter  condition  avoids  overly  long  unproductive \nruns.  If the user target bound is  E,  E*  is set to min{ E, \"( 13}, to facilitate meeting the \nfirst  criterion above.  When  the convergence threshold for  infinite-data EM was not \nreached even when  using the whole  training set, VFEM reports that it was  unable \nto find  a  bound;  otherwise the bound obtained is  reported. \n\nflkdi 1 + Iflkd ,i-l - ILkd ,i-l l + Iflkdi  -\n\nVFEM ensures that the total number of examples used in one run is  always at least \ntwice the number n  used in  the previous run.  This is  done by, if L  ni  < 2n, setting \nthe ni's instead to n~ =  2n(nil L  ni).  If at any point  L  ni > mN, where m  is  the \nnumber of iterations carried out and N  is  the size of the full  training set, Vi  ni =  N \nis  used.  Thus,  assuming that  the  number of iterations  does  not  decrease with  the \nnumber of examples, VFEM's total running time is  always less than three times the \ntime  taken  by  the  last  run  of EM.  (The  worst  case  occurs  when  the  one-but-last \nrun is  carried out on almost  the full  training set.) \n\nThe  run-time  information  gathered in  one  run  is  used  to  set  the  n/s for  the  next \nrun.  We  compute each  Ctki  as  Ilflki  - Pkill/llflk ,i-l - ILk,i-lll.  The approximations \nmade in  the derivation will  be good,  and the resulting ni's accurate,  if the means' \npaths  in  the  current  run  are  similar  to  those  in  the  previous  run.  This  may  not \nbe  true  in  the  earlier  runs ,  but  their  running  time  will  be  negligible  compared  to \nthat of later runs, where the assumption of path similarity from  one run to the next \nshould  hold. \n\n4  Experiments \n\nWe conducted a series of experiments on large synthetic data sets to compare VFEM \nwith  EM.  All  data  sets  were  generated  by  mixtures  of  spherical  Gaussians  with \nmeans  ILk  in  the  unit  hypercube.  Each  data set  was  generated  according to  three \nparameters:  the  dimensionality  D ,  the  number  of  mixture  components  K ,  and \nthe  standard deviation  (Y  of each  coordinate in  each  component.  The  means  were \ngenerated  one  at  a  time  by  sampling  each  dimension  uniformly  from  the  range \n(2(Y,1  - 2(Y).  This ensured  that most  of the data points generated were  within  the \nunit  hypercube.  The  range  of each  dimension  in  VFEM  was  set  to  one.  Rather \nthan  discard  points  outside  the  unit  hypercube,  we  left  them  in  to  test  VFEM's \nrobustness to outliers.  Any ILk  that was less than (vD 1 K)(Y  away from a previously \ngenerated mean was rejected and regenerated, since problems with very close means \nare  unlikely  to  be  solvable  by  either  EM  or  VFEM.  Examples  were  generated  by \nchoosing  one  of the  means  ILk  with  uniform  probability,  and  setting  the  value  of \neach dimension of the example by randomly sampling from  a  Gaussian distribution \nwith  mean  ILkd  and standard deviation (Y.  We  compared VFEM to EM on 64  data \nsets of 10  million  examples each,  generated by using every possible combination of \nthe following  parameters:  D  E  {4, 8,12, 16}; K  E  {3, 4, 5, 6} ; (Y  E  {.01 , .03, .05, .07}. \nIn  each  run  the  two  algorithms  were  initialized  with  the  same  means,  randomly \nselected  with the constraint that  no  two  be less  than vD 1 (2K)  apart.  VFEM was \nallowed  to  converge  before  EM's  guaranteed  convergence  criterion  was  met  (see \ndiscussion  preceding  Equation  4).  All  experiments  were  run  on  a  1  GHz  Pentium \nIII  machine  under Linux,  with \"(  =  O.OOOlDK,  8*  =  0.05,  and  E*  =  min{O.Ol, \"(} . \n\n\fTable 1:  Experimental results.  Values are averages over the number of runs shown. \nTimes  are  in  seconds,  and  #EA is  the  total  number  of example  accesses  made by \nthe algorithm,  in  millions. \n\nRuns \nBound \n\nAll \n\nNo  bound  VFEM \n\nAlgorithm  #Runs  Time  #EA  Loss \nVFEM \n2.51 \n2.51 \nEM \n1.20 \n1.20 \n2.02 \n2.02 \n\n1.21 \n217 \n3457  19.75 \n7820  43.19 \n4502  27.91 \n3068  16.95 \n3849  22.81 \n\nEM \nVFEM \nEM \n\n40 \n40 \n24 \n24 \n64 \n64 \n\nD \n\nK \n10.5  4.2 \n10.5  4.2 \n4.9 \n9.1 \n4.9 \n9.1 \n10 \n4.5 \n4.5 \n10 \n\nrr \n\n0.029 \n0.029 \n0.058 \n0.058 \n0.04 \n0.04 \n\nThe results are shown in Table 1.  Losses were computed relative to the true means, \nwith the best match between true means and empirical ones found by greedy search. \nResults for  runs in  which  VFEM achieved and did  not achieve the required E*  and \n8*  bounds  are  reported  separately.  VFEM  achieved  the  required  bounds  and  was \nable  to stop early on  62.5%  of its runs.  When it found  a  bound,  it  was  on  average \n16 times faster than EM. When it did not, it was on average 73%  slower.  The losses \nof the two  algorithms  were virtually  identical in  both situations.  VFEM was  more \nlikely  to  converge  rapidly  for  higher  D's  and  lower  K's  and  rr's.  When  achieved, \nthe  average  loss  bound  for  VFEM  was  0.006554,  and  for  EM  it  was  0.000081.  In \nother  words,  the  means  produced  by  both  algorithms  were  virtually  identical  to \nthose that would  be obtained with infinite data. 4 \n\nWe  also  compared  VFEM  and  EM  on  a  large  real-world  data  set,  obtained  by \nrecording  a  week  of Web  page  requests  from  the  entire  University  of  Washington \ncampus.  The data is described in detail in  Wolman et al.  [7],  and the preprocessing \ncarried out for  these experiments is described  in Domingos &  Hulten  [3].  The goal \nwas to cluster patterns of Web  access in order to support distributed caching.  On a \ndataset with D  =  10 and 20  million examples,  with 8*  =  0.05, I  =  0.001, E*  =  1/3, \nK  =  3,  and rr  =  0.01,  VFEM achieved a  loss  bound of 0.00581  and was  two orders \nof magnitude faster  than EM  (62  seconds  vs.  5928),  while  learning essentially  the \nsame means. \n\nVFEM's  speedup  relative  to  EM  will  generally  approach  infinity  as  the  data set \nsize  approaches  infinity.  The  key  question  is  thus:  what  are  the  data set  sizes  at \nwhich  VFEM becomes worthwhile?  The tentative evidence from  these experiments \nis  that  they  will  be  in  the  millions.  Databases  of this  size  are  now  common,  and \ntheir growth continues  unabated, auguring well  for  the use  of VFEM. \n\n5  Conclusion \n\nLearning algorithms can be sped up by minimizing the number of examples  used in \neach  step,  under  the  constraint  that  the loss  between  the resulting  model  and the \none  that  would  be  obtained  with  infinite  data remain  bounded.  In  this  paper  we \napplied  this  method to the EM algorithm for  mixtures of Gaussians, and observed \nthe resulting speedups on a  series of large data sets. \n\n4The much higher loss  values relative to the true means, however, indicate that infinite(cid:173)\n\ndata EM would  often find  only  local optima  (unless  the greedy  search  itself only  found  a \nsuboptimal match). \n\n\fAcknowledgments \n\nThis research was  partly supported by NSF  CAREER and IBM Faculty awards to \nthe first  author,  and by  a  gift  from  the Ford Motor  Company. \n\nReferences \n\n[1]  A.  P.  Dempster,  N.  M.  Laird,  and  D.  B.  Rubin.  Maximum  likelihood  from \nincomplete data via the EM algorithm.  Journal of the  Royal Statistical Society, \nSeries  B,  39:1- 38,  1977. \n\n[2]  P.  Domingos  and  G.  Hulten.  Mining  high-speed  data streams.  In  Proceedings \nof the  Sixth  ACM SIGKDD  International  Conference  on  Knowledge  Discovery \nand Data  Mining, pp.  71- 80,  Boston, MA,  2000.  ACM Press. \n\n[3]  P.  Domingos and  G.  Hulten.  A general method for  scaling up  machine learning \n\nalgorithms and its application to clustering.  In Proceedings  of the Eighteenth In(cid:173)\nternational  Conference  on  Machine  Learning,  pp.  106-113,  Williamstown,  MA, \n2001.  Morgan Kaufmann. \n\n[4]  W.  Hoeffding.  Probability inequalities  for  sums  of bounded  random  variables. \n\nJournal  of the  American Statistical  Association, 58:13- 30,  1963. \n\n[5]  C.  Meek,  B.  Thiesson, and  D.  Heckerman.  The learning-curve method  applied \nto clustering.  Technical Report MSR-TR-01-34, Microsoft Research, Redmond, \nWA,2000. \n\n[6]  B. Thiesson,  C.  Meek,  and D.  Heckerman.  Accelerating EM for  large databases. \n\nTechnical Report MSR-TR-99-31, Microsoft Research,  Redmond, WA, 2001. \n\n[7]  A.  Wolman, G.  Voelker, N.  Sharma, N.  Cardwell, M.  Brown, T. Landray, D. Pin(cid:173)\n\nnel,  A.  Karlin,  and H.  Levy.  Organization-based analysis of Web-object sharing \nand  caching.  In  Proceedings  of the  Second  USENIX  Conference  on  Internet \nTechnologies  and  Systems,  pp.  25- 36, Boulder, CO,  1999. \n\n[8]  T.  Zhang,  R.  Ramakrishnan,  and  M.  Livny.  BIRCH:  An  efficient  data  clus(cid:173)\ntering  method for  very large databases.  In  Proceedings  of the  1996 A CM SIG(cid:173)\nMOD International Conference  on Management of Data, pp. 103- 114, Montreal, \nCanada,  1996.  ACM  Press. \n\n\f", "award": [], "sourceid": 2064, "authors": [{"given_name": "Pedro", "family_name": "Domingos", "institution": null}, {"given_name": "Geoff", "family_name": "Hulten", "institution": null}]}