{"title": "MIMIC: Finding Optima by Estimating Probability Densities", "book": "Advances in Neural Information Processing Systems", "page_first": 424, "page_last": 430, "abstract": null, "full_text": "MIMIC:  Finding Optima by Estimating \n\nProbability Densities \n\nJeremy S.  De Bonet, Charles L.  Isbell, Jr., Paul Viola \n\nCambridge,  MA  02139 \n\nArtificial  Intelligence  Laboratory \n\nMassachusetts  Institute of Technology \n\nAbstract \n\nIn  many optimization problems,  the structure of solutions reflects \ncomplex relationships between  the different  input parameters.  For \nexample, experience may tell us  that certain parameters are closely \nrelated  and  should  not  be  explored  independently.  Similarly, ex(cid:173)\nperience  may establish  that  a  subset  of parameters must  take  on \nparticular  values.  Any  search  of the  cost  landscape  should  take \nadvantage of these relationships.  We  present  MIMIC,  a framework \nin which  we  analyze the global structure of the optimization land(cid:173)\nscape.  A  novel  and  efficient  algorithm for  the  estimation of this \nstructure is  derived.  We  use  knowledge of this structure to guide a \nrandomized search  through the solution space  and,  in  turn,  to re(cid:173)\nfine our estimate ofthe structure.  Our technique obtains significant \nspeed  gains over  other randomized optimization procedures. \n\n1 \n\nIntroduction \n\nGiven  some cost  function  C(x)  with local  minima,  we  may search  for  the  optimal \nx  in  many  ways.  Variations  of  gradient  descent  are  perhaps  the  most  popular. \nWhen  most of the minima are far  from  optimal,  the search  must either  include  a \nbrute-force  component  or  incorporate  randomization.  Classical  examples  include \nSimulated Annealing (SA)  and Genetic Algorithms (GAs)  (Kirkpatrick,  Gelatt and \nVecchi,  1983;  Holland,  1975).  In  all  cases,  in  the process  of optimizing C(x)  many \nthousands or perhaps millions of samples of C( x)  are evaluated.  Most optimization \nalgorithms  take  these  millions  of pieces  of  information,  and  compress  them  into \na  single  point  x-the current  estimate of the solution  (one  notable  exception  are \nGAs  to  which  we  will  return  shortly).  Imagine splitting  the  search  process  into \ntwo parts,  both taking t/2 time steps.  Both parts are structurally identical:  taking \na  description  of  CO,  they  start  their  search  from  some  initial  point.  The  sole \nbenefit  enjoyed  by  the  second  part  of the  search  over  the  first  is  that  the  initial \n\n\fMIMIC:  Finding Optima by Estimating Probability Densities \n\n425 \n\npoint  is perhaps  closer  to the optimum.  Intuitively, there must be some additional \ninformation that could  be learned from  the first  half of the search,  if only  to warn \nthe second  half about  avoidable mistakes and pitfalls. \n\nWe  present  an  optimization  algorithm  called  Mutual-Information-Maximizing In(cid:173)\nput  Clustering  (MIMIC).  It attempts  to  communicate information about  the  cost \nfunction  obtained from  one  iteration of the search  to later  iterations of the search \ndirectly.  It does  this in  an efficient  and principled  way.  There  are  two  main com(cid:173)\nponents  of MIMIC: first,  a  randomized  optimization algorithm that  samples from \nthose regions of the input space most likely to contain the minimum for CO; second, \nan  effective  density  estimator that  can  be used  to capture  a  wide  variety of struc(cid:173)\nture  on  the  input  space,  yet  is  computable from  simple second  order  statistics on \nthe data.  MIMIC's results on  simple cost  functions  indicate an order of magnitude \nimprovement  in  performance  over  related  approaches.  Further  experiments  on  a \nk-color map coloring problem yield similar improvements. \n\n2  Related Work \n\nMany  well  known  optimization procedures  neither  represent  nor  utilize  the  struc(cid:173)\nture of the optimization landscape.  In contrast,  Genetic  Algorithms (GA)  attempt \nto capture this structure by an ad hoc embedding of the parameters onto a line (the \nchromosome).  The intent of the crossover  operation in standard genetic  algorithms \nis  to preserve  and propagate a group  of parameters that  might be partially respon(cid:173)\nsible for  generating  a  favorable evaluation.  Even  when  such  groups exist,  many of \nthe  offspring  generated  do  not  preserve  the  structure  of these  groups  because  the \nchoice of crossover  point is  random. \nIn  problems  where  the  benefit  of a  parameter  is  completely  independent  of the \nvalue of all other parameters, the population simply encodes information about the \nprobability distribution  over each  parameter.  In  this case,  the  crossover  operation \nis  equivalent to sampling from this distribution;  the more crossovers  the better  the \nsample.  Even in problems where fitness  is obtained through the combined effects  of \nclusters of inputs,  the GA crossover  operation is  beneficial  only when  its randomly \nchosen  clusters  happen  to  closely  match the  underlying  structure  of the  problem. \nBecause  of the  rarity  of such  a  fortuitous  occurrence,  the  benefit  of the  crossover \noperation  is  greatly  diminished.  As  as  result,  GAs  have  a  checkered  history  in \nfunction  optimization (Baum,  Boneh  and  Garrett,  1995; Lang,  1995).  One  of our \ngoals is to incorporate insights from  GAs in a  principled optimization framework. \n\nThere  have  been  other  attempts  to  capture  the  advantages  of GAs.  Population \nBased Incremental Learning (PBIL)  attempts to incorporate the notion of a  candi(cid:173)\ndate population by  replacing  it with a  single probability vector  (Baluja and Caru(cid:173)\nana,  1995).  Each element of the vector  is  the probability that  a  particular bit in  a \nsolution  is  on.  During the learning process,  the  probability vector  can  be thought \nof as  a  simple model of the  optimization landscape.  Bits  whose  values  are  firmly \nestablished have probabilities that are close to lor O.  Those that are still unknown \nhave probabilities close  to 0.5 . \n\nWhen it is the structure of the components of a candidate rather than the particular \nvalues of the  components that  determines how  it fares, it  can  be  difficult  to  move \nPBIL's  representation  towards  a  viable solution.  Nevertheless,  even  in  these  sorts \nof problems PBIL often  out-performs genetic  algorithms because  those algorithms \nare hindered  by  the fact  that random crossovers  are infrequently  beneficial. \n\nA  very  distinct,  but  related  technique  was  proposed  by  Sabes  and  Jordan  for  a \n\n\f426 \n\nJ.  S.  de Bonet, C.  L.  Isbell and P.  WoLa \n\nreinforcement  learning  task  (Sabes  and  Jordan,  1995).  In  their  framework,  the \nlearner  must  generate  actions so  that  a  reinforcement  function  can  be  completely \nexplored.  Simultaneously,  the  learner  must  exploit  what  it  has  learned  so  as  to \noptimize the long-term reward.  Sabes and Jordan  chose  to construct  a  Boltzmann \ndistribution  from  the  reinforcement  function:  p(x)  =  exp~~) where  R(x)  is  the \nreinforcement function for action X, T is the temperature, and ZT  is a normalization \nfactor.  They  use  this  distribution  to generate  actions.  At  high  temperatures  this \ndistribution approaches the uniform distribution, and results in random exploration \nof RO.  At  low  temperatures  only  those  actions  which garner  large  reinforcement \nare generated.  By  reducing T,  the learner  progresses  from  an  initially randomized \nsearch to a  more directed  search  about the true optimal action.  Interestingly,  their \nestimate for  p( x)  is to some extent  a  model of the optimization landscape which is \nconstructed  during the learning process.  To our knowledge,  Sabes and Jordan have \nneither  attempted optimization over high dimensional spaces,  nor  attempted to fit \np( x)  with a  complex model. \n\n3  MIMIC \n\nKnowing  nothing  else  about  C(x)  it  might  not  be  unreasonable  to  search  for  its \nminimum by  generating  points  from  a  uniform  distribution  over  the  inputs  p( x). \nSuch a search allows none of the information generated by previous samples to effect \nthe generation  of subsequent  samples.  Not  surprisingly,  much  less  work  might be \nnecessary  if samples  were  generated  from  a  distribution,  p8(x),  that  is  uniformly \ndistributed over those x's where C(x)  ~ 0 and has a probability of 0 elsewhere.  For \nexample, if we  had access  to p8 M  (x)  for  OM  = minx C( x)  a  single sample would  be \nsufficient  to find  an optimum. \n\nThis  insight  suggests  a  process  of successive  approximation:  given  a  collection  of \npoints for  which  C( x)  ~ 00  a density estimator for p/J o (x)  is constructed.  From this \ndensity  estimator  additional  samples  are  generated,  a  new  threshold  established, \n01  = 00  -\nf,  and a  new  density estimator created.  The process  is  repeated until the \nvalues of C( x)  cease  to improve. \nThe  MIMIC  algorithm  begins  by  generating  a  random  population  of candidates \nchoosen  uniformly from  the input space.  From this  population the median fitness \nis  extracted and is  denoted 00 .  The algorithm then  proceeds: \n\n1.  Update the parameters of the density estimator of p/J\u00b7(x)  from  a  sample. \n2.  Generate more samples from  the distribution p/J\u00b7(x). \n3.  Set  0i+l  equal  to  the  Nth  percentile  of the  data.  Retain  only  the  points \n\nless  than  Oi +1 ' \n\nThe validity of this  approach  is  dependent  on two  critical  assumptions:  p(\\x)  can \nbe successfully approximated with a finite amount of data; and D(pl1-f(X)llp  (x)) is \nsmall enough so  that samples from p8(x)  are also likely to be samples from p/J-f(X) \n(where  D(pllq)  is  the  Kullback-Liebler  divergence  between  p  and  q).  Bounds  on \nthese  conditions can  be used  to prove  convergence  in  a  finite  number of successive \napproximation steps. \n\nThe performance of this approach is dependent on the nature of the density approx(cid:173)\nimator used.  We have chosen to estimate the conditional distributions for every pair \nof parameters in the representation,  a  total of O( n 2 )  numbers.  In  the next  section \nwe  will  show  how  we  use  these  conditionals distributions  to  construct  a joint dis(cid:173)\ntribution  which  is  closest  in  the  KL  sense  to  the  true joint distribution.  Such  an \n\n\fMIMIC:  Finding Optima by Estimating Probability Densities \n\n427 \n\napproximator is capable of representing clusters of highly related parameters.  While \nthis might seem similar to the intuitive behavior of crossover,  this representation  is \nstrictly  more powerful.  More  importantly, our  clusters  are  learned from  the  data, \nand are not  pre-defined  by  the programmer. \n\n4  Generating Events from  Conditional Probabilities \n\nThe joint probability distribution over  a set of random variables, X  =  {Xi}, is: \n\nGiven only pairwise conditional probabilities, p(Xi IXj) and unconditional probabil(cid:173)\nities, p(Xi), we  are faced  with the task of generating samples that match as  closely \nas possible the true joint distribution, p(X).  It is not possible to capture all possible \njoint  distributions of n  variables  using only  the  unconditional and  pairwise condi(cid:173)\ntional probabilities; however,  we would like to describe the true joint distribution as \nclosely as  possible.  Below,  we  derive  an algorithm for  choosing such  a  description. \nGiven  a  permutation of the  numbers  between  1 and  n,  7r  = i1 i2  ... in,  we  define  a \nclass of probability distributions, P1l\"(X): \n\nThe distribution P1l\"(X)  uses  7r  as an ordering for  the pairwise conditional probabili(cid:173)\nties.  Our goal is to choose the permutation 7r that maximizes the agreement between \nP1l\"(X)  and  the  true  distribution p(X).  The  agreement  between  two  distributions \ncan  be  measured by the  Kullback-Liebler  divergence: \n\n(2) \n\nD(pllp1l\") = l p[logp - logp1l\" ]dX \n\n= Ep[logp]  - Ep[logp1l\"] \n=  -h(p) - Ep[logp(XilIXh)P(Xi2IXi3) . . . p(Xin_lIXi,,)p(Xin)] \n=  -h(p) + h(Xi1IXi2) + h(Xh IXi3) + .. . + h(Xin_1IXiJ + h(XiJ. \n\nThis  divergence  is  always  non-negative,  with equality only  in  the  case  where  p(7r) \nand  p(X)  are  identical  distributions.  The  optimal  7r  is  defined  as  the  one  that \nminimizes this  divergence.  For  a  distribution  that  can  be  completely  described  by \npairwise  conditional  probabilities,  the  optimal  7r  will  generate  a  distribution  that \nwill  be identical to the true distribution.  Insofar as the true distribution  cannot  be \ncaptured  this way,  the optimal P1l\"(X)  will  diverge from  that distribution. \nThe first  term in the divergence does not depend on 7r.  Therefore, the cost function, \nJ1l\"(X),  we  wish  to minimize is: \n\nThe  optimal  7r  is  the  one  that  produces  the  lowest  pairwise  entropy  with  respect \nto  the  true  distribution.  By  searching  over  all  n!  permutations,  it  is  possible  to \ndetermine the optimal 7r.  In the interests  of computational efficiency,  we  employ a \nstraightforward greedy  algorithm to pick  a  permutation: \n\n\f428 \n\nJ.  S.  de Bonet, C.  L.  Isbell and P.  Viola \n\n1.  in  =::  arg minj h(Xj). \n2.  ik  =::  arg minj h( Xj IXik+J, where \n\nj  t=  ik+1 ... in  and  k  =::  n  - 1, n - 2, ... ,2,1. \n\nwhere hO  is the empirical entropy.  Once a distribution is chosen,  generating samples \nis  also straightforward: \n\n1.  Choose  a  value for  Xin  based on its empirical probability P(Xi n). \n2.  for  k  =::  n  - 1, n  - 2, ... ,2,1,  choose  element  Xik  based  on  the  empirical \n\nconditional probability P(Xik jXik+ 1 )\u00b7 \n\nThe first  algorithm runs in time O(n2 )  and the second  in  time O(n 2 ). \n\n5  Experiments \n\nTo measure the performance of MIMIC, we performed three benchmark experiments \nand compared our results  with those obtained using several  standard optimization \nalgorithms. \n\nWe will  use four  algorithms in our comparisons: \n\n1.  MIMIC - the  algorithm above with 200 samples per iteration \n2.  PBIL - standard population based incremental learning \n3.  RHC - randomized hill  climbing \n4.  GA - a  standard genetic algorithm with single  crossover  and  10% \n\nmutation rate \n\n5.1  Four Peaks \n\nThe  Four  Peaks  problem  is  taken  from  (Baluja  and  Caruana,  1995).  Given  an \nN -dimensional input vector  X,  the four  peaks evaluation function  is  defined  as: \n\nI(X, T)  =::  max [tail(O, X), head(l, X)] + R(X, T) \n\nwhere \n\ntai/(b, X)  =::  number of trailing b's  in  X \nhead(b, X)  =::  number of leading b's  in  X \n\nR(X  T) = {N  iftail(?,X) > T  and head(l,X) > T \n\n, \n\n0 \n\notherWIse \n\n(4) \n\n(5) \n(6) \n\n(7) \n\nThere are two global maxima for this function.  They are achieved either when there \nare T + 1 leading l's followed by all O's or when  there are T + 1 trailing O's  preceded \nby  all  1 'so  There  are  also  two  suboptimal local  maxima that  occur  with  a  string \nof all  l's or all  O's.  For large values  of T,  this  problem becomes  increasingly  more \ndifficult because the basin of attraction for the inferior local maxima become larger. \n\nResults  for  running  the  algorithms are shown  in  figure  1.  In  all  trials,  T  was  set \nto  be  10%  of N,  the  total  number of inputs.  The  MIMIC  algorithm consistently \nmaximizes  the  function  with  approximately  one  tenth  the  number  of evaluations \nrequired  by the second  best  algorithm. \n\n\fMIMIC:  Finding Optima by Estimating Probability Densities \n\n429 \n\nFunction Evaluations Required to Maximize 4 Peaks \n\n1200.--~-~--~-~----, \n\n~ I ()()() \no \n.~ \n\n~ 800 \n\n\u2022  MIMIC \n0  PBIL \nx  RHC \n\u2022  GA \n\n& \n~ 600  '------\" \nil \n~400 \n5 \n\n~ 200l::=;;~~~~~:::::=J \n\n40 \n\n50 \n\n60 \n\n70 \n\n80 \n\no \n\nInputs \n\nFigure  1:  Number of evaluations of the  Four-Peak  cost function  for  different  algo(cid:173)\nrithms plotted for  a  variety of problems sizes. \n\n5.2  Six Peaks \n\nThe Six Peaks  problem is  a  slight variation on  Four Peaks  where \n\nR(X,T) =  {  ; \n\nif \n\ntai/(O,x)  > T  and head(l, x) > Tor \ntai/(l, x) > T  and head(O, x) > T \notherwise \n\n(8) \n\nThis function  has  two  additional global maxima where  there  are T + 1 leading  O's \nfollowed  by  all  1 's  or  when  there  are T  + 1 trailing 1 's  preceded  by  all  O's.  In  this \ncase,  it  is  not  the  values  of the  candidates  that  is  important,  but  their  structure: \nthe  first  T  + 1  positions  should  take  on  the  same value,  the  last  T  + 1  positions \nshould  take  on  the  same  value,  these  two  groups  should  take  on  different  values, \nand the middle positions should take on  all the same value. \n\nResults  for  this  problem  are  shown  in  figure  2.  As  might  be  expected,  PBIL  per(cid:173)\nformed  worse  than  on  the  Four  Peak  problem  because  it tends  to  oscillate  in  the \nmiddle of the space while contradictory signals pull it back and forth.  The random \ncrossover  operation of the  G A occasionally was  able to  capture some of the under(cid:173)\nlying  structure,  resulting  in  an  improved  relative  performance  of the  GA.  As  we \nexpected, the MIMIC algorithm was able to capture the underlying structure of the \nproblem, and combine information from  all the  maxima.  Thus MIMIC consistently \nmaximizes  the  Six  Peaks  function  with  approximately  one  fiftieth  the  number  of \nevaluations required  by the other  algorithms. \n\n5.3  Max K-Coloring \n\nA  graph  is  K-Colorable  if it  is  possible  to  assign  one  of  k  colors  to  each  of the \nnodes  of the graph such  that no  adjacent  nodes  have  the same color.  Determining \nwhether  a  graph  is  K-Colorable  is  known  to  be  NP-Complete.  Here,  we  define \nMax K-Coloring to be  the task of finding  a  coloring that minimizes the number of \nadjacent  pairs  colored  the same. \n\nResults  for  this  problem  are  shown  in  figure  2.  We  used  a  subset  of graphs  with \na  single solution  (up  to  permutations of color)  so  that  the optimal solution  is  de(cid:173)\npendent  only on  the structure  of the  parameters.  Because  of this,  PBIL  performs \npoorly.  GA's perform better because  any crossover point is representative of some of \nthe underlying structure of the graphs used.  Finally, MIMIC performs best because \n\n\f430 \n\nJ. S.  de Bonet, C.  L.  Isbell and P.  Vwla \n\nFunction Evaluations Required to Maximize 6 Peaks \n1300r---~--~-~--~---' \n\nFunction Evaluations Required to Maximize K-Coloring \n1200r---~-~-~-~-~---.. \n\n\u2022  MIMIC \no  PBIL \nx  RHC \n+  GA \nL--_ - - '  \n\n\",1200 \ne \no \n\u00b7~1000 \n.a \n'\" W 800 \n'Cl \n~600 \n\u00a7 \n~  400 \no \n~ 200 \n\no \n\n20 \n\n30 \n\n40 \n\nInputs \n\n\u2022  MIMIC \no  PBIL \nx  RHC \n+  GA \n\n:gIOOO \no \n.;:1 \n'\" ~ 800 \nW \n'Cl  600 \n~ \n~400 g \n~ 200 \n\n50 \n\n60 \n\n40 \n\nFigure 2:  Number of evaluations of the Six-Peak cost function (left) and the K-Color \ncost  function  (right)  for  a  variety of problem sizes. \n\nit is  able  to capture  all of the structural regularity within the inputs. \n\n6  Conclusions \n\nWe  have  described  MIMIC,  a  novel  optimization  algorithm  that  converges  faster \nand more reliably than several other existing algorithms.  MIMIC accomplishes this \nin two ways.  First, it performs optimization by successively  approximating the con(cid:173)\nditional distribution of the inputs given  a  bound on the cost  function.  Throughout \nthis process,  the optimum of the cost function  becomes  gradually more likely.  As  a \nresult,  MIMIC directly  communicates information about the cost function from the \nearly stages  to the later stages  of the search.  Second,  MIMIC  attempts to discover \ncommon underlying  structure  about  optima by  computing second-order  statistics \nand sampling from  a  distribution consistent  with those statistics. \n\nAcknowledgments \n\nIn  this  research,  Jeremy  De Bonet is  supported  by the  DOD  Multidisciplinary  Re(cid:173)\nsearch  Program of the University Research  Initiative, Charles Isbell  by  a fellowship \ngranted by AT&T Labs-Research, and Paul Viola by Office of Naval Research Grant \nNo.  N00014-96-1-0311.  Greg Galperin  helped in  the preparation of this paper. \n\nReferences \n\nBaluja,  S.  and  Caruana,  R.  (1995).  Removing  the  genetics  from  the  standard  genetic \n\nalgorithm.  Technical  report,  Carnegie  Mellon  Univerisity. \n\nBaum,  E.  B.,  Boneh,  D.,  and  Garrett,  C.  (1995).  Where  genetic algorithms  excel.  In  Pro(cid:173)\n\nceedings of the Conference on Computational Learning Theory,  New York.  Association \nfor  Computing  Machinery. \n\nHolland,  J.  H.  (1975).  Adaptation in  Natural  and Artificial Systems.  The  Michigan  Uni(cid:173)\n\nversity  Press. \n\nKirkpatrick,  S.,  Gelatt,  C., and  Vecchi,  M.  (1983).  Optimization  by Simulated  Annealing. \n\nScience, 220(4598):671-680. \n\nLang,  K.  (1995).  Hill  climbing  beats genetic search on  a  boolean circuit synthesis  problem \n\nof koza's.  In  Twelfth  International Conference  on  Machine  Learning. \n\nSabes,  P.  N.  and  Jordan,  M. 1.  (1995).  Reinforcement learning by probability matching .  In \nDavid  S.  Touretzky,  M.  M.  and  Perrone,  M., editors,  Advances in Neural Information \nProcessing, volume  8,  Denver  1995.  MIT  Press,  Cambridge. \n\n\f", "award": [], "sourceid": 1328, "authors": [{"given_name": "Jeremy", "family_name": "De Bonet", "institution": null}, {"given_name": "Charles", "family_name": "Isbell", "institution": null}, {"given_name": "Paul", "family_name": "Viola", "institution": null}]}