{"title": "Memory Capacity of Linear vs. Nonlinear Models of Dendritic Integration", "book": "Advances in Neural Information Processing Systems", "page_first": 157, "page_last": 163, "abstract": null, "full_text": "Memory  Capacity of Linear  vs.  Nonlinear \n\nModels of Dendritic Integration \n\nPanayiota Poirazi* \n\nBartlett W.  Mel* \n\nBiomedical Engineering Department \n\nUniversity of Southern California \n\nBiomedical Engineering Department \n\nUniversity of Southern California \n\nLos  Angeles,  CA  90089 \n\npoirazi@sc/. usc. edu \n\nLos Angeles,  CA  90089 \n\nmel@lnc.usc.edu \n\nAbstract \n\nPrevious biophysical modeling work showed that nonlinear interac(cid:173)\ntions among nearby synapses located on active dendritic trees can \nprovide a large boost in the memory capacity of a cell  (Mel,  1992a, \n1992b).  The aim of our present work is  to quantify  this  boost by \nestimating  the  capacity  of  (1)  a  neuron  model  with  passive  den(cid:173)\ndritic  integration  where  inputs  are  combined  linearly  across  the \nentire cell  followed  by a  single  global  threshold,  and  (2)  an active \ndendrite  model  in  which  a  threshold  is  applied  separately  to  the \noutput of each branch, and the branch subtotals are combined lin(cid:173)\nearly.  We focus  here on the limiting case of binary-valued synaptic \nweights,  and derive expressions which  measure model  capacity by \nestimating the number of distinct input-output functions  available \nto both neuron types.  We  show that  (1)  the application of a fixed \nnonlinearity to each dendritic compartment substantially increases \nthe model's flexibility,  (2) for a neuron of realistic size, the capacity \nof the nonlinear cell can exceed that of the same-sized linear cell by \nmore than an order of magnitude, and (3)  the largest capacity boost \noccurs for cells with a relatively large number of dendritic subunits \nof relatively  small  size.  We  validated  the analysis  by  empirically \nmeasuring memory  capacity with  randomized two-class  classifica(cid:173)\ntion problems, where a stochastic delta rule was used to train both \nlinear and nonlinear models.  We  found  that large capacity boosts \npredicted for  the nonlinear  dendritic  model  were  readily  achieved \nin practice. \n\n-http://lnc.usc.edu \n\n\f158 \n\nP.  Poirazi and B.  W.  Mel \n\n1 \n\nIntroduction \n\nBoth  physiological  evidence  and  connectionist  theory  support  the  notion  that  in \nthe  brain,  memories  are  stored  in  the  pattern  of learned  synaptic  weight  values. \nExperiments  in  a  variety  of neuronal  preparations  however,  inQicate  that  the  ef(cid:173)\nficacy  of synaptic  transmission  can  undergo  substantial fluctuations  up  or  down, \nor both,  during brief trains of synaptic stimuli.  Large fluctuations  in synaptic ef(cid:173)\nficacy  on  short  time  scales  seem  inconsistent  with  the  conventional  connectionist \nassumption of stable, high-resolution synaptic weight values.  Furthermore, a recent \nexperimental study suggests that excitatory synapses in the hippocampus-a region \nimplicated in certain forms  of explicit memory-may exist in only a few  long-term \nstable states,  where  the  continuous  grading of synaptic strength seen  in standard \nmeasures of long-term potentiation (LTP) may exist only in the average over a large \npopulation of two-state synapses with  randomly staggered thresholds for  learning \n(Petersen,  Malenka,  Nicoli,  &  Hopfield,  1998).  According  to conventional connec(cid:173)\ntionist notions, the possibility that individual synapses hold only one or two bits of \nlong-term state information would seem to have serious implications for the storage \ncapacity of neural tissue.  Exploration of this question is one of the main themes of \nthis paper. \nIn  a  related  vein,  we  have  found  in  previous  biophysical  modeling  studies  that \nnonlinear interactions between synapses co-activated on the same branch of an ac(cid:173)\ntive dendritic tree could provide an alternative form  of long-term storage capacity. \nThis capacity,  which is  largely orthogonal to that tied  up  in conventional synaptic \nweights,  is  contained  instead  in  the  spatial  permutation  of synaptic  connections \nonto the dendritic tree-which could in principle be modified in the course of learn(cid:173)\ning  or development  (Mel,  1992a,  1992b).  In a  more  abstract setting,  we  recently \nshowed that a  large repository of model flexibility  lies  in the  choice as to which  of \na large number of possible interaction terms available in high dimension is  actually \nincluded  in  a  learning machine's discriminant function,  and that the excess  capac(cid:173)\nity  contained  in  this  \"choice  flexibility\"  can  be  quantified  using  straightforward \ncounting arguments (Poirazi & Mel,  1999). \n\n2  Two Alternative Models of Dendritic Integration \n\nIn  this  paper,  we  use  a  similar  function-counting  approach  to  address  the  more \nbiologically  relevant  case  of  a  neuron  with  mUltiple  quasi-independent  dendritic \ncompartments  (fig.  1).  Our primary  objective  has  been  to  compare  the  memory \ncapacity of a  cell assuming two different modes of dendritic integration.  According \nto  the  linear  model,  the  neuron's  activation  level  aL(x)  prior  to  thresholding  is \ngiven by a  weighted sum of of its inputs over the cell as a whole.  According to the \nnonlinear model,  the  k  synaptic inputs to each  branch  are first  combined linearly, \na  static (e.g.  sigmoidal)  nonlinearity is  applied to each  of the m  branch subtotals, \nand the resulting branch outputs are summed to produce the cell's overall activity \naN{x): \n\nThe expressions for  aL  and aN were written in similar form to emphasize that the \nmodels have an identical number of synaptic weights,  differing only in the presence \nor absence of a fixed  nonlinear function  g applied to the branch subtotals.  Though \nindividual  synaptic  weights  in  both  models  are  constrained  to  have  a  value  of 1, \nany  of the  d  input  lines  may  form  multiple  connections  on  the same  or  different \n\n(1) \n\n\fMemory Capacity of Linear vs.  Nonlinear Models of Dendritic Integration \n\n159 \n\nm \n\n3 \n\n\u2022' . . \n\n, \n\n. \n\n, \n\nI \n\nFigure  1:  A  cell  is  modeled  as  a  set of m  identical  branches  connected  to a  soma, \nwhere  each  branch contains  k  synaptic  contacts  driven  by  one  of d distinct  input \nlines. \n\nbranches as a  means of representing graded synaptic strengths.  Similarly,  an input \nline which forms no connection has an implicit weight of O.  In light of this restriction \nto positive  (or  zero)  weight  values,  both the linear and nonlinear  models  are  split \ninto two opponent channels a+  and a- dedicated to positive vs.  negative coefficients, \nrespectively.  This leads to a final  output for  each model: \n\nyL(x)  =  sgn  [at(x) - aL(x)] \n\nYN(X)  =  sgn  [a;t(x)  - aiV(x)] \n\n(2) \n\nwhere the sgn  operator maps the total activation level into a  class  label of {-I,  I}. \nIn  the following,  we  derive expressions for  the number of distinct  parameter st.ates \navailable to the linear vs.  nonlinear models,  a  measure which  we  have found  to  be \na  reliable  predictor  of storage capacity  under  certain  restrictions  (Poirazi  &  Mel, \n1999).  Based  on  these  expressions,  we  compute  the  capacity  boost  provided  by \nthe  branch nonlinearity as  a  function  of the number of branches m,  synaptic  sites \nper  branch k,  and input space dimensionality  d.  Finally,  we  test the predictions of \nthe  analytical  model  by  training both linear  and  nonlinear  models  on  randomized \nclassification  problems  using  a  stochastic  delta rule,  and  empirically  measure  and \ncompare the storage capacities of the two models. \n\n3  Results \n\n3.1  Counting Parameter States:  Linear vs.  Nonlinear Model \n\nWe  derived expressions for  BLand B N, which estimate the total number of param(cid:173)\neter bits available to the linear vs.  nonlinear models,  respectively: \n\nB N  =  2log2 \n\n(( k+d-1) \n\nk  m  + m -\n\n1) \n\nBL  =  2log2 ( S+d-1) \n\nS \n\n(3) \nThese  expressions  estimate  the  number  of  non-redundant  states  in  each  neuron \ntype,  i.e.,  those  assignments  of  input  lines  to  dendritic  sites  which  yield  distinct \n\n\f160 \n\nP  Poirazi and B. W  Mel \n\ninput-output functions YL  or YN\u00b7 \nThese formulae  are plotted in figure  2A  with d  =  100, where each  curve represents \na  cell  with a  fixed  number of branches  (indicated  by m).  In each  case,  the capac(cid:173)\nity  increases steadily  as  the  number  of synapses  per  branch,  k,  is  increased.  The \nlogarithmic growth in  the capacity  of the linear  model  (evident  in  an  asymptotic \nanalysis  of the  expression  for  B L)  is  shown  at  the  bottom  of the  graph  (circles), \nfrom  which  it  may  be  seen  that  the  boost  in  capacity  provided  by  the  dendritic \nbranch nonlinearity increases steadily with the number of synaptic sites.  For a  cell \nwith 100 branches containing 100 synaptic sites each,  the capacity boost relative to \nthe linear model exceeds a factor of 20. \nFigure 2B  shows  that for  a  given  total  number  of synaptic sites,  in  this  case  s  = \nm\u00b7 k  =  10,000, the capacity of the nonlinear cell is maximized for  a  specific choice \nof m  and  k.  The peak  of each  of the three  curves  (computed  for  different  values \nof d)  occurs  for  a  cell  containing  1,250  branches  with  8  synapses each.  However, \nthe capacity is only moderately sensitive to the branch count:  the capacity of a cell \nwith 100 branches of 100 synapses each,  for  example,  lies  within a factor  of two  of \nthe  optimal  configuration.  The linear cell  capacities can be found  at the far  right \nedge of the plot (m =  10,000), since a nonlinear model with one synapse per branch \nhas a  number of trainable states identical to that of a  linear model. \n\n3.2  Validating the Analytical Model \n\nTo  test  the  predictions  of the  analytical  model,  we  trained  both  linear  and  non(cid:173)\nlinear cells on randomized two-class classification problems.  Training samples were \ndrawn  from  a  40-dimensional spherical  Gaussian  distribution  and  were  randomly \nassigned  positive  or  negative  labels-in some  runs,  training  patterns  were  evenly \ndivided  between  positive  and negative  labels,  with  similar  results.  Each of the  40 \noriginal input dimensions was  recoded using a set of 10  I-dimensional binary,  non(cid:173)\noverlapping receptive fields  with centers spaced along each dimension such that all \nreceptive  fields  would  be  activated  equally often.  This  manipulation  mapped  the \noriginal  40-dimensional  learning  problem  into  400  dimensions,  thereby  increasing \nthe discriminability of the training samples.  The relative memory capacity of linear \nvs.  nonlinear  cells  was  then  determined  empirically  by  comparing  the  number  of \ntraining patterns learnable at a fixed  error rate of 2%. \nThe learning rule  used  for  both cell  types  was  similar  to the  \"clusteron\"  learning \nrule described in (Mel, 1992a), and involved two mechanisms known to contribute to \nneural development:  (1)  random activity-independent  synapse formation,  and  (2) \nactivity-dependent synapse stabilization.  In each iteration, a set of 25 synapses was \nchosen at random, and the  \"worst\"  synapse was identified based on the correlation \nover the training set  of (i)  the  input's pre-synaptic  activity,  (ii)  the post-synaptic \nactivity  (Le.  the local nonlinear branch response for  the nonlinear energy model or \na  constant of 1 for  the linear model),  and  (iii)  a  global  \"delta\"  signal with a  value \nof a if the cell responded correctly to the input pattern, or \u00b1l if the cell  responded \nincorrectly.  The poorest-performing synapse on  the branch  was  then targeted for \nreplacement  with  a  new  synapse  drawn  at  random  from  the  d  input  lines.  The \nprobability that the replacement actually occurred was given by a Boltzmann equa(cid:173)\ntion  based  on  the  difference  in  the  training  set  error  rates  before  and  after  the \nreplacement.  A  \"temperature\"  variable  was  gradually  lowered  over  the  course  of \nthe simulation,  which  was terminated when  no  further  improvement in error rates \nwas seen. \n\nResults of the learning runs are shown in fig.  3 where the analytical capacity (mea(cid:173)\nsured in bits)  was  scaled  to the numerical capacity  (measured in training patterns \n\n\fMemory Capacity of Linear vs.  Nonlinear Models of Dendritic Integration \n\n161 \n\nA  Capacity of Linear vs. Nonlinear \n\nModel for Various Geometries \nx 10' \nd = 100 \n\n8.---~----~--~----~---, \n\nNonlinear Model H r \n\nm=lOOO  ~ \n\nm- 000 \n\n14  .t\" ....... \n12  : d= lOO~ \n\" \n\nj \n, \n10  j \n\nB  Capacity of Linear VS.  Nonlinear Model \nfor  Different Input Space Dimensions \nx10' \n\ns = 10,000 \n\n,  Nonlinear Model \n\n, \n\n, \n\n'(i.) \n\nm \n\n.... co \n\n....... \n\n7 \n\n6 \n~ \n\u00a7  5 \n>. \n'0  4 \n!IS 1t3 \nU \n\n2 \n\nLinear Model \n\n2000 \n\n4000 \n\n6000 \n\n8000 \n\n10000 \n\n~~l Syn:p:c Sires ~ \n\no \n\n'\"'. \n\nLinear ~del \n\nl~ \n\n10000 \n\n4000 \n\n2000 \n8000 \nNumber of Branches (m) \n\n6000 \n\n* \n\nFigure 2:  Comparison of linear vs.  nonlinear model capacity as a function of branch \ngeometry.  A.  Capacity in  bits  for  linear  and several nonlinear  cells  with  different \nbranch counts  (for  d =  100).  For each curve indexed  by branch count m,  sites  per \nbranch k  increases from  left to right as indicated iconically beneath the x-axis.  For \nall cells,  capacity increases with an increasing number of sites, though the capacity \nof the linear model grows logarithmically, leading to an increasingly large capacity \nboost for  the size-matched nonlinear cells.  B.  Capacity of a  nonlinear model  with \n10,000 sites for  different  values of input space dimension d.  Branch count m  grows \nalong  the  x-axis.  Cells  at right  edge  of plot  contain only  one  synapse  per branch, \nand thus  have  a  number of modifiable parameters (and  hence  capacity)  equivalent \nto  that  of  the  linear  model.  All  three  curves  show  that  there  exist  an  optimal \ngeometry which maximizes the  capacity of the nonlinear model  (in  this  case  1,250 \nbranches with  8 synapses each). \n\nlearned at 2%  error).  Two  key  features  of the theoretical curves  (dashed  lines)  are \nechoed in the empirical performance curves  (solid lines), including the much larger \nstorage capacity  of the  nonlinear  cell  model,  and the specific  cell  geometry  which \nmaximizes the capacity boost. \n\n4  Discussion \n\nWe  found  using  both  analytical  and  numerical  methods  that  in  the  limit  of low(cid:173)\nresolution synaptic weights,  application of a fixed  output nonlinearity to each com(cid:173)\npartment  of a  dendritic  tree  leads  to  a  significant  boost  in  capacity  relative  to  a \ncell whose post-synaptic integration is  linear.  For example,  given a cell with 10,000 \nsynaptic contacts originating from  400  distinct input  lines,  the analysis predicts  a \n23-fold increase in capacity for the nonlinear cell, while numerical simulations using \na stochastic delta rule  actually achieve  a  I5-fold boost. \nGiven that a linear and a nonlinear model have an identical number of synaptic con(cid:173)\ntacts  with  uniform  synaptic  weight  values,  what  accounts  for  the capacity boost? \nThe  principal  insight  gained  in  this  work  is  that  the  attachment  of a  fixed  non(cid:173)\nlinearity  to each  branch in  a  neuron substantially increases its  underlying  \"model \n\n\f162 \n\nP.  Poirazi and B.  W.  Mel \n\n- _.  Analytical \n(Bits/14) \nNumerical \n(Training  Patterns) \n\n-\n\n\\ \n\n\\ \n\n\\ \n\n\\ .  \n\\ Nonlinear Model \n\\ \n\n____________________  ~, \n\n70 \n\nI \n6 1,+ \n\nI, \n\n>.  50 \n'13 \n~ 40 \n03 \n<..)  30 \n\n2 \n\nFigure  3:  Comparison  of  ca(cid:173)\npacity boost predicted by analy(cid:173)\nsis  vs.  that observed empirically \nwhen  linear and nonlinear mod(cid:173)\nels  were  trained  using  the  same \nstochastic  delta  rule.  Dashed \nlines:  analytical  curves  for  lin(cid:173)\near vs. nonlinear model for  a cell \nwith  10,000  sites  show  capacity \nfor varying cell geometries.  Solid \nlines:  empirical  performance for \nsame  two  cells  at  2%  error  cri(cid:173)\nterion,  using  a  subunit  nonlin(cid:173)\nearity  g(x)  =  x lO  (similar  re(cid:173)\nsults were seen using a sigmoidal \nnonlinearity,  though the  param(cid:173)\neters  of the  optimal sigmoid  de(cid:173)\npended  on  the  cell  geometry). \nFor both analytical and numeri-\n2 cal curves,  peak capacity is  seen \nfor  cell  with  1,000  branches  (10 \nsynapses  per branch) ..  Cap~city \nexceeds that of same-sIzed lmear \n.:Jk- model  by  a  factor  of 15  at  the \n~ peak, and by more  than a  factor \nof 7 for cells  ranging from  about \n3 to 60 synapses per branch (hor-\nizontal dotted line). \n\nx10 \n\n'\" \n\n, , , , , , ,. \n\nLinear Model \n\n, \no \no  10  20  30  40  50  60  70  80  90  100 \n\nNumber of Branches (m) \n\n* \n\n---I...... \nm \n\nflexibility\" , i.e.  confers  upon the cell  a  much larger choice  of distinct  input-output \nrelations from  which  to select  during learning.  This may  be illustrated as follows. \nFor the linear model,  branching structure is  irrelevant so  that Y L  depends only on \nthe  number of input connections formed  from  each of the d  input lines.  All  spatial \npermutations  of a  set  of input  connections  are  thus  interchangeable  and  produce \nidentical cell responses.  This massive redundancy confines the capacity of the linear \nmodel to grow only logarithmically with an increasing number of synaptic sites  (fig. \n1A),  an  unfortunate  limitation  for  a  brain  in  which  the  formation  of large  num(cid:173)\nbers of synaptic contacts between  neurons is  routine.  In contrast,  the model  with \nnonlinear subunits  contains  many  fewer  redundancies:  most  spatial  permutations \nof the  same  set  of input  connections  lead  to  non-identical  values  of YN,  since  an \ninput x  swapped from branch bi  to branch b2  leads to the elimination of the  k - 1 \ninteraction terms involving x on branch bi  and the creation of k -1 new interaction \nterms on branch b2 \u2022 \n\nInterestingly,  the particular form  of the branch nonlinearity has virtually no effect \non the capacity of the cell  as far  as the counting arguments are concerned  (though \nit can have a profound effect on the cell's  \"representational bias\"-see below), since \nthe  principal  effect  of the nonlinearity in our capacity calculations is  to break the \nsymmetry among the different branches. \n\nThe issue of representational bias is a critical one, however,  and must be considered \nwhen  attempting  to  predict  absolute  or  relative  performance  rates  for  particular \nclassifiers  confronted  with specific  learning problems.  Thus, intrinsic differences  in \nthe geometry of linear  vs.  nonlinear  discriminant functions  mean that the param-\n\n\fMemory Capacity of Linear vs.  Nonlinear Models of Dendritic Integration \n\n163 \n\neters  available  to the  two  models  may  be  better  or  worse  suited  to solve  a  given \nlearning problem, even if the two models were equated for total parameter flexibility. \nWhile such biases are not taken into account in our analysis, they could nonetheless \nhave a  substantial effect  on  measured error rates-and could thus throw  a  perfor(cid:173)\nmance  advantage  to  one  machine  or  the  other.  One  danger  is  that  performance \ndifferences measured empirically could be misinterpreted as arising from differences \nin  underlying  model  capacity,  when  in  fact  they  arise from  differential  suitability \nof the two classifiers for  the learning problem at hand.  To avoid this difficulty,  the \nrandom classification problems we used to empirically assess memory capacity were \nchosen to level the playing field for the linear vs.  nonlinear cells, since in a previous \nstudy we  found  that the coefficients  on linear vs.  nonlinear  (quadratic)  terms were \nabout equally efficient as featUres for  this task.  In this way,  differences in measured \nperformance on these tasks were  primarily attributable to underlying capacity dif(cid:173)\nferences,  rather than differences in representational bias.  This experimental control \npermitted more meaningful comparisons between our analytical and empirical tests \n(fig.  3). \nThe problem  of representational bias crops up  in  a  second  guise,  wherein  the  an(cid:173)\nalytical  expressions  for  capacity  in  eq.  1 can  significantly  overestimate the  actual \nperformance of the cell.  This occurs when  a  particular ensemble of learning prob(cid:173)\nlems  fails  to  utilize  all  of the entropy available  in  the  cell's  parameter space-for \nexample, by requiring the cell to visit only a small subset of its parameter states rel(cid:173)\natively often.  This invalidates the maximum parameter entropy assumption made \nin  the  derivation  of eq.  1,  so  that  measured  performance  will  tend  to  fall  below \npredicted  values.  The  actual  performance  of either  model  when  confronted  with \nan  ensemble  of learning  problems  will  thus  be  determined  by  (1)  the  number  of \ntrainable parameters available  to the neuron  (as  measured by  eq.  1),  (2)  the suit(cid:173)\nability  of the neuron's  parameters for  solving the assigned  learning problems,  and \n(3)  the utilization  of parameters,  which  relates  to the entropy in  the joint  proba(cid:173)\nbility of the parameter values  averaged over the ensemble of learning problems.  In \nour comparisons here of linear and nonlinear cells,  we  we  have calculated  (1),  and \nhave attempted to control for  (2)  and  (3). \nIn conclusion,  our results  build upon the results of earlier biophysical simulations, \nand indicate that in the limit of a  large number of low-resolution synaptic weights, \nnonlinear dendritic processing could nonetheless have a major impact on the storage \ncapacity of neural tissue. \n\nReferences \nMel,  B.  W.  (1992a).  The  clusteron:  Toward  a  simple  abstraction  for  a  complex \nneuron. In Moody, J., Hanson, S., & Lippmann, R.  (Eds.), Advances in Neural \nInformation  Processing  Systems,  vol.  4,  pp.  35-42: Morgan  Kaufmann,  San \nMateo, CA. \n\nMel,  B.  W.  (1992b).  NMDA-based  pattern  discrimination  in  a  modeled  cortical \n\nneuron.  Neural  Comp.,  4,  502-516. \n\nPetersen,  C.  C.  H.,  Malenka,  R.  C.,  Nicoll,  R.  A.,  & Hopfield,  J.  J.  (1998).  All-or(cid:173)\nnone potentiation and  CA3-CA1  synapses.  Proc.  Natl.  Acad.  Sci.  USA,  95, \n4732-4737. \n\nPoirazi, P.,  &  Mel,  B.  W.  (1999).  Choice and value flexibility jointly contribute to \n\nthe capacity of a subsampled quadratic classifier.  Neural  Comp.,  in press. \n\n\f", "award": [], "sourceid": 1646, "authors": [{"given_name": "Panayiota", "family_name": "Poirazi", "institution": null}, {"given_name": "Bartlett", "family_name": "Mel", "institution": null}]}