{"title": "On Neural Networks with Minimal Weights", "book": "Advances in Neural Information Processing Systems", "page_first": 246, "page_last": 252, "abstract": null, "full_text": "On Neural Networks  with Minimal \n\nWeights \n\nVasken Bohossian \n\nJ ehoshua Bruck \n\nCalifornia Institute of Technology \n\nMail  Code 136-93 \n\nPasadena, CA  91125 \n\nE-mail:  {vincent,  bruck }\u00abIparadise. cal tech. edu \n\nAbstract \n\nLinear threshold elements are the basic building blocks of artificial \nneural  networks.  A  linear  threshold element computes  a  function \nthat is  a sign of a weighted sum of the input variables.  The weights \nare  arbitrary  integers;  actually,  they  can  be  very  big  integers-(cid:173)\nexponential  in  the  number  of  the  input  variables.  However,  in \npractice,  it  is  difficult  to  implement  big  weights.  In  the  present \nliterature  a  distinction  is  made  between  the  two  extreme  cases: \nlinear threshold functions  with polynomial-size weights as opposed \nto those  with exponential-size weights.  The  main  contribution  of \nthis paper is  to fill  up  the gap by further  refining  that separation. \nNamely,  we  prove that the class of linear threshold functions  with \npolynomial-size  weights  can  be  divided  into  subclasses  according \nto the degree of the polynomial.  In fact,  we  prove  a  more general \nthat there exists a minimal weight linear threshold function \nresult-\nfor  any arbitrary number of inputs and any  weight  size.  To prove \nthose results we  have developed a  novel technique for  constructing \nlinear threshold functions  with minimal weights. \n\n1 \n\nIntroduction \n\nHuman brains are by far superior to computers for solving hard problems like combi(cid:173)\nnatorial optimization and image and speech recognition, although their basic build(cid:173)\ning  blocks  are  several  orders  of magnitude  slower.  This  observation  has  boosted \ninterest in the field  of artificial neural networks  [Hopfield  82]'  [Rumelhart  82].  The \nlatter  are  built  by  interconnecting  multiple  artificial  neurons  (or  linear  threshold \ngates),  whose  behavior  is  inspired  by  that of biological  neurons .  Artificial  neural \nnetworks  have  found  promising  applications  in  pattern  recognition,  learning  and \n\n\fOn Neural Networks with Minimal Weights \n\n247 \n\nother data processing  tasks.  However  most of the research  has  been  oriented  to(cid:173)\nwards  the practical  aspect of neural networks,  simulating or building networks for \nparticular tasks and then comparing their performance with that of more traditional \nmethods  for  those  particular  tasks.  To  compare neural networks  to  other compu(cid:173)\ntational models  one needs  to develop  the theoretical settings in which  to estimate \ntheir capabilities and limitations. \n\n1.1  Linear Threshold  Gate \n\nThe present  paper focuses  on the study of a  single  linear  threshold gate  (artificial \nneuron)  with  binary inputs  and output  as well  as  integer  weights  (synaptic coeffi(cid:173)\ncients).  Such a gate is  mathematically described by  a  linear  threshold function. \n\nDefinition 1  (Linear Threshold FUnction) \nA  linear  threshold  function  of n  variables  is  a  Boolean  function  f \n{ -1, 1}  that can be written  as \n\n{ -1, I} n  ~ \n\nf( .... ) -\n\nx  - sgn \n\n(F( .... \u00bb  - { \n\nx \n\n-\n\n1 \n1 \n\n-\n\n,for F(x) ~ 0 \n,0  erW1se \n\nth \n\n. \n\n, where  F(x) = tV\u00b7 x = L WiXi \n\nn \n\ni=1 \n\nfor  any x E  {-1, 1}n  and a fixed  tV  E  zn. \n\nAlthough we could allow the weights Wi  to be real numbers, it is known [Muroga 71), \n[Raghavan 88)  that for a, binary input neuron, one needs O( n log n) bits per weight, \nwhere n is the number of inputs.  So in the rest ofthe paper, we will assume without \nloss  of generality that all  weights  are integers. \n\n1.2  Motivation \n\nMany experimental results in  the  area of neural networks have  indicated that  the \nmagnitudes of the  coefficients in the linear threshold elements grow very  fast  with \nthe  size  of the  inputs  and  therefore  limit  the  practical  use  of the  network.  One \nnatural question to ask is  the following.  How limited is the computational power of \nthe network if one limits oneself to threshold elements with only  \"small\"  growth in \nthe size of the coefficients?  To answer that question we  have to define  a measure of \nthe magnitudes of the  weights.  Note  that, given a function  I,  the weight  vector tV \nis not unique  (see Example 1 below). \n\nDefinition 2  (Weight  Space) \nGiven a lineal' threshold function f  we define W  as the set of all weights that satisfy \nDefinition  1,  that is  W  = {UI  E  zn : Vx E {-1, 1}n,sgn(tV\u00b7 x) =  f(x)}. \n\nHere follows  a  measure of the size of the weights. \n\nDefinition 3  (Minimal Weight Size) \nWe define the size of a weight vector as the sum of the absolute values of the weights. \nThe minimal weight size of a linear threshold function  is  defined as  : \n\nS[j) = ~ia/L IWi I) \n\nn \n\n,=1 \n\nThe particular vector that achieves the minimum is called a minimal weight vector. \n\nNaturally, S[f) is  a  function  of n. \n\n\f248 \n\nV.  BOHOSSIAN, J. BRUCK \n\nIt has  been shown  [Hastad 94],  [Myhill  61],  [Shawe-Taylor 92],  (Siu  91]  that there \nexists  a  linear  threshold  function  that  can be implemented  by  a  single  threshold \nelement with exponentially growing weights, S[j] '\" 2'1,  but cannot be implemented \nby  a  threshold element  with  smaller:  polynomialy growing  weights,  S[j]  '\" n d ,  d \nconstant.  In light  of that  result  the  above  question  was  dealt  with  by  defining  a \nclass within the set of linear threshold functions:  the class of functions with \"small\" \n(Le.  polynomialy growing)  weights  [Siu  91].  Most of the recent research focuses  on \nthe power of circuits with small weights, relative to circuits  with  arbitrary weights \n[Goldmann  92],  [Goldman 93].  Rather than dealing with circuits  we  are interested \nin studying a  single  threshold gate.  The main contribution of the present  paper is \nto further refine  the division of small  versus arbitrary weights.  We  separate the set \nof functions  with  small weights into classes indexed by d,  the degree of polynomial \ngrowth  and  show  that  all  of  them  are  non-empty.  In  particular,  we  develop  a \ntechnique  for  proving  that a  weight  vector  is  minimal.  We  use  that  technique  to \nconstruct a function  of size S[j]  =  s  for  an arbitrary s. \n\n1.3  Approach \n\nThe main difficulty in analyzing the size of the weights of a threshold element is  due \nto the fact  that  a  single  linear threshold function  can be implemented by  different \nsets of weights  as shown  in the following example. \n\nExample 1  (A  Threshold FUnction  with Minimal Weights) \nConsider the following  two sets of weights  (weight  vectors). \ntih  =  (124),  FI(X)  =  Xl  + 2X2 + 4X3 \nW2  =  (248),  F2(X)  =  2XI  + 4X2  + 8X3 \n\nThey both implement  the same threshold function \n\nf(X)  =  sgn(F2(x\u00bb  =  sgn(2FI (x\u00bb  =  sgn(FI (x\u00bb \n\nA closer look reveals that  f(x)  =  sgn(x3), implying  that none of the above  weight \nvectors has minimal size.  Indeed,  the minimal one is W3  =  (00 1)  and S(J]  =  1. \n\nIt is in general difficult to determine if a given set of weights is minimal [Amaldi 93], \n[Willis  63].  Our technique consists of limiting the study to only a particular subset \nof linear threshold functions,  a  subset for  which  it is  possible to prove  that a given \nweight  vector  is  minimal.  That subset  is  loosely  defined  by  the requirement  that \nthere exist input  vectors for  which  f(x)  =  f( -x).  The existence of such  a  vector, \ncalled a  root of f, puts a constraint on the weight vector used to implement f.  The \nlarger the set of roots - the larger the constraint on  the set of weight vectors, which \nin turn helps determine the minimal one.  A detailed description of the technique is \ngiven in Section  2. \n\n1.4  Organization \n\nHere follows a brief outline of the rest of the paper.  Section 2 mathematically defines \nthe  setting  of the problem  as  well  as  derives  some  basic  results  on  the  properties \nof functions  that  admit  roots.  Those  results  are  used  as  bUilding  blocks  for  the \nproof of the  main  results  in  Section  3.  It also  introduces  a  construction  method \nfor  functions  with  minimal  weights.  Section 3  presents  the  main  result:  for  any \nweight size,  s, and any nunlber of inputs, n, there exists an n-input linear threshold \nfllllction that requires weights of size S[f] = s.  Section 4 presents some applications \nof the result of Section 3 and indicates future  research directions. \n\n\fOn Neural Networks with Minimal Weights \n\n249 \n\n2  Construction of Minimal Threshold Functions \n\nThe present section defines the mathematical tools used to construct functions with \nminimal weights. \n\n2.1  Mathematical setting \n\nWe  are interested in  constructing functions  for  which  the minimal  weight  is  easily \ndetermined.  Finding the minimal  weight involves  a  search,  we  are  therefore  inter(cid:173)\nested  in  finding  functions  with  a  constrained  weight  spaces.  The  following  tools \nallows us  to put constraints on W. \n\nDefinition 4  (Root Space of a  Boolean Function) \nA vector v E {-I, 1} n  such  that 1 (V)  =  1 (-V)  is  called a  root of I.  We  define  the \nroot space, R, as  the set of all  roots of I. \n\nDefinition 5  (Root Generator Matrix) \nFor  a  given  weight  vector w E  W  and  a  root v E  R,  the  root  generator  matrix, \nG  =  (gij), is a (n x k)-matrix, with entries in {-I, 0,1}, whose rows 9 are orthogonal \nto w and equal to vat all non-zero coordinates, namely, \n\n1.  Gw =  0 \n2.  9ij = \u00b0 or 9ij = Vj  for  all i  and j. \n\nExample 2  (Root Generator Matrix) \nSuppose  that  we  are  given  a  linear  threshold  function  specified  by  a  weight \nvector  w  = \n(1,1,2,4,1,1,2,4).  By  inspection  we  determine  one  root  v  = \n(1,1,1,1, -1, -1, -1, -1).  Notice  that  WI  + W2  - W7  =  \u00b0 which  can  be  written \nas  g. w = 0,  where 9 = (1,1,0,0,0,0, -1,0) is  a row of G.  Set r= v - 2g.  Since 9 \nis equal to vat all non-zero coordinates, r E {-I, I} n.  Also r\u00b7 w =  v\u00b7 w + g. w =  0. \nWe  have generated a  new  root : r =  (-1, -1, 1, 1, -1, -1, 1, -1). \n\nLemma 6  (Orthogonality of G  and W) \nFor  a given  weight vector w E Wand a root v E R \n\nilGT  = 0 \n\nholds for  any  weight  vector il E W. \nProof.  For  an arbitrary  il E  Wand an arbitrary row,  gi,  of G,  let  if =  v - 2gi. \nBy definition of gi,  if E  {-I,1}n and if\u00b7 w =  0.  That implies I(if) =  I(-if) :  if \nil\u00b7 (v - 2gi) = \u00b0 and finally,  since v\u00b7 il = \u00b0 we  get il\u00b7 gi  = 0.  0 \nis a  root of I.  For any  weight  vector il E  W,  sgn(il\u00b7 if) =  sgn( -il\u00b7 if).  Therefore \nLemma 7  (Minimality) \nFor  a  given  weight  vector w E  W  and a  root v E  R  if rank( G)  =  n  - 1  (Le.  G \nhas n  - 1 independent rows)  and IWil  =  1 for  some i,  then w is  the minimal weight \nvector. \nProof.  From  Lemma 6 any  weight  vector  il satisfies  ilGT  = O.  rank( G)  = n  - 1 \nimplies  that  dim(W)  =  1,  i.e.  all  possible  weight  vectors  are integer  multiples  of \neach other.  Since IWi I =  1,  all vectors are of the form  il =  kw, for  k  ~ 1.  Therefore \nw has the smallest size.  0 \nWe  complete Example 2 with an application of Lemma 7. \n\n\f250 \n\nV. BOHOSSIAN, J.  BRUCK \n\nExample 3  (Minimality) \nGiven ill = (1,1,2,4,1,1,2,4) and v = (1,1,1,1, -1, -1, -1, -1) we  can construct: \n\nG= \n\n1  0  0  0  -1 \n0  1  0  0 \n0  0  1  0 \n0  0  0  1 \n1  0  0  0 \n1  1  0  0 \n1  1  1  0 \n\n0 \n0 \n0  -1 \n0 \n0  -1 \n0 \n0 \n0 \n0  -1 \n0 \n0 \n\n0 \n0 \n0 \n0  -1 \n0 \n0 \n0  -1 \n0 \n0  -1 \n0 \n\nIt  is  easy  to  verify  that  rank( G)  =  n - 1 = 7 and  therefore,  by  Lemma  7,  ill  is \nminimal and 8[/] =  16. \n\n2.2  Construction of minimal weight vectors \n\nIn Example  3 we  saw  how,  given  a weight vector, one can show  that it is  minimal. \nIn this section we  present an example of a  linear threshold function  with  minimal \nweight size,  with an arbitrary number of input variables. \nWe  would  like  to  construct  a  weight  vector  and  show  that  it  is  minimal.  Let \nthe  number  of  inputs,  n,  be  even.  Let  ill  consist  of  two  identical  blocks  : \n(Wl,W2, ... ,Wn /2,Wl,W2, ... ,Wn /2)'  Clearly,  if =  (1,1,; .. ,1,-1,-1, ... ,-1) is  a  root \nand G is  the corresponding generator matrix. \n\n1  0  0  0 \n0  1  0  0 \n0  0  1  0 \n\n0  0  0  -1 \n0  0  0 \n0  0  0 \n\n0 \n0  -1 \n0 \n0 \n\n0  0 \n0  0 \n-1  0 \n\n0 \n0 \n0 \n\n0 \n0 \n0 \n\n0 \n0 \n0 \n\nG= \n\n0  0  0  0 \n0  0  0  0 \n\n0  1  0 \n0  0  1 \n\n0 \n0 \n\n0 \n0 \n\n0  0 \n0  0 \n\n0  -1 \n0 \n\n0 \n0  -1 \n\n3  The Main Result \n\nThe following  theorem  states that given  an  integer  s  and a  number of variables n \nthere exists  a function of n  variables and minimal weight size s. \n\nTheorem 8  (Main Result) \nFor any  pair  (s,n)  that satisfies \n\n, for  n  even \n, for  n odd \n\n2.  seven \n\nthere exists a  linear  threshold function  of n  variables, I, with  minimal  weight  size \n8[J] = s. \nProof.  Given  a  pair  (s, n),  that  satisfies  the  above  conditions  we  first  construct \na  weight  vector w that  satisfies  E~l IWil  =  s,  then  show  that  it  is  the  minimal \nweight vector ofthe function I(x) = sgn(w\u00b7X).  The proof is shown only for  n  even. \nCONSTRUCTION. \n\n1.  Define  (at, a2, ... , an /2) = (1,1, ... , 1). \n\n\fOn Neural Networks with Minimal Weights \n\n251 \n\n2.  If L,:::l a,  < s/2 then increase by  one the smallest a,  such that a,  < 2'-\n\nn/2 \n\n.  2 \n. \n\n(In the case of a tie take the  Wi  with smallest index i). \n\n3.  Repeat  the  previous  step  until  L~; ai  =  s /2  or  (aI, a2, ... , aN)  = \n\n(1,1,2,4, ... , 2~ - 2). \n\n4.  Set w= (al,a2, ... ,an /2,al,a2, ... ,an/2)' \n\nBecause we increase the size by one unit at a time the algorithm will converge to the \ndesired  result for  any integer s  that satisfies n  ~ s  ~ 2~. We  have  a  construction \nfor  any valid  (s, n)  pair.  Let us show that w is minimal. \nMINIMALITY.  Given  that w =  (aI, a2, ... , an /2, aI, a2, ... , aaj2)  we  find  a  root  v = \n(1, 1, ... , 1, -1, -1, ... , -1) and n/2 rows of the generator matrix G corresponding to \nthe  equations w,  =  wH ~.  To  form  additional  rows  note  that  the  first  k  ais are \npowers of two (where k  depends on sand n).  Those can be written as a, = L~:~ aj \nand generate k - 1 rows.  And finally  note that all other ai, i  > k,  are smaller than \n2k+l.  Hence,  they  can  be  written  as  a  binary  expansion  a,  =  L~:::l aijaj  where \naij E {O, I}.  There are -r - k such weights.  G has a  total of n -1 independent rows. \nrank(G) = n -1 and WI  = 1,  therefore by Lemma 7, tV is minimal and S[J] = s.  0 \nExample 4  (A  Function of 10  variables and size  S[fJ = 26) \nWe  start with a =  (1,1,1,1,1).  We iterate:  (1,1,2,1,1), (1,1,2,2,1), (1,1,2,2,2), \n(1,1,2, 3,2),  (1,1,2,3,3) ,  (1,1,2,4,3),  (1,1,2,4,4),  and  finally  (1,1 , 2,4,5).  The \nconstruction algorithm  converges to a =  (1,1,2,4,5).  We  claim  that tV  =  (a, a)  = \n(1,1,2,4,5,1,1,2,4,5) is minimal.  Indeed, v =  (1,1,1,1,1, -1, -1, -1, -1, -1) and \n\n1  0  0  0  0  -1 \n0  1  0  0  0 \n0  0  1  0  0 \n0  0  0  1  0 \n0  0  0  0  1 \n1  0  0  0  0 \n1  1  0  0  0 \n1  1  1  0  0 \n1  0  0  1  0 \n\n0 \n0 \n0  -1 \n0 \n0  -1 \n0 \n0 \n0 \n0 \n0 \n0  -1 \n0 \n0 \n0 \n0 \n0 \n0 \n\n0 \n0 \n0 \n0  -1 \n0 \n0 \n-1 \n\n0 \n0 \n0 \n0 \n0  -1 \n0 \n0 \n0 \n0 \n0  -1 \n0 \n0  -1 \n0 \n\nG= \n\nis a  matrix of rank 9. \n\nExample 5  (Functions with  Polynomial Size) \nThis  example  shows  an application  of Theorem  8.  We  define  fred)  as  the  set  of \nlinear  threshold functions  for  which  S[I}  ~ n d \u2022  The  Theorem  states that for  any \neven n  there exists a function 1 of n variables and minimum weight S[I] = nd \u2022  The \nimplication is  that for  all  d, LT \n\nis  a  proper subset of LT \n\n-- (d- I) \n\n-- (d) \n\n4  Conclusions \n\nWe have shown that for any reasonable pair of integers (n, s), where s is even, there \nexists  a linear threshold function  of n  variables with minimal weight  size  S[J}  =  s. \nWe  have  developed  a  novel  technique  for  constructing  linear  threshold  functions \nwith minimal weights that is  based on the existence of root vectors.  An interesting \napplication  of our  method  is  the  computation  of a  lower  bound  on  the  number \nof linear  threshold  functions  [Smith 66}.  In  addition,  our  technique  can  help  in \nstudying the trade-otIs between a number of important parameters associated with \n\n\f252 \n\nV. BOHOSSIAN, 1.  BRUCK \n\nlinear threshold (neural) circuits, including, the munber of elements, the number of \nlayers,  the fan-in,  fan-out  and the size of the  weights. \n\nAcknow ledgements \n\nThis  work  was  supported  in  part  by  the  NSF  Young  Investigator  Award  CCR-\n9457811,  by  the  Sloan  Research  Fellowship,  by  a  grant  from  the  IBM  Almaden \nResearch Center, San Jose,  California, by  a grant from  the AT&T Foundation and \nby  the  center  for  Neuromorphic  Systems  Engineering  as  a  part  of  the  National \nScience  Foundation  Engineering Research  Center  Program;  and  by  the  California \nTrade and Commerce Agency,  Office of Strategic Technology. \n\nReferences \n\n[Amaldi  93]  E. Amaldi and V.  Kann. The complexity andapproximabilityoffinding \nmaximum feasible subsystems of linear relations.  Ecole Poly technique Federale \nDe  Lausanne Technical Report,  ORWP 93/11,  August  1993. \n\n[Goldmann  92]  M.  Goldmann, J. Hastad, and A.  Razborov.  Majority gates vs.  gen(cid:173)\neral weighted threshold gates.  Computational  Complexity,  (2):277-300, 1992. \n[Goldman 93]  M.  Goldmann  and  M.  Karpinski.  Simulating  threshold  circuits  by \n\nmajority circuits.  In Proc.  25th  ACM STOC, pages pp.  551- 560,  1993. \n\n[Hastad 94]  .1.  Hastad.  On the size of weights  for  threshold gates.  SIAM.  J.  Disc. \n\nMath.,  7:484-492,  1994. \n\n[Hopfield  82)  .1.  Hopfield.  Neural networks and physical systems with emergent col(cid:173)\nlective computational abilities.  Proc.  of the  USA  National Academy of Sciences, \n79:2554- 2558, 1982. \n\n[Muroga 71)  M.  Muroga.  Threshold  Logic  and its  Applications.  Wiley-Interscience, \n\n1971. \n\n[Myhill  61)  J.  Myhill  and W.  H.  Kautz.  On the size of weights required for  linear(cid:173)\ninput switching functions.  IRE Trans.  Electronic  Computers,  (EClO):pp.  288-\n290,  1961. \n\n[Raghavan 88]  P.  Raghavan.  Learning  in  threshold  networks:  a  computational \nmodel  and  applications.  Technical  Report  RC  13859,  IBM  Research,  July \n1988. \n\n[Rumelhart  82]  D.  Rumelhart  and J.  McClelland.  Parallel distributed  processing: \n\nExplorations in the microstructure of cognition.  MIT Press,  1982. \n\n[Shawe-Taylor 92]  J.  S.  Shawe-Taylor,  M.  H.  G.  Anthony,  and W.  Kern.  Classes \nof feedforward neural networks and their circuit complexity.  Neural  Networks, \nVol.  5:pp.  971- 977,  1992. \n\n[Siu  91]  K.  Siu and J. Bruck.  On the power of threshold circuits with small weights. \n\nSIAM J.  Disc.  Math.,  Vol.  4(No.  3):pp. 423-435,  August 1991. \n\n[Smith  66]  D.  R.  Smith.  Bounds  on  the  number  of threshold  functions. \n\nTransactions  on Electronic  Computers,  June 1966. \n\nIEEE \n\n[Willis 63]  D.  G.  Willis.  Minimum  weights  for  threshold  switches.  In  Switching \nTheory  in Space  Techniques.  Stanford University Press,  Stanford, Calif.,  1963. \n\n\f", "award": [], "sourceid": 1066, "authors": [{"given_name": "Vasken", "family_name": "Bohossian", "institution": null}, {"given_name": "Jehoshua", "family_name": "Bruck", "institution": null}]}