{"title": "Sample Complexity for Learning Recurrent Perceptron Mappings", "book": "Advances in Neural Information Processing Systems", "page_first": 204, "page_last": 210, "abstract": null, "full_text": "Sample  Complexity for  Learning \nRecurrent  Percept ron Mappings \n\nBhaskar  Dasgupta \n\nEduardo D.  Sontag \n\nDepartment of Computer Science \n\nUniversity of Waterloo \n\nWaterloo, Ontario N2L  3G 1 \n\nDepartment of Mathematics \n\nRutgers  University \n\nNew  Brunswick,  NJ  08903 \n\nCANADA \n\nUSA \n\nbdasgupt~daisy.uwaterloo.ca \n\nsontag~control.rutgers.edu \n\nAbstract \n\nRecurrent  perceptron  classifiers  generalize the classical  perceptron \nmodel.  They take into account  those correlations and dependences \namong input  coordinates  which  arise  from  linear  digital  filtering. \nThis paper provides tight bounds on sample complexity associated \nto the fitting of such  models to experimental data. \n\n1 \n\nIntroduction \n\nOne  of the  most  popular  approaches  to  binary  pattern  classification,  underlying \nmany  statistical  techniques,  is  based  on  perceptrons  or  linear  discriminants;  see \nfor  instance  the  classical  reference  (Duda and  Hart,  1973).  In  this  context,  one is \ninterested  in classifying  k-dimensional input patterns \n\nV=(Vl, . . . ,Vk) \n\ninto two  disjoint  classes  A +  and  A -.  A  perceptron  P  which  classifies  vectors  into \nA +  and  A -\nis  characterized  by  a  vector  (of  \"weights\")  C E  lR k,  and  operates  as \nfollows.  One forms  the inner  product \n\nC.V = CIVI + ... CkVk . \n\nIf this inner  product is  positive,  v  is  classified  into A +, otherwise  into A - . \nIn  signal  processing  and  control  applications,  the  size  k  of the  input  vectors  v  is \ntypically very large, so the number of samples needed in order to accurately  \"learn\" \nan appropriate classifying perceptron is  in principle very  large.  On the other  hand, \nin such  applications the  classes  A + and  A-often can  be  separated  by  means of a \ndynamical system of fairly small dimensionality.  The existence of such a  dynamical \nsystem reflects  the fact  that  the signals of interest  exhibit  context  dependence  and \n\n\fSample Complexity  for  Learning  Recurrent Perceptron Mappings \n\n205 \n\ncorrelations, and this prior information can help in narrowing down the search for  a \nclassifier.  Various dynamical system models for  classification  appear from  instance \nwhen  learning  finite  automata  and  languages  (Giles  et.  al.,  1990)  and  in  signal \nprocessing  as  a  channel equalization problem (at  least  in  the simplest 2-level  case) \nwhen  modeling  linear  channels  transmitting digital  data from  a  quantized  source, \ne.g.  (Baksho et.  al.,  1991)  and (Pulford  et.  al.,  1991). \nWhen  dealing  with  linear  dynamical  classifiers,  the  inner  product  c. v  represents \na  convolution by  a  separating vector  c that  is  the  impulse-response  of a  recursive \ndigital  filter  of some  order  n  ~ k.  Equivalently,  one  assumes  that  the  data  can \nbe  classified  using  a  c that  is  n-rec'Ursive,  meaning that  there  exist  real  numbers \nTI, ... , Tn  SO  that \n\nn \n\nCj  = 2: Cj-iTi,  j  = n + 1, ... , k . \n\ni=1 \n\nSeen  in this context,  the usual  perceptrons  are  nothing more than the very  special \nsubclass of \"finite impulse response\"  systems (all  poles at zero);  thus it is appropri(cid:173)\nate  to call  the  more general  class  \"recurrent\"  or  \"IIR (infinite  impulse response)\" \nperceptrons.  Some authors, particularly Back and Tsoi (Back  and Tsoi, 1991; Back \nand Tsoi, 1995) have introduced these ideas in the neural network literature.  There \nis also related work in control theory dealing with such classifying, or more generally \nquantized-output, linear systems; see  (Delchamps,  1989;  Koplon and Sontag, 1993). \n\nThe  problem  that  we  consider  in  this  paper  is:  if one  assumes  that  there  is  an \nn-recursive  vector  c that  serves  to  classify  the  data,  and  one  knows  n  but  not \nthe  particular  vector,  how  many  labeled  samples  v(i)  are  needed  so  as  to  be  able \nto  reliably  estimate  C?  More  specifically,  we  want  to  be  able  to  guarantee  that \nany  classifying  vector  consistent  with  the  seen  data  will  classify  \"correctly  with \nhigh  probability\"  the  unseen  data  as  well.  This  is  done  by  computing  the  VC \ndimension of the  related  concept  class  and  then  applying well-known  results  from \ncomputational  learning  theory.  Very  roughly  speaking,  the  main  result  is  that \nthe number of samples needed  is  proportional to the logarithm of the  length  k  (as \nopposed to k itself, as would be the case if one did not take advantage of the recurrent \nstructure).  Another  application of our results,  again by  appealing to the literature \nfrom computational learning theory, is to the case of \"noisy\"  measurements or more \ngenerally  data not exactly  classifiable in this way; for  example, our estimates show \nroughly  that if one succeeds  in classifying 95% of a  data set  of size  logq,  then with \nconfidence  ~ lone is  assured that the prediction error rate will be < 90% on future \n(unlabeled) samples. \n\nSection  5  contains  a  result  on  polynomial-time learnability:  for  n  constant,  the \nclass  of concepts  introduced here  is  PAC learnable.  Generalizations to the learning \nof real-valued  (as  opposed  to  Boolean)  functions  are  discussed  in  Section  6.  For \nreasons of space we  omit many proofs; the complete paper is available by electronic \nmail from  the  authors. \n\n2  Definitions and  Statements of Main Results \n\nGiven a  set  X, and  a subset  X  of X, a  dichotomy  on X  is  a  function \n\nfJ:  X  - {-I, I}. \n\n{-I, I}, to be  called the class of classifier \nAssume given a  class F  of functions X -\nfunctions.  The  subset  X  ~ X  is  shattered  by  F  if each  dichotomy  on  X  is  the \nrestriction  to X  of some <P  E F.  The  Vapnik-Chervonenkis  dimension  vc (F)  is the \nsupremum (possibly infinite)  of the set of integers  K  for  which  there is  some subset \n\n\f206 \n\nB. DASGUPTA,E.D.SONTAG \n\nx  ~ X  of cardinality  K,  which  can  be  shattered  by:F.  Due  to  space  limitations, \nwe  omit  any  discussion  regarding  the  relevance  of the  VC  dimension  to  learning \nproblems;  the  reader  is  referred  to  the  excellent  surveys  in  (Maass,  1994;  Thran, \n1994) regarding this issue. \n\nPick  any  two  integers  n>O  and q~O. A sequence \n\nC= (Cl, ... , cn+q)  E lR.n+q \n\nis said to be  n-recursive if there exist real  numbers r1, .. . , rn  so that \n\nn \n\ncn+j  =  2: cn+j-iri,  j  =  1, . .. , q. \n\ni=l \n\n(In  particular,  every  sequence  of length  n  is  n-recursive,  but  the  interesting  cases \nare those  in which  q i=  0,  and  in  fact  q ~ n.)  Given such  an n-recursive  sequence \nC,  we  may consider  its associated  perceptron  classifier.  This is  the map \n\n\u00a2c:  lR.n+q --+{-1,1}: \n\n(X1,  ... ,Xn+q)  H \n\nsign  (I:CiXi) \n\n.=1 \n\nwhere  the sign  function  is  understood  to  be defined  by  sign (z)  = -1 if z ~ 0  and \nsign (z)  = 1 otherwise.  (Changing the definition at zero  to be + 1 would not change \nthe results  to be presented  in  any  way.)  We  now  introduce,  for  each  two fixed  n, q \nas  above,  a  class of functions: \n\n:Fn,q  :=  {\u00a2cl cE lR.n+q is  n-recursive}. \n\nThis is  understood  as  a  function  class  with  respect  to the  input space  X =  lR. n +q, \nand we  are interested  in estimating vc (:Fn,q). \nOur main result  will be as follows  (all logs in base 2): \n\nTheorem 1 \nImax {n, nLlog(L1 + ~ J)J}  ~ vc (:Fn ,q)  ~ min {n + q,  18n + 4n log(q + 1)} I \n\nNote that,  in particular, when  q> max{2 + n 2 , 32}, one  has the tight estimates \n\nn \"2  logq  ~ vc (:Fn ,q)  ~ 8n logq . \n\nThe  organization of the  rest  of the  paper  is  as  follows.  In  Section  3  we  state  an \nabstract  result  on  VC-dimension,  which  is  then  used  in  Section  4  to  prove  Theo(cid:173)\nrem  1.  Finally, Section 6 deals  with bounds on the sample complexity needed  for \nidentification of linear dynamical systems,  that is  to say,  the real-valued functions \nobtained when  not taking  \"signs\"  when  defining the maps \u00a2c. \n\n3  An  Abstract  Result  on VC  Dimension \n\nAssume that we  are given  two sets X  and A,  to be  called  in this context the set  of \ninputs  and the set of parameter values  respectively.  Suppose that we  are also given \na function \n\nF:  AxX--+{-1,1}. \n\nAssociated to this data is  the class  of functions \n\n:F  :=  {F(A,\u00b7): X --+  {-1, 1} I A E A} \n\n\fSample Complexity  for  Learning  Recurrent  Perceptron Mappings \n\n207 \n\nobtained  by  considering  F  as  a function  of the inputs alone,  one such  function for \neach possible parameter value A.  Note that, given the same data one could, dually, \nstudy  the class \n\nF*:  {F(-,~) : A-{-I,I}I~EX} \n\nwhich obtains by fixing the elements of X and thinking of the parameters as inputs. \nIt is  well-known  (and  in  any  case,  a  consequence  of the  more general  result  to be \npresented  below)  that  vc (F)  ~ Llog(vc (F*\u00bbJ,  which  provides  a  lower  bound on \nvc (F) in terms of the  \"dual VC  dimension.\"  A sharper  estimate is  possible when \nA can be written  as  a  product  of n  sets \n\nA =  Al  X  A2  X  \u2022 \u2022 .  x  An \n\n(1) \n\nand that is  the topic which we  develop next. \nWe  assume from now  on that a  decomposition of the form in  Equation (1)  is given, \nand  will  define  a  variation of the  dual  VC  dimension  by  asking  that  only  certain \ndichotomies on A be obtained from F*.  We define  these  dichotomies only on  \"rect(cid:173)\nangular\"  subsets of A,  that is,  sets of the form \n\nL  =  Ll  X  .\u2022. x  Ln  ~ A \n\nwith  each  Li  ~ Ai  a  nonempty  subset.  Given  any  index  1  ::;  K ::;  n,  by  a  K-axis \ndichotomy on such a subset  L  we mean any function 6 : L  - {-I, I}  which depends \nonly on the Kth coordinate, that is,  there is some function \u00a2  : Lit  - {-I, I} so that \n6(Al, . . . ,An )  =  \u00a2(AIt)  for  all  (Al, . . . ,An )  E  L;  an  axis  dichotomy  is  a  map  that \nis  a  K-axis  dichotomy  for  some  K.  A  rectangular  set  L  will  be  said  to  be  axis(cid:173)\nshattered if every  axis dichotomy is the restriction to L  of some function of the form \nF(\u00b7,~): A - {-I, I}, for  some ~ EX. \n\nTheorem 2  If L  =  Ll  X  ... x  Ln  ~ A  can  be  axis-shattered  and  each  set  Li  has \ncardinality ri,  then  vc (F)  ~ Llog(rt)J + ... + Llog(rn)J . \n\n(In the special  case  n=1  one  recovers  the classical result  vc (F)  ~ Llog(vc (F*)J.) \nThe proof of Theorem 2 is  omitted due  to space  limitations. \n\n4  Proof of Main  Result \n\nWe  recall  the  following  result;  it was  proved,  using  Milnor-Warren  bounds  on the \nnumber of connected  components of semi-algebraic sets,  by  Goldberg and Jerrum: \n\nFact  4.1  (Goldberg  and  Jerrum,  1995)  Assume  given  a  function  F  :  A  x  X  -\n{-I, I}  and  the associated  class of functions  F:= {F(A,\u00b7): X - {-I, I} I A E A} . \nSuppose that A =  ~ k  and X =  ~ n,  and that the function  F  can be defined in terms \nof a Boolean formula involving at most s  polynomial inequalities in k + n variables, \neach  polynomial being of degree  at most d.  Then,  vc (F)  ::;  2k log(8eds). \n0 \n\nUsing  the  above  Fact  and  bounds  for  the  standard  \"perceptron\"  model,  it is  not \ndifficult to prove  the following  Lemma. \nLemma 4.2  vc (Fn,q)  ::;  min{n + q,  18n + 4nlog(q + I)} \n\nNext,  we  consider  the lower  bound of Theorem  1. \nLemma 4.3  vc (Fn,q)  ~ maxin, nLlog(Ll + q~1 J)J} \n\n\f208 \n\nB. DASGUPTA, E.  D.  SONTAG \n\nProof As Fn,q  contains the class offunctions <Pc with c= (C1,  ... , cn, 0, ... ,0), which \nin  turn being the set  of signs of an n-dimensional linear space of functions,  has VC \ndimension n,  we  know  that  vc (Fn,q)  ~ n.  Thus we  are  left  to prove  that if q > n \nthen vc(Fn,q) ~ nLlog(l1 + ~J)J. \nThe set of n-recursive sequences of length n + q includes the set of sequences of the \nfollowing special  form: \n\nn \n\n~. 1 \n\nCj  = L.Jlf-\n\n, \n\nj=I, ... ,n+q \n\n(2) \n\ni=l \n\nwhere  ai, h E  lR  for  each  i  = 1, ... , n.  Hence,  to  prove  the  lower  bound,  it  is \nsufficient  to study the class of functions  induced  by \n\nF  : lll.n  x  lll.n+. ~ {-I, I},  (~I\"'\" ~n, XI,\u00b7\u00b7\u00b7, x n+.) >->  sign (t, ~ ~i-I Xj)  . \n\nLet  r  =  L q+~-l J  and  let  L 1, ... ,Ln  be  n  disjoint  sets  of real  numbers  (if desired, \nintegers),  each of cardinality r.  Let  L = U:'::l Lj .  In  addition, if rn < q+n-1, then \nselect  an additional set  B  of (q+n-rn-1)  real  numbers disjoint from  L. \nWe  will  apply  Theorem  2,  showing  that the  rectangular  subset  L1  x  ... x  Ln  can \nbe axis-shattered.  Pick  any,..  E {1, ... , n}  and any <P  : L,.  ~ {-1, 1}.  Consider the \n( unique)  interpolating polynomial \n\nn+q \n\npeA)  = L XjAj - 1 \n\nin A of degree  q + n - 1 such  that \n\nj=l \n\npeA)  = { ~(A)  if A E L,. \n\nif A E (L U B) - L,.. \n\nNow  pick e = (Xl, ... , Xn +q-1).  Observe  that \n\nF(lt, I\"  ... , In,  Xl, .. . ,  xn+.) = sign (t, P(I'\u00bb)  = \u00a2(I.) \n\nfor all (11,  ... , In)  E L1  X  .\u2022\u2022 X  Ln , since p(l) = 0 fori \u00a2 L,.  and p(l) = <P(I)  otherwise. \nIt follows from  Theorem 2 that vc (Fn,q)  ~ nLlog(r)J,  as  desired. \n\u2022 \n\n[)  The Consistency Problem \n\nWe  next  briefly  discuss  polynomial time learnability of recurrent  perceptron  map(cid:173)\npings.  As  discussed  in e.g.  (Turan,  1994),  in  order  to  formalize  this  problem  we \nneed  to first  choose  a  data  structure  to represent  the  hypotheses  in Fn,q.  In  addi(cid:173)\ntion,  since  we  are  dealing with  complexity of computation involving real  numbers, \nwe  must  also  clarify  the  meaning of \"finding\"  a  hypothesis,  in terms  of a  suitable \nnotion  of polynomial-time computation.  Once  this  is  done,  the  problem  becomes \nthat of solving the  consistency  problem: \n\nGiven  a  set  ofs  ~ s(c,8)  inputs6,6, ... ,e&  E  lR n +q ,  and  an \narbitrary dichotomy ~ : {e1, 6, ... , e&}  ~ {-I, I} find  a represen(cid:173)\ntation  of a  hypothesis <Pc  E  Fn,q  such  that the restriction  of <Pc  to \nthe  set  {e1,6, ... ,e&}  is  identical  to  the  dichotomy  ~ (or  report \nthat no such  hypothesis exists). \n\n\fSample  Complexity  for  Learning  Recurrent  Perceptron Mappings \n\n209 \n\nThe representation  to be used  should provide  an  \"efficient  encoding\"  of the values \n' \"  Xn+q)  E jRn+q, \nof the parameters rl, \u2022 .. , rn , Cl ,  . . . , cn: given a set of inputs (Xl\"\none  should  be  able  to  efficiently  check  concept  membership  (that  is,  compute \nsign (L:7~l CjXj)).  Regarding the precise meaning of polynomial-time computation, \nthere are at least two models of complexity possible:  the unit cost  model which deals \nwith  algebraic  complexity  (arithmetic  and  comparison operations  take  unit  time) \nand  the  logarithmic  cost  model  (computation in the  Turing machine sense;  inputs \n(Xl ,  . . . ,  X n +q )  are  rationals,  and  the  time  involved  in  finding  a  representation  of \nrl ,  . .. , r n , Cl, . .. ,  Cn  is required  to be polynomial on the number of bits  L. \nTheorem 3  For  each  fixed  n  > 0,  the  consistency  problem  for :Fn,q  can  be  solved \nin  time  polynomial in q  and  s  in  the  unit  cost  model,  and  time  polynomial in  q,  s, \nand  L  in  the  logarithmic  cost  model. \nSince  vc (:Fn ,q)  =  O(n + nlog(q + 1)),  it  follows  from  here  that  the  class  :Fn,q  is \nlearnable in time polynomial in q (and L  in the log model).  Due to space limitations, \nwe  must omit the  proof;  it  is  based  on the  application of recent  results  regarding \ncomputational complexity aspects  of the first-order  theory of real-closed  fields. \n\n6  Pseudo-Dimension Bounds \n\nIn this section, we obtain results on the learnability of linear systems dynamics, that \nis,  the class of functions obtained if one  does  not take the sign when defining recur(cid:173)\nrent  perceptrons.  The  connection  between  VC  dimension  and  sample  complexity \nis  only  meaningful for  classes  of Boolean functions;  in order  to obtain learnability \nresults  applicable  to  real-valued  functions  one  needs  metric  entropy  estimates for \ncertain spaces  of functions.  These  can  be in turn bounded  through the estimation \nof Pollard's pseudo-dimension.  We  next  briefly  sketch  the  general  framework  for \nlearning  due  to  Haussler  (based  on  previous  work  by  Vapnik,  Chervonenkis,  and \nPollard)  and then  compute a  pseudo-dimension estimate for  the class of interest. \n\nThe  basic  ingredients  are  two  complete  separable  metric  spaces  X  and  If (called \nrespectively  the  sets  of inputs  and  outputs),  a  class  :F  of functions  f  : X  -+  If \n(called the decision rule or hypothesis space) , and a function f  : If x If -+ [0, r]  C  jR \n(called  the  loss  or  cost  function).  The  function  f  is  so  that  the  class  of functions \n(x, y)  ~ f(f(x), y)  is  \"permissible\"  in the sense  of Haussler  and  Pollard. \nNow, \none may introduce, for each f  E :F,  the function \n\nAJ,l  :  X x If x  jR  -+ {-I, I}  :  (x, y, t)  ~ sign (f(f(x) , y)  - t) \n\nas  well  as  the  class  A.1\",i  consisting  of all  such  A/,i '  The  pseudo-dimension  of :F \nwith  respect  to the loss function f,  denoted  by  PO [:F, f],  is defined  as: \n\nPO [:F,R]  :=  vc (A.1\",i). \n\nDue  to space  limitations,  the  relationship  between  the  pseudo-dimension  and  the \nsample complexity of the  class :F  will  not  be  discussed  here;  the reader  is  referred \nto the references  (Haussler,  1992;  Maass,  1994) for  details. \n\nFor our  application we  define , for  any two nonnegative integers  n, q,  the class \n\n:F~,q  :=  {\u00a2<! ICE jRn+q  is  n-recursive} \n\nwhere  \u00a2c \ncan be proved  using  Fact  4.1. \n\njRn+q  -+ jR: \n\n(Xl ,  .. . , Xn+q)  ~ L:7~l CjX j  .  The following  Theorem \n\nTheorem 4  Let p  be  a positive  integer and  assume  that  the  loss function  f  is  given \nbyf(Yl,Y2) =  IYl- Y2IP \u2022  Then ,  PO  [:F~ ,q,f]  ~ 18n+4nlog(p(q+ 1)) . \n\n\f210 \n\nAcknowledgements \n\nB. DASGUPTA, E.  D. SONTAG \n\nThis research  was supported  in part by  US  Air Force  Grant AFOSR-94-0293. \n\nReferences \n\nA.D.  BACK  AND  A.C.  TSOI,  FIR  and  IIR  synapses,  a  new  neural network  archi(cid:173)\ntecture  for  time-series modeling,  Neural  Computation, 3  (1991),  pp.  375-385. \n\nA .D.  BACK  AND  A .C.  TSOI,  A  comparison  of discrete-time  operator  models  for \nnonlinear system identification, Advances in Neural Information Processing Systems \n(NIPS'94),  Morgan  Kaufmann Publishers,  1995, to appear. \n\nA.M .  BAKSHO,  S.  DASGUPTA,  J .S.  GARNETT,  AND  C.R.  JOHNSON,  On  the  sim(cid:173)\nilarity  of conditions  for  an  open-eye  channel  and for  signed filtered  error  adaptive \nfilter  stability,  Proc.  IEEE  Conf.  Decision  and  Control,  Brighton,  UK,  Dec.  1991, \nIEEE  Publications,  1991,  pp. 1786-1787. \n\nA.  BLUMER,  A.  EHRENFEUCHT,  D.  HAUSSLER,  AND  M .  WARMUTH,  Learnability \nand  the  Vapnik-Chervonenkis  dimension,  J.  of the ACM,  36  (1989),  pp. 929-965. \nD.F.  DELCHAMPS,  Extracting  State  Information from  a  Quantized  Output  Record, \nSystems and  Control  Letters,  13  (1989),  pp.  365-372. \n\nR .O.  DUDA  AND  P.E.  HART,  Pattern  Classification  and  Scene  Analysis,  Wiley, \nNew  York,  1973. \n\nC.E.  GILES,  G.Z.  SUN,  H .H.  CHEN,  Y.C.  LEE,  AND  D .  CHEN,  Higher  order re(cid:173)\ncurrent  networks  and  grammatical inference,  Advances  in Neural Information Pro(cid:173)\ncessing  Systems 2,  D.S.  Touretzky,  ed.,  Morgan  Kaufmann, San Mateo,  CA,  1990. \n\nP .  GOLDBERG  AND  M.  JERRUM,  Bounding the  Vapnik-Chervonenkis  dimension  of \nconcept  classes parameterized by  real numbers,  Mach Learning,  18,  (1995):  131-148. \nD.  HAUSSLER,  Decision  theoretic  generalizations  of the  PAC model for  neural nets \nand other learning applications,  Information and Computation, 100, (1992):  78-150. \n\nR.  KOPLON  AND  E.D.  SONTAG,  Linear  systems  with  sign-observations,  SIAM  J. \nControl and Optimization, 31(1993):  1245 - 1266. \n\nW.  MAASS,  Perspectives  of current  research  about  the  complexity  of learning  in \nneural  nets,  in  Theoretical  Advances  in  Neural  Computation  and  Learning,  V.P. \nRoychowdhury,  K.Y.  Siu, and A.  Orlitsky, eds.,  Kluwer,  Boston,  1994,  pp.  295-336. \n\nG.W.  PULFORD,  R.A .  KENNEDY,  AND  B.D.O.  ANDERSON,  Neural network struc(cid:173)\nture  for  emulating  decision  feedback  equalizers,  Proc.  Int . Conf.  Acoustics,  Speech, \nand Signal Processing,  Toronto, Canada, May  1991,  pp.  1517-1520. \n\nE.D .  SONTAG,  Neural  networks  for  control,  in  Essays  on  Control:  Perspectives \nin  the  Theory  and  its  Applications  (H.L.  Trentelman  and  J .C.  Willems,  eds.), \nBirkhauser,  Boston,  1993,  pp. 339-380. \n\nGYORGY  TURAN,  Computational  Learning  Theory  and  Neural  Networks:A  Survey \nof Selected  Topics,  in  Theoretical  Advances  in  Neural  Computation  and  Learning, \nV.P.  Roychowdhury,  K.Y.  Siu,and  A.  Orlitsky,  eds.,  Kluwer,  Boston,  1994,  pp. \n243-293. \n\nL.G.  VALIANT  A  theory  of the  learnable,  Comm. ACM,  27,  1984,  pp.  1134-1142. \n\nV.N .VAPNIK,  Estimation  of  Dependencies  Based  on  Empirical  Data,  Springer, \nBerlin,  1982. \n\n\f", "award": [], "sourceid": 1108, "authors": [{"given_name": "Bhaskar", "family_name": "DasGupta", "institution": null}, {"given_name": "Eduardo", "family_name": "Sontag", "institution": null}]}