{"title": "The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network", "book": "Advances in Neural Information Processing Systems", "page_first": 5410, "page_last": 5419, "abstract": "An important factor contributing to the success of deep learning has been the remarkable ability to optimize large neural networks using simple first-order optimization algorithms like stochastic gradient descent. While the efficiency of such methods depends crucially on the local curvature of the loss surface, very little is actually known about how this geometry depends on network architecture and hyperparameters. In this work, we extend a recently-developed framework for studying spectra of nonlinear random matrices to characterize an important measure of curvature, namely the eigenvalues of the Fisher information matrix. We focus on a single-hidden-layer neural network with Gaussian data and weights and provide an exact expression for the spectrum in the limit of infinite width. We find that linear networks suffer worse conditioning than nonlinear networks and that nonlinear networks are generically non-degenerate. We also predict and demonstrate empirically that by adjusting the nonlinearity, the spectrum can be tuned so as to improve the efficiency of first-order optimization methods.", "full_text": "TheSpectrumoftheFisherInformationMatrixofaSingle-Hidden-LayerNeuralNetworkJeffreyPenningtonGoogleBrainjpennin@google.comPratikWorahGoogleResearchpworah@google.comAbstractAnimportantfactorcontributingtothesuccessofdeeplearninghasbeentheremarkableabilitytooptimizelargeneuralnetworksusingsimple\ufb01rst-orderop-timizationalgorithmslikestochasticgradientdescent.Whiletheef\ufb01ciencyofsuchmethodsdependscruciallyonthelocalcurvatureofthelosssurface,verylittleisactuallyknownabouthowthisgeometrydependsonnetworkarchitectureandhyperparameters.Inthiswork,weextendarecently-developedframeworkforstudyingspectraofnonlinearrandommatricestocharacterizeanimportantmeasureofcurvature,namelytheeigenvaluesoftheFisherinformationmatrix.Wefocusonasingle-hidden-layerneuralnetworkwithGaussiandataandweightsandprovideanexactexpressionforthespectruminthelimitofin\ufb01nitewidth.We\ufb01ndthatlinearnetworkssufferworseconditioningthannonlinearnetworksandthatnonlinearnetworksaregenericallynon-degenerate.Wealsopredictanddemonstrateempiricallythatbyadjustingthenonlinearity,thespectrumcanbetunedsoastoimprovetheef\ufb01ciencyof\ufb01rst-orderoptimizationmethods.1IntroductionInrecentyears,thesuccessofdeeplearninghasspreadfromclassicalproblemsinimagerecogni-tion[1],audiosynthesis[2],translation[3],andspeechrecognition[4]tomorediverseapplicationsinunexpectedareassuchasproteinstructureprediction[5],quantumchemistry[5]anddrugdiscov-ery[6].Theseempiricalsuccessescontinuetooutpacethedevelopmentofaconcretetheoreticalunderstandingofhowandinwhatcontextsdeeplearningworks.Acentraldif\ufb01cultyinanalyzingdeeplearningsystemsstemsfromthecomplexityofneuralnetworklosssurfaces,whicharehighlynon-convexfunctions,oftenofmillionsorevenbillions[7]ofparameters.Optimizationinsuchhigh-dimensionalspacesposesmanychallenges.Formostproblemsindeeplearning,second-ordermethodsaretoocostlytoperformexactly.Despiterecentdevelopmentsonef\ufb01cientapproximationsofthesemethods,suchastheNeumannoptimizer[8]andK-FAC[9],mostpractitionersusegradientdescentanditsvariants[10],[11].Despitetheirwidespreaduse,itisnotobviouswhy\ufb01rst-ordermethodsareoftensuccessfulindeeplearningsinceitisknownthat\ufb01rst-ordermethodsperformpoorlyinthepresenceofpathologicalcurvature.Animportantopenquestioninthisdirectionistowhatextentpathologicalcurvaturepervadesdeeplearningandhowitcanbemitigated.Morebroadly,inordertocontinueimprovingneuralnetworkmodelsandperformance,weaimtobetterunderstandtheconditionsunderwhich\ufb01rst-ordermethodswillworkwell,andhowthoseconditionsdependonmodeldesignchoicesandhyperparameters.Amongthevarietyofobjectsthatmaybeusedtoquantifythegeometryofthelosssurface,twomatriceshaveelevatedimportance:theHessianmatrixandtheFisherinformationmatrix.FromtheperspectiveofEuclideancoordinatespace,theHessianmatrixisthenaturalobjectwithwhichtoquantifythelocalgeometryofthelosssurface.Itisalsothefundamentalobjectunderlyingmanysecond-orderoptimizationschemesanditsspectrumprovidesinsightsastothenatureofcritical32ndConferenceonNeuralInformationProcessingSystems(NeurIPS2018),Montr\u00e9al,Canada.\fpoints.Fromtheperspectiveofinformationgeometry,distancesaremeasuredinmodelspaceratherthanincoordinatespace,andtheFisherinformationmatrixde\ufb01nesthemetricanddeterminestheupdatedirectionsinnaturalgradientdescent[12].Incontrasttothestandardgradient,thenaturalgradientde\ufb01nesthedirectionintheparameterspacewhichgivesthelargestchangeintheobjectiveperunitchangeinthemodel,asmeasuredbyKullback-Leiblerdivergence.AswediscussinSection2,theHessianandtheFisherarerelated;forthesquarederrorlossfunctionsthatweconsiderinthiswork,itturnsoutthattheFisherequalstheGauss-NewtonapproximationoftheHessian,sotheconnectionisconcrete.Acentraldif\ufb01cultyinbuildinguparobustunderstandingofthepropertiesofthesecurvaturematricesstemsfromthefactthattheyarehigh-dimensionalandtheempiricalestimationoftheirspectraislimitedbymemoryandcomputationalconstraints.Theselimitationstypicallypreventdirectcalculationsformodelswithmorethanafewtensofthousandsofparametersanditisdif\ufb01culttoknowwhetherconclusionsdrawnfromsuchsmallmodelswouldgeneralizetothemega-orgiga-dimensionalnetworksusedinpractice.Itisthereforeimportanttodeveloptheoreticaltoolstoanalyzethespectraofthesematrices.Ingeneral,thespectrawilldependinintimatewaysonthespeci\ufb01cparametervaluesoftheweightsandthedistributionofinputdatatothenetwork.Itisnotfeasibletopreciselycaptureallofthesedetails,andevenifatheoryweredevelopedthatdidso,itwouldnotbeclearhowtoderivegeneralizableconclusionsfromit.Wethereforefocusonasimpli\ufb01edcon\ufb01gurationinwhichtheweightsandinputsaretakentoberandomvariables.Theanalysisthenbecomesawell-de\ufb01nedcomputationinrandommatrixtheory.TheFisherisanonlinearfunctionoftheweightsanddata.Tocomputeitsspectrum,weextendtheframeworkdevelopedbyPenningtonandWorah[13]tostudyrandommatriceswithnonlineardependencies.AswedescribeinSection2.4,theFisheralsohasaninternalblockstructurethatcomplicatestheresultingcombinatorialanalysis.Themaintechnicalcontributionofthisworkistoextendthenonlinearrandommatrixtheoryof[13]tomatriceswithnontrivialinternalstructure.TheresultofouranalysisisanexplicitcharacterizationofthespectrumoftheFisherinformationmatrixofasingle-hidden-layerneuralnetworkwithsquaredloss,randomGaussianweightsandrandomGaussianinputdatainthelimitoflargewidth.Wedrawseveralnontrivialandpotentiallysurprisingconclusionsaboutthespectrum.Forexample,linearnetworkssufferworseconditioningthananynonlinearnetwork,andalthoughnonlinearnetworksmayhavemanysmalleigenvaluestheyaregenericallynon-degenerate.Ourresultsalsosuggestprecisewaystotunethenonlinearityinordertoimproveconditioningofthespectrum,andourempiricalsimulationsshowimprovementsinthespeedof\ufb01rst-orderoptimizationasaresult.2Preliminaries2.1NotationandproblemstatementConsiderasingle-hidden-layerneuralnetworkwithweightmatricesW(1),W(2)\u2208Rn\u00d7nandpointwiseactivationfunctionf:R\u2192R.ForinputX\u2208Rn,theoutputofthenetwork\u02c6Y(X)\u2208Rnisgivenby\u02c6Y(X)=W(2)f(W(1)X).Forconcreteness,wefocusouranalysisonthecaseofsquaredloss,inwhichcase,L(\u03b8)=EX,Y12kY\u2212\u02c6Y(X)k22,(1)whereY\u2208Rnaretheregressiontargetsand\u03b8denotesthevectorofallparameters{W(1),W(2)}.ThematrixofsecondderivativesorHessianofthelosswithrespecttotheparameterscanbewrittenas,H=H(0)+H(1),(2)where,H(0)ij=EXX\u03b1\u2202\u02c6Y\u03b1\u2202\u03b8i\u2202\u02c6Y\u03b1\u2202\u03b8j,andH(1)ij=EXX\u03b1(\u02c6Y(X)\u2212Y)\u03b1\u22022\u02c6Y\u03b1\u2202\u03b8i\u2202\u03b8j.(3)Inthisworkwefocusonthepositive-semi-de\ufb01nitematrixH(0),whichisknownastheGauss-Newtonmatrix.ItcanalsobewrittenasH(0)=JTJ,whereJ\u2208Rn\u00d72n2istheJacobianmatrixof\u02c6Ywith2\frespecttotheparameters\u03b8.Formodelswithsquaredloss,itisknownthattheGauss-NewtonmatrixisequaltotheFisherinformationmatrixofthemodeldistributionwithrespecttoitsparameters[14].Assuch,bystudyingH(0)wesimultaneouslyexaminetheGauss-NewtonmatrixandtheFisherinformationmatrix.ThedistributionofeigenvaluesorspectrumofcurvaturematriceslikeH(0)playsanimportantroleinoptimization,asitcharacterizesthelocalgeometryofthelosssurfaceandtheef\ufb01ciencyof\ufb01rst-orderoptimizationmethods.Inthiswork,weseektobuildadetailedunderstandingofthisspectrumandhowthearchitecturalcomponentsoftheneuralnetworkin\ufb02uenceit.Inordertoisolatethesefactorsfromidiosyncraticbehaviorrelatedtothespeci\ufb01csofthedataandweightcon\ufb01gurations,wefocusontheavanillabaselinecon\ufb01gurationinwhichthedataandtheweightsarebothtakentobeiidGaussianrandomvariables.Concretely,wetakeX\u223cN(0,In),W(l)ij\u223cN(0,1n),andwewillbeinterestedincomputingtheexpecteddistributionofeigenvaluesH(0)forlargen.Fromthisperspective,theproblemcanbeframedasacomputationinrandommatrixtheory,theprinciplesbehindwhichwenowreview.2.2SpectraldensityandtheStieltjestransformTheempiricalspectraldensityofamatrixMisde\ufb01nedas,\u03c1M(\u03bb)=1nnXj=1\u03b4(\u03bb\u2212\u03bbj(M)),(4)wherethe\u03bbj(M),j=1,...,n,denotetheneigenvaluesofM,includingmultiplicity,and\u03b4istheDiracdeltafunction.Thelimitingspectraldensityisthelimitofeqn.(4)asn\u2192\u221e,ifitexists.Forz\u2208C\\supp(\u03c1M)theStieltjestransformGof\u03c1Misde\ufb01nedas,G(z)=Z\u03c1M(t)z\u2212tdt=\u22121nEtr(M\u2212zIn)\u22121,(5)wheretheexpectationiswithrespecttotherandomvariablesWandX.Thequantity(M\u2212zIn1)\u22121istheresolventofM.ThespectraldensitycanberecoveredfromtheStieltjestransformusingtheinversionformula,\u03c1M(\u03bb)=\u22121\u03c0lim\u0001\u21920+ImG(\u03bb+i\u0001).(6)2.3MomentmethodOneofthemaintoolsforcomputingthelimitingspectraldistributionsofrandommatricesisthemomentmethod,which,asthenamesuggests,isbasedoncomputationsofthemomentsof\u03c1M.Theasymptoticexpansionofeqn.(5)forlargezgivestheLaurentseries,G(z)=\u221eXk=0mkzk+1,(7)wheremkisthekthmomentofthedistribution\u03c1M,mk=Zdt\u03c1M(t)tk=1nEtrMk.(8)Ifonecancomputemk,thenthedensity\u03c1Mcanbeobtainedviaeqns.(7)and(6).TheideabehindthemomentmethodistocomputemkbyexpandingoutpowersofMinsidethetraceas,1nEtrMk=1nEXi1,...,ik\u2208[n]Mi1i2Mi2i3\u00b7\u00b7\u00b7Mik\u22121ikMiki1,(9)andevaluatingtheleadingcontributionstothesumasn\u2192\u221e.WewillusethemomentmethodinordertocomputethelimitingspectraldensityoftheFisher.Asa\ufb01rststepinthatdirection,wefocusonthepropertiesofthelayer-wiseblockstructureintheFisherinducedbytheneuralnetworkarchitecture.3\f2.4BlockstructureoftheFisherAsdescribedabove,inoursingle-hidden-layersettingwithsquaredloss,theFisherisgivenbyH(0)=EX(cid:2)JTJ(cid:3),J\u03b1i=\u2202\u02c6Y\u03b1\u2202\u03b8i.(10)Becausetheparametersofthemodelareorganizedintotwolayers,itisconvenienttopartitiontheFisherintoa2\u00d72blockmatrix,H(0)= H(0)11H(0)12H(0)12TH(0)22!.Furthermore,becausetheparametersofeachlayerarematrices,itisusefultoregardeachblockoftheFisherasafour-indextensor.Inparticular,[H(0)11]a1b1,a2b2=EX\"XiJ(1)i,a1b1J(1)i,a2b2#,[H(0)12]a1b1,c1d1=EX\"XiJ(1)i,a1b1J(2)i,c1d1#,[H(0)22]c1d1,c2d2=EX\"XiJ(2)i,c1d1J(2)i,c2d2#.TheJacobianentriesJ(l)i,abequalthederivativesof\u02c6YiwithrespecttotheweightvariablesW(l)ab,J(1)i,ab=W(2)iaf0(cid:0)XkW(1)akXk(cid:1)Xb,J(2)j,cd=\u03b4cjf(cid:0)XlW(1)dlXl(cid:1),(11)where\u03b4cjdenotestheKroneckerdeltafunctioni.e.,itis1ifc=j,and0otherwise.Inordertoproceedbythemethodofmoments,wewillneedtocomputethenormalizedtraceofpowersoftheFisher,i.e.1ntr[H(0)]d,foranyd.TheblockstructureoftheFishermakestheexplicitrepresentationofthesetracessomewhatcomplicated.Thefollowingpropositionhelpssimplifytheexpressions.Proposition1.LetA1\u2208Rn\u00d7k1,A2\u2208Rn\u00d7k2andA=[A1,A2]\u2208Rn\u00d7(k1+k2).Then,tr[(ATA)d]=Xb\u2208{1,2}dtrdYi=1AbiATbi=Xb\u2208{1,2}dtrATbdAb1d\u22121Yi=1ATbiAbi+1.(12)UsingProposition1withA1=J(1)andA2=J(2),wehave,tr[(H(0))d]=Xb\u2208{1,2}dtrEX(cid:2)J(bd)TJ(b1)(cid:3)d\u22121Yi=1EX(cid:2)J(bi)TJ(bi+1)(cid:3),(13)whichexpressesthetracesoftheblockFisherentirelyintermsofproductsofitsconstituentblocks.Inordertocarryoutthemomentmethodtocompletion,weneedtheexpectednormalizedtracesmk,mk=1nEWtr[(H(0))k],(14)inthelimitoflargen.Becausethenonlinearitysigni\ufb01cantlycomplicatestheanalysis,we\ufb01rstillustratethebasicsofthemethodologyinthelinearcasebeforemovingontothegeneralcase.2.5AnIllustrativeExample:TheLinearCaseLetusassumethatfistheidentityfunctioni.e.,f(z)=z.Inthiscase,eqn.(11)canbewrittenas,J(1)=W(2)T\u2297X,J(2)=I\u2297W(1)X.(15)4\f(cid:1)((cid:2))0246810-40.0010.0100.100110(cid:2)(cid:1)((cid:2))24680.050.100.150.200.25(cid:2)(a)f(x)=x(cid:1)((cid:2))0246810-40.0010.0100.100110(cid:2)(cid:1)((cid:2))24680.51.01.52.02.53.0(cid:2)(b)f(x)=erf1(x)Figure1:EmpiricalspectraofFisherforsingle-hidden-layernetworksofwidth128(orange)andtheoreticalpredictionofspectra(black)for(a)linearand(b)erf1(seeeqn.30)networks.Insetsshowlogarithmicscale.UsingthefactthatEX[XXT]=In,eqn.(13)gives,tr[(H(0))d]=EWdXk=0(cid:18)dk(cid:19)tr(W(2)W(2)T)d\u2212ktr(W(1)W(1)T)k=dXk=0(cid:18)dk(cid:19)Cd\u2212kCk,(16)whereCnisthenthCatalannumber.TheseriescanbesummedtoobtaintheStieltjestransform,whoseimaginarypartgivesthefollowingexplicitformforthespectrum,\u03c1(\u03bb)=12\u03b4(\u03bb)+h12\u03c02E(cid:16)116(8\u2212\u03bb)\u03bb(cid:17)+4\u2212\u03bb8\u03c02K(cid:16)116(8\u2212\u03bb)\u03bb(cid:17)i1[0,8],(17)whereKandEarethecompleteellipticintegralsofthe\ufb01rst-andsecond-kind,K(k)=Z\u03c020d\u03b81p1\u2212ksin2\u03b8,E(k)=Z\u03c020d\u03b8p1\u2212ksin2\u03b8.(18)Noticethatthespectrumishighlydegenerate,withhalfoftheeigenvaluesequalingzero.Thisdegen-eracycanbeattributedtotheGL(n2)symmetryoftheproductW(2)W(1)under{W(1),W(2)}\u2192{GW(1),W(2)G\u22121}.Fig.1ashowsexcellentagreementbetweenthepredictedspectraldensityand\ufb01nite-widthempiricalsimulations.3TheStieltjestransformofH(0)3.1MainResultIff:R\u2192RisanactivationfunctionwithzeroGaussianmeanand\ufb01niteGaussianmoments,Zdx\u221a2\u03c0e\u2212x22f(x)=0,(cid:12)(cid:12)(cid:12)(cid:12)Zdx\u221a2\u03c0e\u2212x22f(x)k(cid:12)(cid:12)(cid:12)(cid:12)<\u221e,fork>1,(19)thentheStieltjestransformofthelimitingspectraldensityofH(0)isgivenbythefollowingtheorem.Theorem1.TheStieltjestransformofthespectraldensityoftheFisherinformationmatrixofasingle-hidden-layerneuralnetworkwithsquaredloss,activationfunctionf,weightmatricesW(1),W(2)\u2208Rn\u00d7nwithiidentriesW(l)ij\u223cN(0,1n),nobiases,andiidinputsX\u223cN(0,In)isgivenbythefollowingintegralasn\u2192\u221e:G(z)=ZRZR\u03bb1+\u03bb2\u22122z2\u03b62(cid:0)(\u03b7\u2212\u03b6)(\u03b70\u2212\u03b6)+\u03bb1(z\u2212\u03b7+\u03b6)+\u03bb2(z\u2212\u03b70+\u03b6)\u2212z2(cid:1)d\u00b51(\u03bb1)d\u00b52(\u03bb2),(20)wheretheconstants\u03b7,\u03b70,and\u03b6aredeterminedbythenonlinearity,\u03b7=ZRf(x)2e\u2212x2/2\u221a2\u03c0dx,\u03b70=ZRf0(x)2e\u2212x2/2\u221a2\u03c0dx,\u03b6= ZRf0(x)e\u2212x2/2\u221a2\u03c0dx!2,(21)5\f\u03c1(\u03bb)Nonlinearityxerf1(x)slrelu0(x)fopt(x)0.100.5015100.0010.0100.100110\u03bb(a)Spectraforvariousnonlinearities\u03c1(\u03bb)Width163264128\u221e0.00.10.20.30.40.501234\u03bb(b)f(x)=erf1(x)forvariouswidthsFigure2:(a)Theoreticalpredictionsforspectraofvariousnonlinearities;seeeqns.(28)and(30).Thelinearcaseisdegenerateandmorepoorlyconditionedthanthenonlinearcases.(b)Theoreticalpredictionofspectrumforerf1comparedwithempiricalsimulations.Practicalconstraintsrestrictthewidthtosmallvalues,butslowconvergencetowardtheasymptoticpredictioncanbeobserved.andthemeasuresd\u00b51andd\u00b52aregivenby,d\u00b51(\u03bb1)=12\u03c0s\u03b70+3\u03b6\u2212\u03bb1\u03bb1\u2212\u03b70+\u03b61[\u03b70\u2212\u03b6,\u03b70+3\u03b6],d\u00b52(\u03bb2)=12\u03c0s\u03b7+3\u03b6\u2212\u03bb2\u03bb2\u2212\u03b7+\u03b61[\u03b7\u2212\u03b6,\u03b7+3\u03b6].(22)Remark1.AstraightforwardapplicationofCarlson\u2019salgorithm[15]canreducetheintegralineqn.(20)toacombinationofthreestandardellipticintegrals.Remark2.Thespectraldensitycanberecoveredfromeqn.(20)throughtheinversionformula,eqn.(6).Remark3.AlthoughtheresultinTheorem1iswrittenintermsoff0,itisnotnecessarythatfbedifferentiable.Infact,theweakderivativecanbeusedinplaceofthederivative,astheproofofthereduction(seealso[13])to\ufb01nalformusesintegrationbypartsonly.Therefore,justtheexistenceofaweakderivativeforfsuf\ufb01ces.Inparticular,theresultwouldholdfor|x|andRelufunctions.TheproofofTheorem1isquitelongandtechnical,soit\u2019sdeferredtotheSupplementaryMaterial.Thebasicideaunderlyingtheproofisverysimilartothatutilizedin[13].Thecalculationofthemomentsisdividedintotwosub-problems,oneofenumeratingcertainconnectedouter-planargraphs,andanotherofevaluatingcertainhigh-dimensionalintegralsthatcorrespondtowalksinthosegraphs.Fig.1showstheexcellentagreementofthepredictedspectrumwithempiricalsimulationsof\ufb01nite-widthnetworks.Fig.2highlightstheregionofthespectrumforwhichtheasymptoticbehaviorisslowtosetinandsuggeststhatempiricalsimulationswithsmallnetworksmaynotprovideanaccurateportrayalofthebehavioroflargenetworks.Fig.2ashowsthepredictedspectraforavarietyofnonlinearities.3.2FeaturesofthespectrumOwingtoeqn.(6),thebranchpointsandpolesofG(z)encodeinformationaboutthedeltafunctionpeaks,spectraledges,anddiscontinuitiesinthederivativeof\u03c1(\u03bb).ThesespecialpointscanbedetermineddirectlyfromtheintegralrepresentationforG(z)ineqn.(20)byexaminingthezerosofthedenominatoroftheintegrand.Inparticular,thefollowingsixvaluesofzarelocationsofthepolesattheintegrationendpointsanddeterminethesalientfeaturesofthespectraldensity:z1=\u03b7\u2212\u03b6,z2=\u03b7+3\u03b6,z3=12(cid:0)\u03b7+\u03b70+6\u03b6\u2212p(\u03b70\u2212\u03b7)2+64\u03b62(cid:1),(23)z4=\u03b70\u2212\u03b6,z5=\u03b70+3\u03b6,z6=12(cid:0)\u03b7+\u03b70+6\u03b6+p(\u03b70\u2212\u03b7)2+64\u03b62(cid:1).(24)IntheSupplementaryMaterial,weestablishtherelativeorderingofconstants0\u2264\u03b6\u2264\u03b7\u2264\u03b70,whichimpliesthattheminimumandmaximumeigenvaluesofH(0)aregivenby,\u03bbmin=z1,and\u03bbmax=z6.(25)6\fTable1:PropertiesofnonlinearitiesLocationsofspectralfeatures\u03b7\u03b70\u03b6z1z2z3z4z5z6x111040048erf1(x)11.2260.9140.0863.7410.1980.3123.9667.51srelu0(x)11.4670.7330.2673.2000.4910.7333.6676.377fopt11.9230.0770.9231.2311.1381.8462.1542.247TheSupplementaryMaterialalsoshowsthattheequality\u03b7=\u03b6onlyholdsforlinearnetworks,whichimpliesthattheminimumeigenvalueisnonzeroforeverynonlinearactivationfunction.Therearetwodeltafunctionpeaksinspectrum,whicharelocatedat,\u03bb(1)peak=\u03bbmin=z1,and\u03bb(2)peak=z4.(26)Thesepeaksindicatespeci\ufb01ceigenvaluesthathavenonvanishingprobabilityofoccurrence.Thesepeakscoalescewhen\u03b7=\u03b70,whichcanonlyhappenforlinearactivationfunctions,inwhichcase\u03b7=\u03b70=\u03b6,sothepeaksoccurat\u03bb=0,asillustratedinFig.2a.That\ufb01gurealsoshowsthatthespectrummayconsistoftwodisconnectedcomponents,inwhichcasez2isthelocationoftherightedgeoftheleftcomponent.Finally,thederivativeofthespectrumisdiscontinuousatz3andz5.Thesepredictionscanbeveri\ufb01edinFig.2abyconsultingTable1,whichprovidesnumericalvaluesforthesespecialpointsforthevariousnonlinearitiesappearinginthe\ufb01gure.4Empiricalanalysis4.1AmeasureofconditioningUsingtheresultsfromSection4.1,the\ufb01rsttwomomentscanbegivenexplicitlyas,m1=limn\u2192\u221e1ntr[H(0)]=12(\u03b7+\u03b70)m2=limn\u2192\u221e1ntr[H(0)2]=12(\u03b72+\u03b702+4\u03b62)(27)Ascale-invariantmeasureofconditioningoftheFisherisjustm2/m21,whichislower-boundedby1,andwhichquanti\ufb01eshowtightlyconcentratedthespectrumisarounditsmean.Ideally,thisquantityshouldbeassmallaspossibletoavoidpathologicalcurvatureandtoenablefast\ufb01rst-orderoptimization.Oneadvantageofm2/m21comparedtootherconditionnumberssuchas\u03bbmax/\u03bbminor\u03bbmaxisthatitisscale-invariantandwell-de\ufb01nedeveninthepresenceofdegeneracyinthespectrum.ByexpandingfinabasisofHermitepolynomials,weshowintheSupplementaryMaterialthatamongthefunctionswithzeroGaussianmeanthatfopt(x)=1\u221a13(cid:0)x+\u221a6(x2\u22121)(cid:1)(28)minimizestheratiom2/m21.Notethatwehaveremovedthefreedomtorescalefoptbyaconstantbyenforcing\u03b7=1.Curiously,alinearactivationfunctionactuallymaximizestheratio,implyingthatnonlinearityinvariablyimprovesconditioning,atleastbythismeasure.TherelativeconditioningofspectraresultingfromvariousactivationfunctionscanbeobservedinFig.2a.Thefunctionfopt(x)growsquicklyforlarge|x|andmaybetoounstabletouseinactualneuralnetworks.Alternativefunctionscouldbefoundbysolvingtheoptimizationproblem,f\u2217=argminfm2m21,(29)subjecttosomeconstraints,forexamplethatfbemonotoneincreasing,havezeroGaussianmean,andsaturateforlarge|x|.Suchaproblemcouldbesolvedviavariationalcalculus;seetheSupplementaryMaterial.7\fm2m120.00.51.01.52.02.51.181.201.221.241.261.28-0.0132-0.0130-0.0128-0.0126-0.0124-0.0122\u03b1\u0394L(a)srelu\u03b1m2m121.01.52.02.53.03.54.01.51.61.71.81.9-0.010-0.009-0.008-0.007\u03b1\u0394L(b)erf\u03b1Figure3:Comparisonoftheconditioningmeasurem2/m21andsingle-steplossreduction\u2206L(eqn.(33))astheactivationfunctionchangesfor(a)srelu\u03b1and(b)erf\u03b1(eqn.(30)).Thecurvesarehighlycorrelated,suggestingthepossibilityofimproved\ufb01rst-orderoptimizationperformancebytuningthespectrumoftheFisherthroughthechoiceofactivationfunction.4.2Ef\ufb01ciencyofgradientdescentAnotherwaytoinvestigatetheratiom2/m21istoseehowwellitcorrelateswiththeef\ufb01ciencyof\ufb01rst-orderoptimization.Forthispurpose,weexaminetwoone-parameterclassesofwell-behavedactivationfunctionsrelatedtoReLUandtheerrorfunction,srelu\u03b1(x)=[x]++\u03b1[\u2212x]+\u22121+\u03b1\u221a2\u03c0q12(1+\u03b12)\u221212\u03c0(1+\u03b1)2,erf\u03b1(x)=erf(\u03b12x)q4\u03c0tan\u22121\u221a1+4\u03b14\u22121.(30)Heresrelu\u03b1istheshiftedleakyReLUfunctionstudiedin[13].Bothsrelu\u03b1anderf\u03b1havezeroGaussianmeanandarenormalizedsuchthat\u03b7=1forall\u03b1.Changing\u03b1doesaffect\u03b70,\u03b6andtheratiom2/m21,whichimpliesthatdifferentfunctionswithintheseone-parameterfamiliesmaybehavequitedifferentlyundergradientdescent.Wedesignedasimpleandcontrolledexperimenttoexplorethesedifferencesinthecontextofneuralnetworktraining.Thesetupisamodi\ufb01edstudent-teacherframework,inwhichthestudentisinitializedwiththeteacher\u2019sparameters,buttheregressiontargetsareperturbedsothatstudent\u2019sparametersaresuboptimal.Thenweaskbyhowmuchcanthestudentdecreasethelossbyoneoptimally-chosenstepinthegradientdirection.Concretely,wede\ufb01neYi=W(2)tf(W(1)tXi)+\u0001i,i=1,...,M,(31)forteacherweights[W(l)t]ij\u223cN(0,1n),Xi\u223cN(0,In),and\u0001i\u223cN(0,\u03b52In),withwidthn=27,numberofsamplesM=217,andperturbationsize\u03b5=10\u22123.Thelossisde\ufb01nedas,L(Ws)=MXi=112kYi\u2212W(2)sf(W(1)sXi)k22.(32)Weareinterestedinthemaximalsingle-steplossdecreasewhenWsisinitializedatWt,i.e.,\u2206L=min\u03b7(cid:2)L(cid:0)Wt\u2212\u03b7\u2207L|Wt(cid:1)\u2212L(Wt)(cid:3).(33)Forthetwoclassesofactivationfunctionsineqn.(30),weempiricallymeasured\u2206Lasafunctionof\u03b1.InFig.3wecomparetheresultswithourtheoreticalpredictionsform2/m21asafunctionof\u03b1.Theagreementisexcellent,suggestingthatourtheorymaybeabletomakepracticalpredictionsregardingtrainingef\ufb01ciencyofactualneuralnetworks.5ConclusionsInthiswork,wecomputedthespectrumoftheFisherinformationmatrixofasingle-hidden-layerneuralnetworkwithsquaredlossandGaussianweightsandGaussiandatainthelimitoflargenetworkwidth.Ourexplicitresultsindicatethatlinearnetworkssufferworseconditioningthan8\fnonlinearnetworksandthatalthoughnonlinearnetworksmayhavenumeroussmalleigenvaluestheyaregenericallynon-degenerate.Wealsoshowedthatbytuningthenonlinearityitispossibletoadjustthespectruminsuchawaythattheef\ufb01ciencyof\ufb01rst-orderoptimizationmethodscanbeimproved.Byundertakingthisanalysis,wedemonstratedhowtoextendthetechniquesdevelopedin[13]forstudyingrandommatriceswithnonlineardependenciestotheblock-structuredcurvaturematricesthatarerelevantforoptimizationindeeplearning.Thetechniquespresentedherepavethewayforfutureworkstudyingdeeplearningviarandommatrixtheory.References[1]AlexKrizhevsky,IlyaSutskever,andGeoffreyEHinton.Imagenetclassi\ufb01cationwithdeepconvolutionalneuralnetworks.InAdvancesinneuralinformationprocessingsystems,pages1097\u20131105,2012.[2]AaronvandenOord,SanderDieleman,HeigaZen,KarenSimonyan,OriolVinyals,AlexGraves,NalKalchbrenner,AndrewSenior,andKorayKavukcuoglu.Wavenet:Agenerativemodelforrawaudio.arXivpreprintarXiv:1609.03499,2016.[3]YonghuiWu,MikeSchuster,ZhifengChen,QuocV.Le,MohammadNorouzi,WolfgangMacherey,MaximKrikun,YuanCao,QinGao,KlausMacherey,etal.Google\u2019sneuralmachinetranslationsystem:Bridgingthegapbetweenhumanandmachinetranslation.arXivpreprintarXiv:1609.08144,2016.[4]GeoffreyHinton,LiDeng,DongYu,GeorgeE.Dahl,Abdel-rahmanMohamed,NavdeepJaitly,AndrewSenior,VincentVanhoucke,PatrickNguyen,TaraNSainath,etal.Deepneuralnetworksforacousticmodelinginspeechrecognition:Thesharedviewsoffourresearchgroups.IEEESignalProcessingMagazine,29(6):82\u201397,2012.[5]GarrettGoh,NathanHodas,andAbhinavVishnu.DeepLearningforComputationalChemistry.arXivpreprintarXiv:1701.04503,2017.[6]HanAltae-Tran,BharathRamsundar,AneeshS.Pappu,andVijayPande.LowDataDrugDiscoverywithOne-ShotLearning.AmericanChemicalSocietyCentralScience,2017.[7]N.Shazeer,A.Mirhoseini,K.Maziarz,A.Davis,Q.Le,G.Hinton,andJ.Dean.Outrageouslylargeneurallanguagemodelsusingsparselygatedmixturesofexperts.ICLR,2017.URLhttp://arxiv.org/abs/1701.06538.[8]ShankarKrishnan,YingXiao,andRifA.Saurous.Neumannoptimizer:Apracticaloptimizationalgorithmfordeepneuralnetworks.InInternationalConferenceonLearningRepresentations,2018.[9]RogerB.GrosseandJamesMartens.Akronecker-factoredapproximate\ufb01shermatrixforconvolutionlayers.InProceedingsofthe33ndInternationalConferenceonMachineLearning,ICML,pages573\u2013582,2016.[10]JohnDuchi,EladHazan,andYoramSinger.AdaptiveSubgradientMethodsforOnlineLearningandStochasticOptimization.JournalofMachineLearningResearch,2011.[11]DiedrikKingmaandJimmyBa.Adam:AMethodforStochasticOptimization.arxiv:1412.6980,2014.[12]S.I.Amari.Naturalgradientworksef\ufb01cientlyinlearning.NeuralComputation,1998.[13]JeffreyPenningtonandPratikWorah.Nonlinearrandommatrixtheoryfordeeplearning.InAdvancesinNeuralInformationProcessingSystems,pages2634\u20132643,2017.[14]TomHeskes.On\u201cnatural\u201dlearningandpruninginmultilayeredperceptrons.NeuralComputa-tion,12(4):881\u2013901,2000.[15]BCCarlson.Atableofellipticintegralsofthethirdkind.Mathematicsofcomputation,51(183):267\u2013280,1988.[16]MarianoGiaquintaandStefanHilderbrandt.CalculusofVariations1.Springer,1994.9\f[17]RichardStanley.PolygonDissectionsandStandardYoungTableaux.JournalofCombinatorialTheory,SeriesA,1996.10\f", "award": [], "sourceid": 2592, "authors": [{"given_name": "Jeffrey", "family_name": "Pennington", "institution": "Google Brain"}, {"given_name": "Pratik", "family_name": "Worah", "institution": "Google"}]}