{"title": "PacGAN: The power of two samples in generative adversarial networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1498, "page_last": 1507, "abstract": "Generative adversarial networks (GANs) are a technique for learning generative models of complex data distributions from samples. Despite remarkable advances in generating realistic images, a major shortcoming of GANs is the fact that they tend to produce samples with little diversity, even when trained on diverse datasets. This phenomenon, known as mode collapse, has been the focus of much recent work. We study a principled approach to handling mode collapse, which we call packing. The main idea is to modify the discriminator to make decisions based on multiple samples from the same class, either real or artificially generated. We draw analysis tools from binary hypothesis testing---in particular the seminal result of Blackwell---to prove a fundamental connection between packing and mode collapse. We show that packing naturally penalizes generators with mode collapse, thereby favoring generator distributions with less mode collapse during the training process. Numerical experiments on benchmark datasets suggest that packing provides significant improvements.", "full_text": "PacGAN:ThepoweroftwosamplesingenerativeadversarialnetworksZinanLinECEDepartmentCarnegieMellonUniversityzinanl@andrew.cmu.eduAshishKhetanIESEDepartmentUniversityofIllinoisatUrbana-Champaignashish.khetan09@gmail.comGiuliaFantiECEDepartmentCarnegieMellonUniversitygfanti@andrew.cmu.eduSewoongOhIESEDepartmentUniversityofIllinoisatUrbana-Champaignswoh@illinois.eduAbstractGenerativeadversarialnetworks(GANs)areatechniqueforlearninggenerativemodelsofcomplexdatadistributionsfromsamples.Despiteremarkableadvancesingeneratingrealisticimages,amajorshortcomingofGANsisthefactthattheytendtoproducesampleswithlittlediversity,evenwhentrainedondiversedatasets.Thisphenomenon,knownasmodecollapse,hasbeenthefocusofmuchrecentwork.Westudyaprincipledapproachtohandlingmodecollapse,whichwecallpacking.Themainideaistomodifythediscriminatortomakedecisionsbasedonmultiplesamplesfromthesameclass,eitherrealorarti\ufb01ciallygenerated.Wedrawanalysistoolsfrombinaryhypothesistesting\u2014inparticulartheseminalresultofBlackwell[4]\u2014toproveafundamentalconnectionbetweenpackingandmodecollapse.Weshowthatpackingnaturallypenalizesgeneratorswithmodecollapse,therebyfavoringgeneratordistributionswithlessmodecollapseduringthetrainingprocess.Numericalexperimentsonbenchmarkdatasetssuggestthatpackingprovidessigni\ufb01cantimprovements.1IntroductionGenerativeadversarialnetworks(GANs)areatechniquefortraininggenerativemodelstoproducerealisticexamplesfromanunknowndatadistribution[10].SupposewearegivenNi.i.d.samplesX1,...,XNfromanunknownprobabilitydistributionPoversomehigh-dimensionalspaceRp(e.g.,images).ThegoalofgenerativemodelingistolearnamodelthatcandrawsamplesfromdistributionP.Indata-drivengenerativemodeling,thismodelistypicallyformulatedasafunctionG:Rd\u2192Rpthatmapsalow-dimensionalcodevectorZ\u2208Rddrawnfromastandarddistribution(e.g.sphericalGaussian)toahigh-dimensionaldomainofinterest.AbreakthroughintrainingsuchgenerativemodelswasachievedbytheinnovativeideaofGANs.GANstraintwoneuralnetworkscalledthegeneratorG(Z)anddiscriminatorD(X).Theroleofthegeneratoristoproducerealisticsamples,andtheroleofthediscriminatoristodistinguishgeneratedsamplesfromrealdata.Thesetwoneuralnetworksplayadynamicminimaxgameagainsteachother.Iftrainedlongenough,eventuallythegeneratorlearnstoproducesamplesthatareindistinguishablefromrealdata(butpreferablydifferentfromthetrainingsamples).Concretely,GANssearchforthe32ndConferenceonNeuralInformationProcessingSystems(NeurIPS2018),Montr\u00e9al,Canada.\fparametersofneuralnetworksGandDthatoptimizethefollowingminimaxobjective:G\u2217\u2208argminGmaxDV(G,D)=argminGmaxDEX\u223cP[log(D(X))]+EZ\u223cPZ[log(1\u2212D(G(Z)))],(1)wherePisthedistributionoftherealdata,andPZisthedistributionoftheinputcodevectorZ.Critically,[10]showsthattheglobaloptimumof(1)isachievedifandonlyifP=Q,whereQisthegenerateddistributionofG(Z).Thesolutiontotheminimaxproblem(1)canbeapproximatedbyiterativelytrainingtwo\u201ccompeting\"neuralnetworks,thegeneratorGanddiscriminatorD.Eachmodelcanbeupdatedbybackpropagatingthegradientofthelossfunctiontoitsparameters.AmajorchallengeintrainingGANsisaphenomenonknownasmodecollapse,whichreferstoalackofdiversityingeneratedsamples.Indeed,GANscommonlymissmodeswhentrainedonmultimodaldistributions.Forinstance,whentrainedonhand-writtendigitswithtenmodes,thegeneratormightfailtoproducesomeofthedigits[24].Severalapproacheshavebeenproposedto\ufb01ghtmodecollapse,e.g.[7,8].Proposedsolutionsrelyonmodi\ufb01edarchitectures,lossfunctions,andoptimizationalgorithms.Althougheachoftheseproposedmethodsempiricallymitigatesmodecollapse,welackrigorousexplanationsofwhytheempiricalgainsareachieved\u2014especiallywhenthosegainsaresensitivetohyperparameters.OurContributions.Inthiswork,weexamineGANsthroughthelensofhypothesistesting.Byviewingthediscriminatorasperformingabinaryhypothesistestonsamples(i.e.,whethertheyweredrawnfromdistributionPorQ),wecanapplyclassicalhypothesistestingresultstotheanalysisofGANs.Thisviewleadstothreecontributions:(1)Conceptual:Weproposeaformalde\ufb01nitionofmodecollapsethatabstractsawaythegeometricpropertiesoftheunderlyingdatadistributions(Section3).Thisde\ufb01nitioniscloselyrelatedtothenotionofROCcurvesinbinaryhypothesistesting.Giventhisde\ufb01nition,weprovideanewinterpretationofthepairofdistributions(P,Q)asatwo-dimensionalregioncalledthemodecollapseregion,wherePisthetruedatadistributionandQthegeneratedone.Themodecollapseregionprovidesnewinsightsonhowtoreasonabouttherelationshipbetweenthosetwodistributions.(2)Analytical:Throughthelensofhypothesistestingandmodecollapseregions,weshowthatifthediscriminatorisallowedtoseesamplesfromthem-thorderproductdistributionsPmandQminsteadoftheusualtargetdistributionPandgeneratordistributionQ,thenthecorrespondinglosswhentrainingthegeneratornaturallypenalizesgeneratordistributionswithstrongmodecollapse(Section3).Hence,ageneratortrainedwiththistypeofdiscriminatorwillchoosedistributionsthatexhibitlessmodecollapse.Theregioninterpretationofmodecollapseandcorrespondingdataprocessinginequalitiesprovidenovelanalysistoolsforprovingstrongandsharpresultswithsimpleproofs.Technically,thisleadstoanovelgeometricanalysistechniqueto\ufb01ndtheoptimalsolutionsofin\ufb01nitedimensionalnon-convexoptimizationproblemsofinterestinEqs.(2)and(3).(3)Algorithmic:WeproposeanewGANframeworktomitigatemodecollapse,whichwecallPacGAN.PacGANcanbeseamlesslyappliedtoexistingGANs,requiringonlyasmallmodi\ufb01cationtothediscriminatorarchitecture(Section2).Thekeyideaistopassm\u201cpacked\"orconcatenatedsamplestothediscriminator,whicharejointlyclassi\ufb01edaseitherrealorgenerated.Thisallowsthediscriminatortodobinaryhypothesistestingbasedontheproductdistributions(Pm,Qm),whichnaturallypenalizesmodecollapse(Section3).WedemonstrateonbenchmarkdatasetsthatPacGANsigni\ufb01cantlyimprovesuponcompetingapproachesinmitigatingmodecollapse(Section4),notablyminibatchdiscrimination[24].RelatedWorkThreeprimarychallengesappearintheGANliterature:(i)theyareunstabletotrain,(ii)theyarechallengingtoevaluate,and(iii)theyexhibitmodecollapse(morebroadly,theydonotgeneralize).Ourworkexplicitlyaddresseschallenge(iii),whichisthefocusofthissection.Modecollapseisabyproductofpoorgeneralization\u2014i.e.,thegeneratordoesnotlearnthetruedatadistribution;thisphenomenonisofsigni\ufb01cantinterest[2,3,18,1,2].Priorworkhasobservedtwotypesofmodecollapse:entiremodesfromtheinputdataarenevergenerated,orthegeneratoronlycreatesimageswithinasubsetofaparticularmode[9,27,3,7,20,23].Thesephenomenaarenotwell-understood,butanumberofexplanatoryhypotheseshavebeenproposed,includingimproperobjectivefunctions[1,2]andweakdiscriminators[20,24,2,17].Buildingonthesecondhypothesis,2\fweshowthatapackeddiscriminatorcansigni\ufb01cantlyreducemodecollapse,boththeoreticallyandinpractice.Wecomparepackingtothreemainapproachesformitigatingmodecollapse:(1)JointArchitectures.Inencoder-decoderarchitectures,theGANlearnsanencodingG\u22121(X)fromthedataspacetoalower-dimensionallatentspace,inadditiontotheusualdecodingG(Z)fromthelatentspacetothedataspace(e.g.,BiGAN[8],adversariallylearnedinference[7],VEEGAN[25]).Despiteempiricalgainsinsuchjointarchitectures,we\ufb01ndthatpackingcapturesmoremodesfora\ufb01xedgeneratoranddiscriminatorarchitecture,withlessarchitecturalandcomputationaloverhead.Also,recentworksuggeststhatsucharchitecturesmaybeunabletopreventmodecollapse[2].(2)AugmentedDiscriminators.Severalproposalshavestrengthenedthediscriminatorbyprovidingitwithimagelabels[5]and/ormoresamples.Alatterapproach,minibatchdiscrimination[24],feedsanarrayofdatasamplestothediscriminator,whichusestheminibatchassideinformationtoclassifyeachsampleindividually.Recentworkimprovedminibatchdiscriminationbyprogressivelytrainingdiscriminatorsonlargerminibatches,withimpressivevisualresults[13].Whilepackingandminibatchdiscriminationstartfromthesameintuitionthatshowingmultiplesamplesatthediscriminatorhelpsmitigatemodecollapse,howthisideaisimplementedinthediscriminatorarchitecturesarecompletelydifferent.PacGANiseasiertoimplement,empiricallyeffective,andourtheoreticalanalysisshowsthatpackingisaprincipledwaytousebatchedsamples.Forexample,intheexperimentinAppendixB.2(leftcolumnofTable6),thedefaultDCGANdiscriminatorhas585weightsintotalintheUnrolledGANimplementation,theproposedPacDCGAN4onlyadds162moreweightstothediscriminator,whileminibatchdiscriminatoradds1,225,732moreweights.(3)Optimization-basedsolutions.GANsaretypicallytrainedwithiterativegenerator-discriminatorparameterupdates,whichcanleadtonon-convergence[17]\u2014aworseproblemthanmodecollapse.UnrolledGANs[20]proposeanoptimizationthataccountsforkgradientstepswhencomputinggradients.Weobservethatpackingachievesbetterempiricalperformancewithlessoverhead.2PacGANFrameworkTherearemanywaystoimplementtheideaofpacking,eachwithtradeoffs.Inthissection,wepresentasimplepackingframeworkthatservesasthebasisforourempiricalexperimentsandaconcreteexampleofpacking.Aprimaryreasonforthisarchitecturalchoiceistoemphasizeonlytheeffectofpackinginnumericalexperiments,andisolateitfromanyothereffectsthatmightresultfromother(moresophisticated)changestothearchitecture.However,ouranalysisinSection3isagnostictothepackingimplementation,andwediscusspotentialalternativepackingarchitecturesinSection5,especiallythosethatexplicitlyimposepermutationinvariance.WestartwithanexistingGAN,de\ufb01nedbyageneratorarchitecture,adiscriminatorarchitecture,andalossfunction.Wecallthistripletthemotherarchitecture.ThePacGANframeworkmaintainsthesamegeneratorarchitecture,lossfunction,andhyperparametersasthemotherarchitecture.However,insteadofusingadiscriminatorD(X)thatmapsasinglesample(eitherrealorgenerated)toa(soft)label,weuseanaugmenteddiscriminatorD(X1,X2,...,Xm)thatmapsmsamplestoasingle(soft)label.Thesemsamplesaredrawnindependentlyfromthesamedistribution\u2014eitherreal(jointlylabelledY=1)orgenerated(Y=0).Werefertotheconcatenationofsampleswiththesamelabelaspacking,theresultingdiscriminatorasapackeddiscriminator,andthenumbermofconcatenatedsamplesasthedegreeofpacking.TheproposedapproachcanbeappliedtoanyexistingGANarchitectureandanylossfunction,aslongasitusesadiscriminatorD(X)thatclassi\ufb01esasingleinputsample.Weusethenotation\u201cPac(X)(m)\u201dwhere(X)isthenameofthemotherarchitecture,and(m)isisthepackingdegree.Forexample,ifwetakeanoriginalGANandfeedthediscriminatorthreepackedsamples,wecallthis\u201cPacGAN3\u201d.Weimplementpackingbykeepingallhiddenlayersofthediscriminatoridenticaltothemotherarchitecture,andincreasingthenumberofnodesintheinputlayerbyafactorofm.Forexample,inFigure1,westartwithafully-connected,feed-forwarddiscriminator.EachsampleXistwo-dimensional,sotheinputlayerhastwonodes.UnderPacGAN2,wemultiplythesizeoftheinputlayerbythepackingdegreem=2,andtheconnectionstothe\ufb01rsthiddenlayerareadjustedsothatthe\ufb01rsttwolayersremainfully-connected,asinthemotherarchitecture.Thegrid-patternednodesinFigure1representinputnodesforthesecondsample.Similarly,whenpackingaDCGAN,whichusesconvolutionalneuralnetworksforboththegeneratorandthediscriminator,wesimplystacktheimagesintoatensorofdepthm.Forinstance,thediscriminatorforPacDCGAN4onthe3\fFigure1:PacGAN(m)augmentstheinputlayerbyafactorofm.Thenumberofweightsbetweenthe\ufb01rsttwolayersareincreasedtopreservethemotherarchitecture\u2019sconnectivity.Packedsamplesareconcatenatedandfedtotheinputlayer;grid-patternednodesareinputnodesforthesecondsample.MNISTdatasetofhandwrittenimages[16]wouldtakeaninputofsize28\u00d728\u00d74,sinceeachindividualblack-and-whiteMNISTimageis28\u00d728pixels.Onlytheinputlayerandthenumberofweightsinthecorresponding\ufb01rstconvolutionallayerwillincreaseindepthbyafactorof4.AsinstandardGANs,wetrainthepackeddiscriminatorwithabagofsamplesfromtherealdataandthegenerator.However,eachminibatchinthestochasticgradientdescentnowconsistsofpackedsamples(X1,X2,...,Xm,Y),whichthediscriminatorjointlyclassi\ufb01es.Intuitively,packinghelpsthediscriminatordetectmodecollapsebecauselackofdiversityismoreobviousinasetofsamplesthaninasinglesample.3TheoreticalAnalysisofPacGANInthissection,weshowafundamentalconnectionbetweentheprincipleofpackingandmodecollapseinGAN.Weprovideacompleteunderstandingofhowpackingchangesthelossasseenbythegenerator,byfocusingon(a)theoptimaldiscriminatoroverafamilyofallmeasurablefunctions;(b)thepopulationexpectation;and(c)the0-1lossfunctionoftheformmaxDEX\u223cP[I(D(X))]+EG(Z)\u223cQ[1\u2212I(D(G(Z)))],subjecttoD(X)\u2208{0,1}.Thisdiscriminatorprovides(anapproximationof)thetotalvariationdistance,andthegeneratortriestominimizethetotalvariationdistancedTV(P,Q),aswidelyknownintheGANliterature[10].Thereasonwemakethisassumptionisprimarilyforclarityandanalyticaltractability:totalvariationdistancehighlightstheeffectofpackinginawaythatiscleanerandeasiertounderstandthanifweweretoanalyzeJensen-Shannondivergence.Wewanttounderstandhowthis0-1loss,asprovidedbysuchadiscriminator,changeswiththedegreeofpackingm.Aspackeddiscriminatorsseempackedsamples,eachdrawni.i.d.fromonejointclass(i.e.eitherrealorgenerated),wecanconsiderthesepackedsamplesasasinglesamplethatisdrawnfromtheproductdistribution:PmforrealandQmforgenerated.TheresultinglossprovidedbythepackeddiscriminatoristhereforedTV(Pm,Qm).We\ufb01rstprovideaformalmathematicalde\ufb01nitionofmodecollapse,whichleadstoatwo-dimensionalrepresentationofanypairofdistributions(P,Q)asamode-collapseregion.Thisregionrepresentationprovidesnotonlyconceptualclarityregardingmodecollapse,butalsoprooftechniquesthatareessentialtoprovingourmainresults.WedeferalltheproofstotheAppendix.InAppendixE,weshowtheproposedmodecollapseregionisequivalenttotheROCcurveforbinaryhypothesistesting.Thisallowsustousepowerfulmathematicaltechniquesfrombinaryhypothesistestingincludingthedataprocessinginequality.De\ufb01nition1.AtargetdistributionPandageneratorQexhibit(\u03b5,\u03b4)-modecollapsefor0\u2264\u03b5<\u03b4\u22641ifthereexistsasetS\u2286XsuchthatP(S)\u2265\u03b4andQ(S)\u2264\u03b5.Intuitively,larger\u03b4andsmaller\u03b5indicatemoreseveremodecollapse.Thatis,ifalargeportionofthetargetP(S)\u2265\u03b4insomesetSinthedomainXismissinginthegeneratorQ(S)\u2264\u03b5,wedeclare(\u03b5,\u03b4)-modecollapse.Similarly,whenwechangetheroleofPandQ,andhaveP(S)\u2264\u03b5andQ(S)\u2265\u03b4,wesayPandQexhibit(\u0001,\u03b4)-modeaugmentation.Thisde\ufb01nitionhasafundamentalconnectiontotheROCregionindetectiontheoryandbinaryhypothesistesting\u2014aconnectionthatiscriticalforourprooftechniques;thisconnectionisdetailedinAppendixDandE.Akeyobservationisthattwopairsofdistributionscanhavethesametotalvariationdistancewhileexhibitingverydifferentmodecollapsepatterns.Toseethis,consideratoyexampleinFigure2,withauniformtargetdistributionP=U([0,1])andamodecollapsinggeneratorQ1=U([0.2,1])4\fandanonmodecollapsinggeneratorQ2=0.6U([0,0.5])+1.4U([0.5,1]).Theappropriateway111.250.211110.61.40.5PQ1Q2 0 0.5 1 0 0.5 1R(P,Q1)\u03b5\u03b4 0 0.5 1 0 0.5 1R(P,Q2)\u03b5\u03b4Figure2:Aformalde\ufb01nitionof(\u03b5,\u03b4)-modecollapseanditsaccompanyingregionrepresentationcapturestheintensityofmodecollapseforgeneratorsQ1withmodecollapseandQ2whichdoesnothavemodecollapse,foratoyexampledistributionsP,Q1,andQ2shownontheleft.Theregionof(\u03b5,\u03b4)-modecollapsethatisachievableisshowningrey.topreciselyrepresentmodecollapseistovisualizeitthroughatwo-dimensionalregionwecallthemodecollapseregion.Foragivenpair(P,Q),thecorrespondingmodecollapseregionR(P,Q)isde\ufb01nedastheconvexhulloftheregionofpoints(\u03b5,\u03b4)suchthat(P,Q)exhibit(\u03b5,\u03b4)-modecollapse,asshowninFigure2:R(P,Q),conv(cid:0)(cid:8)(\u03b5,\u03b4)(cid:12)(cid:12)\u03b4>\u03b5and(P,Q)has(\u03b5,\u03b4)-modecollapse(cid:9)(cid:1).ThereisafundamentalconnectionbetweenthemodecollapseregionandtheROCcurveinhypothesistesting(AppendixE).Anunpackeddiscriminator,observingonlytheTVdistancebetweengeneratordistributionsQandthetruedistributionP,cannotdistinguishbetweentwocandidategeneratorsQ1andQ2withdTV(P,Q1)=dTV(P,Q2),butdifferentmodecollapseregions.Thekeyinsightofthisworkisthatbyinsteadconsideringproductdistributions,thetotalvariationdistancedTV(Pm,Qm)variesinawaythatiscloselytiedtothemodecollapseregionsfor(P,Q).Forinstance,Figure3(left)showsanachievablerangeofdTV(Pm,Qm)conditionedonthatdTV(P,Q)=\u03c4for\u03c4=1.1.Withinthisachievablerange,somepairs(P,Q)haverapidlyincreasingtotalvariation,occupyingtheupperpartoftheregion(showninred,middlepanelofFigure3),andothershaveslowlyincreasingtotalvariation,occupyingthelowerpart(showninblue)intherightpanelofFigure3.Weformallyshowinthefollowingthatthereisafundamentalconnectionbetweentotalvariationdistanceevolutionanddegreeofmodecollapse.Namely,distributionswithstrongmodecollapseoccupytheupperregion,andhencewillbepenalizedbyapackeddiscriminator.Evolutionoftotalvariationdistanceswithmodecollapse.Weanalyzehowthetotalvariationevolvesforthesetofallpairs(P,Q)thathavethesametotalvariationdistances\u03c4whenunpacked,withm=1,andhave(\u03b5,\u03b4)-modecollapseforsome0\u2264\u03b5<\u03b4\u22641.Thesolutionofthefollowingoptimizationproblemgivesthedesiredrange:minP,QormaxP,QdTV(Pm,Qm)(2)subjecttodTV(P,Q)=\u03c4(P,Q)has(\u03b5,\u03b4)-modecollapse,wherethemaximizationandminimizationareoverallprobabilitymeasuresPandQ,andthemodecollapseconstraintisde\ufb01nedinDe\ufb01nition1.Weprovidetheoptimalsolutionanalyticallyandestablishthatmode-collapsingpairsoccupytheupperpartofthetotalvariationregion;thatis,totalvariationincreasesrapidlyaswepackmoresamplestogether(Figure3,middlepanel).Theorem2.Forall0\u2264\u03b5<\u03b4\u22641andanintegerm,if1\u2265\u03c4\u2265\u03b4\u2212\u03b5thenthesolutiontothemaximizationin(2)is1\u2212(1\u2212\u03c4)m,andthesolutiontotheminimizationisminnmin0\u2264\u03b1\u22641\u2212\u03c4\u03b4\u03b4\u2212\u03b5dTV(cid:16)Pinner1(\u03b1)m,Qinner1(\u03b1)m(cid:17),min1\u2212\u03c4\u03b4\u03b4\u2212\u03b5\u2264\u03b1\u22641\u2212\u03c4dTV(cid:16)Pinner2(\u03b1)m,Qinner2(\u03b1)m(cid:17)o,5\f 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 3 5 7 9 11dTV(Pm,Qm)degreeofpackingm 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 3 5 7 9 11(0.00,0.1)-modecollapsedegreeofpackingm 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 3 5 7 9 11no(0.07,0.1)-modecollapsedegreeofpackingmFigure3:TherangeofdTV(Pm,Qm)achievablebypairswithdTV(P,Q)=\u03c4,forachoiceof\u03c4=0.11,de\ufb01nedbythesolutionsoftheoptimization(4)providedinTheorem4intheAppendix(leftpanel).TherangeofdTV(Pm,Qm)achievablebythosepairsthatalsohave(\u03b5=0.00,\u03b4=0.1)-modecollapse(middlepanel).Asimilarrangeachievablebypairsofdistributionsthatdonothave(\u03b5=0.07,\u03b4=0.1)-modecollapseor(\u03b5=0.07,\u03b4=0.1)-modeaugmentation(rightpanel).Pairs(P,Q)withstrongmodecollapseoccupythetopregion(neartheupperbound)andthepairswithweakmodecollapseoccupythebottomregion(nearthelowerbound).wherePinner1(\u03b1)m,Qinner1(\u03b1)m,Pinner2(\u03b1)m,andQinner2(\u03b1)marethem-thorderproductdistri-butionsofdiscreterandomvariablesdistributedasPinner1(\u03b1)=[\u03b4,1\u2212\u03b1\u2212\u03b4,\u03b1],Qinner1(\u03b1)=[\u03b5,1\u2212\u03b1\u2212\u03c4\u2212\u03b5,\u03b1+\u03c4],Pinner2(\u03b1)=[1\u2212\u03b1,\u03b1],andQinner2(\u03b1)=[1\u2212\u03b1\u2212\u03c4,\u03b1+\u03c4].If\u03c4<\u03b4\u2212\u03b5,thentheoptimizationin(2)hasnosolutionandthefeasiblesetisanemptyset.Oneimplicationisthatdistributionpairs(P,Q)atthetopofthetotalvariationevolutionregionarethosewiththestrongestmodecollapse.Anotherimplicationisthatapair(P,Q)withstrongmodecollapse(i.e.,withlarger\u03b4andsmaller\u03b5intheconstraint)willbepenalizedmoreunderpacking,andhenceageneratorminimizinganapproximationofdTV(Pm,Qm)willbeunlikelytoselectadistributionthatexhibitssuchstrongmodecollapse.Evolutionoftotalvariationdistanceswithoutmodecollapse.Wenextanalyzehowthetotalvariationevolvesforthesetofallpairs(P,Q)thathavethesametotalvariationsdistances\u03c4whenunpacked,withm=1,anddonothave(\u03b5,\u03b4)-modecollapseforsome0\u2264\u03b5<\u03b4\u22641.Becauseofthesymmetryofthetotalvariationdistance,modecollapsefor(Q,P)isequallydamagingasmodecollapseof(P,Q),whenitcomestohowfasttotalvariationdistancesevolve.Hence,wecharacterizethisevolutionforthosefamilyofpairsofdistributionsthatdonothaveeithermodecollapses.Thesolutionofthefollowingoptimizationproblemgivesthedesiredrangeoftotalvariationdistances:minP,QormaxP,QdTV(Pm,Qm)(3)subjecttodTV(P,Q)=\u03c4,(P,Q)doesnothave(\u03b5,\u03b4)-modecollapse,(Q,P)doesnothave(\u03b5,\u03b4)-modecollapse,Weprovidethteoptimalsolutionanalyticallyandestablishthatthepairs(P,Q)withweakmodecollapsewilloccupythebottompartoftheevolutionofthetotalvariationdistances(seeFigure3rightpanel),andalsowillbepenalizedlessunderpacking.Henceageneratorminimizing(approximate)dTV(Pm,Qm)islikelytogeneratedistributionswithweakmodecollapse.Theorem3.If\u03b4+\u03b5\u22641and\u03b4\u2212\u03b5\u2264\u03c4\u2264(\u03b4\u2212\u03b5)/(\u03b4+\u03b5)thenthesolutiontothemaximizationin(3)ismax\u03b1+\u03b2\u22641\u2212\u03c4,\u03b5\u03c4\u03b4\u2212\u03b5\u2264\u03b1,\u03b2dTV(cid:16)Pouter1(\u03b1,\u03b2)m,Qouter1(\u03b1,\u03b2)m(cid:17),wherePouter1(\u03b1,\u03b2)mandQouter1(\u03b1,\u03b2)marethem-thorderproductdistributionsofdiscreterandomvariablesdistributedasPouter1(\u03b1,\u03b2)=[\u03b1(\u03b4\u2212\u03b5)\u2212\u03b5\u03c4\u03b1\u2212\u03b5,\u03b1(\u03b1+\u03c4\u2212\u03b4)\u03b1\u2212\u03b5,1\u2212\u03c4\u2212\u03b1\u2212\u03b2,\u03b2,0]andQouter1(\u03b1,\u03b2)=[0,\u03b1,1\u2212\u03c4\u2212\u03b1\u2212\u03b2,\u03b2(\u03b2+\u03c4\u2212\u03b4)\u03b2\u2212\u03b5,\u03b2(\u03b4\u2212\u03b5)\u2212\u03b5\u03c4\u03b2\u2212\u03b5].Thesolutiontotheminimizationin(3)ismin\u03b5\u03c4\u03b4\u2212\u03b5\u2264\u03b1\u22641\u2212\u03b4\u03c4\u03b4\u2212\u03b5dTV(cid:16)Pinner(\u03b1)m,Qinner(\u03b1,\u03c4)m(cid:17),6\fwherePinner(\u03b1)andQinner(\u03b1,\u03c4)arede\ufb01nedasinTheorem4intheAppendix.Wecanprovetheexactsolutionoftheoptimizationforallvaluesof\u03b5and\u03b4,whichweprovideintheAppendix.Wereferalsototheappendixofmoreillustrationsofregionsoccupiedbyvariouschoicesof\u03b5and\u03b4formodecollapsingdistributions,andnonmodecollapsingregions.4ExperimentsOnstandardbenchmarkdatasets,wecomparePacGANtoseveralbaselineGANarchitectures,someexplicitlydesignedtomitigatemodecollapse:GAN[10],minibatchdiscrimination(MD)[24],DCGAN[22],VEEGAN[25],UnrolledGANs[20],andALI[8].WealsoimplicitlycompareagainstBIGAN[7],whichisconceptuallyidenticaltoALI.Toisolatetheeffectsofpacking,wemakeminimalchoicesinthearchitectureandhyperparametersofourpackingimplementation.Ourgoalistoreproduceexperimentsfromtheliterature,applypackingtothesimplestbaselineGAN,andobservehowpackingaffectsperformance.Wheneverpossible,weusetheexactlysamechoiceofarchitecture,hyperparameters,andlossfunctionasabaselineineachexperiment;wechangeonlythediscriminatortoacceptpackedsamples.Allcodetoreproduceourexperimentscanbefoundathttps://github.com/fjxmlzn/PacGAN.Metrics.Wemeasureseveralpreviously-usedmetrics.The\ufb01rstisnumberofmodesthatareproducedbyagenerator[7,20,25].Inlabelleddatasets,thisnumbercanbeevaluatedusingathird-partytrainedclassi\ufb01erthatclassi\ufb01esthegeneratedsamples[25].Asecondmetricusedin[25]isthenumberofhigh-qualitysamples,whichistheproportionofthesamplesthatarewithinxstandarddeviationsfromthecenterofamode.Finally,wemeasurethereverseKullback-Leiblerdivergencebetweentheinduceddistributionfromgeneratedsamplesandtheinduceddistributionfromtherealsamples.Eachofthesemetricshasshortcomings\u2014forexample,thenumberofobservedmodesignoresclassimbalance,andallofthemetricsassumethemodesareknownapriori.Datasets.Weusesyntheticandrealdatasets.The2D-ring[25]isamixtureofeighttwo-dimensionalsphericalGaussianswithmeans(cos((2\u03c0/8)i),sin((2\u03c0/8)i))andvariances10\u22124ineachdimensionfori\u2208{1,...,8}.The2D-grid[25]isamixtureof25two-dimensionalsphericalGaussianswithmeans(\u22124+2i,\u22124+2j)andvariances0.0025ineachdimensionfori,j\u2208{0,1,2,3,4}.TheMNISTdataset[16]consistsof70Kimagesofhandwrittendigits,each28\u00d728pixels.Unmodi\ufb01ed,thisdatasethas10modes(digits).Asin[20,25],weaugmentthenumberofmodesbystackingtheimages:wegenerateanewdatasetof128KimageswhereeachimageconsistsofthreerandomMNISTimagesstackedintoa28\u00d728\u00d73RGBimage.ThisnewstackedMNISTdatasethas(withhighprobability)1000=10\u00d710\u00d710modes.Finally,weincludeexperimentsontheCelebAdataset,whichisacollectionof200Kfacialimagesofcelebrities[19].4.1SyntheticdataexperimentsOur\ufb01rstexperimentmeasurestheeffectofthenumberofdiscriminatorparametersonmodecollapse.Packedarchitectureshavemorediscriminatornodes(andparameters)thanthemotherarchitecture,whichcouldarti\ufb01ciallymitigatemodecollapsebygivingthediscriminatormorecapacity.Wecomparethiseffecttotheeffectofpackingonthe2Dgriddataset.InFigure4,weevaluatethenumberofmodesrecoveredandreverseKLdivergenceforALI,GAN,MD,andPacGAN,whilevaryingthenumberoftotalparametersineacharchitecture(discriminatorandencoderifoneexists).TheexperimentaldetailsareincludedinAppendixA.2.ForMD,themetrics\ufb01rstimproveandthendegradewiththenumberofparameters.WesuspectthatthismaybecauseMDisverysensitivetoexperimentsettings,asthesamearchitectureofMDhasverydifferentperformanceon2d-gridand2d-ringdataset(AppendixA.1).ForALI,GANandPacGAN,despitevaryingthenumberofparametersbyanorderofmagnitude,wedonotseesigni\ufb01cantevidenceofthemetricsimprovingwiththenumberofparameters.ThissuggeststhattheadvantagesofPacGANandALIcomparedtoGANdonotstemfromhavingmoreparameters.PackingalsoseemstoincreasethenumberofmodesrecoveredanddecreasethereverseKLdivergence;weexplorethisphenomenonmoreinsubsequentexperiments.7\f01000002000003000004000005000006000007000008000001516171819202122232425GANPacGAN2PacGAN3PacGAN4Minibatch DiscriminationALIModesrecovered(higherisbetter)ParameterCount01000002000003000004000005000006000007000008000000.00.20.40.60.81.0GANPacGAN2PacGAN3PacGAN4Minibatch DiscriminationALIReverseKLdivergence(lowerisbetter)ParameterCountFigure4:Effectofnumberofparametersonevaluationmetrics.4.2StackedMNISTexperimentsForourstackedMNISTexperiments,wegeneratesamples.Eachofthethreechannelsineachsampleisclassi\ufb01edbyapre-trainedthird-partyMNISTclassi\ufb01er,andtheresultingthreedigitsdeterminewhichofthe1000modesthesamplebelongsto.Wemeasurethenumberofmodescaptured,aswellastheKLdivergencebetweenthegenerateddistributionovermodesandtheexpected(uniform)one.Inthe\ufb01rstexperiment,wereplicateTable2from[25],whichmeasuredthenumberofobservedmodesinageneratortrainedonthestackedMNISTdataset,aswellastheKLdivergenceofthegeneratedmodedistribution.Inlinewith[25],weusedaDCGAN-likearchitecturefortheseexperiments1(detailsinAppendixB.1).OurresultsareshowninTable1.The\ufb01rstfourrowsarecopieddirectlyfrom[25].ThelastthreerowsarecomputedusingabasicDCGAN,withpackinginthediscriminator.We\ufb01ndthatpackinggivesgoodmodecoverage,reachingall1,000modesineverytrial.Again,packingthesimplestDCGANfullycapturesallthemodesinthebenchmarktest,sowedonotpursuepackingmorecomplexbaselinearchitectures.WealsoobservethatMDisveryunstablethroughouttraining,whichmakesitcaptureevenlessmodesthanGAN.OnefactorthatcontributestoMD\u2019sinstabilitymaybethatMDrequirestoomanyparameters.ThenumberofdiscriminatorparametersinMDis47,976,773,whereasGANhas4,310,401andPacGAN4onlyneeds4,324,801.StackedMNISTModesKLDCGAN[22]99.03.40ALI[8]16.05.40UnrolledGAN[20]48.74.32VEEGAN[25]150.02.95MinibatchDiscrimination[24]24.5\u00b17.675.49\u00b10.418DCGAN(ourimplementation)78.9\u00b16.464.50\u00b10.127PacDCGAN2(ours)1000.0\u00b10.000.06\u00b10.003PacDCGAN3(ours)1000.0\u00b10.000.06\u00b10.003PacDCGAN4(ours)1000.0\u00b10.000.07\u00b10.005Table1:Twomeasuresofmodecollapseproposedin[25]forthestackedMNISTdataset:numberofmodescapturedbythegeneratorandreverseKLdivergenceoverthegeneratedmodedistribution.TheDCGAN,PacDCGAN,andMDresultsareaveragedover10trials,withstandarderrorreported.4.3CelebAexperimentsFinally,wemeasurethediversityofimagesgeneratedfromtheCelebAdatasetasin[3]byestimatingtheprobabilityofcollisioninabatchofgeneratedimages.Ifthereexistsatleastonepairofnear-duplicateimagesinthebatch,acollisionisdeclared,whichindicateslackofdiversity.Thedetailsof1https://github.com/carpedm20/DCGAN-tensorflow8\fhowwedetermineduplicatesandourarchitecturearedeferredtoAppendixC.We\ufb01ndthatpackingsigni\ufb01cantlyimprovesthediversityofsamples,andifthesizeofthediscriminatorissmall,packingalsoimprovessamplequality.SeeAppendixCforgeneratedsamples.discriminatorsizeprobabilityofcollisionDCGANPacDCGAN2273K10.334\u00d7273K0.42016\u00d7273K0.86025\u00d7273K0.650.17Table2:Probabilityof\u22651pairofnear-duplicateimagesappearinginabatchof1024imagesgeneratedfromDCGANandPacDCGAN2oncelebAdataset.5DiscussionInthiswork,weproposeapackingframeworkthattheoreticallyandempiricallymitigatesmodecollapsewithlowoverhead.Ouranalysisleadstoseveralinterestingopenquestions,includinghowtoapplytheseanalysistechniquestomoregeneralclassesoflossfunctionssuchasJensen-ShannondivergenceandWassersteindistances.ThiswillcompletetheunderstandingofthesuperiorityofourapproachobservedinexperimentswithJSdivergenceinSection4andwithWassersteindistanceinAppendixB.3.Anotherimportantquestioniswhatpackingarchitecturetouse.Forinstance,aframeworkthatprovidespermutationinvariancemaygivebetterresultssuchasgraphneuralnetworks[6,26,15]ordeepsets[28].AcknowledgementTheauthorswouldliketothankSreeramKannanandAlexDimakisfortheinitialdiscussionsthatleadtotheinceptionofthepackingidea,andVyasSekarforvaluablediscussionsaboutGANs.WethankSrivastavaAkash,LukeMetz,TuNguyen,andYingyuLiangforprovidinginsightsand/ortheimplementationdetailsontheirproposedarchitecturesforVEEGAN[25],UnrolledGAN[20],D2GAN[21],andMIX+GAN[2].Wethanktheanonymousreviewersfortheirconstructivefeedback.ThisworkissupportedbyNSFawardsCNS-1527754,CCF-1553452,CCF-1705007,andRI-1815535andGoogleFacultyResearchAward.ThisworkusedtheExtremeScienceandEngineeringDiscoveryEnvironment(XSEDE),whichissupportedbyNationalScienceFoundationgrantnumberOCI-1053575.Speci\ufb01cally,itusedtheBridgessystem,whichissupportedbyNSFawardnumberACI-1445606,atthePittsburghSupercomputingCenter(PSC).ThisworkispartiallysupportedbythegenerousresearchcreditsonAWScloudcomputingresourcesfromAmazon.References[1]M.Arjovsky,S.Chintala,andL.Bottou.WassersteinGAN.arXivpreprintarXiv:1701.07875,2017.[2]S.Arora,R.Ge,Y.Liang,T.Ma,andY.Zhang.Generalizationandequilibriumingenerativeadversarialnets(GANs).arXivpreprintarXiv:1703.00573,2017.[3]S.AroraandY.Zhang.Dogansactuallylearnthedistribution?anempiricalstudy.arXivpreprintarXiv:1706.08224,2017.[4]D.Blackwell.Equivalentcomparisonsofexperiments.TheAnnalsofMathematicalStatistics,24(2):265\u2013272,1953.[5]T.Che,Y.Li,A.P.Jacob,Y.Bengio,andW.Li.Moderegularizedgenerativeadversarialnetworks.arXivpreprintarXiv:1612.02136,2016.[6]M.Defferrard,X.Bresson,andP.Vandergheynst.Convolutionalneuralnetworksongraphswithfastlocalizedspectral\ufb01ltering.InAdvancesinNeuralInformationProcessingSystems,pages3844\u20133852,2016.9\f[7]J.Donahue,P.Kr\u00e4henb\u00fchl,andT.Darrell.Adversarialfeaturelearning.arXivpreprintarXiv:1605.09782,2016.[8]V.Dumoulin,I.Belghazi,B.Poole,A.Lamb,M.Arjovsky,O.Mastropietro,andA.Courville.Adversariallylearnedinference.arXivpreprintarXiv:1606.00704,2016.[9]I.Goodfellow.Nips2016tutorial:Generativeadversarialnetworks.arXivpreprintarXiv:1701.00160,2016.[10]I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde-Farley,S.Ozair,A.Courville,andY.Bengio.Generativeadversarialnets.InAdvancesinneuralinformationprocessingsystems,pages2672\u20132680,2014.[11]S.IoffeandC.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducinginternalcovariateshift.InInternationalConferenceonMachineLearning,pages448\u2013456,2015.[12]P.Kairouz,S.Oh,andP.Viswanath.Thecompositiontheoremfordifferentialprivacy.IEEETransactionsonInformationTheory,63(6):4037\u20134049,June2017.[13]T.Karras,T.Aila,S.Laine,andJ.Lehtinen.ProgressivegrowingofGANsforimprovedquality,stability,andvariation.arXivpreprintarXiv:1710.10196,2017.[14]D.KingmaandJ.Ba.Adam:Amethodforstochasticoptimization.arXivpreprintarXiv:1412.6980,2014.[15]T.N.KipfandM.Welling.Semi-supervisedclassi\ufb01cationwithgraphconvolutionalnetworks.arXivpreprintarXiv:1609.02907,2016.[16]Y.LeCun.Themnistdatabaseofhandwrittendigits.http://yann.lecun.com/exdb/mnist/,1998.[17]J.Li,A.Madry,J.Peebles,andL.Schmidt.Towardsunderstandingthedynamicsofgenerativeadversarialnetworks.arXivpreprintarXiv:1706.09884,2017.[18]S.Liu,O.Bousquet,andK.Chaudhuri.Approximationandconvergencepropertiesofgenerativeadversariallearning.arXivpreprintarXiv:1705.08991,2017.[19]Z.Liu,P.Luo,X.Wang,andX.Tang.Deeplearningfaceattributesinthewild.InProceedingsofInternationalConferenceonComputerVision(ICCV),2015.[20]L.Metz,B.Poole,D.Pfau,andJ.Sohl-Dickstein.Unrolledgenerativeadversarialnetworks.arXivpreprintarXiv:1611.02163,2016.[21]T.Nguyen,T.Le,H.Vu,andD.Phung.Dualdiscriminatorgenerativeadversarialnets.InAdvancesinNeuralInformationProcessingSystems,pages2667\u20132677,2017.[22]A.Radford,L.Metz,andS.Chintala.Unsupervisedrepresentationlearningwithdeepconvolu-tionalgenerativeadversarialnetworks.arXivpreprintarXiv:1511.06434,2015.[23]S.Reed,Z.Akata,X.Yan,L.Logeswaran,B.Schiele,andH.Lee.Generativeadversarialtexttoimagesynthesis.arXivpreprintarXiv:1605.05396,2016.[24]T.Salimans,I.Goodfellow,W.Zaremba,V.Cheung,A.Radford,andX.Chen.Improvedtechniquesfortraininggans.InAdvancesinNeuralInformationProcessingSystems,pages2234\u20132242,2016.[25]A.Srivastava,L.Valkov,C.Russell,M.Gutmann,andC.Sutton.Veegan:Reducingmodecollapseingansusingimplicitvariationallearning.arXivpreprintarXiv:1705.07761,2017.[26]K.K.Thekumparampil,C.Wang,S.Oh,andL.-J.Li.Attention-basedgraphneuralnetworkforsemi-supervisedlearning.arXivpreprintarXiv:1803.03735,2018.[27]I.Tolstikhin,S.Gelly,O.Bousquet,C.-J.Simon-Gabriel,andB.Sch\u00f6lkopf.Adagan:Boostinggenerativemodels.arXivpreprintarXiv:1701.02386,2017.[28]M.Zaheer,S.Kottur,S.Ravanbakhsh,B.Poczos,R.R.Salakhutdinov,andA.J.Smola.Deepsets.InAdvancesinNeuralInformationProcessingSystems,pages3391\u20133401,2017.10\f", "award": [], "sourceid": 769, "authors": [{"given_name": "Zinan", "family_name": "Lin", "institution": "Carnegie Mellon University"}, {"given_name": "Ashish", "family_name": "Khetan", "institution": "Amazon AI Labs"}, {"given_name": "Giulia", "family_name": "Fanti", "institution": "CMU"}, {"given_name": "Sewoong", "family_name": "Oh", "institution": "University of Washington"}]}