{"title": "Insights on representational similarity in neural networks with canonical correlation", "book": "Advances in Neural Information Processing Systems", "page_first": 5727, "page_last": 5736, "abstract": "Comparing different neural network representations and determining how representations evolve over time remain challenging open questions in our understanding of the function of neural networks. Comparing representations in neural networks is fundamentally difficult as the structure of representations varies greatly, even across groups of networks trained on identical tasks, and over the course of training. Here, we develop projection weighted CCA (Canonical Correlation Analysis) as a tool for understanding neural networks, building off of SVCCA, a recently proposed method (Raghu et al, 2017). We first improve the core method, showing how to differentiate between signal and noise, and then apply this technique to compare across a group of CNNs, demonstrating that networks which generalize converge to more similar representations than networks which memorize, that wider networks converge to more similar solutions than narrow networks, and that trained networks with identical topology but different learning rates converge to distinct clusters with diverse representations. We also investigate the representational dynamics of RNNs, across both training and sequential timesteps, finding that RNNs converge in a bottom-up pattern over the course of training and that the hidden state is highly variable over the course of a sequence, even when accounting for linear transforms. Together, these results provide new insights into the function of CNNs and RNNs, and demonstrate the utility of using CCA to understand representations.", "full_text": "Insights on representational similarity in neural\n\nnetworks with canonical correlation\n\nAri S. Morcos\u2217\u2021\n\nDeepMind\u2020\n\narimorcos@gmail.com\n\nMaithra Raghu\u2217\u2021\n\nGoogle Brain, Cornell University\n\nmaithrar@gmail.com\n\nSamy Bengio\nGoogle Brain\n\nbengio@google.com\n\nAbstract\n\nComparing different neural network representations and determining how repre-\nsentations evolve over time remain challenging open questions in our understand-\ning of the function of neural networks. Comparing representations in neural net-\nworks is fundamentally dif\ufb01cult as the structure of representations varies greatly,\neven across groups of networks trained on identical tasks, and over the course\nof training. Here, we develop projection weighted CCA (Canonical Correlation\nAnalysis) as a tool for understanding neural networks, building off of SVCCA,\na recently proposed method [22]. We \ufb01rst improve the core method, showing\nhow to differentiate between signal and noise, and then apply this technique to\ncompare across a group of CNNs, demonstrating that networks which general-\nize converge to more similar representations than networks which memorize, that\nwider networks converge to more similar solutions than narrow networks, and that\ntrained networks with identical topology but different learning rates converge to\ndistinct clusters with diverse representations. We also investigate the representa-\ntional dynamics of RNNs, across both training and sequential timesteps, \ufb01nding\nthat RNNs converge in a bottom-up pattern over the course of training and that\nthe hidden state is highly variable over the course of a sequence, even when ac-\ncounting for linear transforms. Together, these results provide new insights into\nthe function of CNNs and RNNs, and demonstrate the utility of using CCA to\nunderstand representations.\n\n1\n\nIntroduction\n\nAs neural networks have become more powerful, an increasing number of studies have sought to de-\ncipher their internal representations [26, 16, 4, 2, 11, 25, 21]. Most of these have focused on the role\nof individual units in the computations performed by individual networks. Comparing population\nrepresentations across networks has proven especially dif\ufb01cult, largely because networks converge\nto apparently distinct solutions in which it is dif\ufb01cult to \ufb01nd one-to-one mappings of units [16].\nRecently, [22] applied Canonical Correlation Analysis (CCA) as a tool to compare representations\nacross networks. CCA had previously been used for tasks such as computing the similarity between\nmodeled and measured brain activity [23], and training multi-lingual word embeddings in language\nmodels [5]. Because CCA is invariant to linear transforms, it is capable of \ufb01nding shared structure\nacross representations which are super\ufb01cially dissimilar, making CCA an ideal tool for comparing\nthe representations across groups of networks and for comparing representations across time in\nRNNs.\nUsing CCA to investigate the representations of neural networks, we make three main contributions:\n\n\u2217equal contribution, in alphabetical order\n\u2020Work done while at DeepMind; currently at Facebook AI Research (FAIR)\n\u2021To whom correspondence should be addressed: arimorcos@gmail.com, maithrar@gmail.com\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f1. We analyse the technique introduced in [22], and identify a key challenge: the method\ndoes not effectively distinguish between the signal and the noise in the representation. We\naddress this via a better aggregation technique (Section 2.2).\n\n2. Building off of [21], we demonstrate that groups of networks which generalize converge to\nmore similar solutions than those which memorize (Section 3.1), that wider networks con-\nverge to more similar solutions than narrower networks (Section 3.2), and that networks\nwith identical topology but distinct learning rates converge to a small set of diverse solu-\ntions (Section 3.3).\n\n3. Using CCA to analyze RNN representations over training, we \ufb01nd that, as with CNNs [22],\nRNNs exhibit bottom-up convergence (Section 4.1). Across sequence timesteps, however,\nwe \ufb01nd that RNN representations vary signi\ufb01cantly (Section A.3).\n\n2 Canonical Correlation Analysis on Neural Network Representations\n\nCanonical Correlation Analysis [10], is a statistical technique for relating two sets of observations\narising from an underlying process.\nIt identi\ufb01es the \u2019best\u2019 (maximizing correlation) linear rela-\ntionships (under mutual orthogonality and norm constraints) between two sets of multidimensional\nvariates.\nConcretely, in our setting, the underlying process is a neural network being trained on some task.\nThe multidimensional variates are neuron activation vectors over some dataset X. As in [22], a\nneuron activation vector denotes the outputs a single neuron z has on X. If X = {x1, ..., xm}, then\nthe neuron z outputs scalars z(x1), ..., z(xm), which can be stacked to form a vector.4\nA single neuron activation vector is one multidimensional variate, and a layer of neurons gives us\na set of multidimensional variates. In particular, we can consider two layers, L1, L2 of a neural\nnetwork as two sets of observations, to which we can then apply CCA, to determine the similarity\nbetween two layers. Crucially, this similarity measure is invariant to (invertible) af\ufb01ne transforms\nof either layer, which makes it especially apt for neural networks, where the representation at each\nlayer typically goes through an af\ufb01ne transform before later use. Most importantly, it also enables\ncomparisons between different neural networks,5 which is not naively possible due to a lack of any\nkind of neuron to neuron alignment.\n\n2.1 Mathematical Details of Canonical Correlation\n\nHere we overview the formal mathematical interpretation of CCA, as well as the optimization prob-\nlem to compute it. Let L1, L2 be a \u00d7 n and b \u00d7 n dimensional matrices respectively, with L1 rep-\nresenting a multidimensional variates, and L2 representing b multidimensional variates. We wish to\n\ufb01nd vectors w, s in Ra, Rb respectively, such that the dot product\n\n\u03c1 =\n\n(cid:104)wT L1, sT L2(cid:105)\n||wT L1|| \u00b7 ||sT L2||\n\nis maximized. Assuming the variates in L1, L2 are centered, and letting \u03a3L1,L1 denote the a by a\ncovariance of L1, \u03a3L2,L2 denote the b by b covariance of L2, and \u03a3L1,L2 the cross covariance:\n\n(cid:104)wT L1, sT L2(cid:105)\n||wT L1|| \u00b7 ||sT L2|| =\n\n(cid:112)wT \u03a3L1,L1w(cid:112)sT \u03a3L2,L2 s\n\nwT \u03a3L1,L2s\n\nWe can change basis, to w = \u03a3\n\n\u22121/2\nL1,L1\n\nu and s = \u03a3\n\n(cid:112)wT \u03a3L1,L1w(cid:112)sT \u03a3L2,L2s\n\nwT \u03a3L1,L2s\n\n\u22121/2\nL2,L2\n\nv to get\n\u22121/2\n\u221a\nL1,L1\n\nuT \u03a3\n\n=\n\n\u22121/2\nL2,L2\n\n\u221a\nuT u\n\n\u03a3L1,L2 \u03a3\nvT v\n\nwhich can be solved with a singular value decomposition:\n\n\u22121/2\nL1,L1\n\n\u03a3\n\n\u03a3L1,L2 \u03a3\n\n= U \u039bV\n\n\u22121/2\nL2,L2\n\nv\n\n(*)\n\n4This is different than the vector of all neuron outputs on a single input: z1(x1), ..., zN (x1), which is also\n\nsometimes referred to as an activation vector.\n\n5Including those with different topologies such that L1 and L2 have different sizes.\n\n2\n\n\fa\n\nd\n\nb\n\ne\n\nc\n\nf\n\nFigure 1: CCA distinguishes between stable and unstable parts of the representation over the course of\ntraining. Sorted CCA coef\ufb01cients (\u03c1(i)\n) comparing representations between layer L at times t through training\nt\nwith its representation at the \ufb01nal timestep T for CNNs trained on CIFAR-10 (a), and RNNs trained on PTB (b)\nand WikiText-2 (c). For all of these networks, at time t0 < T (indicated in title), the performance converges to\nmatch \ufb01nal performance (see Figure A1). However, many \u03c1(i)\nare unconverged, corresponding to unnecessary\nt\nparts of the representation (noise). To distinguish between the signal and noise portions of the representation,\nwe apply CCA between L at timestep tearly early in training, and L at timestep T /2 to get \u03c1T /2. We take\nthe 100 top converged vectors (according to \u03c1T /2) to form S, and the 100 least converged vectors to form B.\nWe then compute CCA similarity between S and L at time t > tearly, and similarly for B. S remains stable\nthrough training (signal), while B rapidly becomes uncorrelated (d-f). Note that the sudden spike at T /2 in the\nunstable representation is because it is chosen to be the least correlated with step T /2.\n\nwith u, v in (*) being the \ufb01rst left and right singular vectors, and the top singular value of \u039b corre-\nsponding to the canonical correlation coef\ufb01cient \u03c1 \u2208 [0, 1], which tells us how well correlated the\nvectors wT L1 = uT \u03a3\n\nL2 (both vectors in Rn) are.\n\nL1 and sT L2 = vT \u03a3\n\n\u22121/2\nL1,L1\n\n\u22121/2\nL2,L2\n\n\u22121/2\nL2,L2\n\n\u22121/2\nL1,L1\n\nL1 and (v(2))T \u03a3\n\nIn fact, u, v, \u03c1 are really the \ufb01rst in a series, and can be denoted u(1), v(1), \u03c1(1). Next in the series are\nu(2), v(2), the second left and right singular vectors, and \u03c1(2) the corresponding second highest sin-\ngular value of \u039b. \u03c1(2) denotes the correlation between (u(2))T \u03a3\nL2,\nwhich is the next highest possible correlation under the constraint that (cid:104)u(1), u(2)(cid:105) = 0 and\n(cid:104)v(1), v(2)(cid:105) = 0.\nThe output of CCA is a series of singular vectors u(i), v(i) which are pairwise orthogonal, their\ncorresponding vectors in Rn: (u(i))T \u03a3\nL2, and \ufb01nally their correlation\ncoef\ufb01cient \u03c1(i) \u2208 [0, 1], with \u03c1(i) \u2264 \u03c1(j), i > j. Letting c = min(a, b), we end up with c non-zero\n\u03c1(i).\nNote\n(u(i))T \u03a3\n(cid:104)(u(i))T \u03a3\n\nof\nL1, as\nL1(cid:105) = (u(i))T \u03a3\nand so our CCA directions are also orthogonal.\n\n(u(j)) = (u(i))T (u(j)) = 0 (**)\n\northogonality\n\u22121/2\nL1,L1\n\nthe\nL1, (u(j))T \u03a3\n\nthat\n\u22121/2\nL1,L1\n\n\u22121/2\nL1,L1\n\nL1 and (v(i))T \u03a3\n\nalso\n\nresults\n\nin\n\nL1, (u(j))T \u03a3\n\nthe\n\northogonality\n\nof\n\n\u22121/2\nL1,L1\n\n\u22121/2\nL1,L1\n\nL1LT\n\n1 \u03a3\n\n\u22121/2\nL1,L1\n\n\u22121/2\nL1,L1\n\n\u22121/2\nL2,L2\n\nu(i), u(j)\n\n2.2 Beyond Mean CCA Similarity\n\nTo determine the representational similarity between two layers L1, L2, [22] prunes neurons with\na preprocessing SVD step, and then applies CCA to L1, L2. They then represent the similar-\nity of L1, L2 by the mean correlation coef\ufb01cient. Adapting this to make a distance measure,\ndSV CCA(L1, L2):\n\ndSV CCA(L1, L2) = 1 \u2212 1\nc\n\n\u03c1(i)\n\nc(cid:88)\n\ni=1\n\nOne drawback with this measure is that it implicitly assumes that all c CCA vectors are equally\nimportant to the representations at layer L1. However, there has been ample evidence that DNNs do\n\n3\n\n025050075010001250150017502000Sorted Index0.00.20.40.60.81.0Correlation CoefficientCIFAR-10 Correlation Coefficients Through Time Performance Convergence: Step 45000Step020000400006000079999020040060080010001200Sorted Index0.00.20.40.60.81.0Correlation CoefficientPTB Correlation Coefficients Through Time Performance Convergence: Epoch 250Epoch1101201301401500020040060080010001200Sorted Index0.00.20.40.60.81.0Correlation CoefficientWikiText-2 Correlation Coefficients Through Time Performance Convergence: Epoch 350Epoch115130145160175001000020000300004000050000600007000080000Epoch Number0.00.20.40.60.8CCA DistanceCIFAR-10 Stable and Unstable Parts of RepresentationStableUnstable100200300400500Epoch Number0.00.20.40.60.8CCA DistancePTB Stable and Unstable Parts of RepresentationStableUnstable100200300400500600700Epoch Number0.00.20.40.60.8CCA DistanceWikiText-2 Stable and Unstable Parts of RepresentationStableUnstable\fFigure 2: Projection weighted (PWCCA) vs. SVCCA vs. unweighted mean Unweighted mean (blue) and\nprojection weighted mean (red) were used to compare synthetic ground truth signal and uncommon (noise)\nstructure, each of \ufb01xed dimensionality. As the signal to noise ratio decreases, the unweighted mean under-\nestimates the shared structure, while the projection weighted mean remains largely robust. SVCCA performs\nbetter than the unweighted mean but less well than the projection weighting.\n\nnot rely on the full dimensionality of a layer to represent high performance solutions [12, 6, 1, 20,\n15, 21, 14]. As a result, the mean correlation coef\ufb01cient will typically underestimate the degree of\nsimilarity.\nTo investigate this further, we \ufb01rst asked whether, over the course of training, all CCA vectors\nconverge to their \ufb01nal representations before the network\u2019s performance converges. To test this, we\ncomputed the CCA similarity between layer L at times t throughout training with layer L at the\n\ufb01nal timestep T . Viewing the sorted CCA coef\ufb01cients \u03c1, we can see that many of the coef\ufb01cients\ncontinue to change well after the network\u2019s performance has converged (Figure 1a-c, Figure A1).\nThis result suggests that the unconverged coef\ufb01cients and their corresponding vectors may represent\n\u201cnoise\u201d which is unnecessary for high network performance.\nWe next asked whether the CCA vectors which stabilize early in training remain stable. To test\nthis, we computed the CCA vectors between layer L at timestep tearly in training and timestep T /2.\nWe then computed the similarity between the top 100 vectors (those which stabilized early) and\nthe bottom 100 vectors (those which had not stabilized) with the representation at all other training\ntimes. Consistent with our intuition, we found that those vectors which stabilized early remained\nstable, while the unstable vectors continued to vary, and therefore likely represent noise.\nThese results suggest that task-critical representations are learned by midway through training, while\nthe noise only approaches its \ufb01nal value towards the end. We therefore suggest a simple and easy to\ncompute variation that takes this into account. We also discuss an alternate approach in Section A.2.\n\nProjection Weighting One way to address this issue is to replace the mean by a weighted mean,\nin which canonical correlations which are more important to the underlying representation have\nhigher weight. We propose a simple method, projection weighting, to determine these weights. We\nbase our proposition on the hypothesis that CCA vectors that account for (loosely speaking) a larger\nproportion of the original outputs are likely to be more important to the underlying representation.\nMore formally, let layer L1, have neuron activation vectors [z1, ..., za], and CCA vectors hi =\nL1. We know from (**) that hi, hj are orthogonal. Because computing CCA can\n(u(i))T \u03a3\nresult in the accrual of small numerical errors [24], we \ufb01rst explicitly orthonormalize h1, ..., hc via\nGram-Schmidt. We then identify how much of the original output is accounted for by each hi:\n\n\u22121/2\nL1,L1\n\nNormalizing this to get weights \u03b1i, with(cid:80)\n\n\u02dc\u03b1i =\n\ndistance6:\n\n(cid:88)\n\nj\n\n|(cid:104)hi, zj(cid:105)|\n\ni \u03b1i = 1, we can compute the projection weighted CCA\n\nd(L1, L2) = 1 \u2212 c(cid:88)\n\n\u03b1i\u03c1(i)\n\nAs a simple test of the bene\ufb01ts of projection weighting, we constructed a toy case in which we used\nCCA to compare the representations of two networks with common (signal) and uncommon (noise)\n\n6We note that this is technically a pseudo-distance rather than a distance as it is non-symmetric.\n\ni=1\n\n4\n\n0.00.20.40.60.81.0Ratio of Signal Dimension to Noise0.00.20.40.60.81.0CCA DistanceMean, PWCCA, SVCCA ComparisonMeanPWCCASVCCA\fFigure 3: Generalizing networks converge to more similar solutions than memorizing networks. Groups\nof 5 networks were trained on CIFAR-10 with either true labels (generalizing) or a \ufb01xed random permutation\nof the labels (memorizing). The pairwise CCA distance was then compared within each group and between\ngeneralizing and memorizing networks (inter) for each layer, based on the training data, and the projection\nweighted CCA coef\ufb01cient (with thresholding to remove low variance noise.) While both categories converged\nto similar solutions in early layers, likely re\ufb02ecting convergent edge detectors, etc., generalizing networks\nconverge to signi\ufb01cantly more similar solutions in later layers. At the softmax, sets of both generalizing and\nmemorizing networks converged to nearly identical solutions, as all networks achieved near-zero training loss.\nError bars represent mean \u00b1 std weighted mean CCA distance across pairwise comparisons.\n\nstructure, each of a \ufb01xed dimensionality. We then used the naive mean and projected weighted\nmean to measure the CCA distance between these two networks as a function of the ratio of signal\ndimensions to noise dimensions. As expected we found that while the naive mean was extremely\nsensitive to this ratio, the projection weighted mean was largely robust (Figure 2).\n\n3 Using CCA to measure the similarity of converged solutions\n\nBecause CCA measures the distance between two representations independent of linear transforms,\nit enables formerly dif\ufb01cult comparisons between the representations of different networks. Here,\nwe use this property of CCA to evaluate whether groups of networks trained on CIFAR-10 with\ndifferent random initializations converge to similar solutions under the following conditions:\n\n3.1)\n\n\u2022 When trained on identically randomized labels (as in [27]) or on the true labels (Section\n\u2022 As network width is varied (Section 3.2)\n\u2022 In a large sweep of 200 networks (Section 3.3)\n\n3.1 Generalizing networks converge to more similar solutions than memorizing networks\n\nIt has recently been observed that DNNs are capable of solving image classi\ufb01cation tasks even when\nthe labels have been randomly permuted [27]. Such networks must, by de\ufb01nition, memorize the\ntraining data, and therefore cannot generalize beyond the training set. However, the representational\nproperties which distinguish networks which memorize from those which generalize remain unclear.\nIn particular, we hypothesize that the representational similarity in a group of generalizing networks\n(networks trained on the true labels) should differ from the representational similarity of memorizing\nnetworks (networks trained on random labels.)\nTo test this hypothesis, we trained groups of \ufb01ve networks with identical topology on either unmod-\ni\ufb01ed CIFAR-10 or CIFAR-10 with random labels (the same set of random labels was used for all\nnetworks), all of which were trained to near-zero training loss7. Critically, the randomization of\nCIFAR-10 labels was consistent for all networks. To evaluate the similarity of converged solutions,\nwe then measured the pairwise projection weighted CCA distance for each layer among networks\ntrained on unmodi\ufb01ed CIFAR-10 (\u201cGeneralizing\u201d), among networks trained on randomized label\nCIFAR-10 (\u201cMemorizing\u201d) and between each pair of networks trained on unmodi\ufb01ed and random\n\n7Details of the architectures and training procedures for this and following experiments can be found in\n\nAppendix A.4.\n\n5\n\n\fa\n\nb\n\nFigure 4: Larger networks converge to more similar solutions. Groups of 5 networks with different random\ninitializations were trained on CIFAR-10. Pairwise CCA distance was computed for members of each group.\nGroups of larger networks converged to more similar solutions than groups of smaller networks (a). Test\naccuracy was highly correlated with degree of convergent similarity, as measured by CCA distance (b).\n\nlabel CIFAR-10 (\u201cInter\u201d). For all analyses, the representation in a given layer was obtained by\naveraging across all spatial locations within each \ufb01lter.\nRemarkably, we found that not only do generalizing networks converge to more similar solutions\nthan memorizing networks (to be expected, since generalizing networks are more constrained), but\nmemorizing networks are as similar to each other as they are to a generalizing network. This result\nsuggests that the solutions found by memorizing networks were as diverse as those found across\nentirely different dataset labellings.\nWe also found that at early layers, all networks converged to equally similar solutions, regardless of\nwhether they generalize or memorize (Figure 3). Intuitively, this makes sense as the feature detectors\nfound in early layers of CNNs are likely required regardless of the dataset labelling. In contrast,\nhowever, at later layers, groups of generalizing networks converged to substantially more similar\nsolutions than groups of memorizing networks (Figure 3). Even among networks which generalize,\nthe CCA distance between solutions found in later layers was well above zero, suggesting that the\nsolutions found were quite diverse. At the softmax layer, sets of both generalizing and memorizing\nnetworks converged to highly similar solutions when CCA distance was computed based on training\ndata; when test data was used, however, only generalizing networks converged to similar softmax\noutputs (Figure A10), again re\ufb02ecting that each memorizing network memorizes the training data\nusing a different strategy.\nImportantly, because each network learned a different linear transform of a similar solution, tradi-\ntional distance metrics, such as cosine or Euclidean distance, were insuf\ufb01cient to reveal this differ-\nence (Figure A5). Additionally, while unweighted CCA revealed the same broad pattern, it does not\nreveal that generalizing networks get more similar in the \ufb01nal two layers (Figure A9).\n\n3.2 Wider networks converge to more similar solutions\n\nIn the model compression literature, it has been repeatedly noted that while networks are robust\nto the removal of a large fraction of their parameters (in some cases, as many as 90%), networks\ninitialized and trained from the start with fewer parameters converge to poorer solutions than those\nderived from pruning a large networks [8, 9, 6, 1, 20, 15]. Recently, [7] proposed the \u201clottery ticket\nhypothesis,\u201d which hypothesizes that larger networks are more likely to converge to good solutions\nbecause they are more likely to contain a sub-network with a \u201clucky\u201d initialization. If this were true,\nwe might expect that groups of larger networks are more likely to contain the same \u201clottery ticket\u201d\nsub-network and are therefore more likely to converge to similar solutions than smaller networks.\nTo test this intuition, we trained groups of convolutional networks with increasing numbers of \ufb01lters\nat each layer. We then used projection weighted CCA to measure the pairwise similarity between\neach group of networks of the same size. Consistent with our intuition, we found that larger networks\nconverged to much more similar solutions than smaller networks (Figure 4a).8 This is also consistent\nwith the equivalence of deep networks to Gaussian processes (GPs) in the limit of in\ufb01nite width\n\n8To control for variability in CCA distance due to comparisons across representations of different sizes, a\nrandom subset of 128 \ufb01lters from the \ufb01nal layer were used for all network comparisons. This bias should, if\n\n6\n\n\fa\n\nb\n\nFigure 5: CCA reveals clusters of converged solutions across networks with different random initial-\nizations and learning rates. 200 networks with identical topology and varying learning rates were trained on\nCIFAR-10. CCA distance between the eighth layer of each pair of networks was computed, revealing \ufb01ve dis-\ntinct subgroups of networks (a). These \ufb01ve subgroups align almost perfectly with the subgroups discovered in\n[21] (b; colors correspond to bars in a), despite the fact that the clusters in [21] were generated using robustness\nto cumulative ablation, an entirely separate metric.\n\nIf each unit in a layer corresponds to a draw from a GP, then as the number of units\n\n[13, 17].\nincreases the CCA distance will go to zero.\nInterestingly, we also found that networks which converged to more similar solutions also achieved\nnoticeably higher test accuracy.\nIn fact, we found that across pairs of networks, the correlation\nbetween test accuracy and the pairwise CCA distance was -0.96 (Figure 4b), suggesting that the\nCCA distance between groups of identical networks with different random initializations (computed\nusing the train data) may serve as a strong predictor of test accuracy.\nIt may therefore enable\naccurate prediction of test performance without requiring the use of a validation set.\n\n3.3 Across many initializations and learning rates, networks converge to discriminable\n\nclusters of solutions\n\nHere, we ask whether networks trained on the same data with different initializations and learning\nrates converge to the same solutions. To test this, we measured the pairwise CCA distance between\nnetworks trained on unmodi\ufb01ed CIFAR-10. Interestingly, when we plotted the pairwise distance\nmatrix (Figure 5a), we observed a block diagonal structure consistent with \ufb01ve clusters of converged\nnetwork solutions, with one cluster highly dissimilar to the other four clusters. Despite the fact that\nthese networks all achieved similar train loss (and many reached similar test accuracy as well), these\nclusters corresponded with the learning rate used to train each network. This result suggests that\nthere exist multiple minima in the optimization landscape to which networks may converge which\nare largely speci\ufb01ed by the optimization parameters.\nIn [21], the authors also observed clusters of network solutions using the relationship between net-\nworks\u2019 robustness to cumulative deletion or \u201cablation\u201d of \ufb01lters and generalization error. To test\nwhether the same clusters are found via these distinct approaches, we assigned a color to each clus-\nter found using CCA (see bars on left and top in Figure 5a), and used these colors to identify the\nsame networks in a plot of ablation robustness vs. generalization error (Figure 5b). Surprisingly, the\nclusters found using CCA aligned nearly perfectly with those observed using ablation robustness.\nThis result suggests not only that networks with different learning rates converge to distinct clusters\nof solutions, but also that these clusters can be uncovered independently using multiple methods,\neach of which measures a different property of the learned solution. Moreover, analyzing these net-\nworks using traditional metrics, such as generalization error, would obscure the differences between\nmany of these networks.\n\nanything, lead to an overestimate of the distance between groups of larger networks, as they are more heavily\nsubsampled.\n\n7\n\n\fa\n\nb\n\nc\n\nd\n\nFigure 6: RNNs exhibit bottom-up learning dynamics. To test whether layers converge to their \ufb01nal rep-\nresentation over the course of training with a particular structure, we compared each layer\u2019s representation\nover the course of training to its \ufb01nal representation using CCA. In shallow RNNs trained on PTB (a), and\nWikiText-2 (b), we observed a clear bottom-up convergence pattern, in which early layers converge to their\n\ufb01nal representation before later layers. In deeper RNNs trained on WikiText-2, we observed a similar pattern\n(c). Importantly, the weighted mean reveals this effect much more accurately than the unweighted mean, which\nis also supported by control experiments (Figure A8) (d), revealing the importance of appropriate weighting of\nCCA coef\ufb01cients.\n\n4 CCA on Recurrent Neural Networks\n\nSo far, CCA has been used to study feedforward networks. We now use CCA to investigate RNNs.\nOur RNNs are LSTMs used for the Penn Treebank (PTB) and WikiText-2 (WT2) language mod-\nelling tasks, following the implementation in [18, 19].\nOne speci\ufb01c question we explore is whether the learning dynamics of RNNs mirror the \u201cbottom\nup\u201d convergence observed in the feedforward case in [22], as well as investigating whether CCA\nproduces qualitatively better outputs than other metrics. However, in the case of RNNs, there are two\npossible notions of \u201ctime\u201d. There is the training timestep, which affects the values of the weights,\nbut also a \u2018sequence timestep\u2019 \u2013 the number of tokens of the sequence that have been fed into the\nrecurrent net. This latter notion of time does not explicitly change the weights, but results in updated\nvalues of the cell state and hidden state of the network, which of course affect the representations of\nthe network.\nIn this work, we primarily focus on the training notion of time; however, we perform a preliminary\ninvestigation of the sequence notion of time as well, demonstrating that CCA is capable of \ufb01nd-\ning similarity across sequence timesteps which are missed by traditional metrics (Figures A2, A4),\nbut also that even CCA often fails to \ufb01nd similarity in the hidden state across sequence timesteps,\nsuggesting that representations over sequence timesteps are often not linearly similar (Figure A3).\n\n4.1 Learning Dynamics Through Training Time\n\nTo measure the convergence of representations through training time, we computed the projection\nweighted mean CCA value for each layer\u2019s representation throughout training to its \ufb01nal representa-\ntion. We observed bottom-up convergence in both Penn Treebank and WikiText-2 (Figure 6a-b). We\nrepeated these experiments with cosine and Euclidean distance (Figure A8), \ufb01nding that while these\nother metrics also reveal a bottom up convergence, the results with CCA highlight this phenomena\nmuch more clearly.\nWe also observed bottom-up convergence in a deeper LSTM trained on WikiText-2 (the larger\ndataset) (Figure 6c). Interestingly, we found that this result changes noticeably if we use the un-\nweighted mean CCA instead, demonstrating the importance of the weighting scheme (Figure 6d).\n\n5 Discussion and future work\n\nIn this study, we developed CCA as a tool to gain insights on many representational properties of\ndeep neural networks. We found that the representations in hidden layers of a neural network contain\nboth \u201csignal\u201d components, which are stable over training and correspond to performance curves,\nand an unstable \u201cnoise\u201d component. Using this insight, we proposed projection weighted CCA,\nadapting [22]. Leveraging the ability of CCA to compare across different networks, we investigated\nthe properties of converged solutions of convolutional neural networks (Section 3), \ufb01nding that\n\n8\n\n0100200300400500Epoch0.00.10.20.30.40.50.6CCA DistancePTB Learning Dynamics0200400600Epoch0.00.10.20.30.40.50.60.7WikiText-2 Learning DynamicsLayer1230200400600Epoch0.00.10.20.30.40.50.6WikiText-2 Deeper LSTMLayer123450200400600Epoch0.00.10.20.30.40.50.6WikiText-2 Unweighted Mean\fnetworks which generalize converge to more similar solutions than those which memorize (Section\n3.1), that wider networks converge to more similar solutions than narrow networks (Section 3.2),\nand that across otherwise identical networks with different random initializations and learning rates,\nnetworks converge to diverse clusters of solutions (Section 3.3). We also used projection weighted\nCCA to study the dynamics (both across training time and sequence steps) of RNNs, (Section 4),\n\ufb01nding that RNNs exhibit bottom-up convergence over the course of training (Section 4.1), and that\nacross sequence timesteps, RNN representations vary nonlinearly (Section A.3).\nOne interesting direction for future work is to examine what is unique about directions which are\npreserved across networks trained with different initializations. Previous work has demonstrated\nthat these directions are suf\ufb01cient for the network computation [22], but the properties that make\nthese directions special remain unknown. Furthermore, the attributes which speci\ufb01cally distinguish\nthe diverse solutions found in Figure 5 remain unclear. We also observed that networks which\nconverge to similar solutions exhibit higher generalization performance (Figure 4b). In future work,\nit would be interesting to explore whether this insight could be used as a regularizer to improve\nnetwork performance. Additionally, it would be useful to explore whether this result is consistent in\nRNNs as well as CNNs. Another interesting direction would be to investigate which aspects of the\nrepresentation present in RNNs is stable over time and which aspects vary. Additionally, in previous\nwork [22], it was observed that \ufb01xing layers in CNNs over the course of training led to better test\nperformance (\u201cfreeze training\u201d). An interesting open question would be to investigate whether a\nsimilar training protocol could be adapted for RNNs.\n\nAcknowledgments\n\nWe would like to thank Jascha Sohl-Dickstein for critical feedback on the manuscript, and Jason\nYosinski, Jon Kleinberg, Martin Wattenberg, Neil Rabinowitz, Justin Gilmer, and Avraham Ruder-\nman for helpful discussion.\n\nReferences\n[1] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional\n\nneural networks. J. Emerg. Technol. Comput. Syst., 13(3):32:1\u201332:18, February 2017.\n\n[2] Devansh Arpit, Stanis\u0142aw Jastrz\u02dbebski, Nicolas Ballas, David Krueger, Emmanuel Bengio,\nMaxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Si-\nmon Lacoste-Julien. A closer look at memorization in deep networks. In Proceedings of the\n34th International Conference on Machine Learning (ICML\u201917), June 2017.\n\n[3] Maurice S. Bartlett. The statistical signi\ufb01cance of canonical correlations. In Biometrika, vol-\n\nume 32, pages 29 \u2013 37, 1941.\n\n[4] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dis-\nsection: Quantifying interpretability of deep visual representations. In Computer Vision and\nPattern Recognition, 2017.\n\n[5] Manaal Faruqui and Chris Dyer. Improving vector space word representations using multilin-\n\ngual correlation. In Association for Computational Linguistics, 2014.\n\n[6] Mikhail Figurnov, Aizhan Ibraimova, Dmitry P Vetrov, and Pushmeet Kohli. PerforatedCNNs:\nAcceleration through elimination of redundant convolutions. In D D Lee, M Sugiyama, U V\nLuxburg, I Guyon, and R Garnett, editors, Advances in Neural Information Processing Systems\n29, pages 947\u2013955. Curran Associates, Inc., 2016.\n\n[7] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Training pruned neural\n\nnetworks. CoRR, abs/1803.03635, 2018.\n\n[8] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural\nnetworks with pruning, trained quantization and huffman coding. In Proceedings of the 4th\nInternational Conference on Learning Representations (ICLR\u201916), October 2015.\n\n[9] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections\n\nfor ef\ufb01cient neural networks. CoRR, abs/1506.02626, 2015.\n\n9\n\n\f[10] Harold Hotelling. Relations between two sets of variates. In Biometrika, volume 28, pages\n\n321\u2013337, 1936.\n\n[11] Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent\nnetworks. International Conference on Learning Representations Workshop, abs/1506.02078,\n2016.\n\n[12] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In D S Touretzky, editor,\nAdvances in Neural Information Processing Systems 2, pages 598\u2013605. Morgan-Kaufmann,\n1990.\n\n[13] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and\nJascha Sohl-Dickstein. Deep neural networks as gaussian processes. In International Confer-\nence on Learning Representations (ICLR\u201917), 2018.\n\n[14] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic\ndimension of objective landscapes. In International Conference on Learning Representations,\nApril 2018.\n\n[15] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters\nfor ef\ufb01cient ConvNets. In International Conference on Learning Representations (ICLR\u201917),\npages 1\u201310, 2017.\n\n[16] Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. Convergent learning:\nDo different neural networks learn the same representations? In Feature Extraction: Modern\nQuestions and Challenges, pages 196\u2013212, 2015.\n\n[17] AGDG Matthews, J Hron, M Rowland, RE Turner, and Z Ghahramani. Gaussian process\nbehaviour in wide deep neural networks. In International Conference on Learning Represen-\ntations (ICLR\u201918), 2018.\n\n[18] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing\n\nLSTM Language Models. arXiv preprint arXiv:1708.02182, 2017.\n\n[19] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An Analysis of Neural Language\n\nModeling at Multiple Scales. arXiv preprint arXiv:1803.08240, 2018.\n\n[20] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolu-\ntional neural networks for resource ef\ufb01cient inference. In International Conference on Learn-\ning Representations (ICLR\u201917), November 2016.\n\n[21] Ari S. Morcos, David G.T. Barrett, Neil C. Rabinowitz, and Matthew Botvinick. On the im-\nportance of single directions for generalization. In International Conference on Learning Rep-\nresentations (ICLR\u201918), 2018.\n\n[22] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular\nvector canonical correlation analysis for deep learning dynamics and interpretability. In Ad-\nvances in Neural Information Processing Systems, 2017.\n\n[23] David Sussillo, Mark M Churchland, Matthew T Kaufman, and Krishna V Shenoy. A neural\nnetwork that \ufb01nds a naturalistic solution for the production of muscle activity. Nature neuro-\nscience, 18(7):1025\u20131033, 2015.\n\n[24] Viivi Uurtio, Jo\u00e3o M. Monteiro, Jaz Kandola, John Shawe-Taylor, Delmiro Fernandez-Reyes,\nand Juho Rousu. A tutorial on canonical correlation methods. ACM Comput. Surv., 50(6):95:1\u2013\n95:33, November 2017.\n\n[25] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding\nneural networks through deep visualization. In Deep Learning Workshop, International Con-\nference on Machine Learning (ICML), 2015.\n\n[26] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In\n\nEuropean conference on computer vision, pages 818\u2013833. Springer, 2014.\n\n[27] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understand-\ning deep learning requires rethinking generalization. International Conference on Learning\nRepresentations (ICLR\u201916), abs/1611.03530, 2016.\n\n10\n\n\f", "award": [], "sourceid": 2770, "authors": [{"given_name": "Ari", "family_name": "Morcos", "institution": "Facebook AI Research"}, {"given_name": "Maithra", "family_name": "Raghu", "institution": "Cornell University and Google Brain"}, {"given_name": "Samy", "family_name": "Bengio", "institution": "Google Brain"}]}