{"title": "Source Separation as a By-Product of Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 459, "page_last": 465, "abstract": null, "full_text": "Source Separation as a \n\nBy-Product of Regularization \n\nSepp Hochreiter \n\nFakultat fur lnformatik \n\nTechnische Universitat Munchen \n\n80290 M unchen, Germany \n\nJ urgen Schmidhuber \n\nIDSIA \n\nCorso Elvezia 36 \n\n6900 Lugano, Switzerland \n\nhochreit~informatik.tu-muenchen.de \n\njuergen~idsia.ch \n\nAbstract \n\nThis paper reveals a previously ignored connection between two \nimportant fields: regularization and independent component anal(cid:173)\nysis (ICA). We show that at least one representative of a broad \nclass of algorithms (regularizers that reduce network complexity) \nextracts independent features as a by-product. This algorithm is \nFlat Minimum Search (FMS), a recent general method for finding \nlow-complexity networks with high generalization capability. FMS \nworks by minimizing both training error and required weight pre(cid:173)\ncision. According to our theoretical analysis the hidden layer of \nan FMS-trained autoassociator attempts at coding each input by \na sparse code with as few simple features as possible. In experi(cid:173)\nments the method extracts optimal codes for difficult versions of \nthe \"noisy bars\" benchmark problem by separating the underlying \nsources, whereas ICA and PCA fail. Real world images are coded \nwith fewer bits per pixel than by ICA or PCA. \n\n1 \n\nINTRODUCTION \n\nIn the field of unsupervised learning several information-theoretic objective func(cid:173)\ntions (OFs) have been proposed to evaluate the quality of sensory codes. Most OFs \nfocus on properties of the code components - we refer to them as code component(cid:173)\noriented OFs, or COCOFs. Some COCOFs explicitly favor near-factorial, mini(cid:173)\nmally redundant codes of the input data [2, 17, 23, 7, 24] while others favor local \ncodes [22,3, 15]. Recently there has also been much work on COCOFs encouraging \nbiologically plausible sparse distributed codes [19,9, 25, 8, 6, 21, 11, 16]. \nWhile COCOFs express desirable properties of the code itself they neglect the costs \nof constructing the code from the data. E.g., coding input data without redun-\n\n\f460 \n\nS. Hochreiter and J Schmidhuber \n\ndancy may be very expensive in terms of information required to describe the code(cid:173)\ngenerating network, which may need many finely tuned free parameters. We believe \nthat one of sensory coding's objectives should be to reduce the cost of code genera(cid:173)\ntion through data transformations, and postulate that an important scarce resource \nis the bits required to describe the mappings that generate and process the codes. \n\nHence we shift the point of view and focus on the information-theoretic costs of \ncode generation. We use a novel approach to unsupervised learning called \"low(cid:173)\ncomplexity coding and decoding\" (LOCOCODE [14]). Without assuming particular \ngoals such as data compression, subsequent classification, etc., but in the spirit \nof research on minimum description length (MDL), LOCOCODE generates so-called \nlococodes that (1) convey information about the input data, (2) can be computed \nfrom the data by a low-complexity mapping (LCM), and (3) can be decoded by an \nLCM. We will see that by minimizing coding/decoding costs LOCOCODE can yield \nefficient, robust, noise-tolerant mappings for processing inputs and codes. \n\nLococodes through regularizers. To implement LOCOCODE we apply regular(cid:173)\nization to an autoassociator (AA) whose hidden layer activations represent the code. \nThe hidden layer is forced to code information about the input data by minimizing \ntraining error; the regularizer reduces coding/decoding costs. Our regularizer of \nchoice will be Flat Minimum Search (FMS) [13]. \n\n2 FLAT MINIMUM SEARCH: REVIEW AND ANALYSIS \n\nFMS is a general gradient-based method for finding low-complexity networks with \nhigh generalization capability. FMS finds a large region in weight space such that \neach weight vector from that region has similar small error. Such regions are called \n\"flat minima\". In MDL terminology, few bits of information are required to pick a \nweight vector in a \"flat\" minimum (corresponding to a low-complexity network) -\nthe weights may be given with low precision. FMS automatically prunes weights \nand units, and reduces output sensitivity with respect to remaining weights and \nunits. Previous FMS applications focused on supervised learning [12, 13]. \nNotation. Let 0, H,I denote index sets for output, hidden, and input units, \nrespectively. For lEO U H, the activation yl of unit 1 is yl = f (SI), where \nSI = Em Wlmym is the net input of unit 1 (m E H for lEO and mEl for 1 E H), \nWlm denotes the weight on the connection from unit m to unit l, f denotes the \nactivation function, and for mEl, ym denotes the m-th component of an input \nvector. W = 1(0 x H) U (H x 1)1 is the number of weights. \nAlgorithm. FMS' objective function E features an unconventional error term: \n\nB = i'; ~UH log ~ (::'~j) 2 + W log ~ (i';~UH L I ~ ~ )') 2 \n\nkED \n\n8Wij \n\nE = Eq + >'B is minimized by gradient descent, where Eq is the training set mean \nsquared error (MSE), and >. a positive \"regularization constant\" scaling B's in(cid:173)\nfluence. Choosing>' corresponds to choosing a tolerable error level (there is no a \npriori \"optimal\" way of doing so). B measures the weight precision (number of \nbits needed to describe all weights in the net). Given a constant number of output \nunits, FMS can be implemented efficiently, namely, with standard backprop's order \nof computational complexity [13]. \n\n\fSource Separation as a By-Product of Regularization \n\n461 \n\n2.1 FMS: A Novel Analysis \n\nSimple basis functions (BFs). A BF is the function determining the activation \nof a code component in response to a given input. Minimizing B 's term \n\nT1 := ~ log~ -y-\n~ ~ 8w\u00b7\u00b7 \ntJ \n\ni, j : iEDuH \n\nkED \n\n( 8 k)2 \n\nobviously reduces output sensitivity with respect to weights (and therefore units). \nT1 is responsible for pruning weights (and, therefore, units). T1 is one reason why \nlow-complexity (or simple) BFs are preferred: weight precision (or complexity) is \nmainly determined by :!~j' \nSparseness. Because T1 tends to make unit activations decrease to zero it favors \nsparse codes. But T1 also favors a sparse hidden layer in the sense that few hidden \nunits contribute to producing the output. B's second term \n\nT2 \n\n:= WlogL ( L \n\nkED \n\ni ,j : iEDuH \n\npunishes units with similar influence on the output. We reformulate it: \n\nT2 = Wlog (\"j, ~UH u ,u, ~OUH \n\nSee intermediate steps in [14] . We observe: (1) an output unit that is very sensitive \nwith respect to two given hidden units will heavily contribute to T2 (compare the \nnumerator in the last term of T2). (2) This large contribution can be reduced by \nmaking both hidden units have large impact on other output units (see denominator \nin the last term of T2). \n\nFew separated basis functions. Hence FMS tries to figure out a way of using \n(1) as few BFs as possible for determining the activation of each output unit, while \nsimultaneously (2) using the same BFs for determining the activations of as many \noutput units as possible (common BFs). (1) and T1 separate the BFs: the force to(cid:173)\nwards simplicity (see T1) prevents input information from being channelled through \na single BF; the force towards few BFs per output makes them non-redundant. (1) \nand (2) cause few BFs to determine all outputs. \nSummary. Collectively T1 and T2 (which make up B) encourage sparse codes \nbased on few separated simple basis functions producing all outputs. Due to space \nlimitations a more detailed analysis (e.g. linear output activation) had to be left to \na TR [14] (on the WWW). \n\n\f462 \n\nS. Hochreiter and J. Schmidhuber \n\n3 EXPERIMENTS \n\nWe compare LOCOCODE to \"independent component analysis\" (ICA, e.g., [5, 1, \n4, 18]) and \"principal component analysis\" (PCA, e.g., [20]). ICA is realized by \nCardoso's JADE algorithm, which is based on whitening and subsequent joint diag(cid:173)\nonalization of 4th-order cumulant matrices. To measure the information conveyed \nby resulting codes we train a standard backprop net on the training set used for \ncode generation. Its inputs are the code components; its task is to reconstruct the \noriginal input. The test set consists of 500 off-training set exemplars (in the case \nof real world images we use a separate test image). Coding efficiency is the average \nnumber of bits needed to code a test set input pixel. The code components are \nscaled to the interval [0,1] and partitioned into discrete intervals. Assuming inde(cid:173)\npendence of the code components we estimate the probability of each discrete code \nvalue by Monte Carlo sampling on the training set. To obtain the test set codes' \nbits per pixel (Shannon's optimal value) the average sum of all negative logarithms \nof code component probabilities is divided by the number of input components. All \ndetails necessary for reimplementation are given in [14]. \n\nNoisy bars adapted from [10, 11]. The input is a 5 x 5 pixel grid with horizontal \nand vertical bars at random positions. The task is to extract the independent \nfeatures (the bars). Each of the 10 possible bars appears with probability k. In \ncontrast to [10, 11] we allow for bar type mixing -\nthis makes the task hamer. \nBar intensities vary in [0.1, 0.5]; input units that see a pixel of a bar are activated \ncorrespondingly others adopt activation -0.5. We add Gaussian noise with variance \n0.05 and mean a to each pixel. For ICA and PCA we have to provide information \nabout the number (ten) of independent sources (tests with n assumed sources will \nbe denoted by ICA-n and PCA-n). LOCOCODE does not require this -\nusing 25 \nhidden units (HUs) we expect LOCOCODE to prune the 15 superfluous HUs. \n\nResults. See Table 1. While the reconstruction errors of all methods are similar, \nLOCOCODE has the best coding efficiency. 15 of the 25 HUs are indeed automati(cid:173)\ncally pruned: LOCOCODE finds an optimal factorial code which exactly mirrors the \npattern generation process. PCA codes and ICA-15 codes, however, are unstruc(cid:173)\ntured and dense. While ICA-lO codes are almost sparse and do recognize some \nsources, the sources are not clearly separated like with LOCOCODE -\ncompare the \nweight patterns shown in [14]. \n\nReal world images. Now we use more realistic input data, namely subsections of: \n1) the aerial shot of a village, 2) an image of wood cells, and 3) an image of striped \npiece of wood. Each image has 150 x 150 pixels, each taking on one of 256 gray \nlevels. 7 x 7 (5 x 5 for village) pixels subsections are randomly chosen as training \ninputs. Test sets stem from images similar to 1) , 2), and 3). \n\nResults. For the village image LOCOCODE discovers on-center-off-surround hidden \nunits forming a sparse code. For the other two images LOCOCODE also finds appro(cid:173)\npriate feature detectors -\nlow-complexity features it always codes more efficiently than ICA and PCA. \n\nsee weight patterns shown in [14J. Using its compact, \n\n\fSource Separation as a By-Product of Regularization \n\n463 \n\nexpo \n\n5 x 5 \n\n5 x 5 \n\ninput meth. \nfield \nbars \n5x5 LOC \nbars \nlCA \n5 x 5 \nbars \nPCA \n5 x 5 \nbars \nlCA \n5x5 \nbars \nPCA \n5 x 5 \nvillage 5x5 LOC \nvillage \nlCA \nvillage 5x5 PCA \nlCA \nvillage \nvillage 5x5 PCA \nvillage 7x7 LOC \nvillage 7x7 \nlCA \nvillage 7x7 PCA \nvillage 7x7 \nlCA \nvillage 7x7 PCA \n7x7 LOC \ncell \nlCA \ncell \n7x7 \n7x7 PCA \ncell \nlCA \ncell \n7x7 \ncell \n7x7 PCA \npiece 7x7 LOC \npiece 7x7 \nlCA \npiece 7x7 PCA \npiece 7x7 \nlCA \npiece 7x7 PCA \n\nnum. \ncamp. \n\n10 \n10 \n10 \n15 \n15 \n8 \n8 \n8 \n10 \n10 \n10 \n10 \n10 \n15 \n15 \n11 \n11 \n11 \n15 \n15 \n4 \n4 \n4 \n10 \n10 \n\nrec. \nerror \n1.05 \n1.02 \n1.03 \n0.71 \n0.72 \n1.05 \n1.04 \n1.04 \n1.11 \n0.97 \n8.29 \n7.90 \n9.21 \n6.57 \n8.03 \n0.840 \n0.871 \n0.722 \n0.360 \n0.329 \n0.831 \n0.856 \n0.830 \n0.716 \n0.534 \n\ncode \ntype \nsparse \nsparse \ndense \ndense \ndense \nsparse \nsparse \ndense \nsparse \ndense \nsparse \ndense \ndense \ndense \ndense \nsparse \nsparse \nsparse \nsparse \ndense \nsparse \nsparse \nsparse \nsparse \nsparse \n\n20 \n\n50 \n\nbits per pixel: # intervals \n10 \n100 \n1.367 \n1.678 \n1.655 \n2.502 \n2.469 \n1.068 \n1.165 \n1.098 \n1.495 \n1.355 \n0.688 \n0.796 \n0.795 \n1.198 \n1.189 \n0.961 \n0.983 \n0.960 \n1.315 \n1.283 \n0.392 \n0.400 \n0.397 \n1.004 \n0.908 \n\n1.163 \n1.446 \n1.418 \n2.142 \n2.108 \n0.895 \n0.978 \n0.916 \n1.273 \n1.123 \n0.547 \n0.652 \n0.648 \n0.981 \n0.972 \n0.814 \n0.829 \n0.811 \n1.099 \n1.073 \n0.347 \n0.352 \n0.348 \n0.878 \n0.775 \n\n0.836 \n1.086 \n1.062 \n1.604 \n1.584 \n0.622 \n0.710 \n0.663 \n0.934 \n0.807 \n0.368 \n0.463 \n0.461 \n0.694 \n0.690 \n0.611 \n0.622 \n0.610 \n0.818 \n0.798 \n0.269 \n0.276 \n0.269 \n0.697 \n0.590 \n\n0.584 \n0.811 \n0.796 \n1.189 \n1.174 \n0.436 \n0.520 \n0.474 \n0.679 \n0.578 \n0.250 \n0.318 \n0.315 \n0.477 \n0.474 \n0.457 \n0.468 \n0.452 \n0.609 \n0.581 \n0.207 \n0.207 \n0.207 \n0.535 \n0.448 \n\nTable 1: Overview of experiments: name of experiment, input field size, coding \nmethod, number of relevant code components (code size), reconstruction error, na(cid:173)\nture of code observed on the test set. PCA's and ICA 's code sizes need to be pre wired. \nLOCOCODE's, however, are found automatically (we always start with 25 HUs). The \nfinal 4 columns show the coding efficiency measured in bits per pixel, assuming the \nreal-valued HU activations are partitioned into 10, 20, 50, and 100 discrete inter(cid:173)\nvals. LOCOCODE codes most effiCiently. \n\n4 CONCLUSION \n\nAccording to our analysis LOCOCODE attempts to describe single inputs with as few \nand as simple features as possible. Given the statistical properties of many visual \ninputs (with few defining features), this typically results in sparse codes. Unlike \nobjective functions of previous methods, however, LOCOCODE's does not contain \nan explicit term enforcing, say, sparse codes -\nsparseness or independence are not \nviewed as a good things a priori. Instead we focus on the information-theoretic \ncomplexity of the mappings used for coding and decoding. The resulting codes \ntypically compromise between conflicting goals. They tend to be sparse and exhibit \nlow but not minimal redundancy -\nif the cost of minimal redundancy is too high. \n\nOur results suggest that LOCOCODE'S objective may embody a general principle of \nunsupervised learning going beyond previous, more specialized ones. We see that \nthere is at least one representative (FMS) of a broad class of algorithms (regularizers \nthat reduce network complexity) which (1) can do optimal feature extraction as a \nby-product, (2) outperforms traditional ICA and PCA on visual source separation \ntasks, and (3) unlike ICA does not even need to know the number of independent \nsources in advance. This reveals an interesting, previously ignored connection be-\n\n\f464 \n\nS. Hochreiter and J. Schmidhuber \n\ntween regularization and ICA, and may represent a first step towards unification of \nregularization and unsupervised learning. \nMore. Due to space limitations, much additional theoretical and experimental \nanalysis had to be left to a tech report (29 pages, 20 figures) on the WWW: see \n[14]. \n\nAcknowledgments. This work was supported by DFG grant SCHM 942/3-1 and \nDFG grant BR 609/10-2 from \"Deutsche Forschungsgemeinschaft\". \n\nReferences \n\n[1] S. Amari, A. Cichocki, and H.H. Yang. A new learning algorithm for blind \nsignal separation. In David S. Touretzky, Michael C. Mozer, and Michael E. \nHasselmo, editors, Advances in Neural Information Processing Systems 8, pages \n757-763. The MIT Press, Cambridge, MA, 1996. \n\n[2] H. B. Barlow, T. P. Kaushal, and G. J. Mitchison. Finding minimum entropy \n\ncodes. Neural Computation, 1(3):412- 423, 1989. \n\n[3] H. G. Barrow. Learning receptive fields . In Proceedings of the IEEE 1st Annual \n\nConference on Neural Networks, volume IV, pages 115- 121. IEEE, 1987. \n\n[4] A. J. Bell and T. J . Sejnowski. An information-maximization approach to \nblind separation and blind deconvolution. Neural Computation, 7(6):1129-\n1159,1995. \n\n[5] J.-F. Cardoso and A. Souloumiac. Blind beamforming for non Gaussian signals. \n\nlEE Proceedings-F, 140(6):362- 370, 1993. \n\n[6] P. Dayan and R. Zemel. Competition and multiple cause models. Neural \n\nComputation, 7:565- 579, 1995. \n\n[7] G. Deco and L. Parra. Nonlinear features extraction by unsupervised redun(cid:173)\n\ndancy reduction with a stochastic neural network. Technical report, Siemens \nAG, ZFE ST SN 41, 1994. \n\n[8] D. J . Field. What is the goal of sensory coding? Neural Computation, 6:559-\n\n601, 1994. \n\n[9] P. Foldilik and M. P. Young. Sparse coding in the primate cortex. In M. A. \nArbib, editor, The Handbook of Brain Theory and Neural Networks, pages 895-\n898. The MIT Press, Cambridge, Massachusetts, 1995. \n\n[10] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The wake-sleep algorithm \n\nfor unsupervised neural networks. Science, 268:1158- 1161,1995. \n\n[11] G. E. Hinton and Z. Ghahramani. Generative models for discovering sparse \ndistributed representations. Philosophical Transactions of the Royal Society B, \n352:1177- 1190,1997. \n\n[12] S. Hochreiter and J. Schmidhuber. Simplifying nets by discovering fiat minima. \nIn G. Tesauro, D. S. Touretzky, and T . K. Leen, editors, Advances in Neural \nInformation Processing Systems 7, pages 529- 536. MIT Press, Cambridge MA, \n1995. \n\n[13] S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1-\n\n42,1997. \n\n[14] S. Hochreiter and J . Schmidhuber. LOCOCODE. Technical Report FKI-222-\n97, Revised Version, Fakultat fUr Informatik, Technische Universitat Miinchen, \n1998. \n\n\fSource Separation as a By-Product of Regularization \n\n465 \n\n[15] T. Kohonen. Self-Organization and Associative Memory. Springer, second ed., \n\n1988. \n\n[16] M. S. Lewicki and B. A. Olshausen. Inferring sparse, overcomplete image codes \nusing an efficient coding framework. In M. 1. Jordan, M. J. Kearns, and S. A. \nSolla, editors, Advances in Neural Information Processing Systems 10, 1998. \nTo appear. \n\n[17J R. Linsker. Self-organization in a perceptual network. IEEE Computer, 21:105-\n\n117,1988. \n\n[18] L. Molgedey and H. G. Schuster. Separation of independent signals using time(cid:173)\n\ndelayed correlations. Phys. Reviews Letters, 72(23) :3634- 3637, 1994. \n\n[19] M. C. Mozer. Discovering discrete distributed representations with iterative \ncompetitive learning. In R. P. Lippmann, J. E . Moody, and D. S. Touretzky, \neditors, Advances in Neural Information Processing Systems 3, pages 627- 634. \nSan Mateo, CA: Morgan Kaufmann , 1991. \n\n[20J E. Oja. Neural networks, principal components, and subspaces. International \n\nJournal of Neural Systems, 1(1):61- 68, 1989. \n\n[21] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field \nproperties by learning a sparse code for natural images. Nature, 381(6583):607-\n609, 1996. \n\n[22] D. E. Rumelhart and D. Zipser. Feature discovery by competitive learning. In \n\nParallel Distributed Processing, pages 151- 193. MIT Press, 1986. \n\n[23J J. Schmidhuber. Learning factorial codes by predictability minimization. Neu(cid:173)\n\nral Computation, 4(6):863- 879, 1992. \n\n[24] S. Watanabe. Pattern Recognition: Human and Mechanical. Willey, New York, \n\n1985. \n\n[25] R. S. Zemel and G. E. Hinton. Developing population codes by minimizing \ndescription length. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, \nAdvances in Neural Information Processing Systems 6, pages 11- 18. San Mateo, \nCA: Morgan Kaufmann, 1994. \n\n\f", "award": [], "sourceid": 1619, "authors": [{"given_name": "Sepp", "family_name": "Hochreiter", "institution": null}, {"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": null}]}