Greedy Layer-Wise Training of Deep Networks

Part of Advances in Neural Information Processing Systems 19 (NIPS 2006)

Bibtex Metadata Paper

Authors

Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle

Abstract

Recent analyses (Bengio, Delalleau, & Le Roux, 2006; Bengio & Le Cun, 2007) of modern nonparametric machine learning algorithms that are kernel machines, such as Support Vector Machines (SVMs), graph-based manifold and semi-supervised learning algorithms suggest fundamental limitations of some learning algorithms. The problem is clear in kernel-based approaches when the kernel is "local" (e.g., the Gaussian kernel), i.e., K (x, y ) converges to a constant when ||x - y || increases. These analyses point to the difficulty of learning "highly-varying functions", i.e., functions that have a large number of "variations" in the domain of interest, e.g., they would require a large number of pieces to be well represented by a piecewise-linear approximation. Since the number of pieces can be made to grow exponentially with the number of factors of variations in the input, this is connected with the well-known curse of dimensionality for classical non-parametric learning algorithms (for regression, classification and density estimation). If the shapes of all these pieces are unrelated, one needs enough examples for each piece in order to generalize properly. However, if these shapes are related and can be predicted from each other, "non-local" learning algorithms have the potential to generalize to pieces not covered by the training set. Such ability would seem necessary for learning in complex domains such as Artificial Intelligence tasks (e.g., related to vision, language, speech, robotics). Kernel machines (not only those with a local kernel) have a shallow architecture, i.e., only two levels of data-dependent computational elements. This is also true of feedforward neural networks with a single hidden layer (which can become SVMs when the number of hidden units becomes large (Bengio, Le Roux, Vincent, Delalleau, & Marcotte, 2006)). A serious problem with shallow architectures is that they can be very inefficient in terms of the number of computational units (e.g., bases, hidden units), and thus in terms of required examples (Bengio & Le Cun, 2007). One way to represent a highly-varying function compactly (with few parameters) is through the composition of many non-linearities, i.e., with a deep architecture. For example, the parity function with d inputs requires O(2d ) examples and parameters to be represented by a Gaussian SVM (Bengio et al., 2006), O(d2 ) parameters for a one-hidden-layer neural network, O(d) parameters and units for a multi-layer network with O(log2 d) layers, and O(1) parameters with a recurrent neural network. More generally,

boolean functions (such as the function that computes the multiplication of two numbers from their d-bit representation) expressible by O(log d) layers of combinatorial logic with O(d) elements in each layer may require O(2d ) elements when expressed with only 2 layers (Utgoff & Stracuzzi, 2002; Bengio & Le Cun, 2007). When the representation of a concept requires an exponential number of elements, e.g., with a shallow circuit, the number of training examples required to learn the concept may also be impractical. Formal analyses of the computational complexity of shallow circuits can be found in (Hastad, 1987) or (Allender, 1996). They point in the same direction: shallow circuits are much less expressive than deep ones. However, until recently, it was believed too difficult to train deep multi-layer neural networks. Empirically, deep networks were generally found to be not better, and often worse, than neural networks with one or two hidden layers (Tesauro, 1992). As this is a negative result, it has not been much reported in the machine learning literature. A reasonable explanation is that gradient-based optimization starting from random initialization may get stuck near poor solutions. An approach that has been explored with some success in the past is based on constructively adding layers. This was previously done using a supervised criterion at each stage (Fahlman & Lebiere, 1990; Lengelle & Denoeux, 1996). Hinton, Osindero, and Teh (2006) recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. The training strategy for such networks may hold great promise as a principle to help address the problem of training deep networks. Upper layers of a DBN are supposed to represent more "abstract" concepts that explain the input observation x, whereas lower layers extract "low-level features" from x. They learn simpler concepts first, and build on them to learn more abstract concepts. This strategy, studied in detail here, has not yet been much exploited in machine learning. We hypothesize that three aspects of this strategy are particularly important: first, pre-training one layer at a time in a greedy way; second, using unsupervised learning at each layer in order to preserve information from the input; and finally, fine-tuning the whole network with respect to the ultimate criterion of interest. We first extend DBNs and their component layers, Restricted Boltzmann Machines (RBM), so that they can more naturally handle continuous values in input. Second, we perform experiments to better understand the advantage brought by the greedy layer-wise unsupervised learning. The basic question to answer is whether or not this approach helps to solve a difficult optimization problem. In DBNs, RBMs are used as building blocks, but applying this same strategy using auto-encoders yielded similar results. Finally, we discuss a problem that occurs with the layer-wise greedy unsupervised procedure when the input distribution is not revealing enough of the conditional distribution of the target variable given the input variable. We evaluate a simple and successful solution to this problem.

2 Deep Belief Nets Let x be the input, and gi the hidden variables at layer i, with joint distribution

P (x, g1 , g2 , . . . , g

)

= P (x|g1 )P (g1 |g2 ) P (g