On the Optimality of Incremental Neural Network Algorithms

Part of Advances in Neural Information Processing Systems 11 (NIPS 1998)

Bibtex Metadata Paper

Authors

Ron Meir, Vitaly Maiorov

Abstract

We study the approximation of functions by two-layer feedforward neu(cid:173) ral networks, focusing on incremental algorithms which greedily add units, estimating single unit parameters at each stage. As opposed to standard algorithms for fixed architectures, the optimization at each stage is performed over a small number of parameters, mitigating many of the difficult numerical problems inherent in high-dimensional non-linear op(cid:173) timization. We establish upper bounds on the error incurred by the al(cid:173) gorithm, when approximating functions from the Sobolev class, thereby extending previous results which only provided rates of convergence for functions in certain convex hulls of functional spaces. By comparing our results to recently derived lower bounds, we show that the greedy algo(cid:173) rithms are nearly optimal. Combined with estimation error results for greedy algorithms, a strong case can be made for this type of approach.

1 Introduction and background

A major problem in the application of neural networks to real world problems is the ex(cid:173) cessively long time required for training large networks of a fixed architecture. Moreover, theoretical results establish the intractability of such training in the worst case [9][4]. Ad(cid:173) ditionally, the problem of determining the architecture and size of the network required to solve a certain task is left open. Due to these problems, several authors have considered incremental algorithms for constructing the network by the addition of hidden units, and estimation of each unit's parameters incrementally. These approaches possess two desir(cid:173) able attributes: first, the optimization is done step-wise, so that only a small number of parameters need to be optimized at each stage; and second, the structure of the network

-This work was supported in part by the a grant from the Israel Science Foundation tThe author was partially supported by the center for Absorption in Science, Ministry of Immi(cid:173)

grant Absorption, State of Israel.

296

R. Meir and V Maiorov

is established concomitantly with the learning, rather than specifying it in advance. How(cid:173) ever, until recently these algorithms have been rather heuristic in nature, as no guaranteed performance bounds had been established. Note that while there has been a recent surge of interest in these types of algorithms, they in fact date back to work done in the early seventies (see [3] for a historical survey).

The first theoretical result establishing performance bounds for incremental approximations in Hilbert space, was given by Jones [8]. This work was later extended by Barron [2], and applied to neural network approximation of functions characterized by certain conditions on their Fourier coefficients. The work of Barron has been extended in two main direc(cid:173) tions. First, Lee et at. [10] have considered approximating general functions using Hilbert space techniques, while Donahue et al. [7] have provided powerful extensions of Jones' and Barron's results to general Banach spaces. One of the most impressive results of the latter work is the demonstration that iterative algorithms can, in many cases, achieve nearly optimal rates of convergence, when approximating convex hulls.

While this paper is concerned mainly with issues of approximation, we comment that it is highly relevant to the statistical problem of learning from data in neural networks. First, Lee et at. [10] give estimation error bounds for algorithms performing incremental opti(cid:173) mization with respect to the training error. Under certain regularity conditions, they are able to achieve rates of convergence comparable to those obtained by the much more com(cid:173) putationally demanding algorithm of empirical error minimization. Moreover, it is well known that upper bounds on the approximation error are needed in order to obtain per(cid:173) formance bounds, both for parametric and nonparametric estimation, where the latter is achieved using the method of complexity regularization. Finally, as pointed out by Don(cid:173) ahue et al. [7], lower bounds on the approximation error are crucial in establishing worst case speed limitations for learning.

The main contribution of this paper is as follows. For functions belonging to the Sobolev class (see definition below), we establish, under appropriate conditions, near-optimal rates of convergence for the incremental approach, and obtain explicit bounds on the parameter values of the network. The latter bounds are often crucial for establishing estimation error rates. In contrast to the work in [10] and [7], we characterize approximation rates for functions belonging to standard smoothness classes, such as the Sobolev class. The former work establishes rates of convergence with respect to the convex hulls of certain subsets of functions, which do not relate in a any simple way to standard functional classes (such as Lipschitz, Sobolev, Holder, etc.). As far as we are aware, the results reported here are the first to report on such bounds for incremental neural network procedures. A detailed version of this work, complete with the detailed proofs, is available in [13].

2 Problem statement

We make use of the nomenclature and definitions from [7]. Let H be a Banach space of functions with norm II . II. For concreteness we assume henceforth that the norm is given by the Lq norm, 1 < q < 00, denoted by II . Ilq. Let linn H consist of all sums of the form L~=l aigi , gi E H and arbitrary ai, and COn H is the set of such sums with ai E [0,1] and L~=l ai = 1. The distances, measured in the Lq norm, from a function f are given by

dist(1innH,f) = inf {l lh - fllq : hE linnH}, dist(conH , f) = inf {l lh - fllq : hE conH}.

The linear span of H is given by linH = Un linn H, while the convex-hull of H is coH = Uncon H. We follow standard notation and denote closures of sets by a bar, e.g. coH is the closure of the convex hull of H. In this work we focus on the special case where

H = H1} ~ {g : g(x) = eCJ(aT x + b), lei::; 1}, IICJ(ยท)llq ::; I},