{"title": "Merging Constrained Optimisation with Deterministic Annealing to \"Solve\" Combinatorially Hard Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1025, "page_last": 1032, "abstract": "", "full_text": "Merging Constrained Optimisation with \n\nDeterministic Annealing to \"Solve\" \n\nCombinatorially Hard Problems \n\nPaul Stolorz\u00b7 \nSanta Fe Institute \n\n1660 Old Pecos Trail, Suite A \n\nSanta Fe, NM 87501 \n\nABSTRACT \n\nSeveral parallel analogue algorithms, based upon mean field theory (MFT) \napproximations to an underlying statistical mechanics formulation, and re(cid:173)\nquiring an externally prescribed annealing schedule, now exist for finding \napproximate solutions to difficult combinatorial optimisation problems. \nThey have been applied to the Travelling Salesman Problem (TSP), as \nwell as to various issues in computational vision and cluster analysis. I \nshow here that any given MFT algorithm can be combined in a natural \nway with notions from the areas of constrained optimisation and adaptive \nsimulated annealing to yield a single homogenous and efficient parallel re(cid:173)\nlaxation technique, for which an externally prescribed annealing schedule \nis no longer required. The results of numerical simulations on 50-city and \n100-city TSP problems are presented, which show that the ensuing algo(cid:173)\nrithms are typically an order of magnitude faster than the MFT algorithms \nalone, and which also show, on occasion, superior solutions as well. \n\n1 \n\nINTRODUCTION \n\nSeveral promising parallel analogue algorithms, which can be loosely described by \nthe term \"deterministic annealing\" , or \"mean field theory (MFT) annealing\", have \n\n*also at Theoretical Division and Center for Nonlinear Studies, MSB213, Los Alamos \n\nNational Laboratory, Los Alamos, NM 87545. \n\n1025 \n\n\f1026 \n\nStolorz \n\nrecently been proposed as heuristics for tackling difficult combinatorial optimisation \nproblems [1, 2, 3, 4, 5, 6, 7] . However, the annealing schedules must be imposed \nexternally in a somewhat ad hoc manner in these procedures (although they can be \nmade adaptive to a limited degree [8]). As a result, a number of authors [9, 10, 11] \nhave considered the alternative analogue approach of Lagrangian relaxation, a form \nof constrained optimisation due originally to Arrow [12], as a different means of \ntackling these problems. The various alternatives require the introduction of a new \nset of variables, the Lagrange multipliers. Unfortunately, these usually lead in turn \nto either the inclusion of expensive penalty terms, or the consideration of restricted \nclasses of problem constraints. The penalty terms also tend to introduce unwanted \nlocal minima in the objective function, and they must be included even when the \nalgorithms are exact [13, 10]. These drawbacks prevent their easy application to \nlarge-scale combinatorial problems, containing 100 or more variables. \nIn this paper I show that the technical features of analogue mean field approxi(cid:173)\nmations can be merged with both Lagrangian relaxation methods, and with the \nbroad philosophy of adaptive annealing without, importantly, requiring the large \ncomputational resources that typically accompany the Lagrangian methods. The \nresult is a systematic procedure for crafting from any given MFT algorithm a sin(cid:173)\ngle parallel homogeneous relaxation technique which needs no externally prescribed \nannealing schedule. In this way the computational power of the analogue heuris(cid:173)\ntics is greatly enhanced. In particular, the Lagrangian framework can be used to \nconstruct an efficient adaptation of the elastic net algorithm [2], which is perhaps \nthe most promising of the analogue heuristics. The results of numerical experi(cid:173)\nments are presented which display both increased computational efficiency, and on \noccasion, better solutions (avoidance of some local minima) over deterministic an(cid:173)\nnealing. Also, the qualitative mechanism at the root of this behaviour is described. \nFinally, I note that the apparatus can be generalised to a procedure that uses several \nmultipliers, in a manner that roughly parallels the notion of different temperatures \nat different physical locations in the simulated annealing heuristic. \n\n2 DETERMINISTIC ANNEALING \n\nThe deterministic annealing procedures consist of tracking the local minimum of an \nobjective function of the form \n\n(1) \n\nwhere x represents the analogue variables used to describe the particular problem at \nhand, and T ~ 0 (initially chosen large) is an adjustable annealing, or temperature, \nparameter. As T is lowered, the objective function undergoes a qualitative change \nfrom a convex to a distinctly non-convex function. Provided the annealing shedule \nis slow enough, however, it is hoped that the local minimum near T = 0 is a close \napproximation to the global solution of the problem. \nThe function S( x) represents an analogue approximation [5, 4, 7] to the entropy of \nan underlying discrete statistical physics system, while F(l;,.) approximates its free \nenergy. The underlying discrete system forms the basis of the simulated annealing \nheuristic [14]. Although a general and powerful technique, this heuristic is an inher(cid:173)\nently stochastic procedure which must consider many individual discrete tours at \n\n\fMerging Constrained Optimisation with Deterministic Annealing \n\n1027 \n\neach and every temperature T. The deterministic annealing approximations have \nthe advantage of being deterministic, so that an approximate solution at a given \ntemperature can be found with much less computational effort. \nIn both cases, \nhowever, the complexity of the problem under consideration shows up in the need \nto determine with great care an annealing schedule for lowering the temperature \nparameter. \n\nThe primary contribution of this paper consists in pursuing the relationship between \ndeterministic annealing and statistical physics one step further, by making explicit \nuse of the fact that due to the statistical physics embedding of the deterministic \nannealing procedures, \n\n(2) \n\nwhere Xmin is the local minimum obtained for the parameter value T. This de(cid:173)\nceptively simple observation allows the consideration of the somewhat different ap(cid:173)\nproach of Lagrange multiplier methods to automatically determine a dynamics for \nT in the analogue heuristics, using as a constraint the vanishing of the entropy \nfunction at zero temperature. This particular fact has not been explicitly used in \nany previous optimisation procedures based upon Lagrange multipliers, although it \nis implicit in the work of [9]. Most authors have focussed instead on the syntactic \nconstraints contained in the function U(~) when incorporating Lagrange multipli(cid:173)\ners. As a result the issue of eliminating an external annealing schedule has not been \ndirectly confronted. \n\n3 LAGRANGE MULTIPLIERS \n\nMultiplier methods seek the critical points of a \"Lagrangian\" function \n\nF(~,;\\) = U(x) - ;\\S(x) \n\n(3) \nwhere the notation of (1) has been retained, in accordance with the philosophy \ndiscussed above. The only difference is that the parameter T has been replaced by \na variable ;\\ (the Lagrange multiplier), which is to be treated on the same basis \nas the variables~. By definition, the critical points of F(x,;\\) obey the so-called \nKuhn-Tucker conditions \n\n\\1LF(~,;\\) = 0 = \n\\1>..F(x,;\\) =0= \n\n-\n\n\\1 rU(~) - ;\\ \\1 rS(~) \n-Sex) \n\n-\n\n(4) \n\nThus, at any critical point of this function, the constraint S(~) = 0 is satisfied. This \ncorresponds to a vanishing entropy estimate in (1). Hopefully, in addition, U(~) is \nminimised, subject to the constraint. \n\nThe difficulty with this approach when used in isolation is that finding the critical \npoints of F(~,;\\) entails, in general, the minimisation of a transformed \"uncon(cid:173)\nstrained\" function, whose set of local minima contains the critical points of F as \na subset. This transformed function is required in order to ensure an algorithm \nwhich is convergent, because the critical points of F(~,;\\) are saddle points, not \nlocal minima. One well-known way to do this is to add a term S2(~) to (3), giving \nan augmented Lagrangian with the same fixed points as (3), but hopefully with \nbetter convergence properties. Unfortunately, the transformed function is invari(cid:173)\nably more complicated than F(~, ;\\), typically containing extra quadratic penalty \n\n\f1028 \n\nStolorz \n\nterms (as in the above case), which tend to convert harmless saddle points into \nunwanted local minima. It also leads to greater computational overhead, usually \nin the form of either second derivatives of the functions U(L) and S(L) , or of ma(cid:173)\ntrix inversions [13, 10] (although see [11] for an approach which minimises this \noverhead). For large-scale combinatorial problems such as the TSP these disadvan(cid:173)\ntages become prohibitive. In addition, the entropic constraint functions occurring \nin deterministic annealing tend to be quite complicated nonlinear functions of the \nvariables involved, often with peculiar behaviour near the constraint condition. In \nthese cases (the Hopfield /Tank method is an example) a term quadratic in the en(cid:173)\ntropy cannot simply be added to (3) in a straightforward way to produce a suitable \naugmented Lagrangian (of course, such a procedure is possible with several of the \nterms in the internal energy U (.~\u00bb. \n\n4 COMBINING BOTH METHODS \n\nThe best features of each of the two approaches outlined above may be retained by \nusing the following modification of the original first-order Arrow technique: \n\nXi = -\"Vr,F(x,>.) = -\"Vr,U(x) + >'''Vx,S(x) \n>. =+\"V>.F(x,>.) =-S(x)+c/>' \n\n(5) \n\nwhere F(x, >.) is a slightly modified \"free energy\" function given by \n\nF(x, >.) = U(x) - >'S(x) + c In >. \n\n(6) \nIn these expressions, c > 0 is a constant, chosen small on the scale of the other pa(cid:173)\nrameters, and characterises the sole, inexpensive, penalty requirement. It is needed \npurely in order to ensure that>. remain positive. In fact, in the numerical experi(cid:173)\nment that I will present, this penalty term for>. was not even used - the algorithm \nwas simply terminated at a suitably small value of >.. \nThe reason for insisting upon>. > 0, in contrast to most first-order relaxation meth(cid:173)\nods, is that it ensures that the free energy objective function is bounded below with \nrespect to the X variables. This in turn allows (5) to be proven locally convergent \n[15] using techniques discussed in [13]. Furthermore, the methods described by (5) \nare found empirically to be globally convergent as well. This feature is in fact the \nkey to their computational efficiency, as it means that they need not be grafted \nonto more sophisticated and inefficient methods in order to ensure convergence. \nThis behaviour can be traced to the fact that the ''free energy\" functions, while \nnon-convex overall with respect to L, are nevertheless convex over large volumes of \nthe solution space. The point can be illustrated by the construction of an energy \nfunction similar to that used by Platt and Barr [9], which also displays the mecha(cid:173)\nnism by which some of the unwanted local minima in deterministic annealing may \nbe avoided. These issues are discussed further in Section 6. \n\nThe algorithms described above have several features which distinguish them from \nprevious work. Firstly, the entropy estimate Sex) has been chosen explicitly as \nthe appropriate constraint function, a fact which has previously been unexploited \nin the optimisation context (although a related piecewise linear function has been \nused by [9]). Further, since this estimate is usually positive for the mean field theory \n\n\fMerging Constrained Optimisation with Deterministic Annealing \n\n1029 \n\nheuristics, A (the only new variable) decreases monotonically in a manner roughly \nsimilar to the temperature decrease schedule used in simulated and deterministic \nannealing, but with the ad hoc drawback now removed. Moreover, there is no \nrequirement that the system be at or near a fixed point each time A is altered -\nthere is simply one homogeneous dynamical system which must approach a fixed \npoint only once at the very end of the simulation, and furthermore A appears linearly \nexcept near the end of the procedure (a major reason for its efficiency). Finally, \nthe algorithms do not require computationally cumbersome extra structure in the \nform of quadratic penalty terms, second derivatives or inverses, in contrast to the \nusual Lagrangian relaxation techniques. All of these features can be seen to be due \nto the statistical physics setting of the annealing \"Lagrangian\", and the use of an \nentropic constraint instead of the more usual syntactic constraints. \n\nThe apparatus outlined above can immediately be used to adapt the Hopfield/Tank \nheuristic for the Travelling Salesman Problem (TSP) [1], which can easily be written \nin the form (1). However, the elastic net method [2] is known to be a somewhat \nsuperior method, and is therefore a better candidate for modification. There is \nan impediment to the procedure here: the objective function for the elastic net is \nactually of the form \n\nF(z., A) = U(x) - AS(X, A) \n\n(7) \n\nwhich precludes the use of a true Lagrange muliplier, since A now appears non(cid:173)\ntrivially in the constraint function itself! However, I find surprisingly that the \nalgorithm obtained by applying the Lagrangian relaxation apparatus in a straight(cid:173)\nforward way as before still leads to a coherent algorithm. The equations are \n\nXi = -V'1',F(x,A) = -V'1',U(z.) + AV'1',S(X) \nA = +(V'>.F(z.,A) = -([S(.~,A) + AV'>.S(X,A)] \n\n(8) \n\nThe parameter ( > 0 is chosen so that an explicit barrier term for A can be avoided. \nIt is the only remaining externally prescribed part of the former annealing schedule, \nand is fixed just once at the begining of the algorithm. \n\nIt can be shown that the global convergence of (8) is highly plausible in general (and \nseems to always occur in practice), as in the simpler case described by (5). Secondly, \nand most importantly, it can be shown that the constraints that are obeyed at \nthe new fixed points satisfy the syntax of the original discrete problem [15]. The \nprocedure is not limited to the elastic net method for the TSP. The mean field \napproximations discussed in [3, 4, 5] all behave in a similar way, and can therefore \nbe adapted successfully to Lagrangian relaxation methods. The form of the elastic \nnet entropy function suggests a further natural generalisation of the procedure. \nA different \"multiplier\" Aa can be assigned to each city a, each variable being \nresponsible for satisfying a different additive component of the entropy constraint. \nThe idea has an obvious parallel to the notion in simulated annealing of lowering \nthe temperature in different geographical regions at different rates in response to \nthe behaviour of the system. The number of extra variables required is a modest \ncomputational investment, since there are typically many more tour points than \ncity points for a given implementation. \n\n\f1030 \n\nStolorz \n\n5 RESULTS FOR THE TSP \n\nNumerical simulations were performed on various TSP instances using the elastic \nnet method, the Lagrangian adaptation with a single global Lagrange multiplier, \nand the modification discussed above involving one Lagrange multiplier for each city. \nThe results are shown in Table 1. The tours for the Lagrangian relaxation methods \nare about 0.5% shorter than those for the elastic net, although these differences \nare not yet at a statistically significant level. The differences in the computational \nrequirements are, however, much more dramatic. No attempt has been made to \noptimise any of the techniques by using sophisticated descent procedures, although \nthe size of the update step has been chosen to seperately optimise each method. \n\nTable 1: Performance of heuristics described in the text on a set of 40 randomly \ndistibuted 50-city instances of the TSP in the unit square. CPU times quoted are \nfor a SUN SPARC Station 1+. Q' and j3 are the standard tuning parameters [4]. \n\nMETHOD \n\nQ' \n\nj3 \n\nTOUR LENGTH \n\nCPU(SEC) \n\nElastic net \nGlobal multiplier \nLocal multipliers \n\n0.2 2.5 \n0.4 2.5 \n0.4 2.5 \n\n5.95 \u00b1 0.10 \n5.92 \u00b1 0.09 \n5.92 \u00b1 0.08 \n\n260 \u00b1 33 \n49 \u00b1 5 \n82 \u00b1 12 \n\nI have also been able to obtain a superior solution to the 100-city problem analysed \nby Durbin and Willshaw [2], namely a solution of length 7.746 [15] (c.f. \nlength \n7.783 for the elastic net) in a fraction of the time taken by elastic net annealing. \nThis represents an improvement of roughly 0.5%. Although still about 0.5% longer \nthan the best tour found by simulated annealing, this result is quite encouraging, \nbecause it was obtained with far less CPU time than simulated annealing, and in \nsubstantially less time than the elastic net: improvements upon solutions within \nabout 1% of optimality typically require a substantial increase in CPU investment. \n\n6 HOW IT WORKS - VALLEY ASCENT \n\nInspection of the solutions obtained by the various methods indicates that the mul(cid:173)\ntiplier schemes can sometimes exchange enough \"inertial\" energy to overcome the \nenergy barriers which trap the annealing methods, thus offering better solutions as \nwell as much-improved computational efficiency. This point is illustrated in Figure \nl(a), which displays the evolution of the following function during the algorithm for \na typical set of parameters: \n\n1 ~ . 2 \n\n1. 2 \nE = \"2 ~ Xi + 2\"\\ \n\nI \n\n(9) \n\nThe two terms can be thought of as different components of an overall kinetic energy \nE. During the procedure, energy can be exchanged between these two components, \nso the function E(t) does not decrease monotonically with time. This allows the \nsystem to occasionally escape from local minima. Nevertheless, after a long enough \n\n\fMerging Constrained Optimisation with Deterministic Annealing \n\n1031 \n\n(a) \n\n6 \n\n(I) \nQ) \n.~ \n\n~4 \no \n'j \nc: \n22 \n\nI~\u00b7, . .. \n\n\\ ..... ,. \n\n.... -\" \n\no L-LJ......L....J. .......... ...L-&...L..I...I...I....J'-..I...L...J....L..J....L..L...L=.::J \no \n\n' ... _ ........... _.,-_ ..... _._._.-.. \n80 \n\n60 \n\n100 \n\n20 \n\n40 \n\nIteration number \n\n... :'\": .... \n\n.....:.~ ... \n\nFigure I: (a) Evolution of variables for a typical 50-city TSP. The solid curve shows \nthe total kinetic energy E given by (9). The dotted curve shows the A component \nof this energy, and the dash-dotted curve shows the x component. (b) Trajectories \ntaken by various algorithms on a schematic free energy surface. The two dash(cid:173)\ndotted curves show possible paths for elastic net annealing, each ascending a valley \nfloor. The dotted curve shows a Lagrangian relaxation, which displays oscillations \nabout the valley floor leading to the superior solution. \n\ntime the function does decrease smoothly, ensuring convergence to a valid solution \nto the problem. \n\nThe basic mechanism can also be understood by plotting schematically the free \nenergy \"surface\" F(ll, A), as shown in Figure l(b) . This surface has a single valley \nin the foreground, where A is large. Bifurcations occur as A becomes smaller, with \na series of saddles, each a valid problem solution, being reached in the background \nat A = O. Deterministic annealing can be viewed as the ascent of just one of these \nvalleys along the valley floor. It is hoped that the broadest and deepest minimum \nis chosen at each valley bifurcation, leading eventually to the lowest background \nsaddle point as the optimal solution. A typical trajectory for one of the Lagrangian \nmodifications also consists roughly of the ascent of one of these valleys. However, \noscillations about the valley floor now occur on the way to the final saddle point, \ndue to the interplay between the different kinetic components displayed in Figure \nlea). It is hoped that the extra degrees of freedom allow valleys to be explored more \nfully near bifurcation points, thus biasing the larger valleys more than deterministic \nannealing. Notice that in order to generate the A dynamics, computational signifi(cid:173)\ncance is now assigned to the actual value of the free energy in the new schemes, in \ncontrast to the situation in regular annealing. \n\n7 CONCLUSION \n\nIn summary, a simple yet effective framework has been developed for systematically \ngeneralising any algorithm described by a mean field theory approximation proce(cid:173)\ndure to a Lagrangian method which replaces annealing by the relaxation of a single \ndynamical system. Even in the case of the elastic net, which has a slightly awkward \n\n\f1032 \n\nStolorz \n\nform, the resulting method can be shown to be sensible, and I find in fact that it \nsubstantially improves the speed (and accuracy) of that method. The adaptations \ndepend crucially upon the vanishing of the analogue entropy at zero temperature. \nThis allows the entropy to be used as a powerful constraint function, even though \nit is a highly nonlinear function and might be expected at first sight to be unsuit(cid:173)\nable for the task. In fact, this observation can also be applied in a wider context \nto design objective functions and architectures for neural networks which seek to \nimprove generalisation ability by limiting the number of network parameters [16]. \n\nReferences \n\n[1] J.J. Hopfield and D.W. Tank. Neural computation of decisions in optimization \n\nproblems. Bioi. Cybern., 52:141-152, 1985. \n\n[2] R. Durbin and D. Willshaw. An analogue approach to the travelling salesman \n\nproblem using an elsatic net method. Nature, 326:689-691, 1987. \n\n[3] D. Geiger and F. Girosi. Coupled markov random fields and mean field theory. \nIn D. Touretzky, editor, Advances in Neural Information Processing Systems \n2, pages 660-667. Morgan Kaufmann, 1990. \n\n[4] A.L. Yuille. Generalised deformable models, statistical physics, and matching \n\nproblems. Neural Comp., 2:1-24, 1990. \n\n[5] P.D. Simic. Statistical mechanics as the underlying theory of \"elastic\" and \n\n\"neural\" optimisations. NETWORK: Compo Neural Syst., 1:89-103, 1990. \n\n[6] A. Blake and A. Zisserman. Visual Reconstruction. MIT Press, 1987. \n[7] C. Peterson and B. Soderberg. A new method for mapping optimization prob(cid:173)\n\nlems onto neural networks. Int. J. Neural Syst., 1:3-22, 1989. \n\n[8] D.J. Burr. An improved elastic net method for the travelling salesman problem. \n\nIn IEEE 2nd International Con! on Neural Networks, pages 1-69-76, 1988. \n[9] J .C. Platt and A.H. Barr. Constrained differential optimization. In D.Z. An(cid:173)\nderson, editor, Neural Information Proc. Systems, pages 612-621. AlP, 1988. \n[10] A.G. Tsirukis, G.V. Reklaitis, and M.F. Tenorio. Nonlinear optimization using \n\ngeneralised hopfield networks. Neural Comp., 1:511-521, 1989. \n\n[11] E. Mjolsness and C. Garrett. Algebraic transformations of objective functions. \n\nNeural Networks, 3:651-669, 1990. \n\n[12] K.J. Arrow, L. Hurwicz, and H. Uzawa. Studies in Linear and Nonlinear \n\nProgramming. Stanford University Press, 1958. \n\n[13] D.P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. \n\nAcademic Press, 1982. See especially Chapter 4. \n\n[14] S. Kirkpatrick, C.D. Gelatt Jr., and M.P. Vecchio Optimization by simulated \n\nannealing. Science, 220:671-680, 1983. \n\n[15] P. Stolorz. Merging constrained optimisation with deterministic annealing to \n\"solve\" combinatorially hard problems. Technical report, LA-UR-91-3593, Los \nAlamos National Laboratory, 1991. \n\n[16] P.Stolorz. Analogue entropy as a constraint in adaptive learning and optimi(cid:173)\n\nsation. Technical report, in preparation, Santa Fe Institute, 1992. \n\n\f", "award": [], "sourceid": 567, "authors": [{"given_name": "Paul", "family_name": "Stolorz", "institution": null}]}