Author(s): Valeriu Beiu - Digital Library/67531/metadc712302/...Valeriu Beiu IX European Signal...

Title.

Author(s):

Submitted to:

Los Alamos N A T I O N A L L A B O R A T O R Y

IMPLEMENTING SIZE-OPTIMAL DISCRETE NEURAL NETWORKS REQUIRES ANALOG CIRCUITY I*

Valeriu Beiu

IX European Signal Processing Conference Rhodes, Greece September 8-1 1 , 1 3 8

Los Alamos National Laboratory, an affirmative action/equal opportunity empldyer. is operated by the University of California for the U.S. Department of Energy under contract W-7405-ENG-36. By acceptance of this article, the publisher recognizes that the US. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or to allow others to do so. for U.S. Government purposes. The Los Alamos National Laboratory requests that the publisher identify this article as work performed under the auspices of the US. Department of Energy.

Form No. 836 R5 ST 2629 1 OB1

DISCLAIMER

This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employes, makes any warranty, exprcss or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or use- fulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any spc- cific commercial product, process, or service by trade name, trademark, manufac- turer, or otherwise does not necessarily constitute or imply its endorsement, m m - mendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors exptessed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

lmplemening Size-Optimal Discrete Neural Networks V. Beiu

Implementing Size-Optimal Discrete Neural Networks Requires Analog Circuitry

Valeriu Beiu

Los Alamos National Laboratory, Division NIS-1, MS D466 Los Alamos, New Mexico 87545, USA

Phone: +1-505467 2430 e3 Fax: +1-505-665 7395 e:* E-mail: [email protected]

1. Introduction In this paper a network is an acyclic graph having several input nodes, and some (at least one) output nodes. If a synaptic weight is associated with each edge, and each node computes the weighted sum of its inputs to which a nonlinear activation function is then applied (artificial neuron, or simply neuron): f (x) = f ( x , , .. ., xA) = = (T ( !, wi xi + e) , the network is a neural network (NN), with the synaptic weights wi E IR, 8 E IR known as the threshold, A being the fan-in, and (T a non-linear activation function. Because the underlying graph is acyclic, the network does not have feedback connections, and can be layered. That is why such a network is also known as a multilayer feedforward neural network, and is commonly characterised by two cost functions: its depth and its size (i.e., the number of neurons).

The paper will constructively show that in order to obtain minimum size NNs (i.e., size-optimal) for implementing any Boolean function (BF), the nonlinear activation function has to be the identity function. Hence, size-optimal hardware implementations of discrete NNs (i.e., implementing BFs) can be obtained only in analog circuitry.

2. Previous Results NNs have been experimentally shown to be quite effective in many applications (see Applications of Neural Net- works in [2], together with Part F: Applications of Neural Computation and Part G: Neural Networks in Practice:. Case Studies from [23]). This success has led researchers to undertake a rigorous analysis of the mathematical properties that enable them to perform so well. It has generated two directions of research: ( i ) to find exist- encelconstructive proofs for what is now known as the “universal approximation problem; ” (ii) to find tight bounds on the size needed by the approximation problem (or some particular cases). The paper will focus on both aspects, for the particular case when the functions to be implemented are Boolean.

The first line of research has concentrated on the approximation capabilities of NNs. It was started in 1987 [24, 351 by showing that Kolmogorov’s superpositions [33] can be interpreted as a NN with one hidden layer, thus giving an existence proof. The first nonconstructive proof has been given using a continuous activation function [18, 191 (also [28]). The fact that NNs are computationally universal, when modifiable connections are al- lowed, was thus established. Different enhancements have been later presented in the literature (see Chapter l in [9]), but all these results-with the partial exception of [3, 32, 341-were obtained “provided that suficiently many hidden units are available” (Le., no claims on the size minimality were made). More constructive solutions have been obtained in very small depth later [30, 40, 411, but their size grows fast with respect to the number of dimensions and/or examples, or with the required precision. Recently, an explicit numerical algorithm for superpositions has been detailed [49, SO].

The other line of research was to find the smallest size NN which can realise an arbitrary function given a set of m vectors from IR”. Many results have been obtained for threshold gates (TGs). One of the first lower bound on the size of a threshold gate circuit (TGC) for “almost all” n-ary BFs was given in [39]: size 2 2 (2 “In) ’ 12.

Later a very tight upper bound size I 2 (2 “/n ) ‘ I2 x { 1 + R [(2 “/n ) ‘ I 2 ] ] has been proven in depth = 4 [36]. Simi- lar existence exponential bounds can be found in [46] which gives an R (,“I3) lower bound for arbitrary BFs. For classification problems (i.e., real inputs-Boolean output), one of the first results was that a NN of depth = 3 and size=m-l could compute an arbitrary dichotomy. The main improvements have been:

Baum [4] presented a TGC with one hidden layer having rm/nl neurons capable of realising an arbitrary dichotomy on a set of m points in general position in IR”; if the points are on the corners of the n-dimensional hypercube (i.e., binary vectors), m - 1 nodes are still needed; a slightly tighter bound was proven in [27]: only rl + ( m - 2 ) / n l neurons are needed in the hidden layer for realising an arbitrary dichotomy on a set of m points which satisfy a more relaxed topological assumption;

Eusipco - 98 1

mailto:[email protected]

lmplemening Size-Optimal Discrete Neural Networks V. Beiu

also, the m - 1 nodes condition was shown to be the least upper bound needed; Arai [ l ] showed that m - 1 hidden neurons are necessary for arbitrary separability (any mapping between input and output for the case of binary-valued units), but improved the bound for the dichotomy problem to m / 3 (without any condition on the inputs).

A size-optimal result was detailed by Red'kin in 1970 [44]:

Theorem I (from [44]). The complexity realisation (i.e., number of threshold elements) of En.m (i.e., BFs of n variables that have exactly m groups of ones in their truth table) is at most 2 (2m)'12 + 3.

Some existence lower bounds for the arbitrary dichotomy problem are as follows: ( i ) a depth-2 TGC requires at least m / ( n log(m / n ) ) TGs; (ii) a depth-3 TGC requires at least 2 (m/logrn) ' I 2 TGs in each of the two hidden layer (if m >>n2); (iii) an arbitrarily interconnected TGC without feedback needs (2m/logm) ' I2 TGs (if m >>n 2). All these results are valid for unlimited fan-in TGs. Departing from these lines, Horne and Hush [26] detail a solution for limited fan-in TGCs.

Theorem 3 (from [26]). Arbitrary Boolean functions of the form f : [ 0, 1 }" + [ 0, 1 }" can be implemented in a NN of perceptrons restricted to fan-in 2 with a node complexity of 0 [ m 2 / ( n + logm)] and requiring 0 (n) layers.

One study [I71 tried to unify these two lines of research by first presenting analytical solutions for the general NN problem in one dimension (having infinite size!), and then giving practical solutions for the one-dimensional cases (i.e., including an upper bound on the size). Extensions to the n-dimensional case using three- and four-layers solutions were derived under piecewise constant approximations (having constant or variable width partitions), and under piecewise linear approximations (using ramps instead of sigmoids).

3. Main Contribution As has been seen from all previous results, the known bounds for size are exponential if TGCs are used for solving general problems (ie. , arbitrary BFs), but they: ( i ) reveal a gap, thus encouraging research efforts to reduce these gaps; and (ii) suggest that NNs with more hidden layers (depth # small constant) might have a smaller size. Still, the size will remain exponential.

A different approach is to use Kolmogorov's superpositions theorem, which shows that there are NNs having only 2n + 1 neurons in the hidden layer which can approximate any function! We start from [49, 501, which present a constructive solution for the general case. The constructive solution presented there builds neurons having fan-in A 1 2 n + 1. The known weight bounds 2 (*-')I2 < weight < (A+ 1)(A+')'2/2A hold for any fan-in A 2 4. Hence, one would expect to have a precision of between A [38] and A logA bits per weight [42, 43, 481. Unfortunately, the constructive solution for Kolmogorov's superposition requires a double exponential precision for the weights:

n r -rn , - I --

weights < (2n + 2) , , - I

(see equation (1) in Theorem I from [49]), thus limiting its applicability for hardware implementations. By re- stricting the class of functions to BFs, it can be shown that: ( i ) the nonlinear function v, becomes the identity function; (ii) the precision of the weighs is reduced to (2n + 2)-n, i.e. 0 (nlogn) bits per weight; and the conclu- sions follow that:

digital implementations of arbitrary BFs require exponential size; analog implementations of arbitrary BFs as TGCs require exponential size; analog implementations of arbitrary BFs as NNs can be done in linear size having linear fan-in and poly- nomial precision weights and thresholds, if the nonlinear function is the identity function.

Extended List of References [I] [2] [3]

[4] [5]

[6] [7]

M. Arai. Bounds on the Number of Hidden Units in Binary-valued Three-layer Neural Networks. Neural Networks, 6(6), 855-860, 1993. M.A. Arbib (ed.). The Handbook of Brain Theory and Neural Networks. MIT Press, Cambridge, MA, 1995. A.R. Barron. Universal Approximation Bounds for Superpositions of a Sigmoidal Function. IEEE Trans. on Information Theory, 39(3),

E.B. Baum. On the Capabilities of Multilayer Perceptrons. J . of Complexity, 4, 193-215, 1988. V. Beiu. Optimal VLSI Implementations of Neural Networks: VLSI-Friendly Learning Algorithms. Chapter 18 in J.G. Taylor (ed.): Neural Networkr and Their Applications, John Wiley & Sons, Chichester, 255-276, 1996. V. Beio. Digital Integrated Circuit Implementations. Chapter E1.4 in [23], 1996. V. Beiu. Optimization of Circuits Using a Constructive Learning Algorithm. Tech. Rep. LA-UR-97-851, Los Alamos National Laboratory, USA; in A.B. Bulsari and S . Kallio (eds.): Neural Network7 in Engineering Systems (EANN'97, Stockholm, Sweden), Systeemitekniikan seura ry (Systems Engineering Association, SEA), Abo Akademis Tryckeri, Turku, Finland, 291-294, 1997. V. Beiu. On the Circuit and VLSI Complexity of Threshold Gate COMPARISON. Tech. Rep. LA-UR-96-3591, Los Alamos National Laboratory,

930-945, 1993.

[8]

Eusipco - 98 2

. lmplemening Size-Optimal Discrete Neural Networks V. Beiu

USA; accepted for publication in Neurocomputing, 1998. V. Beiu. VLSI Cotnplexify qfDiscrete Neural Networks. Gordon and Breach & Harwood Academics Publishing, Newark, NJ, 1998. V. Beiu and T. De Pauw. Tight Bounds on the Size of Neural Networks for Classification Problems. Tech. Rep. LA-UR-97-60, Los Alamos National Laboratory, USA; in J. Mira, R. Moreno-Diu and J. Cabestany (eds.): Biological and Art(ficiu1 Computution: From Neuroscience to Technology, Lecture Notes in Computer Science vol. 1240, Springer, Berlin, 743-752, 1997. V. Beiu and S. Draghici. Limited Weights Neural Networks: Very Tight Entropy Based Bounds. Tech. Rep. LA-UR-97-294, Los Almos National Laboratory, USA; in D.W. Pearson (ed.): Proc. 2nd Intl. ICSC Symp. on Soft Computing (SOC0’97, Nimes, France), ICSC Aca- demic Press, Canada, 11 1-1 18, 1997. V. Beiu and H.E. Makaruk. Deeper Sparsely Nets Are Size-Optimal. Tech. Rep. LA-UR-97-1916, Los Alamos National Laboratory, USA, 1997 (accepted for publication). V. Beiu and J.G. Taylor. VLSI Optimal Neural Network Learning Algorithm. In D.W. Pearson, N.C. Steele and R.F. Albrecht (eds.): Arfifcial Neural Nets and Genetic Algorithms (ICANNGA’95, AI&, France), Springer-Verlag, NY, 6 1-64, 1995. V. Beiu and J.G. Taylor. Optimal Mapping of Neural Networks onto FPGAs. In J. Mira and F. Sandoval (eds.): From Natural to Arr(fificia1 Neural Computation, Lecture Notes in Computer Science, Vol. 930 (IWANN’95, Mblaga, Spain), Springer-Verlag, Berlin, 822-829, 1995. V. Beiu and J.G. Taylor. Direct Synthesis of Neural Networks. Proc. MicroNeuro’96 (Lausanne, Switzerland), IEEE CS Press, Los Alamitos,

A.L. Blum and R.L. Rivest. Training a 3-Node Neural Network Is NP-complete. Neural Networks, 5(1), 117-127, 1992. A. Bulsari. Some Analytical Solutions to the General Approximation Problem for Feedfonvard Neural Networks. Neural Network, 6(7),

G. Cybenko. Continuous Valued Neural Networks with Two Hidden Layers Are Sufficient. Tech. Rep., Tufts Univ., 1988. G. Cybenko. Approximations by Superpositions of a Sigmoid Function. Math. of Control, Signals and System, 2, 303-314, 1989. B. DasGupta, H.T. Siegelmann and E.D. Sontag. On the Complexity of Training Neural Networks with Continuous Activation Functions. Tech. Rep. 93-61, Dept. of CS, Univ. of Minnesota, 1993; also in IEEE Trans. on Neural Networks, 6, 1490-1504, 1995. J.S. Denker and B.S. Wittner. Network Generality, Training Required, and Precision Required. In D.Z. Anderson (ed.): Neural Information Processing Systems (NIPS*87), American Inst. Physics, NY, 219-222, 1988. S. Draghici and I.K. Sethi. On the Possibilities of the Limited Precision Weights Neural Networks in Classification Problems. In J. Mira, R. Moreno-Diaz and J. Cabestany (eds.): Biologicul and Artificial Computation: From Neuroscience to Technology (IWANN’97, Lanzarote, Spain), Lecture Notes in Computer Science vol. 1240, Springer, Berlin, 753-762, June 1997. E. Fiesler and R. Beale (eds.). Handbook of Neural Computation. Oxford Univ. Press and the Inst. of Physics, NY, 1996. R. Hecht-Nielsen. Kolmogorov’s Mapping Neural Network Existence Theorem. Proc. IEEE Intl. Con$ on Neural Networks, IEEE Press,

J.L. Holt and J.-N. Hwang. Finite Precision Error Analysis of Neural Network Hardware Implementations. lEEE Trans. on Comp., 42(3).

B.G. Home and D.R. Hush, D.R. On the Node Complexity of Neural Networks. Neural Networks, 7(9), 1413-1426, 1994. S.-C. Huang and Y.-F. Huang. Bounds on the Number of Hidden Neurons of Multilayer Perceptrons in Classification and Recognition. IEEE Trans. on Neural Networks, 2(1), 47-55, 1991. B. Irie and S. Miyake. Capabilities of Three-Layered Perceptrons. Proc. Intl. Con$ on Neural Networks ICNN’88 (San Diego, CA), vol.

J.S. Judd. Neural Network Design and the Complexity of Learning. MIT Press, Cambridge, MA, 1990. H. Katsuura and D.A. Sprecher. Computational Aspects of Kolmogorov’s Superposition Theorem. Neural Networks, 7(3), 455-461, 1993. A.H. Khan and E.L. Hines. Integer-Weight Neural Networks. Electronic Letters, 30( 1 3 , 1237-1238, 1994. P. Koiran. On the Complexity of Approximating Mappings Using Feedfonvard Networks. Neural Networks, 6(5), 649-653, 1993. A.N. Kolmogorov. On the Representation of Continuous Functions of Many Variables by Superposition of Continuous Functions of One Vari3ble and Addition. Dokl. Akud. Nauk SSSR, 114, 953-956, 1957 (English translation in Transl. Amer. Math. Soc., 2(28), 55-59, 1963). V. Kurkovb. Kolmogorov’s Theorem and Multilayer Neural Networks. Neural Networks, 5(4), 501-506, 1992. Y. LeCun. Models connexionistes de l’apprentisage. MSc thesis, Universiti Pierre et Marie Curie, Paris, 1987. O.B. Lupanov. The Synthesis of Circuits from Threshold Elements. Problemy Kibernetiki, 20, 109-140, 1973. R.C. Minnik. Linear-Input Logic. IRE Trans. on Electronic Computers, 10, 6-16, 1961. J. Myhill and W.H. Kautz. On the Size of Weights Required for Linear-Input Switching Functions. IRE Trans. on Electronic Cr,mput.,

E.I. Neciporuk. The Synthesis of Networks from Threshold Elements, Soviet Mathematics-Doklady, 5( 1) 163- 166, 1964. English transl.: Automation Express, 7(1), 35-39 and 7(2), 27-32, 1964. M. Nees. Approximate Versions of Kolmogorov’s Superposition Theorem, Proved Constructively. J. ofComputationa1 and Applied Math.,

M. Nees. Chebyshev Approximation by Discrete Superposition. Application to Neural Networks. Advances in Computational Mathematics,

I. Parberry. Circuit Complexity and Neural Networks. MIT Press, Cambridge, MA, 1994. P. Raghavan. Learning in Threshold Networks: A Computational Model and Applications. Tech. Rep. RC 13859, IBM Res., 1988. Also in Proc. 1st Workshop on Computational Learning Theory, ACM Press, 19-27, 1988. N.P. Red’kin. Synthesis of Threshold Circuits for Certain Classes of Boolean Functions. Kibernetiku, 6(5), 6-9, 1970 (English translation in Cybernetics, 6(5), 540-544, 1973). V.P. Roychowdhury, A. Orlitsky and K.-Y. Siu. Lower Bounds on Threshold and Related Circuits Via Communication Complexity. IEEE Trans Information Theory, 40(2), 467-474, 1994. K.-Y. Siu, V. Roychowdhury and T. Kailath. Depth-Size Tradeoffs for Neural Computations. IEEE Trans. on Comp., 40(12), 1402-1412, 1991. J. Sfma. Back-propagation is not Efficient. Neural Networks, 9(6), 1017-1023, 1996. E.D. Sontag. Shattering All Sets of k Points in “General Position” Requires (k - 1) /2 Parameters. Report SYCON (Rutgers Centerfor System & Control) 96-01, Dept. of Maths., Rutgers Univ., New Brunswick, NJ 08903, 1996 (http://www.rnath.rutgers. edu/-sontag/FTP-DIW generic.ps.gz). Also in Neural Computation, 9(2), 337-348, 1997. D.A. Sprecher. A Numerical Implementation of Kolmogorov’s Superpositions. Neural Networks, 9(5), 765-772, 1996. D.A. Sprecher. A Numerical Implementation of Kolmogorov’s Superpositions 11. Neural Networks, 10(3), 447-457, 1997. M. Stevenson and S. Huq. On the Capability of Threshold Adalines with Limited-Precision Weights. Neural Computation, 8(8), 1603-1610, 1996. J. Wray and G.G.R. Green. Neural Networks, Approximation Theory, and Finite Precision Computation. Neural Networks, 8( I), 3 1-37. 1995.

CA, 257-264, 1996.

991-996, 1993.

11-14, 1987.

281-290, 1993.

1, 641-648, 1988.

EC-IO, 288-290, 1961.

54(2), 239-250, 1994.

5(2), 137-152, 1996.

Eusipco - 98 3

http://www.rnath.rutgers

M98005765 I11111111 Ill 11111 11111 11111 111ll lllll11111 lllll11111111

Publ. Date (11) 1 9 9 ~ 3 / Sponsor Code (18) J"7 8 I - y r uc Category (19) U L-90 6 ', D OF/@-

DOE

Author(s): Valeriu Beiu - Digital Library/67531/metadc712302/...Valeriu Beiu IX European Signal...

Documents

Transcript of Author(s): Valeriu Beiu - Digital Library/67531/metadc712302/...Valeriu Beiu IX European Signal...