Artificial neural networks in accounting and finance ... · Networks in Accounting and Finance:...

Artificial NeuralNetworks in Accountingand Finance: ModelingIssuesJames R. Coakley* and Carol E. BrownOregon State University, USA

Abstract This article reviews the literature on artificial neural networks (ANNs) appliedto accounting and finance problems and summarizes the ‘suggestions’ fromthis literature. The first section reviews the basic foundation of ANNs toprovide a common basis for further elaboration and suggests criteria thatshould be used to determine whether the use of an ANN is appropriate. Thesecond section of the paper discusses development of ANN models including:selection of the learning algorithm, choice of the error and transfer functions,specification of the architecture, preparation of the data to match the architec-ture, and training of the network The final section presents some generalguidelines and a brief summary of research progress and open research ques-tions. Copyright 2000 John Wiley & Sons, Ltd.

INTRODUCTION

Artificial neural networks (ANNs) haveemerged as a powerful statistical modelingtechnique. ANNs detect the underlying func-tional relationships within a set of data andperform such tasks as pattern recognition,classification, evaluation, modeling, predictionand control (Lawrence and Andriola, 1992).Several systems based on ANNs and in wide-spread commercial use for accounting and fin-ancial tasks are described by Brown et al. (1995).These include:

I FALCON, used by six of the ten largest creditcard companies to screen transactions forpotential fraud.

I Inspector, used by Chemical Bank to screenforeign currency transactions.

* Correspondence to: James R. Coakley Business,Department of Accounting, Finance and InformationManagement, Boxell Hull 200, Corvallis, OR 97331-2603, USA. E-mail: [email protected]

Copyright 2000 John Wiley & Sons, Ltd. Accepted 7 July 1999

International Journal of Intelligent Systems in Accounting, Finance & Management

Int. J. Intell. Sys. Acc. Fin. Mgmt. 9, 119–144 (2000)

I Several ANNs used to assist in managinginvestments by making predictions aboutdebt and equity securities as well as deriva-tive instruments.

I Several ANNs used for credit granting,including GMAC’s Credit Advisor that grantsinstant credit for automobile loans.

I AREAS, used for residential property valu-ation.

Developers of commercially used ANNs gener-ally consider the inner workings of their sys-tems to be proprietary information that givesthem a competitive advantage. Thus, they arereluctant to disclose information about thosesystems. The fact that a system is in commercialuse provides some insight into the kind of tasksand domains for which ANNs are likely to beuseful. Unfortunately, the lack of details oninternal system structure precludes using thepublished information about those systems tohelp answer open research questions about thebest structure for ANNs in accounting and fin-ance.

In the past few years, many researchers haveused ANNs to analyze traditional classificationand prediction problems in accounting andfinance. The research that has compared theANNs with traditional statistical methods pro-duced mixed results. In spite of all the research,we still do not have any general guidelines totell us when to use ANNs or which ANNarchitecture to employ.

Numerous articles have appeared recentlythat surveyed journal articles on ANNs appliedto business situations. Wong et al. (1997) sur-veyed 203 articles from 1988 through 1995.They classified the articles by year of publi-cation, application area (accounting, finance,etc.), journal, various decision characteristics(problem domain, decision process phase, levelof management, level of task interdependence),means of development, integration with othertechnologies, comparative technique (discrimin-ant analysis, regression analysis, logit and ID3were the most common techniques), and majorcontribution. The survey included five articlesin accounting and auditing, and 54 articles infinance. Given the focus toward decision pro-cess, and the limited coverage of accountingand auditing, the paper does not provide muchinsight into accounting and finance issues.

O’Leary (1998) analyzed 15 articles thatapplied ANNs to predict corporate failure orbankruptcy. For each study, he provided infor-mation about the data, the ANN model andsoftware (means of development), the structureof the ANN (input, hidden and output layers)training and testing, and the alternative para-metric methods used as a benchmark. He thenanalyzed the overall ability of the ANN modelsto perform the prediction task.

Zhang et al. (1998) surveyed 21 articles thataddressed modeling issues when ANNs areapplied for forecasting, and an additional 11studies that compared the relative performanceof ANNs with traditional statistical methods.For the modeling issues, they addressed thetype of data, size of the training and testsamples, architecture of the model (number ofnodes in each layer and transfer function),training algorithm used, and the method ofdata normalization. The paper provides insightsinto these modeling issues by summarizing the

Copyright 2000 John Wiley & Sons, Ltd. Int. J. Intell. Sys. Acc. Fin. Mgmt. 9, 119–144 (2000)

120 J.R. COAKLEY AND C.E. BROWN

numerous, sometimes conflicting, suggestionsderived from the articles. Only one of the art-icles pertained to an accounting and financeproblem.

Vellido et al. (1999) surveyed 123 articlesfrom 1992 through 1998. They included 8 art-icles in accounting and auditing, and 44 articlesin finance (23 on bankruptcy prediction, 11 oncredit evaluation, and 10 in other areas). Theyprovided information on the ANN modelapplied, the method used to validate trainingof the model, the sample size and number ofdecision variables, the comparative parametric/linear technique used as a benchmark, andmain contribution of the article. They summar-ize the most frequently cited advantages anddisadvantages of the ANN models. The paperdoes provide a sense of which ANN method-ologies have been applied to accounting andfinance problems, but offer little insight intohow to choose and develop an appropriateANN model.

The objective of this paper is to provide abroader review of the literature on ANNsapplied to accounting and finance problems,focusing on the modeling issues, and summar-izing the ‘suggestions’ from this literature. It ismore like a tutorial on modeling issues than acritical analysis. A subsequent paper will pro-vide a more critical look at these modelingissues in the accounting and finance articles.

The first section will review the basic foun-dation of ANNs to provide a common basisfor further elaboration. Based on this foun-dation, we suggest criteria that should be usedto determine whether it is appropriate to usean ANN. For a more detailed description ofANNs, we refer the reader to numerous otherarticles that provide insights into various net-works (Anderson and Rosenfeld, 1988; Hecht-Nielsen, 1990; Hertz et al., 1991; Hoptroff et al.,1991; Lawrence, 1991; Rumelhart and McClel-land, 1986; Waite and Hardenbergh, 1989;Wasserman, 1989).

The second section of the paper discussesdevelopment of ANN models. Once thedecision to use an ANN is made, the researcherfaces numerous modeling decisions concerning:

I The selection of the learning algorithm

I The choice of the error and transfer functionsI The specification of the architectureI The appropriate preparation of the data to

match the architecture andI The approach used to train of the network.

The final section presents some general guide-lines, conclusions on research progress andopen research questions.

FOUNDATIONS

Background

ANNs are structures of highly interconnectedelementary computational units. They arecalled neural because the model of the nervoussystems of animals inspired them. Each compu-tational unit (see Figure 1) has a set of inputconnections that receive signals from othercomputational units and a bias adjustment, aset of weights for each input connection andbias adjustment, and a transfer function thattransforms the sum of the weighted inputs andbias to decide the value of the output fromthe computational unit. The sum value for thecomputational unit (node j) is the linear combi-nation of all signals from each connection (Ai)times the value of the connection weightbetween node j and connection i (Wji) (equation(1)). Note that equation (1) is similar to the

Figure 1 Structure of a computational unit (node j)


121ARTIFICIAL NEURAL NETWORKS IN ACCOUNTING AND FINANCE

equation form of multiple regression: Y9 5 B01 Si [Bi * Xi]. The output for node j is the resultof applying a transfer function g (equation (2))to the sum value (Sumj).

Sumj 5 Si [Wji * Ai] (1)

Oj 5 g(Sumj) (2)

If the transfer function applied in equation(2) is linear, then the computational unitresembles the multiple regression model. If thetransfer function applied in equation (2) is thesigmoid (see description below), then the com-putational unit resembles the logistic regressionmodel. The only difference between the ANNand regression models is the manner in whichthe values for the weights are established.ANNs employ a dynamic programmingapproach to iteratively adjust the weights untilthe error is minimized while the regressionmodels compute the weights using a mathemat-ical technique that minimizes the squared error.

Most ANNs applied in the literature are actu-ally a network of these computational units(hereafter referred to as nodes) interconnectedto function as a collective system. The architec-ture of the network defines how the nodes in

Figure 2 Feed-forward neural network structurewith two hidden layers

a network are interconnected. A multi-layer,feed-forward architecture is depicted in Figure2. The nodes are organized into a series oflayers with an input layer, one or more hiddenlayers, and an output layer. Data flows throughthis network in one direction only, from theinput layer to the output layer.

What is the advantage of a multi-layerednetwork? An ANN model with no hidden lay-ers (a single computation unit) can separatedata that falls on opposite sides of a hyperplane(see the left panel in Figure 3). Compared to alinear model, the hyperplane generated by theANN model will be non-linear (note that ithas a large linear section, but curves at theendpoints). In this example, the linear and non-linear ANN models would produce comparableclassification accuracies (each would misclassifyfour of the 49 data points).

If a single hidden layer is added, then eachnode in the hidden layer will form a hyper-plane. The nodes in the output layer of thissingle hidden layer network combine the hy-perplanes to create convex open or closedregions. An ANN model with two nodes in asingle hidden layer is depicted in the middlepanel in Figure 3. In this example, a singlenode in the output layer selects which of thehyperplanes to apply. The classification accu-racy of this one hidden layer model hasincreased to 48 of the 49 data points.

If two hidden layers are used, then thesecond hidden layer combines the hyperplanesfrom the first hidden layer into convex regions,and the nodes in the output layer now combinethe convex regions to form concave regions.

Figure 3 Example to demonstrate the advantagesof hidden layers



The ANN model depicted in the right panel ofFigure 3 would have six nodes in the firsthidden layer and four nodes in the secondhidden layer. The single node in the outputlayer would select among the four convexregions. Note that we are now able to attain100% classification accuracy with the ANNmodel.

Research has shown that an ANN with twohidden layers can approximate a particular setof functions to a given accuracy (Cybenko,1988; Lapedes and Farber, 1987). Research hasalso shown that it is possible to use a singlehidden layer to approximate a continuous func-tion and achieve the desired accuracy(Cybenko, 1989; Hecht-Nielsen, 1990; Hertz etal., 1991; Hornik et al., 1989). Since a singlehidden-layer model is less complex, mostresearchers have favored these over two hiddenlayer models. We were not able to find anyreferences in the literature when more than twohidden layers were used. Thus, we will limitthe discussions in this paper to one and twohidden-layer models.

Networks that contain feedback connectionsare said to be recurrent. Recurrent networksrecirculate previous outputs back to the inputs,similar to the concept of using lagged variablesin forecasting. The recurrent feature of an ANNinherently considers the moving average factorin a time series and outperforms a multi-layered network (Jhee and Lee, 1993). However,most ANN applications in accounting andfinance have employed a multi-layer, feed-forward architecture.

Linear versus Non-linear Models

As discussed above, an ANN with no hiddenlayers is similar to a generalized linear model.Feedforward ANNs with hidden layers are asubset of the larger class of non-linearregression and discrimination models. Gener-ally, one chooses a non-linear model over alinear model when the underlying relationshipsbetween the variables are either known to benon-linear, or are not known, a priori.

Linear economic models are not able to cap-ture non-linear patterns and trends in the

relationships between and within most econo-metric predictor variables used in accountingand finance research. For example, a time seriescan be broken down into four components:secular trend, cyclical variation, seasonal fluc-tuation and irregular fluctuation. The irregularfluctuation can be further divided into deter-ministic chaotic behavior and stochastic noise.Conventional linear techniques cannot dis-tinguish between these two sub-components ofrandom noise and non-linear relationships(Wong, 1991).

Austin et al. (1997) used an ANN to identifymathematical relationships between currentvalues of macroeconomic and financial vari-ables and future stock market valuations. Theyfound that linear economic models were notable to capture non-linear patterns and trendsin the relationships between and within stockand bond price movements. The linear tech-niques could not distinguish between randomnoise and non-linear relationships.

When applied to classification tasks, ANNshave the ability to induce algorithms for reco-gnizing patterns. Thus, the knowledge base isinferred from the facts presented to train thenetwork. The disadvantage, however, is thatthese networks do not have facilities to explainhow a conclusion is reached. Hence, they arebest suited for rather straightforward discrimi-nation and classification problems involvingcomplex relationships. Curram and Mingers(1994) found that ANNs provided better classi-fication rates than linear discriminant analysison data sets that were non-linear, but weregenerally slightly worse where the data werelinearly separable. When using real data wherethe degree of non-linearity is often not known,they suggest using both methods and compar-ing the performance.

A limitation of using financial ratios withlinear discrimination models is that the ratiomust provide a dichotomous signal. Forexample, a ratio value above or below a giventhreshold must always signal financial distress.In practice, a ratio may signal distress bothwhen it is higher than normal and when it islower than normal (Coats and Fant, 1993). Soin these situations, ANNs may provide a bettermodel to capture the underlying relationships



between the financial ratios and the depen-dent variable.

In general, an ANN should not be used tomodel inherently linear relationships. At best,the ANN will turn out to be a far more com-plex and cumbersome approach to achievecomparable results. ANNs should be appliedwhen there is some evidence of non-linearityin the relationships between the dependent andexplanatory variables.

Parametric versus Non-parametric

Parametric models, and regression models inparticular, have become abused statisticalmethods. Tests are routinely performed andinferences made without verifying normality oferrors, independence of errors and constancyof error variance (Marques et al., 1991). ANNscan be used as parametric models. Previousresearch has shown that back-propagation willminimize the least squared error if (1) themodel does not get trapped in a local optimaland (2) there are an adequate number of nodesin the hidden layer. If an ANN is used as aparametric model, the same sort of distri-butional assumptions are required for errorterms as for a statistical model. These distri-butional assumptions of noise that is inde-pendent between each case and normally dis-tributed with equal variance affect the abilityto make statistical inferences about the results.

However, ANNs applied as non-parametricmodels can easily incorporate multiple sourcesof evidence without simplifying assumptionsconcerning the functional form of the relation-ship between output and predictor variables.In many cases, the predictor variables are notnormally distributed and customary transform-ations have been unsatisfactory. Markham andRagsdale (1995) found that the data typicallyused in the study of bank failures did not fitnormal distributions even after data transform-ations. When such statistical assumptions(distribution, independence of multiple fea-tures, etc.) are not valid, an ANN that doesnot rely on these assumptions provides bettergeneralization properties and seems to be bettersuited to handle small sample problems (Nileset al., 1989).

Parametric statistical models require thedeveloper to specify the nature of the functionalrelationships between dependent and inde-pendent variables. ANNs use the data todevelop an internal representation of therelationship between the variables so that apriori assumptions about underlying parameterdistributions are not required. Thus, betterresults would be expected from the ANN ifthe known relationships between the variablesdo not fit the appropriate parametric statisti-cal model.

Most parametric models require that theinput variables be linearly separable. That is,no two input variables can have similar influ-ences on the dependent variables. When finan-cial ratios and aggregate account balances areused as input, this requirement can be easilyviolated.

Within a parametric model, outliers in a dataset influence the size of the correlation coef-ficient, the average value for a group, or thevariability of scores within a group. Univariateoutliers, cases that are extreme with respect toone of the input variables, are relatively easyto detect. There are situations where the outlieris caused by the interaction of two or morevariables. These multivariate outliers are harderto detect since the values for each individualvariable are within bounds (they were notdetected as a univariate outlier). There arenumerous aspects of an ANN that make themmore robust with respect to outliers (non-lineartransfer functions within each computationalunit, moving average weight adjustments, etc.).Research has shown that ANNs are morerobust than regression when outliers arepresent in the data (Marques et al., 1991; Subra-manian et al., 1993).

In summary, ANNs applied as non-para-metric models are not as constrained by distri-bution-related requirements as most traditionalstatistical models. The non-parametric ANNmodel may be preferred over traditional para-metric statistical models in those situationswhere the input data do not meet the assump-tions required by the parametric model, orwhen large outliers are evident in the dataset.



Selection Criteria

Most ANNs used in the accounting and finan-cial literature have been applied as statisticalmethods, not as artificial intelligence techniquesthat attempt to mimic the basic architectureand functioning mechanisms of the humanbrain. We consider an ANN as a non-linearstatistical method that offers certain advantagesover its parametric counterparts. As such, thefirst fundamental decision that should be madeby a researcher is whether it is appropriate touse an ANN to solve accounting and financeproblems, as opposed to a traditional para-metric statistical model.

DETERMINE APPROPRIATE PARAMETRICMODEL

As a first step, the researcher should determinethe research question being asked and theappropriate parametric modeling approach thatwould be used if all the underlying assump-tions were valid. Tabachnick and Fidell (1983)offer a classification scheme relating researchquestions to the appropriate parametric models.We have extended this classification scheme todistinguish the nature of the output variable(s)(continuous versus discrete). We have also cate-gorized ANN applications to accounting andfinance problems according to this scheme (seeTable 1).

The first category assesses the strength ofthe relationship between the input and outputvariables. The greater the strength of therelationship, the greater ability to predict thevalue of the output variable. When the outputvariables are continuous, the traditional para-metric models applied to answer these researchquestions have been regression-based andeconometric approaches (Box–Jenkins ARIMA).The accounting and finance problems whichfall into this category include the applicationof forecasting techniques to stock prices andother market-related indicators, cost estimation,and other managerial forecasting studies. Whenthe output variables are discrete, then logisticalregression models must be applied. However,the accounting and finance studies which have

Table 1 Classification scheme for research questions

Research Type of Parametricquestion type output model Accounting and finance ANN application

Assess Continuous RegressionI Stock price and market-related studies (White, 1988; Kamijodegree of

Econometric and Tanigawa, 1990; Kimoto et al., 1990; Yoon and Swales,relationshipapproaches 1991; Bergerson and Wunsch, 1991; Wong, 1991; Trippi and(forecasting,

(Box–Jenkins De Sieno, 1992; Kryzanowski et al., 1993; Grudnitski andpatternARIMA). Osburn, 1993; Charitou and Charalambous, 1995; Austin etrecognition)

al., 1997; Kohara et al., 1997)I Cost estimation (Hoyt and Lay, 1995)I Analytic review (Coakley and Brown, 1993; Coakley, 1995)I Bankruptcy (Udo, 1993)I Managerial forecasting studies (Jhee and Lee, 1993; Choi

et al., 1997)

Discrete LOGIT See Two-group classificationPredict group Discrete LOGIT Two-group classification studies:membership I Predictions of management fraud (Fanning et al., 1995;

Discriminant(two group Green and Choi, 1997; Fanning and Cogger, 1988)functionsand multiple I Auditor’s going concern uncertainty opinion (Hansen and

group Messier, 1991; Coats and Fant, 1993; Lenard et al., 1995;usuallyclassification) Anandarajan and Anandarajan, 1999)includes tests

I Bank failures (Bell et al., 1990; Chung and Tam, 1992; Tamof statisticaland Kiang, 1992; Subramanian et al., 1993; Merkham andsignificanceRagsdale, 1995; Bell, 1997; Etheridge and Sriram, 1997)and the

I Bankruptcy (Boucher, 1990; Raghupathi et al., 1991;evaluation ofSalchenberger et al., 1992; Odom and Sharda, 1992; Brocketthe variouset al., 1994; Fanning and Cogger, 1994; Wilson and Sharda,predictor1994; Boritz et al., 1995; Boritz and Kennedy, 1995; Huangvariableset al., 1994; Barniv et al., 1997; Jo et al., 1997)

I Credit evaluation and underwriting (Collins et al., 1988;Reilly et al., 1991; Jensen, 1992; Grudnitski et al., 1995;Torsun, 1996)

I Bond ratings (Dutta and Shekhar, 1988; Surkan andSingleton, 1991)

Multiple-group classifications:

I Bond ratings (Kwon et al., 1997; Daniels et al., 1997;Maher and Sen, 1997).Prediction of group membership only

Assess Continuous Principal Traditional parametric techniques in this category wereunderlying components applied to achieve data reduction.structure analysis

ANN models with Hebbian learning are closely related toFactor principal components analysis.

analysisNo studies were found that applied ANN models in thiscategory

Discrete Cluster I Prediction of bank failures (Etheridge and Sriram, 1997)analysis Categorical Learning ANN was used but results were

compared with logit and discriminant analysis rather than ak-means clustering algorithm.

I Bankruptcy (Chen et al., 1995; Serrano-Cinca, 1996).Kohonen networks are very similar to k-means clusteranalysis.



Table 1 Continued

Research Type of Parametricpuestion type output model Accounting and finance ANN application

I None found

Significance Limitations of the ANN models is the computationalContinuous ANOVAof group complexity required to assess the effect of an input variable

differences Discrete PROBIT on the output variables

Note: A short summary of each study cited in this table can be found in Coakley et al. (1995).

used logistic regression have focused more onthe two-group classification problem and areincluded as classification studies.

The second category measures the degree towhich group membership can be predictedfrom the various input variables. The groupsact as discrete levels of the output variable(hence there are no comparable models for con-tinuous output variables). This type of researchquestion is analyzed with logistic regressionand discriminant functions, and usuallyincludes tests of statistical significance and theevaluation of the various predictor variables.The ANN models focus on the prediction ofgroup membership only. We further subdividethe accounting and finance studies in this cate-gory into two-group and multiple-group classi-fication problems. The two-group classificationstudies include predictions of managementfraud, the auditor’s going concern uncertaintyopinion, bank failures, bankruptcy, and creditevaluation and underwriting. The studies onpredicting bond ratings range from two-groupclassification to multiple-group classifications.

The third category deals with finding anunderlying structure inherent in a set of finan-cial data. These types of problems are answeredwith principal components analysis (and factoranalysis) when the output variables are con-tinuous, and cluster analysis when the outputvariables are discrete. ANN models with Hebb-ian learning are closely related to principalcomponents analysis. In the accounting andfinance literature, the traditional parametrictechniques in this category were applied toachieve data reduction. We did not find anystudies where these type of ANN models wereapplied to accounting and finance problems,



although determining the information contentof financial statements seems a likely candidate.

There were two studies that applied Kohonennetworks to perform k-means cluster analysis.Although Etheridge and Sriram (1997) didapply a Categorical Learning ANN, the studyfocused on the prediction of bank failures andthe results were compared with logit and discri-minant analysis (versus a k-means clusteringalgorithm). Chen et al. (1995) evaluated bank-ruptcy data and found that the Self OrganizingMap achieved higher classification accuracythan conventional clustering methods (84.8%compared to 54.3%). Serrano-Cinca (1996)applied a Self Organizing Feature Map to clus-ter firms into zones of bankruptcy, zones ofsolvency, and a zone of doubtful insolvency.This zone of doubtful insolvency contained thefirms that were mis-classified in previous stud-ies using the same data set (Odom and Sharda,1992; Wilson and Sharda, 1994). In additionto improved classification accuracy, the SOFMprovides graphic output depicting the risk ofbankruptcy and the financial characteristics ofthe firm.

The final category measures the significanceof group differences. The emphasis is on infer-ring population differences among groups. Theclassic parametric model is Analysis of Variance(ANOVA) when the output variable is continu-ous, and PROBIT when the output variable isdiscrete. One of the limitations of the ANNmodels is the computational complexityrequired to assess the effect of an input variableon the output variables. To date, there areno known attempts to apply ANN models inthis manner.

Validate Underlying Assumptions

Once the nature of the research question hasbeen established, and the appropriate para-metric model is determined, the next step is tovalidate the underlying assumptions for theparametric model.

(1) If the validity of these assumptions is ques-tionable, then an ANN may provide bet-ter results.

I When applied to determine the degree ofrelationships with continuous outputs,ANNs have been shown to outperformARIMA (Choi et al., 1997), and multipleregression (Kohara et al., 1997; White, 1988).

I When applied to determine the degree ofrelationships with discrete outputs (two-group classifications), ANNs have beenshown to produce lower type I errors (e.g.0.13 for ANN, 0.536 for logit and 0.533 fordiscriminant analysis), but also higher typeII errors (0.09 for ANN, 0.012 for logit and0.014 for discriminant analysis). In general,the combined type I and II errors are lowerfor the ANNs over the other methods. How-ever, the comparison of the methodsbecomes dependent on the relative cost ofthe type I versus type II errors (Etheridgeand Sriram, 1997).

(2) If complex non-linear relationships areexpected between the predictor variables,then a non-linear ANN should provide bet-ter results than a linear model. If the non-linear relationships do not exist, the ANNresults should be similar to linear modelresults (Charitou and Charalambous, 1996).

(3) Marques et al. (1991) suggest that ANNswill perform better than their parametriccounterparts under conditions of high noiseand low sample size. The ANNs becomeless competitive when there is low noise orlarge samples. Slightly different resultswere obtained by Markham and Rakes(1998). Using simulated data, they com-pared linear regression and ANN modelsacross 20 datasets varying the sample sizeand the variance in the error. With lowvariance, linear regression outperformedANN models across all sample sizes. Withhigh variance, ANN models were better.



With medium variance, regression perfor-med better with small sample sizes, whileANNs performed better with large samplesizes.

(4) With respect to classification problems,ANNs perform better as the number ofvariables and groups increase, and are morerobust to changes in sample size and pro-portion (Subramanian et al., 1993).

If the underlying distributional properties ofthe data are not known a priori, it is rec-ommended that both ANN and parametricmodels be developed and compared. The useof an ANN model does not guarantee betterresults over a parametric model. Poor perform-ance with both models could indicate the lackof a deterministic relationship between inputsand outputs. In general, failures in ANN per-formance can be attributed to the use of aninadequate learning algorithm, the use of aninappropriate model and/or architecture, inva-lid data representation to the model, or inad-equate training of the network. These factorsare discussed in the next section.

DEVELOPMENT OF THE ANN MODEL

Once the decision to use an ANN is made, theresearcher faces numerous decisions includingselection of the learning algorithm, choice ofthe error and transfer functions, specificationof the architecture, preparation of the data tomatch the architecture, and training of the net-work.

Selection of the Learning Algorithm

ANNs effectively filter input to produce output.More specifically, an ANN looks for patternsin a set of examples applied to the input layerof the network, and learns from those examplesto produce new patterns, the output. Knowl-edge within the ANN is maintained in theweights. The process of learning is imple-mented by changing the weights until thedesired response is attained at the outputnodes. In an ANN with linear transfer func-tions, the weights can be derived using matrix

manipulation (much like solving the generallinear model). In an ANN with non-lineartransfer functions, two learning mechanismscan be used to derive the weights: unsuper-vised learning and supervised learning.

Unsupervised learning is analogous to a clusteranalysis approach. The network self-organizesto identify the salient properties of the inputdata set. The input patterns are classifiedaccording to their degree of similarity, withsimilar patterns activating the same output pat-tern.

Supervised learning accepts input examples,computes the output values, compares the com-puted output values to the desired outputvalues (termed target values), and then adjuststhe network weights to reduce the difference(error). The learning process is repeated untilthe difference between the computed and targetoutput values are at an acceptably low value.

The most common supervised learning algor-ithm is back-propagation. Back-propagationemploys a gradient-descent search method tofind weights that minimize the global errorfrom the error function. The error signal fromthe error function is propagated back throughthe network from the output layer, makingadjustments to the connection weights that areproportional to the error. The process limitsoverreaction to any single, potentially inconsist-ent data item by making small shifts in theweights. A complete description of the algor-ithm and its derivations can be found in numer-ous sources, including Hertz et al. (1991) andRumelhart et al. (1986).

The commonly cited disadvantages of theback-propagation learning algorithm are thatthe training time usually grows exponentiallyas number of nodes increases, and there areno assurances that a global minimum will bereached. The use of partial derivatives in gradi-ent descent algorithm makes the assumptionthat the error surface is locally linear. At pointsof high curvature this linearity assumption doesnot hold and divergent behavior might occur.To compensate for this possibility, an infini-tesimally small step size (learning rate) mustbe used to guarantee asymptotic convergenceto a minimum point.

But, small learning coefficients lead to slow



learning. The momentum factor (delta weight)was added to back-propagation algorithm toact as a low-pass filter on the weight adjust-ment terms. General trends are reinforcedwhereas oscillatory behavior cancels itself out.It allows a low learning coefficient with fasterlearning. But, incorrect specification of theseparameters (learning rate and momentum term)can lead to either extremely slow convergenceor to oscillatory behavior without convergence.

One technique suggested to speed up learn-ing is the use of the conjugate-gradient algor-ithm. This algorithm uses the second derivativeof the error function to exploit informationregarding both the slope and curvature of theresponse surface. When compared to back-propagation, the conjugate-gradient algorithmhas been shown to produce comparable resultswith much faster training times (Charitou andCharalambous, 1996; Wong, 1991).

A major drawback of the gradient descentalgorithms is that there is no way of determin-ing in advance whether the architecture andselected methods will converge. Real error sur-faces, with multiple weight dimensions, tendto have very complex features including dentsand ravines. Although the gradient-descentmethod always follows the steepest pathtowards the bottom of the error surface, it mayget stuck within a large dent or ravine on theside of the surface. Numerous methods areavailable to compensate for this tendency tofind local (non-optimal) minima. Some methodsadjust the weight derivation process to main-tain the momentum established by previousadjustments. Other methods involve startingfrom a different point on the error surface(using a different set of initial weights) andensuring that the results are similar. Still othermethods involve the dynamic adjustment ofthe network architecture (trimming nodes orconnections between nodes).

Evolutionary programming is a stochasticoptimization technique that has been used inaccounting and finance problems as an alterna-tive to the conventional gradient methods.Evolutionary programming involves two pro-cesses—mutation and selection. The algorithmstarts with an original population of weightsets that are evaluated by examining the corre-

sponding outputs. Random mutation of the par-ents creates new solutions. Specifically, a Guas-sian random variable with mean zero andvariance equal to mean squared error of theparent perturbs each weight and bias term.Each offspring is evaluated, and the n ‘best’solutions are retained (selected) as parents forthe next iteration.

Genetic algorithms extend this mutation andselection process by adding a recombinationphase to create the child nodes that are evalu-ated. Each child is formed as a cross betweentwo parents, hence the term ‘crossover’. Gold-berg (1994) cites numerous advantages of gen-etic algorithms: they can easily solve problemsthat have many difficult-to-find optima, theyare noise tolerant, and they use very little prob-lem-specific information. They work with thecoding of the parameter set, not the specificvalues of the parameters. The major disadvan-tage cited for genetic algorithms is the difficultyin specifying the optimal parameter settings forthe model.

A genetic algorithm was used by Huang etal (1994) to predict financially distressed firmsusing unbalanced groups (the group of finan-cially distressed companies represents less than3% of the population). The ANN clearly domi-nated discriminant analysis and Logic methodsapplied to the same data set. Levitan and Gupta(1996) applied a genetic algorithm to the costdriver optimization (CDO) problem in activity-based costing. CDO is a combinatorial optimiz-ation problem that is NP-hard. Compared toa greedy algorithm proposed by Babad andBalachandran to solve the CDO problem, thegenetic algorithm was less complex (used fewercost driver parameters) and calculated moreaccurate costs.

Genetic algorithms have been demonstratedas possible techniques to aid in the develop-ment of the ANN model. Back et al. (1996)used a genetic algorithm to select the best inputvariables for an ANN model applied for bank-ruptcy prediction. Hansen (1998) demonstratedthe use of a genetic algorithm to find the mosteffective ANN architecture.

Optimal Estimation Theory has also beenapplied to accounting and finance problemsas an alternative learning algorithm to back-



propagation. Optimal Estimation Theory (OET),introduced by Shepanski (1988), uses a leastsquares estimator to calculate the optimal setof connection weights for the presented trainingset. This method significantly reduces the timerequired to train a network, but requires havingthe same number of nodes in the input andhidden layers. This method was used by Boritzet al. (1995) to predict bankruptcy filings. Theycompared the performance of the OET algor-ithm with back-propagation, and found mixedresults. In general, when the numbers of bank-rupt vs. non-bankrupt firms in the training setswere equal, the back-propagation algorithmprovided higher classification accuracy. How-ever, Boucher (1990) also used training datasets with equal numbers of bankrupt and non-bankrupt firms and found that the OET algor-ithm provided higher overall classification accu-racy than the back-propagation algorithm.

It appears that the OET algorithm does trainfaster, but it is not known whether it achievessimilar performance. The discussion below onnetwork size suggests that the requirement tomaintain a fixed number of nodes in the hiddenlayer may be a limitation of the OET algorithm.Larger or smaller networks trained with theback-propagation algorithm may achieve bet-ter results.

In a survey of ANNs in business, Velido etal. (1999) found that the gradient descent back-propagation model dominated other methods(of 92 reported papers, 74 used back-propagation). Given the often-cited disadvan-tages of this approach, it still seems to be thefavored one. The algorithm is intuitively sim-ple, and applications incorporating back propa-gation are freely available on numerous websites. In addition, researchers have seen a ten-fold increase in processing power at thedesktop in the past decade that may havediminished some of the previous concernsregarding slow training. Back-propagation willbe the focus in the remainder of this paper.

Choice of the Error and Transfer Functions

Error FunctionsThe sum-of-squared-error (SSE) error functionis the one most widely applied in the account-

ing and finance literature. Since the SSE func-tion is differentiable and minimized when itsarguments are equal, the resulting error surface(in three dimensions) looks like a bowl. Thebowl’s bottom corresponds to the set of weightsthat produce the minimum error. At any givenpoint on the bowl’s surface, the derivative ofthe error function provides the slope of thesurface with respect to the weights. To minim-ize the error, the weights are adjusted to movedown the slope towards the bowl’s bottom.The SSE function uniformly weights each train-ing trial error in accordance with the square ofthe magnitude of the error vector (Target 2Output). This error-measurement schemeensures that large errors receive much greaterattention than small errors. Also, the SSE takesinto account the frequency of occurrence ofparticular inputs. It is much more sensitive toerrors made on commonly encountered inputsthan it is to errors on rare inputs (Hecht-Nielsen, 1990).

However, any differentiable function of thetarget and output variables, F(T, O), that isminimized when its arguments are equal couldbe used as an error function in the back-propa-gation model. The more important consider-ation is that the error function should reflectthe statistical nature of the original data as wellas any assumptions about measurement errors.

Coakley (1995) proposes that an alternativeerror function that maximizes the relativeentropy between the input data streams maybe more appropriate for classification tasks.This error function, based on informationtheory and proposed by Hertz et al. (1991),involves learning the correct probabilities of aset of hypotheses represented by the outputunit. Information theory is applied to derivethe expected information content of an inputsignal, which is expressed as a conditional like-lihood function. The entropy error function isformed by maximizing the conditional likeli-hood function, and corresponds to themaximum conditional likelihood function com-monly used in Logit analysis (Loh, 1992;Palepu, 1986). The maximum conditional likeli-hood function is appropriate for dichotomousoutput variables. As a result of using theentropy error function, the ANN will place



more weight on those signals that producecomplementary information, and less weighton those signals that provide redundant orconflicting information.

Transfer FunctionsThe transfer function is used to derive theoutput of a node given its weighted-adjustedinput. The use of linear transfer functionsrequires that the input patterns be linearly inde-pendent. If non-linear transfer functions areused, linear independence of the input patternsis not required. Thus, non-linear transfer func-tions allow ANN models to be applied to awider range of problems (Hertz et al., 1991).

Four non-linear transfer functions have beenproposed in the literature for use with ANNmodels: sigmoid (logistic), half-sigmoid, sine(or cosine), and hyperbolic tangent (see Ander-son and Rosenfeld, 1988; Hecht-Nielsen, 1990;Hertz et al., 1991; Lawrence, 1991; Rumelhartand McClelland, 1986; Stornetta and Huber-man, 1987; Waite and Hardenbergh, 1989;Wasserman, 1989). We did not find anyaccounting and finance papers that reportedthe use of the sine transfer function, so haveexcluded that function from the analysis.

The sigmoid function (equation (3)) is abounded differentiable real function that isdefined for all real input values and has apositive derivative everywhere. As depicted inFigure 4, the sigmoid is a semi-linear functionthat makes a rapid yet smooth transition froma nearly proportional center section to theupper and lower limits. The sigmoid functionis centered at 0.5, and provides an output rangeof zero to one.

g(h) 5 (High – Low) / (1 1 exp(2Gain * (h (3)2 Center))) 1 Low

where:

I High is the upper limit for the output.I Low is the lower limit for the output.I Center is the value of the input at which the

output is equal to (High 1 Low)/2.I Gain is directly proportional to the derivative

of the function at the Center point. Withhigh gain (Gain .. 1), the sigmoid functionapproximates a step function, while at low

Figure 4 Comparison of sigmoid, Half-Sigmoid andhyperbolic tangent Transfer functions

gain (Gain ,, 1) it approximates a linearfunction.

I h is the sum of the weighted connections.

Stornetta and Huberman (1987) suggest thatthe conventional sigmoid function may not beoptimal for all data sets. With the back-propa-gation algorithm, the magnitude of a weightadjustment at a particular node is proportionalto the output level of that node. With a sigmoidfunction, it is possible to get an output valueof zero from a node. These appropriate outputvalues would result in no weight modification.They propose an alternative that changes theoutput range of the sigmoid transfer functionby adding a bias of −.. Thus in equation (3),the High and Low values would be set to(0.5, −0.5) respectively. This modified sigmoidfunction, commonly termed the ‘Half-Sigmoid’,places less weight on those data values closeto the mean, yet provides an improved capa-bility for explaining those input data valuesclose to the lower bound. Note in Figure 4 thatthe Half-Sigmoid function is centered at zeroand has a range of 6..

Another alternative is to use the hyperbolictangent function (equation (4)). Like the half-sigmoid, it is centered at zero. However, therange for the output values is wider (61 versus6.). This function is more linear in nature and



produces an effect similar to truncating outliersin a linear model.

g(h) 5 tanh(h)

Coakley et al. (1992) compared the perform-ance of these three transfer functions whenused in a multi-layer, feed-forward, back-propagation network. The study was limited toevaluating the forecasting performance of theANN when applied to a computer-generatedtime series with seasonal variations and ran-dom noise. In general, they found that thehyperbolic tangent transfer function had fasterconvergence and slightly better predictive accu-racy than the sigmoid and half-sigmoid func-tions. There were no consistent differencesbetween the sigmoid and the half-sigmoid func-tion in terms of either performance or speedof convergence.

For an ANN with no hidden layers and alinear transfer function, Brown et al. (1993)present a theoretical explanation for why learn-ing is faster when the input values are in therange -b to b rather than 0 to b. They extendtheir arguments to suggest that the use of thehyperbolic tangent transfer function in the hid-den layer of an ANN is also better.

Selection ConsiderationsThere are no clear criteria regarding whichtransfer function to use. The best guidance thatcan be offered is to consider how the transferfunction relates to the error function. To solvethe back-propagation algorithm, the weights inthe network must be adjusted to minimize theerror function. Using the gradient-descentmethod, the change in the weights (DWji ) isderived from the derivative of the error func-tion E with respect to the weights Wji (equation(5)). For the derivations presented here,subscript j refers to processing units in theoutput layer, subscript i refers to the processingunits in the input layer, and there are no hid-den layers in the model. The equations for theoutput processing units do not change in formif hidden layers are added to the model. Inthat case, the subscript i would refer to theprocessing units in the hidden layer immedi-ately preceding the output layer:

DWji 5 2hE

Wji(5)

where:

I Eta (h) is the learning coefficientI E 5 f(Target Value 2 Output Value), f any

function with characteristics discussed in thefollowing section

I Output Value 5 g(Si [Wji * Ai] ) (from equa-tions (1) and (2))

Equation (5) is easier to solve with certaincombinations of transfer function and errorfunction. For example, the sigmoid transferfunction and sum-of-squared-error error func-tion combination provide a simple solution, asdo the combination of the hyperbolic-tangenttransfer function and relative-entropy errorfunction.

With the SSE error function, the partialderivative in equation (5) has a tendency tobound the maximum values of the weights. AsFigure 5 depicts, the adjustment to the weightswill tend to zero when the target and predictedvalues are equal, as expected, but will also bezero when the target and predicted values are

Figure 5 Delta for SSE error function



at the extreme opposites (target 5 0, predicted5 1 for the sigmoid case).

One interesting difference to note in the twographs in Figure 5 is the relative values of thedelta weights. For the sigmoid function, thedelta weight values range from 20.15 to 10.15.For the hyperbolic tangent function, the valuesrange from 21.2 to 11.2. This difference hascaused some researchers to recommend the sig-moid function for classification problems thatinvolve learning about average behavior, andthe hyperbolic tangent function if the probleminvolves learning about deviations from theaverage (such as forecasting). This difference indelta weight values also becomes extremelyimportant when the learning coefficient andmomentum factors are set. Lower values needto be used if the hyperbolic tangent transferfunction is to be used (Coakley et al., 1992).

For the entropy error function, a differentstory emerges. For the sigmoid transfer func-tion, the delta weight is only bounded whenthe predicted value approaches zero (top panelin Figure 6). When combined with the hyper-bolic tangent transfer function, the delta weightadjustment is linear (bottom panel in Figure 6).

Figure 6 Delta for entropy error function

The adjustment is still zero when the target andpredicted values are equal, but the adjustmentincreases linearly as the difference between thetarget and predicted values increase. The impli-cation is that if the learning coefficient andmomentum adjustment factors are incorrectlyspecified, the ANN model will blow up veryearly in the training process.

Specification of the ANN Architecture

Determining the appropriate ANN architectureis not a simple, straightforward task. Sincenumerous alternative architectures exist, thefirst task is to determine which architecture isappropriate for the problem. Most of theresearch in accounting and finance has usedthe multi-layer, feed-forward model. Thus, ourdiscussion will be limited to this architecture.

Input and Output LayersWith the feed-forward architecture, the inter-face with the environment usually establishesthe number of processing nodes in the inputand output layers. For example, many authors(Boritz et al., 1995; Boucher, 1990; Coats andFant, 1993) used an ANN to perform financialdistress classifications based on the Altman(1968) model. Since the Altman model uses fivefinancial ratios, each of these ANN architec-tures has a five-node input layer and a single-node output layer that represents the dichot-omous classification of distressed vs. non-distressed.

When applied for forecasting, the number ofinput nodes corresponds to the number oflagged observations used to discover the under-lying time-series patterns. This is a very criticaldecision since it captures the complex autocor-relation structures in the data. According toZhang et al. (1998) there is systematic way todetermine the number of lagged variables toinclude. They report that a number of studiesare using genetic algorithms to determine theoptimal input architecture.

The number of output nodes to use is notalways straightforward. For example, Yoon andSwales (1991) used a model with one outputnode for distressed firms and a second outputnode for non-distressed firms. This does add



additional information in that the network candistinguish firms that exhibit characteristics thatare clearly associated with distressed firms,firms that exhibit characteristics that are clearlyassociated with non-distressed firms, and firmsthat exhibit characteristics that are associatedwith both. In the multiple-group classificationproblem of predicting bond ratings, Kwon etal. (1997) used a single output node for eachbond rating category (for a total of five nodesin the output layer) while Maher and Sen (1997)used a single output node to distinguish thesix ordinal bond rating categories. Since thedata sets are different, it is difficult to do directcomparison of the results. But both modelsreported similar classification accuracy, andeach model outperformed its parametriccounterpart (multiple discriminant analysis inthe Kwon study and logit in the Maher andSen study).

Internal ArchitectureSpecifying the internal architecture requiresdetermining the number of hidden layers, andthe number of processing nodes in each hiddenlayer. These two parameters affect the overallperformance of the ANN, and usually varyfrom one implementation to the next.

The ability of a multi-layer network to rep-resent non-linear models has been establishedthrough an application of an existence theoremproved by Kolmogorov (1963) and restated forback-propagation neural networks by Hecht-Nielsen (1990). Additional research applied thisgeneral theorem to the back-propagation multi-layer feed-forward model:

I At most two hidden layers are needed toapproximate a particular set of functions toa given accuracy (Cybenko, 1988; Lapedesand Farber, 1987). The arbitrary accuracy isobtained by including enough processingnodes in each of the hidden layers. Researchsuggests that an architecture with n inputdata streams will require at most (2n 1 1)processing nodes per hidden layer to achievethe desired accuracy (Cybenko, 1989; Hecht-Nielsen, 1990; Hertz et al., 1991; Hornik etal., 1989).

I It is possible to approximate a continuousfunction that may achieve the desired accu-

racy with a single hidden layer (Cybenko,1989; Hecht-Nielsen, 1990; Hertz et al., 1991;Hornik et al., 1989). Extra hidden layers aredetrimental due to the increased processingrequirements to train and evaluate themodel. Therefore, an architecture with a sin-gle hidden layer would be preferred as longas the desired accuracy can be obtained.

I A rule of thumb for obtaining good gen-eralization from a network is to use thesmallest system that will fit the data (Reed,1993). One approach is to train successivelysmaller networks until the smallest one isfound that will adequately learn the data.However, these smallest feasible networkstend to be sensitive to initial conditions andlearning parameters and may be more likelyto become trapped in local minima. So Reed’srecommended approach is to train a networkthat is larger than necessary and removeparts that are not needed. Determining whichparts are not needed is usually a trial anderror process. Parts are selected for removalusing a heuristic (e.g. remove the node withthe smallest sum of weights). The smallernetwork is then tested to see if it continuesto produce acceptable results. The process issimilar to the ‘Backward Elimination Pro-cedure’ used with regression models. Thelarge initial size allows the network to learnreasonably quickly with less sensitivity toinitial conditions while the reduced com-plexity of the trimmed system favorsimproved generalization.

These recommendations for smaller networkscan be substantiated with a comparison ofresults between Curram and Mingers (1994)and Hart (1992). Both studies used similar datasets. Hart (1992) achieved a 29% error rate witha 20-hidden-node network while Curram andMingers (1994) attained a 26% error rate witha 10-hidden-node network. This suggests thatHart’s results may be subject to overfitting ofthe data caused by using more hidden nodesthan necessary.

When applied for classification tasks, theANN architectures should have fewer nodes inthe hidden layer than in the input layer. Therationale for this heuristic is conceptually sim-



ple. If a Hebbian-learning ANN model wasdeveloped with the same number of nodes inthe input, hidden and output layers, then atconvergence, the weights would resemble anidentity matrix. If only one node was used inthe hidden layer, the weights would becomecollinear with the principal eigenvector of theinput data. Having fewer nodes in the hiddenlayer than in the input layer creates a bottle-neck that forces the model to perform datacompression, and the resulting weights capturethe principal components that form the basisfor many parametric methods (Diamantarasand Kung, 1996).

Subramanian et al. (1993) recommend thatthe number of nodes in a single hidden layershould range from k to n11, where k is thenumber of output nodes and n is the numberof input variables. Another rule of thumb isthat the number of nodes in the middle layershould be 75% of the number of nodes in theinput layer (Salchenberger et al., 1992). Gener-ally speaking, too many nodes in the middlelayer (too many connections) produce a net-work that memorizes the input data and lacksthe ability to generalize. Patuwo et al. (1993)also found that the number of nodes affectedthe classification accuracy. As the number ofnodes decreases, the average classification ratedecreases in the training sample but increasesin the test samples. Thus, increasing the num-ber of nodes in the middle layer will degradethe network’s ability to classify outside thetraining set (overfitting). Hence, excess nodesin the hidden layer will reduce the ability ofthe network to generalize.

Another consideration is that as the numberof hidden nodes in a network is increased, thenumber of variables and terms are alsoincreased. If the network has more degrees offreedom (the number of connection weights)than the number of training samples, there arenot enough examples to constrain the network.This is similar to fitting a high-order poly-nomial though a small number of points.

De Villiers and Barnard (1992) conductedexperiments comparing networks with one andtwo hidden layers and the same total numberof connections (weights). Their results suggestthat there is no statistically significant differ-

ence between the optimal performance of net-works with one or two hidden layers, thatnetworks with a single hidden layer do a betterclassification on average, and that networkswith two hidden layers train easier if the num-ber of nodes in each of the two hidden layersis equal.

Masters (1994) has proposed a geometric pro-gression rule to determine the number of nodesin the hidden layers. If a single hidden layeris used, then the number of nodes 5 sqrt(n*m), where n 5 number of input nodes and m5 number of output nodes. If two hidden lay-ers are used, then first compute the parameterr 5 (n/m)1/3. The number of nodes in the firstlayer 5 mr2 and the number in the secondlayer 5 mr.

Torsun (1996) used Masters (1994) geometricprogression rule to determine number of nodesin the initial model, and then nodes were pro-gressively deleted until the best performancewas attained. Since further pruning of a net-work can lead to better generalization, theycontinued by deleting weights that had theleast effect on the solution. The final ANNachieved a classification accuracy of approxi-mately 92%.

To overcome problems associated with the‘trial and error’ approach to determining theappropriate internal network architecture,researchers have developed algorithms thatadaptively adjust the architecture during train-ing. Most of the algorithms proposed in theliterature and applied to accounting and financeproblems have started small and added connec-tions:

I Malakooti and Zhou (1994) have developedan adaptive strategy algorithm which startswith a small number of hidden nodes andconnections, then adaptively increases thenumber of connections first, and the numberof nodes second, until the ‘proper’ topologyis obtained.

I Fahlman and Lebiere (1990) have developedthe Cascade-Correlation (Cascor) trainingparadigm. Cascor begins with no hiddennodes and then incrementally creates andinstalls hidden nodes to improve the net-work’s ability to categorize. Coats and Fant



(1993) found several advantages of theCascor network: it determines its own sizeand design, it retains the structure it hasbuilt even if the training set changes, and itdoes not require back-propagation of errorsignals through the connections of the net-work. This eliminates the need for humantrial and error experimentation to determinethe number of hidden nodes. They foundthe Cascor network provided more effectivefinancial distress pattern classifications thanmultiple discriminant analysis.

I Fanning and Cogger (1994) have developedthe Generalized Adaptive Neural NetworkArchitecture (GANNA) where the architec-ture of the network is learned by trial anderror. This does not require a priori specifi-cation of number of layers, numbers ofnodes, or other design choices. The GANNAuses an evolutionary mechanism to grow net-works, with new nodes added provided theyoffer improvement.

Other examples, termed ‘pruning’ algorithms,start with a large network and reduce the num-ber of connections during training:

I Armstrong et al. (1991) provide a trimmingapproach called the Adaptive Logic Network(ALN). The ALN is a Boolean logic-basedtree framework that evolves by deletingthose branches that are not providing anyadditional information. Fanning et al. (1995)found that the ALN approach produced simi-lar results as the GANNA processor, andsuperior results to the Logit model, whenapplied to detect management fraud.

I Jhee and Lee (1993) applied a weight-decaytechnique that adds an extra term to the costfunction that penalizes network complexity.This additional term pushes the smaller-valued weights to zero while having minimalimpact on the larger-valued weights. Inessence, their algorithm deletes connectionswithin the model by freezing the value ofthe weights to zero.

In conclusion, specification of the internalarchitecture involves tradeoffs between fittingaccuracy and generalization ability. Whilenumerous expansion/pruning methods havebeen proposed, these methods are usually quite

complex, difficult to implement, and cannotguarantee selection of the optimum architec-ture. Most of the heuristics that are offeredare based on simulations derived from limitedexperiments. None of the heuristics works wellfor all problems.

The design of an ANN is basically problem-dependent and requires an experimental trial-and-error approach. We do know that largerarchitectures are required for complex responsesurfaces, larger architectures are required (referback to Figure 3 earlier in the paper to visuallysee the result of adding additional hiddenlayers). Multiple-group bond rating classi-fication is a complex problem. Papers address-ing this problem have used large architectures.Kwon et al. (1997) used an ANN with 41 nodesin the input layer, two hidden layers with 25and 15 nodes each, and 5 nodes in the outputlayer. The Maher and Sen (1997) model used asingle hidden layer with 2n11 nodes. At theother extreme, simple classification problemscan be modeled with relatively simple architec-tures. Charitou and Charalambous (1996) used9 nodes in the input layer, a single hiddenlayer with 4 nodes, and one node in the outputlayer to predict the sign of earnings change.

Preparation of the Sample Data

ANN training requires three sets of data: atraining set which must be representative ofthe entire domain, an evaluation set which isused to determine when the best generalizationability has been obtained, and a test set whichis used to evaluate the prediction/classificationaccuracy of the model. This tends to causeproblems in some accounting and financialapplications since the number of instances issometimes limited.

Sample Size

Baum and Haussler (1989) have estimated thatthe minimum training set size required for typi-cal classification tasks is N/I, where N is thenumber of connections (non-zero weights) inthe network and I is the fraction of the testexamples that include incorrect classifications.For the typical financial distress problem using



the five Altman (1968) ratios, and a single layernetwork with five nodes, there will be 36 con-nections (presuming that a bias node isincluded at both the input and hidden layer).If a balanced set of training data is used, thefraction of incorrect classifications will be 0.5,suggesting that the training set should include72 observations. Additional, independent obser-vations will also be required in the evaluationand testing sets.

Patuwo et al. (1993) found that sample sizein the training set affects the classification accu-racy. As sample size increases, the averageclassification rate decreases in the training sam-ple but increases in the test samples. Thesmaller the sample size, the easier it is for thenetworks to adjust themselves to the idiosyn-crasies of the individual observations in thesample.

Data TransformationsIt is generally recommended that the input datafor the ANN be manipulated to match theinput constraints on the transfer function. Thatis, data is scaled such that the midpoint of thedata range is at the center of the transfer func-tion, and the range of the input data is adjustedto match the input range of the transfer func-tion. For the commonly used sigmoid transferfunction, using an input data range of (0–1)will scale the data to the midpoint and limitthe calculations in the initial iterations to thenear-linear portion of the function. With widerinput data ranges, those training samples at ornear the extremes will be clipped at the upperor lower bound of the transfer function.Eventually, the derived weights should com-pensate for any differences between the inputdata range and input range of the transferfunction. However, correct specification of theinput data range facilitates training of theANN.

Shanker et al. (1996) evaluated two datatransformation methods: linear transformationto the zero to one range, and statistical stan-dardization. They found that data standardiz-ation improves classification rate but requiresmore computation time. The improvement wasfound to diminish as the network became large,and as the sample size increased.

The range of the target (or known output)values must be adjusted to lie within the outputrange of the transfer functions in the outputlayer. Otherwise the error function may havelocal minima besides the global minima. Targetvalues at or near the boundary will cause satu-ration of the network. Because non-linear trans-fer functions have asymptotic limits, a verylarge net input value is required to cause thetransfer function to produce an output valuethat lies on the boundary. The algorithm willattempt to adjust the weights to large valuesto achieve the large net input value. A ‘better’fit is usually obtained if the target values arescaled to 90% of the output range of the trans-fer function. So for the sigmoid transfer func-tion, target values should be scaled into therange of 0.05 to 0.95. However, if the ANN isbeing use to predict group membership, scalingthe target values to the range of the transferfunction, in this case, 0 to 1, allows the outputfrom the network to be directly interpreted asposterior probabilities.

Coding of Categorical VariablesFor non-ordered categories, multiple variablesmust be used (regardless of whether it is aninput or output variable). In parametric mode-ling, all input variables must be linearly separ-able. Thus, c categories must be coded with c-1 variables. If the data has been coded with‘effects coding’ for the associated parametricmodel, then the same data can be used withthe ANN. ANNs are less sensitive to linearcombinations of the input data, so c categoriescould be coded with c variables. Coding ordinalcategories with a single variable makes it easierfor the network to use the order of the categoryto generalize.

Initializing Connection WeightsThe initial connection weights must also bespecified prior to training. Generally, theseweights are randomly assigned within a pre-specified range. Experiments by Wessels andBarnard (1992) suggest that the initial trainingweights should occupy the range of63A/sqrt(N), where A is the standard deviationof the inputs to the node and N denotes thenumber of weights leading to a particular node.



Results indicate that this method performs closeto optimally when the value of A is approxi-mately one.

Training of the Network

Training of the ANN involves propagating theerror (generated from the forward processingof the input training records) to adjust the setof weights to minimize the error function. Themomentum factor in the algorithm acts as amoving average of the weight adjustments.Thus, if the error from each input record ispropagated before the next record is processed,the network will capture the temporal infor-mation between the individual record sets (e.g.single-period autocorrelation). If multiple inputrecords are forward processed before propa-gation, the model captures the temporal infor-mation between those record sets (e.g. seasonalfluctuations). If the ANN is being used forclassification, then all of the input recordsshould be processed before the error is propa-gated. If this is unreasonable due to compu-tational constraints (e.g. memory requirements),then the sequence of input patterns should berandomized with each epoch to avoid capturinginadvertent autocorrelations in the data.

The back-propagation algorithm guaranteesthat total error in the training set will continueto decrease as the number of epochs increases.With each epoch, the weights are modified todecrease the error on the training patterns. Astraining progresses, the amount of change inthe error function becomes smaller. Conver-gence occurs when the change in the errorfunction is less than a specified threshold.

However, training with repeated applicationsof the same data set may result in the phenom-enon of overtraining. Similar to the concept ofoverfitting discussed in the ‘Internal Architec-ture’ section above, overtraining occurs whenthe ANN attempts to exactly fit the limited setof points and loses its ability to interpolatebetween those points (Hecht-Nielsen 1990). Astraining progresses, there is always an inter-mediate stage at which the algorithm reachesa good balance between accurately fitting thetraining set examples and still providing areasonably good interpolation capability.

According to Hecht-Nielsen (1990), the back-propagation model starts with a very flat sur-face and does a good job at interpolatingbetween points on the surface.

However, the training data may be incom-plete, or may contain spurious and misleadingregularities due to sampling errors and/oromissions. At some point in later stages oflearning, the network starts to take advantageof these idiosyncrasies in the training data. Astraining progresses, the surface becomes con-voluted as it attempts to ‘better’ fit the set oftraining points and the ability to interpolatebetween the points diminishes.

This problem is demonstrated in Figure 7.An ANN was developed for time-series predic-tion. The first 48 periods were used to trainthe network, and the last 12 periods were usedfor prediction. After 10 iterations, the ANNresults resembled the average of the data points(heavy-dashed line). The ‘best’ predictive abilityoccurred after 300 iterations. Notice that thebest-fit line does not exactly fit the data ineither the training or prediction periods. After3000 iterations, the ANN almost exactly fits the

Figure 7 Effect of overtraining on time series data



data points in the training periods. However,the fit in the prediction period is very poor, andis clearly worse than the 300-iteration results.

The problem created from overtraining isdetermining when sufficient iterations havebeen accomplished to achieve the desired pre-diction accuracy. The ‘best’ predictive perform-ance should be obtained with the set of weightsthat produces the minimum value for the errorfunction in the testing set of data. Iterationsbeyond that point will not improve predictiveperformance. However, since the testing set ofdata is a holdout sample used to evaluate theperformance of the network, it cannot be usedto determine when to stop training. Thus, per-formance is evaluated on independent set ofknown observations, termed an evaluation set,after each epoch. The error on this evaluationset will tend to decrease in step with the train-ing error as the network generalizes from thetraining data to the underlying function. As thenetwork begins to ‘overtrain’ to the training setof data, the error for the evaluation set startsto increase again even though the training errorcontinues to decrease. The algorithm should be

terminated when performance on the evalu-ation set begins to deteriorate.

This technique may not be practical whenonly a small amount of data is available sincevalidation data cannot be used for training(Reed 1993). An alternative approach that hasbeen applied by Chung and Tam (1992) andCoakley (1995) is the use of a jackknife methodto derive the evaluation set of data. In Chungand Tam (1992), examples were excluded fromthe training sample one at a time, with theremaining examples serving as the training set.This process is repeated n times (where n isthe number of observations in the training set)and the average misclassification rate is calcu-lated. Lachenbruch (1967) has shown that thismethod provides an unbiased estimate for theprobability of misclassification. An extension ofthis approach is to use k-fold cross-validationwhen the data set is divided into k equal parts.The ANN would be trained k times, using n-kdata points for training and k data points forevaluation. The results of the k models arethen averaged. Coakley (1995) used a differentapproach of drawing a random sample fromthe training set at the beginning of each epoch.

CONCLUSIONS

The non-linear characteristics of ANNs makethem a promising alternative to traditional lin-ear and parametric methods. The challengefaced by ANN researchers is that there are noformal theories for determining, a priori, theoptimal network model. Development of anANN model is a complex process that is cur-rently more art than science. Thus, for an ANNapplied to a specific problem, experiments mustbe conducted to determine the performance dif-ferences between alternative models.

Based on an extensive survey of ANNsapplied to accounting and finance problems,the following guidelines are suggested:

(1) Determine the type of research question andcomparable parametric model that wouldbe used. If the properties of the data donot match the distributional assumptionsrequired by the parametric model, or there



is some evidence of possible non-linearity,then development of an ANN model is war-ranted

(2) The most widely implemented learningalgorithm is back-propagation. For classi-fication problems, there are no clear per-formance differences between back-propa-gation and other learning algorithms. Thus,the selection will most likely be driven bythe availability of alternative learning algor-ithms within the software package beingused. For forecasting problems, the recur-rent algorithms appear to improve predic-tive ability.

(3) There are no clear guidelines on the selec-tion of the error and transfer functions. Themost widely used combination is the SSEerror function and sigmoid transfer func-tion. It has been suggested that the relativeentropy error function is more appropriatefor classification problems that producedichotomous output. With this error func-tion, the hyperbolic tangent transfer func-tion seems most appropriate. Another con-sideration is robustness of the transferfunction. During error propagation, round-ing errors could produce negative valuesfor the hidden neurons. The sigmoid func-tion cannot handle these values, while thehyperbolic tangent function can. Finally,some researchers recommend the sigmoidfunction for classification problems thatinvolve learning about average behavior,and the hyperbolic tangent function if theproblem involves learning about deviationsfrom the average (such as forecasting).

(4) In terms of architecture, selection of thenumber of nodes in the input and outputlayers is driven by the problem and thenature of the input data. If a classificationtask, using a single output node forcesclassification along an ordinal scale, whileusing multiple output nodes allows dupli-cate and mis-classifications.

(5) Determining the number of layers, and thenumber of nodes per layer, is still part ofthe ‘art’ of neural networks.

I Assess the complexity of the problemand the resulting response surface. Complexresponse surfaces require larger networks.

I Assess the noise within the data. Smallernetworks should be used with noisy datato increase generalization ability and avoidoverfitting to the noise.

I If being applied for classification tasks,the number of nodes in the hidden layershould be less than the number of nodes inthe input layer.

I Although numerous heuristics have beensuggested for determining the numbernodes, they do not apply across all thereported studies.

I Some algorithms have been developedthat adaptively adjust the architecture dur-ing training to overcome the ‘trial and error’approach typically associated with back-propagation ANNs. The expansion andpruning algorithms both achieve improvedaccuracies.

(6) Data-preparation requirements for the ANNmodel are relatively well defined. The inputdata should be scaled to match the inputside of the selected transfer function, whilethe specified target values should be scaledto match the output side. Most softwarepackages will perform scaling of the inputdata, and will automatically generate theinitial training weights.

(7) If the ANN is used to model time seriesdata, then data sets should be presented inorder, and weights should be adjusted aftereach data set. If the ANN is used for classi-fication tasks, then the entire data setshould be presented before weights areadjusted.

(8) Obtaining a sufficient sample size may beproblematic since three separate data setsare required (training, evaluation, andtesting). Jackknife and cross validationmethods have been shown to be effectivetechniques that allow use of smaller datasets.

The availability of sample data appears to bea driving factor in the ANN research performedto date in the accounting and finance areas.Many of the cited articles applied ANNs as anextension of a previous statistical model-basedstudy, or used publicly available stock marketdata. Of the 123 studies cited by Velido et al.



(1999), only 9 were considered to be examplesof ANNs applied to a ‘real-world’ case. Thecommercial applications of ANNs in the privatesector (see Brown et al., 1995) seem to focus ona wider range of problems. We believe thelimited applications in published academicresearch are due to lack of access to appropriatedata sets.

It appears that ANNs show promise in prob-lem areas where traditional statistical tech-niques are inadequate. ANNs provide a non-linear dimension that captures the underlyingrepresentations within the data sets. Thus,future applications should focus on problemareas where non-linear relationships arebelieved to exist within the data.

However, much more research is needed inthe following areas:

I How do we systematically build an appropri-ate ANN model for specific types of account-ing and finance problems? Although we haveattempted to derive some guidelines fromthe literature, we are still generalizing acrossbroad problem categories. Replication of thestudies using different algorithms, architec-tures and data preparation techniques isneeded to refine these guidelines within spe-cific application areas.

I Are any of the alternative training methodsand algorithms ‘best’ for a specific type ofaccounting and finance problem? Many ofthe studies are first attempts to demonstratethe applicability of a technique within aproblem area. It is difficult to generalize fromthese limited experiments. Again, replicationof the experiments is needed.

References

Altman E. 1968. Financial ratios, discriminant analy-sis and the prediction of corporate bankruptcy.The Journal of Finance 23: No. 4, 589–609.

Altman E. 1983. Corporate Financial Distress—A Com-plete Guide to Predicting, Avoiding and Dealing withBankruptcy. John Wiley: New York.

Anandarajan M, Anandarajan A. 1998. A comparisonof machine learning techniques with a qualitativeresponse model for auditor’s going concernreporting. Expert Systems with Applications 16:385–392.

Anderson JA, Rosenfeld E. 1988. Neurocomputing:Foundations of Research. MIT Press: Cambridge,MA.

Armstrong W, Dwelly A, Liang J, Lin D, ReynoldsS. 1991. Learning and generalization in adaptivelogic networks. Artificial Neural Networks, Proceed-ings of the 1991 International Conference on ArtificialNeural Networks. 1173–1176.

Austin M, Looney C, Zhuo J. 1997. Security markettiming using neural networks. The New Review ofApplied Expert Systems 3: 3–14.

Babad YM, Balachandran BV. 1993. Cost driver opti-mization in activity-based costing. The AccountingReview 68: No. 3, 563–575.

Back B, Laitinen T, Sere K. 1996. ‘Neural networksand genetic algorithms for bankruptcy predictions.Expert Systems with Applications 11: No. 4, 407–413.

Barniv R, Agarwal A, Leach R. 1997. Predicting theoutcome following bankruptcy filing: a three-stateclassification using neural networks. InternationalJournal of Intelligent Systems in Accounting, Financeand Management 6: 177–194.

Baum EB, Haussler D. 1989. What size net give validgeneralization. Neural Computation 1. MIT Press:Cambridge, MA; 151–160.

Bell TB. 1997. Neural Nets or the Logit Model?A comparison of each model’s ability to predictcommercial bank failures. Intelligent Systems inAccounting, Finance and Management 6: 249–264.

Bell TB, Ribar GS, Verchio J. 1990. Neural nets vs.logistic regression: a comparison of each model’sability to predict commercial bank failures. Proceed-ings of the 1990 Deloitte Touche/University of KansasSymposium on Auditing Problems. 29–53.

Bell TB, Szykowny S, Willingham J. 1993. Assessingthe likelihood of fraudulent financial reporting: acascaded logic approach. Working Paper, KPMGPeat Marwick: Montvale, NJ.

Bergerson K, Wunsch DC. 1991. A commodity trad-ing model based on a neural network-expert sys-tem hybrid. Proceedings of the IEEE InternationalConference on Neural Networks, Seattle, WA. I289–I293.

Boritz JE, Kennedy DB. 1995. Effectiveness of neuralnetwork types for prediction of business failure.Expert Systems with Applications 9: No. 4, 503–512.

Boritz JE, Kennedy DB, Albuquerque AM. 1995. Pre-dicting corporate failure using a neural networkapproach. Intelligent Systems in Accounting, Financeand Management 4: No. 2, 95–111.

Boucher J. 1990. Artificial neural systems and the predic-tion of corporate failure. Master’s Project, School ofAccountancy, University of Waterloo.

Brockett PL, Cooper WW, Golden LL, Pitaktong U.1994. A neural network method for obtaining anearly warning of insurer insolvency. The Journal ofRisk and Insurance 61: No. 3, 402–424.

Brown CE, Coakley JR, Phillips ME. 1995. Neuralnetworks enter the world of management account-



ing. Management Accounting LXXVI: No. 11, May,51–57.

Brown M, An PC, Harris CJ, Wang H. 1993. Howbiased is your multi-layer perceptron? Proceedingsof the 1993 World Congress on Neural Networks.507–511.

Charitou A, Charalambous C. 1996. The predictionof earnings using financial statement information:empirical evidence with Logit models and artificialneural networks. International Journal of IntelligentSystems in Accounting, Finance and Management 5:199–215.

Chen SK, Mangiameli P, West D. 1995. The compara-tive ability of self-organizing neural networks todefine cluster structure. Omega 23: No. 3, 271–279.

Choi HR, Kim W, An SY. 1997. Recurrent anddecomposed neural network-based hotel occu-pancy forecasting. The New Review of Applied ExpertSystems 3: 121–136.

Chung HM, Tam KY. 1992. A comparative analysisof inductive-learning algorithms. Intelligent Systemsin Accounting, Finance and Management 2: No. 1,3–18.

Coakley JR. 1995. Using pattern analysis methods tosupplement attention-directing analytical pro-cedures. Expert Systems with Applications 9: No. 4,513–528.

Coakley JR, Brown CE. 1993. Neural networksapplied to ratio analysis in the analytical reviewprocess. International Journal of Intelligent Systemsin Accounting, Finance and Management 2: No. 1,19–39.

Coakley JR, McFarlane DD, Perley WG. 1992. Alter-native criteria for evaluating artificial neural net-work performance. TIMS/ORSA Joint NationalMeeting (Orlando, FL).

Coats PK, Fant LF. 1993. Recognizing financial dis-tress patterns using a neural network tool. Finan-cial Management 22: No. 3, 142–155.

Collins E, Ghosh S, Scofield C. 1988. An applicationof a multiple neural-network learning system toemulation of mortgage underwriting judgments.Proceedings of the IEEE International Conference onNeural Networks, Vol. 2. 459–466.

Curram SP, Mingers J. 1994. Neural networks,decision tree induction and discriminant analysis:an empirical comparison. Journal of the OperationalResearch Society 45: No. 4, 440–450.

Cybenko G. 1988. Continuous valued neural net-works with two hidden layers are sufficient. Tech-nical Report, Department of Computer Science,Tufts University, Medford, MA.

Cybenko G. 1989. Approximation by superpositionsof a sigmoidal function. Mathematics of Control,Signals, and Systems 2: 303–314.

Daniels H, Kamp B, Verkooijen W. 1997. Modelingnon-linearity in economic classification with neuralnetworks. International Journal of Intelligent Systemsin Accounting, Finance and Management 6: 287–301.

de Villiers J, Barnard E. 1992. Backpropagation neu-

ral nets with one and two hidden layers. IEEETransactions on Neural Networks 4: No. 1, 136–141.

Diamantaras KI, Kung SY. 1996. Principal ComponentNeural Networks: Theory and Applications. Wiley:New York.

Duliba KA. 1991. Contrasting neural nets withregression in predicting performance in the trans-portation industry. Proceedings of the 24th AnnualHawaii International Conference on Systems Sciences,Vol. IV. 163–170.

Dutta S, Shekhar S. 1988. Bond rating: a non-conservative application of neural networks. Pro-ceedings of the IEEE International Conference on Neu-ral Networks, Vol. 2. 433–450.

Elman JL. 1988. Finding Structure in Time. TechnicalReport, CRL #88-1, Center for Research in Langu-age, University of California, San Diego.

Etheridge HL, Sriram RS. 1997. A comparison of therelative costs of financial distress models: artificialneural networks, Logit and multivariate discrimin-ant analysis. International Journal of Intelligent Sys-tems in Accounting, Finance and Management 6:235–248.

Fahlman SE, Lebiere C. 1990. The Cascade-CorrelationLearning Architecture, Technical Report: CMU-CS-90-100, Carnegie-Mellon University.

Fanning KM, Cogger KO. 1994. A comparative analy-sis of artificial neural networks using financialdistress prediction. International Journal of IntelligentSystems in Accounting, Finance and Management, 3:No. 4, 241–252.

Fanning KM, Cogger KO. 1998. Neural networkdetection of management fraud using publishedfinancial data. International Journal of Intelligent Sys-tems in Accounting, Finance and Management 7:21–42.

Fanning K, Cogger KO, Srivastava R. 1995. Detectionof management fraud: A neural network approach.Intelligent Systems in Accounting, Finance and Man-agement 4: No. 2, 113–126.

Goldberg DE. 1994. Genetic and evolutionary algor-ithms come of age. Communications of the ACM,37: No. 3, 113–119.

Green BP, Choi JH. 1997. Assessing the risk of man-agement fraud through neural network tech-nology. Auditing: A Journal of Practice and Theory16: No. 1, 14–28.

Grudnitski G, Osburn L. 1993. Forecasting S&P andgold futures prices: An application of neural net-works. The Journal of Futures Markets 13: No. 6,631–643.

Grudnitski G, Do AQ, Shilling JD. 1995. A neuralnetwork analysis of mortgage choice. IntelligentSystems in Accounting, Finance and Management 4:No. 2, 127–135.

Hansen JV. 1998. Comparative performance of back-propagation networks by genetic algorithms andheuristics. International Journal of Intelligent Systemsin Accounting, Finance and Management 7: 69–79.

Hansen JV, Messier WF. 1991. Artificial neural net-



works: foundations and application to a decisionproblem. Expert Systems with Applications 3: No. 1,135–141.

Hart JA. 1992. Using neural networks for classi-fication tasks—some experiments on data sets andpractical advice. Journal of the Operational ResearchSociety 43: 215–226.

Hecht-Nielsen R. 1990. Neurocomputing. Addison-Wesley: Reading, MA.

Hertz J, Krogh A, Palmer RG. 1991. Introduction tothe Theory of Neural Computation. Addison-Wesley:Reading, MA.

Hoptroff R, Hall T, Bramson MJ. 1991. Forecastingeconomic turning points with neural nets. Proceed-ings of the International Joint Conference on NeuralNetworks. IEEE Service Center: Piscataway, NJ,II.347–II.352.

Hornik K, Stinchcombe M, White H. 1989. Multilayerfeedforward networks are universal approxima-tors. Neural Networks 2: 359–366.

Hoyt RE, Lay CM. 1995. Neural network cost esti-mates for heart bypass surgery. Expert Systems withApplications 9: No. 4, 529–542.

Huang, CS, Dorsey RE, Boose MA. 1994. Life Insurerfinancial distress prediction: a neural networkmodel. Journal of Insurance Regulation 131–167.

Jensen HL. 1992. Using neural networks for creditscoring. Managerial Finance 18: No. 6, 15–26.

Jhee WC, Lee JK. 1993. Performance of neural net-works in managerial forecasting. Intelligent Systemsin Accounting, Finance and Management 2: No. 1,55–71.

Jo H, Han I, Lee H. 1997. Bankruptcy predictionusing case-based reasoning, neural networks, anddiscriminate analysis. Expert Systems with Appli-cations 13: No. 2, 97–108.

Kamijo K, Tanigawa T. 1990. Stock price patternrecognition: a recurrent neural network approach.Proceedings of the IEEE International Joint Conferenceon Neural Networks, Vol. I. (San Diego, CA, 215–221.

Kimoto R, Asakawa K, Yoda M, Takeoka M. 1990.Stock market prediction system with modular neu-ral networks. Proceedings of the International JointConference on Neural Networks, Vol. 1. 1–6.

Kohara K, Ishikawa T, Fukuhara Y, Nakamura Y.1997. Stock price prediction using prior knowledgeand neural networks. International Journal of Intelli-gent Systems in Accounting, Finance and Management6: 11–22.

Kolmogorov AN. 1963. On the representation of con-tinuous function of many variables by superpo-sition of continuous functions of one variable andaddition. American Mathematical Society Translation28: 55–59.

Kryzanowski L, Galler M, Wright DW. 1993. Usingartificial neural networks to pick stocks. FinancialAnalysts Journal 49: No. 4, 21–27.

Kwon YS, Han I, Lee KC. 1997. Ordinal pairwisepartitioning (OPP) approach to neural networks

training in bond rating. International Journal of Intel-ligent Systems in Accounting, Finance and Manage-ment 6: 23–40.

Lachenbruch P. 1967. An almost unbiased methodof obtaining confidence intervals for the prob-ability of misclassification in discriminant analysis.Biometrics December, 639–645.

Lapedes A, Farber R. 1987. Nonlinear Signal ProcessingUsing Neural Networks: Prediction and System Mode-ling. Los Almos National Laboratory Report LA-UR-87-2662.

Lawrence J. 1991. Introduction to Neural Networks.California Scientific Software: Grass Valley, CA.

Lawrence J, Andriola P. 1992. Three-step methodevaluates neural networks for your application.EDN 93–100.

Lenard MJ, Alam P, Madey GR. 1995. The appli-cation of neural networks and a qualitativeresponse model to the auditor’s going concernuncertainty decision. Decision Sciences 26: No. 2,209–227.

Levitan A, Gupta M. 1996. Using genetic algorithmsto optimize the selection of cost drivers in activity-based costing. International Journal of IntelligentSystems in Accounting, Finance and Management 5:No. 3, 129–146.

Liang TP, Chandler JS, Han I, Roan J. 1992. Anempirical investigation of some data effects on theclassification accuracy of probit, ID3, and neuralnetworks. Contemporary Accounting Research 9: No.1, 306–328.

Lippman RP. 1987. An introduction to computingwith neural nets. IEEE ASSP Magazine April, 4–22.

Loh L. 1992. Financial characteristics of leveragedbuyouts. Journal of Business Research 24: 241–252.

Maher JJ, Sen TK. 1997. Predicting bond ratingsusing neural networks: a comparison with logisticregression. International Journal of Intelligent Systemsin Accounting, Finance and Management 6: 59–72.

Malakooti B, Zhou YQ. 1994. Feedforward artificialneural networks for solving discrete multiple cri-teria decision making problems. ManagementScience 40: No. 11, 1542–1561.

Markham IS, Ragsdale CT. 1995. Combining neuralnetworks and statistical predictions to solve theclassification problem in discriminant analysis.Decision Sciences 26: No. 2, 229–242.

Markham IS, Rakes TR. 1998. The effect of samplesize and variability of data on the comparativeperformance of artificial neural networks andregression. Computers and Operations Research. 25:No. 4, 251–263.

Marques L, Hill T, Worthley R, Remus W. 1991.Neural network models as an alternative toregression. Proceedings of the 24th Annual HawaiiInternational Conference on Systems Sciences, Vol. IV.129–146.

Masters T. 1994. Practical Neural Networks. AcademicPress: New York.

Niles L, Silverman H, Tajchman G, Bush M. 1989.



How limited training data can allow a neuralnetwork classifier to outperform an ‘optimal’ stat-istical classifier. IEEE Proceedings 1989 InternationalConference 1. Glasgow, Scotland, 17–20.

Odom M, Sharda R. 1992. A neural network modelfor bankruptcy prediction. In Neural Networks inFinance and Investing, Trippi RR, Turban (eds).Probus Publishing: Chicago, IL.

Ohlson J. 1980. Financial ratios and the probabilisticprediction of bankruptcy. Journal of AccountingResearch 18: No. 1, 109–131.

O’Leary DE. 1998. Using neural networks to predictcorporate failure. International Journal of IntelligentSystems in Accounting, Finance and Management 7:187–197.

Pal SK, Mitra S. 1992. Multilayer perceptron, fuzzysets, and classification. IEEE Transactions on NeuralNetworks 3: No. 5, September, 683–697.

Palepu KG. 1986. Predicting takeover targets: Amethodological and empirical analysis. Journal ofAccounting and Economics 8: 3–35.

Patuwo E, Hu MY, Hung MS. 1993. Two-groupclassification using neural networks. DecisionSciences 24: No. 4, 825–845.

Proto VW, Fogel DB, Fogel LJ. 1995. Alternativeneural network training methods. IEEE Expert 10:No. 3, 16–22.

Raghupathi W, Schkade LL, Raju BS. 1991. A neuralnetwork application for bankruptcy prediction.Proceedings of the 24th Annual Hawaii InternationalConference on Systems Sciences, Vol. IV. 147–155.

Reed R. 1993. Pruning algorithms—a survey. IEEETransactions on Neural Networks 4: No. 5, 740–747.

Reilly DL, Collins E, Scofield C, Ghosh S. 1991. Riskassessment of mortgage applications with a neuralnetwork system: An update as the test portfolioages. Proceedings of the IEEE International Conferenceon Neural Networks. Piscataway, NJ, 1991, II479–II482.

Rumelhart DE, Hinton GE, Williams RJ. 1986. Learn-ing representations by back-propagating errors.Nature 323: 533–536.

Rumelhart DE, McClelland JL. 1986 Parallel Distrib-uted Processing, Explorations in the Microstructure ofCognition, Volume 1: Foundations. MIT Press: Cam-bridge, MA.

Salchenberger LM, Cinar EM, Lash NA. 1992. Neuralnetworks: a new tool for predicting thrift failures.Decision Sciences, 23: No. 4, 899–916.

Schocken S, Ariav G. 1994. Neural networks fordecision support: Problems and opportunities.Decision Support Systems 11: 393–414.

Serrano-Cinca C. 1996. Self organizing neural net-works for financial diagnosis. Decision Support Sys-tems, 17: 227–238.

Shanker M, Hu MY, Hung MS. 1996. Effect of datastandardization on neural network training.Omega, International Journal of Management Science24: No. 4, 385–397.

Sharda R. 1994. Neural networks for the MS/OR

analyst: an application bibliography. Interfaces 24:No. 2, March–April, 116–130.

Shepanski JF. 1988. Fast learning in artificial neuralsystems: Multilayer Perceptron training using opti-mal estimation. Proceedings of the IEEE Second Inter-national Conference on Neural Networks, Vol. 4. SanDiego, CA, July.

Stornetta WS, Huberman BA. 1987. An improvedthree-layer backpropagation algorithm. Proceedingsof the IEEE First International Conference on NeuralNetworks. San Diego, CA.

Subramanian V, Ming SH, Hu MY. 1993. An experi-mental evaluation of neural networks for classi-fication. Computers and Operations Research 20: No.7, 769–782.

Surkan A, Singleton J. 1991. Neural networks forbond rating improved by multiple hidden layers.Proceedings of the IEEE International Conference onNeural Networks, Vol. II. 163–168.

Tabachnick BG, Fidell LS. 1983. Using MultivariateStatistics. Harper & Row: New York, NY.

Tam KY, Kiang MY. 1992. Managerial applicationsof neural networks: the case of bank failure predic-tions. Management Science 38: No. 7, 926–947.

Taudes A, Natter M, Trcka M. 1998. Real optionvaluation with neural networks. International Jour-nal of Intelligent Systems in Accounting, Finance andManagement 7: 43–52.

Torsun IS. 1996. A neural network for a loan appli-cation scoring system. The New Review of AppliedExpert Systems 2: 47–62.

Trippi RR, DeSieno D. 1992. Trading equity indexfutures with a neural network. The Journal of Port-folio Management 19: No. 1, 27–33.

Trippi RR, Turban E. (eds). 1992. Neural Networks inFinance and Investing. Probus Publishing:Chicago, IL.

Udo G. 1993. Neural network performance on the



bankruptcy classification problem. Computers andIndustrial Engineering 25: 377–380.

Velido A, Lisboa PJG, Vanghan J. 1999. Neural net-works in business: a survey of applications (1992–1998). Expert Systems with Applications 17: 51–70.

Waite T, Hardenbergh H. 1989. Neural nets. Program-mer’s Journal 7: No. 3, 10–22.

Wasserman PD. 1989. Neural Computing: Theory andPractice. Van Nostrand Reinhold: New York.

Wessels L, Barnard E. 1992. Avoiding false localminima by proper initialization of connections.IEEE Transactions on Neural Networks 3, No. 6,899–905.

White H. 1988. Economic prediction using neuralnetworks: the case of IBM daily stock returns.Proceedings of the IEEE International Conference onNeural Networks Vol. II. 451–458.

Widrow B, Rumelhart DE, Lehr, MA. 1994. Neuralnetworks: applications in industry, business andscience. Communications of the ACM 37: No. 3,93–105.

Wilson RL, Sharda R. 1994. Bankruptcy predictionusing neural networks. Decision Support Systems11: 545–557.

Wong BK, Bodnovich TA, Selvi Y. 1997. Neuralnetwork applications in business. A review andanalysis of the literature (1988–95). Decision SupportSystems 19: 301–320.

Wong FS. 1991. A 3D neural network for businessforecasting. Proceedings of the 24th Annual HawaiiInternational Conference on Systems Sciences, Vol. IV.113–123.

Yoon Y, Swales G. 1991. Predicting stock price per-formance: a neural network approach. Proceedingsof the 24th Annual Hawaii International Conferenceon Systems Sciences, Vol. IV. 156–162.

Zhang G, Patuwo BE, Hu MY. 1998. Forecastingwith artificial neural networks: The state of theart. International Journal of Forecasting 14: 35–62.

Artificial neural networks in accounting and finance ... · Networks in Accounting and Finance:...

Documents

Transcript of Artificial neural networks in accounting and finance ... · Networks in Accounting and Finance:...