Download - Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

1/106


2/106

y e s i n

onparametr ics

v i

eural etworks


3/106

ASA SIAM eries on

StatisticsandApplied Probability

S M

The ASA-SIAMSeries on Statistics andApplied Probability ispublished

jointlyby the American Statistical Association and the Society for Industrial andApplied

Mathematics.

The

series consists

of a

broad

spectrum of

books

on

topics

in

statistics

and

applied probability.

The

purpose

of the series is to

provide inexpensive,

quality publications

of

interest

to the

intersecting membership

of the two

societies.

Editorial Board

Robert

N.

Rodriguez

SA S Institute Inc., Editor-in-Chief

DavidBanks

Duke University

H. T.

Banks

NorthCarolina StateUniversity

Richard K Burdick

Arizona State University

JosephGardiner

Michigan

State University

Douglas

M.

Hawkins

University

of

Minnesota

Susan Holmes

Stanford

University

Lisa

LaVange

Inspire Pharmaceuticals,

Inc.

Gary

C.McDonald

Oakland

University andNational Institute

of Statistical Sciences

Francoise

Seillier Moiseiwitsch

University of MarylandBaltimore County

Lee,

H. K. H., Bayesian

Nonparametrics via Neural Networks

O Gorman,

T. W.,

Applied Adaptive

Statistical

Methods:

Tests of

Significance

and

Confidence

Intervals

Ross

T. J.

Booker,

J. M. and

Parkinson, W. }. eds., FuzzyLogica nd Probability Applications:

Bridging the Gap

Nelson, W. B. Recurrent EventsDataAnalysis

for

Product Repairs Disease Recurrences

and

Other Applications

Mason,

R . L. and

Young,

J. C.

Multivariate

Statistical

Process

Control

with

Industrial

Applications

Smith, P. L. A Primer for SamplingSolids Liquids and

Cases: Based

on the

Seven

Sampling

Errors of

Pierre

Gy

Meyer,M. A. and Booker, J. M.,Eliciting and Analyzing Expert judgment: A Practical Guide

Latouche, G. and Ramaswami, V.,Introduction to

Matrix

Analytic Methods in Stochastic

Modeling

Peck

R.,Haugh, L., andGoodman, A., Statistical Case Studies:A Collaboration Between

Academe and

Industry

Student Edition

Peck,

R., Haugh, L., and

Goodman,

A., Statistical

Case

Studies:A Collaboration Between

AcademeandIndustry

Barlow,

R.,

EngineeringReliability

Czitrom,V. and

Spagon,

P. D.

Statistical

Case

Studies for Industrial

Process

Improvement


4/106

ye s i n

onparametrics

v i

eural etworks

Herbert K H Lee

University

of

California S a n t a Cruz

S a n t a Cruz California

i m S

Society

for

Industrial

and

Applied Mathematics AmericanS t a t i s t ic a lAssociation

Philadelphia Pennsylvania

Alexandria

Virginia


5/106

Copyright 2004by the AmericanS tatistical Associationand theSocietyfor

Industrial

and Applied Mathematics.

1 0 9 8 7 6 5 4 3 2 1

All

rights reserved. Printed

in the

United States

of

America.

No

part

of

this book

maybe reproduced, stored, or transmitted in any manner without the written

permission

of the

publisher.

For

information, write

to the

Society

for

Industrial

and

Applied

Mathematics, 3600 University City Science Center,Philadelphia, PA

19104-2688.

No

warranties, express

or

implied,

are

made

by the

publisher, authors,

and

their

employers

that

the

programs contained

in

this volume

arefree of

error. They

should

not be relied on as the sole basis to solve a problem whose incorrect

solution could resultin injuryto personorproperty.If the programsareemployed

in such a manner, it is at the

user s

own

risk

and the publisher, authors, and their

employers disclaim all liability for such misuse.

Trademarked namesmay be usedinthisbook without the inclusion of a trademark

symbol. These names

are

used

in an

editorial context only;

no

infringement

of

trademark

is

intended.

S-PLUS

is a registered trademark of Insightful Corporation.

SAS is a registered trademark of SAS Institute Inc.

Thisresearch was supported in part by the National Science Foundation (grants

DMS

9803433, 9873275,

and

0233710)

and the

National Institutes

of

Health

(grant RO1CA54852-08).

Libraryof ongress

Cataloging in Publication Data

Lee, HerbertK. H.

Bayesian nonparametricsvianeural networks/ Herbert K.H. Lee.

p. cm.

(ASA-SIAM series

on

statistics

and

applied probability)

Includesbibliographical referencesand index.

ISBN

0-89871-563-6 (pbk.)

1.

Bayesian statistical decision theory. 2. Nonparametric statistics. 3. Neural

networks

(Computer science) I. Title. II.Series.

QA279.5.L43 2004

519.5 42-dc22 2004048151

A

portion of the royalties from the sale ofthis book arebeing placed

in a

fund

to

help students attend SIAM meetings

and

other SIAM-related

activities.

This fund isadministered bySIAMand qualified individualsare

encouraged

to

write directly

to

SIAM

for

guidelines.

Z JL JIL is a registered trademark.


6/106

ont nts

ListofFigures vii

Preface

ix

ntroduction

1 1 StatisticsandMachine Learning 1

1 2

Outline

of the

Book

2

1 3

Regression ExampleGroundlevel Ozone Pollution 3

1 4

Classification ExampleLoan Applications

3

1 5 A

Simple Neural Network Example

8

2

Nonparametric Models

2 1 Nonparametric Regression 12

2.1.1 Local Methods 12

2.1.2 Regression Using Basis Functions

16

2 2 Nonparametric Classification 19

2 3 Neural Networks 20

2.3.1 Neural NetworksAreStatisticalModels 21

2.3.2

A

Brief History

of

Neural Networks

22

2.3.3 Multivariate Regression

22

2.3.4 Classification

23

2.3.5 Other FlavorsofNeural Networks 26

2 4 TheBayesian Paradigm 28

2 5 Model Building 29

3 PriorsforNeural Networks 3

3 1 Parameter Interpretation andLack Thereof 31

3 2

Proper Priors 33

3 3 Noninformative Priors 37

3.3.1 Flat Priors

38

3.3.2

Jeffreys

Priors

43

3.3.3 Reference Priors

46

3 4 Hybrid Priors 46

3 5

Model Fitting

48


7/106

ontents

3 6

Example Comparing Priors

51

3 7

Asymptotic Consistency

of the

Posterior

53

Building

a

Model

57

4 1 Model Selection and Model Averaging 57

4 1 1 Modeling

Versus

Prediction 59

4 1 2 Bayesian Model Selection 60

4 1 3 Computational Approximations

for

Model Selection

61

4 1 4 Model Averaging 62

4 1 5 Shrinkage Priors

63

4 1 6 Automatic Relevance Determination

63

4 1 7 Bagging

64

4 2 SearchingtheModel Space 66

4 2 1 Greedy Algorithms

67

4 2 2 Stochastic Algorithm s 69

4 2 3 Reversible Ju mp Markov Chain Monte Carlo 70

4 3 Ozone Data Analysis 71

4 4 Loan Data Analysis 73

5 Conclusions 79

A Reference Prior Derivation 8

Glossary 85

Bibliography

87

Index 95


8/106

s t

of

igures

1 1

Pairwise

scatterplots for the ozonedata 4

1.2

Estimated smooths

for the

ozone data

5

1 3 Correlated

variables:

Age vs

current

residence

for

loan

applicants 7

1.4 A

neural network

fitted function 9

1 5

Simple neural network model diagram

9

2.1 A tree model for ozone using only wind speedand humidity 14

2.2 Example wavelets

from

the Haarfamily 17

2.3 Diagram of nonparametric methods 18

2.4

Neu ral network model diagram

20

2 5

Multivariateresponseneural network model diagram

23

2.6 Probability of loan acceptanceby onlyage 6 nodes 26

2 7 Probabilityofloan acceptancebyonly age 4nodes 27

3.1 Fitted

function

withasingle hidden node 32

3.2

M axim um likelihood

fit for a

two-hidden node network

34

3.3

Logistic basis functions

of the fit in

Figure

3.2 34

3.4 DAG for theMiillerand Rios Insua model 35

3.5 DAG for the

N eal model

37

3.6 Com parison of priors 52

4.1 Fitted mean

functions

forseveral sizesofnetworks 58

4.2 Bagging fittedvalues 66

4.3

Fitted ozone values

by day of

year

72

4.4 Fitted ozone values by vertical height hum idity pressure gradient and

inversion

base temperature 73

4.5

Fitted ozone values

by

actual recorded levels

74

4.6 Probabilityof loan acceptancebytimeincu rrent residence 75

4.7 Probabilityof loan acceptance by age 76

4.8

Probability

of

loan acceptance

by

income

77

V I I


9/106

his page intentionally left blank


10/106

ref ce

When I first heard about neural networks and how great they were, I was rather

skeptical. Being sold as a magical black box, there w as enough hy pe to make one believe

that theycouldsolvethe

world's

problems.WhenItriedtolearn more about them ,Ifound

that most of the literature was written for a machine learning audience, and I had to grapple

with

a new perspective and a new set of terminology.

After

some work, I came to see

neural networks from astatistical perspective,as aprobability

model.

One of theprimary

motivationsforthis bookwas tow rite about neural networksfor statisticians, addressing

issues and concerns of interest to statisticians, and usin g statistical terminology. Neura l

networksare apowerfulmodel,andshould betreated assuch, rather than disdained as a

mere algorithm as I have

found

some statisticians do. Hopefully this book w ill prove to

be

illum inating.

The

phrase

Bayesian

nonparametrics meansdifferentthings

to

differentpeople.

The

traditional interpretation usually implies infinite dimensional processes such

as

Dirichlet

processes, used for problems in regression, classification, and density estimation. W hile

on

the

surface this book

may not

appear

to fit

thatdescription,

it is

actually

close. One

of

the

themes

of

this book

is

that

a

neural network

can be

viewed

as a finite-dimensional

approximationto an infinite-dimen sionalmodel,an dthat this modelis

useful

inpracticefor

problemsinregressionand

classification.

Thus

the firstsectionofthis book will

focus

onintroducing n eural networks withinthe

statistical context of nonparametric regression and classification. The rest of the book will

examine important statistical modeling issues for Bayesian neural networks, particularly

thechoiceofpriorand thechoiceofmodel.

While this book will not assume the reader has any prior knowledge about neural

networks, neither will it try to be an all-inclusive introduction. Topics w ill be introduce d

inaself-contained manner, with references providedforfurther detailsof them any issues

that

w ill not be directly addressed in this book.

The

target audience

for

thisbook

is

practicing statisticians

and

researchers,

as

well

asstuden ts preparing for either or both roles. This book addresses p ractical and theoretical

issues.

It is

hoped that

the

users

of

neu ral networks willwant

an

understanding

of how the

model works, which

can

lead

to a

better appreciation

of

knowingwhen

it is

working

and

when

it isnot. Itwillbeassumed thatthereaderhasalready been introducedto thebasicsof

the Bayesian approach, with only a brief review and additional references provided. There

are anumberofgood introductory booksonBayesian statistics available (see Section 2.4),

so it does not seem productive to repeat that material here. It will also be assum ed that the

reader has a solid background in mathematical statistics and in linear regression, such as


11/106

x

reface

that

wh ich wo uld be acquired as part of a traditional M aster's degree in statistics. How ever,

thereare fewformal proofs, andmuchof the text should beaccessibleeven without this

background. Com putational issues will be discussed at conceptual and algorithmic levels.

This work developed

from

myPh.D. thesis

( Model Selection

andModel Averaging

for Neural Networks , Carnegie Mellon University, Department of Statistics, 1998). I am

gratefulfor all theassistanceandknowledge providedby myadvisor, Larry Wasserman. I

would

also like

to

acknowledge

the

manyindividuals

who

hav e contributed

to

this

effort,

including David Banks, Jim Berger, Roberto Carta, Merlise Clyde, Daniel Cork, Scott

Davies, Sigrunn Eliassen, Chris Genovese, Robert Gramacy, Rob Kass, Milovan Krnjajid,

Meena Mani, Daniel Merl, Andrew Moore, Peter Mtiller, Mark Schervish, Valerie Ventura,

and Kert Viele, as well as the staff at SIAM and anumberof anonymous referees and

editors, both for this book and for thepapers preceding it. At various points during this

work,

funding

hasbeen providedby theN ational Science Foun dation (grantsDMS9803433,

9873275,

and

0233710)

and the

National Institutes

of

Health (grant

R O1CA54852-08).


12/106

Chapter

1

ntrodu tion

The goal of this book is to put neu ral network models firmly into a statistical framework

treating them with the accompanying rigor norm ally accorded to statistical m odels. A neural

network

is frequently seen as either a magical black box or purely as a machine learning

algorithm when in

fact

there is a

definite

probability model behind it. This book will start

by showing how neural networks are indeed a statistical model for doing nonparametric

regression or classification. The focus w ill be on a Bayesian perspective although

many

of

the

topics will apply

to

frequentist models

as

well.

As

much

of the

literature

on

neural

networks

appears in the computer science realm

many

standard modeling questions

fail

to

get addressed. In particular this book will take a hard look at key modeling issues

such

as choosing

an

appropriate prior

and

dealing with model selection. M ost

of the

existing

literature deals w ith neural networks as an algorithm. The hope of this book is to shift the

focus back

to

modeling.

1 1

Statistics

andMachineLearning

The fields of statistics and machine learning are two approaches toward the same goals

with

much

in

common.

In

bothcases

the

idea

is to

learn about

aproblemfrom data. In

most cases this

is

either

a

classification problem

a

regression problem

an

exploratory data

analysisproblem or some combination of the above. W here statistics and machine learning

differ most

is in

theirperspective.

It is

sort

of

like

tw o

people

one

standing outside

of an

airplane and one standing inside the same plane both asked to describe this plane. The

person outside might discuss

the

length

the

wingspan

the

number

of

engines

and

their

layout

and so on. The person

inside might comment

on the

number

of

rows

of

seats

the number ofaisles the seat configurations the amoun t of overhead storage space the

numberof lavatories and so on. In theend they are both describing the same plane. The

relationship

between

statistics

and

machine learning

is

much

like

this situation.

As

much

of the terminology differs between the two fields towards the end of this book a glossary is

provided for translating relevant machine learning terms into statistical terms.

As

a bit of an

overgeneralization

the field of

statistics

and the

methods that come

outof it are based on probability m odels. A t the heart of almost all analyses there is some


13/106

Chapter 1

Introduction

distribution describingarandom q uantity,and themethodology springsfromtheprobability

model.

For

example,

in

simple

linear

regression,

we

relate

the

response

yto the

exp lanatory

variable

x via the

conditional distribution

y ~

N f io

+ fi\x,

cr

2

) using possibly unkno wn

parametersf io , f i\,and a

2

. Thecoreof themodelis theprobability

model.

For a matching overgeneralization, machine learning can be seen as the art of devel-

oping

algorithms for learning from data. The idea is to devise a clever algorithm that will

perform well inpractice. It is apragmatic approach thathas produced manyuseful tools

for data analysis.

Manyind ividual data analysis m ethods

are

associated o nlywith

the f ield in

w hich they

were developed. In somecases,m ethods have been invented in one field and reinvented in

the other, with the terminology rem aining completely

different,

and the research rem aining

separate. This is

unfortunate,

since both fields have much to learn from each other. It is

like the airplane exam ple, where a more com plete description of the plane is available

when

both observers cooperate.

Traditionally, neural networks are seen as a machine learning algorithm. They were

largelydevelopedby the machine learning c om m unity or the precursor comm unity, artificial

intelligence); see Section 2.3.2. In most implem entations, the

focus

is on prediction, using

the neural netw ork as a means of finding good predictions, i.e., as an algorithm. Yet a neu ral

networkisalsoastatistical model. Chapter2will showhow it issimply another methodfor

nonparam etric regression

or

classification. There

is an

underlying probability model, and,

withthe proper perspective, a neural network is seen as a generalization of linear regression.

Further discussion is

found

in Section

2.3.1.

For a more traditional approach to neural networks, the reader is referred to Bishop

1995)or

Ripley 1996),

two

excellent books that

arestatistically

accurate

bu t

have some-

what of a machine learning perspective. Fine 1999) is more statistical in flavor but

mo stly non-B ayesian. Several good but

not

B ayesian ) statistical review articles

arealso

available Ch eng and Titterington 1994); Stern 1996); W arner and M isra 1996)). A

final

key

reference is Neal 1996), which details a fully Bayesian approach to neural

networks from

a

perspective that combines elements

from

both machine learning

and

statistics.

1 2 utlineof the

Book

This section is meant to give the reader an idea of where this book is going from here.

The next two sections of this chapter will introduce two data examples that will be used

throughout thebook, one for regression and one for classification. The final section of

this chapter gives a brief introduction to a simple neural network model, to provide some

concreteness to the concept of a neural network, before mo ving on. Chapter 2 provides a

statistical context

for

neural networks, showing

how

they

fit

into

the

largerframework

of

nonparametric regression and classification. It will also give a brief review of the Bayesian

approach and of concepts in model building. Chapter 3 will

focus

on choosing a prior, w hich

is a key item in a Bayesian ana lysis. Also included are details on model fitting and a result on

posterior asymptotic consistency. Chapter 4 gets into thenutsa ndboltsof model building,

discussing model selection and alternatives such as model averaging, both at conceptual

and implementational levels.


14/106

1 4 lassification ExampleLoanApplications 3

1 3 RegressionExampleGroundlevel

Ozone

Pollution

Anexample that will

be

used throughout this book

to

illustrate

a

problem

in

n onparametric

regression is a dataset on groundlevel ozone pollution. This dataset first appeared in Breim an

andFriedman (1985) and was analyzed extensively by Hastie and Tibshirani (1990). Other

authors have also used this dataset, so it isusefulto be able to compare results

from

methods

inthis book w ith other published results.

The data consist of groundlevel ozone measurements in the Los Angeles area over

the course of a year, along

with

other meteorological va riables. The goal is to use the

meteorological covariates to predict ozone concentration, which is a pollutan t at the level of

humanactivity. The version of the data used here is that in Hastie and T ibshirani (1990),with

missing data removed, so that there are complete m easurements for 330 days in 1976. For

each of these da ys, the response variable of interest is the daily m axim um one-hour-average

ozone level in parts per million at Upland , California. A lso available are ninepossible

explanatory variables: VH , the altitude at which the pressure is 500 millibars; WIN D, the

windspeed (m ph) at Los An geles International A irport (LAX ); HU M , the hum idity ( ) at

LA X; TEMP, the temp erature (degrees F) at Sandburg A irForceBase; IBH , the temp erature

inversion base height (feet); DPG, the pressure gradient (mm Hg)from LAX to Daggert;

IBT,

the inversion base temperature (degrees F) at LA X; VIS, the visibility (miles) at LA X;

an dDAY,the day of theyear (numberedfrom 1 to 365).

The data are displayed in pairwise scatterplots in Figure 1.1. The first thing to notice

is that most variables have a strong nonlinear association with the ozone level, making

prediction feasible but requiring a flexible model to capture the nonlinear relationship.

Fitting a generalized additive model (GAM, local smoothed functions for each variable

withoutinteractioneffects), as described in Section 2.1.1 or in Hastie and Tibshirani 1990),

produces the fitted relationships between the explanatory variables (rescaled to the unit

interval) and ozone displayed in F igure 1.2. Most variables display a strong relationship

withozone,

and all but the first are

clearly nonlinear. This problem

is

thus

an

example

of

nonparametric regression, in that some smoothingof the data isnecessary, but wemust

determine how m uch smoothing is optimal. We will apply neural networks to this problem

and compare the results to other non parametric techniques.

A nother notable feature of this dataset is that man y variables are highly related to each

other, for example VH, TEMP, and IBT, as can be seen in Figure 1.1. It is important to deal

withthis mu lticollinearity

to

avoid overfilling

and

reduce

the

variance

of

predictions.

The

favored approaches

in

this book include selection

and

model averaging.

We

shall

see

that

neural netw orks perform w ell when compared to existing methods in the literature.

1 4 Classification

ExampleLoanApplications

As anexample of aclassification problem, a

dataset

thatweshall

revisit

comes

from

con-

sumer banking. One question of interest isclassifying loan applications for acceptance or

rejection. The high correlation between explanatory variables presents certain challenges.

There is alsodirect

interest

in the problem of model

selection,

as will beexplained below.

Historically, bank ing was a local and personal operation. W hen a customer w anted a

loan, they wen ttotheir local bank branch, where they knewthestaff, an dthey appliedfor


15/106

4 hapter 1 Introduction

Figure 1 1

Pairwise

scatterplots for the

ozone

data

a loan. The branch manager would then approve ordeny the loan based on a combination

of the

information

in the

application

and

their personal knowledge

of the

customer.

As

banking grew in scale the processing of loan applications m oved to centralized facilities

and was

done

by

trained loan

officers who specialized in

making such decisions

The

personal connection was gone but there was still a hum an looking over each application

and making

his or her

decision based

on his or her

experience

as a

loan officer. Recently


16/106

1 4 Classification ExampleLoan pplications

Figure 1 2 Estimated smooths

for the

ozone data

banks ha ve been switching

to

partially

or

completely automated systems

fo r

dealing with

most loan applications basing their decisions on algorithms derived from statistical models.

This process has been m et with much resistance

from

the loan officers who believe that the

automated

processes

will mishandle thenonstandard itemson many applications and who

are

also worried abo ut losing theirjobs.

A

regional bank w anted to streamline this process while retaining the

human

decision

element. In particular the bank w anted to find out wh at information on the application was

most important so thatitcould g reatlysimplify theform making lifeeasierfor thecustomer

and creating cost savings for the bank . Thus the primary goal here is one of variable selection.

In the process one must find a model that predicts well but the ultimate goal is to find a

subset

o f

explanatory variables that

is

hopefully

not too

large

and can be

used

to

predict

as

accurately

as

possible. Even from

astatistical

standpoint

ofprediction any

model would

need to be concerned

with

the large amount of correlation between explanatory variables

so

model selection would

be of

interest

to

help deal with

the

m ulticollinearity.

W e

turn

to

nonparametric models for classification in the hope that by maximizing the flexibility of

the model structure we can minimize the numb er of covariates necessary for a good fit. In

somesectionsof this book we will focuson the intermediate goalo fpredictingwhetherthe

loan

w as actually approved or denied by a hum an loanofficer. In other sections we will be

mo re concerned w ith variable selection. This dataset involves the following 23 covariates:


17/106

6 Chapter 1 Introduction

1. Birthdate

2.

Length

of

time

in

curre nt residence

3.

Length

of

time

in

previous residence

4. Leng th of time at current employer

5.

Length

of

time

at

previous employer

6. Line utilization

of

available credit

7. Numberof inquiries on

credit

file

8.

Date

of

oldestentry

in

credit

file

9. Income

10. Residential status

11.

M onthly mortgage payments

12. Numberofchecking accounts at this bank

13.

Number

of

credit card accounts

at

this bank

14. Numberofpersonal credit lines at this bank

15. Nu mber of installment loans at this bank

16. Numberofaccounts at

credit

unions

17.

Number

of

accounts

at

otherbanks

18.

Number

of

accounts

at finance

com panies

19.

Number

of

accounts

at

other

financial

institutions

20. Budgeted debt expenses

21. Amountofloan approved

22. Loan type code

23.

Presence

o f a

co-applicant

These variables

fall

m ainly into three groups: stability demographic and financial.

Stability variables include such

itemsas the

length

of

time

the

applicant

has

worked

in his

or her currentjob Stability is thought to be positively correlated with intent and ability

to repay a loan; fo r example apersonwho hasheld agiven job longer is less likely to

lose it and the corresponding income and be un able to repay a loan. A person who has

lived in his or her currentresidence for longer is less likely to skip town suddenly leaving

behind

an unpa id loan. Dem ographic variables include items likeage In the United States


18/106

1 4 Classification ExampleLoan pplications

Figure 1 3 Correlated

variables:

Age vs

current

residence

fo r

loan

applicants

it isillegalto discriminate against older people butyounger peoplecan be discriminated

against. M any standard demographic variables (e.g., gender) are not legal for use in a loan

decision process and are thus not included. Financial variables include the num ber of other

accounts at this bank and at other financial institutions, as well as the applicant s income

and

budgeted expenses,

and are of

obvious interest

to a

loan officer.

The

particular loans

involved are unsecured (no collateral) personal loans, such as a personal line o f creditor a

vacation

loan.

The

covariates

in

this

are highly correlated. In some cases, there is even

causation. For example, someone with a mortgage will be a homeowner. Another exam ple

is that a person cannot have lived in their current residence for more years than they are

old. Figure1.3shows aplot of ageversus length oftime theapplicant has reported living in

their current residence. Clearly all points must fall below the 45-degree line. Any statistical

analysis must take this correlation into account, and model selection is an ideal approach.

ovariates et


19/106

8

Chapter

1

Introduction

A standard approach

in the

machine learning literature

is to

split

the

data into

two

sets,

a training set whichisusedforfitting themodel,and a testse tforwhichthe

fitted

model

makes predictions, and one then com pares the accuracy of the predictions to the true values.

Since the loan dataset contains 8508 cases, this is large enough to split into a training set

of4000observations and a test set of the rem aining 4508 observations.4000 observations

are enough to fit complex models, and the test set allows one to see if the fitted model is

overfilling

filling

the Iraining sel quite well but nol able lo fil the lest set well), or if il can

predicl well on oul-of-sample observations, which is typically the desired goal.

This is amessy dalasel,and no model willbe able lo predicl wilhgreal accuracy

on cases oulside the Iraining sel. M ost variables are self-reported, so there is potential

measuremenl error

for the

explanatory variables. There

is

also arbitrary rounding

in the

self-reporting. Forexample, residence limesaretypically reportedlo thenearesl m onlhfor

short times but lo the nearest year or nearest five years for people who have lived in the same

place for a longer lime. This slructure is apparent in Figure 1.3, where the horizonlal lines

in

the picture correspond to whole years near the bottom and clearly delineate every five

years in the middle. Some inform ation is incomplete, and there are a number of subjective

factorsknowntoinfluence loan

officers

whichare notcodedin thedala,solhalonecannol

hope for looprecisea fil on thisdalasel.For example, dala on co-applicanls w as incomplete,

and il was nolpossiblelo use Ihose variables because of the very large amo unl of missing

dalaforcaseswherethere wasknownlo be aco-applicanl.AIrained loanofficer canmake

muchbetter sense

of the

incomplete co-applicanl dala,

and mis is

likely

a

source

of

some

of

the error in the models that will be seen in this book. From talking to loan

officers,

il

seems lhat

the

weighlgiven

to the

presence

and the

attributes

of a

co-applicanlseem

lo be

based largely on factors that are not easily

quantified,

such as the reliability of the income

source of the primary applicanl e.g., some fields of self-employm enl are deemed belter

than

others,

and

some employers

are

known

for

having more

or

less emp loyee turnover;

an

applicant in a less stable job wou ld have more need for a co-applicant).

1 5 A

Simple

eural

etworkExample

In this section, the basic idea of a neural network is introduced via a simple example.

More

details about neural networks will appear

in the

following chapters.

In

particular,

Chapter

2

will demonstrate

how

neural networks

can be

viewed

as a

nonparametric m odel

andg ive some history, and Chap ter 3 will explain theinterrelation,and lack thereof, of the

parameters

in the

model,

as

well

as

methods

for fitting the

m odel.

For the ozone dala introduced in Section 1.3, consider filling a model relating ozone

levels to the day of the year coded as 1 through 365). A possible filled model

would

be a

neural

nelwork w ilh

two

nodes,

and one

example

of

such

a fil

appears

in

Figure 1.4.

The

filled line in theplotis

How

can we

make sense

of this

equation? Neural networks

are

typically thou ght

of

in

terms of their hidden

nodes.

This is best illustrated witha picture. Figure 1.5 displays

oursimple model, with

a

single explanatory variable day

of

year),

and two

hidden nodes.


20/106

1 5 ASimple Neural Network xample 9

igur

1 5

Simple neural network model diagram


21/106

10 Chapter

1 Introduction

Startingat thebottomof thediagram,theexplanatory variable input node) feedsi tsvalue

to each of the hidden

nodes,

w hich then transform it and

feed

those results to the prediction

output)node, wh ich combines them togivethe fittedvalueat thetop.Inparticular, w hat

each hidden node does is to multiply the value of the explanatory variable by a scalar

parameter b) add ascalar a),andthen take thelogistic transformationof it:

Soin

equation 1.1),

the first

node

hasa = 21.75an db =0.19. Theb

coefficients

are

shown as the

labels

of the

edges from

the

input node

to the

hidden nodes

in

Figure

1.5 a is

no t

shown).

The

prediction node then takes

a

linear com bination

of the

results

of all of the

hidden nodes to produce the fitted value. In the figure, the linear

coefficients

for the hidden

nodes are the labels of the edges

from

the hidden nodes to the prediction node. Returning

to equation 1.1), the first node is weighted by 10.58, the second by 13.12 ,and then7.13

is

added. This results

in a fitted

value

for

each

day of the

year.

To

make

a bit

more sense

of

the plot, note that = 114 is the

inflection

point for the first rise in the graph, and

this occurs when ab*x = 0 in equation 1.2). So the first rise is the action of the first

hidden node. Similarly,

the fitted function

decreases because

of the

second hidden node,

where

the

center

of thedecreaseis

around

280 = -Wew ll

return

to the

interpretations

of

theparameters inChapter 3 .


22/106

Chapter

NonparametricModels

Most

standard statistical techn iques re

p r metric

methods, meaning thataparticular fam -

ily

of m odels, indexed by one or m ore parameters, is chosen for the data, and the model is

fit

by choosing optimal values

for the

parameters

or finding

their posterior d istribution).

Examples include linear regression withslope

and

intercept parameters)

and

logistic

re-

gression with the parameters being thecoefficients). In these cases, it is assumed that the

choice

of

model

family

e.g.,

a

linear relationship w ith independent Gaussian error)

is the

correct

family,

and all that needs to be done is to fit the coefficients.

The idea behind nonparametric modelingis tomove beyond restricting oneself to a

particular family of models, instead utilizing a much larger model space. For example,

the goal of many nonparametric regression problems is to find the continuous

function

that

best approximates the random process withoutoverfitting the data. In thiscase,one

is not

restricted

to

linear functions,

or

even differentiablefunctions.

The

model

is

thus

nonp r metric

in the sense that the set of possible models under consideration does not

belong

to a

family that

can be

indexed

by a finite

number

of

parameters.

In practice, insteadof working directly withan infinite-dimensional space, such as

the space of continuousfunctions, various classes of methods h ave been developed to ap-

proximate the space of interest. The two classes that w ill be developed here are local

approximations

and

basis representations. There

are

many additional methods that

do not

fit

into these

categories,but

these

two are

sufficient

fo r

providing

the

context

of how

neural

networks fit into the bigger picture of nonparametric mo deling. Further references on non-

parametric modelingarew idely available, forexample, Hardle 1990),Hastie, Tibshirani,

and Friedman 2001),Du da, Hart,andStork 2001),andGentle 2002).

Inthis chapter,

w e

will

first

examine

a

number

of

differentnonparametric regression

methods, w ith particular attention on those using a basis representation. Next will be

a discussion of classification methods, which are

often

simple extensions of regression

techniques. W ithin this framew ork,wewill introduce neural network models, which will

be show n to fit right in with the other basis representation methods. A brief review of the

Bayesian paradigm follows, along with a discussion of how it relates to neural networks

in particular. Finally, there will be a discussion of model building w ithin the context of

nonparam etric modeling, wh ich will also

set the

stage

for the

rest

of thebook.


23/106

Chapter

2 NonparametricModels

2 1 Nonparametric

Regression

The typ ical linear regression problem is to find a vector of

coefficients

to max imize the

likelihood

of the

model

where

y is the /th

case

of the

response variable,

x, = { , . . . ,x ]is the

vector

ofcor-

responding valuesof theexplanatory variables,and theresiduals, ,-, are

iid

G aussian with

mean

zero and a common variance. The assum ption of a straight line or a hyperplane) fit

may

be overly

restrictive.

Even a family oftransformationsm ay beinsufficientlyflexible to

fit

many datasets. Instead,

w e may

want

a

much richer class

of

possible response

functions.

The

typical nonparametric regression model

is of the form

where

/ e

F, someclass

of

regression functions,

x/, is the

vector

of

explanatory variables,

and e/ is iidadditive error with mean zero andcon stant variance, usually assumedtohave

a

Gaussian distribution

as wewill

here).

The

main distinction between

the

competing

nonparametric methods

is the

class

of

functions,F,

to

which

/ is

assumed

to

belong.

The common feature

is

that

functions in T

should

be

able

to

approximate

a

very large

range

of

functions, such

as the set of all

continuous functions,

or the set of all

square-

integrable functions. W e will now look at a variety of

different

ways to choose F an d

hence a nonparam etric regression model. This section focuses only on the two classes of

methods that most closely relate

to

neural networks:

local

m ethods

andbasis

representations.

Note that

it is not

critical

for the flow of

this book

for the

reader

to

understand

all of

thedetails in this section; the variety of methods is provided as context for how neural

networks fit into the statistical literature. Descriptions of the methods here are rather brief,

with references provided for

further

information. While going through the methods in this

chapter, the reader may find Figure 2 .3

helpful

in understanding the relationships between

these methods.

2 1 1 Local Methods

Some

of the

simplest approaches

to

nonparametric regression

are

those w hich operate

lo-

cally. In one dimension, a mov ing average with a fixed window size would be an obvious

example that is both simple and local. With a fixed window, the m oving average estimate

is

a step func tion. M ore sophisticated things can be done in terms of the

choice

of window,

the weighting of the averaging, or the shape of the fit over the window, as dem onstrated by

the following methods.

The idea of kernel smoothing is to use a mov ing weighted average, where the weight

function

is referred to as a kernel. A kernel is a continuous, bounded, symm etric

function

whose integral

isone.The

simplest kernel would

be the

indicator

function

over

the

inter-

val from to^ i.e.,

K x)

=/

[

_iis the logistic function:

In

words,

a

neural network

is

usually described

in

terms

of its

hidden nodes. Each hid-

den

node

can be

thought

of as a

function, taking

a

linear combination

of the

explanatory

variables

(y^x,)

as aninputand returning the logistic transformation (equation (2.7)) as its

output. The neural network then takes a linear combination of the outputs of the hidden

nodes

to

give

the fitted

value

at a

point. Figure

2.4

shows this process pictorially. (Note that

this

diagram

is an

expanded version

of

Figure 1.5.) Com bining equations (2.6)

and

(2.7)

and

igur

2 4

Neural network model diagram.


32/106

2 3 eural etworks

21

expanding

the

vector notationyields

the

following equation, which shows

the

model

in its

full

glory(or goriness):

wherekis thenumberofbasisfunctionsandris thenumberofexplanatory variables. The

parameters of this model are k the numbero f hidden nodes), fij for j e {0 ... k}, and

Yjhforj e [ I . k}, h {0 ... r}. For a fixednetwork size (fixed k , th e numberof

parameters

d is

wherethe first termin the sum is thenumberof Yjh and the second is the numberof /J

;

.

It is

clear

from equation 2.6) thataneural networkissimplyanonparametric regres-

sion using

a

basis representation compare

to

equation 2.4)), with

the W j s,

location-scale

logisticfunctions, as the basesand the ftj s as theircoefficients. The

(infinite)

set of location-

scale logisticfunctions

does

span

the

space

of

continuous functions,

as

well

as the

space

of

square-integrable functions Cybenko 1989); Funahashi 1989); Hornik, Stinchcombe,and

White 1989)). While these basesare notorthogonal, they have turnedou t to be quite useful

inpractice, witha relatively small number able to approximate a wide rangeo f functions.

The key to

understanding

a

neural network model

is to

think

of it in

terms

of

basis

functions. Interpretationof the individual parameters willbe deferredtoChapter3. For now,

we

note thatif we use asingle basisfu nction hidden node)andrestrictfio =0 andfi\ =1

then

this simple neural networkisexactlythesameas astandard logistic regression model.

Thusaneural networkcan beinterpretedas alinear combinationoflogistic regressions.

2 3 1 NeuralNetworks reStatistical Models

From equation 2.6), it isobvious thataneural networkis astandard parametric model,

withalikelihoodandparametersto be fit.When comparedto themodelsinSection 2.1.2,

one can see how

neural networks

are

just another example

of a

basis

function

method

for

nonparametric regression classification will similarlybe discussed shortly). Whilethe

form

of

equation 2.6)

may not be the

most intuitiveexpression

in the

world,

it is a

special

case

of

equations 2.2)and 2.4)and is clearlya model. It is notmagic,and it is onlyablackbox

if the user wantsto pretend they have never looked at equation 2.6). While some people

have claimed that neural networks

are

purely algorithms

and not

models

it

should

now be

apparent that they

are

both algorithms

and

models.

Byviewing neural networks as statistical models, we can now apply many other

ideasin statistics in orderto understand, improve,and appropriately u se these models. The

disadvantage

of

viewing them

as

algorithms

is

that

it can be

difficult

to

apply knowledge

fromother algorithms. Takingthe model-based perspective, we can be more systematicin

discussing issues suchaschoosingaprior, buildingamodel, checking assumptionsfor the

validity

of the model,and understanding uncertaintyin ou r predictions.

It isalso worth noting thataneural networkcan be viewedas a special caseof projec-

tion pursuit regression equation 2.3)) wherethe arbitrary smooth functions are restricted

to scaled

logistic

functions. Furthermore, in the limit with an arbitrarily large numberof


33/106

Chapter 2 NonparametricModels

basis

functions,

a

neural network

can be

made

to

converge

to a

Gaussian process model

Neal 1996)).

3

Brief

History of

Neural

Networks

While neural networks are statistical models, they have mostly been developed

from

the

algorithmic

perspectiveof

machine learning. N eural networks were

originally

created

as

an

attempt to model the act of thinking by m odeling neurons in a brain. M uch of the early

work inthis area tracesback to apaper byMcCulloch and

Pitts 1943)

which introduced

theideaof anactivationfunction, althoughtheauthorsusedathreshold indicator) function

rather thanthesigmoidal functions comm on today anS-shaped function thathas horizontal

asymptotes

in

both directionsfrom

its

center

and

rises sm oothly

and

m onotonically between

its

asymptotes,

e.g.,

equation

2.7)).

This particular model

did not

turn

out to be

appropriate

for

modeling brains but did eventually lead to useful statistical models. Modern neu ral

networks

aresometimes

referred

to as rtifici l

neur l networks

to

emphasize that there

is

no longer any exp licit connection to biology.

Early networks of threshold

function

nodes were explored by Rosenblatt

1962)

call-

ing

them perceptrons,

a

term that

is

sometimes still used

for

nodes)

and

W idrow

and Hoff

1960) calling them adalines). Threshold activations werefound

to

have severe limitations

M insky and Papert 1969)), and thu s sigmoidal activations became widely used instead

Anderson

1982)).

M uch of the recent w ork on neural networks stems

from

renewed interest generated

by

Rum elhart, Hinton,

an d

W illiams 1986)

and

their backpropagation algorithm

for

fitting

the

parameters

of the

network.

A

number

of key

papers followed, including Funahashi

1989), Cyben ko 1989), and H ornik, Stinchcombe, and W hite 1989), that showed that

neural networks are a way to approximate afunctionarbitrarily closely as the number of

hidden nodes gets large. M athematically, they have shown that the

infinite

set of location-

scale logistic functions is indeed a basis set for many comm on spaces, such as the space of

continuousfunctions,or square-integrable functions.

3 3

Multivariate

Regression

To

extend

the

above model

to a

multivariate

y, we

simply treat each dimension

of y as a

separate output and add a set of connectionsfromeach hidden node to each of the dim ensions

of

y. In this imp lementation, we assume that the error variance is the same for all dimensions,

although this would

be

easily generalized

to

separate variances

for

each componen t. D enote

thevectorof the ithobservationas y, sothatthe gthcomp onent dimension)of y, isdenoted

y,

g

. T he model is now

Thu s each dimension of the outpu t is modeled as adifferent linear combination of the same

basis

functions.

This is displayed pictorially in Figure 2.5.


34/106

3 NeuralNetworks 3

Figure 2 5

Multivariate response neural networkmodeldiagram .

2 3 4 Classification

The multivariate response model is easily extended to the problem ofclassification. There

are

several possible approaches,

two of

w hich will

be

discussed here.

The first is the

more

common approach oftransformingtheoutputto theprobability scale leadingto astandard

multinomial likelihood. The second approach uses latent variables, retaining a Gaussian

likelihood of sorts, leaving it much closer to the regression model.

For themultinomial likelihoodapproach, the key toextending this modeltoclassi-

fication is to express the categorical response variable as a vector of indicator (dummy)

variables.

A

multivariate neural network

is

then

fit to

this vector

of

indicator variables

withtheoutputs transformedto a probability scale sothatthem ultinomial likelihood can

be

used.

Let t

t

be the

categorical response (the target with

its

value being

the

category

number to which

case

/

belongs t{

e

{1 ...

q}, whereqis the number of

categories.

Note

that

the

ordering

of the

categories does

not

matter

in

this formulation.

Let

y,-

be a

vector

in

thealternate representation of the response (a vector of indicator variables), where v/

g

= 1

when g y, and zero otherwise.

For example, suppose we are trying to classify tissue samples into the categories

benign tumor, malignant tumor, and no tumor, and our observed dataset (the training

set)

is

{benign, none, m alignant, benign, malignant}.

W e

have three categories,

soq = 3

and

we get the

following receding:


35/106

24 Chapter

2 Nonparametric odels

Letpi

g

be the

true) underlying probability that

yi

g

= I. To use a

neural network

to

fit

these probabilities,

the

continuous-valued output

of the

neural network denoted

w

ig

is

tr nsforme to the probability scaleby exponentiating and normalizing by the sum of all

exponentiated outputs for that observation. Thus the likelihood is

The parameters

p

ig

are deterministic transformations of the neural network parameters

ft

and

y:

where i =

1,...,

nrepresents the

different cases,

j = 1,.,.,kare the

different

hidden

nodes

logistic basis functions), and

g =

1,.. . ,

q

are the

different

classes being predicted.

Note

that

in

practice, only

the firstq 1

elements

ofy are

used

in fitting the

neural network

so

that

the

problem

is offull

rank.

The w

iq

term

is set to

zero for

all i) for

identifiability

ofthe model.

For a fixed

network

size fixed

A ; ,

the

number

of

parameters,

d is

wherethe firsttermin the sum is thenumberof

Y J

and thesecondis thenumberof ft

jg

.

This model

is

referred

to as the

softmax

model

in the field of

computer science Bridle

1989)). This method

of

reparameterizing

the

probabilities from

a

continuous scale

can be

foundin

other areas

of

statistics

as

well, such

as

generalized linear regression McCullagh

andNelder 1989,p.

159)).

As

an

alternative

to the

multinomial

likelihood,one can use a

latent variable approach

to

modify theoriginal modeltodirectlydoclassification, retaininganunderlying Gaussian

likelihood,

and

thus retaining

the

properties

of the

original network

as

well

as

being able

to

more readily reuse computer code).

Again,we code thecategoriesnumerically so that the ith observation is in category

f j ,

and we

construct indicator variables

for

each

of the q

possible categories.

Again,

we

run the

neural network

for the

indicator variables

and

denote

the

continuous-valued outputs

predictions

of the

latent variables)

byWi

g

.Now

instead

of

using

a

multinomial distribution

for f -

wedeterministicallyset the fittedvalue to be


36/106

3

Neural

Networks

5

i.e.,

the fitted

response

is the

category with

the

largest W f

g

. Note that

we setwi

q

=0 for all

i, inordertoensurethemodel iswell specified (otherwise thelocationof all of thew

ig

s

could

beshifted by a

constant w ithout changing

the

model). Thus this model preserves

the

original

regression likelihood

but now

applies

it to

latent variables, with

the

latent variables

producingadeterministicfittedvalueon thecategorical

scale.

This approach also

has the

advantage

of a

simple extension

to

ordinal response vari-

ables, those w hich have a clear ordering but are categorical, where thedifferences between

categories cannotbedirectly convertedtodistances, sotreating themascontinuousis not

realistic.Forexample,asurveymayofferthe

responses

of excellent,

good,

fair, and

poor, which are certainly ordered but w ith no natural distance metric between levels. To

fitan ordinal variable, weagainlett

t

code

thecategory of the ithobservation,butthis time

the ordering of the categories is important. We no longer need the indicator variables, and

instead

we

just

fit a

single neural network

and

denote

its

continuous output

by w/. W e

then

convert this output

to a

category

by

dividing

the

real line into bins

and

matching

the

bins

totheordered categories. To beprecise, wehave additional parametersc\ ... c

9

_i , with

c

g


37/106

26 Chapter

2

Nonparametric

Models

igure

2 6 Predicted

probab ility

of

loan

acceptance

from a

6 node network

using

only

the ageof th eapplicant. An x marksaloanthatw as

actually

approved and an O

aloan thatwas actually denied.

2 3 5

Other

Flavorsof Neu ral Networks

The exact

form

of neural network model presented above (equation (2.6)) is referred to as

a single hidden layerfeed-forward neural network with logistic activation

functions

and a

linear outpu t. As one m ight expect

from

all of the term s in that description, there are a large

number of

variations possible.

W e

shall discuss thesebriefly,

and

then

we

w ill return

to the

abovebasicm odel

for the

rest

of

thisbook.

The first term is single hidden layer wh ich m eans there is just one set of hidden

nodes, the

logisticbasis functio ns, which

are

pictured

as the

m iddle

ro w in

Figure 2.4. W ith

one set of hidden nodes, we hav e the straightforw ard basis representation interpretation. It

is quite possible to add additional layers of hidden nodes, where each hidden node takes

a linear combination of theoutputs from the previous layer and produces its own output

which is a logistic transform ation of its input linear com bination. The ou tputs of each layer

aretaken as the inputso f thenext layer. As wasproved by

several

groups, a single layer

is all that is necessary to span most spaces of interest, so there is no additional flexibility

to be

gained

by

using multiple layers (Cybenko(1989);Funahashi(1989); Hornik, Stinch-

com be, and W hite (1989)). How ever, sometimes adding layers will give a mo re com pact

representation, whereby a com plexfunction can be approximated by a smaller total num ber

ofnodes

in

multiple layers than

the

number

of

nodes necessary

if

only

a

single layer

is

used.


38/106

3 NeuralNetworks

7

igur

2.7.

Predicted

probab ility

of

loan

acceptance

from a

4 node network

using

only

the age

of

the

applicant.

An x

marks

a

loanthat

was

actually

approved

and an O

a

loanthat

wasactually

denied.

Afeed-forward networkis onethathas adistinct orderingto its

layers

sothattheinputs

to

each layer

are

linear combinations

of the

outputs

of the

prev ious layer,

as in

Figure 2.4.

Sometimes additional connections aremade, connecting nonadjacent layers. For example,

starting w ith

our

standard model,

the

third layer (the ou tput prediction) could take

as

inputs

a

linear combination

of

both

the

hidden nodes (the second layer)

and the

explanatory variables

(the first

layer), which would give

a

model

of the

form

whichis

often

called a semiparametric model, because it contains both a param etric com-

ponent (a standard linear regression, a'x,) and a nonparametric component (the linear

combination of

location-scale logistic bases). Significantly more complex models

can be

madeby allowing connections to godo wn layersas well, creating cyclesin thegraphof

Figure 2.4.

For

example, suppose there

are

four total levels, with

two

layers

of

hidden

nodes,

the

inputs

of the

second

row of

hidden nodes

are

linear combinations

of the first row

of

hidden nodes

as

usual,

but the

inputs

of the first row of

hidden nodes include both

the

explanatory variables as well as the outputs

from

the second row of hidden nodes, thus cre-

ating acycle. Such netwo rks require iterative solutionsandsubstantially more com puting

time,and are not ascommonlyused


39/106

28 Chapter2

NonparametricModels

Logistic basisfunctions are the most comm only used. In theory, any sigmo idal func-

tioncan beused,as theproofs thattheinfinitesetcomprisesabasissetdependsonlyon the

functions

being sigmoidal.

In

practice,

one of two

sigmoidal functions

is

usually chosen,

either thelogistic or thehyperbolic tangen t:

which is a simple transformation of the logistic

function,

ty (z) = 2^ 2z) 1. The

historical threshold functions discussed in Section 2.3.2 are no longer commonly used.

Acompletely different typeofbasis function issometimes used, a radial basis

function,

whichis usu ally called a kernel in statistics. Thus a radial basis netw ork is basically another

nam e for what statisticians call a mixtu re model, as described in equation (2.5). Ne ura l

netw ork m odels, particularly radial basis

function

versions, are sometimes used for den sity

estimationproblems(see,forexample, Bishop

(1995)).

In our basic mode l, we use a linear ou tput in that the final prediction is simply

a

linear combination

of the

outputs

of the

hidden layer.

In

other implementations,

the

logistic transformation may also be applied to the final prediction as well, producing pre-

dictions

in the

u nit interval (w hich could

be

rescaled again

if

necessary). There does

not

seem to be any substantial advantage to doing so in a regression problem (but it is some-

times done nonetheless). Obviously, if one is fitting values restricted to an interval (e.g.,

probabilitiesfor a

classification problem), such

an

additional transformation

can be

quite

useful.

2 4 TheBayesian

Paradigm

The twomainphilosophiesofprobability an dstatisticsare theclassicalandBayesian ap-

proaches. While they share much

in

comm on, they have

a

fun damental

difference in how

they interpretprobabilitiesand on thenatureofunknown parameters. In theclassical(orfre-

quen tist) fram ew ork, probabilities

are

typically in terpreted

as

long-ru n relative freque ncies.

Unknown parameters are though t of as fixed quan tities, so that there is a right answ er,

even

if wewill never know wh atit is. Thedataareusedto find a

best

guess for thevalues

of

theun know n parameters.

Under

the

Bayesian framework,probabilities

are

seen

as

inherently subjective

so

that,

for

each person, their probability statementsreflect their personal beliefs about

the

relative likelinessof the

possible

outcomes. U nkn ow n parameters areconsideredto be

random so that they also have probability distributions. Instead of finding a single best

guess for the unknown parameters, the usual approach is to estimate the distribution of

the parameters usin g Bayes' theorem . First, a prior distribution m ust be specified, and this

distribution is mean t to

reflect

the sub ject's personal beliefs abou t the parameters, before the

data

areobserved. Insome cases, theprior willbeconstructed

from

previous experiments.

This approach providesamathem atically coherent m ethodforcom bining inform ationfrom

different sources.

In

othercases,

the

practitioner

may

either

not

know anything about

the

parameters

or may

want

to use a

prior that would

be

generally acceptable

to

many other

people,rather thanapersonal one. Such priors arediscussed further inSection 3.3. Once

theprior, P(0), isspecifiedfor theparameters,0 ,Bayes' theorem combinesthepriorwith


40/106

5

Model uilding 9

thelikelihood,/ X|0),to get the posterior:

One

usefultype

o f

prior

is a

conjugateprior,

one

that when combined

withthe

likeli-

hood producesaposteriorin thesame family. Forexample,if thelikelihood specifies that

yi,

yn

~ N fj,, 1 ,then usinganormal prior for u, J L ~ N a, 1 for some constant

a, is a conjugate choice, because the posterior for J L will also be normal, in particular,

Note that

theideaof

conjugacy

is

likelihooddependent,

in

that

a

prior

is

conjugate

for a

particular likelihood. Conjugate priors

are

widely used

for

convenience,

as

they lead

to

analytically tractable posteriors.

In

manycases,such

as a

neural network,

there

is no

known conjugateprior.Instead,priors

for

individual parameters

may be

chosen

to be conditionally conjugate, inthat whenallother parameters areconditioned upon held

fixed),theconditional posterioris of thesame

family

as theconditional prior.Forexample,

in

Chapter

3,

nearly

all of the

priorspresented

for

neural networks will

put a

normal dis-

tribution

on the bj parameters possibly conditioned upontheother parameters), whichis

conditionally conjugate. Themainbenefit ofconditional conjugacyisthatitallowsone to

useGibbs sampling,aswillbedescribedinsection3.5, which greatly helpsinmodelfitting.

This book will assume that

the

reader

has

some previous knowledge

of the

Bayesian

approach. Formore detailson the Bayesian approach, aswellas fordiscussionof the

philosophicalmeritsandcriticismsof theBayesian approach,thereaderisreferredtosome

of

the

many other sources available Robert 2001); Congdon 2001); Carlin

and

Louis

2000); Gelman

et al.

1995); Bernardo

and

Smith 1994); Press 1989); Hartigan 1983);

Jeffreys 1961)).

Theapproachinthis book willbe theBayesian one. However,it ismoreof aprag-

matic Bayesian approach, rather thanadogmatic subjectivist approach. As weshallsee

inSection3.1, inmostcasestheparameters do nothave straightforward interpretations,

and

thus it is notfeasible to put subjective knowledge into aprior. In theevent that the

data analysthassubstantive prior information,amodel with more interpretable parameters

shouldbeused insteadof aneural network. Ifsomethingisknown abouttheshapeof the

data,another more accessible model should

be

chosen.

The

strength

of

neural networks

is

theirflexibility, as wasdescribed inSection 2.3. A keybenefitof theBayesian approachis a

fullaccounting

of

uncertainty.

By

estimating

a

posteriordistribution,

one

obtains estimates

of

error

and

uncertainty

in the

process.Such uncertainty

can

also encompass

the

choice

of

model,aswillbediscussedin thenext section.

2 5 Model uilding

A

full

analysisof anydatasetinvolves many steps, starting with exploratory data analysis

andmovingon toformal model building.Theprocessofchoosingamodelistypicallyan

iterative one. Outsideof thesimplest problems,it isquite rarefor the firstmodel specifiedto

be the final

model.First,

one

must check

the

validity

of the

assumptions. Then

one

should

see if thedata suggest thatadifferentmodelmay bemore appropriate. Forexample, when

fitting amultiple linear regression,it isimportanttocheck thattheresidualsdofollowthe

assumptions, and,ifnot, then usuallyatransformationisperformedon thedatatoimprove


41/106

30

hapter 2 Nonparametric

Models

thesituation. A lso important is to check that the right set of variables has been included in

themodel. Variables whicharefoundto beirrelevant orredundantareremoved. Variables

whichw ere not initially included but are later

found

to be

helpful

would be added. Thus a

series

of

models will

be

investigated before settling upon

a final

model

or set of

models.

The

same procedure applies

to

modeling

via

neural networks.

It is

important

to

check

the key

assumptions

of the

model. Residualplotsshould

be

made

to

verify normality,

independence, and constant variance hom oscedasticity). Violations of any assumption calls

for

a

remedy, typically

a

transformation

of one or

more variables. Many

of the

guidelines

for linear regression are applicable here as well.

Also important is theissue of model selection or some alternative, such as model

averaging; see Section 4.1). There are two parts to this issue when dealing withneural

networks: choosing

the set of

explanatory variables

and

choosing

the

number

of

hidden

nodes. Anadditional complication couldbechoosing thenetwork structure, if one con-

siders

networks beyond just single hidden layer fee d-forward neural networks w ith

logistic

activationfunctions

and a

linear

output,

but

that

is

beyond

the

scope

of

this book.)

First consider choosinganoptimalset ofexplanatory variables. For a given dataset,

using more explanatory variables will improve the fit. However, if the variables are not

actually relevant, they will

not

help with prediction. Consider

the

mean square error

for

predictions, which

has two

componentsthe square

of the

prediction bias expected mis-

prediction)

and the

prediction variance.

The use of

irrelevant variables will have

no effect

on

the prediction bias but will increase the pred iction variance. Inclusion of unique relevant

variables will reduce both bias

and

variance. Inclusion

of

redundant

e.g.,

correlated)

variables will not change the prediction bias but will increase the variance. Thus the goal is

to find all of the

nonredundant useful explanatory variables, removing

all

variables which

do not improve prediction.

Thesecond aspectisselectingtheoptimal numberofhidden nodes orbasisfunctions).

Using more nodes allows a more complex fitted

function.

Fewer nodes lead to a smoother

function.

For a particular dataset, the more nodes used, thebetter the fit. With enough

hidden nodes, the

function

can fit the data

perfectly,

becoming an interpolating function.

However, such

a

function will usually perform poorly

at

prediction,

as it fluctuates

wildly

in

attemptingtomodelthenoisein thedatainadditionto theund erlying trend. Justaswith

anyother nonparametric procedure,

an

optimal amount

of

smoothing must

befound so

that

the fitted

function

is

neither overfilling not smooth enough)

or

underfilling loo smoolh).

Bolh

of theseaspeclsof

model

selection

will

be

discussed

in

Chapter

4.

Thai chapter

will investigate criteria for selection, as well as methods for searching through the space

ofpossible models. It will also

discuss

some alternatives to the traditional approachof

choosing only a single m odel.


42/106

Chapter

Priors

for

NeuralN etwo rks

One of the key

decisions

in a

Bayesian analysis

is the

choice

of

prior.

The

idea

is

that

one s prior should reflectone s current beliefs either

from

previous data or from purely

subjective

sources) about the parameters before one has observed the data. This task turns

out

to berather difficult for aneural network, because inmost

cases

the parameters have

no

interpretable meaning, merely beingcoefficients

in a

nonstandard basis expansion

as

described in Section 2.3). In certain special cases, the param eters do have intu itive m eanings,

as willbediscussed in thenext section. Ingeneral, however,theparametersarebasically

uninterpretable, and thus the idea of pu tting beliefs into

one s

prior is rather quixotic. The

next

two sections discuss several practical choices of priors. This is followed by a practical

discussion

of

param eter estim ation,

a

com parison

of

som e

of the

priors

in

this chapter,

and

some

theoretical results on

asym ptotic consistency.

3.1 Parameter

Interpretation

an d LackThereof

In some specificcases,theparametersof aneural network have obvious interpretations. We

willfirst

look

at

these cases,

and

then

a

simple exam ple will show

how

things

can

quickly

go wrongandbecom e virtually uninterpretable.

The firstcaseisthatof aneural network with onlyonehidden nodeand oneexplanatory

variable.

The

m odel

for the fitted

values

is

Figure3.1shows thisfittedfunction for fa = 2,b\=4, YQ = 9 andy\ =15

over

x in the

un it interval.

The

parameter

fiois an

overall location parameter

for y, in

that

y

= p

Q

w hen the logistic

function

is

close

to zero in this case whenx is near or below

zero,

so

here Q

is the

y-intercept).

The

parameter

fi\is an

overall scale factor

for y, in

that

the

logistic

function

ranges

from 0 to 1 and

thus

fi\is the

range

of y,

which

in

this case

is

2

2)

= 4. The y

parameters control

the

location

and

scale

of the

logistic

function.

The center of the logistic function occurs at ^, here 0.6. For larger values of y\ in the


43/106

32

hapter3 Priorsfor Neural Networks

Figure 3 1 Fitted

function with

a

singlehiddennode

neighborhood

of the

center,

y

changes

at a

steeper

rate

asx

gets farther away from

the

center.

If y\ is

positive, then

the

logistic risesfrom

left to

right.

If y\ is

negative, then

the

logistic

will

decreaseas jc

increases

Note that a similar interpretation can be used when there are additional explana-

tory variables, with

the YQ,y\ andx

above being

replacedby a

linear combination

of

the jc s with coefficients y

;

, so that the logistic rises with respect to the entire linear

combination.

This interpretation scheme

can

also

be

extended

to

additional logistic

functions as

long

astheir centersarespaced sufficiently farapartfromeach other. Inthat case,for any

given

value

ofx it

will

be

close

to the

center

of at

most

one

logistic,

and it

will

be in the

tails

ofal l

others (where

a

small change

inx

does

no t

produce

a

noticeable change

iny .

Thus

the parameters ofeach logistic function can bedealt with separately, using theprevious

interpretation locally.

Two

logistics with opposite signsony\can beplaced so that theyaresomewhatbu t

not

to oclosetoeach other,andtogether they

form

abump, jointly acting much likeakernel

or a

radialbasisfunction.

One

could attempt

to

apply some

of the

interpretations

of

mixture

models

here,

but in

practice,

fitted

functions rarely behave

in

this manner.

In most real-world problems, there will

be

some overlap

of the

logisticfunctions,

which can

lead

to

many other sorts

of

behavior

in the

neural network predictions,

effec-

tively removing any interpretability. A simple example is shown in Figure 3.2, which

deals with

a

model with

a

single explanatory variable

and

only

tw o

hidden nodes (logistic

functions .

This example involvesfitting themotorcycle accident dataofSilverman (1985), which

tracks the acceleration forceon thehead of amotorcycle rider in the firstmoments after


44/106

3 2 Proper Priors 33

impact. Here a neural network with two hidden nodes was used, and the maxim um likelihood

fit is

shown

in

Figure 3.2. This model

fits the

data somewhat w ell, although

it

does appear

to oversmooth in the first milliseconds after the crash. The point is not that this is the

best possible fit but rather that this particular fit, which is the maximum likelihood fit,

displays interesting behavior with

respect to the

parameters

of the

model. Remember

that

there are only two logistic

functions

producing this fit, yet there are three points of

inflection

in the fitted

curve. Thus

the

previous interpretations

of the

parameters

no

longer

apply.

W hat is going on here is that the two logistic

functions

have centers very close to

each other, so that their ranges of action interact, and do so in a highly nonlinear fashion.

These logistic bases

are

shown

in