7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
1/106
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
2/106
y e s i n
onparametr ics
v i
eural etworks
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
3/106
ASA SIAM eries on
StatisticsandApplied Probability
S M
The ASA-SIAMSeries on Statistics andApplied Probability ispublished
jointlyby the American Statistical Association and the Society for Industrial andApplied
Mathematics.
The
series consists
of a
broad
spectrum of
books
on
topics
in
statistics
and
applied probability.
The
purpose
of the series is to
provide inexpensive,
quality publications
of
interest
to the
intersecting membership
of the two
societies.
Editorial Board
Robert
N.
Rodriguez
SA S Institute Inc., Editor-in-Chief
DavidBanks
Duke University
H. T.
Banks
NorthCarolina StateUniversity
Richard K Burdick
Arizona State University
JosephGardiner
Michigan
State University
Douglas
M.
Hawkins
University
of
Minnesota
Susan Holmes
Stanford
University
Lisa
LaVange
Inspire Pharmaceuticals,
Inc.
Gary
C.McDonald
Oakland
University andNational Institute
of Statistical Sciences
Francoise
Seillier Moiseiwitsch
University of MarylandBaltimore County
Lee,
H. K. H., Bayesian
Nonparametrics via Neural Networks
O Gorman,
T. W.,
Applied Adaptive
Statistical
Methods:
Tests of
Significance
and
Confidence
Intervals
Ross
T. J.
Booker,
J. M. and
Parkinson, W. }. eds., FuzzyLogica nd Probability Applications:
Bridging the Gap
Nelson, W. B. Recurrent EventsDataAnalysis
for
Product Repairs Disease Recurrences
and
Other Applications
Mason,
R . L. and
Young,
J. C.
Multivariate
Statistical
Process
Control
with
Industrial
Applications
Smith, P. L. A Primer for SamplingSolids Liquids and
Cases: Based
on the
Seven
Sampling
Errors of
Pierre
Gy
Meyer,M. A. and Booker, J. M.,Eliciting and Analyzing Expert judgment: A Practical Guide
Latouche, G. and Ramaswami, V.,Introduction to
Matrix
Analytic Methods in Stochastic
Modeling
Peck
R.,Haugh, L., andGoodman, A., Statistical Case Studies:A Collaboration Between
Academe and
Industry
Student Edition
Peck,
R., Haugh, L., and
Goodman,
A., Statistical
Case
Studies:A Collaboration Between
AcademeandIndustry
Barlow,
R.,
EngineeringReliability
Czitrom,V. and
Spagon,
P. D.
Statistical
Case
Studies for Industrial
Process
Improvement
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
4/106
ye s i n
onparametrics
v i
eural etworks
Herbert K H Lee
University
of
California S a n t a Cruz
S a n t a Cruz California
i m S
Society
for
Industrial
and
Applied Mathematics AmericanS t a t i s t ic a lAssociation
Philadelphia Pennsylvania
Alexandria
Virginia
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
5/106
Copyright 2004by the AmericanS tatistical Associationand theSocietyfor
Industrial
and Applied Mathematics.
1 0 9 8 7 6 5 4 3 2 1
All
rights reserved. Printed
in the
United States
of
America.
No
part
of
this book
maybe reproduced, stored, or transmitted in any manner without the written
permission
of the
publisher.
For
information, write
to the
Society
for
Industrial
and
Applied
Mathematics, 3600 University City Science Center,Philadelphia, PA
19104-2688.
No
warranties, express
or
implied,
are
made
by the
publisher, authors,
and
their
employers
that
the
programs contained
in
this volume
arefree of
error. They
should
not be relied on as the sole basis to solve a problem whose incorrect
solution could resultin injuryto personorproperty.If the programsareemployed
in such a manner, it is at the
user s
own
risk
and the publisher, authors, and their
employers disclaim all liability for such misuse.
Trademarked namesmay be usedinthisbook without the inclusion of a trademark
symbol. These names
are
used
in an
editorial context only;
no
infringement
of
trademark
is
intended.
S-PLUS
is a registered trademark of Insightful Corporation.
SAS is a registered trademark of SAS Institute Inc.
Thisresearch was supported in part by the National Science Foundation (grants
DMS
9803433, 9873275,
and
0233710)
and the
National Institutes
of
Health
(grant RO1CA54852-08).
Libraryof ongress
Cataloging in Publication Data
Lee, HerbertK. H.
Bayesian nonparametricsvianeural networks/ Herbert K.H. Lee.
p. cm.
(ASA-SIAM series
on
statistics
and
applied probability)
Includesbibliographical referencesand index.
ISBN
0-89871-563-6 (pbk.)
1.
Bayesian statistical decision theory. 2. Nonparametric statistics. 3. Neural
networks
(Computer science) I. Title. II.Series.
QA279.5.L43 2004
519.5 42-dc22 2004048151
A
portion of the royalties from the sale ofthis book arebeing placed
in a
fund
to
help students attend SIAM meetings
and
other SIAM-related
activities.
This fund isadministered bySIAMand qualified individualsare
encouraged
to
write directly
to
SIAM
for
guidelines.
Z JL JIL is a registered trademark.
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
6/106
ont nts
ListofFigures vii
Preface
ix
ntroduction
1 1 StatisticsandMachine Learning 1
1 2
Outline
of the
Book
2
1 3
Regression ExampleGroundlevel Ozone Pollution 3
1 4
Classification ExampleLoan Applications
3
1 5 A
Simple Neural Network Example
8
2
Nonparametric Models
2 1 Nonparametric Regression 12
2.1.1 Local Methods 12
2.1.2 Regression Using Basis Functions
16
2 2 Nonparametric Classification 19
2 3 Neural Networks 20
2.3.1 Neural NetworksAreStatisticalModels 21
2.3.2
A
Brief History
of
Neural Networks
22
2.3.3 Multivariate Regression
22
2.3.4 Classification
23
2.3.5 Other FlavorsofNeural Networks 26
2 4 TheBayesian Paradigm 28
2 5 Model Building 29
3 PriorsforNeural Networks 3
3 1 Parameter Interpretation andLack Thereof 31
3 2
Proper Priors 33
3 3 Noninformative Priors 37
3.3.1 Flat Priors
38
3.3.2
Jeffreys
Priors
43
3.3.3 Reference Priors
46
3 4 Hybrid Priors 46
3 5
Model Fitting
48
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
7/106
ontents
3 6
Example Comparing Priors
51
3 7
Asymptotic Consistency
of the
Posterior
53
Building
a
Model
57
4 1 Model Selection and Model Averaging 57
4 1 1 Modeling
Versus
Prediction 59
4 1 2 Bayesian Model Selection 60
4 1 3 Computational Approximations
for
Model Selection
61
4 1 4 Model Averaging 62
4 1 5 Shrinkage Priors
63
4 1 6 Automatic Relevance Determination
63
4 1 7 Bagging
64
4 2 SearchingtheModel Space 66
4 2 1 Greedy Algorithms
67
4 2 2 Stochastic Algorithm s 69
4 2 3 Reversible Ju mp Markov Chain Monte Carlo 70
4 3 Ozone Data Analysis 71
4 4 Loan Data Analysis 73
5 Conclusions 79
A Reference Prior Derivation 8
Glossary 85
Bibliography
87
Index 95
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
8/106
s t
of
igures
1 1
Pairwise
scatterplots for the ozonedata 4
1.2
Estimated smooths
for the
ozone data
5
1 3 Correlated
variables:
Age vs
current
residence
for
loan
applicants 7
1.4 A
neural network
fitted function 9
1 5
Simple neural network model diagram
9
2.1 A tree model for ozone using only wind speedand humidity 14
2.2 Example wavelets
from
the Haarfamily 17
2.3 Diagram of nonparametric methods 18
2.4
Neu ral network model diagram
20
2 5
Multivariateresponseneural network model diagram
23
2.6 Probability of loan acceptanceby onlyage 6 nodes 26
2 7 Probabilityofloan acceptancebyonly age 4nodes 27
3.1 Fitted
function
withasingle hidden node 32
3.2
M axim um likelihood
fit for a
two-hidden node network
34
3.3
Logistic basis functions
of the fit in
Figure
3.2 34
3.4 DAG for theMiillerand Rios Insua model 35
3.5 DAG for the
N eal model
37
3.6 Com parison of priors 52
4.1 Fitted mean
functions
forseveral sizesofnetworks 58
4.2 Bagging fittedvalues 66
4.3
Fitted ozone values
by day of
year
72
4.4 Fitted ozone values by vertical height hum idity pressure gradient and
inversion
base temperature 73
4.5
Fitted ozone values
by
actual recorded levels
74
4.6 Probabilityof loan acceptancebytimeincu rrent residence 75
4.7 Probabilityof loan acceptance by age 76
4.8
Probability
of
loan acceptance
by
income
77
V I I
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
9/106
his page intentionally left blank
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
10/106
ref ce
When I first heard about neural networks and how great they were, I was rather
skeptical. Being sold as a magical black box, there w as enough hy pe to make one believe
that theycouldsolvethe
world's
problems.WhenItriedtolearn more about them ,Ifound
that most of the literature was written for a machine learning audience, and I had to grapple
with
a new perspective and a new set of terminology.
After
some work, I came to see
neural networks from astatistical perspective,as aprobability
model.
One of theprimary
motivationsforthis bookwas tow rite about neural networksfor statisticians, addressing
issues and concerns of interest to statisticians, and usin g statistical terminology. Neura l
networksare apowerfulmodel,andshould betreated assuch, rather than disdained as a
mere algorithm as I have
found
some statisticians do. Hopefully this book w ill prove to
be
illum inating.
The
phrase
Bayesian
nonparametrics meansdifferentthings
to
differentpeople.
The
traditional interpretation usually implies infinite dimensional processes such
as
Dirichlet
processes, used for problems in regression, classification, and density estimation. W hile
on
the
surface this book
may not
appear
to fit
thatdescription,
it is
actually
close. One
of
the
themes
of
this book
is
that
a
neural network
can be
viewed
as a finite-dimensional
approximationto an infinite-dimen sionalmodel,an dthat this modelis
useful
inpracticefor
problemsinregressionand
classification.
Thus
the firstsectionofthis book will
focus
onintroducing n eural networks withinthe
statistical context of nonparametric regression and classification. The rest of the book will
examine important statistical modeling issues for Bayesian neural networks, particularly
thechoiceofpriorand thechoiceofmodel.
While this book will not assume the reader has any prior knowledge about neural
networks, neither will it try to be an all-inclusive introduction. Topics w ill be introduce d
inaself-contained manner, with references providedforfurther detailsof them any issues
that
w ill not be directly addressed in this book.
The
target audience
for
thisbook
is
practicing statisticians
and
researchers,
as
well
asstuden ts preparing for either or both roles. This book addresses p ractical and theoretical
issues.
It is
hoped that
the
users
of
neu ral networks willwant
an
understanding
of how the
model works, which
can
lead
to a
better appreciation
of
knowingwhen
it is
working
and
when
it isnot. Itwillbeassumed thatthereaderhasalready been introducedto thebasicsof
the Bayesian approach, with only a brief review and additional references provided. There
are anumberofgood introductory booksonBayesian statistics available (see Section 2.4),
so it does not seem productive to repeat that material here. It will also be assum ed that the
reader has a solid background in mathematical statistics and in linear regression, such as
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
11/106
x
reface
that
wh ich wo uld be acquired as part of a traditional M aster's degree in statistics. How ever,
thereare fewformal proofs, andmuchof the text should beaccessibleeven without this
background. Com putational issues will be discussed at conceptual and algorithmic levels.
This work developed
from
myPh.D. thesis
( Model Selection
andModel Averaging
for Neural Networks , Carnegie Mellon University, Department of Statistics, 1998). I am
gratefulfor all theassistanceandknowledge providedby myadvisor, Larry Wasserman. I
would
also like
to
acknowledge
the
manyindividuals
who
hav e contributed
to
this
effort,
including David Banks, Jim Berger, Roberto Carta, Merlise Clyde, Daniel Cork, Scott
Davies, Sigrunn Eliassen, Chris Genovese, Robert Gramacy, Rob Kass, Milovan Krnjajid,
Meena Mani, Daniel Merl, Andrew Moore, Peter Mtiller, Mark Schervish, Valerie Ventura,
and Kert Viele, as well as the staff at SIAM and anumberof anonymous referees and
editors, both for this book and for thepapers preceding it. At various points during this
work,
funding
hasbeen providedby theN ational Science Foun dation (grantsDMS9803433,
9873275,
and
0233710)
and the
National Institutes
of
Health (grant
R O1CA54852-08).
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
12/106
Chapter
1
ntrodu tion
The goal of this book is to put neu ral network models firmly into a statistical framework
treating them with the accompanying rigor norm ally accorded to statistical m odels. A neural
network
is frequently seen as either a magical black box or purely as a machine learning
algorithm when in
fact
there is a
definite
probability model behind it. This book will start
by showing how neural networks are indeed a statistical model for doing nonparametric
regression or classification. The focus w ill be on a Bayesian perspective although
many
of
the
topics will apply
to
frequentist models
as
well.
As
much
of the
literature
on
neural
networks
appears in the computer science realm
many
standard modeling questions
fail
to
get addressed. In particular this book will take a hard look at key modeling issues
such
as choosing
an
appropriate prior
and
dealing with model selection. M ost
of the
existing
literature deals w ith neural networks as an algorithm. The hope of this book is to shift the
focus back
to
modeling.
1 1
Statistics
andMachineLearning
The fields of statistics and machine learning are two approaches toward the same goals
with
much
in
common.
In
bothcases
the
idea
is to
learn about
aproblemfrom data. In
most cases this
is
either
a
classification problem
a
regression problem
an
exploratory data
analysisproblem or some combination of the above. W here statistics and machine learning
differ most
is in
theirperspective.
It is
sort
of
like
tw o
people
one
standing outside
of an
airplane and one standing inside the same plane both asked to describe this plane. The
person outside might discuss
the
length
the
wingspan
the
number
of
engines
and
their
layout
and so on. The person
inside might comment
on the
number
of
rows
of
seats
the number ofaisles the seat configurations the amoun t of overhead storage space the
numberof lavatories and so on. In theend they are both describing the same plane. The
relationship
between
statistics
and
machine learning
is
much
like
this situation.
As
much
of the terminology differs between the two fields towards the end of this book a glossary is
provided for translating relevant machine learning terms into statistical terms.
As
a bit of an
overgeneralization
the field of
statistics
and the
methods that come
outof it are based on probability m odels. A t the heart of almost all analyses there is some
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
13/106
Chapter 1
Introduction
distribution describingarandom q uantity,and themethodology springsfromtheprobability
model.
For
example,
in
simple
linear
regression,
we
relate
the
response
yto the
exp lanatory
variable
x via the
conditional distribution
y ~
N f io
+ fi\x,
cr
2
) using possibly unkno wn
parametersf io , f i\,and a
2
. Thecoreof themodelis theprobability
model.
For a matching overgeneralization, machine learning can be seen as the art of devel-
oping
algorithms for learning from data. The idea is to devise a clever algorithm that will
perform well inpractice. It is apragmatic approach thathas produced manyuseful tools
for data analysis.
Manyind ividual data analysis m ethods
are
associated o nlywith
the f ield in
w hich they
were developed. In somecases,m ethods have been invented in one field and reinvented in
the other, with the terminology rem aining completely
different,
and the research rem aining
separate. This is
unfortunate,
since both fields have much to learn from each other. It is
like the airplane exam ple, where a more com plete description of the plane is available
when
both observers cooperate.
Traditionally, neural networks are seen as a machine learning algorithm. They were
largelydevelopedby the machine learning c om m unity or the precursor comm unity, artificial
intelligence); see Section 2.3.2. In most implem entations, the
focus
is on prediction, using
the neural netw ork as a means of finding good predictions, i.e., as an algorithm. Yet a neu ral
networkisalsoastatistical model. Chapter2will showhow it issimply another methodfor
nonparam etric regression
or
classification. There
is an
underlying probability model, and,
withthe proper perspective, a neural network is seen as a generalization of linear regression.
Further discussion is
found
in Section
2.3.1.
For a more traditional approach to neural networks, the reader is referred to Bishop
1995)or
Ripley 1996),
two
excellent books that
arestatistically
accurate
bu t
have some-
what of a machine learning perspective. Fine 1999) is more statistical in flavor but
mo stly non-B ayesian. Several good but
not
B ayesian ) statistical review articles
arealso
available Ch eng and Titterington 1994); Stern 1996); W arner and M isra 1996)). A
final
key
reference is Neal 1996), which details a fully Bayesian approach to neural
networks from
a
perspective that combines elements
from
both machine learning
and
statistics.
1 2 utlineof the
Book
This section is meant to give the reader an idea of where this book is going from here.
The next two sections of this chapter will introduce two data examples that will be used
throughout thebook, one for regression and one for classification. The final section of
this chapter gives a brief introduction to a simple neural network model, to provide some
concreteness to the concept of a neural network, before mo ving on. Chapter 2 provides a
statistical context
for
neural networks, showing
how
they
fit
into
the
largerframework
of
nonparametric regression and classification. It will also give a brief review of the Bayesian
approach and of concepts in model building. Chapter 3 will
focus
on choosing a prior, w hich
is a key item in a Bayesian ana lysis. Also included are details on model fitting and a result on
posterior asymptotic consistency. Chapter 4 gets into thenutsa ndboltsof model building,
discussing model selection and alternatives such as model averaging, both at conceptual
and implementational levels.
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
14/106
1 4 lassification ExampleLoanApplications 3
1 3 RegressionExampleGroundlevel
Ozone
Pollution
Anexample that will
be
used throughout this book
to
illustrate
a
problem
in
n onparametric
regression is a dataset on groundlevel ozone pollution. This dataset first appeared in Breim an
andFriedman (1985) and was analyzed extensively by Hastie and Tibshirani (1990). Other
authors have also used this dataset, so it isusefulto be able to compare results
from
methods
inthis book w ith other published results.
The data consist of groundlevel ozone measurements in the Los Angeles area over
the course of a year, along
with
other meteorological va riables. The goal is to use the
meteorological covariates to predict ozone concentration, which is a pollutan t at the level of
humanactivity. The version of the data used here is that in Hastie and T ibshirani (1990),with
missing data removed, so that there are complete m easurements for 330 days in 1976. For
each of these da ys, the response variable of interest is the daily m axim um one-hour-average
ozone level in parts per million at Upland , California. A lso available are ninepossible
explanatory variables: VH , the altitude at which the pressure is 500 millibars; WIN D, the
windspeed (m ph) at Los An geles International A irport (LAX ); HU M , the hum idity ( ) at
LA X; TEMP, the temp erature (degrees F) at Sandburg A irForceBase; IBH , the temp erature
inversion base height (feet); DPG, the pressure gradient (mm Hg)from LAX to Daggert;
IBT,
the inversion base temperature (degrees F) at LA X; VIS, the visibility (miles) at LA X;
an dDAY,the day of theyear (numberedfrom 1 to 365).
The data are displayed in pairwise scatterplots in Figure 1.1. The first thing to notice
is that most variables have a strong nonlinear association with the ozone level, making
prediction feasible but requiring a flexible model to capture the nonlinear relationship.
Fitting a generalized additive model (GAM, local smoothed functions for each variable
withoutinteractioneffects), as described in Section 2.1.1 or in Hastie and Tibshirani 1990),
produces the fitted relationships between the explanatory variables (rescaled to the unit
interval) and ozone displayed in F igure 1.2. Most variables display a strong relationship
withozone,
and all but the first are
clearly nonlinear. This problem
is
thus
an
example
of
nonparametric regression, in that some smoothingof the data isnecessary, but wemust
determine how m uch smoothing is optimal. We will apply neural networks to this problem
and compare the results to other non parametric techniques.
A nother notable feature of this dataset is that man y variables are highly related to each
other, for example VH, TEMP, and IBT, as can be seen in Figure 1.1. It is important to deal
withthis mu lticollinearity
to
avoid overfilling
and
reduce
the
variance
of
predictions.
The
favored approaches
in
this book include selection
and
model averaging.
We
shall
see
that
neural netw orks perform w ell when compared to existing methods in the literature.
1 4 Classification
ExampleLoanApplications
As anexample of aclassification problem, a
dataset
thatweshall
revisit
comes
from
con-
sumer banking. One question of interest isclassifying loan applications for acceptance or
rejection. The high correlation between explanatory variables presents certain challenges.
There is alsodirect
interest
in the problem of model
selection,
as will beexplained below.
Historically, bank ing was a local and personal operation. W hen a customer w anted a
loan, they wen ttotheir local bank branch, where they knewthestaff, an dthey appliedfor
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
15/106
4 hapter 1 Introduction
Figure 1 1
Pairwise
scatterplots for the
ozone
data
a loan. The branch manager would then approve ordeny the loan based on a combination
of the
information
in the
application
and
their personal knowledge
of the
customer.
As
banking grew in scale the processing of loan applications m oved to centralized facilities
and was
done
by
trained loan
officers who specialized in
making such decisions
The
personal connection was gone but there was still a hum an looking over each application
and making
his or her
decision based
on his or her
experience
as a
loan officer. Recently
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
16/106
1 4 Classification ExampleLoan pplications
Figure 1 2 Estimated smooths
for the
ozone data
banks ha ve been switching
to
partially
or
completely automated systems
fo r
dealing with
most loan applications basing their decisions on algorithms derived from statistical models.
This process has been m et with much resistance
from
the loan officers who believe that the
automated
processes
will mishandle thenonstandard itemson many applications and who
are
also worried abo ut losing theirjobs.
A
regional bank w anted to streamline this process while retaining the
human
decision
element. In particular the bank w anted to find out wh at information on the application was
most important so thatitcould g reatlysimplify theform making lifeeasierfor thecustomer
and creating cost savings for the bank . Thus the primary goal here is one of variable selection.
In the process one must find a model that predicts well but the ultimate goal is to find a
subset
o f
explanatory variables that
is
hopefully
not too
large
and can be
used
to
predict
as
accurately
as
possible. Even from
astatistical
standpoint
ofprediction any
model would
need to be concerned
with
the large amount of correlation between explanatory variables
so
model selection would
be of
interest
to
help deal with
the
m ulticollinearity.
W e
turn
to
nonparametric models for classification in the hope that by maximizing the flexibility of
the model structure we can minimize the numb er of covariates necessary for a good fit. In
somesectionsof this book we will focuson the intermediate goalo fpredictingwhetherthe
loan
w as actually approved or denied by a hum an loanofficer. In other sections we will be
mo re concerned w ith variable selection. This dataset involves the following 23 covariates:
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
17/106
6 Chapter 1 Introduction
1. Birthdate
2.
Length
of
time
in
curre nt residence
3.
Length
of
time
in
previous residence
4. Leng th of time at current employer
5.
Length
of
time
at
previous employer
6. Line utilization
of
available credit
7. Numberof inquiries on
credit
file
8.
Date
of
oldestentry
in
credit
file
9. Income
10. Residential status
11.
M onthly mortgage payments
12. Numberofchecking accounts at this bank
13.
Number
of
credit card accounts
at
this bank
14. Numberofpersonal credit lines at this bank
15. Nu mber of installment loans at this bank
16. Numberofaccounts at
credit
unions
17.
Number
of
accounts
at
otherbanks
18.
Number
of
accounts
at finance
com panies
19.
Number
of
accounts
at
other
financial
institutions
20. Budgeted debt expenses
21. Amountofloan approved
22. Loan type code
23.
Presence
o f a
co-applicant
These variables
fall
m ainly into three groups: stability demographic and financial.
Stability variables include such
itemsas the
length
of
time
the
applicant
has
worked
in his
or her currentjob Stability is thought to be positively correlated with intent and ability
to repay a loan; fo r example apersonwho hasheld agiven job longer is less likely to
lose it and the corresponding income and be un able to repay a loan. A person who has
lived in his or her currentresidence for longer is less likely to skip town suddenly leaving
behind
an unpa id loan. Dem ographic variables include items likeage In the United States
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
18/106
1 4 Classification ExampleLoan pplications
Figure 1 3 Correlated
variables:
Age vs
current
residence
fo r
loan
applicants
it isillegalto discriminate against older people butyounger peoplecan be discriminated
against. M any standard demographic variables (e.g., gender) are not legal for use in a loan
decision process and are thus not included. Financial variables include the num ber of other
accounts at this bank and at other financial institutions, as well as the applicant s income
and
budgeted expenses,
and are of
obvious interest
to a
loan officer.
The
particular loans
involved are unsecured (no collateral) personal loans, such as a personal line o f creditor a
vacation
loan.
The
covariates
in
this
are highly correlated. In some cases, there is even
causation. For example, someone with a mortgage will be a homeowner. Another exam ple
is that a person cannot have lived in their current residence for more years than they are
old. Figure1.3shows aplot of ageversus length oftime theapplicant has reported living in
their current residence. Clearly all points must fall below the 45-degree line. Any statistical
analysis must take this correlation into account, and model selection is an ideal approach.
ovariates et
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
19/106
8
Chapter
1
Introduction
A standard approach
in the
machine learning literature
is to
split
the
data into
two
sets,
a training set whichisusedforfitting themodel,and a testse tforwhichthe
fitted
model
makes predictions, and one then com pares the accuracy of the predictions to the true values.
Since the loan dataset contains 8508 cases, this is large enough to split into a training set
of4000observations and a test set of the rem aining 4508 observations.4000 observations
are enough to fit complex models, and the test set allows one to see if the fitted model is
overfilling
filling
the Iraining sel quite well but nol able lo fil the lest set well), or if il can
predicl well on oul-of-sample observations, which is typically the desired goal.
This is amessy dalasel,and no model willbe able lo predicl wilhgreal accuracy
on cases oulside the Iraining sel. M ost variables are self-reported, so there is potential
measuremenl error
for the
explanatory variables. There
is
also arbitrary rounding
in the
self-reporting. Forexample, residence limesaretypically reportedlo thenearesl m onlhfor
short times but lo the nearest year or nearest five years for people who have lived in the same
place for a longer lime. This slructure is apparent in Figure 1.3, where the horizonlal lines
in
the picture correspond to whole years near the bottom and clearly delineate every five
years in the middle. Some inform ation is incomplete, and there are a number of subjective
factorsknowntoinfluence loan
officers
whichare notcodedin thedala,solhalonecannol
hope for looprecisea fil on thisdalasel.For example, dala on co-applicanls w as incomplete,
and il was nolpossiblelo use Ihose variables because of the very large amo unl of missing
dalaforcaseswherethere wasknownlo be aco-applicanl.AIrained loanofficer canmake
muchbetter sense
of the
incomplete co-applicanl dala,
and mis is
likely
a
source
of
some
of
the error in the models that will be seen in this book. From talking to loan
officers,
il
seems lhat
the
weighlgiven
to the
presence
and the
attributes
of a
co-applicanlseem
lo be
based largely on factors that are not easily
quantified,
such as the reliability of the income
source of the primary applicanl e.g., some fields of self-employm enl are deemed belter
than
others,
and
some employers
are
known
for
having more
or
less emp loyee turnover;
an
applicant in a less stable job wou ld have more need for a co-applicant).
1 5 A
Simple
eural
etworkExample
In this section, the basic idea of a neural network is introduced via a simple example.
More
details about neural networks will appear
in the
following chapters.
In
particular,
Chapter
2
will demonstrate
how
neural networks
can be
viewed
as a
nonparametric m odel
andg ive some history, and Chap ter 3 will explain theinterrelation,and lack thereof, of the
parameters
in the
model,
as
well
as
methods
for fitting the
m odel.
For the ozone dala introduced in Section 1.3, consider filling a model relating ozone
levels to the day of the year coded as 1 through 365). A possible filled model
would
be a
neural
nelwork w ilh
two
nodes,
and one
example
of
such
a fil
appears
in
Figure 1.4.
The
filled line in theplotis
How
can we
make sense
of this
equation? Neural networks
are
typically thou ght
of
in
terms of their hidden
nodes.
This is best illustrated witha picture. Figure 1.5 displays
oursimple model, with
a
single explanatory variable day
of
year),
and two
hidden nodes.
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
20/106
1 5 ASimple Neural Network xample 9
igur
1 5
Simple neural network model diagram
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
21/106
10 Chapter
1 Introduction
Startingat thebottomof thediagram,theexplanatory variable input node) feedsi tsvalue
to each of the hidden
nodes,
w hich then transform it and
feed
those results to the prediction
output)node, wh ich combines them togivethe fittedvalueat thetop.Inparticular, w hat
each hidden node does is to multiply the value of the explanatory variable by a scalar
parameter b) add ascalar a),andthen take thelogistic transformationof it:
Soin
equation 1.1),
the first
node
hasa = 21.75an db =0.19. Theb
coefficients
are
shown as the
labels
of the
edges from
the
input node
to the
hidden nodes
in
Figure
1.5 a is
no t
shown).
The
prediction node then takes
a
linear com bination
of the
results
of all of the
hidden nodes to produce the fitted value. In the figure, the linear
coefficients
for the hidden
nodes are the labels of the edges
from
the hidden nodes to the prediction node. Returning
to equation 1.1), the first node is weighted by 10.58, the second by 13.12 ,and then7.13
is
added. This results
in a fitted
value
for
each
day of the
year.
To
make
a bit
more sense
of
the plot, note that = 114 is the
inflection
point for the first rise in the graph, and
this occurs when ab*x = 0 in equation 1.2). So the first rise is the action of the first
hidden node. Similarly,
the fitted function
decreases because
of the
second hidden node,
where
the
center
of thedecreaseis
around
280 = -Wew ll
return
to the
interpretations
of
theparameters inChapter 3 .
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
22/106
Chapter
NonparametricModels
Most
standard statistical techn iques re
p r metric
methods, meaning thataparticular fam -
ily
of m odels, indexed by one or m ore parameters, is chosen for the data, and the model is
fit
by choosing optimal values
for the
parameters
or finding
their posterior d istribution).
Examples include linear regression withslope
and
intercept parameters)
and
logistic
re-
gression with the parameters being thecoefficients). In these cases, it is assumed that the
choice
of
model
family
e.g.,
a
linear relationship w ith independent Gaussian error)
is the
correct
family,
and all that needs to be done is to fit the coefficients.
The idea behind nonparametric modelingis tomove beyond restricting oneself to a
particular family of models, instead utilizing a much larger model space. For example,
the goal of many nonparametric regression problems is to find the continuous
function
that
best approximates the random process withoutoverfitting the data. In thiscase,one
is not
restricted
to
linear functions,
or
even differentiablefunctions.
The
model
is
thus
nonp r metric
in the sense that the set of possible models under consideration does not
belong
to a
family that
can be
indexed
by a finite
number
of
parameters.
In practice, insteadof working directly withan infinite-dimensional space, such as
the space of continuousfunctions, various classes of methods h ave been developed to ap-
proximate the space of interest. The two classes that w ill be developed here are local
approximations
and
basis representations. There
are
many additional methods that
do not
fit
into these
categories,but
these
two are
sufficient
fo r
providing
the
context
of how
neural
networks fit into the bigger picture of nonparametric mo deling. Further references on non-
parametric modelingarew idely available, forexample, Hardle 1990),Hastie, Tibshirani,
and Friedman 2001),Du da, Hart,andStork 2001),andGentle 2002).
Inthis chapter,
w e
will
first
examine
a
number
of
differentnonparametric regression
methods, w ith particular attention on those using a basis representation. Next will be
a discussion of classification methods, which are
often
simple extensions of regression
techniques. W ithin this framew ork,wewill introduce neural network models, which will
be show n to fit right in with the other basis representation methods. A brief review of the
Bayesian paradigm follows, along with a discussion of how it relates to neural networks
in particular. Finally, there will be a discussion of model building w ithin the context of
nonparam etric modeling, wh ich will also
set the
stage
for the
rest
of thebook.
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
23/106
Chapter
2 NonparametricModels
2 1 Nonparametric
Regression
The typ ical linear regression problem is to find a vector of
coefficients
to max imize the
likelihood
of the
model
where
y is the /th
case
of the
response variable,
x, = { , . . . ,x ]is the
vector
ofcor-
responding valuesof theexplanatory variables,and theresiduals, ,-, are
iid
G aussian with
mean
zero and a common variance. The assum ption of a straight line or a hyperplane) fit
may
be overly
restrictive.
Even a family oftransformationsm ay beinsufficientlyflexible to
fit
many datasets. Instead,
w e may
want
a
much richer class
of
possible response
functions.
The
typical nonparametric regression model
is of the form
where
/ e
F, someclass
of
regression functions,
x/, is the
vector
of
explanatory variables,
and e/ is iidadditive error with mean zero andcon stant variance, usually assumedtohave
a
Gaussian distribution
as wewill
here).
The
main distinction between
the
competing
nonparametric methods
is the
class
of
functions,F,
to
which
/ is
assumed
to
belong.
The common feature
is
that
functions in T
should
be
able
to
approximate
a
very large
range
of
functions, such
as the set of all
continuous functions,
or the set of all
square-
integrable functions. W e will now look at a variety of
different
ways to choose F an d
hence a nonparam etric regression model. This section focuses only on the two classes of
methods that most closely relate
to
neural networks:
local
m ethods
andbasis
representations.
Note that
it is not
critical
for the flow of
this book
for the
reader
to
understand
all of
thedetails in this section; the variety of methods is provided as context for how neural
networks fit into the statistical literature. Descriptions of the methods here are rather brief,
with references provided for
further
information. While going through the methods in this
chapter, the reader may find Figure 2 .3
helpful
in understanding the relationships between
these methods.
2 1 1 Local Methods
Some
of the
simplest approaches
to
nonparametric regression
are
those w hich operate
lo-
cally. In one dimension, a mov ing average with a fixed window size would be an obvious
example that is both simple and local. With a fixed window, the m oving average estimate
is
a step func tion. M ore sophisticated things can be done in terms of the
choice
of window,
the weighting of the averaging, or the shape of the fit over the window, as dem onstrated by
the following methods.
The idea of kernel smoothing is to use a mov ing weighted average, where the weight
function
is referred to as a kernel. A kernel is a continuous, bounded, symm etric
function
whose integral
isone.The
simplest kernel would
be the
indicator
function
over
the
inter-
val from to^ i.e.,
K x)
=/
[
_iis the logistic function:
In
words,
a
neural network
is
usually described
in
terms
of its
hidden nodes. Each hid-
den
node
can be
thought
of as a
function, taking
a
linear combination
of the
explanatory
variables
(y^x,)
as aninputand returning the logistic transformation (equation (2.7)) as its
output. The neural network then takes a linear combination of the outputs of the hidden
nodes
to
give
the fitted
value
at a
point. Figure
2.4
shows this process pictorially. (Note that
this
diagram
is an
expanded version
of
Figure 1.5.) Com bining equations (2.6)
and
(2.7)
and
igur
2 4
Neural network model diagram.
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
32/106
2 3 eural etworks
21
expanding
the
vector notationyields
the
following equation, which shows
the
model
in its
full
glory(or goriness):
wherekis thenumberofbasisfunctionsandris thenumberofexplanatory variables. The
parameters of this model are k the numbero f hidden nodes), fij for j e {0 ... k}, and
Yjhforj e [ I . k}, h {0 ... r}. For a fixednetwork size (fixed k , th e numberof
parameters
d is
wherethe first termin the sum is thenumberof Yjh and the second is the numberof /J
;
.
It is
clear
from equation 2.6) thataneural networkissimplyanonparametric regres-
sion using
a
basis representation compare
to
equation 2.4)), with
the W j s,
location-scale
logisticfunctions, as the basesand the ftj s as theircoefficients. The
(infinite)
set of location-
scale logisticfunctions
does
span
the
space
of
continuous functions,
as
well
as the
space
of
square-integrable functions Cybenko 1989); Funahashi 1989); Hornik, Stinchcombe,and
White 1989)). While these basesare notorthogonal, they have turnedou t to be quite useful
inpractice, witha relatively small number able to approximate a wide rangeo f functions.
The key to
understanding
a
neural network model
is to
think
of it in
terms
of
basis
functions. Interpretationof the individual parameters willbe deferredtoChapter3. For now,
we
note thatif we use asingle basisfu nction hidden node)andrestrictfio =0 andfi\ =1
then
this simple neural networkisexactlythesameas astandard logistic regression model.
Thusaneural networkcan beinterpretedas alinear combinationoflogistic regressions.
2 3 1 NeuralNetworks reStatistical Models
From equation 2.6), it isobvious thataneural networkis astandard parametric model,
withalikelihoodandparametersto be fit.When comparedto themodelsinSection 2.1.2,
one can see how
neural networks
are
just another example
of a
basis
function
method
for
nonparametric regression classification will similarlybe discussed shortly). Whilethe
form
of
equation 2.6)
may not be the
most intuitiveexpression
in the
world,
it is a
special
case
of
equations 2.2)and 2.4)and is clearlya model. It is notmagic,and it is onlyablackbox
if the user wantsto pretend they have never looked at equation 2.6). While some people
have claimed that neural networks
are
purely algorithms
and not
models
it
should
now be
apparent that they
are
both algorithms
and
models.
Byviewing neural networks as statistical models, we can now apply many other
ideasin statistics in orderto understand, improve,and appropriately u se these models. The
disadvantage
of
viewing them
as
algorithms
is
that
it can be
difficult
to
apply knowledge
fromother algorithms. Takingthe model-based perspective, we can be more systematicin
discussing issues suchaschoosingaprior, buildingamodel, checking assumptionsfor the
validity
of the model,and understanding uncertaintyin ou r predictions.
It isalso worth noting thataneural networkcan be viewedas a special caseof projec-
tion pursuit regression equation 2.3)) wherethe arbitrary smooth functions are restricted
to scaled
logistic
functions. Furthermore, in the limit with an arbitrarily large numberof
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
33/106
Chapter 2 NonparametricModels
basis
functions,
a
neural network
can be
made
to
converge
to a
Gaussian process model
Neal 1996)).
3
Brief
History of
Neural
Networks
While neural networks are statistical models, they have mostly been developed
from
the
algorithmic
perspectiveof
machine learning. N eural networks were
originally
created
as
an
attempt to model the act of thinking by m odeling neurons in a brain. M uch of the early
work inthis area tracesback to apaper byMcCulloch and
Pitts 1943)
which introduced
theideaof anactivationfunction, althoughtheauthorsusedathreshold indicator) function
rather thanthesigmoidal functions comm on today anS-shaped function thathas horizontal
asymptotes
in
both directionsfrom
its
center
and
rises sm oothly
and
m onotonically between
its
asymptotes,
e.g.,
equation
2.7)).
This particular model
did not
turn
out to be
appropriate
for
modeling brains but did eventually lead to useful statistical models. Modern neu ral
networks
aresometimes
referred
to as rtifici l
neur l networks
to
emphasize that there
is
no longer any exp licit connection to biology.
Early networks of threshold
function
nodes were explored by Rosenblatt
1962)
call-
ing
them perceptrons,
a
term that
is
sometimes still used
for
nodes)
and
W idrow
and Hoff
1960) calling them adalines). Threshold activations werefound
to
have severe limitations
M insky and Papert 1969)), and thu s sigmoidal activations became widely used instead
Anderson
1982)).
M uch of the recent w ork on neural networks stems
from
renewed interest generated
by
Rum elhart, Hinton,
an d
W illiams 1986)
and
their backpropagation algorithm
for
fitting
the
parameters
of the
network.
A
number
of key
papers followed, including Funahashi
1989), Cyben ko 1989), and H ornik, Stinchcombe, and W hite 1989), that showed that
neural networks are a way to approximate afunctionarbitrarily closely as the number of
hidden nodes gets large. M athematically, they have shown that the
infinite
set of location-
scale logistic functions is indeed a basis set for many comm on spaces, such as the space of
continuousfunctions,or square-integrable functions.
3 3
Multivariate
Regression
To
extend
the
above model
to a
multivariate
y, we
simply treat each dimension
of y as a
separate output and add a set of connectionsfromeach hidden node to each of the dim ensions
of
y. In this imp lementation, we assume that the error variance is the same for all dimensions,
although this would
be
easily generalized
to
separate variances
for
each componen t. D enote
thevectorof the ithobservationas y, sothatthe gthcomp onent dimension)of y, isdenoted
y,
g
. T he model is now
Thu s each dimension of the outpu t is modeled as adifferent linear combination of the same
basis
functions.
This is displayed pictorially in Figure 2.5.
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
34/106
3 NeuralNetworks 3
Figure 2 5
Multivariate response neural networkmodeldiagram .
2 3 4 Classification
The multivariate response model is easily extended to the problem ofclassification. There
are
several possible approaches,
two of
w hich will
be
discussed here.
The first is the
more
common approach oftransformingtheoutputto theprobability scale leadingto astandard
multinomial likelihood. The second approach uses latent variables, retaining a Gaussian
likelihood of sorts, leaving it much closer to the regression model.
For themultinomial likelihoodapproach, the key toextending this modeltoclassi-
fication is to express the categorical response variable as a vector of indicator (dummy)
variables.
A
multivariate neural network
is
then
fit to
this vector
of
indicator variables
withtheoutputs transformedto a probability scale sothatthem ultinomial likelihood can
be
used.
Let t
t
be the
categorical response (the target with
its
value being
the
category
number to which
case
/
belongs t{
e
{1 ...
q}, whereqis the number of
categories.
Note
that
the
ordering
of the
categories does
not
matter
in
this formulation.
Let
y,-
be a
vector
in
thealternate representation of the response (a vector of indicator variables), where v/
g
= 1
when g y, and zero otherwise.
For example, suppose we are trying to classify tissue samples into the categories
benign tumor, malignant tumor, and no tumor, and our observed dataset (the training
set)
is
{benign, none, m alignant, benign, malignant}.
W e
have three categories,
soq = 3
and
we get the
following receding:
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
35/106
24 Chapter
2 Nonparametric odels
Letpi
g
be the
true) underlying probability that
yi
g
= I. To use a
neural network
to
fit
these probabilities,
the
continuous-valued output
of the
neural network denoted
w
ig
is
tr nsforme to the probability scaleby exponentiating and normalizing by the sum of all
exponentiated outputs for that observation. Thus the likelihood is
The parameters
p
ig
are deterministic transformations of the neural network parameters
ft
and
y:
where i =
1,...,
nrepresents the
different cases,
j = 1,.,.,kare the
different
hidden
nodes
logistic basis functions), and
g =
1,.. . ,
q
are the
different
classes being predicted.
Note
that
in
practice, only
the firstq 1
elements
ofy are
used
in fitting the
neural network
so
that
the
problem
is offull
rank.
The w
iq
term
is set to
zero for
all i) for
identifiability
ofthe model.
For a fixed
network
size fixed
A ; ,
the
number
of
parameters,
d is
wherethe firsttermin the sum is thenumberof
Y J
and thesecondis thenumberof ft
jg
.
This model
is
referred
to as the
softmax
model
in the field of
computer science Bridle
1989)). This method
of
reparameterizing
the
probabilities from
a
continuous scale
can be
foundin
other areas
of
statistics
as
well, such
as
generalized linear regression McCullagh
andNelder 1989,p.
159)).
As
an
alternative
to the
multinomial
likelihood,one can use a
latent variable approach
to
modify theoriginal modeltodirectlydoclassification, retaininganunderlying Gaussian
likelihood,
and
thus retaining
the
properties
of the
original network
as
well
as
being able
to
more readily reuse computer code).
Again,we code thecategoriesnumerically so that the ith observation is in category
f j ,
and we
construct indicator variables
for
each
of the q
possible categories.
Again,
we
run the
neural network
for the
indicator variables
and
denote
the
continuous-valued outputs
predictions
of the
latent variables)
byWi
g
.Now
instead
of
using
a
multinomial distribution
for f -
wedeterministicallyset the fittedvalue to be
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
36/106
3
Neural
Networks
5
i.e.,
the fitted
response
is the
category with
the
largest W f
g
. Note that
we setwi
q
=0 for all
i, inordertoensurethemodel iswell specified (otherwise thelocationof all of thew
ig
s
could
beshifted by a
constant w ithout changing
the
model). Thus this model preserves
the
original
regression likelihood
but now
applies
it to
latent variables, with
the
latent variables
producingadeterministicfittedvalueon thecategorical
scale.
This approach also
has the
advantage
of a
simple extension
to
ordinal response vari-
ables, those w hich have a clear ordering but are categorical, where thedifferences between
categories cannotbedirectly convertedtodistances, sotreating themascontinuousis not
realistic.Forexample,asurveymayofferthe
responses
of excellent,
good,
fair, and
poor, which are certainly ordered but w ith no natural distance metric between levels. To
fitan ordinal variable, weagainlett
t
code
thecategory of the ithobservation,butthis time
the ordering of the categories is important. We no longer need the indicator variables, and
instead
we
just
fit a
single neural network
and
denote
its
continuous output
by w/. W e
then
convert this output
to a
category
by
dividing
the
real line into bins
and
matching
the
bins
totheordered categories. To beprecise, wehave additional parametersc\ ... c
9
_i , with
c
g
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
37/106
26 Chapter
2
Nonparametric
Models
igure
2 6 Predicted
probab ility
of
loan
acceptance
from a
6 node network
using
only
the ageof th eapplicant. An x marksaloanthatw as
actually
approved and an O
aloan thatwas actually denied.
2 3 5
Other
Flavorsof Neu ral Networks
The exact
form
of neural network model presented above (equation (2.6)) is referred to as
a single hidden layerfeed-forward neural network with logistic activation
functions
and a
linear outpu t. As one m ight expect
from
all of the term s in that description, there are a large
number of
variations possible.
W e
shall discuss thesebriefly,
and
then
we
w ill return
to the
abovebasicm odel
for the
rest
of
thisbook.
The first term is single hidden layer wh ich m eans there is just one set of hidden
nodes, the
logisticbasis functio ns, which
are
pictured
as the
m iddle
ro w in
Figure 2.4. W ith
one set of hidden nodes, we hav e the straightforw ard basis representation interpretation. It
is quite possible to add additional layers of hidden nodes, where each hidden node takes
a linear combination of theoutputs from the previous layer and produces its own output
which is a logistic transform ation of its input linear com bination. The ou tputs of each layer
aretaken as the inputso f thenext layer. As wasproved by
several
groups, a single layer
is all that is necessary to span most spaces of interest, so there is no additional flexibility
to be
gained
by
using multiple layers (Cybenko(1989);Funahashi(1989); Hornik, Stinch-
com be, and W hite (1989)). How ever, sometimes adding layers will give a mo re com pact
representation, whereby a com plexfunction can be approximated by a smaller total num ber
ofnodes
in
multiple layers than
the
number
of
nodes necessary
if
only
a
single layer
is
used.
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
38/106
3 NeuralNetworks
7
igur
2.7.
Predicted
probab ility
of
loan
acceptance
from a
4 node network
using
only
the age
of
the
applicant.
An x
marks
a
loanthat
was
actually
approved
and an O
a
loanthat
wasactually
denied.
Afeed-forward networkis onethathas adistinct orderingto its
layers
sothattheinputs
to
each layer
are
linear combinations
of the
outputs
of the
prev ious layer,
as in
Figure 2.4.
Sometimes additional connections aremade, connecting nonadjacent layers. For example,
starting w ith
our
standard model,
the
third layer (the ou tput prediction) could take
as
inputs
a
linear combination
of
both
the
hidden nodes (the second layer)
and the
explanatory variables
(the first
layer), which would give
a
model
of the
form
whichis
often
called a semiparametric model, because it contains both a param etric com-
ponent (a standard linear regression, a'x,) and a nonparametric component (the linear
combination of
location-scale logistic bases). Significantly more complex models
can be
madeby allowing connections to godo wn layersas well, creating cyclesin thegraphof
Figure 2.4.
For
example, suppose there
are
four total levels, with
two
layers
of
hidden
nodes,
the
inputs
of the
second
row of
hidden nodes
are
linear combinations
of the first row
of
hidden nodes
as
usual,
but the
inputs
of the first row of
hidden nodes include both
the
explanatory variables as well as the outputs
from
the second row of hidden nodes, thus cre-
ating acycle. Such netwo rks require iterative solutionsandsubstantially more com puting
time,and are not ascommonlyused
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
39/106
28 Chapter2
NonparametricModels
Logistic basisfunctions are the most comm only used. In theory, any sigmo idal func-
tioncan beused,as theproofs thattheinfinitesetcomprisesabasissetdependsonlyon the
functions
being sigmoidal.
In
practice,
one of two
sigmoidal functions
is
usually chosen,
either thelogistic or thehyperbolic tangen t:
which is a simple transformation of the logistic
function,
ty (z) = 2^ 2z) 1. The
historical threshold functions discussed in Section 2.3.2 are no longer commonly used.
Acompletely different typeofbasis function issometimes used, a radial basis
function,
whichis usu ally called a kernel in statistics. Thus a radial basis netw ork is basically another
nam e for what statisticians call a mixtu re model, as described in equation (2.5). Ne ura l
netw ork m odels, particularly radial basis
function
versions, are sometimes used for den sity
estimationproblems(see,forexample, Bishop
(1995)).
In our basic mode l, we use a linear ou tput in that the final prediction is simply
a
linear combination
of the
outputs
of the
hidden layer.
In
other implementations,
the
logistic transformation may also be applied to the final prediction as well, producing pre-
dictions
in the
u nit interval (w hich could
be
rescaled again
if
necessary). There does
not
seem to be any substantial advantage to doing so in a regression problem (but it is some-
times done nonetheless). Obviously, if one is fitting values restricted to an interval (e.g.,
probabilitiesfor a
classification problem), such
an
additional transformation
can be
quite
useful.
2 4 TheBayesian
Paradigm
The twomainphilosophiesofprobability an dstatisticsare theclassicalandBayesian ap-
proaches. While they share much
in
comm on, they have
a
fun damental
difference in how
they interpretprobabilitiesand on thenatureofunknown parameters. In theclassical(orfre-
quen tist) fram ew ork, probabilities
are
typically in terpreted
as
long-ru n relative freque ncies.
Unknown parameters are though t of as fixed quan tities, so that there is a right answ er,
even
if wewill never know wh atit is. Thedataareusedto find a
best
guess for thevalues
of
theun know n parameters.
Under
the
Bayesian framework,probabilities
are
seen
as
inherently subjective
so
that,
for
each person, their probability statementsreflect their personal beliefs about
the
relative likelinessof the
possible
outcomes. U nkn ow n parameters areconsideredto be
random so that they also have probability distributions. Instead of finding a single best
guess for the unknown parameters, the usual approach is to estimate the distribution of
the parameters usin g Bayes' theorem . First, a prior distribution m ust be specified, and this
distribution is mean t to
reflect
the sub ject's personal beliefs abou t the parameters, before the
data
areobserved. Insome cases, theprior willbeconstructed
from
previous experiments.
This approach providesamathem atically coherent m ethodforcom bining inform ationfrom
different sources.
In
othercases,
the
practitioner
may
either
not
know anything about
the
parameters
or may
want
to use a
prior that would
be
generally acceptable
to
many other
people,rather thanapersonal one. Such priors arediscussed further inSection 3.3. Once
theprior, P(0), isspecifiedfor theparameters,0 ,Bayes' theorem combinesthepriorwith
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
40/106
5
Model uilding 9
thelikelihood,/ X|0),to get the posterior:
One
usefultype
o f
prior
is a
conjugateprior,
one
that when combined
withthe
likeli-
hood producesaposteriorin thesame family. Forexample,if thelikelihood specifies that
yi,
yn
~ N fj,, 1 ,then usinganormal prior for u, J L ~ N a, 1 for some constant
a, is a conjugate choice, because the posterior for J L will also be normal, in particular,
Note that
theideaof
conjugacy
is
likelihooddependent,
in
that
a
prior
is
conjugate
for a
particular likelihood. Conjugate priors
are
widely used
for
convenience,
as
they lead
to
analytically tractable posteriors.
In
manycases,such
as a
neural network,
there
is no
known conjugateprior.Instead,priors
for
individual parameters
may be
chosen
to be conditionally conjugate, inthat whenallother parameters areconditioned upon held
fixed),theconditional posterioris of thesame
family
as theconditional prior.Forexample,
in
Chapter
3,
nearly
all of the
priorspresented
for
neural networks will
put a
normal dis-
tribution
on the bj parameters possibly conditioned upontheother parameters), whichis
conditionally conjugate. Themainbenefit ofconditional conjugacyisthatitallowsone to
useGibbs sampling,aswillbedescribedinsection3.5, which greatly helpsinmodelfitting.
This book will assume that
the
reader
has
some previous knowledge
of the
Bayesian
approach. Formore detailson the Bayesian approach, aswellas fordiscussionof the
philosophicalmeritsandcriticismsof theBayesian approach,thereaderisreferredtosome
of
the
many other sources available Robert 2001); Congdon 2001); Carlin
and
Louis
2000); Gelman
et al.
1995); Bernardo
and
Smith 1994); Press 1989); Hartigan 1983);
Jeffreys 1961)).
Theapproachinthis book willbe theBayesian one. However,it ismoreof aprag-
matic Bayesian approach, rather thanadogmatic subjectivist approach. As weshallsee
inSection3.1, inmostcasestheparameters do nothave straightforward interpretations,
and
thus it is notfeasible to put subjective knowledge into aprior. In theevent that the
data analysthassubstantive prior information,amodel with more interpretable parameters
shouldbeused insteadof aneural network. Ifsomethingisknown abouttheshapeof the
data,another more accessible model should
be
chosen.
The
strength
of
neural networks
is
theirflexibility, as wasdescribed inSection 2.3. A keybenefitof theBayesian approachis a
fullaccounting
of
uncertainty.
By
estimating
a
posteriordistribution,
one
obtains estimates
of
error
and
uncertainty
in the
process.Such uncertainty
can
also encompass
the
choice
of
model,aswillbediscussedin thenext section.
2 5 Model uilding
A
full
analysisof anydatasetinvolves many steps, starting with exploratory data analysis
andmovingon toformal model building.Theprocessofchoosingamodelistypicallyan
iterative one. Outsideof thesimplest problems,it isquite rarefor the firstmodel specifiedto
be the final
model.First,
one
must check
the
validity
of the
assumptions. Then
one
should
see if thedata suggest thatadifferentmodelmay bemore appropriate. Forexample, when
fitting amultiple linear regression,it isimportanttocheck thattheresidualsdofollowthe
assumptions, and,ifnot, then usuallyatransformationisperformedon thedatatoimprove
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
41/106
30
hapter 2 Nonparametric
Models
thesituation. A lso important is to check that the right set of variables has been included in
themodel. Variables whicharefoundto beirrelevant orredundantareremoved. Variables
whichw ere not initially included but are later
found
to be
helpful
would be added. Thus a
series
of
models will
be
investigated before settling upon
a final
model
or set of
models.
The
same procedure applies
to
modeling
via
neural networks.
It is
important
to
check
the key
assumptions
of the
model. Residualplotsshould
be
made
to
verify normality,
independence, and constant variance hom oscedasticity). Violations of any assumption calls
for
a
remedy, typically
a
transformation
of one or
more variables. Many
of the
guidelines
for linear regression are applicable here as well.
Also important is theissue of model selection or some alternative, such as model
averaging; see Section 4.1). There are two parts to this issue when dealing withneural
networks: choosing
the set of
explanatory variables
and
choosing
the
number
of
hidden
nodes. Anadditional complication couldbechoosing thenetwork structure, if one con-
siders
networks beyond just single hidden layer fee d-forward neural networks w ith
logistic
activationfunctions
and a
linear
output,
but
that
is
beyond
the
scope
of
this book.)
First consider choosinganoptimalset ofexplanatory variables. For a given dataset,
using more explanatory variables will improve the fit. However, if the variables are not
actually relevant, they will
not
help with prediction. Consider
the
mean square error
for
predictions, which
has two
componentsthe square
of the
prediction bias expected mis-
prediction)
and the
prediction variance.
The use of
irrelevant variables will have
no effect
on
the prediction bias but will increase the pred iction variance. Inclusion of unique relevant
variables will reduce both bias
and
variance. Inclusion
of
redundant
e.g.,
correlated)
variables will not change the prediction bias but will increase the variance. Thus the goal is
to find all of the
nonredundant useful explanatory variables, removing
all
variables which
do not improve prediction.
Thesecond aspectisselectingtheoptimal numberofhidden nodes orbasisfunctions).
Using more nodes allows a more complex fitted
function.
Fewer nodes lead to a smoother
function.
For a particular dataset, the more nodes used, thebetter the fit. With enough
hidden nodes, the
function
can fit the data
perfectly,
becoming an interpolating function.
However, such
a
function will usually perform poorly
at
prediction,
as it fluctuates
wildly
in
attemptingtomodelthenoisein thedatainadditionto theund erlying trend. Justaswith
anyother nonparametric procedure,
an
optimal amount
of
smoothing must
befound so
that
the fitted
function
is
neither overfilling not smooth enough)
or
underfilling loo smoolh).
Bolh
of theseaspeclsof
model
selection
will
be
discussed
in
Chapter
4.
Thai chapter
will investigate criteria for selection, as well as methods for searching through the space
ofpossible models. It will also
discuss
some alternatives to the traditional approachof
choosing only a single m odel.
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
42/106
Chapter
Priors
for
NeuralN etwo rks
One of the key
decisions
in a
Bayesian analysis
is the
choice
of
prior.
The
idea
is
that
one s prior should reflectone s current beliefs either
from
previous data or from purely
subjective
sources) about the parameters before one has observed the data. This task turns
out
to berather difficult for aneural network, because inmost
cases
the parameters have
no
interpretable meaning, merely beingcoefficients
in a
nonstandard basis expansion
as
described in Section 2.3). In certain special cases, the param eters do have intu itive m eanings,
as willbediscussed in thenext section. Ingeneral, however,theparametersarebasically
uninterpretable, and thus the idea of pu tting beliefs into
one s
prior is rather quixotic. The
next
two sections discuss several practical choices of priors. This is followed by a practical
discussion
of
param eter estim ation,
a
com parison
of
som e
of the
priors
in
this chapter,
and
some
theoretical results on
asym ptotic consistency.
3.1 Parameter
Interpretation
an d LackThereof
In some specificcases,theparametersof aneural network have obvious interpretations. We
willfirst
look
at
these cases,
and
then
a
simple exam ple will show
how
things
can
quickly
go wrongandbecom e virtually uninterpretable.
The firstcaseisthatof aneural network with onlyonehidden nodeand oneexplanatory
variable.
The
m odel
for the fitted
values
is
Figure3.1shows thisfittedfunction for fa = 2,b\=4, YQ = 9 andy\ =15
over
x in the
un it interval.
The
parameter
fiois an
overall location parameter
for y, in
that
y
= p
Q
w hen the logistic
function
is
close
to zero in this case whenx is near or below
zero,
so
here Q
is the
y-intercept).
The
parameter
fi\is an
overall scale factor
for y, in
that
the
logistic
function
ranges
from 0 to 1 and
thus
fi\is the
range
of y,
which
in
this case
is
2
2)
= 4. The y
parameters control
the
location
and
scale
of the
logistic
function.
The center of the logistic function occurs at ^, here 0.6. For larger values of y\ in the
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
43/106
32
hapter3 Priorsfor Neural Networks
Figure 3 1 Fitted
function with
a
singlehiddennode
neighborhood
of the
center,
y
changes
at a
steeper
rate
asx
gets farther away from
the
center.
If y\ is
positive, then
the
logistic risesfrom
left to
right.
If y\ is
negative, then
the
logistic
will
decreaseas jc
increases
Note that a similar interpretation can be used when there are additional explana-
tory variables, with
the YQ,y\ andx
above being
replacedby a
linear combination
of
the jc s with coefficients y
;
, so that the logistic rises with respect to the entire linear
combination.
This interpretation scheme
can
also
be
extended
to
additional logistic
functions as
long
astheir centersarespaced sufficiently farapartfromeach other. Inthat case,for any
given
value
ofx it
will
be
close
to the
center
of at
most
one
logistic,
and it
will
be in the
tails
ofal l
others (where
a
small change
inx
does
no t
produce
a
noticeable change
iny .
Thus
the parameters ofeach logistic function can bedealt with separately, using theprevious
interpretation locally.
Two
logistics with opposite signsony\can beplaced so that theyaresomewhatbu t
not
to oclosetoeach other,andtogether they
form
abump, jointly acting much likeakernel
or a
radialbasisfunction.
One
could attempt
to
apply some
of the
interpretations
of
mixture
models
here,
but in
practice,
fitted
functions rarely behave
in
this manner.
In most real-world problems, there will
be
some overlap
of the
logisticfunctions,
which can
lead
to
many other sorts
of
behavior
in the
neural network predictions,
effec-
tively removing any interpretability. A simple example is shown in Figure 3.2, which
deals with
a
model with
a
single explanatory variable
and
only
tw o
hidden nodes (logistic
functions .
This example involvesfitting themotorcycle accident dataofSilverman (1985), which
tracks the acceleration forceon thehead of amotorcycle rider in the firstmoments after
7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)
44/106
3 2 Proper Priors 33
impact. Here a neural network with two hidden nodes was used, and the maxim um likelihood
fit is
shown
in
Figure 3.2. This model
fits the
data somewhat w ell, although
it
does appear
to oversmooth in the first milliseconds after the crash. The point is not that this is the
best possible fit but rather that this particular fit, which is the maximum likelihood fit,
displays interesting behavior with
respect to the
parameters
of the
model. Remember
that
there are only two logistic
functions
producing this fit, yet there are three points of
inflection
in the fitted
curve. Thus
the
previous interpretations
of the
parameters
no
longer
apply.
W hat is going on here is that the two logistic
functions
have centers very close to
each other, so that their ranges of action interact, and do so in a highly nonlinear fashion.
These logistic bases
are
shown
in
Top Related