Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

download Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

of 106

Transcript of Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    1/106

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    2/106

    y e s i n

    onparametr ics

    v i

    eural etworks

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    3/106

    ASA SIAM eries on

    StatisticsandApplied Probability

    S M

    The ASA-SIAMSeries on Statistics andApplied Probability ispublished

    jointlyby the American Statistical Association and the Society for Industrial andApplied

    Mathematics.

    The

    series consists

    of a

    broad

    spectrum of

    books

    on

    topics

    in

    statistics

    and

    applied probability.

    The

    purpose

    of the series is to

    provide inexpensive,

    quality publications

    of

    interest

    to the

    intersecting membership

    of the two

    societies.

    Editorial Board

    Robert

    N.

    Rodriguez

    SA S Institute Inc., Editor-in-Chief

    DavidBanks

    Duke University

    H. T.

    Banks

    NorthCarolina StateUniversity

    Richard K Burdick

    Arizona State University

    JosephGardiner

    Michigan

    State University

    Douglas

    M.

    Hawkins

    University

    of

    Minnesota

    Susan Holmes

    Stanford

    University

    Lisa

    LaVange

    Inspire Pharmaceuticals,

    Inc.

    Gary

    C.McDonald

    Oakland

    University andNational Institute

    of Statistical Sciences

    Francoise

    Seillier Moiseiwitsch

    University of MarylandBaltimore County

    Lee,

    H. K. H., Bayesian

    Nonparametrics via Neural Networks

    O Gorman,

    T. W.,

    Applied Adaptive

    Statistical

    Methods:

    Tests of

    Significance

    and

    Confidence

    Intervals

    Ross

    T. J.

    Booker,

    J. M. and

    Parkinson, W. }. eds., FuzzyLogica nd Probability Applications:

    Bridging the Gap

    Nelson, W. B. Recurrent EventsDataAnalysis

    for

    Product Repairs Disease Recurrences

    and

    Other Applications

    Mason,

    R . L. and

    Young,

    J. C.

    Multivariate

    Statistical

    Process

    Control

    with

    Industrial

    Applications

    Smith, P. L. A Primer for SamplingSolids Liquids and

    Cases: Based

    on the

    Seven

    Sampling

    Errors of

    Pierre

    Gy

    Meyer,M. A. and Booker, J. M.,Eliciting and Analyzing Expert judgment: A Practical Guide

    Latouche, G. and Ramaswami, V.,Introduction to

    Matrix

    Analytic Methods in Stochastic

    Modeling

    Peck

    R.,Haugh, L., andGoodman, A., Statistical Case Studies:A Collaboration Between

    Academe and

    Industry

    Student Edition

    Peck,

    R., Haugh, L., and

    Goodman,

    A., Statistical

    Case

    Studies:A Collaboration Between

    AcademeandIndustry

    Barlow,

    R.,

    EngineeringReliability

    Czitrom,V. and

    Spagon,

    P. D.

    Statistical

    Case

    Studies for Industrial

    Process

    Improvement

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    4/106

    ye s i n

    onparametrics

    v i

    eural etworks

    Herbert K H Lee

    University

    of

    California S a n t a Cruz

    S a n t a Cruz California

    i m S

    Society

    for

    Industrial

    and

    Applied Mathematics AmericanS t a t i s t ic a lAssociation

    Philadelphia Pennsylvania

    Alexandria

    Virginia

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    5/106

    Copyright 2004by the AmericanS tatistical Associationand theSocietyfor

    Industrial

    and Applied Mathematics.

    1 0 9 8 7 6 5 4 3 2 1

    All

    rights reserved. Printed

    in the

    United States

    of

    America.

    No

    part

    of

    this book

    maybe reproduced, stored, or transmitted in any manner without the written

    permission

    of the

    publisher.

    For

    information, write

    to the

    Society

    for

    Industrial

    and

    Applied

    Mathematics, 3600 University City Science Center,Philadelphia, PA

    19104-2688.

    No

    warranties, express

    or

    implied,

    are

    made

    by the

    publisher, authors,

    and

    their

    employers

    that

    the

    programs contained

    in

    this volume

    arefree of

    error. They

    should

    not be relied on as the sole basis to solve a problem whose incorrect

    solution could resultin injuryto personorproperty.If the programsareemployed

    in such a manner, it is at the

    user s

    own

    risk

    and the publisher, authors, and their

    employers disclaim all liability for such misuse.

    Trademarked namesmay be usedinthisbook without the inclusion of a trademark

    symbol. These names

    are

    used

    in an

    editorial context only;

    no

    infringement

    of

    trademark

    is

    intended.

    S-PLUS

    is a registered trademark of Insightful Corporation.

    SAS is a registered trademark of SAS Institute Inc.

    Thisresearch was supported in part by the National Science Foundation (grants

    DMS

    9803433, 9873275,

    and

    0233710)

    and the

    National Institutes

    of

    Health

    (grant RO1CA54852-08).

    Libraryof ongress

    Cataloging in Publication Data

    Lee, HerbertK. H.

    Bayesian nonparametricsvianeural networks/ Herbert K.H. Lee.

    p. cm.

    (ASA-SIAM series

    on

    statistics

    and

    applied probability)

    Includesbibliographical referencesand index.

    ISBN

    0-89871-563-6 (pbk.)

    1.

    Bayesian statistical decision theory. 2. Nonparametric statistics. 3. Neural

    networks

    (Computer science) I. Title. II.Series.

    QA279.5.L43 2004

    519.5 42-dc22 2004048151

    A

    portion of the royalties from the sale ofthis book arebeing placed

    in a

    fund

    to

    help students attend SIAM meetings

    and

    other SIAM-related

    activities.

    This fund isadministered bySIAMand qualified individualsare

    encouraged

    to

    write directly

    to

    SIAM

    for

    guidelines.

    Z JL JIL is a registered trademark.

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    6/106

    ont nts

    ListofFigures vii

    Preface

    ix

    ntroduction

    1 1 StatisticsandMachine Learning 1

    1 2

    Outline

    of the

    Book

    2

    1 3

    Regression ExampleGroundlevel Ozone Pollution 3

    1 4

    Classification ExampleLoan Applications

    3

    1 5 A

    Simple Neural Network Example

    8

    2

    Nonparametric Models

    2 1 Nonparametric Regression 12

    2.1.1 Local Methods 12

    2.1.2 Regression Using Basis Functions

    16

    2 2 Nonparametric Classification 19

    2 3 Neural Networks 20

    2.3.1 Neural NetworksAreStatisticalModels 21

    2.3.2

    A

    Brief History

    of

    Neural Networks

    22

    2.3.3 Multivariate Regression

    22

    2.3.4 Classification

    23

    2.3.5 Other FlavorsofNeural Networks 26

    2 4 TheBayesian Paradigm 28

    2 5 Model Building 29

    3 PriorsforNeural Networks 3

    3 1 Parameter Interpretation andLack Thereof 31

    3 2

    Proper Priors 33

    3 3 Noninformative Priors 37

    3.3.1 Flat Priors

    38

    3.3.2

    Jeffreys

    Priors

    43

    3.3.3 Reference Priors

    46

    3 4 Hybrid Priors 46

    3 5

    Model Fitting

    48

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    7/106

    ontents

    3 6

    Example Comparing Priors

    51

    3 7

    Asymptotic Consistency

    of the

    Posterior

    53

    Building

    a

    Model

    57

    4 1 Model Selection and Model Averaging 57

    4 1 1 Modeling

    Versus

    Prediction 59

    4 1 2 Bayesian Model Selection 60

    4 1 3 Computational Approximations

    for

    Model Selection

    61

    4 1 4 Model Averaging 62

    4 1 5 Shrinkage Priors

    63

    4 1 6 Automatic Relevance Determination

    63

    4 1 7 Bagging

    64

    4 2 SearchingtheModel Space 66

    4 2 1 Greedy Algorithms

    67

    4 2 2 Stochastic Algorithm s 69

    4 2 3 Reversible Ju mp Markov Chain Monte Carlo 70

    4 3 Ozone Data Analysis 71

    4 4 Loan Data Analysis 73

    5 Conclusions 79

    A Reference Prior Derivation 8

    Glossary 85

    Bibliography

    87

    Index 95

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    8/106

    s t

    of

    igures

    1 1

    Pairwise

    scatterplots for the ozonedata 4

    1.2

    Estimated smooths

    for the

    ozone data

    5

    1 3 Correlated

    variables:

    Age vs

    current

    residence

    for

    loan

    applicants 7

    1.4 A

    neural network

    fitted function 9

    1 5

    Simple neural network model diagram

    9

    2.1 A tree model for ozone using only wind speedand humidity 14

    2.2 Example wavelets

    from

    the Haarfamily 17

    2.3 Diagram of nonparametric methods 18

    2.4

    Neu ral network model diagram

    20

    2 5

    Multivariateresponseneural network model diagram

    23

    2.6 Probability of loan acceptanceby onlyage 6 nodes 26

    2 7 Probabilityofloan acceptancebyonly age 4nodes 27

    3.1 Fitted

    function

    withasingle hidden node 32

    3.2

    M axim um likelihood

    fit for a

    two-hidden node network

    34

    3.3

    Logistic basis functions

    of the fit in

    Figure

    3.2 34

    3.4 DAG for theMiillerand Rios Insua model 35

    3.5 DAG for the

    N eal model

    37

    3.6 Com parison of priors 52

    4.1 Fitted mean

    functions

    forseveral sizesofnetworks 58

    4.2 Bagging fittedvalues 66

    4.3

    Fitted ozone values

    by day of

    year

    72

    4.4 Fitted ozone values by vertical height hum idity pressure gradient and

    inversion

    base temperature 73

    4.5

    Fitted ozone values

    by

    actual recorded levels

    74

    4.6 Probabilityof loan acceptancebytimeincu rrent residence 75

    4.7 Probabilityof loan acceptance by age 76

    4.8

    Probability

    of

    loan acceptance

    by

    income

    77

    V I I

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    9/106

    his page intentionally left blank

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    10/106

    ref ce

    When I first heard about neural networks and how great they were, I was rather

    skeptical. Being sold as a magical black box, there w as enough hy pe to make one believe

    that theycouldsolvethe

    world's

    problems.WhenItriedtolearn more about them ,Ifound

    that most of the literature was written for a machine learning audience, and I had to grapple

    with

    a new perspective and a new set of terminology.

    After

    some work, I came to see

    neural networks from astatistical perspective,as aprobability

    model.

    One of theprimary

    motivationsforthis bookwas tow rite about neural networksfor statisticians, addressing

    issues and concerns of interest to statisticians, and usin g statistical terminology. Neura l

    networksare apowerfulmodel,andshould betreated assuch, rather than disdained as a

    mere algorithm as I have

    found

    some statisticians do. Hopefully this book w ill prove to

    be

    illum inating.

    The

    phrase

    Bayesian

    nonparametrics meansdifferentthings

    to

    differentpeople.

    The

    traditional interpretation usually implies infinite dimensional processes such

    as

    Dirichlet

    processes, used for problems in regression, classification, and density estimation. W hile

    on

    the

    surface this book

    may not

    appear

    to fit

    thatdescription,

    it is

    actually

    close. One

    of

    the

    themes

    of

    this book

    is

    that

    a

    neural network

    can be

    viewed

    as a finite-dimensional

    approximationto an infinite-dimen sionalmodel,an dthat this modelis

    useful

    inpracticefor

    problemsinregressionand

    classification.

    Thus

    the firstsectionofthis book will

    focus

    onintroducing n eural networks withinthe

    statistical context of nonparametric regression and classification. The rest of the book will

    examine important statistical modeling issues for Bayesian neural networks, particularly

    thechoiceofpriorand thechoiceofmodel.

    While this book will not assume the reader has any prior knowledge about neural

    networks, neither will it try to be an all-inclusive introduction. Topics w ill be introduce d

    inaself-contained manner, with references providedforfurther detailsof them any issues

    that

    w ill not be directly addressed in this book.

    The

    target audience

    for

    thisbook

    is

    practicing statisticians

    and

    researchers,

    as

    well

    asstuden ts preparing for either or both roles. This book addresses p ractical and theoretical

    issues.

    It is

    hoped that

    the

    users

    of

    neu ral networks willwant

    an

    understanding

    of how the

    model works, which

    can

    lead

    to a

    better appreciation

    of

    knowingwhen

    it is

    working

    and

    when

    it isnot. Itwillbeassumed thatthereaderhasalready been introducedto thebasicsof

    the Bayesian approach, with only a brief review and additional references provided. There

    are anumberofgood introductory booksonBayesian statistics available (see Section 2.4),

    so it does not seem productive to repeat that material here. It will also be assum ed that the

    reader has a solid background in mathematical statistics and in linear regression, such as

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    11/106

    x

    reface

    that

    wh ich wo uld be acquired as part of a traditional M aster's degree in statistics. How ever,

    thereare fewformal proofs, andmuchof the text should beaccessibleeven without this

    background. Com putational issues will be discussed at conceptual and algorithmic levels.

    This work developed

    from

    myPh.D. thesis

    ( Model Selection

    andModel Averaging

    for Neural Networks , Carnegie Mellon University, Department of Statistics, 1998). I am

    gratefulfor all theassistanceandknowledge providedby myadvisor, Larry Wasserman. I

    would

    also like

    to

    acknowledge

    the

    manyindividuals

    who

    hav e contributed

    to

    this

    effort,

    including David Banks, Jim Berger, Roberto Carta, Merlise Clyde, Daniel Cork, Scott

    Davies, Sigrunn Eliassen, Chris Genovese, Robert Gramacy, Rob Kass, Milovan Krnjajid,

    Meena Mani, Daniel Merl, Andrew Moore, Peter Mtiller, Mark Schervish, Valerie Ventura,

    and Kert Viele, as well as the staff at SIAM and anumberof anonymous referees and

    editors, both for this book and for thepapers preceding it. At various points during this

    work,

    funding

    hasbeen providedby theN ational Science Foun dation (grantsDMS9803433,

    9873275,

    and

    0233710)

    and the

    National Institutes

    of

    Health (grant

    R O1CA54852-08).

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    12/106

    Chapter

    1

    ntrodu tion

    The goal of this book is to put neu ral network models firmly into a statistical framework

    treating them with the accompanying rigor norm ally accorded to statistical m odels. A neural

    network

    is frequently seen as either a magical black box or purely as a machine learning

    algorithm when in

    fact

    there is a

    definite

    probability model behind it. This book will start

    by showing how neural networks are indeed a statistical model for doing nonparametric

    regression or classification. The focus w ill be on a Bayesian perspective although

    many

    of

    the

    topics will apply

    to

    frequentist models

    as

    well.

    As

    much

    of the

    literature

    on

    neural

    networks

    appears in the computer science realm

    many

    standard modeling questions

    fail

    to

    get addressed. In particular this book will take a hard look at key modeling issues

    such

    as choosing

    an

    appropriate prior

    and

    dealing with model selection. M ost

    of the

    existing

    literature deals w ith neural networks as an algorithm. The hope of this book is to shift the

    focus back

    to

    modeling.

    1 1

    Statistics

    andMachineLearning

    The fields of statistics and machine learning are two approaches toward the same goals

    with

    much

    in

    common.

    In

    bothcases

    the

    idea

    is to

    learn about

    aproblemfrom data. In

    most cases this

    is

    either

    a

    classification problem

    a

    regression problem

    an

    exploratory data

    analysisproblem or some combination of the above. W here statistics and machine learning

    differ most

    is in

    theirperspective.

    It is

    sort

    of

    like

    tw o

    people

    one

    standing outside

    of an

    airplane and one standing inside the same plane both asked to describe this plane. The

    person outside might discuss

    the

    length

    the

    wingspan

    the

    number

    of

    engines

    and

    their

    layout

    and so on. The person

    inside might comment

    on the

    number

    of

    rows

    of

    seats

    the number ofaisles the seat configurations the amoun t of overhead storage space the

    numberof lavatories and so on. In theend they are both describing the same plane. The

    relationship

    between

    statistics

    and

    machine learning

    is

    much

    like

    this situation.

    As

    much

    of the terminology differs between the two fields towards the end of this book a glossary is

    provided for translating relevant machine learning terms into statistical terms.

    As

    a bit of an

    overgeneralization

    the field of

    statistics

    and the

    methods that come

    outof it are based on probability m odels. A t the heart of almost all analyses there is some

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    13/106

    Chapter 1

    Introduction

    distribution describingarandom q uantity,and themethodology springsfromtheprobability

    model.

    For

    example,

    in

    simple

    linear

    regression,

    we

    relate

    the

    response

    yto the

    exp lanatory

    variable

    x via the

    conditional distribution

    y ~

    N f io

    + fi\x,

    cr

    2

    ) using possibly unkno wn

    parametersf io , f i\,and a

    2

    . Thecoreof themodelis theprobability

    model.

    For a matching overgeneralization, machine learning can be seen as the art of devel-

    oping

    algorithms for learning from data. The idea is to devise a clever algorithm that will

    perform well inpractice. It is apragmatic approach thathas produced manyuseful tools

    for data analysis.

    Manyind ividual data analysis m ethods

    are

    associated o nlywith

    the f ield in

    w hich they

    were developed. In somecases,m ethods have been invented in one field and reinvented in

    the other, with the terminology rem aining completely

    different,

    and the research rem aining

    separate. This is

    unfortunate,

    since both fields have much to learn from each other. It is

    like the airplane exam ple, where a more com plete description of the plane is available

    when

    both observers cooperate.

    Traditionally, neural networks are seen as a machine learning algorithm. They were

    largelydevelopedby the machine learning c om m unity or the precursor comm unity, artificial

    intelligence); see Section 2.3.2. In most implem entations, the

    focus

    is on prediction, using

    the neural netw ork as a means of finding good predictions, i.e., as an algorithm. Yet a neu ral

    networkisalsoastatistical model. Chapter2will showhow it issimply another methodfor

    nonparam etric regression

    or

    classification. There

    is an

    underlying probability model, and,

    withthe proper perspective, a neural network is seen as a generalization of linear regression.

    Further discussion is

    found

    in Section

    2.3.1.

    For a more traditional approach to neural networks, the reader is referred to Bishop

    1995)or

    Ripley 1996),

    two

    excellent books that

    arestatistically

    accurate

    bu t

    have some-

    what of a machine learning perspective. Fine 1999) is more statistical in flavor but

    mo stly non-B ayesian. Several good but

    not

    B ayesian ) statistical review articles

    arealso

    available Ch eng and Titterington 1994); Stern 1996); W arner and M isra 1996)). A

    final

    key

    reference is Neal 1996), which details a fully Bayesian approach to neural

    networks from

    a

    perspective that combines elements

    from

    both machine learning

    and

    statistics.

    1 2 utlineof the

    Book

    This section is meant to give the reader an idea of where this book is going from here.

    The next two sections of this chapter will introduce two data examples that will be used

    throughout thebook, one for regression and one for classification. The final section of

    this chapter gives a brief introduction to a simple neural network model, to provide some

    concreteness to the concept of a neural network, before mo ving on. Chapter 2 provides a

    statistical context

    for

    neural networks, showing

    how

    they

    fit

    into

    the

    largerframework

    of

    nonparametric regression and classification. It will also give a brief review of the Bayesian

    approach and of concepts in model building. Chapter 3 will

    focus

    on choosing a prior, w hich

    is a key item in a Bayesian ana lysis. Also included are details on model fitting and a result on

    posterior asymptotic consistency. Chapter 4 gets into thenutsa ndboltsof model building,

    discussing model selection and alternatives such as model averaging, both at conceptual

    and implementational levels.

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    14/106

    1 4 lassification ExampleLoanApplications 3

    1 3 RegressionExampleGroundlevel

    Ozone

    Pollution

    Anexample that will

    be

    used throughout this book

    to

    illustrate

    a

    problem

    in

    n onparametric

    regression is a dataset on groundlevel ozone pollution. This dataset first appeared in Breim an

    andFriedman (1985) and was analyzed extensively by Hastie and Tibshirani (1990). Other

    authors have also used this dataset, so it isusefulto be able to compare results

    from

    methods

    inthis book w ith other published results.

    The data consist of groundlevel ozone measurements in the Los Angeles area over

    the course of a year, along

    with

    other meteorological va riables. The goal is to use the

    meteorological covariates to predict ozone concentration, which is a pollutan t at the level of

    humanactivity. The version of the data used here is that in Hastie and T ibshirani (1990),with

    missing data removed, so that there are complete m easurements for 330 days in 1976. For

    each of these da ys, the response variable of interest is the daily m axim um one-hour-average

    ozone level in parts per million at Upland , California. A lso available are ninepossible

    explanatory variables: VH , the altitude at which the pressure is 500 millibars; WIN D, the

    windspeed (m ph) at Los An geles International A irport (LAX ); HU M , the hum idity ( ) at

    LA X; TEMP, the temp erature (degrees F) at Sandburg A irForceBase; IBH , the temp erature

    inversion base height (feet); DPG, the pressure gradient (mm Hg)from LAX to Daggert;

    IBT,

    the inversion base temperature (degrees F) at LA X; VIS, the visibility (miles) at LA X;

    an dDAY,the day of theyear (numberedfrom 1 to 365).

    The data are displayed in pairwise scatterplots in Figure 1.1. The first thing to notice

    is that most variables have a strong nonlinear association with the ozone level, making

    prediction feasible but requiring a flexible model to capture the nonlinear relationship.

    Fitting a generalized additive model (GAM, local smoothed functions for each variable

    withoutinteractioneffects), as described in Section 2.1.1 or in Hastie and Tibshirani 1990),

    produces the fitted relationships between the explanatory variables (rescaled to the unit

    interval) and ozone displayed in F igure 1.2. Most variables display a strong relationship

    withozone,

    and all but the first are

    clearly nonlinear. This problem

    is

    thus

    an

    example

    of

    nonparametric regression, in that some smoothingof the data isnecessary, but wemust

    determine how m uch smoothing is optimal. We will apply neural networks to this problem

    and compare the results to other non parametric techniques.

    A nother notable feature of this dataset is that man y variables are highly related to each

    other, for example VH, TEMP, and IBT, as can be seen in Figure 1.1. It is important to deal

    withthis mu lticollinearity

    to

    avoid overfilling

    and

    reduce

    the

    variance

    of

    predictions.

    The

    favored approaches

    in

    this book include selection

    and

    model averaging.

    We

    shall

    see

    that

    neural netw orks perform w ell when compared to existing methods in the literature.

    1 4 Classification

    ExampleLoanApplications

    As anexample of aclassification problem, a

    dataset

    thatweshall

    revisit

    comes

    from

    con-

    sumer banking. One question of interest isclassifying loan applications for acceptance or

    rejection. The high correlation between explanatory variables presents certain challenges.

    There is alsodirect

    interest

    in the problem of model

    selection,

    as will beexplained below.

    Historically, bank ing was a local and personal operation. W hen a customer w anted a

    loan, they wen ttotheir local bank branch, where they knewthestaff, an dthey appliedfor

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    15/106

    4 hapter 1 Introduction

    Figure 1 1

    Pairwise

    scatterplots for the

    ozone

    data

    a loan. The branch manager would then approve ordeny the loan based on a combination

    of the

    information

    in the

    application

    and

    their personal knowledge

    of the

    customer.

    As

    banking grew in scale the processing of loan applications m oved to centralized facilities

    and was

    done

    by

    trained loan

    officers who specialized in

    making such decisions

    The

    personal connection was gone but there was still a hum an looking over each application

    and making

    his or her

    decision based

    on his or her

    experience

    as a

    loan officer. Recently

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    16/106

    1 4 Classification ExampleLoan pplications

    Figure 1 2 Estimated smooths

    for the

    ozone data

    banks ha ve been switching

    to

    partially

    or

    completely automated systems

    fo r

    dealing with

    most loan applications basing their decisions on algorithms derived from statistical models.

    This process has been m et with much resistance

    from

    the loan officers who believe that the

    automated

    processes

    will mishandle thenonstandard itemson many applications and who

    are

    also worried abo ut losing theirjobs.

    A

    regional bank w anted to streamline this process while retaining the

    human

    decision

    element. In particular the bank w anted to find out wh at information on the application was

    most important so thatitcould g reatlysimplify theform making lifeeasierfor thecustomer

    and creating cost savings for the bank . Thus the primary goal here is one of variable selection.

    In the process one must find a model that predicts well but the ultimate goal is to find a

    subset

    o f

    explanatory variables that

    is

    hopefully

    not too

    large

    and can be

    used

    to

    predict

    as

    accurately

    as

    possible. Even from

    astatistical

    standpoint

    ofprediction any

    model would

    need to be concerned

    with

    the large amount of correlation between explanatory variables

    so

    model selection would

    be of

    interest

    to

    help deal with

    the

    m ulticollinearity.

    W e

    turn

    to

    nonparametric models for classification in the hope that by maximizing the flexibility of

    the model structure we can minimize the numb er of covariates necessary for a good fit. In

    somesectionsof this book we will focuson the intermediate goalo fpredictingwhetherthe

    loan

    w as actually approved or denied by a hum an loanofficer. In other sections we will be

    mo re concerned w ith variable selection. This dataset involves the following 23 covariates:

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    17/106

    6 Chapter 1 Introduction

    1. Birthdate

    2.

    Length

    of

    time

    in

    curre nt residence

    3.

    Length

    of

    time

    in

    previous residence

    4. Leng th of time at current employer

    5.

    Length

    of

    time

    at

    previous employer

    6. Line utilization

    of

    available credit

    7. Numberof inquiries on

    credit

    file

    8.

    Date

    of

    oldestentry

    in

    credit

    file

    9. Income

    10. Residential status

    11.

    M onthly mortgage payments

    12. Numberofchecking accounts at this bank

    13.

    Number

    of

    credit card accounts

    at

    this bank

    14. Numberofpersonal credit lines at this bank

    15. Nu mber of installment loans at this bank

    16. Numberofaccounts at

    credit

    unions

    17.

    Number

    of

    accounts

    at

    otherbanks

    18.

    Number

    of

    accounts

    at finance

    com panies

    19.

    Number

    of

    accounts

    at

    other

    financial

    institutions

    20. Budgeted debt expenses

    21. Amountofloan approved

    22. Loan type code

    23.

    Presence

    o f a

    co-applicant

    These variables

    fall

    m ainly into three groups: stability demographic and financial.

    Stability variables include such

    itemsas the

    length

    of

    time

    the

    applicant

    has

    worked

    in his

    or her currentjob Stability is thought to be positively correlated with intent and ability

    to repay a loan; fo r example apersonwho hasheld agiven job longer is less likely to

    lose it and the corresponding income and be un able to repay a loan. A person who has

    lived in his or her currentresidence for longer is less likely to skip town suddenly leaving

    behind

    an unpa id loan. Dem ographic variables include items likeage In the United States

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    18/106

    1 4 Classification ExampleLoan pplications

    Figure 1 3 Correlated

    variables:

    Age vs

    current

    residence

    fo r

    loan

    applicants

    it isillegalto discriminate against older people butyounger peoplecan be discriminated

    against. M any standard demographic variables (e.g., gender) are not legal for use in a loan

    decision process and are thus not included. Financial variables include the num ber of other

    accounts at this bank and at other financial institutions, as well as the applicant s income

    and

    budgeted expenses,

    and are of

    obvious interest

    to a

    loan officer.

    The

    particular loans

    involved are unsecured (no collateral) personal loans, such as a personal line o f creditor a

    vacation

    loan.

    The

    covariates

    in

    this

    are highly correlated. In some cases, there is even

    causation. For example, someone with a mortgage will be a homeowner. Another exam ple

    is that a person cannot have lived in their current residence for more years than they are

    old. Figure1.3shows aplot of ageversus length oftime theapplicant has reported living in

    their current residence. Clearly all points must fall below the 45-degree line. Any statistical

    analysis must take this correlation into account, and model selection is an ideal approach.

    ovariates et

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    19/106

    8

    Chapter

    1

    Introduction

    A standard approach

    in the

    machine learning literature

    is to

    split

    the

    data into

    two

    sets,

    a training set whichisusedforfitting themodel,and a testse tforwhichthe

    fitted

    model

    makes predictions, and one then com pares the accuracy of the predictions to the true values.

    Since the loan dataset contains 8508 cases, this is large enough to split into a training set

    of4000observations and a test set of the rem aining 4508 observations.4000 observations

    are enough to fit complex models, and the test set allows one to see if the fitted model is

    overfilling

    filling

    the Iraining sel quite well but nol able lo fil the lest set well), or if il can

    predicl well on oul-of-sample observations, which is typically the desired goal.

    This is amessy dalasel,and no model willbe able lo predicl wilhgreal accuracy

    on cases oulside the Iraining sel. M ost variables are self-reported, so there is potential

    measuremenl error

    for the

    explanatory variables. There

    is

    also arbitrary rounding

    in the

    self-reporting. Forexample, residence limesaretypically reportedlo thenearesl m onlhfor

    short times but lo the nearest year or nearest five years for people who have lived in the same

    place for a longer lime. This slructure is apparent in Figure 1.3, where the horizonlal lines

    in

    the picture correspond to whole years near the bottom and clearly delineate every five

    years in the middle. Some inform ation is incomplete, and there are a number of subjective

    factorsknowntoinfluence loan

    officers

    whichare notcodedin thedala,solhalonecannol

    hope for looprecisea fil on thisdalasel.For example, dala on co-applicanls w as incomplete,

    and il was nolpossiblelo use Ihose variables because of the very large amo unl of missing

    dalaforcaseswherethere wasknownlo be aco-applicanl.AIrained loanofficer canmake

    muchbetter sense

    of the

    incomplete co-applicanl dala,

    and mis is

    likely

    a

    source

    of

    some

    of

    the error in the models that will be seen in this book. From talking to loan

    officers,

    il

    seems lhat

    the

    weighlgiven

    to the

    presence

    and the

    attributes

    of a

    co-applicanlseem

    lo be

    based largely on factors that are not easily

    quantified,

    such as the reliability of the income

    source of the primary applicanl e.g., some fields of self-employm enl are deemed belter

    than

    others,

    and

    some employers

    are

    known

    for

    having more

    or

    less emp loyee turnover;

    an

    applicant in a less stable job wou ld have more need for a co-applicant).

    1 5 A

    Simple

    eural

    etworkExample

    In this section, the basic idea of a neural network is introduced via a simple example.

    More

    details about neural networks will appear

    in the

    following chapters.

    In

    particular,

    Chapter

    2

    will demonstrate

    how

    neural networks

    can be

    viewed

    as a

    nonparametric m odel

    andg ive some history, and Chap ter 3 will explain theinterrelation,and lack thereof, of the

    parameters

    in the

    model,

    as

    well

    as

    methods

    for fitting the

    m odel.

    For the ozone dala introduced in Section 1.3, consider filling a model relating ozone

    levels to the day of the year coded as 1 through 365). A possible filled model

    would

    be a

    neural

    nelwork w ilh

    two

    nodes,

    and one

    example

    of

    such

    a fil

    appears

    in

    Figure 1.4.

    The

    filled line in theplotis

    How

    can we

    make sense

    of this

    equation? Neural networks

    are

    typically thou ght

    of

    in

    terms of their hidden

    nodes.

    This is best illustrated witha picture. Figure 1.5 displays

    oursimple model, with

    a

    single explanatory variable day

    of

    year),

    and two

    hidden nodes.

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    20/106

    1 5 ASimple Neural Network xample 9

    igur

    1 5

    Simple neural network model diagram

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    21/106

    10 Chapter

    1 Introduction

    Startingat thebottomof thediagram,theexplanatory variable input node) feedsi tsvalue

    to each of the hidden

    nodes,

    w hich then transform it and

    feed

    those results to the prediction

    output)node, wh ich combines them togivethe fittedvalueat thetop.Inparticular, w hat

    each hidden node does is to multiply the value of the explanatory variable by a scalar

    parameter b) add ascalar a),andthen take thelogistic transformationof it:

    Soin

    equation 1.1),

    the first

    node

    hasa = 21.75an db =0.19. Theb

    coefficients

    are

    shown as the

    labels

    of the

    edges from

    the

    input node

    to the

    hidden nodes

    in

    Figure

    1.5 a is

    no t

    shown).

    The

    prediction node then takes

    a

    linear com bination

    of the

    results

    of all of the

    hidden nodes to produce the fitted value. In the figure, the linear

    coefficients

    for the hidden

    nodes are the labels of the edges

    from

    the hidden nodes to the prediction node. Returning

    to equation 1.1), the first node is weighted by 10.58, the second by 13.12 ,and then7.13

    is

    added. This results

    in a fitted

    value

    for

    each

    day of the

    year.

    To

    make

    a bit

    more sense

    of

    the plot, note that = 114 is the

    inflection

    point for the first rise in the graph, and

    this occurs when ab*x = 0 in equation 1.2). So the first rise is the action of the first

    hidden node. Similarly,

    the fitted function

    decreases because

    of the

    second hidden node,

    where

    the

    center

    of thedecreaseis

    around

    280 = -Wew ll

    return

    to the

    interpretations

    of

    theparameters inChapter 3 .

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    22/106

    Chapter

    NonparametricModels

    Most

    standard statistical techn iques re

    p r metric

    methods, meaning thataparticular fam -

    ily

    of m odels, indexed by one or m ore parameters, is chosen for the data, and the model is

    fit

    by choosing optimal values

    for the

    parameters

    or finding

    their posterior d istribution).

    Examples include linear regression withslope

    and

    intercept parameters)

    and

    logistic

    re-

    gression with the parameters being thecoefficients). In these cases, it is assumed that the

    choice

    of

    model

    family

    e.g.,

    a

    linear relationship w ith independent Gaussian error)

    is the

    correct

    family,

    and all that needs to be done is to fit the coefficients.

    The idea behind nonparametric modelingis tomove beyond restricting oneself to a

    particular family of models, instead utilizing a much larger model space. For example,

    the goal of many nonparametric regression problems is to find the continuous

    function

    that

    best approximates the random process withoutoverfitting the data. In thiscase,one

    is not

    restricted

    to

    linear functions,

    or

    even differentiablefunctions.

    The

    model

    is

    thus

    nonp r metric

    in the sense that the set of possible models under consideration does not

    belong

    to a

    family that

    can be

    indexed

    by a finite

    number

    of

    parameters.

    In practice, insteadof working directly withan infinite-dimensional space, such as

    the space of continuousfunctions, various classes of methods h ave been developed to ap-

    proximate the space of interest. The two classes that w ill be developed here are local

    approximations

    and

    basis representations. There

    are

    many additional methods that

    do not

    fit

    into these

    categories,but

    these

    two are

    sufficient

    fo r

    providing

    the

    context

    of how

    neural

    networks fit into the bigger picture of nonparametric mo deling. Further references on non-

    parametric modelingarew idely available, forexample, Hardle 1990),Hastie, Tibshirani,

    and Friedman 2001),Du da, Hart,andStork 2001),andGentle 2002).

    Inthis chapter,

    w e

    will

    first

    examine

    a

    number

    of

    differentnonparametric regression

    methods, w ith particular attention on those using a basis representation. Next will be

    a discussion of classification methods, which are

    often

    simple extensions of regression

    techniques. W ithin this framew ork,wewill introduce neural network models, which will

    be show n to fit right in with the other basis representation methods. A brief review of the

    Bayesian paradigm follows, along with a discussion of how it relates to neural networks

    in particular. Finally, there will be a discussion of model building w ithin the context of

    nonparam etric modeling, wh ich will also

    set the

    stage

    for the

    rest

    of thebook.

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    23/106

    Chapter

    2 NonparametricModels

    2 1 Nonparametric

    Regression

    The typ ical linear regression problem is to find a vector of

    coefficients

    to max imize the

    likelihood

    of the

    model

    where

    y is the /th

    case

    of the

    response variable,

    x, = { , . . . ,x ]is the

    vector

    ofcor-

    responding valuesof theexplanatory variables,and theresiduals, ,-, are

    iid

    G aussian with

    mean

    zero and a common variance. The assum ption of a straight line or a hyperplane) fit

    may

    be overly

    restrictive.

    Even a family oftransformationsm ay beinsufficientlyflexible to

    fit

    many datasets. Instead,

    w e may

    want

    a

    much richer class

    of

    possible response

    functions.

    The

    typical nonparametric regression model

    is of the form

    where

    / e

    F, someclass

    of

    regression functions,

    x/, is the

    vector

    of

    explanatory variables,

    and e/ is iidadditive error with mean zero andcon stant variance, usually assumedtohave

    a

    Gaussian distribution

    as wewill

    here).

    The

    main distinction between

    the

    competing

    nonparametric methods

    is the

    class

    of

    functions,F,

    to

    which

    / is

    assumed

    to

    belong.

    The common feature

    is

    that

    functions in T

    should

    be

    able

    to

    approximate

    a

    very large

    range

    of

    functions, such

    as the set of all

    continuous functions,

    or the set of all

    square-

    integrable functions. W e will now look at a variety of

    different

    ways to choose F an d

    hence a nonparam etric regression model. This section focuses only on the two classes of

    methods that most closely relate

    to

    neural networks:

    local

    m ethods

    andbasis

    representations.

    Note that

    it is not

    critical

    for the flow of

    this book

    for the

    reader

    to

    understand

    all of

    thedetails in this section; the variety of methods is provided as context for how neural

    networks fit into the statistical literature. Descriptions of the methods here are rather brief,

    with references provided for

    further

    information. While going through the methods in this

    chapter, the reader may find Figure 2 .3

    helpful

    in understanding the relationships between

    these methods.

    2 1 1 Local Methods

    Some

    of the

    simplest approaches

    to

    nonparametric regression

    are

    those w hich operate

    lo-

    cally. In one dimension, a mov ing average with a fixed window size would be an obvious

    example that is both simple and local. With a fixed window, the m oving average estimate

    is

    a step func tion. M ore sophisticated things can be done in terms of the

    choice

    of window,

    the weighting of the averaging, or the shape of the fit over the window, as dem onstrated by

    the following methods.

    The idea of kernel smoothing is to use a mov ing weighted average, where the weight

    function

    is referred to as a kernel. A kernel is a continuous, bounded, symm etric

    function

    whose integral

    isone.The

    simplest kernel would

    be the

    indicator

    function

    over

    the

    inter-

    val from to^ i.e.,

    K x)

    =/

    [

    _iis the logistic function:

    In

    words,

    a

    neural network

    is

    usually described

    in

    terms

    of its

    hidden nodes. Each hid-

    den

    node

    can be

    thought

    of as a

    function, taking

    a

    linear combination

    of the

    explanatory

    variables

    (y^x,)

    as aninputand returning the logistic transformation (equation (2.7)) as its

    output. The neural network then takes a linear combination of the outputs of the hidden

    nodes

    to

    give

    the fitted

    value

    at a

    point. Figure

    2.4

    shows this process pictorially. (Note that

    this

    diagram

    is an

    expanded version

    of

    Figure 1.5.) Com bining equations (2.6)

    and

    (2.7)

    and

    igur

    2 4

    Neural network model diagram.

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    32/106

    2 3 eural etworks

    21

    expanding

    the

    vector notationyields

    the

    following equation, which shows

    the

    model

    in its

    full

    glory(or goriness):

    wherekis thenumberofbasisfunctionsandris thenumberofexplanatory variables. The

    parameters of this model are k the numbero f hidden nodes), fij for j e {0 ... k}, and

    Yjhforj e [ I . k}, h {0 ... r}. For a fixednetwork size (fixed k , th e numberof

    parameters

    d is

    wherethe first termin the sum is thenumberof Yjh and the second is the numberof /J

    ;

    .

    It is

    clear

    from equation 2.6) thataneural networkissimplyanonparametric regres-

    sion using

    a

    basis representation compare

    to

    equation 2.4)), with

    the W j s,

    location-scale

    logisticfunctions, as the basesand the ftj s as theircoefficients. The

    (infinite)

    set of location-

    scale logisticfunctions

    does

    span

    the

    space

    of

    continuous functions,

    as

    well

    as the

    space

    of

    square-integrable functions Cybenko 1989); Funahashi 1989); Hornik, Stinchcombe,and

    White 1989)). While these basesare notorthogonal, they have turnedou t to be quite useful

    inpractice, witha relatively small number able to approximate a wide rangeo f functions.

    The key to

    understanding

    a

    neural network model

    is to

    think

    of it in

    terms

    of

    basis

    functions. Interpretationof the individual parameters willbe deferredtoChapter3. For now,

    we

    note thatif we use asingle basisfu nction hidden node)andrestrictfio =0 andfi\ =1

    then

    this simple neural networkisexactlythesameas astandard logistic regression model.

    Thusaneural networkcan beinterpretedas alinear combinationoflogistic regressions.

    2 3 1 NeuralNetworks reStatistical Models

    From equation 2.6), it isobvious thataneural networkis astandard parametric model,

    withalikelihoodandparametersto be fit.When comparedto themodelsinSection 2.1.2,

    one can see how

    neural networks

    are

    just another example

    of a

    basis

    function

    method

    for

    nonparametric regression classification will similarlybe discussed shortly). Whilethe

    form

    of

    equation 2.6)

    may not be the

    most intuitiveexpression

    in the

    world,

    it is a

    special

    case

    of

    equations 2.2)and 2.4)and is clearlya model. It is notmagic,and it is onlyablackbox

    if the user wantsto pretend they have never looked at equation 2.6). While some people

    have claimed that neural networks

    are

    purely algorithms

    and not

    models

    it

    should

    now be

    apparent that they

    are

    both algorithms

    and

    models.

    Byviewing neural networks as statistical models, we can now apply many other

    ideasin statistics in orderto understand, improve,and appropriately u se these models. The

    disadvantage

    of

    viewing them

    as

    algorithms

    is

    that

    it can be

    difficult

    to

    apply knowledge

    fromother algorithms. Takingthe model-based perspective, we can be more systematicin

    discussing issues suchaschoosingaprior, buildingamodel, checking assumptionsfor the

    validity

    of the model,and understanding uncertaintyin ou r predictions.

    It isalso worth noting thataneural networkcan be viewedas a special caseof projec-

    tion pursuit regression equation 2.3)) wherethe arbitrary smooth functions are restricted

    to scaled

    logistic

    functions. Furthermore, in the limit with an arbitrarily large numberof

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    33/106

    Chapter 2 NonparametricModels

    basis

    functions,

    a

    neural network

    can be

    made

    to

    converge

    to a

    Gaussian process model

    Neal 1996)).

    3

    Brief

    History of

    Neural

    Networks

    While neural networks are statistical models, they have mostly been developed

    from

    the

    algorithmic

    perspectiveof

    machine learning. N eural networks were

    originally

    created

    as

    an

    attempt to model the act of thinking by m odeling neurons in a brain. M uch of the early

    work inthis area tracesback to apaper byMcCulloch and

    Pitts 1943)

    which introduced

    theideaof anactivationfunction, althoughtheauthorsusedathreshold indicator) function

    rather thanthesigmoidal functions comm on today anS-shaped function thathas horizontal

    asymptotes

    in

    both directionsfrom

    its

    center

    and

    rises sm oothly

    and

    m onotonically between

    its

    asymptotes,

    e.g.,

    equation

    2.7)).

    This particular model

    did not

    turn

    out to be

    appropriate

    for

    modeling brains but did eventually lead to useful statistical models. Modern neu ral

    networks

    aresometimes

    referred

    to as rtifici l

    neur l networks

    to

    emphasize that there

    is

    no longer any exp licit connection to biology.

    Early networks of threshold

    function

    nodes were explored by Rosenblatt

    1962)

    call-

    ing

    them perceptrons,

    a

    term that

    is

    sometimes still used

    for

    nodes)

    and

    W idrow

    and Hoff

    1960) calling them adalines). Threshold activations werefound

    to

    have severe limitations

    M insky and Papert 1969)), and thu s sigmoidal activations became widely used instead

    Anderson

    1982)).

    M uch of the recent w ork on neural networks stems

    from

    renewed interest generated

    by

    Rum elhart, Hinton,

    an d

    W illiams 1986)

    and

    their backpropagation algorithm

    for

    fitting

    the

    parameters

    of the

    network.

    A

    number

    of key

    papers followed, including Funahashi

    1989), Cyben ko 1989), and H ornik, Stinchcombe, and W hite 1989), that showed that

    neural networks are a way to approximate afunctionarbitrarily closely as the number of

    hidden nodes gets large. M athematically, they have shown that the

    infinite

    set of location-

    scale logistic functions is indeed a basis set for many comm on spaces, such as the space of

    continuousfunctions,or square-integrable functions.

    3 3

    Multivariate

    Regression

    To

    extend

    the

    above model

    to a

    multivariate

    y, we

    simply treat each dimension

    of y as a

    separate output and add a set of connectionsfromeach hidden node to each of the dim ensions

    of

    y. In this imp lementation, we assume that the error variance is the same for all dimensions,

    although this would

    be

    easily generalized

    to

    separate variances

    for

    each componen t. D enote

    thevectorof the ithobservationas y, sothatthe gthcomp onent dimension)of y, isdenoted

    y,

    g

    . T he model is now

    Thu s each dimension of the outpu t is modeled as adifferent linear combination of the same

    basis

    functions.

    This is displayed pictorially in Figure 2.5.

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    34/106

    3 NeuralNetworks 3

    Figure 2 5

    Multivariate response neural networkmodeldiagram .

    2 3 4 Classification

    The multivariate response model is easily extended to the problem ofclassification. There

    are

    several possible approaches,

    two of

    w hich will

    be

    discussed here.

    The first is the

    more

    common approach oftransformingtheoutputto theprobability scale leadingto astandard

    multinomial likelihood. The second approach uses latent variables, retaining a Gaussian

    likelihood of sorts, leaving it much closer to the regression model.

    For themultinomial likelihoodapproach, the key toextending this modeltoclassi-

    fication is to express the categorical response variable as a vector of indicator (dummy)

    variables.

    A

    multivariate neural network

    is

    then

    fit to

    this vector

    of

    indicator variables

    withtheoutputs transformedto a probability scale sothatthem ultinomial likelihood can

    be

    used.

    Let t

    t

    be the

    categorical response (the target with

    its

    value being

    the

    category

    number to which

    case

    /

    belongs t{

    e

    {1 ...

    q}, whereqis the number of

    categories.

    Note

    that

    the

    ordering

    of the

    categories does

    not

    matter

    in

    this formulation.

    Let

    y,-

    be a

    vector

    in

    thealternate representation of the response (a vector of indicator variables), where v/

    g

    = 1

    when g y, and zero otherwise.

    For example, suppose we are trying to classify tissue samples into the categories

    benign tumor, malignant tumor, and no tumor, and our observed dataset (the training

    set)

    is

    {benign, none, m alignant, benign, malignant}.

    W e

    have three categories,

    soq = 3

    and

    we get the

    following receding:

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    35/106

    24 Chapter

    2 Nonparametric odels

    Letpi

    g

    be the

    true) underlying probability that

    yi

    g

    = I. To use a

    neural network

    to

    fit

    these probabilities,

    the

    continuous-valued output

    of the

    neural network denoted

    w

    ig

    is

    tr nsforme to the probability scaleby exponentiating and normalizing by the sum of all

    exponentiated outputs for that observation. Thus the likelihood is

    The parameters

    p

    ig

    are deterministic transformations of the neural network parameters

    ft

    and

    y:

    where i =

    1,...,

    nrepresents the

    different cases,

    j = 1,.,.,kare the

    different

    hidden

    nodes

    logistic basis functions), and

    g =

    1,.. . ,

    q

    are the

    different

    classes being predicted.

    Note

    that

    in

    practice, only

    the firstq 1

    elements

    ofy are

    used

    in fitting the

    neural network

    so

    that

    the

    problem

    is offull

    rank.

    The w

    iq

    term

    is set to

    zero for

    all i) for

    identifiability

    ofthe model.

    For a fixed

    network

    size fixed

    A ; ,

    the

    number

    of

    parameters,

    d is

    wherethe firsttermin the sum is thenumberof

    Y J

    and thesecondis thenumberof ft

    jg

    .

    This model

    is

    referred

    to as the

    softmax

    model

    in the field of

    computer science Bridle

    1989)). This method

    of

    reparameterizing

    the

    probabilities from

    a

    continuous scale

    can be

    foundin

    other areas

    of

    statistics

    as

    well, such

    as

    generalized linear regression McCullagh

    andNelder 1989,p.

    159)).

    As

    an

    alternative

    to the

    multinomial

    likelihood,one can use a

    latent variable approach

    to

    modify theoriginal modeltodirectlydoclassification, retaininganunderlying Gaussian

    likelihood,

    and

    thus retaining

    the

    properties

    of the

    original network

    as

    well

    as

    being able

    to

    more readily reuse computer code).

    Again,we code thecategoriesnumerically so that the ith observation is in category

    f j ,

    and we

    construct indicator variables

    for

    each

    of the q

    possible categories.

    Again,

    we

    run the

    neural network

    for the

    indicator variables

    and

    denote

    the

    continuous-valued outputs

    predictions

    of the

    latent variables)

    byWi

    g

    .Now

    instead

    of

    using

    a

    multinomial distribution

    for f -

    wedeterministicallyset the fittedvalue to be

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    36/106

    3

    Neural

    Networks

    5

    i.e.,

    the fitted

    response

    is the

    category with

    the

    largest W f

    g

    . Note that

    we setwi

    q

    =0 for all

    i, inordertoensurethemodel iswell specified (otherwise thelocationof all of thew

    ig

    s

    could

    beshifted by a

    constant w ithout changing

    the

    model). Thus this model preserves

    the

    original

    regression likelihood

    but now

    applies

    it to

    latent variables, with

    the

    latent variables

    producingadeterministicfittedvalueon thecategorical

    scale.

    This approach also

    has the

    advantage

    of a

    simple extension

    to

    ordinal response vari-

    ables, those w hich have a clear ordering but are categorical, where thedifferences between

    categories cannotbedirectly convertedtodistances, sotreating themascontinuousis not

    realistic.Forexample,asurveymayofferthe

    responses

    of excellent,

    good,

    fair, and

    poor, which are certainly ordered but w ith no natural distance metric between levels. To

    fitan ordinal variable, weagainlett

    t

    code

    thecategory of the ithobservation,butthis time

    the ordering of the categories is important. We no longer need the indicator variables, and

    instead

    we

    just

    fit a

    single neural network

    and

    denote

    its

    continuous output

    by w/. W e

    then

    convert this output

    to a

    category

    by

    dividing

    the

    real line into bins

    and

    matching

    the

    bins

    totheordered categories. To beprecise, wehave additional parametersc\ ... c

    9

    _i , with

    c

    g

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    37/106

    26 Chapter

    2

    Nonparametric

    Models

    igure

    2 6 Predicted

    probab ility

    of

    loan

    acceptance

    from a

    6 node network

    using

    only

    the ageof th eapplicant. An x marksaloanthatw as

    actually

    approved and an O

    aloan thatwas actually denied.

    2 3 5

    Other

    Flavorsof Neu ral Networks

    The exact

    form

    of neural network model presented above (equation (2.6)) is referred to as

    a single hidden layerfeed-forward neural network with logistic activation

    functions

    and a

    linear outpu t. As one m ight expect

    from

    all of the term s in that description, there are a large

    number of

    variations possible.

    W e

    shall discuss thesebriefly,

    and

    then

    we

    w ill return

    to the

    abovebasicm odel

    for the

    rest

    of

    thisbook.

    The first term is single hidden layer wh ich m eans there is just one set of hidden

    nodes, the

    logisticbasis functio ns, which

    are

    pictured

    as the

    m iddle

    ro w in

    Figure 2.4. W ith

    one set of hidden nodes, we hav e the straightforw ard basis representation interpretation. It

    is quite possible to add additional layers of hidden nodes, where each hidden node takes

    a linear combination of theoutputs from the previous layer and produces its own output

    which is a logistic transform ation of its input linear com bination. The ou tputs of each layer

    aretaken as the inputso f thenext layer. As wasproved by

    several

    groups, a single layer

    is all that is necessary to span most spaces of interest, so there is no additional flexibility

    to be

    gained

    by

    using multiple layers (Cybenko(1989);Funahashi(1989); Hornik, Stinch-

    com be, and W hite (1989)). How ever, sometimes adding layers will give a mo re com pact

    representation, whereby a com plexfunction can be approximated by a smaller total num ber

    ofnodes

    in

    multiple layers than

    the

    number

    of

    nodes necessary

    if

    only

    a

    single layer

    is

    used.

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    38/106

    3 NeuralNetworks

    7

    igur

    2.7.

    Predicted

    probab ility

    of

    loan

    acceptance

    from a

    4 node network

    using

    only

    the age

    of

    the

    applicant.

    An x

    marks

    a

    loanthat

    was

    actually

    approved

    and an O

    a

    loanthat

    wasactually

    denied.

    Afeed-forward networkis onethathas adistinct orderingto its

    layers

    sothattheinputs

    to

    each layer

    are

    linear combinations

    of the

    outputs

    of the

    prev ious layer,

    as in

    Figure 2.4.

    Sometimes additional connections aremade, connecting nonadjacent layers. For example,

    starting w ith

    our

    standard model,

    the

    third layer (the ou tput prediction) could take

    as

    inputs

    a

    linear combination

    of

    both

    the

    hidden nodes (the second layer)

    and the

    explanatory variables

    (the first

    layer), which would give

    a

    model

    of the

    form

    whichis

    often

    called a semiparametric model, because it contains both a param etric com-

    ponent (a standard linear regression, a'x,) and a nonparametric component (the linear

    combination of

    location-scale logistic bases). Significantly more complex models

    can be

    madeby allowing connections to godo wn layersas well, creating cyclesin thegraphof

    Figure 2.4.

    For

    example, suppose there

    are

    four total levels, with

    two

    layers

    of

    hidden

    nodes,

    the

    inputs

    of the

    second

    row of

    hidden nodes

    are

    linear combinations

    of the first row

    of

    hidden nodes

    as

    usual,

    but the

    inputs

    of the first row of

    hidden nodes include both

    the

    explanatory variables as well as the outputs

    from

    the second row of hidden nodes, thus cre-

    ating acycle. Such netwo rks require iterative solutionsandsubstantially more com puting

    time,and are not ascommonlyused

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    39/106

    28 Chapter2

    NonparametricModels

    Logistic basisfunctions are the most comm only used. In theory, any sigmo idal func-

    tioncan beused,as theproofs thattheinfinitesetcomprisesabasissetdependsonlyon the

    functions

    being sigmoidal.

    In

    practice,

    one of two

    sigmoidal functions

    is

    usually chosen,

    either thelogistic or thehyperbolic tangen t:

    which is a simple transformation of the logistic

    function,

    ty (z) = 2^ 2z) 1. The

    historical threshold functions discussed in Section 2.3.2 are no longer commonly used.

    Acompletely different typeofbasis function issometimes used, a radial basis

    function,

    whichis usu ally called a kernel in statistics. Thus a radial basis netw ork is basically another

    nam e for what statisticians call a mixtu re model, as described in equation (2.5). Ne ura l

    netw ork m odels, particularly radial basis

    function

    versions, are sometimes used for den sity

    estimationproblems(see,forexample, Bishop

    (1995)).

    In our basic mode l, we use a linear ou tput in that the final prediction is simply

    a

    linear combination

    of the

    outputs

    of the

    hidden layer.

    In

    other implementations,

    the

    logistic transformation may also be applied to the final prediction as well, producing pre-

    dictions

    in the

    u nit interval (w hich could

    be

    rescaled again

    if

    necessary). There does

    not

    seem to be any substantial advantage to doing so in a regression problem (but it is some-

    times done nonetheless). Obviously, if one is fitting values restricted to an interval (e.g.,

    probabilitiesfor a

    classification problem), such

    an

    additional transformation

    can be

    quite

    useful.

    2 4 TheBayesian

    Paradigm

    The twomainphilosophiesofprobability an dstatisticsare theclassicalandBayesian ap-

    proaches. While they share much

    in

    comm on, they have

    a

    fun damental

    difference in how

    they interpretprobabilitiesand on thenatureofunknown parameters. In theclassical(orfre-

    quen tist) fram ew ork, probabilities

    are

    typically in terpreted

    as

    long-ru n relative freque ncies.

    Unknown parameters are though t of as fixed quan tities, so that there is a right answ er,

    even

    if wewill never know wh atit is. Thedataareusedto find a

    best

    guess for thevalues

    of

    theun know n parameters.

    Under

    the

    Bayesian framework,probabilities

    are

    seen

    as

    inherently subjective

    so

    that,

    for

    each person, their probability statementsreflect their personal beliefs about

    the

    relative likelinessof the

    possible

    outcomes. U nkn ow n parameters areconsideredto be

    random so that they also have probability distributions. Instead of finding a single best

    guess for the unknown parameters, the usual approach is to estimate the distribution of

    the parameters usin g Bayes' theorem . First, a prior distribution m ust be specified, and this

    distribution is mean t to

    reflect

    the sub ject's personal beliefs abou t the parameters, before the

    data

    areobserved. Insome cases, theprior willbeconstructed

    from

    previous experiments.

    This approach providesamathem atically coherent m ethodforcom bining inform ationfrom

    different sources.

    In

    othercases,

    the

    practitioner

    may

    either

    not

    know anything about

    the

    parameters

    or may

    want

    to use a

    prior that would

    be

    generally acceptable

    to

    many other

    people,rather thanapersonal one. Such priors arediscussed further inSection 3.3. Once

    theprior, P(0), isspecifiedfor theparameters,0 ,Bayes' theorem combinesthepriorwith

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    40/106

    5

    Model uilding 9

    thelikelihood,/ X|0),to get the posterior:

    One

    usefultype

    o f

    prior

    is a

    conjugateprior,

    one

    that when combined

    withthe

    likeli-

    hood producesaposteriorin thesame family. Forexample,if thelikelihood specifies that

    yi,

    yn

    ~ N fj,, 1 ,then usinganormal prior for u, J L ~ N a, 1 for some constant

    a, is a conjugate choice, because the posterior for J L will also be normal, in particular,

    Note that

    theideaof

    conjugacy

    is

    likelihooddependent,

    in

    that

    a

    prior

    is

    conjugate

    for a

    particular likelihood. Conjugate priors

    are

    widely used

    for

    convenience,

    as

    they lead

    to

    analytically tractable posteriors.

    In

    manycases,such

    as a

    neural network,

    there

    is no

    known conjugateprior.Instead,priors

    for

    individual parameters

    may be

    chosen

    to be conditionally conjugate, inthat whenallother parameters areconditioned upon held

    fixed),theconditional posterioris of thesame

    family

    as theconditional prior.Forexample,

    in

    Chapter

    3,

    nearly

    all of the

    priorspresented

    for

    neural networks will

    put a

    normal dis-

    tribution

    on the bj parameters possibly conditioned upontheother parameters), whichis

    conditionally conjugate. Themainbenefit ofconditional conjugacyisthatitallowsone to

    useGibbs sampling,aswillbedescribedinsection3.5, which greatly helpsinmodelfitting.

    This book will assume that

    the

    reader

    has

    some previous knowledge

    of the

    Bayesian

    approach. Formore detailson the Bayesian approach, aswellas fordiscussionof the

    philosophicalmeritsandcriticismsof theBayesian approach,thereaderisreferredtosome

    of

    the

    many other sources available Robert 2001); Congdon 2001); Carlin

    and

    Louis

    2000); Gelman

    et al.

    1995); Bernardo

    and

    Smith 1994); Press 1989); Hartigan 1983);

    Jeffreys 1961)).

    Theapproachinthis book willbe theBayesian one. However,it ismoreof aprag-

    matic Bayesian approach, rather thanadogmatic subjectivist approach. As weshallsee

    inSection3.1, inmostcasestheparameters do nothave straightforward interpretations,

    and

    thus it is notfeasible to put subjective knowledge into aprior. In theevent that the

    data analysthassubstantive prior information,amodel with more interpretable parameters

    shouldbeused insteadof aneural network. Ifsomethingisknown abouttheshapeof the

    data,another more accessible model should

    be

    chosen.

    The

    strength

    of

    neural networks

    is

    theirflexibility, as wasdescribed inSection 2.3. A keybenefitof theBayesian approachis a

    fullaccounting

    of

    uncertainty.

    By

    estimating

    a

    posteriordistribution,

    one

    obtains estimates

    of

    error

    and

    uncertainty

    in the

    process.Such uncertainty

    can

    also encompass

    the

    choice

    of

    model,aswillbediscussedin thenext section.

    2 5 Model uilding

    A

    full

    analysisof anydatasetinvolves many steps, starting with exploratory data analysis

    andmovingon toformal model building.Theprocessofchoosingamodelistypicallyan

    iterative one. Outsideof thesimplest problems,it isquite rarefor the firstmodel specifiedto

    be the final

    model.First,

    one

    must check

    the

    validity

    of the

    assumptions. Then

    one

    should

    see if thedata suggest thatadifferentmodelmay bemore appropriate. Forexample, when

    fitting amultiple linear regression,it isimportanttocheck thattheresidualsdofollowthe

    assumptions, and,ifnot, then usuallyatransformationisperformedon thedatatoimprove

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    41/106

    30

    hapter 2 Nonparametric

    Models

    thesituation. A lso important is to check that the right set of variables has been included in

    themodel. Variables whicharefoundto beirrelevant orredundantareremoved. Variables

    whichw ere not initially included but are later

    found

    to be

    helpful

    would be added. Thus a

    series

    of

    models will

    be

    investigated before settling upon

    a final

    model

    or set of

    models.

    The

    same procedure applies

    to

    modeling

    via

    neural networks.

    It is

    important

    to

    check

    the key

    assumptions

    of the

    model. Residualplotsshould

    be

    made

    to

    verify normality,

    independence, and constant variance hom oscedasticity). Violations of any assumption calls

    for

    a

    remedy, typically

    a

    transformation

    of one or

    more variables. Many

    of the

    guidelines

    for linear regression are applicable here as well.

    Also important is theissue of model selection or some alternative, such as model

    averaging; see Section 4.1). There are two parts to this issue when dealing withneural

    networks: choosing

    the set of

    explanatory variables

    and

    choosing

    the

    number

    of

    hidden

    nodes. Anadditional complication couldbechoosing thenetwork structure, if one con-

    siders

    networks beyond just single hidden layer fee d-forward neural networks w ith

    logistic

    activationfunctions

    and a

    linear

    output,

    but

    that

    is

    beyond

    the

    scope

    of

    this book.)

    First consider choosinganoptimalset ofexplanatory variables. For a given dataset,

    using more explanatory variables will improve the fit. However, if the variables are not

    actually relevant, they will

    not

    help with prediction. Consider

    the

    mean square error

    for

    predictions, which

    has two

    componentsthe square

    of the

    prediction bias expected mis-

    prediction)

    and the

    prediction variance.

    The use of

    irrelevant variables will have

    no effect

    on

    the prediction bias but will increase the pred iction variance. Inclusion of unique relevant

    variables will reduce both bias

    and

    variance. Inclusion

    of

    redundant

    e.g.,

    correlated)

    variables will not change the prediction bias but will increase the variance. Thus the goal is

    to find all of the

    nonredundant useful explanatory variables, removing

    all

    variables which

    do not improve prediction.

    Thesecond aspectisselectingtheoptimal numberofhidden nodes orbasisfunctions).

    Using more nodes allows a more complex fitted

    function.

    Fewer nodes lead to a smoother

    function.

    For a particular dataset, the more nodes used, thebetter the fit. With enough

    hidden nodes, the

    function

    can fit the data

    perfectly,

    becoming an interpolating function.

    However, such

    a

    function will usually perform poorly

    at

    prediction,

    as it fluctuates

    wildly

    in

    attemptingtomodelthenoisein thedatainadditionto theund erlying trend. Justaswith

    anyother nonparametric procedure,

    an

    optimal amount

    of

    smoothing must

    befound so

    that

    the fitted

    function

    is

    neither overfilling not smooth enough)

    or

    underfilling loo smoolh).

    Bolh

    of theseaspeclsof

    model

    selection

    will

    be

    discussed

    in

    Chapter

    4.

    Thai chapter

    will investigate criteria for selection, as well as methods for searching through the space

    ofpossible models. It will also

    discuss

    some alternatives to the traditional approachof

    choosing only a single m odel.

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    42/106

    Chapter

    Priors

    for

    NeuralN etwo rks

    One of the key

    decisions

    in a

    Bayesian analysis

    is the

    choice

    of

    prior.

    The

    idea

    is

    that

    one s prior should reflectone s current beliefs either

    from

    previous data or from purely

    subjective

    sources) about the parameters before one has observed the data. This task turns

    out

    to berather difficult for aneural network, because inmost

    cases

    the parameters have

    no

    interpretable meaning, merely beingcoefficients

    in a

    nonstandard basis expansion

    as

    described in Section 2.3). In certain special cases, the param eters do have intu itive m eanings,

    as willbediscussed in thenext section. Ingeneral, however,theparametersarebasically

    uninterpretable, and thus the idea of pu tting beliefs into

    one s

    prior is rather quixotic. The

    next

    two sections discuss several practical choices of priors. This is followed by a practical

    discussion

    of

    param eter estim ation,

    a

    com parison

    of

    som e

    of the

    priors

    in

    this chapter,

    and

    some

    theoretical results on

    asym ptotic consistency.

    3.1 Parameter

    Interpretation

    an d LackThereof

    In some specificcases,theparametersof aneural network have obvious interpretations. We

    willfirst

    look

    at

    these cases,

    and

    then

    a

    simple exam ple will show

    how

    things

    can

    quickly

    go wrongandbecom e virtually uninterpretable.

    The firstcaseisthatof aneural network with onlyonehidden nodeand oneexplanatory

    variable.

    The

    m odel

    for the fitted

    values

    is

    Figure3.1shows thisfittedfunction for fa = 2,b\=4, YQ = 9 andy\ =15

    over

    x in the

    un it interval.

    The

    parameter

    fiois an

    overall location parameter

    for y, in

    that

    y

    = p

    Q

    w hen the logistic

    function

    is

    close

    to zero in this case whenx is near or below

    zero,

    so

    here Q

    is the

    y-intercept).

    The

    parameter

    fi\is an

    overall scale factor

    for y, in

    that

    the

    logistic

    function

    ranges

    from 0 to 1 and

    thus

    fi\is the

    range

    of y,

    which

    in

    this case

    is

    2

    2)

    = 4. The y

    parameters control

    the

    location

    and

    scale

    of the

    logistic

    function.

    The center of the logistic function occurs at ^, here 0.6. For larger values of y\ in the

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    43/106

    32

    hapter3 Priorsfor Neural Networks

    Figure 3 1 Fitted

    function with

    a

    singlehiddennode

    neighborhood

    of the

    center,

    y

    changes

    at a

    steeper

    rate

    asx

    gets farther away from

    the

    center.

    If y\ is

    positive, then

    the

    logistic risesfrom

    left to

    right.

    If y\ is

    negative, then

    the

    logistic

    will

    decreaseas jc

    increases

    Note that a similar interpretation can be used when there are additional explana-

    tory variables, with

    the YQ,y\ andx

    above being

    replacedby a

    linear combination

    of

    the jc s with coefficients y

    ;

    , so that the logistic rises with respect to the entire linear

    combination.

    This interpretation scheme

    can

    also

    be

    extended

    to

    additional logistic

    functions as

    long

    astheir centersarespaced sufficiently farapartfromeach other. Inthat case,for any

    given

    value

    ofx it

    will

    be

    close

    to the

    center

    of at

    most

    one

    logistic,

    and it

    will

    be in the

    tails

    ofal l

    others (where

    a

    small change

    inx

    does

    no t

    produce

    a

    noticeable change

    iny .

    Thus

    the parameters ofeach logistic function can bedealt with separately, using theprevious

    interpretation locally.

    Two

    logistics with opposite signsony\can beplaced so that theyaresomewhatbu t

    not

    to oclosetoeach other,andtogether they

    form

    abump, jointly acting much likeakernel

    or a

    radialbasisfunction.

    One

    could attempt

    to

    apply some

    of the

    interpretations

    of

    mixture

    models

    here,

    but in

    practice,

    fitted

    functions rarely behave

    in

    this manner.

    In most real-world problems, there will

    be

    some overlap

    of the

    logisticfunctions,

    which can

    lead

    to

    many other sorts

    of

    behavior

    in the

    neural network predictions,

    effec-

    tively removing any interpretability. A simple example is shown in Figure 3.2, which

    deals with

    a

    model with

    a

    single explanatory variable

    and

    only

    tw o

    hidden nodes (logistic

    functions .

    This example involvesfitting themotorcycle accident dataofSilverman (1985), which

    tracks the acceleration forceon thehead of amotorcycle rider in the firstmoments after

  • 7/25/2019 Bayesian Nonparametrics via Neural Networks_Herbert K. H. Lee (SIAM 2004 106s)

    44/106

    3 2 Proper Priors 33

    impact. Here a neural network with two hidden nodes was used, and the maxim um likelihood

    fit is

    shown

    in

    Figure 3.2. This model

    fits the

    data somewhat w ell, although

    it

    does appear

    to oversmooth in the first milliseconds after the crash. The point is not that this is the

    best possible fit but rather that this particular fit, which is the maximum likelihood fit,

    displays interesting behavior with

    respect to the

    parameters

    of the

    model. Remember

    that

    there are only two logistic

    functions

    producing this fit, yet there are three points of

    inflection

    in the fitted

    curve. Thus

    the

    previous interpretations

    of the

    parameters

    no

    longer

    apply.

    W hat is going on here is that the two logistic

    functions

    have centers very close to

    each other, so that their ranges of action interact, and do so in a highly nonlinear fashion.

    These logistic bases

    are

    shown

    in