SAS Programming and Data Analysis Portfolio - BTReilly

Submitted as part of the curriculum in

SAS Programming and Data Analysis

Portfolio by

Brian Thomas Reilly

Florida State University

College of Arts and Sciences Department of Statistics

STA3024 SAS for Data

and Statistical Analysis .page 2

STA4202 Analysis of Variance

and Design of Experiments .page 19

STA4203 Applied Regression Methods .page 42

STA4853 Time Series

and Forecasting Methods .page 52

Portfolio for SAS Programming and Data Analysis Certificate

Submitted to Florida State University and SAS Institute by Brian Thomas Reilly

STA3024 Project C Brian Reilly SAS for Data and Statistical Analysis Spring 2016

of 63

Transportation Security Administration (TSA)

Claims Data from 2002 to 2014

Project C – Formal Report

Table of Contents 1. Motivation for Study………………………………………………4 2. Data Explanation…………………………………………………..4 3. Graphical and Numerical Summary……………………………….4 4. Statistical Test……………………………………………………..8 5. Conclusions Drawn………………………………………………..9 6. Source Code……………………………………………………....10


of 63

1. Motivation for the Study As an Actuarial Science major, I am interested in exploring a real world example that a Property and Casualty (P&C) company might face when insuring airline or airports. That example is analyzing claims related to the Transportation Security Administration (TSA) screening process. I am also interested in using some metrics from the actuarial field to examine the data such as payment per loss, payment per payment, and aggregate loss. 2. Data Explanation According to the data.gov, the data is “Claims made against TSA during a screening process of persons or passenger's property due to an injury, loss, or damage. Claims data information include claim number, incident date, claim type, claim amount, status, and disposition,” for a given annual range.

The data is from the following four files:

a. http://www.dhs.gov/sites/default/files/publications/claims-2014.xls b. http://www.dhs.gov/sites/default/files/publications/claims-2010-2013_0.xls c. http://www.dhs.gov/sites/default/files/publications/claims-2007-2009_0.xls d. http://www.dhs.gov/sites/default/files/publications/claims-2002-2006_0.xls

The file extensions needed to be changed to “.xlsx” SAS will only handle the files as “xlsx,” thus their file extensions will need to be renamed accordingly. The data will be cleaned by omitting observations where the “Close Amount” and “Disposition” are missing. The four files will be merged into one table. After cleaning the data, there are 171,006 observations. 3. Graphical and Numerical Summary

General Numerical Statistics Number of Claims 171,006 Total Amount Paid $15,485,087.70

Average Claim Payment $90.55


of 63

Claim Type

These pie graphs display the amount of claims of each different type of claim (left) and how much these claim amounts totaled (right). The Passenger Property Loss claim type was the most prevalent for each of these graphs, followed by Property Damage. Personal Injury and Other were substantially smaller than Passenger Property Loss and Property Damage.

Payment per Loss by Claim Site

These vertical bar charts display information using the Payment per Loss by Claim Site. The chart on the left displays the frequency of payment for each site of Checked Baggage and Checkpoint. It is evident that there were significantly more at the Checked Baggage site (131,330) rather than the Checkpoint site (31,254). The chart on the right displays the Expected


of 63

Value to pay per claim compared to the Median Value for each the Checked Baggage and Checkpoints sites. For the Checked Baggage site, both the Expected Value and Median Value (which is virtually 0) are much lower than those for the Checkpoint site. These charts suggest that the frequency is worse for the Checked Baggage site, but the severity is worse for the Checkpoint site.

Payment per Payment by Claim Site

These vertical bar charts display information using the Payment per Payment by Claim Site. The chart on the left displays the frequency of payment for each site of Checked Baggage and Checkpoint. It is similar to the Payment per Loss graph because there were significantly more at the Checked Baggage site (55,201) rather than the Checkpoint site (17,608). However, compared to the Payment per Loss graph above, the numbers are slightly closer. The chart on the right displays the Expected Value to pay per claim compared to the Median Value for each the Checked Baggage and Checkpoints sites. For the Checked Baggage site, both the Expected Value and Median Value are once again lower than those for the Checkpoint site. However, the numbers for Checked Baggage are closer to those of Checkpoint than they were in the Payment per Loss chart. Just as above, the frequency is worse for the Checked Baggage site, but the severity is worse for the Checkpoint site.


of 63

Top 5 Worst Airlines

These vertical bar charts display the Top 5 Worst Airlines for Expected Payment per Loss (left) and Expected Payment per Payment (right). The Close Amounts are much higher for the Expected Payment per Payment than the Expected Payment per Loss. Airlines Air Inter Europe, Arik Airlines, and Virgin Express are all in both charts here for Top 5 Worst Airlines. In both graphs, the Close Amount is drastically higher for Air Inter Europe than any other airline. Determined from these charts based on frequency and close amount, Air Inter Europe, Virgin Express, and Japan Airlines Express are the top 3 overall worst airlines.


of 63

4. Statistical Test A more analytical test for the difference in means amongst the claim sites is to use the Tukey’s studentized range test (HSD) within PROC ANOVA and the MEANS statement. This test provides groupings of claim sites with similar means. In addition the CLDIFF option will provide a numerical difference between each pair of claim sites. It is likely that there will be a significant difference between the mean of the Checkpoint and Checked Baggage.

Means with the same letter are not significantly different.

Tukey Grouping Mean N Claim Site

A 2093.5 287 Motor Vehicle

B 173.7 32756 Checkpoint

B 143.8 1488 Other

B 65.9 136140 Checked Baggage

B 40.0 10 Bus Station

B 27.5 53 - Clearly there is a statistically significant difference between the mean of Motor Vehicle claims and all other claims. Therefore, the data will be restricted to only Checked Baggage and Checkpoint.


Tukey Grouping Mean N Claim Site

A 173.655 32756 Checkpoint

B 65.852 136140 Checked Baggage

Comparisons significant at the 0.1 level are indicated by ***.

Claim Site Comparison

Difference Between

Means

Simultaneous 90% Confidence

Limits

Checkpoint - Checked Baggage 107.803 103.691 111.914 ***

Checked Baggage - Checkpoint -107.803 -111.914 -103.691 *** These tables suggest that there is a significant difference between the mean of checkpoint and checked baggage of $107.80. This difference is still significance at the confidence level of 90%, which is lower than the default of 95%.


of 63

5. Conclusions Drawn It would be better to lose an item at a checkpoint instead of check baggage, by about $107.80 on average with about 95% confidence. The average compensation per loss at checkpoint is more than double the average compensation at check baggage. When considering the average of payment per claim settled there is less of a difference between checkpoint and check baggage, yet the average compensation per at checkpoint is still significantly higher. Also, from 2004-2014 it is evident that the number of claims are diminishing. Around 2004 the number of claims was over five thousand while by 2014 the number of claims has fall to around one thousands.


of 63

6. Source Code /* Import */ /* Import the claims data from 2014. */ PROC IMPORT DATAFILE="/home/btr09c/STA3024/Project/claims-2014.xlsx" DBMS=XLSX replace OUT=Claims14; GETNAMES=YES; RUN; Data Claims14; /* Preserve data in temporary columns */ set Claims14(rename=('Close Amount'n=AmountChar

'Claim Number'n=ClaimNum)); /* Convert the AmountChar to numeric ?? suppresses the invalid data message and prevents setting '_ERROR_' */ 'Close Amount'n = input(AmountChar,?? 12.); /* Convert ClaimNum to character */ 'Claim ID'n = put(ClaimNum, 16.); /* Drop the temporary columns AmountChar and ClaimNum as well as the unwanted 'Item Category' and 'Incident Date' */ drop AmountChar ClaimNum 'Item Category'n 'Incident Date'n; run; /* Sorted for the merge later */ proc sort data=Claims14;

by 'Claim ID'n; run; /* Import the claims data from 2010 to 2013. */ PROC IMPORT DATAFILE="/home/btr09c/STA3024/Project/claims-2010-2013.xlsx" DBMS=XLSX replace OUT=Claims1013; GETNAMES=YES; RUN; Data Claims1013; /* Preserve data in temporary columns */ set Claims1013(rename=('Close Amount'n=AmountChar

'Claim Number'n=ClaimNum)); /* Convert the AmountChar to numeric ?? suppresses the invalid data message and prevents setting '_ERROR_' */ 'Close Amount'n = input(AmountChar,?? 12.); /* Convert ClaimNum to character */ 'Claim ID'n = put(ClaimNum, 16.); /* Drop the temporary columns AmountChar and ClaimNum as well as the unwanted 'Item Category' and 'Incident Date' */ drop AmountChar ClaimNum 'Item Category'n 'Incident Date'n; run; proc sort data=Claims1013;

by 'Claim ID'n; run;


of 63

/* Import the claims data from 2007 to 2009. */ PROC IMPORT DATAFILE="/home/btr09c/STA3024/Project/claims-2007-2009.xlsx" DBMS=XLSX replace OUT=Claims0709; GETNAMES=YES; RUN; Data Claims0709; /* Preserve data in temporary columns */ set Claims0709(rename=('Claim Number'n=ClaimNum)); /* Convert ClaimNum to character */ 'Claim ID'n = put(ClaimNum, 16.); /* Drop the temporary columns AmountChar and ClaimNum as well as the unwanted columns Item, 'Claim Amount', 'Incident Date', and Status */ drop ClaimNum Item 'Incident Date'n 'Claim Amount'n Status; run; proc sort data=Claims0709;

by 'Claim ID'n; run; /* Import the claims data from 2002 to 2006. */ PROC IMPORT DATAFILE="/home/btr09c/STA3024/Project/claims-2002-2006.xlsx" DBMS=XLSX replace OUT=Claims0206; GETNAMES=YES; RUN; Data Claims0206; /* Preserve data in temporary columns */ set Claims0206(rename=('Claim Number'n=ClaimNum)); /* Convert ClaimNum to character */ 'Claim ID'n = put(ClaimNum, 16.); /* Drop the temporary columns AmountChar and ClaimNum as well as the unwanted columns Item, 'Claim Amount', 'Incident Date', and Status */ drop ClaimNum Item 'Incident Date'n 'Claim Amount'n Status; run; proc sort data=Claims0206;

by 'Claim ID'n; run; /* Surpress the Warning for different lengths of 'Airline Name' Note: the guessingrows option is not available in PROC IMPORT */ options varlenchk=nowarn;


of 63

/* Combine the Claims from 2002 to 2014 and clean the data. */ data TSA.Claims; merge Claims14 Claims1013 Claims0709 Claims0206; by 'Claim ID'n; /* Dispositions with '-' are ommitted as missing data */ if Disposition='-' then delete; /* Dispositions with missing data are ommitted */ else if Disposition=' ' then delete; /* Missing 'Date Received' are ommitted*/ else if 'Date Received'n='.' then delete; /* Missing 'Close Amount' are ommitted */ else if 'Close Amount'n='.' then 'Close Amount'n=0; /* Denied claims should have close amounts of zero */ else if Disposition='Deny' AND 'Close Amount'n>0

then 'Close Amount'n=0; run; /* Delete the partial year range datasets */ proc datasets library=work nolist; /*nolist surpresses output*/

delete Claims14 Claims1013 Claims0709 Claims0206; run; /* Print 50 observations*/ proc print data=TSA.claims(obs=50); run; /* PROJECT B - NEW MATERIAL*/ data Claims; set TSA.claims; run; /* Basic Summary Statistics*/ proc univariate data=Claims; var 'Close Amount'n; /* Under "Moments" N = Total Number of Claims Sum Observations = Sum of All Close Amounts, Mean = Average Amount for All Claims */ run; /*** DESCRIBE CLAIM TYPES ***/ /* Frequency Table for Claim Type */ proc freq data=Claims; table 'Claim Type'n; run;


of 63

/* Pie Chart for Frequency of Claims by Claim Type*/ proc gchart data=claims; pie 'Claim Type'n / fill=solid value = inside percent = inside slice = inside;

goptions colors=(lightseagreen violet lightyellow lightsalmon lightgrey lightcoral);

run; quit; /* Pie Chart for Sum of "Close Amount" by Claim Type*/ proc gchart data=claims; pie 'Claim Type'n / sumvar='Close Amount'n fill=solid value = inside percent = inside slice = inside;

goptions colors=(lightsalmon lightseagreen lightyellow lightpink lightgrey lightcoral);

run; quit; /*** MORE DATA CLEANING AND PROCESSING ***/ /* Subset Claims for personal property types*/ data all_personal; set claims; if ('Claim Type'n = "Passenger Property Loss" OR 'Claim Type'n = "Property Damage" OR 'Claim Type'n = "Passenger Theft") then; else delete; run; /* Subset Claims for Non-Zeros in "Close Amount" */ data NonZero_personal; set all_personal; if ('Close Amount'n = 0) then delete; run; /*** CARRY-ON VS CHECKED ***/ /* Subset All Personal Claims at "Checkpoint" and "Checked Baggage" */ data all_site; set all_personal; if ('Claim Site'n = "Checkpoint" OR 'Claim Site'n = "Checked Baggage") then; else delete; run;


of 63

/* Subset Non-Zero Amount Personal Claims at "Checkpoint" and "Checked Baggage" */ data NonZero_site; set NonZero_personal; if ('Claim Site'n = "Checkpoint" OR 'Claim Site'n = "Checked Baggage") then; else delete; run; /* Display Carry On Versus*/ title color=magenta "Frequency of Payment per Loss by Claim Site"; proc sgplot data=all_site; vbar 'Claim Site'n / stat=Freq fillattrs=(color=Magenta) datalabel; run; title color=magenta "Payment per Loss by Claim Site"; proc sgplot data=all_site; vbar 'Claim Site'n / response='Close Amount'n stat=mean barwidth=.35 discreteoffset= -.2 LEGENDLABEL="Expected Value"; vbar 'Claim Site'n / response='Close Amount'n stat=median barwidth=.35 discreteoffset= .2 LEGENDLABEL="Median Value"; run; title color=navy "Frequency of Payment per Payment by Claim Site"; proc sgplot data=NonZero_site; vbar 'Claim Site'n / stat=Freq fillattrs=(color=Navy) datalabel; run; title color=navy "Payment per Payment by Claim Site"; proc sgplot data=NonZero_site; vbar 'Claim Site'n / response='Close Amount'n stat=mean barwidth=.35 discreteoffset= -.2 LEGENDLABEL="Expected Value"; vbar 'Claim Site'n / response='Close Amount'n stat=median barwidth=.35 discreteoffset= .2 LEGENDLABEL="Median Value"; run; /* Worst Airlines */ /* Top 5 Average Close Amount (Payment per Loss) by Airlines */ data all_has_airlines; set all_personal; if 'Airline Name'n = '' then delete; else if 'Airline Name'n = '-' then delete; run;


of 63

/* Sort for proc means */ proc sort data=all_has_airlines; by 'Airline Name'n; run; /* Get Stats for "Close Amount" by "Airline" */ proc means data=all_has_airlines noprint; var 'Close Amount'n; by 'Airline Name'n; output out=all_stat_airline; run; /* Subset Means for "Close Amount" by "Airline" */ data all_mean_airline; set all_stat_airline; if _STAT_ ^= "MEAN" then delete; run; /* Sort Mean descending for "Close Amount" by Airlines*/ proc sort data=all_mean_airline; by descending 'Close Amount'n; run; data top5_avg_amt_all_airline; set all_mean_airline; if _N_ > 5 then delete; run; proc print data=top5_avg_amt_all_airline; run; title color=Magenta "Top 5 Expected Payment per Loss by Airline"; proc sgplot data=top5_avg_amt_all_airline; vbar 'Airline Name'n / response='Close Amount'n

datalabel=_FREQ_ fillattrs=(color=Magenta); run; /* Top 5 Average Close Amount (Payment per Payment) by Airlines */ data nonzero_has_airlines; set nonzero_personal; if 'Airline Name'n = '' then delete; else if 'Airline Name'n = '-' then delete; run; /* Sort for proc means */ proc sort data=nonzero_has_airlines; by 'Airline Name'n; run; /* Get Stats for "Close Amount" by "Airline" */ proc means data=nonzero_has_airlines noprint; var 'Close Amount'n; by 'Airline Name'n; output out=nonzero_stat_airline; run;


of 63

/* Subset Means for "Close Amount" by "Airline" */ data nonzero_mean_airline; set nonzero_stat_airline; if _STAT_ ^= "MEAN" then delete; run; /* Sort Mean descending for "Close Amount" by Airlines*/ proc sort data=nonzero_mean_airline; by descending 'Close Amount'n; run; data top5_avg_amt_nonzero_airline; set nonzero_mean_airline; if _N_ > 5 then delete; run; title; proc print data=top5_avg_amt_nonzero_airline; run; title color=Navy "Top 5 Expected Payment per Payment by Airline"; proc sgplot data=top5_avg_amt_nonzero_airline; vbar 'Airline Name'n / response='Close Amount'n datalabel=_FREQ_ fillattrs=(color=Navy); run; /* ~~~~~~ OVER TIME ~~~~~~ */ /* All Claims with Year */ data all_year; set all_personal; y = year('Date Received'n); put y; if (y < 2002 OR y > 2014) then delete; run; /* Non-Zero claims with years */ data nonzero_year; set all_year; if ('Close Amount'n = 0) then delete; run; /* Settled claims with years */ data settled_year; set all_year; if (Disposition ^= "Settle") then delete; run; /* Frequency of the All Claims by Year*/ proc freq data=all_year noprint; table y / out=freq_all_year; run;


of 63

/* Frequency of nonzero amount by Year*/ proc freq data=nonzero_year noprint; table y / out=freq_nonzero_year; run; /* Frequency of settled claims by Year*/ proc freq data=settled_year noprint; table y / out=freq_settled_year; run; /*Merge frequency for payment per payment and payment per loss Rename frequency counts accordingly */ data disp_year_freq; merge freq_all_year (rename=(count=pmt_loss_freq)) freq_nonzero_year (rename=(count=pmt_pmt_freq)) freq_settled_year (rename=(count=settle_freq)); run; /* Plot the Frequency of Claims by Year*/ title "Claim Freqency by Year"; proc sgplot data=disp_year_freq;

vline y / response=pmt_loss_freq LEGENDLABEL="Payment per Loss" lineattrs=(color=Magenta);

vline y / response=pmt_pmt_freq LEGENDLABEL="Payemnt per Payment" lineattrs=(color=Navy);

vline y / response=settle_freq LEGENDLABEL="Paid in Full" lineattrs=(color=green);

run; /* Expected payment per payment and loss over time*/ /* sort for proc means */ proc sort data=all_year; by y; run; /* sort for proc means */ proc sort data=nonzero_year; by y; run; /* Statistics of the All Claims by Year*/ proc means data=all_year noprint; var 'Close Amount'n; by y; output out=mean_all_year; run; /* Statistics of the nonzero Claims by Year*/ proc means data=nonzero_year noprint; var 'Close Amount'n; by y; output out=mean_nonzero_year; run;


of 63

/*Merge payment per payment and payment per loss Rename frequency counts accordingly */

data disp_year_avg; merge mean_all_year (rename=('Close Amount'n=pmt_loss_mean)) mean_nonzero_year (rename=('Close Amount'n=pmt_pmt_mean)); if _STAT_ ^= "MEAN" then delete; pay_perct = pmt_loss_mean / pmt_pmt_mean; run; title color=black "Expected Close Amount by Year"; proc sgplot data=disp_year_avg;

vline y / response=pmt_loss_mean LEGENDLABEL="Payment per Loss" lineattrs=(color=Magenta);

vline y / response=pmt_pmt_mean LEGENDLABEL="Payment per Payment" lineattrs=(color=Navy);

run; /* Project C – New Material*/ /* test for the difference in means amongst the claim sites */ proc anova data=TSA.Claims plots=none; class 'Claim Site'n; model 'Close Amount'n = 'Claim Site'n; means 'Claim Site'n / tukey lines cldiff; run; /* Restrict the data to only checked baggage and check point*/ data only_checked_and_checkpt; set TSA.Claims; if 'Claim Site'n = "-" then delete; else if 'Claim Site'n = "Other" then delete; else if 'Claim Site'n = "Bus Station" then delete; else if 'Claim Site'n = "Motor Vehicle" then delete; run; /* retest with restricted data*/ proc anova data=only_checked_and_checkpt plots=none; class 'Claim Site'n; model 'Close Amount'n = 'Claim Site'n; means 'Claim Site'n / tukey lines cldiff alpha=0.1; /* lower confidence to 90% level to try to merge groups*/ run;






STA4853 Time Series




STA4202 Final Project Brian Reilly Analysis of Variance and Design of Experiments Spring 2016

of 63

Experiments from A First Course in Design and Analysis of Experiments

By Gary W. Oehlert

Final Project

Table of Contents 1. Problem 13.6 – Soybean Varieties

a. Two-way ANOVA model………………………………...21 b. Residual Analysis…………………………………………22 c. Interaction Plots…………………………………………...23 d. ANOVA with Blocking Factor……………………………24 e. Estimate Average Difference and Standard Error………...24 f. Dunnett's t-Tests…………………………………………..24 g. Tukey's Studentized Range (HSD) Test…………………..25 h. Box-Cox Analysis…………………………………………26 i. Square Root Transformation………………………………27 j. Only Significant Factors…………………………………..28

2. Problem 15.6 – Yogurt Freshness a. Number of Defining Contrasts…………………………….29 b. Defining Contrast Identification…………………………..29 c. ANOVA Table…………………………………………….29 d. Only Significant Factors…………………………………..29

3. Problem 17.1 – Pollutants and Bird Bone Strength a. Parallel Lines Model………………………………………30 b. Separate Slopes Model…………………………………….32 c. Separate Lines Model……………………………………...33

Source Code For #1………………………………………………………….34 For #2………………………………………………………….39 For #3………………………………………………………….40


of 63

1. An experiment was conducted to determine how different soybean varieties compete against weeds.

There were sixteen varieties of soybeans and three weed treatments: no herbicide, apply herbicide 2 weeks after planting the soybeans, and apply herbicide 4 weeks after planting the soybeans. The measured response is weed biomass in kg/ha. The data is given in pr13_6 from Oehlert. a) The fit for a tentative two-way ANOVA model with interaction:

Source DF Anova SS Mean Square F Value Pr > F

treat 2 85183539.58 42591769.79 19.16 <.0001 var 15 25587029.17 1705801.94 0.77 0.7050 treat*var 30 22426627.08 747554.24 0.34 0.9989

The treatment looks statistically significant at the 99% confidence level.


of 63

b) Residual analysis for the above model:

The model appears to be fairly adequate because the Residual vs. Percent is extremely normal and the Quartile vs. Residual (QQ-plot) is mostly linear.

Fit D iagnostics for biomass

0.1198A dj R -‐Square0.5552R-‐Square2.22E6MSE

48Error DF48Parameters96Observations

Proportion Less0.0 0.4 0.8

Residual

0.0 0.4 0.8

Fit–Mean

-‐2000

-‐1000

0

1000

2000

-‐3200 -‐800 1600Residual

0

10

20

30

40

Percent

0 20 40 60 80 100Observation

0.000

0.025

0.050

0.075

0.100

0.125

Cook's D

0 2000 4000 6000Predicted Value

0

2000

4000

6000

biom

ass

-‐2 -‐1 0 1 2Quantile

-‐2000

-‐1000

0

1000

2000

Residu

al

0.5 0.6 0.7 0.8 0.9 1.0Leverage

-‐2

-‐1

0

1

2

RStudent

1000 2000 3000 4000 5000Predicted Value

-‐2

-‐1

0

1

2

RStudent

1000 3000 5000Predicted Value

-‐2000

-‐1000

0

1000

2000

Residu

al


of 63

c) The two interaction plots for the above model:

These plots confirm the finding about the significance of the interaction effect, because the interaction plot for treatment is roughly parallel whereas the interaction plot for variety is significantly more variable.

herb2 herb4 none

treat

0

2000

4000

6000

biom

ass

SturdyParkerOzzieM90-‐610M90-‐317M90-‐1682M89-‐794M89-‐792M89-‐642M89-‐1946M89-‐1926M89-‐1743M89-‐1006M88-‐250LambertA rcher

var

Interaction Plot for biomass

A rcher

Lambert

M88-‐250

M89-‐1006

M89-‐1743

M89-‐1926

M89-‐1946

M89-‐642

M89-‐792

M89-‐794

M90-‐1682

M90-‐317

M90-‐610

OzzieParker

Sturdy

var

0

2000

4000

6000

biom

ass

noneherb4herb2treat

Interaction Plot for biomass


of 63

d) There were two replications of the experiment, one in St. Paul, MN, and one in Rosemount, MN, and there could be variability due to the location. Here is the fit of another ANOVA model retaining the original two-way structure with interaction but also including the blocking factor under the assumption that the blocking factor does not interact with any other factor:

Source DF Anova SS Mean Square F Value Pr > F

loc 1 50634150.00 50634150.00 42.45 <.0001 treat 2 85183539.58 42591769.79 35.71 <.0001 var 15 25587029.17 1705801.94 1.43 0.1729 treat*var 30 22426627.08 747554.24 0.63 0.9116

The location (blocking factor) and treatment look statistically significant at the 99% confidence level.

e) The estimated average for the weed biomass difference between no herbicide treatment and

herbicide applied after 4 weeks is 1432.2 and its standard error is 193.06072. f) Here is the output of the Dunnett procedure:

The ANOVA Procedure Dunnett's t Tests for biomass

Alpha 0.05 Error Degrees of Freedom 47 Error Mean Square 1192718 Critical Value of Dunnett's t 2.28037 Minimum Significant Difference 622.61

Comparisons significant at the 0.05 level are indicated by ***.

treat Comparison

Difference Between

Means Simultaneous 95% Confidence Limits

none - herb2 2282.8 1660.2 2905.4 *** herb4 - herb2 850.6 228.0 1473.2 ***

The Dunnett multiple comparisons test for the weed treatments shows that the comparison of the treatments for none and herb2 as well as herb4 and herb2 are both statistically significant at the 95% level.


of 63

g) Here is the output of the Tukey procedure:

The ANOVA Procedure Tukey's Studentized Range (HSD) Test for biomass

Alpha 0.05 Error Degrees of Freedom 47 Error Mean Square 1192718 Critical Value of Studentized Range 5.11482 Minimum Significant Difference 2280.5


Tukey Grouping Mean N var

A 3341.7 6 M89-1743 A 2950.0 6 Archer A 2640.0 6 Lambert A 2510.0 6 M89-1006 A 2385.0 6 M88-250 A 2353.3 6 M89-1926 A 2103.3 6 M90-610 A 2040.0 6 Ozzie A 2035.0 6 M89-1946 A 2005.0 6 M89-792 A 1845.0 6 M90-1682 A 1795.0 6 M89-642 A 1685.0 6 M90-317 A 1511.7 6 Sturdy A 1506.7 6 Parker A 1490.0 6 M89-794

The Tukey multiple comparisons test for the soybean variety shows that there is not a significant difference between the mean of any of the varieties as they all have the same Tukey grouping.


of 63

h) Here is the output from proc transreg:

The Box-Cox method suggests that an appropriate transformation of the response is exponentiation to the 0.4 power.

Box-‐C ox Analysis for biomass

-‐2 -‐1 0 1 2Lambda

-‐900

-‐850

-‐800

-‐750

-‐700

Log Likelihoo

d

num_treatTerms with Pr F < 0.05 at the Selected Lambda

-‐2 -‐1 0 1 2Lambda

-‐900

-‐850

-‐800

-‐750

-‐700

Log Likelihoo

d

95% C ISelected λ = 0.4

num_treatTerms with Pr F < 0.05 at the Selected Lambda

0

5

10

15

F


of 63

i) With the response transformed by the square root of the biomass, here are the ANOVA table and residual plots:

Source DF Type III SS Mean Square F Value Pr > F

loc 1 6790.04826 6790.04826 59.26 <.0001 treat 2 10607.86240 5303.93120 46.29 <.0001 var 15 2455.13192 163.67546 1.43 0.1737 treat*var 30 2358.39280 78.61309 0.69 0.8621

While the location and treatment are still statistically significant at the 99% level, the residuals of the model with the transformed response are less adequate than the original model. Moreover, the Residual vs. Percent of the former is more uniform than normal as well the Quartile vs. Residual is less linear, therefore does not support the assumption of normally distributed residuals.

Fit D iagnostics for sqrt_biomass

0.6056A dj R -‐Square0.8049R-‐Square114.58MSE



Residual

0.0 0.4 0.8

Fit–Mean

-‐20

0

20

40

-‐24 -‐12 0 12 24Residual

0

5

10

15

20

Percent

0 20 40 60 80 100Observation

0.00

0.02

0.04

0.06

Cook's D


20

40

60

80

sqrt_

biom

ass

-‐2 -‐1 0 1 2Quantile

-‐20

-‐10

0

10

20

Residu

al

0.5 0.6 0.7 0.8 0.9 1.0Leverage

-‐2

-‐1

0

1

2

RStudent


-‐2

-‐1

0

1

2

RStudent


-‐15

-‐10

-‐5

0

5

10

15

Residu

al


of 63

j) With square root transformation and only the significant factors (alpha = 0.05), here are the ANOVA table and residual plots:


loc 1 6790.04826 6790.04826 61.25 <.0001 treat 2 10607.86240 5303.93120 47.84 <.0001

Again the location and treatment remain statistically significant, however the exclusion of non-significant factors adjusts the residual diagnostics to provide a more linear Quartile vs. Residual (QQ-plot) and more normal Residual vs. Percent plot. These suggest that the square root transformed model with only significant factors is more adequate due to the more normally distributed residuals.

Fit D iagnostics for sqrt_biomass

0.6184A dj R -‐Square0.6304R-‐Square110.86MSE



Residual

0.0 0.4 0.8

Fit–Mean

-‐20

-‐10

0

10

20

-‐32 -‐16 0 16 32Residual

0

10

20

30

Percent

0 20 40 60 80 100Observation

0.00

0.02

0.04

0.06

Cook's D


20

40

60

80

sqrt_

biom

ass

-‐2 -‐1 0 1 2Quantile

-‐20

-‐10

0

10

20

Residu

al

0.04 0.05 0.06 0.07 0.08Leverage

-‐2

-‐1

0

1

2

RStudent

20 30 40 50 60Predicted Value

-‐2

-‐1

0

1

2

RStudent

20 30 40 50 60Predicted Value

-‐20

-‐10

0

10

20

Residu

al


of 63

2. From Problem 15.6 (Oehlert): Scientists wish to understand how the amount of sugar (two levels), culture strain (two levels), type of fruit (blueberry or strawberry), and pH (two levels) influence shelf life of refrigerated yogurt. In a preliminary experiment, they produce one batch of each of the sixteen kinds of yogurt. The yogurt is then placed in two coolers, eight batches in each cooler. The response is the number of days till an off odor is detected from the batch. a) There was one defining contrast used to put the yogurt batches in the coolers.

b) The defining contrast used to put the yogurt batches in the coolers was ‘a’. c) Here is the ANOVA table for the treatment interactions:


cooler 1 90.25000000 90.25000000 20.06 0.0110 A 1 1.00000000 1.00000000 0.22 0.6619 B 1 4.00000000 4.00000000 0.89 0.3992 C 1 4.00000000 4.00000000 0.89 0.3992 D 1 81.00000000 81.00000000 18.00 0.0132 A*B 1 6.25000000 6.25000000 1.39 0.3039 A*C 1 0.25000000 0.25000000 0.06 0.8252 A*D 1 0.25000000 0.25000000 0.06 0.8252 B*C 1 6.25000000 6.25000000 1.39 0.3039 B*D 1 0.25000000 0.25000000 0.06 0.8252 C*D 1 0.25000000 0.25000000 0.06 0.8252

The treatments that are statistically significant at the 0.05 level are cooler (the blocking factor) and D (or pH).

d) Once the insignificant interactions are removed, here is the ANOVA table for the treatment interactions:


cooler 1 90.25000000 90.25000000 28.97 0.0001 D 1 81.00000000 81.00000000 26.00 0.0002

The cooler and D are still statistically significant.


of 63

3. From Problem 17.1 (Oehlert):

Pollutants may reduce the strength of bird bones. We believe that the strength reduction, if present, is due to a change in the bone itself, and not a change in the size of the bone. One measure of bone strength is calcium content. We have an instrument which can measure the total amount of calcium in a 1cm length of bone. Bird bones are essentially thin tubes in shape, so the total amount of calcium will also depend on the diameter of the bone.

Thirty-two chicks are divided at random into four groups. Group 1 is a control group and receives a normal diet. Each other group receives a diet including a different toxin (pesticides related to DDT). At 6 weeks, the chicks are sacrificed and the calcium content (in mg) and diameter (in mm) of the right femur is measured for each chick.


of 63

a) The data was analyzed using the parallel lines model:

Parameter Estimate Standard

Error t Value Pr > |t|

Intercept 0.744256564 B 0.27403956 2.72 0.0114 group P1 -0.614273759 B 0.05553962 -11.06 <.0001 group P2 -0.593174291 B 0.05761354 -10.30 <.0001 group P3 -0.640936171 B 0.05921678 -10.82 <.0001 group control 0.000000000 B . . . diam 3.940283686 0.10015321 39.34 <.0001


group 3 2.01678322 0.67226107 56.04 <.0001 diam 1 18.56676449 18.56676449 1547.84 <.0001

2.2 2.4 2.6 2.8 3.0

diam

9

10

11

12

13

calcium

controlP3P2P1group

Analysis of C ovariance for calcium


of 63

b) The data was analyzed using the separate slopes model:



Intercept 0.220664416 0.25524122 0.86 0.3949 diam 4.133417098 B 0.09492988 43.54 <.0001 diam*group P1 -0.230570351 B 0.02055904 -11.22 <.0001 diam*group P2 -0.218643930 B 0.02165823 -10.10 <.0001 diam*group P3 -0.239305428 B 0.02239929 -10.68 <.0001 diam*group control 0.000000000 B . . .


diam 1 18.97552857 18.97552857 1614.81 <.0001 diam*group 3 2.02338125 0.67446042 57.40 <.0001

2.2 2.4 2.6 2.8 3.0

diam

9

10

11

12

13

calcium

controlP3P2P1group



of 63

c) The data was analyzed using the separate lines model:



Intercept 0.338039547 B 0.52383928 0.65 0.5248 group P1 1.174488414 B 0.67691988 1.74 0.0956 group P2 -1.005149670 B 0.67558492 -1.49 0.1498 group P3 -0.473512513 B 0.64229414 -0.74 0.4681 group control 0.000000000 B . . . diam 4.090248437 B 0.19302335 21.19 <.0001 diam*group P1 -0.678410103 B 0.25290625 -2.68 0.0130 diam*group P2 0.173431454 B 0.25599067 0.68 0.5046 diam*group P3 -0.053822482 B 0.24388955 -0.22 0.8272 diam*group control 0.000000000 B . . .


group 3 0.11886488 0.03962163 4.79 0.0094 diam 1 18.03589628 18.03589628 2181.65 <.0001 diam*group 3 0.12546291 0.04182097 5.06 0.0074

2.2 2.4 2.6 2.8 3.0

diam

9

10

11

12

13

calcium

controlP3P2P1group



of 63

/* Code for #1 */ data Oehlert.pr13_6; input loc $ treat $ var $ biomass; datalines; R herb2 Parker 750 StP herb2 Parker 1440 R herb4 Parker 1630 StP herb4 Parker 890 R none Parker 3590 StP none Parker 740 R herb2 Lambert 870 StP herb2 Lambert 550 R herb4 Lambert 3430 StP herb4 Lambert 2520 R none Lambert 6850 StP none Lambert 1620 R herb2 M89-792 1090 StP herb2 M89-792 130 R herb4 M89-792 2930 StP herb4 M89-792 570 R none M89-792 3710 StP none M89-792 3600 R herb2 Sturdy 1110 StP herb2 Sturdy 400 R herb4 Sturdy 1310 StP herb4 Sturdy 2060 R none Sturdy 2680 StP none Sturdy 1510 R herb2 Ozzie 1150 StP herb2 Ozzie 370 R herb4 Ozzie 1730 StP herb4 Ozzie 2420 R none Ozzie 4870 StP none Ozzie 1700 R herb2 M89-1743 1210 StP herb2 M89-1743 430 R herb4 M89-1743 6070 StP herb4 M89-1743 2790 R none M89-1743 4480 StP none M89-1743 5070 R herb2 M89-794 1330 StP herb2 M89-794 190 R herb4 M89-794 1700 StP herb4 M89-794 1370 R none M89-794 3740 StP none M89-794 610 R herb2 M90-1682 1630


of 63

StP herb2 M90-1682 200 R herb4 M90-1682 2000 StP herb4 M90-1682 880 R none M90-1682 3330 StP none M90-1682 3030 R herb2 M89-1946 1660 StP herb2 M89-1946 230 R herb4 M89-1946 2290 StP herb4 M89-1946 2210 R none M89-1946 3180 StP none M89-1946 2640 R herb2 Archer 2210 StP herb2 Archer 1110 R herb4 Archer 3070 StP herb4 Archer 2120 R none Archer 6980 StP none Archer 2210 R herb2 M89-642 2290 StP herb2 M89-642 220 R herb4 M89-642 1530 StP herb4 M89-642 390 R none M89-642 3750 StP none M89-642 2590 R herb2 M90-317 2320 StP herb2 M90-317 330 R herb4 M90-317 1760 StP herb4 M90-317 680 R none M90-317 2320 StP none M90-317 2700 R herb2 M90-610 2480 StP herb2 M90-610 350 R herb4 M90-610 1360 StP herb4 M90-610 1680 R none M90-610 5240 StP none M90-610 1510 R herb2 M88-250 2480 StP herb2 M88-250 350 R herb4 M88-250 1810 StP herb4 M88-250 1020 R none M88-250 6230 StP none M88-250 2420 R herb2 M89-1006 2430 StP herb2 M89-1006 280 R herb4 M89-1006 2420 StP herb4 M89-1006 2350 R none M89-1006 5990 StP none M89-1006 1590


of 63

R herb2 M89-1926 3120 StP herb2 M89-1926 260 R herb4 M89-1926 1360 StP herb4 M89-1926 1840 R none M89-1926 5980 StP none M89-1926 1560 ; run; /* a) Fit a tentative two-way ANOVA model with interaction.*/ proc anova data=Oehlert.pr13_6; class treat var; model biomass = treat var treat*var; run; /* b) Perform a residual analysis for the above model.*/ proc glm data=Oehlert.pr13_6 plot=diagnostics; class treat var; model biomass = treat var treat*var; run; /* c) Show the two interaction plots for the above model. */ proc glm data=Oehlert.pr13_6 plot=diagnostics; class var treat; /* provides the other interaction plot */ model biomass = var treat var*treat; run; /* d) Modify your SAS code to include this blocking variable. Fit another ANOVA model retaining the original two-way structure with interaction but also including the blocking factor. */ proc anova data=Oehlert.pr13_6; class loc treat var; model biomass = loc treat var treat*var; run; /* e) Compute the estimated average and its standard error for the weed biomass difference between no herbicide treatment and herbicide applied after 4 weeks. */ proc glm data=Oehlert.pr13_6; class loc treat var; model biomass = loc treat var treat*var; means treat / tukey cldiff; /* average differences */ lsmeans treat / stderr; /* standard errors */ run;


of 63

/* f) Add a MEANS statement to the SAS code you used in question d) to perform a Dunnett multiple comparisons test for the weed treatments. */ proc anova data=pr13_6; class loc treat var; model biomass = loc treat var treat*var; means treat / dunnett; run; /* g) Add a MEANS statement to the SAS code you used in question d) to perform a Tukey multiple comparisons test for the soybean variety. */ proc anova data=Oehlert.pr13_6; class loc treat var; model biomass = loc treat var treat*var; means var / tukey; run; /* h) Use the Box-Cox method to figure out an appropriate transformation of the response. */ data Oehlert.pr13_6; set Oehlert.pr13_6; /* Create a numeric column for the number of weeks for the treatments */ if treat = "herb2" then num_treat = 2; else if treat = "herb4" then num_treat = 4; else num_treat = 0; run; proc transreg data=Oehlert.pr13_6; model BoxCox(biomass/lambda=-2 to 2 by 0.1 alpha=0.05) = identity(num_treat); run; /* i) Transform the response by the square root of the biomass. */ data Oehlert.pr13_6; set Oehlert.pr13_6; sqrt_biomass = sqrt(biomass); run;


of 63

/* Rerun your analysis in question d) using the square root response, showing your ANOVA table and key residual plots.*/ proc glm data=Oehlert.pr13_6 plot=diagnostics; class loc treat var; model sqrt_biomass = loc treat var treat*var; run; /* j) Suppose you decide to use the square root transformation. Rerun the analysis in question d) with only the significant factors ( = 0.05). */ proc glm data=Oehlert.pr13_6 plot=diagnostics; class loc treat; model sqrt_biomass = loc treat; run;


of 63

/* Code for #2 */ data Oehlert.pr15_6; input cooler $ A B C D days; datalines; 1 -1 -1 -1 -1 34 1 1 1 -1 -1 34 1 1 -1 1 -1 32 1 1 -1 -1 1 34 1 -1 1 1 -1 34 1 -1 1 -1 1 39 1 -1 -1 1 1 38 1 1 1 1 1 37 2 1 -1 -1 -1 35 2 -1 1 -1 -1 36 2 -1 -1 1 -1 39 2 -1 -1 -1 1 41 2 1 1 1 -1 39 2 1 1 -1 1 44 2 1 -1 1 1 44 2 -1 1 1 1 42 ; run; /* c) Analyze the data */ proc glm data=Oehlert.pr15_6; class cooler A B C D; model days = cooler A B C D A*B A*C A*D B*C B*D C*D; run; /* d) Remove the insignificant interactions and rerun the analysis */ proc glm data=Oehlert.pr15_6; class cooler D; model days = cooler D; run;


of 63

/* Code for #3 */ data Oehlert.pr17_1; input group $ diam calcium; datalines; control 2.48 10.41 control 2.81 11.82 control 2.73 11.58 control 2.67 11.14 control 2.90 12.05 control 2.45 10.45 control 2.69 11.39 control 2.94 12.50 P1 3.10 12.10 P1 2.61 10.38 P1 2.49 10.08 P1 2.69 10.71 P1 2.43 9.82 P1 2.52 10.12 P1 2.54 10.16 P1 2.55 10.14 P2 2.57 10.33 P2 2.48 10.03 P2 2.77 11.13 P2 2.30 8.99 P2 2.56 10.06 P2 2.18 8.73 P2 2.65 10.66 P2 2.73 11.03 P3 2.60 10.46 P3 2.17 8.64 P3 2.64 10.48 P3 2.35 9.32 P3 2.89 11.54 P3 2.38 9.48 P3 2.55 10.08 P3 2.29 9.12 ; run; /* a) Analyze the data using the parallel lines model. */ proc glm data=Oehlert.pr17_1; class group; model calcium = group diam / solution; means group /tukey; lsmeans group/ stderr pdiff cov out=adjmeans_a; run;


of 63

/* b) Analyze the data using the separate slopes model. */ proc glm data=Oehlert.pr17_1; class group; model calcium = diam group*diam / solution; run; /* c) Analyze the data using the separate lines model. */ proc glm data=Oehlert.pr17_1; class group; model calcium = group diam group*diam / solution; run; /* End of Code – Have a great Day! */






STA4853 Time Series




STA4203 Final Project Brian Reilly Applied Regression Methods Fall 2015

of 63

This project uses the datasets data100.csv and data1000.csv for training, and test1000.csv for testing. All three datasets contain y as the dependent variable and the variables x0-x999 as predictors.

Final Project

Table of Contents 1. data100.csv – 100 Observations Training Data Set

a. OLS Regression & R2…………………………………….44 b. Root Mean Squared Error………………………………...44 c. Forward Selection (0.05 level)……………………………44 d. RMSE for Forward Selection (from c)……………………44 e. Forward Selection (0.001 level)…………………………..44 f. RMSE for Forward Selection (from e)……………………44

2. data1000.csv – 1000 Observation Training Data Set a. Stepwise Selection………………………………………...44 b. RMSE with Test Data Set…………………………………44 c. Adjusted R2 Criterion Remarks…………………………...44 d. Ridge Regression………………………………………….44 e. QQ plot & Normality Test………………………………...45 f. Residual Diagnostic……………………………………….46 g. Correlated Errors (Durbin-Watson)……………………….46 h. Outliers…………………………………………………….46 i. Correlation Matrix…………………………………………47 j. Condition Index……………………………………………47 k. Box-Cox Method Remarks………………………………...47 l. Partial Regression Plots……………………………………47

Source Code For #1………………………………………………………….49 For #2………………………………………………………….51


of 63

1. Using the data100.csv data for training the models:

a. When the data was fit with an OLS regression, the R2 of the obtained model was 1.

b. The RMSE of the model from a) on the test set test1000.csv is 51.4249.

c. When forward selection with significance level 0.05 was performed on the data, the process was stopped once the model perfectly described the data, R2 reached 1. The model equation used 95 predictors.

d. The RMSE of the model from c) on the test set test1000.csv is 1.7878.

e. When forward selection with significance level 0.001 was performed on the data, the R2 is 0.9584. The model equation used 13 predictors.

f. The RMSE of the model from e) on the test set test1000.csv is 1.2101.

2. Using the data1000.csv data for training the models:

a. When stepwise selection with significance level 0.01 was performed on the data, the R2 is 0.9504. The model equation used 13 predictors.

b. The RMSE of the model from a) on the test set test1000.csv is 1.0023.

c. One cannot perform variable selection using the Adjusted R2 criterion because there are too many predictors.

d. When Ridge Regression with ridge coefficient 0.0002 was used on the test set test1000.csv the RMSE is 3.5040.


of 63

Using the model from a): e. The QQ plot of the residuals:

The QQ plot appears to be roughly linear, thus it suggests that the residuals are likely normally distributed. Moreover, the probability for the Kolmogorov-Smirnov test for normality is greater than 0.15, therefore we fail to reject the null hypothesis, this further support the claim that the residuals are normally distributed.


of 63

f. Plot of the Residuals vs. Predicted:

The errors seem to have constant variance.

g. The p-values for the Durbin-Watson test for positive and negative autocorrelation are 0.4308 and 0.5692, respectively. Since these values do not support either positive or negative autocorrelation, there are not likely correlated errors.

h. There were no outliers at the 0.05 or 0.01 levels.


of 63

i. The matrix of the correlations between the variables selected at a):

j. The condition index is 2.83585.

k. One cannot use the Box-Cox method to find the best transformation for the response, because some of the responses are negative.

l. The partial regression plots:

Partial Plots for y

Partial Regressor Residual

Partial Dependent Residu

al

-‐2 0 2

-‐4

-‐2

0

2

4

x40

-‐3 -‐2 -‐1 0 1 2 3

-‐4

-‐2

0

2

x80

-‐2 0 2

-‐4

-‐2

0

2

4

x50

-‐3 -‐2 -‐1 0 1 2 3

-‐4

-‐2

0

2

4

x10

-‐2 0 2

-‐4

-‐2

0

2

4

x70

0.00 0.25 0.50 0.75 1.00 1.25

-‐4

-‐2

0

2

4

Intercept


of 63

The partial regression plots (continued):

Partial Plots for y


Partial Dependent Residual

-‐2 -‐1 0 1 2 3

-‐4

-‐2

0

2

4

x65

-‐3 -‐2 -‐1 0 1 2

-‐4

-‐2

0

2

4

x30

-‐2 0 2 4

-‐4

-‐2

0

2

4

x0

-‐3 -‐2 -‐1 0 1 2

-‐4

-‐2

0

2

4

x90

-‐2 -‐1 0 1 2

-‐4

-‐2

0

2

4

x20

-‐2 -‐1 0 1 2

-‐4

-‐2

0

2

4

x60

Partial Plots for y


Partial Dependent Residual

-‐2 0 2 4

-‐4

-‐2

0

2

4

x444

-‐2 0 2

-‐4

-‐2

0

2

4

x811


of 63

/** 1. Using the data100.csv data for training the models: **/ proc import datafile="/home/btr09c/STA4203/Data/data100.csv" dbms=csv out=WORK.data100 replace; getnames=yes; run; proc import datafile="/home/btr09c/STA4203/Data/test1000.csv" dbms=csv out=WORK.test1000 replace; getnames=yes; run; /* #1 a) */ proc reg data=data100 plots=none outest=data100_a; model y=x0-x999; Output out=data100 predicted=pred; run; quit; /* #1 b) */ proc score data=test1000 score=data100_a out=b_scrd residual type=parms; var x0-x999 y; run; proc univariate data=b_scrd noprint; var model1; output out=b_stat uss=ss1; run; data b_stat; set b_stat; rmse=sqrt(ss1/1000); run; /* #1 c) */ proc reg data=data100 plots=none outest=data100_c; model y=x0-x999 /selection=forward sle=0.05; Output out=data100 predicted=pred; run; quit; /* #1 d) */ proc score data=test1000 score=data100_c out=d_scrd residual type=parms; var x0-x999 y; run; proc univariate data=d_scrd noprint; var model1; output out=d_stat uss=ss1; run; data d_stat; set d_stat; rmse=sqrt(ss1/1000); run;


of 63

/* #1 e) */ proc reg data=data100 plots=none; model y=x81 x31 x61 x0 x20 x70 x49 x89 x10 x40 x60 x80 x30; Output out=data100 predicted=pred; run; quit; proc reg data=data100 plots=none outest=data100_e; model y=x0-x999 /selection=forward sle=0.001; Output out=data100 predicted=pred; run; quit; /* #1 f) */ proc score data=test1000 score=data100_e out=f_scrd residual type=parms; var x0-x999 y; run; proc univariate data=f_scrd noprint; var model1; output out=f_stat uss=ss1; run; data f_stat; set f_stat; rmse=sqrt(ss1/1000); run;


of 63

/** 2. Using the data1000.csv data for training the models: **/ proc import datafile="/home/btr09c/STA4203/Data/data1000.csv" dbms=csv out=WORK.data1000 replace; getnames=yes; run; proc reg data=data1000 plots=none; model y=x70 x10 x50 x80 x40 x60 x20 x90 x0 x30 x65 x811 x444; run; quit; proc reg data=data1000 plots=none outest=data100_a; model y=x0-x999 /selection=stepwise sle=0.01 slstay=0.01; Output out=data100 predicted=pred residual=r_y; run; quit; proc score data=test1000 score=data100_a out=b_scrd residual type=parms; var x0-x999 y; run; proc univariate data=b_scrd noprint; var model1; output out=b_stat uss=ss1; run; data b_stat; set b_stat; rmse=sqrt(ss1/1000); run; Proc reg data=data1000; /* Durbin-Watson test */ model y=x70 x10 x50 x80 x40 x60 x20 x90 x0 x30 x65 x811 x444 /partial; output out=data1000 predicted=p_y residual=r_y rstudent=r_student; run; quit; /* Plot the residuals versus predicted values*/ Proc GPLOT data=data1000; Plot r_y * p_y; run; /* Draw the QQ plot of the residuals */ proc univariate data=data1000 noprint normaltest; qqplot r_y; run; proc univariate data=data1000 normaltest; var r_y; run; data outliers; do i=1 to 1000 by 1; set data1000 point=i;

if (abs(r_student)>=abs(tinv(0.01/(2*1000),1000-13-1))) then output;

end; stop; run;






STA4853 Time Series




STA4853 Homework #4 Brian Reilly Time Series and Forecasting Methods Spring 2015

of 63

The REPAIR series is monthly repair data. The BUS series is the monthly average number of bus passengers per weekday. The EUREKA series is the monthly average temperatures.

Homework #4

Table of Contents 1. REPAIR series

a. Seasonal ARIMA Model Selection………………………..54 b. Non-Seasonal Comparison………………………………...55 c. Forecast and Model Constant……………………………...57

2. BUS series a. Seasonal ARIMA Model Selection ……………………….58 b. Forecast (36 Months)……………………………………....59

3. EUREKA series a. Select Trend + Stationary ARIMA Process………………..60 b. Forecast (36 Months)………………………………………60

Source Code For Problem 1…………………………………………………61 For Problem 2………………………………………………....61 For Problem 3………………………………………………....62


of 63

T rend and C orrelation Analysis for Y(1)

0 5 10 15 20Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

IACF

0 5 10 15 20Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

PACF

0 5 10 15 20Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

ACF

-‐0.50

-‐0.25

0.00

0.25

0.50

Y(1)

0 20 40 60 80 100Observation

Problem 1:

(a) For the REPAIR series Y = log(X), I chose an ARIMA(0,1,1)(0,0,1)12 model without a constant. The log transform was necessary in order to stabilize the variance over the series. Differencing was done once to change the rate of decay in the ACF versus Lag. There is a drop off after the first lag in the ACF versus that implies p = 1. The probability that the significance of the constant was relatively high, Pr > |t| = 0.2269, and the estimate for the constant was relatively close to zero, MU = 0.01168, therefore the NOCONSTANT option was used. Seasonality of 12 was used due to the periodic nature of the monthly data.


of 63

(b) The residual for the series with the seasonal term, Q = 1, shows no considerable correlation at any particular lag for the ACF, PACF, or IACF. The probability of white noise is likewise not significant, less than 0.05, for all lags.

For the seasonal model, the QQ-plot for residuals is generally linear and the distribution of residuals appears to be normal.

R esidual C orrelation D iagnostics for Y(1)

0 5 10 15 20Lag

1.0

.05

.001White Noise Prob

0 5 10 15 20Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

IACF

0 5 10 15 20Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

PACF

0 5 10 15 20Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

ACF

R esidual Normality D iagnostics for Y(1)

QQ-‐Plot

-‐2 -‐1 0 1 2Quantile

-‐0.6

-‐0.4

-‐0.2

0.0

0.2

0.4

Resid

ual

D istribution of Residuals

-‐0.66 -‐0.42 -‐0.18 0.06 0.3 0.54Residual

0

10

20

30

Percent

K ernelNormal

-‐0.66 -‐0.42 -‐0.18 0.06 0.3 0.54Residual

0

10

20

30

Percent

K ernelNormal


of 63

The non-seasonal model has spikes around lag 12 in the residual correlation diagnostics for ACF, PACF, and IACF charts. The probability of white noise starting at lag 12 exceed the reasonable allowable amount, 0.05.

Although the QQ-plot for the residual is generally linear, the distribution for the residuals of the non-seasonal model has a negative skewness.

For the seasonal model, the AIC, -86.4276, and SBC, -81.3624, are the smaller than that of the non-seasonal model, AIC = -70.7336 and SBC= -68.201, thus the seasonal model is a better fit.

R esidual C orrelation D iagnostics for Y(1)

0 5 10 15 20Lag

1.0

.05

.001

White Noise Prob

0 5 10 15 20Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

IACF

0 5 10 15 20Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

PACF

0 5 10 15 20Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

ACF

R esidual Normality D iagnostics for Y(1)

QQ-‐Plot

-‐2 -‐1 0 1 2Quantile

-‐0.6

-‐0.4

-‐0.2

0.0

0.2

0.4

Resid

ual


-‐0.675 -‐0.375 -‐0.075 0.225 0.525Residual

0

10

20

30

40

Perce

nt

K ernelNormal

-‐0.675 -‐0.375 -‐0.075 0.225 0.525Residual

0

10

20

30

40

Perce

nt

K ernelNormal


of 63

(c) Forcecast without a constant:

Forecast with a constant:

The forecast with a constant shows the predicted values and confidence intervals are generally increasing with time, whereas the forecast without a constant for the predicted values and confidence interval appears to stabilize over time, neither increasing nor decreasing with time.

7.0

7.5

8.0

8.5

9.0Forecast

100 110 120 130

Obs

95% Confidence L imitsPredicted

Forecasts for Y

7.5

8.0

8.5

9.0

Forecast

100 110 120 130

Obs


Forecasts for Y


of 63

Problem 2:

(a) For the BUS series Y = sqrt(X), I chose an ARIMA(0,1,1)(0,1,1)12 model without a constant. Differencing was done once to change the rate of decay in the ACF versus Lag. There is a drop off after the first lag in the ACF versus that implies q = 1. The probability that the significance of the constant was relatively high, Pr > |t| = 0.7823, and the estimate for the constant was relatively close to zero, MU = -0.01030, therefore the NOCONSTANT option was used. Seasonality of 12 was used due to the periodic nature of the monthly data.

There do not appear to be any significant spikes for lags in residuals.

T rend and C orrelation Analysis for Y(1 12)

0 5 10 15 20 25Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

IACF

0 5 10 15 20 25Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

PACF

0 5 10 15 20 25Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

ACF

-‐10

-‐5

0

5

10

Y(1 1

2)

0 50 100 150Observation

R esidual C orrelation D iagnostics for Y(1 12)

0 5 10 15 20 25Lag

1.0

.05

.001

White Noise Prob

0 5 10 15 20 25Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

IACF

0 5 10 15 20 25Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

PACF

0 5 10 15 20 25Lag

-‐1.0

-‐0.5

0.0

0.5

1.0

ACF


of 63

The residuals appear to be normal and QQ to be linear.

(b) Forecasting for 36 months beyond the end of the series on the original Scale:

R esidual Normality D iagnostics for Y(1 12)

QQ-‐Plot

-‐2 -‐1 0 1 2Quantile

-‐5

0

5

10

Residu

al


-‐9 -‐7 -‐5 -‐3 -‐1 1 3 5 7 9 11Residual

0

10

20

30

Percent

K ernelNormal

-‐9 -‐7 -‐5 -‐3 -‐1 1 3 5 7 9 11Residual

0

10

20

30

Percent

K ernelNormal

80

100

120

Forecast

140 150 160 170

Obs


Forecasts for Y


of 63

Problem 3: (a) 𝑌 = sin !

!+ cos !

!+ cos !!

!+ cos !!

!+ sin !!

!+ 𝐴𝑅(1)

Maximum Likelihood Estimation

Estimate Standard

Error t Valu

e

Approx

Pr > |t| Lag Variable Shift

518.19807 2.09309 247.58 <.0001 0 X 0 -0.36548 0.08884 -4.11 <.0001 1 X 0

-41.87785 2.88710 -14.51 <.0001 0 xsin1 0 -26.83130 2.87610 -9.33 <.0001 0 xcos1 0 -5.44357 2.66062 -2.05 0.0408 0 xcos2 0 -5.64730 2.31676 -2.44 0.0148 0 xcos3 0 -4.98457 1.90206 -2.62 0.0088 0 xsin4 0

The Probabilities of the factors are low enough, below 0.05, to support my claim.

(b) The forecasts for 36 months beyond the end of the series:

The behavior of the forecast of the series oscillates as the series did with a relatively constant width of the confidence interval around the predictions.

450

500

550

600

Forecast

120 125 130 135 140 145

Obs


Forecasts for X


of 63

/* Code for Problem 1*/ *Import Repair Data; filename RepFile "/home/btr09c/my_courses/huffer/repair.txt"; data RepairData; infile RepFile; input X; Y = log(X); run; proc arima data=RepairData; *ARIMA(0,1,1)(0,0,1)_12 without a constant; identify var=Y(1); estimate q=(1,12) NOCONSTANT method=ml; *Non-seasonal model for part b; estimate q=(1) NOCONSTANT method=ml; *Forecast without a constant for part c; identify var=Y(1) NOPRINT; estimate q=(1,12) NOCONSTANT method=ml NOPRINT; forecast lead=36; *Forecast with conastant for part c; identify var=Y(1) NOPRINT; estimate q=(1,12) method=ml NOPRINT; forecast lead=36; quit; /* Code for Problem 2*/ *Import Bus Data; filename BusFile "/home/btr09c/my_courses/huffer/bus.txt"; data BusData; infile busFile; input X; Y = sqrt(X); run; proc arima data=BusData; *ARIMA(0,1,1)(0,1,1)_12; identify var=Y(1 12); estimate q=(1,12) method=ml NOCONSTANT; forecast lead=36 out=frcst nooutall; quit; data origscale; keep osforecast osl95 osu95; set frcst; osforecast=(forecast)**2; osl95=(l95)**2; osu95=(u95)**2; run; proc print; run;


of 63

/* Code for Problem 3*/ *Import Eureka Data; filename EurFile "/home/btr09c/my_courses/huffer/eureka.txt"; data EurekaData; infile EurFile; input X; run; data trended; tpi=2*arcos(-1); /* tpi = 2*pi */ drop tpi; do time=1 to nobs+36; if time <= nobs then set EurekaData nobs=nobs; else X=. ; /* Set future temperature values to missing. */ xsin1=sin(tpi*time/12); xcos1=cos(tpi*time/12); xsin2=sin(2*tpi*time/12); xcos2=cos(2*tpi*time/12); xsin3=sin(3*tpi*time/12); xcos3=cos(3*tpi*time/12); xsin4=sin(4*tpi*time/12); xcos4=cos(4*tpi*time/12); xsin5=sin(5*tpi*time/12); xcos5=cos(5*tpi*time/12); xcos6=cos(6*tpi*time/12); output; end; run; /* Analyze Temperature Data as Periodic Trend Plus Stationary Process */ proc arima data=trended; identify var=X crosscor=(xsin1 xcos1 xcos2 xcos3 xsin4); estimate q=1 input=(xsin1 xcos1 xcos2 xcos3 xsin4) method=ml; quit; /* Refit Model Eliminating Most of the Non-Significant Terms and then Examine the Forecasts */ proc arima data=trended plots=residual(all); identify var=X crosscor=(xsin1 xcos1 xsin2 xsin3) noprint; estimate q=1 input=(xsin1 xcos1 xcos2 xcos3 xsin4) method=ml; forecast lead=36 out=resids; quit; /* Plot residuals versus (one-step-ahead) forecasts */ proc sgplot data=resids; loess x=forecast y=residual; run;

Brian Thomas Reilly B.S. Actuarial Science Florida State University +1.941.806.8550 [email protected]

SAS Programming and Data Analysis Portfolio - BTReilly

Documents

Transcript of SAS Programming and Data Analysis Portfolio - BTReilly