The essence of data step programming

Post on 30-Nov-2014

701 views 5 download

Tags:

description

The fundamental of SAS programming is DATA step programming. The essence of DATA step programming is to understand how SAS processes the data during the compilation and execution phases. In this paper, you will be exposed to what happens “behind the scenes” while creating a SAS dataset. You will learn how a new dataset is created, one observation at a time, from either a raw text file or an existing SAS dataset, to the program data vector (PDV) and from the PDV to the newly-created SAS dataset. Once you fully understand DATA step processing, learning the SUM and RETAIN statements will become easier to grasp. Relating to this topic, this paper will also cover BY-group processing.

Transcript of The essence of data step programming

The Essence of DATA Step Programming

Arthur LiCity of Hope Comprehensive Cancer Center

Department of Information Science

INTRODUCTION

SAS programming

DATA step programming

Understanding how SAS processes the data during the compilation and execution phases

Fundamental:

Essence:

A COMMON BEFUDDLEMENT

The newly-created SAS dataset is not what we intended there are more or less observationsthe value of the variable was not retained

correctly

Reason:Learning only SAS language syntaxNot understanding the fundamental SAS

programming concepts

INTRODUCTION

We will cover…what happens “behind the scenes” while creating a

SAS dataset Learn how a new dataset is created

one observation at a time a raw text file/SAS dataset PDVSAS dataset

The SUM and RETAIN statements BY-group processing Transposing dataset examples

DATA STEP PROCESSING OVERVIEW

Compilation phase:Each statement is scanned for syntax errors.

Execution phase:The DATA step reads and processes the input data.

If there is no syntax error

A DATA step is processed in two-phase sequences:

DATA STEP PROCESSING OVERVIEW

Variable names Columns

Name 1-7

Height 9-10

Weight 12-14

Program1:

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

Data Entry Error

The column input method:Each variable is occupied in a fixed fieldThe values are standard character or numerical values

Creating a new variable: BMI

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

Used to hold raw dataWill not be created when reading a SAS dataset

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

PDV is created

Memory area where SAS builds its new data set, 1 observation at a time.

_N_ D _ERROR_D

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

PDV is created

Automatic variables:_N_ = 1: 1st observation is being processed_N_ = 2: 2nd observation is being processed

_N_ D _ERROR_D

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

PDV is created

Automatic variables:_ERROR_ = 1: signals the data error of the currently-processed observation

_N_ D _ERROR_D

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

A space is added to the PDV for each variable

_N_ D _ERROR_D Height KName K Weight K

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

BMI is added to the PDV

_N_ D _ERROR_D Height KName K Weight K BMI K

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV _N_ D _ERROR_D Height KName K Weight K BMI K

D = dropped

K = kept

COMPILATION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV _N_ D _ERROR_D Height KName K Weight K BMI K

Checks for syntax errorsinvalid variable names invalid optionsincorrect punctuationsmisspelled keywords

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

The DATA step works like a loopIt repetitively executes statements

reads data values creates observations one at a time

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

1st Iteration:At the beginning

1 0

_N_ 1, _ERROR_ 0The remaining variables are set to missing

. . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1st data line input bufferThe input pointer @ the beginning of the input buffer

The INFILE statement identifies the location of Exampl1.txt

1 0 . . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0

The INPUT statement reads data values: input buffer PDV

. . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0

input buffer (columns 1-7) “Name” in the PDV

Barbara . . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0

The input pointer @ column 8

Barbara . . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0 . .

input buffer (columns 9-10) “Height” in the PDV

Barbara 61

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0

The input pointer @ column 11

Barbara 61 . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

1 0

Tries to read Weight – invalid value

Barbara 61 . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

Tries to read Weight – invalid value _ERROR_ 1

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 1 Barbara 61 . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

The input pointer @ column 15

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 1 Barbara 61 . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

BMI will remain missing: operations on a missing value a missing value.

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 1 Barbara 61 . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

The OUTPUT statement is executed

Only values marked with (K) are copied as a single observation to the SAS dataset ex1

Name Height Weight BMI

1 Barbara 61 . .

Ex1:

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 1 Barbara 61 . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1st Iteration:

B a r b a r a 6 1 1 2 D

At the end of the DATA step, two things occur automatically:

Ex1:

Name Height Weight BMI

1 Barbara 61 . .

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 1 Barbara 61 . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

1. The SAS system returns to the beginning of the DATA step

Ex1:

Name Height Weight BMI

1 Barbara 61 . .

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 1 Barbara 61 . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

2. The values of the variables in the PDV are reset to missing _N_ ↑ 2

_ERROR_ 0

Ex1:

Name Height Weight BMI

1 Barbara 61 . .

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 . . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

2nd data line input buffer The input pointer @

beginning of the input buffer

Ex1:

Name Height Weight BMI

1 Barbara 61 . .

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 . . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

The INPUT statement is executed

Ex1:

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 .62John 175

Name Height Weight BMI

1 Barbara 61 . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

BMI is calculated

Ex1:

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 31.867862John 175

Name Height Weight BMI

1 Barbara 61 . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

The OUTPUT statement is executed

Ex1:

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 31.867862John 175

Name Height Weight BMI

1 Barbara 61 . .

2 John 62 175 31.8678

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

2nd Iteration:

J o h n 6 2 1 7 5

At the end of the DATA step, two things occur automatically:

Ex1:

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 31.867862John 175

Name Height Weight BMI

1 Barbara 61 . .

2 John 62 175 31.8678

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

Ex1:

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 31.867862John 175

1. The SAS system returns to the beginning of the DATA step

Name Height Weight BMI

1 Barbara 61 . .

2 John 62 175 31.8678

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV

Barbara 61 12DJohn 62 175

Example1.txt12345678901234567890

Ex1:

2. The values of the variables in the PDV are reset to missing _N_ ↑3

Name Height Weight BMI

1 Barbara 61 . .

2 John 62 175 31.8678

_N_ D _ERROR_D Name K Height K Weight K BMI K

3 0 . . .

EXECUTION PHASE

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

proc print data=ex1;run;

There are no more records to readThe SAS system next DATA/PROC step

THE OUTPUT STATEMENT

data ex1; set example1; BMI = 700*weight/(height*height);

run;

The explicit OUTPUT statement:

write the current observation from the PDV to a SAS dataset immediately

not at the end of the DATA step

output;

THE OUTPUT STATEMENT

data ex1; set example1; BMI = 700*weight/(height*height);

run;

It tells SAS to write observations to the dataset at the end of the DATA step

The implicit OUTPUT statement:

THE OUTPUT STATEMENT

Using explicit OUTPUT will override the implicit OUTPUT

We can use more than one OUTPUT statement in the DATA step

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; infile 'C:\Arthur\example1.txt'; input name $ 1-7 height 9-10 weight 12-14; BMI = 700*weight/(height*height); output;run;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 …

…Input buffer

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

Barbara 61 12DJohn 62 175Raw data

SAS dataset

When Reading a raw dataset …

Name Height Weight BMI

1 Barbara 61 . .

2 John 62 175 31.8678

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV_N_ D _ERROR_D Name K Height K Weight K BMI K

SAS dataset

When Reading a SAS dataset …

SAS dataset

Input dataset:Example1(after “set”)

Output dataset:Ex1(after “data”)

Name Height Weight

1 Barbara 61 .

2 John 62 175

Name Height Weight BMI

1 Barbara 61 . .

2 John 62 175 31.8678

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

When reading a raw dataset, SAS sets each variable value in the PDV to missing at the beginning of each iteration of execution, except for …

the automatic variablesvariables that are named in the RETAIN or SUM statementdata elements in a _TEMPORARY_ arrayvariables created in the options of the FILE/INFILE statement

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV

1st Iteration:At the beginning of the

execution phase, SAS sets each variable to missing in the PDV

When Reading a SAS dataset …

Example1: Name Height Weight

1 Barbara 61 170

2 John 62 175

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 0 . . .

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV

1st Iteration:The SET statement is

executed

When Reading a SAS dataset …

Example1: Name Height Weight

1 Barbara 61 170

2 John 62 175

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 0 .Barbara 170 61

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV

1st Iteration:BMI is calculated

When Reading a SAS dataset …

Example1: Name Height Weight

1 Barbara 61 170

2 John 62 175

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 0 Barbara 31.9807170 61

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV

1st Iteration:Output statement is

executed

When Reading a SAS dataset …

Example1:

Ex1:

Name Height Weight

1 Barbara 61 170

2 John 62 175

Name Height Weight BMI

1 Barbara 61 170 31.9807

_N_ D _ERROR_D Name K Height K Weight K BMI K

1 0 Barbara 31.9807170 61

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV

2nd Iteration:When Reading a SAS dataset …

Example1:

Ex1:

Name Height Weight

1 Barbara 61 170

2 John 62 175

Name Height Weight BMI

1 Barbara 61 170 31.9807

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 Barbara .170 61

Variables exist in the input dataset

SAS sets each variable to missing in the PDV only before the 1st iteration of the execution

Variables will retain their values in the PDV until they are replaced by the new values

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV

2nd Iteration:When Reading a SAS dataset …

Example1:

Ex1:

Name Height Weight

1 Barbara 61 170

2 John 62 175

Name Height Weight BMI

1 Barbara 61 170 31.9807

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 Barbara .170 61

Variables being created in the DATA step

SAS sets each variable to missing in the PDV at the beginning of every iteration of the execution

THE DIFFERENCE BETWEEN READING A RAW DATASET AND READING A SAS DATASET

data ex1; set example1; BMI = 700*weight/(height*height); output;run;

PDV

2nd Iteration:SET statement is executed

When Reading a SAS dataset …

Example1:

Ex1:

Name Height Weight

1 Barbara 61 170

2 John 62 175

Name Height Weight BMI

1 Barbara 61 170 31.9807

_N_ D _ERROR_D Name K Height K Weight K BMI K

2 0 John .175 62

THE RETAIN STATEMENT

ID SCORE

1 A01 3

2 A02 .

3 A03 4

Consider the following dataset:

We would like to create a new variable that accumulates the values of SCORE

TOTAL

3

3

7

THE RETAIN STATEMENT

ID SCORE

1 A01 3

2 A02 .

3 A03 4

Consider the following dataset:

How to do it?Set the TOTAL to 0 at the first iteration of the

execution Then at each iteration of the execution, add

values from SCORE to TOTAL

TOTAL

3

3

7

Problem: TOTAL is a new variable that you want to create TOTAL will be set to missing in the PDV at the beginning of every iteration of the execution.

THE RETAIN STATEMENT

To fix this problem, we can use the RETAIN statement:

RETAIN VARIABLE <VALUE>;

Prevents the VARIABLE from being initialized each time the DATA step executes

THE RETAIN STATEMENT

To fix this problem, we can use the RETAIN statement:

RETAIN VARIABLE <VALUE>;

Name of the variable that we will want to retain

A numeric valueUsed to initialize the VARIABLE

only at the first iteration of the DATA step execution

Not specifying an initial value VARIABLE is initialized as missing

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV _N_ D _ERROR_D ID K Total K

The execution phase begins immediately after the completion of the compilation phase

ID SCORE

1 A01 3

2 A02 .

3 A03 4

Score K

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

_N_ 1, _ERROR_ 0ID, SCORE missingTOTAL 0 because of the RETAIN

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:

_N_ D _ERROR_D ID K Total KScore K

1 0 . 0

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

1st observation from ex2 PDV.

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:

_N_ D _ERROR_D ID K Total KScore K

1 0 3 0 A01

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

The RETAIN statement is a compile-time only statement

It does not execute during the execution phase

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:

_N_ D _ERROR_D ID K Total KScore K

1 0 3 0 A01

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

TOTAL is calculated

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:

_N_ D _ERROR_D ID K Total KScore K

1 0 3 3 A01

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

The implicit OUTPUT statement tells the SAS system to write observations to the dataset

ID SCORE

1 A01 3

2 A02 .

3 A03 4

1st Iteration:ID SCORE TOTAL

1 A01 3 3

Ex2_2:

_N_ D _ERROR_D ID K Total KScore K

1 0 3 3 A01

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

_N_ ↑2 ID and SCORE are retained from the

previous iteration because data are read from an existing SAS dataset

TOTAL is also retained because the RETAIN statement is used

ID SCORE

1 A01 3

2 A02 .

3 A03 4

2nd Iteration:ID SCORE TOTAL

1 A01 3 3

Ex2_2:

_N_ D _ERROR_D ID K Total KScore K

2 0 3 3 A01

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

2nd observation from ex2 PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4

2nd Iteration:ID SCORE TOTAL

1 A01 3 3

Ex2_2:

_N_ D _ERROR_D ID K Total KScore K

2 0 . 3 A02

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

TOTAL is calculated

ID SCORE

1 A01 3

2 A02 .

3 A03 4

2nd Iteration:ID SCORE TOTAL

1 A01 3 3

Ex2_2:

_N_ D _ERROR_D ID K Total KScore K

2 0 . 3 A02

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4

2nd Iteration:ID SCORE TOTAL

1 A01 3 3

2 A02 . 3

Ex2_2:

The implicit OUTPUT: The contents in PDV Ex2_2

_N_ D _ERROR_D ID K Total KScore K

2 0 . 3 A02

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4

3rd Iteration:ID SCORE TOTAL

1 A01 3 3

2 A02 . 3

Ex2_2:

_N_ ↑ 3. ID and SCORE are retained

from the previous iteration. TOTAL is also retained.

_N_ D _ERROR_D ID K Total KScore K

3 0 . 3 A02

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4

3rd Iteration:ID SCORE TOTAL

1 A01 3 3

2 A02 . 3

Ex2_2:

3rd observation from ex2 PDV

_N_ D _ERROR_D ID K Total KScore K

3 0 4 3 A03

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4

3rd Iteration:ID SCORE TOTAL

1 A01 3 3

2 A02 . 3

Ex2_2:

TOTAL is calculated

_N_ D _ERROR_D ID K Total KScore K

3 0 4 7 A03

THE RETAIN STATEMENT

data ex2_2; set ex2; retain total 0; total = sum(total, score);run;

PDV

ID SCORE

1 A01 3

2 A02 .

3 A03 4

3rd Iteration:ID SCORE TOTAL

1 A01 3 3

2 A02 . 3

3 A03 4 7

Ex2_2:

The implicit OUTPUT:The contents in PDV Ex2_2

_N_ D _ERROR_D ID K Total KScore K

3 0 4 7 A03

THE SUM STATEMENT

The SUM statement has the following form:

VARIABLE + EXPRESSION;

The numeric accumulator variable that is to be created

It is automatically set to 0 at the beginning of the first iteration of the DATA step execution

Retained in following iterations

Any SAS expression If EXPRESSION is evaluated

to a missing value, it is treated as 0

THE SUM STATEMENT

data ex2_2; set ex2;

run;

retain total 0;total = sum(total, score);

The previous program can be re-written as…

THE SUM STATEMENT

data ex2_2; set ex2;

run;

The previous program can be re-written as…

total + score;

THE SUBSETTING IF STATEMENT

We use the subsetting IF statement to continue processing only the observations that meet the condition of the specified expression

IF EXPRESSION;

If EXPRESSION is true for the observation, SAS continues to execute statements in the DATA step includes the current observation in the data set

THE SUBSETTING IF STATEMENT

Use the IF statement to continue processing only the observations that meet the condition of the specified expression

IF EXPRESSION;

If EXPRESSION is false for the observation, no further statements are processed for that obs.SAS immediately returns to the beginning of DATA step the remaining program statements in the DATA step are

not executed and the current observation is not written to the output data set

THE BY-GROUP PROCESSING IN THE DATA STEP

ID SCORE

1 A01 3

2 A02 .

3 A03 4

One observation per subject

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

Multiple observations per subject-- Longitudinal data

Identify the beginning/end of measurement for each subject

This can be accomplished by using the BY-group processing method

THE BY-GROUP PROCESSING IN THE DATA STEP

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

SAS locates the beginning and end of a BY-group by creating two temporary indicator variables for each BY variable: FIRST.VARIABLELAST.VARIABLE

Suppose ID is the “BY” variable:

FIRST.ID

1

0

0

1

0

LAST.ID

0

0

1

0

1

SAS reads the 1st observation for ID = A01

SAS reads the last observation for ID = A01

THE BY-GROUP PROCESSING IN THE DATA STEP

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

Calculating the total scores for each subject

ID TOTAL

1 A01 9

2 A02 6

proc sort data=ex3; by id;run;data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

1st iteration:_N_ 1, _ERROR_ 0 FIRST.ID 1, LAST.ID 1 only at beginning of 1st iterationID, Score missingTOTAL 0 because of the SUM statement

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

1 0 1 1 . 0

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

The SET statement is executed1st observation PDVFIRST.ID 1 and LAST.ID 0

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

1 0 1 0 A01 3 0

1st iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID = 1: TOTAL 0

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

1 0 1 0 A01 3 0

1st iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is accumulated

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

1 0 1 0 A01 3 3

1st iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

The subsetting IF statement is evaluated to be FALSE because LAST.ID ≠ 1

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

1 0 1 0 A01 3 3

SAS returns to the beginning of the DATA step to begin the 2nd iteration

1st iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ ↑ 2The values for the rest of the variables are retained

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

2 0 1 0 A01 3 3

2nd iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

2nd observation PDVNot the first observation for A01: FIRST.ID 0Not the last observation for A01: LAST.ID 0

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

2 0 0 0 A01 4 3

2nd iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID ≠ 1: no execution

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

2 0 0 0 A01 4 3

2nd iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is accumulated

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

2 0 0 0 A01 4 7

2nd iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

2 0 0 0 A01 4 7

The subsetting IF statement is evaluated to be FALSE because LAST.ID ≠ 1

SAS returns to the beginning of the DATA step to begin the 3rd iteration

2nd iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ ↑3The values for the rest of the variables are retained

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

3 0 0 0 A01 4 7

3rd iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

3rd observation PDVNot the first observation: FIRST.ID 0 Last observation for A01: LAST.ID 1

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

3 0 0 1 A01 2 7

3rd iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID ≠ 1: no execution

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

3 0 0 1 A01 2 7

3rd iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is calculated

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

3 0 0 1 A01 2 9

3rd iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

3 0 0 1 A01 2 9

The subsetting IF statement is evaluated to be TRUE

3rd iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

SAS reaches the end of the 3rd iterationThe implicit OUTPUT executesSAS returns to the beginning of the

DATA step to begin the 3rd iteration

ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

3 0 0 1 A01 2 9

3rd iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ ↑ 4The values for the remaining

variables are retained

ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

4 0 0 1 A01 2 9

4th iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

4th observation PDVFIRST.ID 1LAST.ID 0

ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

4 0 1 0 A02 4 9

4th iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID = 1: TOTAL 0ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

4 0 1 0 A02 4 0

4th iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is calculatedID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

4 0 1 0 A02 4 4

4th iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

4 0 1 0 A02 4 4

The subsetting IF statement is evaluated to be FALSE

SAS returns to the beginning of the DATA step to begin the 5th iteration

4th iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

_N_ ↑ 5The values for the remaining

variables are retained

ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

5 0 1 0 A02 4 4

5th iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

5th observation PDVFIRST.ID 0LAST.ID 1

ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

5 0 0 1 A02 2 4

5th iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

FIRST.ID ≠ 1: no execution ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

5 0 0 1 A02 2 4

5th iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

TOTAL is calculated ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

5 0 0 1 A02 2 6

5th iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

The subsetting IF statement is evaluated to be TRUE

ID TOTAL

1 A01 9

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

5 0 0 1 A02 2 6

5th iteration:

THE BY-GROUP PROCESSING IN THE DATA STEP

data ex3_1 (drop=score); set ex3; by id; if first.id = 1 then total = 0; total + score; if last.id = 1; run;

PDV

ID SCORE

1 A01 3

2 A01 4

3 A01 2

4 A02 4

5 A02 2

ID TOTAL

1 A01 9

2 A02 6

Ex3_1:

_N_ D _ERROR_D ID K Total KScore DFIRST.ID D LAST.ID D

5 0 0 1 A02 2 6

SAS reaches the end of the 5th iteration

The implicit OUTPUT executes

5th iteration:

RESTRUCTURING DATASETS

Restructuring datasets:

data with one observation per

subject (the wide format)

data with multiple observations per

subject (the long format)

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

RESTRUCTURING DATASETS

Restructuring datasets:

data with one observation per

subject (the wide format)

data with multiple observations per

subject (the long format)

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2S1 – S3 SCORE

Distinguish different measurements for each subject

RESTRUCTURING DATASETS

The transformation can be easily done by using ARRAY/PROC TRANSPOSE (See my paper “The Many Ways to Effectively Utilize Array Processing”, paper 244-2011)

This can also be accomplished without advanced techniques for more simple cases

Here is a solution for using multiple OUTPUT statements in one DATA step

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide: Long:ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

Transform wide long2 observations to read 2 DATA step

iterationsUse multiple OUTPUT statementAny missing values in S1 – S3 will not be

outputted to long

data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

_N_ 1Other variables missing

KID

.

DS1

.

DS2

.

DS3

.

KTIME

.

KSCORE

1

K_N_

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

1st observation from the wide PDV

A01

KID

3

DS1

4

DS2

5

DS3

.

KTIME

.

KSCORE

1

K_N_

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

Time 1

A01

KID

3

DS1

4

DS2

5

DS3

1

KTIME

.

KSCORE

1

K_N_

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

Score value from S1(3)

A01

KID

3

DS1

4

DS2

5

DS3

1

KTIME

3

KSCORE

1

K_N_

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

SCORE ≠ missing: ID, TIME, and SCORE Long

A01

KID

3

DS1

4

DS2

5

DS3

1

KTIME

3

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

TIME 2

A01

KID

3

DS1

4

DS2

5

DS3

2

KTIME

3

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

Score value from S2(4)

A01

KID

3

DS1

4

DS2

5

DS3

2

KTIME

4

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

SCORE ≠missing: ID, TIME, and SCORE Long

A01

KID

3

DS1

4

DS2

5

DS3

2

KTIME

4

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

TIME 3

A01

KID

3

DS1

4

DS2

5

DS3

3

KTIME

4

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

SCORE value from S3(5)

A01

KID

3

DS1

4

DS2

5

DS3

3

KTIME

5

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:

SCORE ≠missing: ID, TIME, and SCORE Long

A01

KID

3

DS1

4

DS2

5

DS3

3

KTIME

5

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

1st iteration:There is no more implicit OUTPUT statementSAS returns to the beginning of the DATA step to

begin the 2nd iteration

A01

KID

3

DS1

4

DS2

5

DS3

3

KTIME

5

KSCORE

1

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:_N_ ↑2ID and S1-S3 are retained from the previous iterationTIME, SCORE missing

A01

KID

3

DS1

4

DS2

5

DS3

.

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

2nd observation from the Wide PDV

A01

KID

4

DS1

.

DS2

2

DS3

.

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

TIME 1

A01

KID

4

DS1

.

DS2

2

DS3

1

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

SCORE value from S1 (4)

A01

KID

4

DS1

.

DS2

2

DS3

1

KTIME

4

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

ID, TIME, and SCORE Long

A01

KID

4

DS1

.

DS2

2

DS3

1

KTIME

4

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

TIME 2

A01

KID

4

DS1

.

DS2

2

DS3

2

KTIME

4

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

SCORE the value from S2 (missing)

A01

KID

4

DS1

.

DS2

2

DS3

2

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

SCORE = missing: no output is generated

A01

KID

4

DS1

.

DS2

2

DS3

2

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

TIME 3

A01

KID

4

DS1

.

DS2

2

DS3

3

KTIME

.

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

SCORE the value from S3 (2)

A01

KID

4

DS1

.

DS2

2

DS3

3

KTIME

2

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:

ID, TIME, and SCORE Long

A01

KID

4

DS1

.

DS2

2

DS3

3

KTIME

2

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

Long:

FROM WIDE FORMAT TO LONG FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

Wide:data long (drop=s1-s3); set wide; time = 1; score = s1; if not missing(score) then output; time = 2; score = s2; if not missing(score) then output; time = 3; score = s3; if not missing(score) then output;run;

2nd iteration:SAS returns to the beginning of the DATA step to begin the

3rd iterationWith no more observations to read in the 3rd iteration, SAS

goes to the next DATA or PROC step

A01

KID

4

DS1

.

DS2

2

DS3

3

KTIME

2

KSCORE

2

K_N_

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

Long:

FROM WIDE FORMAT TO LONG FORMAT

FROM LONG FORMAT TO WIDE FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

Reading 5 observations but only creating 2 observations

You are not copying data from the PDV to the final dataset at each iteration

You only need to generate one observation once all the observations for each subject have been processed

FROM LONG FORMAT TO WIDE FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

S1

S2

S3

S1

S3

if time = 1 then s1 = score;else if time = 2 then s2 = score;else s3 = score;

Use BY-group processing: BY ID Output to the final data when LAST.ID = 1

SCORE S1, S2 S3

RETAIN

FROM LONG FORMAT TO WIDE FORMAT

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

proc sort data=long; by id;run;data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

1ST iteration:_N_ 1FIRST.ID 1, LAST.ID 1Other variables missing

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

1 1 1 . . . . .

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

1ST iteration:The SET statement copies the 1st observation PDV

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

1 1 1 A01 1 3 . . .

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

1ST iteration:The SET statement copies the 1st observation PDVFIRST.ID 1 since this is the 1st observation for A01LAST.ID 0 since this is not the last observation for A01

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

1 1 0 A01 1 3 . . .

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

1ST iteration:Since TIME = 1, S1 SCORE (3)

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

1 1 0 A01 1 3 3 . .

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

1ST iteration:The subsetting IF statement is evaluated to be FALSE SAS returns to the beginning of the DATA step to begin the

2nd iteration

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

1 1 0 A01 1 3 3 . .

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

2nd iteration:_N_ ↑2

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

2 1 0 A01 1 3 3 . .

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

2nd iteration: FIRST.ID and LAST.ID are retained; they are automatic variables ID, TIME, SCORE are retained; they are from input dataset S1, S2, and S3 are retained because of the RETAIN statement

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

2 1 0 A01 1 3 3 . .

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

2nd iteration:The SET statement copies the 2nd observation to the PDV

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

2 1 0 A01 2 4 3 . .

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

2nd iteration:The SET statement copies the 2nd observation to the PDVFIRST.ID 0; this is not the first observation for A01LAST.ID 0; this is not the last observation for A01 either

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

2 0 0 A01 2 4 3 . .

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

2nd iteration:Since TIME = 2, S2 SCORE (4)

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

2 0 0 A01 2 4 3 4 .

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

2nd iteration:The subsetting IF statement is evaluated to be FALSE SAS returns to the beginning of the DATA step to begin the 3rd

iteration

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

2 0 0 A01 2 4 3 4 .

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

3rd iteration:_N_ ↑3The rest of the variables are retained

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

3 0 0 A01 2 4 3 4 .

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

3rd iteration:The SET statement copies the 3rd observation PDV

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

3 0 0 A01 3 5 3 4 .

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

3rd iteration:The SET statement copies the 3rd observation PDVFIRST.ID 0; this is not the first observation for A01LAST.ID 1; this is the last observation for A01

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

3 0 1 A01 3 5 3 4 .

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

3rd iteration:Since TIME = 3, S3 SCORE (5)

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

3 0 1 A01 3 5 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

3rd iteration:The subsetting IF statement is evaluated to be true

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

3 0 1 A01 3 5 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

3rd iteration:The implicit OUTPUT executes - variables marked with

(K) are copied to the dataset wideSAS returns to the beginning of the DATA step to

begin the 4th iteration

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

3 0 1 A01 3 5 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

4th iteration:_N_ ↑4The rest of the variables are retained

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 0 1 A01 3 5 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

4th iteration:The SET statement copies the 4th observation PDV

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 0 1 A02 1 4 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

4th iteration:The SET statement copies the 4th observation PDVFIRST.ID 1; this is the first observation for A02LAST.ID 0; this is not the last observation for A02

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 1 0 A02 1 4 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

4th iteration:Since TIME = 1, S1 SCORE (4)

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 1 0 A02 1 4 4 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

4th iteration:The subsetting IF statement is evaluated to be FALSE SAS returns to the beginning of the DATA step to begin the 5 th

iteration

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 1 0 A02 1 4 4 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

5th iteration:_N_ ↑5The rest of the variables are retained

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 1 0 A02 1 4 4 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

5th iteration:The SET statement copies the 5th observation PDV

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 1 0 A02 3 2 4 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

5th iteration:The SET statement copies the 5th observation PDVFIRST.ID 0; this is not the first observation for A02LAST.ID 1; this is the last observation for A02

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

5th iteration:Since TIME = 3, S3 SCORE (2)

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 4 2

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

5th iteration:The subsetting IF statement is evaluated to be TRUE

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 4 2

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

5th iteration:The implicit OUTPUT executes

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 4 2

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 4 2

How to fix this?

FROM LONG FORMAT TO WIDE FORMAT

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

FROM LONG FORMAT TO WIDE FORMAT

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:_N_ ↑4The rest of the variables are retained

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 0 1 A01 3 5 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:The SET statement copies the 4th observation PDV

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 0 1 A02 1 4 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:The SET statement copies the 4th observation PDVFIRST.ID 1; this is the first observation for A02LAST.ID 0; this is not the last observation for A02

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 1 0 A02 1 4 3 4 5

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:Since FIRST.ID = 1, S1 – S3 missing

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 1 0 A02 1 4 . . .

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:Since TIME = 1, S1 SCORE (4)

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 1 0 A02 1 4 4 . .

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

4th iteration:The subsetting IF statement is evaluated to be falseSAS returns to the beginning of the DATA step to begin the 5th

iteration

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

4 1 0 A02 1 4 4 . .

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:_N_ ↑5The rest of the variables are retained

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 1 0 A02 1 4 4 . .

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:The SET statement copies the 5th observation PDV

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 1 0 A02 3 2 4 . .

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:The SET statement copies the 5th observation PDVFIRST.ID 0; this is not the first observation for A02LAST.ID 1; this is the last observation for A02

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 . .

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:Since FIRST.ID ≠1, no execution

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 . .

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:Since TIME = 3, S3 SCORE (2)

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 . 2

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:The subsetting IF statement is evaluated to be true

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 . 2

ID S1 S2 S3

1 A01 3 4 5

FROM LONG FORMAT TO WIDE FORMAT

data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do; s1 = .; s2 = .; s3 = .; end; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id;run;

ID TIME SCORE

1 A01 1 3

2 A01 2 4

3 A01 3 5

4 A02 1 4

5 A02 3 2

5th iteration:SAS reaches the end of the 5th iterationThe implicit OUTPUT executes

_N_ D FIRST.ID D LAST.ID D ID K TIME D SCORE D S1 K S2 K S3 K

5 0 1 A02 3 2 4 . 2

ID S1 S2 S3

1 A01 3 4 5

2 A02 4 . 2

FROM LONG FORMAT TO WIDE FORMAT

CONCLUSION

The most important part of DATA step processing is to understand how data is transformed to the PDV and how data is copied from the PDV to a new dataset

To be a successful SAS programmer, we must be able to thoroughly comprehend how DATA steps are processed

REFERENCES

Cody, Ron. 2001. Longitudinal Data and SAS® A Programmer’s Guide. Cary, NC: SAS Institute Inc.

ACKNOWLEDGEMENT

I would like to thank MaryAnne DePesquo for inviting me to present at the SGF 2011

CONTACT INFORMATION

Arthur X. Li

City of Hope Comprehensive Cancer Center

Division of Information Science

1500 East Duarte Road

Duarte, CA 91010 - 3000

Work Phone: (626) 256-4673 ext. 65121

Fax: (626) 471-7106

E-mail: xueli@coh.org