HRP 223 - 2008
description
Transcript of HRP 223 - 2008
![Page 1: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/1.jpg)
HRP223 2008
Copyright © 1999-2008 Leland Stanford Junior University. All rights reserved.Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.
HRP 223 - 2008
Topic 4 – Data Manipulation
![Page 2: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/2.jpg)
HRP223 2008
Why Code Data step advantages
– Splitting data into many subsets– Tasks that require looping– Quickly subsetting– Complex retains
Minor tweaks with nice pay offs– Adding Page Numbers– Inserting Group Names in Titles– Title and Footnote Justification– Conditional Highlighting
Including parameters
![Page 3: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/3.jpg)
HRP223 2008
Common Ground … where
The first week of class you saw that you can point-and-click with EG or write data step code or PROC SQL statements to subset data.
![Page 4: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/4.jpg)
HRP223 2008
where The syntax for where is identical in SQL and data steps. Differences vs. if statements:
– main points work in where only • sub points work in either
– x between y and z• x >= y and x <= z• y <= x <= z
– string1 ? string2 or string1 contains string2• index(string1,string2) > 0
– string1 =* string2 • soundex(string1) = soundex(string2)
– x is null or x is missing• missing(x)
– String1 like “U%of%A%”• use regular expressions (PRX)
![Page 5: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/5.jpg)
HRP223 2008
Why bother?
If you can use the GUI to write the subsets, why bother learning the code?– It takes time to make a new dataset. If all you
want is to subset for an analysis, it is a LOT faster to add the where into the analysis code.
• First run the analysis on the complete data.• Right click the node and choose open last submitted code.• (Tell it to keep all variables.)• Scroll to the procedure and add in the where.
![Page 6: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/6.jpg)
HRP223 2008
Keep All Data
Before the analysis code, SAS puts in instructions to subset the data. Tell it to include all variables by adding the variable in the where statement or just use a *. More on this in a bit.
![Page 7: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/7.jpg)
HRP223 2008
where Syntax
The where statement, like all SAS statements, begins with a keyword (where) and ends in a semicolon.– where isDead = "false";– where isDead ne "true";– where missing(gender);– where salary > 100000;– where country in ("USA", "Japan", "UK");– where country in ("USA" "Japan" "UK");
![Page 8: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/8.jpg)
HRP223 2008
where Syntax Arithmetic
– where salary/12 > 10000;– where (salary /12) * 1.20 ge 9900;– where salary + bonus < 120000;
Logical– where gender ne "M" and salary >= 50000;– where gender ne "M" or salary >= 50000;– where country = "UK" or country = "UTAH";– where country not in ("USA", "AU");
![Page 9: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/9.jpg)
HRP223 2008 SAS has many operations available to help you
make decisions.– = eq, ~= ne, < lt, > gt, <= le, >= ge, in ( )– Not
• requires the expression following it to not be true.– & And, | or, in
• & Requires both operands to be true.• | Requires one operand to be true.• In () requires at least one comparison to be true.
– Math operations:• + - * / **.
Make Decisions
![Page 10: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/10.jpg)
HRP223 2008
Logical Decisions & Compound Expressions
Use the List Data … option on the Describe menu to choose what variables to report, then include validity checks on the data.
Common tests and common problems:where YODeath < YOBirth;where Sex = "M" and numPreg > 0;
where Sex="M" and numPreg > 0 or ageLMP > 0; *** bad ***;
where Sex="M" and (numPreg > 0 or ageLMP > 0); *** good ***;
– Moral: Use parentheses generously with ands and ors.
![Page 11: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/11.jpg)
HRP223 2008
Looking at Data
The traditional way to look at data is with proc print. proc print data=parity;var gender numBirths yoBirth yoDeath ageLMP;
run; You can print out the corners of your data table.
SAS should have called this lastobs.
proc print data=parity(firstobs=6 obs=6); var gender ageLMP;run;
![Page 12: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/12.jpg)
HRP223 2008
Moving Stuff in EG
Last time somebody asked me how to move stuff between process flows and I said that I just copied the entire project.
Actually, you can copy a bunch of stuff then right click and choose “Move to > somewhere”.
![Page 13: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/13.jpg)
HRP223 2008
Data Step
There are a few things that can be done in a data step that can’t be done in SQL.
Most SAS programmers do not know SQL and I need you to be able to look at their code.
![Page 14: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/14.jpg)
HRP223 2008
Data Step Parts
Data steps begin with a data statement. The second statement is usually a set
statement or an input statement. There are any number of additional
statements after the set or input line. The data step ends with a run statement or (if
the programmer is too lazy to type run;) at the beginning of the next data step or procedure.
![Page 15: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/15.jpg)
HRP223 2008
About that second line… set blah;– Says you are going to read data from an existing SAS data
set (called blah in this case) into your new data set. input gender $ age;– Means that you are going to read existing data from this
page of code or from a text file. Typically the input statement appears with a datalines statement (for reading from this file) or an infile statement (for reading from another text file).
– “gender $” means that one variable is a character string– Age does not have a $. So this signifies a numeric variable.
![Page 16: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/16.jpg)
HRP223 2008
Those lines after the 2nd line
Commands that you are likely to see after the set line include:
where statements are used to select what records to include based on the values in the source file.
if-then-else statements are used to check simple logic to assign new values.
select statements are used to perform complex checking and choosing from a list.
![Page 17: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/17.jpg)
HRP223 2008
How SAS Processes a Dataset When you create a SAS data set with data step code, SAS
does the following things:1. It figures out what variables it needs to track and it sets
aside some space in the computer’s working memory (RAM) to hold the variables. This space is called the Program Data Vector (PDV).
2. It sets the values in the PDV to missing.3. Then it does all the instructions you tell it to do, in the order
you have written them. 4. Then it writes all the variables out to the new dataset.5. It then repeats the process if there is more data.
![Page 18: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/18.jpg)
HRP223 2008
Manipulating Data
Say you have a dataset with a bunch of variables. How does SAS keep track of the data and allow you to manipulate it?
Say this is the dataset called “OLD”.
![Page 19: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/19.jpg)
HRP223 2008
Manipulating Data
When you do a data step every variable on the set or input lines are added to the PDV for the life of the data step.
data new;set old;
run; If you don’t tell SAS to do something different, every1
variable in the PDV is output to the new dataset.
The variables id, race, case, refage and lname are put into the PDV.
The variables id, race, case, refage and lname are put into the new dataset.
![Page 20: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/20.jpg)
HRP223 2008
How SAS Really WorksThe Program Data Vector
SAS processes information a record at a time in the PDV. The PDV tracks variable names and their contents, plus a couple of automatic variables. The automatic ones don’t get output.
SAS forgets what is in the vector when it reads the next record but you can force it to remember without too much effort.
![Page 21: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/21.jpg)
HRP223 2008
Working With Variables
It is easy to add variables to the PDV. You can create a variable called isMale and set it to a
value for everyone in a dataset like this:data new;
set old; isMale = ‘yes’;
run;
or you can conditionally assign a variable’s value with an if statement:
if refAge < 50 then isYoung = 1;
isMale is added to PDV.
Id, race, case, refage, lname and isMale are put into the new dataset .
![Page 22: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/22.jpg)
HRP223 2008
if then (else) statements
If you want to do something if a condition is true, use an if-then statement.
Remember you need both words, if and then .data males females;set parity;if gender = "m" then output males;else output females;
run;
![Page 23: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/23.jpg)
HRP223 2008
What could possibly go wrong?
If you send the people who have a gender of “m” to one dataset and everyone else to another, you will be dealing with major headaches later. The “female” records plus all the “Males”, “ ”, and every misspelling of male and female goes into that second file.
I do use if-then statements but very rarely do I use simple else statements.
![Page 24: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/24.jpg)
HRP223 2008
select-when-otherwise-end I use select statements instead of complex else logic. The first condition in the block that is true is executed and the
rest are ignored.data males females others;set parity;select (gender);when ("M") output males;when ("F") output females;otherwise output others;end;
run;
![Page 25: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/25.jpg)
HRP223 2008
Creating a Variable
data x; input grade $ @@;datalines;A B C D F;run;data y; set x;
select (grade);when ("A") score = "Woop!!!!";when ("B", "C") score = "Bah";when ("D", "F") score = "Ut oh";end;
run;
![Page 26: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/26.jpg)
HRP223 2008
Complex Decisions
data ovary.affected;set ovary.rptca;
select; when (refage =. ) agegr = .;
when (refage <60) agegr = 1; when (refage <65) agegr = 2; when (refage <70) agegr = 3; when (refage >=70) agegr = 4;
end;run;
Note: NO THEN
Note: Select ends with end;
missing is negative infinity so check for missing before your first <
The first condition that is true is done and the rest are ignored. So get thing in the correct order.
![Page 27: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/27.jpg)
HRP223 2008
I use select statements to track known problems in a dataset.data alice2 missingStuff badAge ageThing;
set alice;select;*These are FATAL data errors each dataset should have 0 observations;*no year blood draw, yob, age at entry in study;when (missing(yr_bl_dr) or missing(birthyr) or missing(dadage)) output missingStuff;
*age from blood draw inconsistent with reported age;when ((yr_bl_dr-birthyr)-dadage > 2) output badAge; * blood draw before age at birth;when (yr_bl_dr - birthyr < 0) output ageThing;
otherwise output alice2;end;
run; * NOTE: this does not notice multiple errors;
![Page 28: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/28.jpg)
HRP223 2008
No Otherwise
If you leave off the otherwise statement, SAS will generate an error if the data is not “trapped” by one of the other conditions.
This is very helpful because it makes it easy to see problems.
![Page 29: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/29.jpg)
HRP223 2008
Adding New Variables
As it scans down the page containing a data step, SAS figures out if new variables are character or numeric by looking for quotation marks. The first time it sees a new variable it sets the width in the PDV.
![Page 30: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/30.jpg)
HRP223 2008
Playing with Character Variables If you manipulate character strings you want to remember these things:
– upcase()– lowercase()
What variables and contents are in the new dataset?
data case;band = "Skinny Puppy";uBand = upcase(band);output;band = "Assemblage 23";lBand = lowcase(band);output;
run;
![Page 31: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/31.jpg)
HRP223 2008
Length Be sure to set the length of the variable to be wide enough to hold
your data.data case2;length band $50.;band = "Skinny Puppy";uBand = upcase(band);output;uBand = ""; band = "Assemblage 23";lBand = lowcase(band);output;
run;
![Page 32: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/32.jpg)
HRP223 2008
EG Helps
![Page 33: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/33.jpg)
HRP223 2008
Combining EG 4.1 does not have all the functions in SAS 9.1.3 listed. A
couple important missing functions are the CATs. CAT Function
– Concatenates character strings without removing leading or trailing blanks
CATS Function– Concatenates character strings and removes leading and trailing blanks
CATT Function– Concatenates character strings and removes trailing blanks
CATX Function– Concatenates character strings, removes leading and trailing blanks,
and inserts separators
![Page 34: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/34.jpg)
HRP223 2008
Compressing
Often you will a variable which has extra characters in it and you want to get rid of them.– Check digits in medical record numbers.
Use the function compress() to remove the – and spaces.
![Page 35: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/35.jpg)
HRP223 2008"- "
![Page 36: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/36.jpg)
HRP223 2008
Splitting Strings
If you need to break a string of letters into words use the scan function()– Specify the original string, comma, the word
number, comma, an optional list of word delimiters.
![Page 37: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/37.jpg)
HRP223 2008
The First Word
![Page 38: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/38.jpg)
HRP223 2008
Example of Character Functions
![Page 39: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/39.jpg)
HRP223 2008
Variable Order
There are times when you will want to move a variable to the beginning of the PDV and therefore, to the left side of a dataset. I do this if I am calculating values and I do not want to scroll to the end of the spreadsheet (viewtable) to check a value.
Just reference the variable before the set statement.
![Page 40: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/40.jpg)
HRP223 2008
data life;input subj_id yob yod @@;datalines;
1000100 1920 1942 1000101 1921 19421000102 1930 1995; run;data span;
* move age to head of pdv by referencing it before it is read in the set statement;age = 0;set life;age=yod-yob;
run;
![Page 41: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/41.jpg)
HRP223 2008
Importing Datafrom External Text Files
You also use the keyword input to get data from a stored text file. Specify an infile statement to define the source of your data and do not use datalines.
data life; infile ‘c:\projects\blah\life.txt’; input subj_id yob dob;run;
![Page 42: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/42.jpg)
HRP223 2008
Importing Data…the Hard Way
You can specify what column has the data you want as well as how wide it is:
data rawblah; infile ‘c:\projects\pam\prostate.dat’; input @1 id 7. @3 race 1. @2 case 1. @24 refage 2. @99 l_name $10.; run;
Here you tell SAS that this variable is going to hold 10 characters.
This variable is written as a seven digit number with no decimal places.
![Page 43: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/43.jpg)
HRP223 2008
Importing Data… the Hard Way
If you have fixed length character variables, specify them with a dollar sign and an informat like this:
input l_name $10.; If your character variables are of variable length and
you want to read them up to a maximum length or a delimiter, include a : in the specification:
input l_name : $10.; This is handy if you are reading tab-delimited data
with character variables with imbedded blanks.
![Page 44: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/44.jpg)
HRP223 2008
Comments
Comment the heck out of the code you write. Two syntaxes you have seen:– * blah;– /* blah */
You can also select a block of code and push– Control /
to comment it out Control shift /– Turns the comment back into code.
![Page 45: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/45.jpg)
HRP223 2008
What is a bug anyway?
When you write a program and it doesn’t work the way that you intended, it is described as having a bug.
There are many types of bugs. Syntax and semantic errors are relatively easy to find and fix. When these errors happen, SAS can not figure out what you want done. Conceptual errors happen when SAS understands the words you give it but it does not do what you intended. These can be very, very hard to find and fix.
Spotting syntax and semantic bugs is easy. You just need to look in the SAS log.
![Page 46: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/46.jpg)
HRP223 2008
Syntax Errors
As you try to write code you will see syntax errors and lots of red in the log. Look at the line it marks first. If you can’t see the problem, look for problems (especially a missing semicolon) on the line above where the red begins.– Misspelled keywords– Unmatched quotation marks– Missing semicolons– Invalid options
![Page 47: HRP 223 - 2008](https://reader035.fdocuments.net/reader035/viewer/2022062812/5681639c550346895dd4953e/html5/thumbnails/47.jpg)
HRP223 2008
What is a bug anyway? (2)
You will look in the log window to find out if SAS found any syntax errors.
* oops forgot the "then";