Efficient SAS programming with Large Data

29
Efficient SAS programming with Large Data Aidan McDermott Computing Group, March 2007

description

Efficient SAS programming with Large Data. Aidan McDermott Computing Group, March 2007. Axes if Efficiency. processing speed: CPU real storage: disk memory … user: functionality interface to other systems ease of use learning user development: methodologies reusable code - PowerPoint PPT Presentation

Transcript of Efficient SAS programming with Large Data

Page 1: Efficient SAS programming with Large Data

Efficient SAS programming with Large Data

Aidan McDermott

Computing Group, March 2007

Page 2: Efficient SAS programming with Large Data

Axes if Efficiency

• processing speed:– CPU– real

• storage:– disk– memory– …

• user:– functionality– interface to other systems– ease of use– learning

• user development:– methodologies– reusable code– facilitate extension, rewriting– maintenance

Page 3: Efficient SAS programming with Large Data

Dataset / Table

Page 4: Efficient SAS programming with Large Data

• Datasets consist of three parts

Page 5: Efficient SAS programming with Large Data

General (and obvious) principles

• Avoid doing the job if possible

• Keep only the data you need to perform a particular task (use drop, keep, where and if’s)

Page 6: Efficient SAS programming with Large Data

Combining datasets -- concatenation

Page 7: Efficient SAS programming with Large Data

General (and obvious) principles

• Often efficient methods were written to perform the required task – use them.

Page 8: Efficient SAS programming with Large Data

General (and obvious) principles

• Often efficient methods were written to perform other tasks – use them with caution.

• Write data driven code– it’s easier to maintain data than to update code

• Use length statements to limit the size of variables in a dataset to no more than is needed.– don’t always know what size this should be, don’t

always produce your own data.

• Use formatted data rather than the data itself

Page 9: Efficient SAS programming with Large Data

Memory resident datasets

Page 10: Efficient SAS programming with Large Data

Compressing Datasets

• Compress datasets with a compression utility such as compress, gzip, winzip, or pkzip and decompress before running each SAS job– delays execution and there is need to keep track of

data and program dependency.

• Use a general purpose compression utility and decompress it within SAS for sequential access.– system dependent (need a named pipe), sequential

dataset storage.

Page 11: Efficient SAS programming with Large Data

Compressing Datasets

Page 12: Efficient SAS programming with Large Data

SAS internal Compression

• allows random access to data and is very effective under the right circumstances. In some cases doesn’t reduce the size of the data by much.

• “There is a trade-off between data size and CPU time”.

Page 13: Efficient SAS programming with Large Data

• indata is a large dataset and you want to produce a version of indata without any observations

Page 14: Efficient SAS programming with Large Data

The data step is a two stage process• compile phase• execute phase

Page 15: Efficient SAS programming with Large Data

Data step logic

Page 16: Efficient SAS programming with Large Data

Data step logic

Page 17: Efficient SAS programming with Large Data
Page 18: Efficient SAS programming with Large Data

data step

Page 19: Efficient SAS programming with Large Data

data admits; set admits; discharge = admit + length; format discharge date8.;run;

Name type size drop retain format value

patientID C 6 n y

gender C 1 n y

admit N 8 n y date8.

length N 8 n y

discharge N 8 n n date8.

_N_

_ERROR_ 0

PDV: compile phase

Page 20: Efficient SAS programming with Large Data

data admits; set admits; discharge = admit + length; format discharge date8.;run;

Name type size drop retain format value

patientID C 6 n y 321C-4

gender C 1 n y M

admit N 8 n y date8. 15736

length N 8 n y 21

discharge N 8 n n date8.

_N_ 1

_ERROR_ 0

PDV: execute phase

Page 21: Efficient SAS programming with Large Data

data admits; set admits; discharge = admit + length; format discharge date8.;run;

Name type size drop retain format value

patientID C 6 n y 321C-4

gender C 1 n y M

admit N 8 n y date8. 15736

length N 8 n y 21

discharge N 8 n n date8. 15757

_N_ 1

_ERROR_ 0

PDV: execute phase

Page 22: Efficient SAS programming with Large Data

data admits; set admits; discharge = admit + length; format discharge date8.;run; /* implicit output */

Name type size drop retain format value

patientID C 6 n y 321C-4

gender C 1 n y M

admit N 8 n y date8. 15736

length N 8 n y 21

discharge N 8 n n date8. 15757

_N_ 1

_ERROR_ 0

PDV: execute phase

Page 23: Efficient SAS programming with Large Data

Name type size drop retain format value

patientID C 6 n y 321C-4

gender C 1 n y M

admit N 8 n y date8. 15736

length N 8 n y 21

discharge N 8 n n date8.

_N_ 2

_ERROR_ 0

data admits; set admits; discharge = admit + length; format discharge date8.;run;

PDV: execute phase

Page 24: Efficient SAS programming with Large Data

Efficiency: suspend the PDV activities

Page 25: Efficient SAS programming with Large Data

General principles

• Use by processing whenever you can• Given the data below, for each region, siteid,

and date, calculate the mean and maximum ozone value.

Page 26: Efficient SAS programming with Large Data

General principles

• Easy:

Page 27: Efficient SAS programming with Large Data

General principles

• Suppose there are multiple monitors at each site and you still need to calculate the daily mean?– Combine multiple observations onto one line and

then compute the statistics?

• Suppose you want the 10% trimmed mean?

• Suppose you want the second maximum?– Use Arrays to sort the data?– Write your own function?

Page 28: Efficient SAS programming with Large Data
Page 29: Efficient SAS programming with Large Data