Optimization in SDD using Compress Function

Post on 09-Jun-2015

54.692 views 1 download

Tags:

Transcript of Optimization in SDD using Compress Function

A Presentation on Optimizing the IDB Process

By Navin Kumar

Process-I

Use of the Compress Option

The Compress option can greatly help reduce the execution time required for dataset creation.

In some cases upto ~40%Since IDB deals with huge volumes

of data this particular option is particularly pertinent to our day-to

-day programming.

Why use the compress option and what does it do?

Why use the compress option and what does it do?

Compressing a file is a process that reduces the number of bytes required to represent each observation.It converts each observation to a variable length record whereas in an un-compressed dataset each observation is a fixed length record.It means fewer I/O operations are required to read or write to the data during processing.

The Sub-Options with the Compress Option

Compress=yes or Compress=char The observations in a SAS data set are compressed by reducing repeated consecutive characters (including blanks) to two-byte or three-byte representations

Compress=binary This is highly effective for compressing medium to large (>=100Mb) blocks of binary data(numeric variables).

Datasets which have a lot of numeric variables (flags) like events, css and the like are most positively affected.

The REUSE OptionReuse=yes

This option specifies space to be reused, observations that are added to the SAS data set are inserted wherever enough free space exists, instead of at the end of the SAS data set.

Reuse=noThis is the default option and results in less efficient usage of space if you delete update or add many observations in a SAS dataset.

(An example would be the dataset Events requiring run-visit expansion or Vitals which requires transposing Vstestcd….basically any dataset which undergoes extensive record count change during its processing.)

It is hence advised to always use Reuse=yes option whenever we compress datasets.

Links describing in detail the Syntax, functions and

documentation of these OPTIONS have been attached at the end of

the presentation for reference.

How does it look like implemented in a code?

And here is an example of how the log will give a measure about the extent and effectiveness of the Compress option in your program.

Screenshots of Code and LogHere is an example of how the options statement is to be written

Some figures in Projects and Results of improved Efficiency

Project Name XXX1

Runtime(in Hours) Difference(in hours) Percent Reduction Compared to old Process

Without Compress option

2.553056 1.12083 ~44% Size of Compound

With Compress+Reuse option

1.432222 ~8.5Gb

With Compress option only

1.565556 0.9875

~39%

Some figures in Projects and Results of improved Efficiency

Project Name XXX2

Runtime(in Hours) Difference(in hours) Percent Reduction Compared to old Process

Without Compress option

10.14 3.54 ~35% Size of Compound

With Compress+Reuse option

6.6 ~44 Gb

With Compress option only

6.88 3.26

~32%

Some figures in Projects and Results of improved Efficiency

Project Name XXX3

Runtime(in Hours) Difference(in hours) Percent Reduction Compared to old Process

Without Compress option

1.8 0.7 ~40% Size of Compound

With Compress+Reuse option

1.1 ~44 Gb

Some figures in Projects and Results of improved Efficiency

Project Name XXX4

Runtime(in Hours) Difference(in hours) Percent Reduction Compared to old Process

Without Compress option

4.97 1.67 ~34% Size of Compound

With Compress+Reuse option

3.3 ~11.5 Gb

Do we have Trade-offs?

Yes. When we try to open a compressed dataset the data explorer is unable to open it

and shows blank variables.But!

Do we have Trade-offs?

We can work around it by just setting up the required dataset in a new dataset

Or

Do we have Trade-offs?

Use an option in the data step compress=no

Recommended Use and Applicability

It is to be used when dealing with big studies.Source Programming would benefit most by using it in their single driver program.It would help in situations when SDD is super slow.Last minute implementation of changes would speed up resulting in programmers going home on time .If both validation and source programmers are working on last minute changes near project deadlines there will be less time spent waiting for refreshes.Lesser number of complains and ill will feelings against SDD.

Recommended Use and Applicability

Validation Side Advantage

Since Validation programs are stand alone codes, different for each dataset, validation programmers have the freedom to choose and best implement Compress=Char or Compress= binary option depending on the dataset.

When is it advised not to use compress?

•Very small datasets. An example is the screen-shot below-Here a compressed dataset would be the same as an uncompressed one.

Advantage in its Use

An advantage in using compress in the options statement is that the I/O Engine automatically decides and switches between when to use and when not to use compress. Here is an example.

Projects List

Implemented so far in: BIV (LY2963016) in QACialis (LY450190) in QA

Tested in:BIVSolenzaCialisLA294

Hidden as they Projects for a Confidential

Client

THANK YOU