Optimization in SDD using Compress Function
-
Upload
navin-kumar -
Category
Technology
-
view
54.692 -
download
1
Transcript of Optimization in SDD using Compress Function
A Presentation on Optimizing the IDB Process
By Navin Kumar
Process-I
Use of the Compress Option
The Compress option can greatly help reduce the execution time required for dataset creation.
In some cases upto ~40%Since IDB deals with huge volumes
of data this particular option is particularly pertinent to our day-to
-day programming.
Why use the compress option and what does it do?
Why use the compress option and what does it do?
Compressing a file is a process that reduces the number of bytes required to represent each observation.It converts each observation to a variable length record whereas in an un-compressed dataset each observation is a fixed length record.It means fewer I/O operations are required to read or write to the data during processing.
The Sub-Options with the Compress Option
Compress=yes or Compress=char The observations in a SAS data set are compressed by reducing repeated consecutive characters (including blanks) to two-byte or three-byte representations
Compress=binary This is highly effective for compressing medium to large (>=100Mb) blocks of binary data(numeric variables).
Datasets which have a lot of numeric variables (flags) like events, css and the like are most positively affected.
The REUSE OptionReuse=yes
This option specifies space to be reused, observations that are added to the SAS data set are inserted wherever enough free space exists, instead of at the end of the SAS data set.
Reuse=noThis is the default option and results in less efficient usage of space if you delete update or add many observations in a SAS dataset.
(An example would be the dataset Events requiring run-visit expansion or Vitals which requires transposing Vstestcd….basically any dataset which undergoes extensive record count change during its processing.)
It is hence advised to always use Reuse=yes option whenever we compress datasets.
Links describing in detail the Syntax, functions and
documentation of these OPTIONS have been attached at the end of
the presentation for reference.
How does it look like implemented in a code?
And here is an example of how the log will give a measure about the extent and effectiveness of the Compress option in your program.
Screenshots of Code and LogHere is an example of how the options statement is to be written
Some figures in Projects and Results of improved Efficiency
Project Name XXX1
Runtime(in Hours) Difference(in hours) Percent Reduction Compared to old Process
Without Compress option
2.553056 1.12083 ~44% Size of Compound
With Compress+Reuse option
1.432222 ~8.5Gb
With Compress option only
1.565556 0.9875
~39%
Some figures in Projects and Results of improved Efficiency
Project Name XXX2
Runtime(in Hours) Difference(in hours) Percent Reduction Compared to old Process
Without Compress option
10.14 3.54 ~35% Size of Compound
With Compress+Reuse option
6.6 ~44 Gb
With Compress option only
6.88 3.26
~32%
Some figures in Projects and Results of improved Efficiency
Project Name XXX3
Runtime(in Hours) Difference(in hours) Percent Reduction Compared to old Process
Without Compress option
1.8 0.7 ~40% Size of Compound
With Compress+Reuse option
1.1 ~44 Gb
Some figures in Projects and Results of improved Efficiency
Project Name XXX4
Runtime(in Hours) Difference(in hours) Percent Reduction Compared to old Process
Without Compress option
4.97 1.67 ~34% Size of Compound
With Compress+Reuse option
3.3 ~11.5 Gb
Do we have Trade-offs?
Yes. When we try to open a compressed dataset the data explorer is unable to open it
and shows blank variables.But!
Do we have Trade-offs?
We can work around it by just setting up the required dataset in a new dataset
Or
Do we have Trade-offs?
Use an option in the data step compress=no
Recommended Use and Applicability
It is to be used when dealing with big studies.Source Programming would benefit most by using it in their single driver program.It would help in situations when SDD is super slow.Last minute implementation of changes would speed up resulting in programmers going home on time .If both validation and source programmers are working on last minute changes near project deadlines there will be less time spent waiting for refreshes.Lesser number of complains and ill will feelings against SDD.
Recommended Use and Applicability
Validation Side Advantage
Since Validation programs are stand alone codes, different for each dataset, validation programmers have the freedom to choose and best implement Compress=Char or Compress= binary option depending on the dataset.
When is it advised not to use compress?
•Very small datasets. An example is the screen-shot below-Here a compressed dataset would be the same as an uncompressed one.
Advantage in its Use
An advantage in using compress in the options statement is that the I/O Engine automatically decides and switches between when to use and when not to use compress. Here is an example.
Projects List
Implemented so far in: BIV (LY2963016) in QACialis (LY450190) in QA
Tested in:BIVSolenzaCialisLA294
Hidden as they Projects for a Confidential
Client
Links For ReferenceLINK 1LINK 2LINK 3
THANK YOU