Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing)...

69
Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting

Transcript of Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing)...

Page 1: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Think FAST! Use Memory Tables (Hashing) for Faster MergingGregg P. SnellData Savant Consulting

Page 2: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

IntroductionMerging

Something we all do The “need for speed” increases as

Files get larger Process is repeated (daily, weekly, etc.)

I/O is the speed killer Basic match-merge (requires sorting) Indexes are usually faster But hashing is fastest!

No need to sort any file Single pass of each file, then memory I/O

Page 3: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

IntroductionIf hashing is fastest, why are you not using it?

dot-notation syntax is weird “object” is ambiguous and hard to understand

Think: hash object = memory table Familiar terms Side-by-side code comparisons

Page 4: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Topics to be CoveredHashing – Defined

Quick Review Sequential/Direct Access Implicit/Explicit Looping Indexes

Hash Object = Memory Table

Compare Merge Methods

Limitations and Overcoming Them

Page 5: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Hashing – DefinedHashing is the process of converting a long-

range key (numeric or character) to a smaller-range integer number with a mathematical algorithm or function

+

Key-indexing - the concept of using the value of a table’s key variable as the index into that table Think of arrays: client(66216)=”POTENTIAL CLIENT”;

Introduced to the SAS world by Paul Dorfman at SUGI 25 “Private Detectives In A Data Warehouse: Key-Indexing, Bitmapping, And Hashing”

Page 6: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Hashing – DefinedIncorporated into the DATA step with v9

Has two predefined component objects: hash object hash iterator

These “objects” provide a quick and efficient method to store, search, and retrieve data based on lookup keys

Page 7: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick ReviewSequential/Direct Access

Implicit/Explicit Looping

Indexes

Page 8: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick ReviewSequential/Direct Access

Page 9: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick Review – Sequential AccessSequential access – read one after another

Top to bottom SAS is smart enough to know when the end-of-file

(EOF) has been encountered and stops reading

Let’s look at an example…/* sequential access */data work.sequential; set sashelp.class; put _all_; output;run;

Name=Alfred Sex=M Age=14 Height=69 Weight=112.5 _ERROR_=0 _N_=1Name=Alice Sex=F Age=13 Height=56.5 Weight=84 _ERROR_=0 _N_=2Name=Barbara Sex=F Age=13 Height=65.3 Weight=98 _ERROR_=0 _N_=3Name=Carol Sex=F Age=14 Height=62.8 Weight=102.5 _ERROR_=0 _N_=4Name=Henry Sex=M Age=14 Height=63.5 Weight=102.5 _ERROR_=0 _N_=5Name=James Sex=M Age=12 Height=57.3 Weight=83 _ERROR_=0 _N_=6<snip>

Page 10: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick Review

Page 11: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick Review – Direct AccessDirect access – read specific records

You must specify which row(s) to read SAS has no way of knowing when you want to stop

(so tell it) Let’s look at an example…/* direct access */

data work.direct; do i=2, 3, 9; set sashelp.class point=i; put _all_; output; end; stop;run;

i=2 Name=Alice Sex=F Age=13 Height=56.5 Weight=84 _ERROR_=0 _N_=1i=3 Name=Barbara Sex=F Age=13 Height=65.3 Weight=98 _ERROR_=0 _N_=1i=9 Name=Jeffrey Sex=M Age=13 Height=62.5 Weight=84 _ERROR_=0 _N_=1

Page 12: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick ReviewSequential/Direct Access

Implicit/Explicit Looping

Indexes

Page 13: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick ReviewSequential/Direct Access

Implicit/Explicit Looping

Indexes

Page 14: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick Review

Implicit/Explicit Looping

Page 15: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick Review – Implicit/Explicit AccessImplicit/Explicit Looping

Implicit looping By default, DATA steps execute an implicit loop

Explicit looping You specify what, when, and how long to loop

Page 16: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick Review – Implicit/Explicit AccessImplicit/Explicit Looping

Let’s look at an example Side by side with sequential access

/* implicit looping */ /* explicit looping */data work.sequential; data work.sequential; do until (eof); set sashelp.class; set sashelp.class end=eof; put _all_; put _all_; output; output; end;run; run;

Page 17: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick Review – Implicit/Explicit AccessExplicit looping is usually utilized with direct

access

Used in the previous direct access example…/* direct access */data work.direct; do i=2, 3, 9; set sashelp.class point=i; put _all_; output; end; stop;run;

i=2 Name=Alice Sex=F Age=13 Height=56.5 Weight=84 _ERROR_=0 _N_=1i=3 Name=Barbara Sex=F Age=13 Height=65.3 Weight=98 _ERROR_=0 _N_=1i=9 Name=Jeffrey Sex=M Age=13 Height=62.5 Weight=84 _ERROR_=0 _N_=1

Page 18: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick ReviewSequential/Direct Access

Implicit/Explicit Looping

Indexes

Page 19: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick ReviewSequential/Direct Access

Implicit/Explicit Looping

Indexes

Page 20: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick Review

Indexes

Page 21: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick Review – IndexesIndexes

An optional file created for a SAS dataset

Provides direct access to specific records based on key values

Key values stored in ascending order

Includes pointers to corresponding records

Page 22: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick Review – Indexes

/* simulate an index on variable: age */data work.class_index; set sashelp.class; row_id=_n_; keep age row_id;run;

proc sort data=work.class_index; by age row_id;run;

data work.class_index; keep age rid; retain age rid; length rid $20; set work.class_index; by age; if first.age then rid = trim(put(row_id,best.-L)); else rid = trim(rid) || ',' || trim(put(row_id,best.-L)); if last.age then output;run;

Let’s simulate an index

Page 23: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick Review – Indexes

Page 24: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick Review – Indexes

/* create work.class with index */ data work.class (index=(age)); set sashelp.class; run;

/* direct access with rows */ /* direct access with values */data work.direct; data work.direct; do i=2, 3, 9; set sashelp.class point=i; set work.class; where age=13; put _all_; put _all_; output; output; end; stop;run; run;

i=2 Name=Alice Sex=F Age=13 Height=56.5 Weight=84 _ERROR_=0 _N_=1i=3 Name=Barbara Sex=F Age=13 Height=65.3 Weight=98 _ERROR_=0 _N_=1i=9 Name=Jeffrey Sex=M Age=13 Height=62.5 Weight=84 _ERROR_=0 _N_=1

Let’s use direct access with a real index

Page 25: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick Review – Indexes

/* direct access */data work.direct; do age=13,14; do until (eof); set class key=age end=eof; if _IORC_=0 then do; /* 0 indicates a match was found */ put _all_; output; end; /* if no match, reset the error flag and continue */ else _ERROR_=0; end; end; stop;run;

age=13 eof=0 Name=Alice Sex=F Height=56.5 Weight=84 _ERROR_=0 _IORC_=0 _N_=1age=13 eof=0 Name=Barbara Sex=F Height=65.3 Weight=98 _ERROR_=0 _IORC_=0 _N_=1...age=14 eof=0 Name=Henry Sex=M Height=63.5 Weight=102.5 _ERROR_=0 _IORC_=0 _N_=1age=14 eof=0 Name=Judy Sex=F Height=64.3 Weight=90 _ERROR_=0 _IORC_=0 _N_=1

And now with explicit looping

Page 26: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick Review – Indexes

/* direct access */data work.driver; age=13; output; age=14; output;run;

data work.direct; set work.driver; /* <- sequential access & implicit loop */ do until (eof); /* <- explicit loop */ set work.class key=age end=eof; /* <- direct access */ if _IORC_=0 then do; put _all_; output; end; else _ERROR_=0; end;run;

And now data driven

Page 27: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick ReviewSequential/Direct Access

Implicit/Explicit Looping

Indexes

Page 28: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Quick ReviewSequential/Direct Access

Implicit/Explicit Looping

Indexes

Page 29: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Hashing Object = Memory Table

Think of a traditional row/column table

Create it

Define it

Fill it

Access it

Page 30: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Hashing Object = Memory Table

Create itDynamic run-time memory table It does not exist until you create itIt can also be dynamically deleted!

Page 31: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Hashing Object = Memory Table

Create itdata <somename>; ... /* Create it */ declare hash h_small (); ...run;

Created a dynamic run-time memory table

h_small, NOT work.h_small, just h_small

No structure (variables/index)

No content (rows).

Page 32: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Hashing Object = Memory Table

Define itdata <somename>; ... /* Define it */ length keyvar smallvar1-smallvar2 8 newvar $12; rc = h_small.DefineKey ( “keyvar” ); rc = h_small.DefineData ( “smallvar1”,”smallvar2”, “newvar”); rc = h_small.DefineDone (); ...run;

Create an index variable called keyvar

Create three other variables

Notice length is declared before using them

Stop defining

Page 33: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Hashing Object = Memory Table

Fill itdata <somename>; ... /* Fill it */ do until ( eof_small ); set work.small (keep=keyvar smallvar1-smallvar2) end = eof_small; newvar = “any text”; rc = h_small.add (); end; ...run;

Use explicit looping

Retrieve values from another table

Assign variables values any way you want

Fill it

Page 34: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Hashing Object = Memory Table

Access itdata <somename>; ... /* Access it */ do until ( eof_big); set work.big end = eof_big; smallvar1=.; smallvar2=.; newtext=“ “; rc = h_small.find (); output; end; ...run;

Use explicit looping

Load keyvar with a value

Access the memory table by keyvar

Variables assigned only if a match is found

Page 35: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Hashing Object = Memory Table ... /* Create it */ declare hash h_small (); /* Define it */ length keyvar smallvar1-smallvar2 8 newvar $12; rc = h_small.DefineKey ( “keyvar” ); rc = h_small.DefineData ( “smallvar1”,”smallvar2”, “newvar”); rc = h_small.DefineDone (); /* Fill it */ do until ( eof_small ); set work.small (keep=keyvar smallvar1-smallvar2) end = eof_small; newvar = “any text”; rc = h_small.add (); end; /* Access it */ do until ( eof_big); set work.big end = eof_big; smallvar1=.; smallvar2=.; newtext=“ “; rc = h_small.find (); output; end; ...

Page 36: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Comparing Merge Methods

12 ways to do anything with SAS

Limit of two garden-variety merge techniquesMatch mergingMerging with indexes

Pentium 4, 2.4GHz, 1.25GB ram, 50GB disk

XP Pro SP2 with SAS 9.1 (TS1M3)

Each run was executed from a new session

Page 37: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Comparing Merge Methods Create sample tables (lifted from SAS-L)%let large_obs = 500000;

data work.small ( keep = keyvar small: ) work.large ( keep = keyvar large: ); array keys(1:500000) $1 _temporary_; length keyvar 8; array smallvar [20]; retain smallvar 12; array largevar [682]; retain largevar 55; do _i_ = 1 to &large_obs ; keyvar = ceil (ranuni(1) * &large_obs); if keys(keyvar) = ' ' then do; output large; if ranuni(1) < 1/5 then output small; keys(keyvar) = 'X'; end; end;run;

NOTE: The data set WORK.SMALL has 63406 observations and 21 variables.NOTE: The data set WORK.LARGE has 315975 observations and 683 variables.

Page 38: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Comparing Merge Methods Create sample tables (lifted from SAS-L)

Page 39: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Comparing Merge Methods Create sample tables (lifted from SAS-L)

Page 40: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Match Merging

Requires sorting

Used in 80% of all code (not statistically verified)

Page 41: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Match Merging/* basic match-merge with sort */

proc sort data=work.small; by keyvar; run;

NOTE: There were 63406 observations read from the data set WORK.SMALL.NOTE: The data set WORK.SMALL has 63406 observations and 21 variables.NOTE: PROCEDURE SORT used (Total process time): real time 2.00 seconds cpu time 0.23 seconds

proc sort data=work.large; by keyvar; run;

NOTE: There were 315975 observations read from the data set WORK.LARGE.NOTE: The data set WORK.LARGE has 315975 observations and 683 variables.NOTE: PROCEDURE SORT used (Total process time): real time 11:59.46 cpu time 51.75 seconds

12 minutes for sorting

Page 42: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Match Merging/* basic match-merge with sort */

data work.match_merge; merge work.large (in=a) work.small (in=b); by keyvar; if a;run;

NOTE: There were 315975 observations read from the data set WORK.LARGE.NOTE: There were 63406 observations read from the data set WORK.SMALL.NOTE: The data set WORK.MATCH_MERGE has 315975 obs and 703 variables.NOTE: DATA statement used (Total process time): real time 8:39.31 cpu time 20.14 seconds

8.5 minutes to merge

Page 43: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Match Merging

12 minute sort + 8:39 merge = 20.5 minutes

I/O was the real speed killer (sorting both files)

Page 44: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Merge with an Index

An index can eliminate the need for sorting

Usually speeds things up

Page 45: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Merge with an Indexoptions msglevel=i;

/* creating indexes */proc datasets lib=work nolist; modify small; index create keyvar; modify large; index create keyvar;quit;

INFO: Multiple concurrent threads will be used to create the index.NOTE: Simple index keyvar has been defined.NOTE: MODIFY was successful for WORK.SMALL.DATA.INFO: Multiple concurrent threads will be used to create the index.NOTE: Simple index keyvar has been defined.NOTE: MODIFY was successful for WORK.LARGE.DATA.NOTE: PROCEDURE DATASETS used (Total process time): real time 59.46 seconds cpu time 6.40 seconds

1 minute for indexing

Page 46: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Merge with an Index/* merge with indexes (no sorting) */data work.match_merge_index; merge work.large (in=a) work.small (in=b); by keyvar; if a;run;

INFO: Index keyvar selected for BY clause processing.INFO: Index keyvar selected for BY clause processing.NOTE: There were 315975 observations read from the data set WORK.LARGE.NOTE: There were 63406 observations read from the data set WORK.SMALL.NOTE: The data set WORK.MATCH_MERGE_INDEX has 315975 observations and 703variables.NOTE: DATA statement used (Total process time): real time 1:21:18.98 cpu time 1:03.39

1 hour 21 minutes for merging

Page 47: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Merge with an Index

1 minute index + 81 minute merge = 82 minutesIndexes Usually speeds things up

Generally not when accessing every recordFor every record being read, from each table

Read from index to get RIDs for each valueThen read each record by RID

Essentially doubled the I/O

What if work.large were already sorted?Only work.small would need the index

Page 48: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Merge with an Index (large pre-sorted)/* creating indexes */proc datasets lib=work nolist; modify small; index create keyvar;quit;

INFO: Multiple concurrent threads will be used to create the index.NOTE: Simple index keyvar has been defined.NOTE: MODIFY was successful for WORK.SMALL.DATA.NOTE: PROCEDURE DATASETS used (Total process time): real time 2.29 seconds cpu time 0.22 seconds

2 seconds for indexing

Page 49: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Merge with an Index (large pre-sorted)/* merge with index on small (large is already sorted) */

data work.match_merge_index; merge work.large (in=a) work.small (in=b); by keyvar; if a;run;

INFO: Index keyvar selected for BY clause processing.NOTE: There were 315975 observations read from the data set WORK.LARGE.NOTE: There were 63406 observations read from the data set WORK.SMALL.NOTE: The data set WORK.MATCH_MERGE_INDEX has 315975 obs and 703 vars.NOTE: DATA statement used (Total process time): real time 7:46.57 cpu time 24.20 seconds

7.75 minutes for merging

Page 50: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Merge with an Index (large pre-sorted)

2 second index + 7.75 merge = 7.8 minutesEliminated I/O thrashing on work.large

Page 51: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Memory Table Merge/* merge with memory table (no sorting or indexing required!) */

data work.hash_merge (drop=rc i);

/* Create it */ declare hash h_small ();

/* Define it */ length keyvar smallvar1-smallvar20 8; array smallvar(20); rc = h_small.DefineKey (”keyvar” ); rc = h_small.DefineData (”smallvar1”,”smallvar2”,”smallvar3”, ”smallvar4”,”smallvar5”,”smallvar6”, ”smallvar7”,”smallvar8”,”smallvar9”, ”smallvar10”,”smallvar11”,”smallvar12”, ”smallvar13”,”smallvar14”,”smallvar15”, ”smallvar16”,”smallvar17”,”smallvar18”, ”smallvar19”,”smallvar20” ); rc = h_small.DefineDone (); ...

Page 52: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Memory Table Merge /* Fill it */ do until ( eof_small ); set work.small end = eof_small; rc = h_small.add (); end;

/* Merge it */ do until ( eof_large ); set work.large end = eof_large; /* this loop initializes variables before merging from h_small */ do i=lbound(smallvar) to hbound(smallvar); smallvar(i) = .; end; rc = h_small.find (); output; end;

run;

Page 53: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Memory Table MergeNOTE: There were 63406 observations read from the data set WORK.SMALL.NOTE: There were 315975 observations read from the data set WORK.LARGE.NOTE: The data set WORK.HASH_MERGE has 315975 obd and 703 variables.NOTE: DATA statement used (Total process time): real time 7:17.23 cpu time 16.59 seconds

7.3 minutes for merging

Page 54: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Comparing Merge MethodsMerge results:

Match merge w/sorting = 20.5 minutes

Index merge w/o sorting = 82 minutes

Index merge w/pre-sorting = 7.8 minutes

Memory table merge = 7.3 minutes

Page 55: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Stacking the Odds?

Did I select tables that favor hashing?

NO! And I will Prove it!

Rerun the two fastest merges in reverse orderMerge small onto large

Page 56: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Merge with an Index (large pre-sorted)/* merge with index on small (large is already sorted) */

data work.match_merge_index; merge work.small (in=a) work.large (in=b keep=keyvar largevar1-largevar20); by keyvar; if a;run;

INFO: Index keyvar selected for BY clause processing.NOTE: There were 63406 observations read from the data set WORK.SMALL.NOTE: There were 315975 observations read from the data set WORK.LARGE.NOTE: The data set WORK.MATCH_MERGE_INDEX has 63406 obs and 41 vars.NOTE: DATA statement used (Total process time): real time 2:08.84 cpu time 7.11 seconds

2 second index + 2:08 merge = 2.2 minutes

Page 57: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Memory Table Merge/* merge with memory table (no sorting or indexing required!) */

data work.hash_merge (drop=rc i);

/* Create it */ declare hash h_large ();

/* Define it */ length keyvar largevar1-largevar20 8; array largevar(20); rc = h_large.DefineKey ( "keyvar" ); rc = h_large.DefineData ( "largevar1","largevar2","largevar3", "largevar4","largevar5","largevar6", "largevar7","largevar8","largevar9", "largevar10","largevar11","largevar12", "largevar13","largevar14","largevar15", "largevar16","largevar17","largevar18", "largevar19","largevar20" ); rc = h_large.DefineDone (); ...

Page 58: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Memory Table Merge /* Fill it */ do until ( eof_large ); set work.large(keep=keyvar largevar1-largevar20) end = eof_large; rc = h_large.add (); end;

/* Merge it */ do until ( eof_small ); set work.small end = eof_small; do i=lbound(largevar) to hbound(largevar); largevar(i) = .; end; rc = h_large.find (); output; end;

run;

Page 59: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Memory Table Merge NOTE: There were 315975 observations read from the data set WORK.LARGE.NOTE: There were 63406 observations read from the data set WORK.SMALL.NOTE: The data set WORK.HASH_MERGE has 63406 observations and 41 variables.NOTE: DATA statement used (Total process time): real time 1:19.46 cpu time 6.43 seconds

1:19 merge = 1.3 minutes

Page 60: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Comparing Merge MethodsMerge results:

Match merge w/sorting = 20.5 minutes

Index merge w/o sorting = 82 minutes

Index merge w/pre-sorting = 7.8 minutes

Memory table merge = 7.3 minutes

And when reversing the order of the tables:

Index merge w/pre-sorting = 2.2 minutes

Memory table merge = 1.3 minutes

Page 61: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Limitations and Overcoming Them

Limitation #1

Hash tables are not persisted across DATA stepsAutomatically deleted at the end of the stepOnce deleted, memory is immediately freed up

Overcome it

Multiple merges within a single DATA step, or

Merge multiple tables at once on different keys

Page 62: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Limitations and Overcoming Them

Limitation #2

Hash tables are limited by available memoryEstimate as variables*length*records To include all 682 variables from work.large

(682+1)*8*315975 or about 1.7gig

Overcome it

Increase available memory with -memsize

Reduce variable lengths and/or concatenate

Page 63: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Limitations and Overcoming Them

Limitation #3

Key values must be distinct – no duplicates!

Should not be the many in a many-one merge

Page 64: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Limitations and Overcoming Them

Limitation #3

Key values must be distinct – no duplicates!

Should not be the many in a many-one merge

Overcome it

Add variables until the key is unique

Create a sequence variable to make it unique For example…

Page 65: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Limitations and Overcoming Them

rc = h_large.DefineKey ( "keyvar",”keyseq” ); rc = h_large.DefineData ( "largevar1","largevar2”, … ) rc = h_large.DefineDone ();

/* Fill it */ maxkeyseq=0; do until ( eof_large ); set work.large(keep=keyvar) end = eof_large; by keyvar; if first.keyvar then keyseq=0; keyseq+1; rc = h_large.add (); if last.keyvar then maxkeyseq=max(maxkeyseq,keyseq); end;

Create a sequence variable to make it unique

Page 66: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Limitations and Overcoming Them

/* Merge it */ do until ( eof_small ); set work.small end = eof_small; do keyseq=1 to maxkeyseq; do i=lbound(largevar) to hbound(largevar); largevar(i) = .; end; rc = h_large.find (); output; end; drop maxkeyseq; end;

Create a sequence variable to make it unique

Page 67: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

ConclusionMerge results:

Match merge w/sorting = 20.5 minutes

Index merge w/o sorting = 82 minutes

Index merge w/pre-sorting = 7.8 minutes

Memory table merge = 7.3 minutes

And when reversing the order of the tables:

Index merge w/pre-sorting = 2.2 minutes

Memory table merge = 1.3 minutes

Page 68: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved.

Acknowledgements the Hash Man – Paul Dorfman

Richard DeVenezia (www.DeVenezia.com)

Recommended reading Hash tip sheet

support.sas.com/rnd/base/topics/datastep/dot/hash-tip-sheet.pdf

Page 69: Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Copyright © 2006, SAS Institute Inc. All rights reserved. 69Copyright © 2006, SAS Institute Inc. All rights reserved. 69