PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS...

40
PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode

Transcript of PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS...

Page 1: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

PhUSE 2011: Brighton

TS09Rectifying Irregular Text Data

a Case for Using Regular Expressions in SAS

Jayshree GaradeManjusha Gode

Page 2: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

Outline

• Problems

• Solutions & Introducing Regular Expressions

• Advantages over SAS String Functions

• Points to note while using Regular Expressions

• References2

Page 3: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

Outline

• Problems

• Solutions & Introducing Regular Expressions

• Advantages over SAS String Functions

• Points to note while using Regular Expressions

• References3

Page 4: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

Problem: Physical abnormalities

4

SUBJID TRT ABNORMALITY

01-011 B anemia

01-036 D anaemia

01-026 C anemea

01-014 B anemic

Page 5: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

Problem: Time point variable …

5

USUBJID VISIT VSDT PRSDTLTM VNTR_RT VNTRTUN

1 1 17-Oct-08 Per 1 D01 Predose 47 /min

1 2 3-Nov-08 Per 1 D01 .5 hr 58 /min

1 2 3-Nov-08 Per 1 D 01 01 hr 51 /min

1 2 3-Nov-08 Per 1d01 02hr 49 /min

1 3 4-Nov-08 day1 53 /min

1 90 3-Feb-09 Poststudy 56 /min

Page 6: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

…Problem: Time point variable

6

USUBJID VISIT VSDT PRSDTLTM VNTR_RT VNTRTUN

1 1 17-Oct-08 Per 1 D01 Predose 47 /min

1 2 3-Nov-08 Per 1 D01 .5 hr 58 /min

1 2 3-Nov-08 Per 1 D 01 01 hr 51 /min

1 2 3-Nov-08 Per 1d01 02hr 49 /min

1 3 4-Nov-08 day1 53 /min

1 90 3-Feb-09 Poststudy 56 /min

Page 7: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

…Problem: Time point variable

7

USUBJID VISIT VSDT PRSDTLTM VNTR_RT VNTRTUN

1 1 17-Oct-08Per 1 D01 Predose

47 /min

1 2 3-Nov-08 Per 1 D01 .5 hr 58 /min

1 2 3-Nov-08 Per 1 D 01 01 hr 51 /min

1 2 3-Nov-08 Per 1d01 02hr 49 /min

1 3 4-Nov-08 day1 53 /min

1 90 3-Feb-09 Poststudy 56 /min

Time_desc

Predose

Day 1, 0.5 Hour

Day 1, 1 Hour

Day 1, 2 Hours

Day 1

Poststudy

Page 8: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

8

…Problem: Time point variable

PRSDTLTM

D01

D 01

d01

day1

Time_desc

Day 1

Day 1

Day 1

Day 1

Page 9: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

Outline

• Problems

• Solutions & Introducing Regular Expressions

• Advantages over SAS String Functions

• Points to note while using Regular Expressions

• References9

Page 10: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

10

…Ways to approach the problem

• Traditional --- Using SAS String Functions

INDEX TRANWRD SUBSTR ANYALNUM ANYALPHA ANYDIGIT ANYSPACE NOTALNUM NOTALPHA ANYALNUM

NOTUPPER ANYALPHA FIND ANYDIGIT FINDC ANYPUNCT ANYSPACE INDEXC NOTALNUM INDEXW NOTALPHA VERIFY NOTDIGIT CALL CATS CALL CATT CALL CATX TRANSLATE SCAN SCANQ CALL SCAN CALL SCANQ COMPARE COMPLEV CALL COMPCOST SOUNDEX COMPGED SPEDIS MISSING RANK REPEAT REVERSE…………

Page 11: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

11

Alternative Approach to Problem

Introducing REGULAR EXPRESSIONS!!

Page 12: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

12

Introduction – Regular Expressions

• Powerful technique for searching and manipulating

text data.

• A mini programming language - pattern matching.

• 2 types – pattern matching functions in SAS

SAS Regular Expressions – SAS Version 6.12

PERL Regular Expressions – SAS Version 9

Page 13: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

13

Steps to use Regular Expressions…Problem

Required Portion

Pattern

Regular Expressions

Locate Reqd. Portion

Process Data

Problem

Required Portion

Problem

Page 14: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

14

Step1 - Identify the problem …USUB

JIDVISI

TVSDT PRSDTLTM VNTR_

RTVNTRTUN

1 1 17-Oct-08

Per 1 D01 Predose

47 /min

1 2 3-Nov-08

Per 1 D01 .5 hr

58 /min

1 2 3-Nov-08

Per 1 D 01 01 hr

51 /min

1 2 3-Nov-08

Per 1d01 02 hr

49 /min

1 3 4-Nov-08

Day1 53 /min

1 90 3-Feb-09

Poststudy 56 /min

time_desc

Predose

Day 1, 0.5 Hour

Day 1, 1 Hour

Day 1, 2 Hours

Day 1

Poststudy

Problem

Required PortionRequired Portion

PatternPattern

Regular Regular ExpressionsExpressions

Locate Reqd. Locate Reqd. PortionPortion

Process DataProcess Data

Page 15: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

15

Step2 – Visualize the “Required Portion” within the source text

ProblemProblem

Required Portion

PatternPattern

Regular Regular ExpressionsExpressions

Locate Reqd. Locate Reqd. PortionPortion

Process DataProcess Data

PRSDTLTMPer 1 D01 Predose

Per 1 .5 hr

Per 1 01 hr

Per 1 02 hr

Poststudy

D01

d01

D 01

Day1

Page 16: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

16

Step 3 – Identify a pattern

ProblemProblem

Required Required PortionPortion

Pattern

Regular Regular ExpressionsExpressions

Locate Reqd. Locate Reqd. PortionPortion

Process DataProcess Data

PRSDTLTMPer 1 D01 Predose

Per 1 D01 .5 hr

Per 1 D 01 01 hr

Per 1d01 02 hr

Day1

Poststudy

Preceding Blank

‘D’ or ‘d’

Following Blank

One/more digits

Following Blank

2- Non Digits

EXTRACT

Page 17: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

19

Regular Expressions Syntax...at a glance

Metacharacter

Description

* Matches the previous sub expression zero or more times

+ Matches the previous sub expression one or more times

? Matches the previous sub expression zero or one times

\d Matches a digit (0-9)

\D Matches a non-digit

\w Matches a word character (upper or lower case letter, blank, or underscore)

[abc] Matches any of the characters in the brackets

\( Matches (

Page 18: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

20

Step 4 – Write the Regular Expression for the pattern

ProblemProblem

Required Required PortionPortion

PatternPattern

Regular Expressions

Locate Reqd. Locate Reqd. PortionPortion

Process DataProcess Data

PRSDTLTMPer 1 D01 Predose

Per 1 D01 .5 hr

Per 1 D 01 01 hr

Per 1d01 02 hr

Day1

Poststudy

Preceding Blank

(("/"/ /"/")) ??

Page 19: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

21

Step 4 – Write the Regular Expression for the pattern

ProblemProblem

Required Required PortionPortion

PatternPattern

Regular Expressions

Locate Reqd. Locate Reqd. PortionPortion

Process DataProcess Data

PRSDTLTMPer 1 D01 Predose

Per 1 D01 .5 hr

Per 1 D 01 01 hr

Per 1d01 02 hr

Day1

Poststudy

‘D’ or ‘d’

("/("/ [Dd][Dd] ?? /")/")

Page 20: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

22

Step 4 – Write the Regular Expression for the pattern

ProblemProblem

Required Required PortionPortion

PatternPattern

Regular Expressions

Locate Reqd. Locate Reqd. PortionPortion

Process DataProcess Data

PRSDTLTMPer 1 D01 Predose

Per 1 D01 .5 hr

Per 1 D 01 01 hr

Per 1d01 02 hr

Day1

Poststudy

2-Non Digits

("/("/ [Dd][Dd] ?? /")/")(\D\D)?(\D\D)?

Page 21: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

23

Step 4 – Write the Regular Expression for the pattern

ProblemProblem

Required Required PortionPortion

PatternPattern

Regular Expressions

Locate Reqd. Locate Reqd. PortionPortion

Process DataProcess Data

PRSDTLTMPer 1 D01 Predose

Per 1 D01 .5 hr

Per 1 D 01 01 hr

Per 1d01 02 hr

Day1

Poststudy

Following Blank

("/("/ [Dd][Dd] ?? /")/")(\D\D)?(\D\D)? ??

Page 22: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

24

Step 4 – Write the Regular Expression for the pattern

ProblemProblem

Required Required PortionPortion

PatternPattern

Regular Expressions

Locate Reqd. Locate Reqd. PortionPortion

Process DataProcess Data

PRSDTLTMPer 1 D01 Predose

Per 1 D01 .5 hr

Per 1 D 01 01 hr

Per 1d01 02 hr

Day1

Poststudy

One/more digits

("/("/ [Dd][Dd] ?? /")/")(\D\D)?(\D\D)? ?? \d+\d+

Page 23: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

25

Step 4 – Write the Regular Expression for the pattern

ProblemProblem

Required Required PortionPortion

PatternPattern

Regular Expressions

Locate Reqd. Locate Reqd. PortionPortion

Process DataProcess Data

PRSDTLTMPer 1 D01 Predose

Per 1 D01 .5 hr

Per 1 D 01 01 hr

Per 1d01 02 hr

Day1

Poststudy

Following blank

("/("/ [Dd][Dd] ?? /")/")(\D\D)?(\D\D)? ?? \d+\d+ ++

Page 24: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

26

Step 4 – Write the Regular Expression for the pattern

ProblemProblem

Required Required PortionPortion

PatternPattern

Regular Expressions

Locate Reqd. Locate Reqd. PortionPortion

Process DataProcess Data

(("/ ?[Dd](\D\D)? ?\d+ +/""/ ?[Dd](\D\D)? ?\d+ +/"))

PRSDTLTM

Per 1 D01 Predose

Per 1 D01 .5 hr

Per 1 D 01 01 hr

Per 1d01 02 hr

Day1

Poststudy

Page 25: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

27

Step 4 – Write the Regular Expression for the pattern

ProblemProblem

Required Required PortionPortion

PatternPattern

Regular Expressions

Locate Reqd. Locate Reqd. PortionPortion

Process DataProcess Data

/* Extracting the Day Text portion*/data day_txt;set lb.ecg(keep = PRSDTLTM);retain day_exp;

* defined to describe the day text pattern;

day_exp

=PRXPARSE

end;

run;

("/ ?[Dd](\D\D)? ?\d+ +/");

if _n_ = 1 then do ;

Metacharacters

Page 26: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

28

Recap… Steps to use Regular Expressions…

Problem

Required Portion

Pattern

Regular Expressions

Locate Reqd. Portion

Process Data

Problem

Required Portion

Problem

Page 27: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

29

Recap… Steps to use Regular Expressions…

Problem

Required Portion

Pattern

Regular Expressions

Locate Reqd. Portion

Process Data

Problem

Required Portion

Problem

Page 28: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

30

Recap… Steps to use Regular Expressions…

Problem

Required Portion

Pattern

Regular Expressions

Locate Reqd. Portion

Process Data

Problem

Required Portion

Problem

Page 29: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

31

Recap… Steps to use Regular Expressions…

Problem

Required Portion

Pattern

Regular Expressions

Locate Reqd. Portion

Process Data

Problem

Required Portion

Problem

Page 30: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

32

Step 5 – Locate the “Required Portion”

ProblemProblem

Required Required PortionPortion

PatternPattern

Regular Regular ExpressionsExpressions

Locate Reqd. Portion

Process DataProcess Data

/* Extracting the Day Text portion*/data day_txt;

set lb.ecg(keep = PRSDTLTM);retain day_exp day_nexp;if _n_ = 1 then do ; * defined to describe the day text pattern;

day_exp = PRXPARSE("/ ?[Dd](\D\D)? ?\d+ +/");end;

*Locating the day text pattern in the PRSDTLTMvar;CALLCALL PRXSUBSTR(day_exp,PRSDTLTM,dayst,dayln);PRXSUBSTR(day_exp,PRSDTLTM,dayst,dayln);

run;

Pattern defn

Source Variable

Stores Start position of

matched stringStores length of matched string

Page 31: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

33

Step 6 – Use other SAS text functions to further process data

ProblemProblem

Required Required PortionPortion

PatternPattern

Regular Regular ExpressionsExpressions

Locate Reqd. Locate Reqd. PortionPortion

Process DataProcess Data

/* Extracting the Day Text portion*/data day_txt;

set lb.ecg(keep = PRSDTLTM);retain day_exp day_nexp;

if _n_ = 1 then do ; * defined to describe the day text pattern;

day_exp = PRXPARSE("/ ?[Dd](\D\D)? ?\d+ +/"); end;

* Locating the day text pattern in the PRSDTLTM var;CALL PRXSUBSTR(day_exp,PRSDTLTM, dayst, dayln);

* Extracting the day text pattern;day_txt = day_txt = substrn(PRSDTLTM,dayst,dayln);substrn(PRSDTLTM,dayst,dayln);

run;Source

VariableStarting Position

Length of matched pattern

Page 32: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

34

…Output

PRSDTLTM day_txt

Per 1 D01 Predose

Per 1 D01 .5 hr

Per 1 D 01 01 hr

Per 1d01 02 hr

Day1

Poststudy

Extracted string

D01

Day1

d01

D 01

Page 33: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

Outline

• Problems

• Solutions & Introducing Regular Expressions

• Advantages over SAS String Functions

• Points to note while using Regular Expressions

• References35

Page 34: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

36

Advantages…

• Compact solution

• Tremendous flexibility

Concise description.

Highly unstructured data streams.

Multiple matching patterns in one step.

Page 35: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

Outline

• Problems

• Solutions & Introducing Regular Expressions

• Advantages over SAS String Functions

• Points to note while using Regular Expressions

• References37

Page 36: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

38

Look before you leap

Document thoroughly.

Understand patterns.

Define before use.

Define only once in a data step.

Page 37: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

Outline

• Problems

• Solutions & Introducing Regular Expressions

• Advantages over SAS String Functions

• Points to note while using Regular Expressions

• References39

Page 38: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

40

Support.sas.com

Paper TU02- An Introduction to Regular Expressions with Examples from Clinical

Data - Richard F. Pless, Ovation Research Group, Highland Park, IL

SUGI 29-Tutorials - Paper 265-29 An Introduction to Perl Regular Expressions in SAS 9 Ron Cody, Robert Wood Johnson Medical School, Piscataway, NJ

An Introduction to PERL Regular Expression in SAS® James J. Van Campen, SRI International, Menlo Park, CA

…References

Page 40: PhUSE 2011: Brighton TS09 Rectifying Irregular Text Data a Case for Using Regular Expressions in SAS Jayshree Garade Manjusha Gode.

Thank you Thank you

42