Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular...

36
Perl 101: Regular Expressions -Alan Voss, Perl Hacker

Transcript of Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular...

Page 1: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

Perl 101: Regular Expressions-Alan Voss, Perl Hacker

Page 2: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)
Page 3: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

A) Black magic?B) A form of wizardry?C) A (mostly) predictable mini language for detecting patterns in text and manipulating text in an iterative fashionD) Platform and language independent (for the most part): Perl, JavaScript, Java, PHP, grep -e, etcE) The first argument to Perl's split()F) All of the above

What are Regular Expressions?

Page 4: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

/The/

● Matches any of:○ The○ Their○ Thesis○ anaEsThesia○ ectoThermiC○ AbsinThe

A Basic Example

Page 5: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

/tru/i (i: modifier for case insensitivity)

● Matches any of:○ True○ truth○ altruism○ constRUcted○ obStrUctEd○ Restructure

Building on the Basic Example

Page 6: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

We can match basic words using a "pattern" of just the word.

But what if we need to match something more interesting?

So far, not very interesting.

Page 7: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

/the|tru/i

● Matches any of:○ The○ true○ anaesThesia○ constRUcted○ obStrUctEd○ absinThe

Building on the Basic Example

Page 8: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

/chee(z|se)burger/i

● Matches any of:○ cheezburger (sic)○ cheeseburger○ CHEESEburger

Building on the Basic Example

Page 9: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

/t(he|ru)/i

● (Still) matches any of:○ The○ true○ anaesThesia○ constRUcted○ obStrUctEd○ absinThe

Building on the Basic Example

Page 10: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

● Regular expressions are string iterator instructions.

● Matching always starts at the beginning of the string.

● Matching continues until total success or partial failure.

● If failure is the case, backtracking occurs until no success is possible at the starting position, and when exhausted the cursor advances and starts again.

How does it work?

Page 11: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

The● T matches T at start of regex● h matches next character in regex● e matches the one after that

Success!

Back to the /The/ example:

Page 12: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

ectoThermic● e does not match T at start of regex

○ fail, advance, start over● c does not match T at start of regex

○ fail, advance, start over● t does not match T at start of regex

○ fail, advance, start over● o does not match T at start of regex

○ fail, advance, start over

Back to the /The/ example:

Page 13: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

ectoThermic (continued)● T matches T at start of regex● h matches next character in regex● e matches the one after that

The rest of the string is ignored.

Success!

Back to the /The/ example:

Page 14: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

True● T matches T at start of regex● r does not match next character h

○ fail, advance, backtrack regex, start over● r does not match T at start of regex

○ fail, advance, start over● u does not match T at start of regex

○ fail, advance, start over● e does not match T at start of regex

○ fail, can't advance, done.

Failure.

Back to the /The/ example:

Page 15: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

Special Characters

. matches any character but "\n" (except with modifiers)

* 0 or more of the proceeding character, class, or sub-expression

+ 1 or more of the proceeding character, class, or sub-expression

? 0 or 1 of the proceeding character, class, or sub-expression

{ n, m } minimum of n, maximum of m (m optional) of the proceeding character, class, or sub-expression

| or

[.....] denotes character class

(.....) denotes sub regular expression or special / extended syntaxes

\ escape any of these symbols, including itself

\Q ... \E escape all special characters afterward (to \E, if present)

Page 16: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

Special Characters (zero width)

\A beginning of string

\z absolute end of string

\Z end of string, save the final terminating character, like a newline

\G start matching where the previous global match stopped

\b matches a word boundary

^ beginning of line (including start of new line)

$ end of line (might not be end of string)

(?=) positive lookahead

(?!) negative lookahead

(?<=) positive lookbehind

(?<!) negative lookbehind

For these, don't advance the cursor, just test in place.

Page 17: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)
Page 18: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

/^(ab)?normal$/# normal, abnormal

/^(stig|fer)ma(ta)?$/# stigma, stigmata, fermata

/^(ma){1,2}$/# ma, mama

Examples

Page 19: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

/^ba(na){2}$/# banana

/^(kn[ia]ck){1,2}$/# knick, knack, knickknack (knickknick, knackknack)

/^(angio|rhino|osteo|neo)plasty$/# various surgeries

More Examples

Page 20: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

/^br+$/# indicating varying degrees of coldness

/^[0-9]{3}-[0-9]{2}-[0-9]{4}$/# social security number (character classes)

/^[0-9]{1,3}(,[0-9]{3})*$/# a number in the form of 10,201,231

/^(.+).?(??{reverse $1})$/# a palindrome (awake...?)

More Examples

Page 21: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

Character Classes (use ranges)

Character class

AKA * Denotes Opposite (not) uses leading ^

Opposite (not) AKA *

[ACGT] any DNA nucleic acid [^ACGT]

[0-9] \d any single number [^0-9] \D

[A-Za-z] any uppercase or lowercase ASCII letter [^A-Za-z]

[A-Za-z0-9_] \w any "word" character [^A-Za-z0-9_] \W

[\t\r\n\f ] \s any whitespace character [^\t\r\n\f ] \S

* not available in all implementations, but definitely in Perl!

Page 22: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

Greed

Page 23: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

+ ? {n,m} {n,} are all greedy

They always match as many as they can.

Upon whole regex failure, part of the substring that was matched using greed will be backtracked by one, until either there is nothing more to backtrack or the whole regex succeeds.

Greed

Page 24: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

The opposite of greed.

Adding a ? to the quantifier will make it match the minimum required.

A*? might match 0 letter As, even though it could match many, many more with greed.

Reluctance or Parsimony

Page 25: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

Say you wanted to get the information between quotes in this sentence:

$a = 'The man said, "Heck if I know!"';

You could match that with the following:

/"(.+)"/

Greed vs Reluctance/Parsimony

Page 26: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

But what about this sentence:

$a = 'The man said, "Heck if I know!" and then she answered, "Oh, but I do!"';

Will this work anymore?

/"(.+)"/

Greed vs Reluctance/Parsimony

Page 27: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

Nope. The .+ is greedy, and will swallow everything from the first quote to the very last one, even though you didn't mean to capture all of that.

$a = 'The man said, "Heck if I know!" and then she answered, "Oh, but I do!"';

/"(.+)"/

Greed vs Reluctance/Parsimony

Page 28: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

Could use a reluctant expression, and that would match just the first one:

$a = 'The man said, "Heck if I know!" and then she answered, "Oh, but I do!"';

/"(.+?)"/

Greed vs Reluctance/Parsimony

Page 29: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

Or use the /g global modifier to match both:

$a = 'The man said, "Heck if I know!" and then she answered, "Oh, but I do!"';

/"(.+?)"/g

Greed vs Reluctance/Parsimony

Page 30: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

Or, you could say what you mean (in some cases this is faster, and other times it is slower):

$a = 'The man said, "Heck if I know!" and then she answered, "Oh, but I do!"';

/"([^"]+)"/g

Greed vs Reluctance/Parsimony

Page 31: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

There is a quick shortcut for being greedy and not backtracking as well, which is the + which can be used similarly to the ? in reluctant matching.

*+{2,5}+++

'aaaa' =~ /a++a/ will never match, but /a+a/ will.

Greed vs Possessive

Page 32: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

/(\d{2,})\1/# matches a repeating set of numbers

\1 refers to the first set of parentheses (subexpression) in the expression, and says "match that again, please"

In substitutions, which we'll talk about later, you could refer to that as $1, a global in Perl

Captures and backreferences

Page 33: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

How about making things more clear with names, rather than numbers?

/(\d{2,})\1//(?<numbergroup>\d{2,})\g{numbergroup}/# matches a repeating set of numbers

It is important to note that named captures are just aliases to the numbered backreferences, and can bite you in specific circumstances, e.g. the branch reset pattern.

Named captures and backreferences

Page 34: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

Can be used in combination.

Modifiers

Modifier Means Does

/g Global Match as many times as possible.

/i Insensitive Case insensitivity, even with character classes.

/s Single Treats even a multi-lined string as a single string, such that . will match "\n", for example.

/m Multiple For multiple-lined strings, ^ and $ will match the beginning and end of each line.

/x eXtend A good way to add comments in your regex

/e Eval Evaluate the replacement value as an expression, and use the results for the substitution.

/r Return Only return the modified string without actually modifying $_ during substitution

Page 35: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

use s with any special character or bracket set as delimiters.

s/Alan/someone better at Java/

s#\b(\w+)\b#ucfirst $1#ge"alan loves regular expressions" becomes "Alan Loves Regular Expressions"

s{(\w)(\w)}{$2$1}g or s/(\w{2})/reverse $1/geSwap adjacent word characters.

Substitution

Page 36: Perl 101: Regular Expressions - Meetupfiles.meetup.com/501101/Perl 101- Regular Expressions.pdfExpressions-Alan Voss, Perl Hacker. A) Black magic? B) A form of wizardry? C) A (mostly)

● lookaheads (and the negative counterpart)● look behinds (and the negative counterpart)● other extended regular expressions● ?: for blocking the capture of a set of

parentheses● Regexp::Common (named aliases for

commonly used regexes, including very complicated ones)

● The entire Regexp:: namespace (including Regexp::Debugger, used in this presentation

Topics not (heavily) covered, but that are related and interesting