Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i

60
Regex makes me want to ( weep | give up | (°°)┻━┻ )\.? A presentation by Brett Florio of FoxyCart.com . Follow along at bit.ly/regex-makes-me-wanna

Transcript of Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i

Regex

makes me want to ( weep | give up | (╯°□°)╯︵ ┻━┻ )\.? A presentation by Brett Florio of FoxyCart.com.

Follow along at bit.ly/regex-makes-me-wanna

Who it’s for?▷ Beginners looking to

understand the basics

▷ Intermediate regex devs wanting a review and some new approaches

▷ Advanced programmers who just don’t really grok regular expressions.

▷ Anybody who hates regex because they don’t understand it.

Slides… are available atbit.ly/regex-makes-me-wanna

How we’ll learn:Rather than abstract concepts like “cat” and “dog”, we’ll focus on real use-cases you might run across in your daily programming.

What we’ll learn:▷ Our goal

▷ A brief history of regex

▷ Matching

▷ Validating

▷ Replacing

▷ Working with HTML

▷ Common gotchas

About this presentation!

▷ Co-founded FoxyCart.com (now Foxy.io) in 2007

▷ Dove into regex when @lukestokes told me something was impossible. Proved him wrong.

▷ Spent the past five years traveling full-time or half-time in an RV with my wife and 3 kids.

▷ Currently in Austin, TX, and happy to grab food or drinks if you’re in town!

@brettflorio

http://brettflorio.com/ has more photos like this -->

FoxyCart.com / Foxy.io is where I solve problems.

About @brettflorio

# Credit card number matcher

CREDIT_CARD = re.compile( r'([^\d])([3456][ -]*?(?:\d[ -]*?){12,15})([^\d])')

CC_REPLACEMENT = '\g<1>XXX_CC_LE_REPLACEMENT_XXX\g<3>'

# Password matching

PASSWORD = re.compile( r'customer_password=(.*?)&')

PASSWORD_REPLACEMENT = 'customer_password=XXX_PW_LE_REPLACEMENT_XXX&'

A recent real-life regex…

Extra sanitization of logs,in a Chef recipe:

1. Find emails

2. Validate custom input

3. Link @mentions and #tags in text

4. Strip <script> tags

5. Truly validate a subdomain^(?!-)[a-z0-9-]{1,63}(?<!-)$

Our goals!Understand how to:

http://www.totalprosports.com/2012/06/01/soccer-celebrations-special-effects-win-video/

“Big thanks to NomadPHP.com!

Check out Daycamp4Developers(PHP Application Security day in June)

1.REGEX: A Brief Intro

With an even briefer coverage of its history.

“Some people, when confronted with a

problem, think “I know, I'll use regular expressions.”

Now they have two problems.

http://regex.info/blog/2006-09-15/247

▷ 1940s-60s: Lots of smart people

▷ 1970s: g/re/p

▷ 1980: Perl and Henry Spencer

▷ 1997: PCRE (Perl Compatible Regular Expressions)

Pronunciation: hard or soft ‘g’

Regular expressions’ history

Matching

int preg_match ( string $pattern , string $subject [, array &$matches [, int$flags = 0 [, int $offset = 0 ]]] )

Returns 1 if match found.0 if not.false if error

Common regex usage: PHP

Replacingmixed preg_replace ( mixed $pattern , mixed $replacement , mixed $subject [, int $limit = -1 [, int &$count ]] )

Returns the replaced string or array (based on the $subject).

Matching (all)

int preg_match_all ( string $pattern , string $subject [, array &$matches [, int $flags = PREG_PATTERN_ORDER [, int $offset = 0 ]]] )

Returns # (int) of matches found.

Matchingstring.match(RegExp);

Returns an array of matches, or null if no matches.

Replacingstring.replace(RegExp, replacement);

Returns the string with the replacements performed.

Caveats about JavaScript’s regex▷ No “single-line” or DOTALL mode. (The dot never matches a new line.)▷ No lookbehind support :(▷ Same methods for regex and non-regex matching and replacing.

Common regex usage: JS

Problem: Finding email addresses in a codebase.Goal: /[\w.+-]+@[a-z0-9-]+(\.[a-z0-9-]+)*/i

2.The Basics of

Regex Patterns

Hypothetical situation:

Your project has bloated over the years, and both internal and external emails are going everywhere, maybe including terminated employees, personal accounts, etc.

Your mission:

You need to search the whole codebase to find all the emails so you can tidy things up!

Find all the emails!Or… an alternate story:

You need to strip emails from user-submitted content, to protect privacy or restrict communication (or like Airbnb does).

~12 Special Charactersaka “Metacharacters”

▷ . \ [ ] ? * + { } ( ) ^ $ |▷ - (sometimes)

Nearly everything else is a literal!

Imagine your input string as bolts, and your pattern as a set of sockets (in order).

An analogy:Sockets!

"Socket wrench and sockets" by Kae - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons

The exact match

If you know exactly what you’re looking for…

You still might get more than you wanted!

https://regex101.com/r/qG8zB1/1

The almighty .and the escape \

The dot (.) matches ANYTHING and EVERYTHING.

Except… new lines, by default. PHP and others can enable DOTALL or single-line mode to have the dot match a new line. JavaScript can’t.

The backslash \ escapes special characters (metacharacters). So \. makes a dot match just a dot.

https://regex101.com/r/eR9vT7/1

The almighty . (dot)

The dot (.) matchesANYTHING and EVERYTHING

(except newlines, by default).

Gator Grip Universal Socket, available online.

The almighty . (dot)

The dot (.) matchesANYTHING and EVERYTHING

(except newlines, by default).

Toysmith Classic Pin Art, ~$20. Buy one!

Square brackets match what’s inside them.

[abc] ‘a’ ‘b’ or ‘c’[a-z] Lowercase letters[0-9] Any single digit[a-z.] Letters and the dot

A common case is…[A-Za-z0-9_]which has a shortcut:\w “Word” characters

So… let’s try this: [\w.+-]

Character Classes!

https://regex101.com/r/iW3bW4/1

Dashes need escaping inside square brackets (unless they’re at the start or the end), since they have special meaning

So… [\w.+-] is fine. The dash is at the end.But… [\w.\-+] needs escaping.

When in doubt, escaping doesn’t typically hurt.[\w.+\-] is also just fine.

Escaping!

? 0 or 1 match (optional)* 0 or more matches+ 1 or more matches

But what about at least 3, or 1 through 6 matches?

Repetition!

https://regex101.com/r/sF4tM6/1https://regex101.com/r/aC3iH8/1

https://regex101.com/r/iE3rB4/1https://regex101.com/r/uF5lB7/1

https://regex101.com/r/tI4nO0/1https://regex101.com/r/aX5qG6/1

Curly brackets get you minimum and maximum ranges. Minimum is required:

{1,} At least 1{1,3} 1 through 3{1,64} 1 through 64

64 characters is the maximum length of the username portion of an email, so…

More Repetition!

It looks similar in both PHP…

preg_match(‘/pattern/i‘, $subject);

And JavaScript:

string.match(/pattern/i);

Other common modifiers are:

s Makes the dot match newlines as well. (PHP)g Match all, not just the first. (JavaScript)m Makes ^ and $ line-specific.

References for PHP and JavaScript

By default, regex is case-sensitive.Adding an “i” after the pattern’s delimiter fixes that.

DON’T FORGET CAPS LOCK

Putting it all together

/[\w.+-]+@[a-z0-9-]+(\.[a-z0-9-]+)*/i(Try it on a project in your text editor.)

A great tool for testing how PHP handles preg_match, preg_match_all, and preg_replace is http://www.phpliveregex.com/

See this example at http://www.phpliveregex.com/p/9yD

What that looks like in PHPpreg_match_all( "/[\w.+-]+@[a-z0-9-]+(\.[a-z0-9-]+)*/i", $input_lines, $output_array);

Array ( [0] => Array ( [0] => [email protected] [1] => [email protected] [2] => [email protected] [3] => [email protected] [4] => [email protected] [5] => [email protected] [6] => [email protected] [7] => admin@localhost [8] => [email protected] [9] => [email protected] [10] => [email protected] )

. [] ?

* + {}

Square Brackets

Matches characters inside the brackets. Supports ranges.

[abc] ‘a’ ‘b’ or ‘c’[a-z] Lowercase letters[0-9] Any single digit

Quick review before funny gifs!

The Dot and the \w

Matches everything but new lines. If you want to match a dot and only a dot, escape it like \

\w matches letters, numbers, and the underscore..

Optional

The ? matches 0 or 1

The Star

The * matches 0 or more.

The Plus

Matches 1 or more

Curly Brackets

Min and max ranges.

{1,} At least 1{1,3} 1 through 3{1,64} 1 through 64

Problem: Make sure input is what we expect.Goal 1: /[^0-9a-z\-_.]/

Goal 2: /^[0-9]{1,2}[dwmy]$/

3.Using Regex for Validation

▷ Know your target.

▷ Some targets are impossible:

○ "much.more unusual"@example.com ○ "[email protected]"@example.com ○ "very.(),:;<>[]\".VERY.\"very@\\

\"very\".unusual"@strange.example.com ○ admin@mailserver1 (local domain name with no TLD)○ !#$%&'*+-/=?^_`{}|[email protected] ○ "()<>[]:,;@\\\"!#$%&'*+-/=?^_`{}| ~.a"@example.org ○ " "@example.org (space between the quotes)

Hooray! But…

Validating thingsis where you get to determine exactly what you want.

Finding things…is usually a matter of “good enough”.

When not to use regex

http://php.net/manual/en/function.filter-var.phphttp://php.net/manual/en/filter.filters.validate.php

Hammer icon by John Caserta, from The Noun Project

Just because you can use regex for validation doesn’t mean you should. PHP’s got lots handled.

filter_var( '[email protected]', FILTER_VALIDATE_EMAIL);

^ Start of string$ End of string

https://regex101.com/r/sN8pA6/1

if (!preg_match( "%^[0-9]{1,2}[dwmy]$%", $_POST["subscription_frequency"]) ) { $IsError = true; })

Anchors

▷ Imagine writing routing rules.These will do very different things.

Small anchors. Big impact.

index(\.php)? ^index(\.php)?

https://regex101.com/r/dS8zC9/1

Negated Character Classes

[^abc] Anything except a, b, or c, including new lines.

// Ensure input only contains// alphanumeric, dash, dot, underscoreif (preg_match("/[^0-9a-z\-_.]/i", $product_code)) { $IsError = true;}

Problem: Link @mentions and #tagsGoal: /\B@([\w]{2,})/i

4.Finding… and REPLACING

First we need to find them…▷ @foo but not @foo.bar or [email protected]

▷ \w works well to get us [A-Za-z0-9_]

▷ \B is an anchor, like ^ or $, but that matches “not a word boundary”. It matches a position, not a character.

▷ Wrap a pattern in parentheses to make a “capturing group”.

But wait… We need pieces: ( )

preg_match_all( "/\B@([\w]{2,})/i", $input, $output_array);

Array ( [0] => Array ( [0] => @calevans [1] => @FoxyCart ) [1] => Array ( [0] => calevans [1] => FoxyCart ))

The result…

Named capturing groups:

preg_match_all( "/\B@(?P<username>[\w]{2,})/i", $input);

0=>array(0=>@calevans1=>@FoxyCart)

username=>array(0=>calevans1=>FoxyCart)

1=>array(0=>calevans1=>FoxyCart)

For complex patterns or ease of reference, you can name capturing groups using (?P<name>) syntax.

The result…

It’s replacin’ time!

preg_replace( "/\B@([\w]{2,})/i", "<a href=\"foo?user=$1\">$0</a>", $input);

Hey <a href="foo?user=calevans">@calevans</a>, could you pick up some #ice_cream and #gingerbread for #CoderFaire? <a href="foo?user=FoxyCart">@FoxyCart</a> will sponsor. Email me a receipt at [email protected].

Notice the $0 and $1. $0 is the complete match.$1 is the first captured group. $2 would be the second, etc.

A recent example…

Find credit card numbers, before they get submitted, emailed, saved, logged, or backed up.

Visualization by https://jex.im/regulex/

“preg_replace is the best.

Problem: Match some HTML tag attributes.Goal: %name=(['"]?)amount\1%

5.Backreferences and HTML

▷ Backreferences refer back to previous captured groups in the same pattern.

▷ Syntax is \#, where # is the number of the group.

▷ Useful for matching pairs of things (opening/closing quotes and tags).

Backreferences

http://regexr.com/3a8j0

Problem: Strip script tags without stripping extra stuff.Goal: %<script.*?</script>%

6.Greediness & the Dot

https://regex101.com/r/uJ7jQ6/1

Greedy by default

This pattern will match as much as it possibly can.

Anytime you use a dot, remember how greedy it is.

https://regex101.com/r/lO1sB7/1

Adding a ? after a repetition metacharacter (+, *, or {m,n}) will make it non-greedy.

Notice the difference. It’ll stop the match as soon as it can instead of as late as it can.

In general, always throw a ? after a + or *.

Go non-greedy!

http://xkcd.com/1638/

Slashes and HTML

The / is often used as the pattern delimiter, so it needs to be escaped.

preg_match('/https?:\/\/.*?\//i'

In PHP you can use others. % or ` (backtick) work well.

preg_match('%https?://.*?/%i'

preg_match('`https?://.*?/`i'In JavaScript, you can’t use others, but you can construct without them… 

var re = new RegExp("https?://");http://php.net/manual/en/regexp.reference.delimiters.php

Slashes and HTML

Problem: Validate a subdomain with dashes(which can’t start or end the string)

Goal: ^(?!-)[a-z0-9-]{1,63}(?<!-)$

7.Lookarounds!

Positive Lookahead:Match something followed by something else.

(?=)

Negative Lookahead:

Match something not followed by something else.

(?!)

https://regex101.com/r/gK0mE7/1https://regex101.com/r/mE1fC4/1

Lookaheads

Positive Lookbehind:Match something preceded by something else.

(?<=)Negative Lookbehind:Match something not preceded by something else.

(?<!)

JavaScript doesn’t support lookbehinds, and there are some limitations.

https://regex101.com/r/kL3rA4/1https://regex101.com/r/xT1gA9/1

Lookbehinds

Subdomains can’t be longer than 63 characters, can only contain letters, numbers, and dashes, but cannot start or end with a dash.

The top is without lookarounds.

The bottom is with ‘em.

https://regex101.com/r/jU0yI3/2 from http://stackoverflow.com/a/7933253/862520

https://regex101.com/r/wV7yQ0/2

Practical lookarounds

Problem: You can’t get enough regex!Goal: Learn all the regex!

8.Resources & Homework

Special Characters:aka “Metacharacters”

▷ caret ^▷ dollar sign $▷ period or dot .▷ question mark ?▷ asterisk or star *▷ plus sign +▷ parentheses ( )▷ square brackets [ ]▷ curly brackets { }▷ pipe |▷ backslash \

Reading & Resources:

▷ regular-expressions.info▷ regexr.com is my jam.▷ regex101.com does a bit

more if you need it.▷ phpliveregex.com shows

PHP’s handling of preg_ methods.

▷ jex.im/regulex/ is super helpful visualization.

Overview

▷ The pipe character, to match one pattern OR another

▷ All the character classes: \s \S \d \D \W

▷ Unicode support, and how frustrating it can be

▷ Non-capturing (or “passive”) groups

▷ Named capturing groups

▷ How the \b and \B work as they relate to the @mentions example. Why does \B@foo match the way it does? How do they relate to \w and \W?

Homework!

You can find me at:

@brettflorio, [email protected]

You can leave feedback at https://joind.in/event/lone-star-php-2017/regex-makes-me-weepgive-up-i

Slides available at bit.ly/regex-makes-me-wanna

Thanks!Any questions?

Thanks again to @calevans and @nomadphp for asking me to do this talk in the first place.

Credits

Thanks also to all the people who made and released these awesome resources for free:

▷ Minicons by Webalys▷ Presentation template

by SlidesCarnival