Regex is Fun
description
Transcript of Regex is Fun
![Page 1: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/1.jpg)
Copyright © 2013 Splunk Inc.
Regex is Fun
David ClawsonSplunkYoda
![Page 2: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/2.jpg)
2
![Page 3: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/3.jpg)
Regular Expressions
• “A regular expression is a pattern which specifies a set of
strings of characters; it is said to match certain strings.”
—Ken Thompson
• QED Text Editor written by Ken in the 1970s
• Invented in the 1940s
• Help celebrate it’s 70th Year
3
![Page 4: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/4.jpg)
Types of Regular Expressions
4
![Page 5: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/5.jpg)
How is Regex used in Python?
Python “re”
Python's built-in "re" module provides excellent support for regular expressions, with a
modern and complete regex flavor.
The only significant features missing from Python's regex syntax are atomic grouping,
possessive quantifiers, and Unicode properties.
Using Regular Expressions in Python
The first thing to do is to import the regexp module into your script with “import re”.
5
![Page 6: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/6.jpg)
How is Regex used in Python?
Call re.search(regex, subject) to apply a regex pattern to a subject string. • The function returns None if the matching attempt fails, and a Match object otherwise. • The Match object stores details about the part of the string matched by the regular expression
pattern.
Since None evaluates to False, you can easily use re.search() in an if statement.
6
![Page 7: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/7.jpg)
How is Regex used in Python?
Do not confuse re.search() with re.match(). • Both functions do exactly the same, with the important distinction that re.search() will attempt
the pattern throughout the string, until it finds a match. re.match() on the other hand, only attempts the pattern at the very start of the string.
7
![Page 8: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/8.jpg)
How is Regex used in Python?
To get all matches from a string, call re.findall(regex, subject). This will return an array of all non-overlapping regex matches in the string. "Non-overlapping" means that the string is searched through from left to right, and the next match attempt starts beyond the previous match.
If the regex contains one or more capturing groups, re.findall() returns an array of tuples,
with each tuple containing text matched by all the capturing groups.
The overall regex match is not included in the tuple, unless you place the entire regex
inside a capturing group.
8
![Page 9: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/9.jpg)
How is Regex used in Python?
More efficient than re.findall() is re.finditer(regex, subject). It returns an iterator that enables you to loop over the regex matches in the subject string: for m in re.finditer(regex, subject). The for-loop variable m is a Match object with the details of the current match.
9
![Page 10: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/10.jpg)
How is Regex used in Splunk?
Field extraction
| rex field=_raw “%UC_CALLMANAGER-(?<Severity>\d+)-EndPointUnregistered:
Configure Line Breaking
LINE_BREAKER = [\r\n]+
Filtering and Routing Data to Queues
REGEX =(?m)^EventCode=(592|593)
Many more…….
10
![Page 11: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/11.jpg)
Regex Testing Tools
11
• RegExr http://gskinner.com/RegExr/
• Reggy http://reggyapp.com/
• RegexPal http://regexpal.com/
• Regex Buddy http://www.regexbuddy.com/
• Lars Olav Torvik http://regex.larsolavtorvik.com/
• Rubular http://rubular.com/
![Page 12: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/12.jpg)
Regex Reference Texts
12
• http://www.regular-expressions.info/reference.html - from the creators of RegexBuddy
• Introducing Regular Expressions by Michael Fitzgerald
• Mastering Regular Expressions by Jeffrey Friedl
• Regular Expressions Cookbook by Jan Goyvaerts
• Regular Expressions Pocket Reference by Tony Stubblebine
![Page 13: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/13.jpg)
Basic Concepts of Regular Expressions
Because Knowingleads to Doing
![Page 14: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/14.jpg)
14
![Page 15: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/15.jpg)
Simple Pattern Matching
Matching String Literals
Matching Digits and Non-Digits
Matching Word and Non-Word Characters
Matching Whitespace
Matching Any Character
15
![Page 16: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/16.jpg)
Matching String LiteralsSample Apache Log
10.23.10.11 www.iamcool.com 10.100.0.11 - - [06/Dec/2012:14:39:03 -0800] "GET /Facelift/answers/swelling HTTP/1.1" 301 20 14932 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
Literal String Match of the first ip address would be:
10.23.10.11
16
![Page 17: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/17.jpg)
Matching Digits and Non-Digits\d or \D or [0-9]
\d - match digit
\D – match non-digit (matches whitespace, punctuation and other characters not used in words)
[0-9] - match any number (called a character class)
[^0-9] – match any non-number
17
![Page 18: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/18.jpg)
Matching Words and Non-Words
\w or \W
\w – match any word character and is essentially the same as the character class [a-zA-Z0-9]
\W – match any non-word character
18
![Page 19: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/19.jpg)
Matching Whitespace
\s or \S
\s – match whitespace (Spaces, Tabs, Line Feeds and Carriage Returns)
\S – match any character that is not whitespace. Same as [^\s]
19
![Page 20: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/20.jpg)
Character shorthands for whitespace
20
Character Shorthand Description
\f Form Feed
\h Horizontal Whitespace
\H Not Horizontal Whitespace
\n Newline
\r Carriage Return
\t Horizontal Tab
\v Vertical Tab (whitespace)
\V Not vertical whitespace
![Page 21: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/21.jpg)
Matching Any Character
Dot (.)
Matches any character but line ending characters
\b – matches a word boundary without consuming any characters
21
![Page 22: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/22.jpg)
Boundaries and Alternation Matching the Beginning and End of Line
List of Regex Special Character
Alternation and Regex Options
Subpatterns
Capturing and Named Groups
Character Classes
Negated Character Classes22
![Page 23: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/23.jpg)
Matching Beginning and End of Line
^ OR $
^ - matches the beginning of a line
$ - matches the end of a line
23
![Page 24: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/24.jpg)
List of Regex Special Characters.^*+?|(){}[]\-
. -matches any character
^ -matches beginning of the line
* -matches zero or more
+ -matches one or more
? –matches one or more
| -used for alternation (choice of patterns to match)
() –used for grouping
{} –used as a quantifier
[] –used with character classes
\ -used to make a character literal or as a special regex character
- -hyphen is used in a character class range
24
![Page 25: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/25.jpg)
Alternation and Options
| OR ?
| -gives choice of alternate patterns to match, ie: (THE|The|the)
(?i) – Case insensitive
(?J) –allow duplicate names
(?m) –match on duplicate lines
(?s) –match on a single line
(?U) –match lazy
(?X) –Ignore whitespace, comments
(?-…) –Unset or turn off options
25
![Page 26: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/26.jpg)
Subpatterns
Group(s) within a group
(THE|The|the) -has three subpatterns
(tT)h(e|eir) –matches the, The, their, Their
26
![Page 27: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/27.jpg)
Capturing and Named Groups
()(?<name>…) OR (?P<name>…)
Store their content in memory
(it is) (time to eat)
$1 $2
(?<Severity>\d)
Splunk creates a field of Severity from this named group
27
![Page 28: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/28.jpg)
Character Classes
[]
[aeiou] –only matches the characters inside of the brackets
[0-9] –matches a range of characters, using a hyphen
[a-zA-Z0-9] –matches all alphanumeric characters
28
![Page 29: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/29.jpg)
Negated Character Classes
[^…]
*** Super important – especially for Splunk field extractions ***
[^aeiou] –matches all consonants and NOT vowels
[^\s] – match everything that is not a space
29
![Page 30: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/30.jpg)
Quantifiers Greedy, Lazy, Possessive
Matching a certain number of times
30
![Page 31: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/31.jpg)
Greedy, Lazy, Possessive* + ?
* - match zero of more times
.* -will match all of the characters in the subject text (want to avoid this)
+ -match one or more
\d+ -match all of the digits until there aren’t any more - greedy
? –match 0 or 1 of the preceeding token.
colou?r –matches either color or colour
31
![Page 32: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/32.jpg)
Matching a Certain Number of Times{}
\d{3} -matches 3 digits only
\d{1,3} –matches range of 1 to 3 digits
\d{1,} -same as \d+
\d{0,} -same as \d*
\d{0,1} -same as \d?
32
![Page 34: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/34.jpg)
Optimized Regular Expressions
Because fast is elegant!
![Page 35: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/35.jpg)
Optimize Regular ExpressionsGood Better
(whiskey) (?:whiskey)
Capture groups add unnecessary overhead and impact overallperformance use them only when necessary.
![Page 36: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/36.jpg)
Optimize Regular ExpressionsGood Better
splunk|splash spl(?:unk|ash)
Try to “factor” on the left, when you can, while exposing requiredtext. Less alternation is better.
![Page 37: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/37.jpg)
Optimize Regular ExpressionsGood Better
(?:aussie$|gypsie$) (?:aus|gyp)sie$
Try to “factor” on the right when input text is close to end of theline. Most regex engines will anchor at end of line when “$” is
present.
![Page 38: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/38.jpg)
Optimize Regular ExpressionsGood Better
0{3,7} 0000{0,4}
Typically exposing required or literal text makes the engineexecute the regex faster
![Page 39: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/39.jpg)
Optimize Regular ExpressionsGood Better
(.)* .*
Useless parenthesis add unnecessary overhead. As above, usethem only when necessary.
![Page 40: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/40.jpg)
Optimize Regular ExpressionsGood Better
matty[:] matty:
The character class/set (indicated by []) will add unnecessaryoverhead when not needed.
![Page 41: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/41.jpg)
Optimize Regular ExpressionsGood Better
^genti|^collar ^(?:genti|collar)
Anchoring the regex at the beginning of the line will result inimproved performance with most regex engines.
![Page 42: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/42.jpg)
Optimize Regular ExpressionsGood Better
delaney$|connery$ (delaney|connery)$
I said, anchor the regex!
![Page 43: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/43.jpg)
Optimize Regular ExpressionsGood Better
^src.*: ^src[^:]*:
Using a negated character class/set instead of lazy/greedyquantifiers will typically result in faster regexes. Lazy/greedy
quantifiers will make the regex engines backtrack whichultimately impacts overall performance.
![Page 44: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/44.jpg)
Optimize Regular ExpressionsGood Better
bride|brian bri(?:de|an)
Full alternation is more expensive than partial alternation. Also,in this case the regex engine will alternate only AFTER ‘bri’ has
been matched.
![Page 45: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/45.jpg)
Optimize Regular ExpressionsGood Better
(?:edu|com|net|…) (?:com|edu|net|…)
Leading the engine to a match by placing the most popular matchfirst may result in faster execution in some engines.
![Page 46: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/46.jpg)
Optimize Regular ExpressionsGood Better
^.*(answer) ^.{42}(answer)
Specifying an exact position inside the string and leading theengine to a match, will help improve performance drastically
compared to using a simple greedy/lazy quantifier.
![Page 47: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/47.jpg)
Optimize Regular ExpressionsGood Better
.*?a ^.*a
If ‘a’ is near the end of the input string will match faster as lessbacktracking will be required.
![Page 48: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/48.jpg)
Optimize Regular ExpressionsGood Better
.*a ^.*?a
If ‘a’ is near the beginning of the input string the regex enginewill match faster.
![Page 49: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/49.jpg)
Optimize Regular ExpressionsGood Better
:[^:]*: :[^:]*+:
Ex. in ‘ :destination’ the second regex fails faster.
![Page 50: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/50.jpg)
Optimize Regular ExpressionsGood Better
:[^:]*: :(?>[^:]*):
Same as above, using different notation. Explanation:Atomic grouping or possessive quantifiers instruct the regexengine not to keep the states captured by * or + therefore
preventing it from unsuccessfully backtracking and in turn failingfaster.
![Page 51: Regex is Fun](https://reader033.fdocuments.net/reader033/viewer/2022061610/5681659c550346895dd87697/html5/thumbnails/51.jpg)
Python for the Masses