Regular Expresions

27
1 Regular expressions Basic Syntax Reference Characters Character Description Example Any character except [\^$.|?*+() All characters except the listed special characters match a single instance of themselves. { and } are literal characters, unless they're part of a valid regular expression token (e.g. the {n} quantifier). a matches a \ (backslash) followed by any of [\^$.|?*+(){} A backslash escapes special characters to suppress their special meaning. \+ matches + \Q...\E Matches the characters between \Q and \E literally, suppressing the meaning of special characters. \Q+-*/\E matches +-*/ \xFF where FF are 2 hexadecimal digits Matches the character with the specified ASCII/ANSI value, which depends on the code page used. Can be used in character classes. \xA9 matches © when using the Latin-1 code page. \n, \r and \t Match an LF character, CR character and a tab character respectively. Can be used in character classes. \r\n matches a DOS/Windows CRLF line break. \a, \e, \f and \v Match a bell character (\x07), escape character (\x1B), form feed (\x0C) and vertical tab (\x0B) respectively. Can be used in character classes. \cA through \cZ Match an ASCII character Control+A through Control+Z, equivalent to \x01 through \x1A. Can be used in character classes. \cM\cJ matches a DOS/Windows CRLF line break. Character Classes or Character Sets [abc] Character Description Example [ (opening square bracket) Starts a character class. A character class matches a single character out of all the possibilities offered by the character class. Inside a character class, different rules apply. The rules in this section are only valid inside character classes. The rules outside this section are not valid in character classes, except for a few character escapes that are indicated with "can be used inside character classes".

description

Regular Expresions

Transcript of Regular Expresions

Page 1: Regular Expresions

1

Regular expressions Basic Syntax Reference

Characters

Character Description Example

Any character except [\^$.|?*+() All characters except the listed special characters match a single instance of themselves. { and } are literal characters, unless they're part of a valid regular expression token (e.g. the {n} quantifier).

a matches a

\ (backslash) followed by any of [\^$.|?*+(){}

A backslash escapes special characters to suppress their special meaning. \+ matches +

\Q...\E Matches the characters between \Q and \E literally, suppressing the meaning of special characters.

\Q+-*/\E matches +-*/

\xFF where FF are 2 hexadecimal digits

Matches the character with the specified ASCII/ANSI value, which depends on the code page used. Can be used in character classes.

\xA9 matches © when using the Latin-1 code page.

\n, \r and \t Match an LF character, CR character and a tab character respectively. Can be used in character classes.

\r\n matches a DOS/Windows CRLF line break.

\a, \e, \f and \v Match a bell character (\x07), escape character (\x1B), form feed (\x0C) and vertical tab (\x0B) respectively. Can be used in character classes.

\cA through \cZ Match an ASCII character Control+A through Control+Z, equivalent to \x01 through \x1A. Can be used in character classes.

\cM\cJ matches a DOS/Windows CRLF line break.

Character Classes or Character Sets [abc]

Character Description Example

[ (opening square bracket) Starts a character class. A character class matches a single character out of all the possibilities offered by the character class. Inside a character class, different rules apply. The rules in this section are only valid inside character classes. The rules outside this section are not valid in character classes, except for a few character escapes that are indicated with "can be used inside character classes".

Page 2: Regular Expresions

2

Any character except ^-]\ add that character to the possible matches for the character class.

All characters except the listed special characters. [abc] matches a, b or c

\ (backslash) followed by any of ^-]\

A backslash escapes special characters to suppress their special meaning. [\^\]] matches ^ or ]

- (hyphen) except immediately after the opening [

Specifies a range of characters. (Specifies a hyphen if placed immediately after the opening [)

[a-zA-Z0-9] matches any letter or digit

^ (caret) immediately after the opening [

Negates the character class, causing it to match a single character not listed in the character class. (Specifies a caret if placed anywhere except after the opening [)

[^a-d] matches x (any character except a, b, c or d)

\d, \w and \s Shorthand character classes matching digits, word characters (letters, digits, and underscores), and whitespace (spaces, tabs, and line breaks). Can be used inside and outside character classes.

[\d\s] matches a character that is a digit or whitespace

\D, \W and \S Negated versions of the above. Should be used only outside character classes. (Can be used inside, but that is confusing.)

\D matches a character that is not a digit

[\b] Inside a character class, \b is a backspace character. [\b\t] matches a backspace or tab character

Dot

Character Description Example

. (dot) Matches any single character except line break characters \r and \n. Most regex flavors have an option to make the dot match line break characters too.

. matches x or (almost) any other character

Page 3: Regular Expresions

3

Anchors

Character Description Example

^ (caret) Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the caret match after line breaks (i.e. at the start of a line in a file) as well.

^. matches a in abc\ndef. Also matches d in "multi-line" mode.

$ (dollar) Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the dollar match before line breaks (i.e. at the end of a line in a file) as well. Also matches before the very last line break if the string ends with a line break.

.$ matches f in abc\ndef. Also matches c in "multi-line" mode.

\A Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Never matches after line breaks.

\A. matches a in abc

\Z Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Never matches before line breaks, except for the very last line break if the string ends with a line break.

.\Z matches f in abc\ndef

\z Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Never matches before line breaks.

.\z matches f in abc\ndef

Word Boundaries

Character Description Example

\b Matches at the position between a word character (anything matched by \w) and a non-word character (anything matched by [^\w] or \W) as well as at the start and/or end of the string if the first and/or last characters in the string are word characters.

.\b matches c in abc

\B Matches at the position between two word characters (i.e the position between \w\w) as well as at the position between two non-word characters (i.e. \W\W).

\B.\B matches b in abc

Alternation

Character Description Example

| (pipe) Causes the regex engine to match either the part on the left side, or the part on the right side. Can be strung together into a series of options.

abc|def|xyz matches abc, def or xyz

| (pipe) The pipe has the lowest precedence of all operators. Use grouping to alternate only part of the regular expression.

abc(def|xyz) matches abcdef or abcxyz

Page 4: Regular Expresions

4

Quantifiers

Character Description Example

? (question mark)

Makes the preceding item optional. Greedy, so the optional item is included in the match if possible. abc? matches ab or abc

?? Makes the preceding item optional. Lazy, so the optional item is excluded in the match if possible. This construct is often excluded from documentation because of its limited use.

abc?? matches ab or abc

* (star) Repeats the previous item zero or more times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all.

".*" matches "def" "ghi" in abc "def" "ghi" jkl

*? (lazy star) Repeats the previous item zero or more times. Lazy, so the engine first attempts to skip the previous item, before trying permutations with ever increasing matches of the preceding item.

".*?" matches "def" in abc "def" "ghi" jkl

+ (plus) Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once.

".+" matches "def" "ghi" in abc "def" "ghi" jkl

+? (lazy plus) Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item.

".+?" matches "def" in abc "def" "ghi" jkl

{n} where n is an integer >= 1

Repeats the previous item exactly n times. a{3} matches aaa

{n,m} where n >= 0 and m >= n

Repeats the previous item between n and m times. Greedy, so repeating m times is tried before reducing the repetition to n times.

a{2,4} matches aaaa, aaa or aa

{n,m}? where n >= 0 and m >= n

Repeats the previous item between n and m times. Lazy, so repeating n times is tried before increasing the repetition to m times.

a{2,4}? matches aa, aaa or aaaa

{n,} where n >= 0 Repeats the previous item at least n times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only n times.

a{2,} matches aaaaa in aaaaa

{n,}? where n >= 0

Repeats the previous item n or more times. Lazy, so the engine first matches the previous item n times, before trying permutations with ever increasing matches of the preceding item.

a{2,}? matches aa in aaaaa

Page 5: Regular Expresions

5

Advanced Syntax Reference

Grouping and Backreferences

Syntax Description Example

(regex) Round brackets group the regex between them. They capture the text matched by the regex inside them that can be reused in a backreference, and they allow you to apply regex operators to the entire grouped regex.

(abc){3} matches abcabcabc. First group matches abc.

(?:regex) Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything and do not create backreferences.

(?:abc){3} matches abcabcabc. No groups.

\1 through \9 Substituted with the text matched between the 1st through 9th pair of capturing parentheses. Some regex flavors allow more than 9 backreferences.

(abc|def)=\1 matches abc=abc or def=def, but not abc=def or def=abc.

Modifiers

Syntax Description Example

(?i) Turn on case insensitivity for the remainder of the regular expression. (Older regex flavors may turn it on for the entire regex.)

te(?i)st matches teST but not TEST.

(?-i) Turn off case insensitivity for the remainder of the regular expression. (?i)te(?-i)st matches TEst but not TEST.

(?s) Turn on "dot matches newline" for the remainder of the regular expression. (Older regex flavors may turn it on for the entire regex.)

(?-s) Turn off "dot matches newline" for the remainder of the regular expression.

(?m) Caret and dollar match after and before newlines for the remainder of the regular expression. (Older regex flavors may apply this to the entire regex.)

(?-m) Caret and dollar only match at the start and end of the string for the remainder of the regular expression.

(?x) Turn on free-spacing mode to ignore whitespace between regex tokens, and allow # comments.

(?-x) Turn off free-spacing mode.

(?i-sm) Turns on the options "i" and "m", and turns off "s" for the remainder of the regular expression. (Older regex flavors may apply this to the entire regex.)

Page 6: Regular Expresions

6

(?i-sm:regex) Matches the regex inside the span with the options "i" and "m" turned on, and "s" turned off. (?i:te)st matches TEst but not TEST.

Atomic Grouping and Possessive Quantifiers

Syntax Description Example

(?>regex) Atomic groups prevent the regex engine from backtracking back into the group (forcing the group to discard part of its match) after a match has been found for the group. Backtracking can occur inside the group before it has matched completely, and the engine can backtrack past the entire group, discarding its match entirely. Eliminating needless backtracking provides a speed increase. Atomic grouping is often indispensable when nesting quantifiers to prevent a catastrophic amount of backtracking as the engine needlessly tries pointless permutations of the nested quantifiers.

x(?>\w+)x is more efficient than x\w+x if the second x cannot be matched.

?+, *+, ++ and {m,n}+

Possessive quantifiers are a limited yet syntactically cleaner alternative to atomic grouping. Only available in a few regex flavors. They behave as normal greedy quantifiers, except that they will not give up part of their match for backtracking.

x++ is identical to (?>x+)

Lookaround

Syntax Description Example

(?=regex) Zero-width positive lookahead. Matches at a position where the pattern inside the lookahead can be matched. Matches only the position. It does not consume any characters or expand the match. In a pattern like one(?=two)three, both two and three have to match at the position where the match of one ends.

t(?=s) matches the second t in streets.

(?!regex) Zero-width negative lookahead. Identical to positive lookahead, except that the overall match will only succeed if the regex inside the lookahead fails to match.

t(?!s) matches the first t in streets.

(?<=regex) Zero-width positive lookbehind. Matches at a position if the pattern inside the lookahead can be matched ending at that position (i.e. to the left of that position). Depending on the regex flavor you're using, you may not be able to use quantifiers and/or alternation inside lookbehind.

(?<=s)t matches the first t in streets.

(?<!regex) Zero-width negative lookbehind. Matches at a position if the pattern inside the lookahead cannot be matched ending at that position.

(?<!s)t matches the second t in streets.

Page 7: Regular Expresions

7

Continuing from The Previous Match

Syntax Description Example

\G Matches at the position where the previous match ended, or the position where the current match attempt started (depending on the tool or regex flavor). Matches at the start of the string during the first match attempt.

\G[a-z] first matches a, then matches b and then fails to match in ab_cd.

Conditionals

Syntax Description Example

(?(?=regex)then|else) If the lookahead succeeds, the "then" part must match for the overall regex to match. If the lookahead fails, the "else" part must match for the overall regex to match. Not just positive lookahead, but all four lookarounds can be used. Note that the lookahead is zero-width, so the "then" and "else" parts need to match and consume the part of the text matched by the lookahead as well.

(?(?<=a)b|c) matches the second b and the first c in babxcac

(?(1)then|else) If the first capturing group took part in the match attempt thus far, the "then" part must match for the overall regex to match. If the first capturing group did not take part in the match, the "else" part must match for the overall regex to match.

(a)?(?(1)b|c) matches ab, the first c and the second c in babxcac

Comments

Syntax Description Example

(?#comment) Everything between (?# and ) is ignored by the regex engine. a(?#foobar)b matches ab

Page 8: Regular Expresions

8

Unicode Syntax Reference

Unicode Characters

Character Description Example

\X Matches a single Unicode grapheme, whether encoded as a single code point or multiple code points using combining marks. A grapheme most closely resembles the everyday concept of a "character".

\X matches à encoded as U+0061 U+0300, à encoded as U+00E0, ©, etc.

\uFFFF where FFFF are 4 hexadecimal digits

Matches a specific Unicode code point. Can be used inside character classes. \u00E0 matches à encoded as U+00E0 only. \u00A9 matches ©

\x{FFFF} where FFFF are 1 to 4 hexadecimal digits

Perl syntax to match a specific Unicode code point. Can be used inside character classes.

\x{E0} matches à encoded as U+00E0 only. \x{A9} matches ©

Unicode Properties, Scripts and Blocks

Character Description Example

\p{L} or \p{Letter} Matches a single Unicode code point that has the property "letter". See Unicode Character Properties in the tutorial for a complete list of properties. Each Unicode code point has exactly one property. Can be used inside character classes.

\p{L} matches à encoded as U+00E0; \p{S} matches ©

\p{Arabic} Matches a single Unicode code point that is part of the Unicode script "Arabic". See Unicode Scripts in the tutorial for a complete list of scripts. Each Unicode code point is part of exactly one script. Can be used inside character classes.

\p{Thai} matches one of 83 code points in Thai script, from � until �

\p{InBasicLatin} Matches a single Unicode code point that is part of the Unicode block "BasicLatin". See Unicode Blocks in the tutorial for a complete list of blocks. Each Unicode code point is part of exactly one block. Blocks may contain unassigned code points. Can be used inside character classes.

\p{InLatinExtended-A} any of the code points in the block U+100 until U+17F (Ā until ſ)

\P{L} or \P{Letter} Matches a single Unicode code point that does not have the property "letter". You can also use \P to match a code point that is not part of a particular Unicode block or script. Can be used inside character classes.

\P{L} matches ©

Page 9: Regular Expresions

9

Replacement Text Reference

Syntax Using Backslashes

Feature JGsoft .NET Java Perl ECMA Python Ruby Tcl PHP ereg

PHP preg

REAL basic Oracle Post

gres XPath R

\& (whole regex match) YES no no no no no YES no no no no no YES error no

\0 (whole regex match) YES no no no no no YES YES YES YES YES no no error no

\1 through \9 (backreference) YES no no depr. no YES YES YES YES YES YES YES YES error YES

\10 through \99 (backreference) YES no no no no YES no no no YES YES no no error no

\10 through \99 treated as \1 through \9 (and a literal digit) if fewer than 10 groups

YES n/a n/a n/a n/a no n/a n/a n/a no no n/a n/a error n/a

\g<group> (named backreference) YES no no no no YES no no no no no no no error no

\` (backtick; subject text to the left of the match) YES no no no no no YES no no no no no no error no

\' (straight quote; subject text to the right of the match) YES no no no no no YES no no no no no no error no

\+ (highest-numbered participating group) YES no no no no no YES no no no no no no error no

Backslash escapes one backslash and/or dollar YES no YES YES no YES YES YES YES YES YES YES YES YES YES

Unescaped backslash as literal text YES YES no YES YES YES YES YES YES YES no YES YES error no

Page 10: Regular Expresions

10

Character Escapes

Feature JGsoft .NET Java Perl ECMA Python Ruby Tcl PHP ereg

PHP preg

REAL basic Oracle Post

gres XPath R

\u0000 through \uFFFF (Unicode character) YES no string no string u

string no string no no no no no error string

\x{0} through \x{FFFF} (Unicode character) YES no no string no no no no no no no no no error no

\x00 through \xFF (ASCII character) YES no no string string non-r

string no string string string YES no no error string

Syntax Using Dollar Signs

Feature JGsoft .NET Java Perl ECMA Python Ruby Tcl PHP ereg

PHP preg

REAL basic Oracle Post

gres XPath R

$& (whole regex match) YES YES error YES YES no no no no no YES no no error no

$0 (whole regex match) YES YES YES no no no no no no YES YES no no YES no

$1 through $9 (backreference) YES YES YES YES YES no no no no YES YES no no YES no

$10 through $99 (backreference) YES YES YES YES YES no no no no YES YES no no YES no

$10 through $99 treated as $1 through $9 (and a literal digit) if fewer than 10 groups

YES no YES no YES n/a n/a n/a n/a no no n/a n/a YES n/a

${1} through ${99} (backreference) YES YES error YES no no no no no YES no no no error no

${group} (named backreference) YES YES error no no no no no no no no no no error no

$` (backtick; subject text to the left of the match) YES YES error YES YES no no no no no YES no no error no

$' (straight quote; subject text to the right of the match) YES YES error YES YES no no no no no YES no no error no

$_ (entire subject string) YES YES error YES IE only no no no no no no no no error no

Page 11: Regular Expresions

11

$+ (highest-numbered participating group) YES no error YES no no no no no no no no no error no

$+ (highest-numbered group in the regex) no YES error no IE and

Firefox no no no no no no no no error no

$$ (escape dollar with another dollar) YES YES error no YES no no no no no no no no error no

$ (unescaped dollar as literal text) YES YES error no YES YES YES YES YES YES buggy YES YES error YES

Tokens Without a Backslash or Dollar

Feature JGsoft .NET Java Perl ECMA Python Ruby Tcl PHP ereg

PHP preg

REAL basic Oracle Post

gres XPath R

& (whole regex match) no no no no no no no YES no no no no no no no

General Replacement Text Behavior

Feature JGsoft .NET Java Perl ECMA Python Ruby Tcl PHP ereg

PHP preg

REAL basic Oracle Post

gres XPath R

Backreferences to non-existent groups are silently removed YES no error YES no error YES YES no YES YES YES YES YES error

A flavor can have four levels of support (or non-support) for a particular token:

• A "YES" in the table below indicates the token will be substituted. • A "no" indicates the token will remain in the replacement as literal text. Note that languages that use variable interpolation in strings may still

replace tokens indicated as unsupported below, if the syntax of the token corresponds with the variable interpolation syntax. E.g. in Perl, $0 is replaced with the name of the script.

• The "string" label indicates that the syntax is supported by string literals in the language's source code. For languages like PHP that have interpolated (double quotes) and non-interpolated (single quotes) variants, you'll need to use the interpolated string style. String-level support also means that the character escape won't be interpreted for replacement text typed in by the user or read from a file.

• Finally, "error" indicates the token will result in an error condition or exception, preventing any replacements being made at all.

Page 12: Regular Expresions

12

• JGsoft: This flavor is used by the Just Great Software products, including PowerGREP, EditPad Pro and AceText. It is also used by the TPerlRegEx Delphi component and the RegularExrpessions and RegularExpressionsCore units in Delphi XE and C++Builder XE.

• .NET: This flavor is used by programming languages based on the Microsoft .NET framework versions 1.x, 2.0 or 3.0. It is generally also the regex flavor used by applications developed in these programming languages.

• Java: The regex flavor of the java.util.regex package, available in the Java 4 (JDK 1.4.x) and later. A few features were added in Java 5 (JDK 1.5.x) and Java 6 (JDK 1.6.x). It is generally also the regex flavor used by applications developed in Java.

• Perl: The regex flavor used in the Perl programming language, as of version 5.8.

• ECMA (JavaScript): The regular expression syntax defined in the 3rd edition of the ECMA-262 standard, which defines the scripting language commonly known as JavaScript. The VBscript RegExp object, which is also commonly used in VB 6 applications uses the same implementation with the same search-and-replace features. However, VBScript and VB strings don't support \xFF and \uFFFF escapes.

• Python: The regex flavor supported by Python's built-in re module.

• Ruby: The regex flavor built into the Ruby programming language.

• Tcl: The regex flavor used by the regsub command in Tcl 8.2 and 8.4, dubbed Advanced Regular Expressions in the Tcl man pages. wxWidgets uses the same flavor.

• PHP ereg: The replacement text syntax used by the ereg_replace and eregi_replace functions in PHP.

• PHP preg: The replacement text syntax used by the preg_replace function in PHP.

• REALbasic: The replacement text syntax used by the ReplaceText property of the RegEx class in REALbasic.

• Oracle: The replacement text syntax used by the REGEXP_REPLACE function in Oracle Database 10g.

• Postgres: The replacement text syntax used by the regexp_replace function in PostgreSQL.

• XPath: The replacement text syntax used by the fn:replace function in XQuery and XPath.

• R: The replacement text syntax used by the sub and gsub functions in the R language. Though R supports three regex flavors, it has only one replacement syntax for all three.

Page 13: Regular Expresions

13

Highest-Numbered Capturing Group

The $+ token is listed twice, because it doesn't have the same meaning in the languages that support it. It was introduced in Perl, where the $+ variable holds the text matched by the highest-numbered capturing group that actually participated in the match. In several languages and libraries that intended to copy this feature, such as .NET and JavaScript, $+ is replaced with the highest-numbered capturing group, whether it participated in the match or not.

E.g. in the regex a(\d)|x(\w) the highest-numbered capturing group is the second one. When this regex matches a4, the first capturing group matches 4, while the second group doesn't participate in the match attempt at all. In Perl, $+ will hold the 4 matched by the first capturing group, which is the highest-numbered group that actually participated in the match. In .NET or JavaScript, $+ will be substituted with nothing, since the highest-numbered group in the regex didn't capture anything. When the same regex matches xy, Perl, .NET and JavaScript will all store y in $+.

Also note that .NET numbers named capturing groups after all non-named groups. This means that in .NET, $+ will always be substituted with the text matched by the last named group in the regex, whether it is followed by non-named groups or not, and whether it actually participated in the match or not.

Page 14: Regular Expresions

14

Syntax Reference for Specific Regex Flavors

.NET Syntax for Named Capture and Backreferences

Syntax Description Example

(?<name>regex) Round brackets group the regex between them. They capture the text matched by the regex inside them that can be referenced by the name between the sharp brackets. The name may consist of letters and digits.

(?'name'regex) Round brackets group the regex between them. They capture the text matched by the regex inside them that can be referenced by the name between the single quotes. The name may consist of letters and digits.

\k<name> Substituted with the text matched by the capturing group with the given name. (?<group>abc|def)=\k<group> matches abc=abc or def=def, but not abc=def or def=abc.

\k'name' Substituted with the text matched by the capturing group with the given name. (?'group'abc|def)=\k'group' matches abc=abc or def=def, but not abc=def or def=abc.

(?(name)then|else) If the capturing group "name" took part in the match attempt thus far, the "then" part must match for the overall regex to match. If the capturing group "name" did not take part in the match, the "else" part must match for the overall regex to match.

(?<group>a)?(?(group)b|c) matches ab, the first c and the second c in babxcac

Python Syntax for Named Capture and Backreferences

Syntax Description Example

(?P<name>regex) Round brackets group the regex between them. They capture the text matched by the regex inside them that can be referenced by the name between the sharp brackets. The name may consist of letters and digits.

(?P=name) Substituted with the text matched by the capturing group with the given name. Not a group, despite the syntax using round brackets.

(?P<group>abc|def)=(?P=group) matches abc=abc or def=def, but not abc=def or def=abc.

XML Character Classes

Syntax Description Example

\i Matches any character that may be the first character of an XML name, i.e. [_:A-Za-z].

Page 15: Regular Expresions

15

\c \c matches any character that may occur after the first character in an XML name, i.e. [-._:A-Za-z0-9]

\i\c* matches an XML name like xml:schema

\I Matches any character that cannot be the first character of an XML name, i.e. [^_:A-Za-z].

\C Matches any character that cannot occur in an XML name, i.e. [^-._:A-Za-z0-9].

[abc-[xyz]] Subtracts character class "xyz" from character class "abc". The result matches any single character that occurs in the character class "abc" but not in the character class "xyz".

[a-z-[aeiou]] matches any letter that is not a vowel (i.e. a consonant).

POSIX Bracket Expressions

Syntax Description Example

[:alpha:] Matches one character from a POSIX character class. Can only be used in a bracket expression.

[[:digit:][:lower:]] matches one of 0 through 9 or a through z

[.span-ll.] Matches a POSIX collation sequence. Can only be used in a bracket expression. [[.span-ll.]] matches ll in the Spanish locale

[=x=] Matches a POSIX character equivalence. Can only be used in a bracket expression.

[[=e=]] matches e, é, è and ê in the French locale

Page 16: Regular Expresions

16

Regular Expression Flavor Comparison

The table below compares which regular expression flavors support which regex features and syntax. The features are listed in the same order as in the regular expression reference.

The comparison shows regular expression flavors rather than particular applications or programming languages implementing one of those regular expression flavors.

• JGsoft: This flavor is used by the Just Great Software products, including PowerGREP and EditPad Pro. • .NET: This flavor is used by programming languages based on the Microsoft .NET framework versions 1.x, 2.0 or 3.x. It is generally also the

regex flavor used by applications developed in these programming languages. • Java: The regex flavor of the java.util.regex package, available in the Java 4 (JDK 1.4.x) and later. A few features were added in Java 5 (JDK

1.5.x) and Java 6 (JDK 1.6.x). It is generally also the regex flavor used by applications developed in Java. • Perl: The regex flavor used in the Perl programming language, versions 5.6 and 5.8. Versions prior to 5.6 do not support Unicode. • PCRE: The open source PCRE library. The feature set described here is available in PCRE 5.x and 6.x. PCRE is the regex engine used by the

TPerlRegEx Delphi component and the RegularExrpessions and RegularExpressionsCore units in Delphi XE and C++Builder XE. • ECMA (JavaScript): The regular expression syntax defined in the 3rd edition of the ECMA-262 standard, which defines the scripting language

commonly known as JavaScript. • Python: The regex flavor supported by Python's built-in re module. • Ruby: The regex flavor built into the Ruby programming language. • Tcl ARE: The regex flavor developed by Henry Spencer for the regexp command in Tcl 8.2 and 8.4, dubbed Advanced Regular Expressions. • POSIX BRE: Basic Regular Expressions as defined in the IEEE POSIX standard 1003.2. • POSIX ERE: Extended Regular Expressions as defined in the IEEE POSIX standard 1003.2. • GNU BRE: GNU Basic Regular Expressions, which are POSIX BRE with GNU extensions, used in the GNU implementations of classic UNIX

tools. • GNU ERE: GNU Extended Regular Expressions, which are POSIX ERE with GNU extensions, used in the GNU implementations of classic UNIX

tools. • XML: The regular expression flavor defined in the XML Schema standard. • XPath: The regular expression flavor defined in the XQuery 1.0 and XPath 2.0 Functions and Operators standard.

Applications and languages implementing one of the above flavors are:

• AceText: Version 2 and later use the JGsoft engine. Version 1 did not support regular expressions at all. • ActionScript: ActionScript is the language for programming Adobe Flash (formerly Macromedia Flash). ActionScript 3.0 and later supports the

regex flavor listed as "ECMA" in the table below. • awk: The awk UNIX tool and programming language uses POSIX ERE. • C#: As a .NET programming language, C# can use the System.Text.RegularExpressions classes, listed as ".NET" below.

Page 17: Regular Expresions

17

• Delphi for .NET: As a .NET programming language, the .NET version of Delphi can use the System.Text.RegularExpressions classes, listed as ".NET" below.

• Delphi for Win32: Delphi for Win32 does not have built-in regular expression support. Many free PCRE wrappers are available. • EditPad Pro: Version 6 and later use the JGsoft engine. Earlier versions used PCRE, without Unicode support. • egrep: The traditional UNIX egrep command uses the "POSIX ERE" flavor, though not all implementations fully adhere to the standard. Linux

usually ships with the GNU implementation, which use "GNU ERE". • grep: The traditional UNIX grep command uses the "POSIX BRE" flavor, though not all implementations fully adhere to the standard. Linux

usually ships with the GNU implementation, which use "GNU BRE". • Emacs: The GNU implementation of this classic UNIX text editor uses the "GNU ERE" flavor, except that POSIX classes, collations and

equivalences are not supported. • Java: The regex flavor of the java.util.regex package is listed as "Java" in the table below. • JavaScript: JavaScript's regex flavor is listed as "ECMA" in the table below. • MySQL: MySQL uses POSIX Extended Regular Expressions, listed as "POSIX ERE" in the table below. • Oracle: Oracle Database 10g implements POSIX Extended Regular Expressions, listed as "POSIX ERE" in the table below. Oracle supports

backreferences \1 through \9, though these are not part of the POSIX ERE standard. • Perl: Perl's regex flavor is listed as "Perl" in the table below. • PHP: PHP's ereg functions implement the "POSIX ERE" flavor, while the preg functions implement the "PCRE" flavor. • PostgreSQL: PostgreSQL 7.4 and later uses Henry Spencer's "Advanced Regular Expressions" flavor, listed as "Tcl ARE" in the table below.

Earlier versions used POSIX Extended Regular Expressions, listed as POSIX ERE. • PowerGREP: Version 3 and later use the JGsoft engine. Earlier versions used PCRE, without Unicode support. • PowerShell: PowerShell's built-in -match and -replace operators use the .NET regex flavor. PowerShell can also use the

System.Text.RegularExpressions classes directly. • Python: Python's regex flavor is listed as "Python" in the table below. • R: The regular expression functions in the R language for statistical programming use either the POSIX ERE flavor (default), the PCRE flavor

(perl = true) or the POSIX BRE flavor (perl = false, extended = false). • REALbasic: REALbasic's RegEx class is a wrapper around PCRE. • RegexBuddy: Version 3 and later use a special version of the JGsoft engine that emulates all the regular expression flavors in this comparison.

Version 2 supported the JGsoft regex flavor only. Version 1 used PCRE, without Unicode support. • Ruby: Ruby's regex flavor is listed as "Ruby" in the table below. • sed: The sed UNIX tool uses POSIX BRE. Linux usually ships with the GNU implementation, which use "GNU BRE". • Tcl: Tcl's Advanced Regular Expression flavor, the default flavor in Tcl 8.2 and later, is listed as "Tcl ARE" in the table below. Tcl's Extended

Regular Expression and Basic Regular Expression flavors are listed as "POSIX ERE" and "POSIX BRE" in the table below. • VBScript: VBScript's RegExp object uses the same regex flavor as JavaScript, which is listed as "ECMA" in the table below. • Visual Basic 6: Visual Basic 6 does not have built-in support for regular expressions, but can easily use the "Microsoft VBScript Regular

Expressions 5.5" COM object, which implements the "ECMA" flavor listed below. • Visual Basic.NET: As a .NET programming language, VB.NET can use the System.Text.RegularExpressions classes, listed as ".NET" below. • wxWidgets: The wxRegEx class supports 3 flavors. wxRE_ADVANCED is the "Tcl ARE" flavor, wxRE_EXTENDED is "POSIX ERE" and

wxRE_BASIC is "POSIX BRE".

Page 18: Regular Expresions

18

• XML Schema: The XML Schema regular expression flavor is listed as "XML" in the table below. • XPath: The regex flavor used by XPath functions is listed as "XPath" in the table below. • XQuery: The regex flavor used by XQuery functions is listed as "XPath" in the table below.

Characters

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

Backslash escapes one metacharacter YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES

\Q...\E escapes a string of metacharacters YES no Java

6 YES YES no no no no no no no no no no

\x00 through \xFF (ASCII character) YES YES YES YES YES YES YES YES YES no no no no no no

\n (LF), \r (CR) and \t (tab) YES YES YES YES YES YES YES YES YES no no no no YES YES

\f (form feed) and \v (vtab) YES YES YES YES YES YES YES YES YES no no no no no no

\a (bell) YES YES YES YES YES no YES YES YES no no no no no no

\e (escape) YES YES YES YES YES no no YES YES no no no no no no

\b (backspace) and \B (backslash) no no no no no no no no YES no no no no no no

\cA through \cZ (control character) YES YES YES YES YES YES no no YES no no no no no no

\ca through \cz (control character) YES YES no YES YES YES no no YES no no no no no no

Character Classes or Character Sets [abc]

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

[abc] character class YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES

[^abc] negated character YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES

Page 19: Regular Expresions

19

class

[a-z] character class range YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES

Hyphen in [\d-z] is a literal YES YES YES YES YES no no no no no no no no no no

Hyphen in [a-\d] is a literal YES no no no YES no no no no no no no no no no

Backslash escapes one character class metacharacter

YES YES YES YES YES YES YES YES YES no no no no YES YES

\Q...\E escapes a string of character class metacharacters

YES no Java 6 YES YES no no no no no no no no no no

\d shorthand for digits YES YES ascii YES ascii ascii option ascii YES no no no no YES YES

\w shorthand for word characters YES YES ascii YES ascii ascii option ascii YES no no YES YES YES YES

\s shorthand for whitespace YES YES ascii YES ascii YES option ascii YES no no YES YES ascii ascii

\D, \W and \S shorthand negated character classes YES YES YES YES YES YES YES YES YES no no YES YES YES YES

[\b] backspace YES YES YES YES YES YES YES YES YES no no no no no no

Dot

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

. (dot; any character except line break) YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES

Anchors

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

^ (start of string/line) YES YES YES YES YES YES YES YES YES YES YES YES YES no YES

$ (end of string/line) YES YES YES YES YES YES YES YES YES YES YES YES YES no YES

Page 20: Regular Expresions

20

\A (start of string) YES YES YES YES YES no YES YES YES no no no no no no

\Z (end of string, before final line break) YES YES YES YES YES no no YES YES no no no no no no

\z (end of string) YES YES YES YES YES no \Z YES no no no no no no no

\` (start of string) no no no no no no no no no no no YES YES no no

\' (end of string) no no no no no no no no no no no YES YES no no

Word Boundaries

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

\b (at the beginning or end of a word) YES YES YES YES ascii ascii option ascii no no no YES YES no no

\B (NOT at the beginning or end of a word) YES YES YES YES ascii ascii option ascii no no no YES YES no no

\y (at the beginning or end of a word) YES no no no no no no no YES no no no no no no

\Y (NOT at the beginning or end of a word) YES no no no no no no no YES no no no no no no

\m (at the beginning of a word) YES no no no no no no no YES no no no no no no

\M (at the end of a word) YES no no no no no no no YES no no no no no no

\< (at the beginning of a word) no no no no no no no no no no no YES YES no no

\> (at the end of a word) no no no no no no no no no no no YES YES no no

Alternation

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

| (alternation) YES YES YES YES YES YES YES YES YES no YES \| YES YES YES

Page 21: Regular Expresions

21

Quantifiers

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

? (0 or 1) YES YES YES YES YES YES YES YES YES no YES \? YES YES YES

* (0 or more) YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES

+ (1 or more) YES YES YES YES YES YES YES YES YES no YES \+ YES YES YES

{n} (exactly n) YES YES YES YES YES YES YES YES YES \{n\} YES \{n\} YES YES YES

{n,m} (between n and m) YES YES YES YES YES YES YES YES YES \{n,m\} YES \{n,m\} YES YES YES

{n,} (n or more) YES YES YES YES YES YES YES YES YES \{n,\} YES \{n,\} YES YES YES

? after any of the above quantifiers to make it "lazy" YES YES YES YES YES YES YES YES YES no no no no no YES

Grouping and Backreferences

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

(regex) (numbered capturing group) YES YES YES YES YES YES YES YES YES \( \) YES \( \) YES YES YES

(?:regex) (non-capturing group) YES YES YES YES YES YES YES YES YES no no no no no no

\1 through \9 (backreferences) YES YES YES YES YES YES YES YES YES YES no YES YES no YES

\10 through \99 (backreferences) YES YES YES YES YES YES YES YES YES no n/a no no n/a YES

Forward references \1 through \9 YES YES YES YES YES no no YES no no n/a no no n/a no

Nested references \1 through \9 YES YES YES YES YES YES no YES no no n/a no no n/a no

Backreferences non-existent YES YES YES YES YES no YES no YES YES n/a YES YES n/a YES

Page 22: Regular Expresions

22

groups are an error

Backreferences to failed groups also fail YES YES YES YES YES no YES YES YES YES n/a YES YES n/a YES

Modifiers

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

(?i) (case insensitive) YES YES YES YES YES /i only YES YES YES no no no no no flag

(?s) (dot matches newlines) YES YES YES YES YES no YES (?m) no no no no no no flag

(?m) (^ and $ match at line breaks) YES YES YES YES YES /m

only YES always on no no no no no no flag

(?x) (free-spacing mode) YES YES YES YES YES no YES YES YES no no no no no flag

(?n) (explicit capture) YES YES no no no no no no no no no no no no no

(?-ismxn) (turn off mode modifiers) YES YES YES YES YES no no YES no no no no no no no

(?ismxn:group) (mode modifiers local to group) YES YES YES YES YES no no YES no no no no no no no

Atomic Grouping and Possessive Quantifiers

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

(?>regex) (atomic group) YES YES YES YES YES no no YES no no no no no no no

?+, *+, ++ and {m,n}+ (possessive quantifiers) YES no YES no YES no no no no no no no no no no

Lookaround

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

(?=regex) (positive lookahead) YES YES YES YES YES YES YES YES YES no no no no no no

Page 23: Regular Expresions

23

(?!regex) (negative lookahead) YES YES YES YES YES YES YES YES YES no no no no no no

(?<=text) (positive lookbehind)

full regex

full regex

finite length

fixed length

fixed + alternation no fixed

length no no no no no no no no

(?<!text) (negative lookbehind)

full regex

full regex

finite length

fixed length

finite length no fixed

length no no no no no no no no

Continuing from The Previous Match

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

\G (start of match attempt) YES YES YES YES YES no no YES no no no no no no no

Conditionals

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

(?(?=regex)then|else) (using any lookaround) YES YES no YES YES no no no no no no no no no no

(?(regex)then|else) no YES no no no no no no no no no no no no no

(?(1)then|else) YES YES no YES YES no YES no no no no no no no no

(?(group)then|else) YES YES no no YES no YES no no no no no no no no

Comments

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

(?#comment) YES YES no YES YES no YES YES YES no no no no no no

Free-Spacing Syntax

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

Free-spacing syntax supported YES YES YES YES YES no YES YES YES no no no no no YES

Page 24: Regular Expresions

24

Character class is a single token YES YES no YES YES n/a YES YES YES n/a n/a n/a n/a n/a YES

# starts a comment YES YES YES YES YES n/a YES YES YES n/a n/a n/a n/a n/a no

Unicode Characters

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

\X (Unicode grapheme) YES no no YES option no no no no no no no no no no

\u0000 through \uFFFF (Unicode character) YES YES YES no no YES u"string" no YES no no no no no no

\x{0} through \x{FFFF} (Unicode character) YES no no YES option no no no no no no no no no no

Unicode Properties, Scripts and Blocks

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

\pL through \pC (Unicode properties) YES no YES YES option no no no no no no no no no no

\p{L} through \p{C} (Unicode properties) YES YES YES YES option no no no no no no no no YES YES

\p{Lu} through \p{Cn} (Unicode property) YES YES YES YES option no no no no no no no no YES YES

\p{L&} and \p{Letter&} (equivalent of [\p{Lu}\p{Ll}\p{Lt}] Unicode properties)

YES no no YES option no no no no no no no no no no

\p{IsL} through \p{IsC} (Unicode properties) YES no YES YES no no no no no no no no no no no

\p{IsLu} through \p{IsCn} (Unicode property) YES no YES YES no no no no no no no no no no no

\p{Letter} through \p{Other} YES no no YES no no no no no no no no no no no

Page 25: Regular Expresions

25

(Unicode properties)

\p{Lowercase_Letter} through \p{Not_Assigned} (Unicode property)

YES no no YES no no no no no no no no no no no

\p{IsLetter} through \p{IsOther} (Unicode properties)

YES no no YES no no no no no no no no no no no

\p{IsLowercase_Letter} through \p{IsNot_Assigned} (Unicode property)

YES no no YES no no no no no no no no no no no

\p{Arabic} through \p{Yi} (Unicode script) YES no no YES option no no no no no no no no no no

\p{IsArabic} through \p{IsYi} (Unicode script) YES no no YES no no no no no no no no no no no

\p{BasicLatin} through \p{Specials} (Unicode block) YES no no YES no no no no no no no no no no no

\p{InBasicLatin} through \p{InSpecials} (Unicode block)

YES no YES YES no no no no no no no no no no no

\p{IsBasicLatin} through \p{IsSpecials} (Unicode block)

YES YES no YES no no no no no no no no no YES YES

Part between {} in all of the above is case insensitive YES no no YES no no no no no no no no no no no

Spaces, hyphens and underscores allowed in all long names listed above (e.g. BasicLatin can be written as Basic-Latin or Basic_Latin or Basic Latin)

YES no Java 5 YES no no no no no no no no no no no

\P (negated variants of all \p YES YES YES YES option no no no no no no no no YES YES

Page 26: Regular Expresions

26

as listed above)

\p{^...} (negated variants of all \p{...} as listed above) YES no no YES option no no no no no no no no no no

Named Capture and Backreferences

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

(?<name>regex) (.NET-style named capturing group) YES YES no no no no no no no no no no no no no

(?'name'regex) (.NET-style named capturing group) YES YES no no no no no no no no no no no no no

\k<name> (.NET-style named backreference) YES YES no no no no no no no no no no no no no

\k'name' (.NET-style named backreference) YES YES no no no no no no no no no no no no no

(?P<name>regex) (Python-style named capturing group YES no no no YES no YES no no no no no no no no

(?P=name) (Python-style named backreference) YES no no no YES no YES no no no no no no no no

multiple capturing groups can have the same name YES YES n/a n/a no n/a no n/a n/a n/a n/a n/a n/a n/a n/a

XML Character Classes

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

\i, \I, \c and \C shorthand XML name character classes

no no no no no no no no no no no no no YES YES

[abc-[abc]] character class subtraction YES 2.0 no no no no no no no no no no no YES YES

POSIX Bracket Expressions

Page 27: Regular Expresions

27

Feature JGsoft .NET Java Perl PCRE ECMA Python Ruby Tcl ARE

POSIX BRE

POSIX ERE

GNU BRE

GNU ERE XML XPath

[:alpha:] POSIX character class YES no no YES ascii no no YES YES YES YES YES YES no no

\p{Alpha} POSIX character class YES no ascii no no no no no no no no no no no no

\p{IsAlpha} POSIX character class YES no no YES no no no no no no no no no no no

[.span-ll.] POSIX collation sequence no no no no no no no no YES YES YES YES YES no no

[=x=] POSIX character equivalence no no no no no no no no YES YES YES YES YES no no