Unicode Regular Expressions

52
Unicode Regular Expressions s/ / /g Nick Patch 23 January 2013

description

Unicode regular expression tutorial with examples in Perl, PHP, and JavaScript. Presented at: Shutterstock “Brown Bag Lunch” Tech Talk, 23 January 2013, New York, NY

Transcript of Unicode Regular Expressions

Page 1: Unicode Regular Expressions

UnicodeRegular Expressions

s/ / /g� �Nick Patch

23 January 2013

Page 2: Unicode Regular Expressions

Unicode Refresher

Unicode attempts to support thecharacters of the world — a massive task!

Page 3: Unicode Regular Expressions

Unicode Refresher

It's hard to attach a single meaning to theword “character” but most folks think ofcharacters as the smallest stand-alone

components of a writing system.

Page 4: Unicode Regular Expressions

Unicode Refresher

In Unicode, this sense of characters is represented by one or more code points,

which are each stored in one or more bytes.

Page 5: Unicode Regular Expressions

Unicode Refresher

However, programmers andprogramming languages tend to think of

characters as individual code points,or worse, individual bytes.

We need to modernize our habits!

Page 6: Unicode Regular Expressions

Unicode Refresher

Unicode is not just a big set of characters.It also defines standard properties for

each character and standard algorithmsfor operations such as collation,

normalization, and segmentation.

Page 7: Unicode Regular Expressions

Normalization

NFD(ᾀ◌̀) = α◌̓◌̀◌ͅNFC(ᾀ◌̀) = ᾂ̀

Page 8: Unicode Regular Expressions

Normalization

NFD(Чю рлёнис◌́ ) = Чю рле нис◌́ ◌̈NFC(Чю рлёнис◌́ ) = Чю рлёнис◌́

Page 9: Unicode Regular Expressions

Normalization

ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡α ◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀

≠ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡

α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓

Page 10: Unicode Regular Expressions

Perl Normalization

use Unicode::Normalize;

say $str; # ᾀ◌̀say NFD($str); # α◌̓◌̀◌ͅsay NFC($str); # ᾂ̀

Page 11: Unicode Regular Expressions

JavaScript Normalization

var unorm = require('unorm');

console.log($str); # ᾀ◌̀console.log(unorm.nfd($str)); # α◌̓◌̀◌ͅconsole.log(unorm.nfc($str)); # ᾂ̀

Page 12: Unicode Regular Expressions

PHP Normalization

echo $str; # ᾀ◌̀

echo Normalizer::normalize($str, Normalizer::FORM_D); # α◌̓◌̀◌ͅ

echo Normalizer::normalize($str, Normalizer::FORM_C); # ᾂ̀

Page 13: Unicode Regular Expressions

Grapheme Clusters

regex: /^.$/

string 1: ᾂ

string 2: α◌̓◌̀◌ͅ

Page 14: Unicode Regular Expressions

Grapheme Clusters

regex: /^.$/

string 1: ᾂ ⇧

string 2: α◌̓◌̀◌ͅ ⇧

1. anchor beginning of string

Page 15: Unicode Regular Expressions

Grapheme Clusters

regex: /^.$/

string 1: ᾂ ⇧

string 2: α◌̓◌̀◌ͅ ⇧

1. anchor beginning of string2. match code point (excl. \n)

Page 16: Unicode Regular Expressions

Grapheme Clusters

regex: /^.$/

string 1: ᾂ ⇧⇧

string 2: α◌̓◌̀◌ͅ

1. anchor beginning of string2. match code point (excl. \n)3. anchor at end of string

Page 17: Unicode Regular Expressions

Grapheme Clusters

regex: /^.$/

string 1: ᾂ ⇧⇧

string 2: α◌̓◌̀◌ͅ

1. anchor beginning of string2. match code point (excl. \n)3. anchor at end of string4. 1 success but 1 failure — mixed results �

Page 18: Unicode Regular Expressions

Grapheme Clusters

regex: /^\X$/

string 1: ᾂ

string 2: α◌̓◌̀◌ͅ

Page 19: Unicode Regular Expressions

Grapheme Clusters

regex: /^\X$/

string 1: ᾂ ⇧

string 2: α◌̓◌̀◌ͅ ⇧

1. anchor beginning of string

Page 20: Unicode Regular Expressions

Grapheme Clusters

regex: /^\X$/

string 1: ᾂ ⇧

string 2: α◌̓◌̀◌ͅ ⇧

1. anchor beginning of string2. match grapheme cluster

Page 21: Unicode Regular Expressions

Grapheme Clusters

regex: /^\X$/

string 1: ᾂ ⇧⇧

string 2: α◌̓◌̀◌ͅ ⇧ ⇧

1. anchor beginning of string2. match grapheme cluster3. anchor at end of string

Page 22: Unicode Regular Expressions

Grapheme Clusters

regex: /^\X$/

string 1: ᾂ ⇧⇧

string 2: α◌̓◌̀◌ͅ ⇧ ⇧

1. anchor beginning of string2. match grapheme cluster3. anchor at end of string4. success! �

Page 23: Unicode Regular Expressions

Perl

use v5.12; # better yet: v5.14use utf8;use charnames qw( :full ); # unless v5.16use open qw( :encoding(UTF-8) :std );

$str =~ /^\X$/;

$str =~ s/^(\X)$/->$1<-/;

Page 24: Unicode Regular Expressions

PHP

preg_match('/^\X$/u', $str);

preg_replace('/^(\X)$/u', '->$1<-', $str);

Page 25: Unicode Regular Expressions

JavaScript

[This slide intentionally left blank.]

Page 26: Unicode Regular Expressions

Match Any Character

two bytes (if byte mode): е..иcode point (exc. \n): е.иcode point (incl. \n): е\p{Any}иgrapheme cluster (incl. \n): е\Xи

Page 27: Unicode Regular Expressions

Match Any Letter

letter code point:е\p{General_Category=Letter}иletter code point: е\pLиCyrillic code point: е\p{Script=Cyrillic}иCyrillic code point: е\p{Cyrillic}и

letter grapheme cluster: е(?=\pL)\Xи

Page 28: Unicode Regular Expressions

regex: / \p{Cyrillic} о т /x

string 1: който

string 2: кои то◌̆

Page 29: Unicode Regular Expressions

regex: / о \p{Cyrillic} т /x

string 1: който

string 2: кои то◌̆

1. match letter о

Page 30: Unicode Regular Expressions

regex: / о \p{Cyrillic} т /x

string 1: който

string 2: кои то◌̆

1. match letter о2. match Cyrillic letter (1 code point)

Page 31: Unicode Regular Expressions

regex: / \p{Cyrillic}о т /x

string 1: който

string 2: кои то◌̆

1. match letter о2. match Cyrillic letter (1 code point)3. match letter т

Page 32: Unicode Regular Expressions

regex: / \p{Cyrillic} о т /x

string 1: който

string 2: кои то◌̆

1. match letter о2. match Cyrillic letter (1 code point)3. match letter т4. 1 success but 1 failure — mixed results �

Page 33: Unicode Regular Expressions

regex: / (?= \p{Cyrillic} ) \X о т /x

string 1: който

string 2: кои то◌̆

Page 34: Unicode Regular Expressions

regex: / о (?= \p{Cyrillic} ) \X т /x

string 1: който

string 2: кои то◌̆

1. match letter о

Page 35: Unicode Regular Expressions

regex: / о (?= \p{Cyrillic} ) \X т /x

string 1: който ⇧

string 2: кои то◌̆ ⇧

1. match letter о2. positive lookahead Cyrillic letter (1 code point)

Page 36: Unicode Regular Expressions

regex: / (?= \p{Cyrillic} )о \X т /x

string 1: който ⇧

string 2: кои◌̆то ⇧

1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)

Page 37: Unicode Regular Expressions

regex: / (?= \p{Cyrillic} ) \Xо т /x

string 1: който ⇧

string 2: кои◌̆то ⇧

1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)4. match letter т

Page 38: Unicode Regular Expressions

regex: / (?= \p{Cyrillic} ) \X о т /x

string 1: който ⇧

string 2: кои т◌̆ о ⇧

1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)4. match letter т5. success! �

Page 39: Unicode Regular Expressions

Character Literals

[ یي ]

(?: ی| (ي

Page 40: Unicode Regular Expressions

Character Literals

[ یي ]

(?: ی|ي )

Page 41: Unicode Regular Expressions

Character Literals

[ یي ]

(?: ی|ي )

[\x{064A}\x{06CC}]

Page 42: Unicode Regular Expressions

Character Literals

[ یي ]

(?: ی|ي )

[\x{064A}\x{06CC}]

[\N{ARABIC LETTER YEH}\N{ARABIC LETTER FARSI YEH}]

Page 43: Unicode Regular Expressions

Properties

\p{Script=Latin}

Name: ScriptValue: Latin

Match any code point with thevalue “Latin” for the Script property.

Page 44: Unicode Regular Expressions

Properties

\P{Script=Latin}

Name: ScriptValue: not Latin

Negated form:Match any code point without the

value “Latin” for the Script property.

Page 45: Unicode Regular Expressions

Properties

\p{Latin}

Name: Script (implicit)Value: Latin

The Script and General Categoryproperties don't require the namebecause they're so common and

their values don't conflict.

Page 46: Unicode Regular Expressions

Properties

\p{General_Category=Letter}

Name: General CategoryValue: Letter

Match any code point with the value“Letter” for the General Category property.

Page 47: Unicode Regular Expressions

Properties

\p{gc=Letter}

Name: General Category (gc)Value: Letter

Property names may be abbreviated.

Page 48: Unicode Regular Expressions

Properties

\p{gc=L}

Name: General Category (gc)Value: Letter (L)

The General Category property isso commonly used that its valuesall have standard abbreviations.

Page 49: Unicode Regular Expressions

Properties

\p{L}

Name: General Category (implicit)Value: Letter (L)

And the General Category values may evenbe used on their own, like the Script values.These two properties have distinct values.

Page 50: Unicode Regular Expressions

Properties

\pL

Name: General Category (implicit)Value: Letter (L)

Single-character General Categoryvalues don't require curly braces.

Page 51: Unicode Regular Expressions

Properties

\PL

Name: General Category (implicit)Value: not Letter (L)

Don't forget negation!

Page 52: Unicode Regular Expressions

s/ / /g� �