Unicode Regular Expressions

UnicodeRegular Expressions

s/ / /g� �Nick Patch

23 January 2013

Unicode Refresher

Unicode attempts to support thecharacters of the world — a massive task!

Unicode Refresher

It's hard to attach a single meaning to theword “character” but most folks think ofcharacters as the smallest stand-alone

components of a writing system.

Unicode Refresher

In Unicode, this sense of characters is represented by one or more code points,

which are each stored in one or more bytes.

Unicode Refresher

However, programmers andprogramming languages tend to think of

characters as individual code points,or worse, individual bytes.

We need to modernize our habits!

Unicode Refresher

Unicode is not just a big set of characters.It also defines standard properties for

each character and standard algorithmsfor operations such as collation,

normalization, and segmentation.

Normalization

NFD(ᾀ◌̀) = α◌̓◌̀◌ͅNFC(ᾀ◌̀) = ᾂ̀

Normalization

NFD(Чю рлёнис◌́ ) = Чю рле нис◌́ ◌̈NFC(Чю рлёнис◌́ ) = Чю рлёнис◌́

Normalization

ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡α ◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀

≠ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡

α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓

Perl Normalization

use Unicode::Normalize;

say $str; # ᾀ◌̀say NFD($str); # α◌̓◌̀◌ͅsay NFC($str); # ᾂ̀

JavaScript Normalization

var unorm = require('unorm');

console.log($str); # ᾀ◌̀console.log(unorm.nfd($str)); # α◌̓◌̀◌ͅconsole.log(unorm.nfc($str)); # ᾂ̀

PHP Normalization

echo $str; # ᾀ◌̀

echo Normalizer::normalize($str, Normalizer::FORM_D); # α◌̓◌̀◌ͅ

echo Normalizer::normalize($str, Normalizer::FORM_C); # ᾂ̀

Grapheme Clusters

regex: /^.$/

string 1: ᾂ

string 2: α◌̓◌̀◌ͅ

Grapheme Clusters

regex: /^.$/

string 1: ᾂ ⇧

string 2: α◌̓◌̀◌ͅ ⇧

1. anchor beginning of string

Grapheme Clusters

regex: /^.$/

string 1: ᾂ ⇧


1. anchor beginning of string2. match code point (excl. \n)

Grapheme Clusters

regex: /^.$/

string 1: ᾂ ⇧⇧


1. anchor beginning of string2. match code point (excl. \n)3. anchor at end of string

Grapheme Clusters

regex: /^.$/



1. anchor beginning of string2. match code point (excl. \n)3. anchor at end of string4. 1 success but 1 failure — mixed results �

Grapheme Clusters

regex: /^\X$/

string 1: ᾂ


Grapheme Clusters

regex: /^\X$/

string 1: ᾂ ⇧


1. anchor beginning of string

Grapheme Clusters

regex: /^\X$/

string 1: ᾂ ⇧


1. anchor beginning of string2. match grapheme cluster

Grapheme Clusters

regex: /^\X$/


string 2: α◌̓◌̀◌ͅ ⇧ ⇧

1. anchor beginning of string2. match grapheme cluster3. anchor at end of string

Grapheme Clusters

regex: /^\X$/


string 2: α◌̓◌̀◌ͅ ⇧ ⇧

1. anchor beginning of string2. match grapheme cluster3. anchor at end of string4. success! �

Perl

use v5.12; # better yet: v5.14use utf8;use charnames qw( :full ); # unless v5.16use open qw( :encoding(UTF-8) :std );

$str =~ /^\X$/;

$str =~ s/^(\X)$/->$1<-/;

PHP

preg_match('/^\X$/u', $str);

preg_replace('/^(\X)$/u', '->$1<-', $str);

JavaScript

[This slide intentionally left blank.]

Match Any Character

two bytes (if byte mode): е..иcode point (exc. \n): е.иcode point (incl. \n): е\p{Any}иgrapheme cluster (incl. \n): е\Xи

Match Any Letter

letter code point:е\p{General_Category=Letter}иletter code point: е\pLиCyrillic code point: е\p{Script=Cyrillic}иCyrillic code point: е\p{Cyrillic}и

letter grapheme cluster: е(?=\pL)\Xи

regex: / \p{Cyrillic} о т /x

string 1: който

string 2: кои то◌̆

regex: / о \p{Cyrillic} т /x



1. match letter о

regex: / о \p{Cyrillic} т /x



1. match letter о2. match Cyrillic letter (1 code point)

regex: / \p{Cyrillic}о т /x



1. match letter о2. match Cyrillic letter (1 code point)3. match letter т

regex: / \p{Cyrillic} о т /x



1. match letter о2. match Cyrillic letter (1 code point)3. match letter т4. 1 success but 1 failure — mixed results �

regex: / (?= \p{Cyrillic} ) \X о т /x



regex: / о (?= \p{Cyrillic} ) \X т /x



1. match letter о

regex: / о (?= \p{Cyrillic} ) \X т /x

string 1: който ⇧

string 2: кои то◌̆ ⇧

1. match letter о2. positive lookahead Cyrillic letter (1 code point)

regex: / (?= \p{Cyrillic} )о \X т /x


string 2: кои◌̆то ⇧

1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)

regex: / (?= \p{Cyrillic} ) \Xо т /x


string 2: кои◌̆то ⇧

1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)4. match letter т

regex: / (?= \p{Cyrillic} ) \X о т /x


string 2: кои т◌̆ о ⇧

1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)4. match letter т5. success! �

Character Literals

[ یي ]

(?: ی| (ي

Character Literals

[ یي ]

(?: ی|ي )

Character Literals

[ یي ]

(?: ی|ي )

[\x{064A}\x{06CC}]

Character Literals

[ یي ]

(?: ی|ي )

[\x{064A}\x{06CC}]

[\N{ARABIC LETTER YEH}\N{ARABIC LETTER FARSI YEH}]

Properties

\p{Script=Latin}

Name: ScriptValue: Latin

Match any code point with thevalue “Latin” for the Script property.

Properties

\P{Script=Latin}

Name: ScriptValue: not Latin

Negated form:Match any code point without the

value “Latin” for the Script property.

Properties

\p{Latin}

Name: Script (implicit)Value: Latin

The Script and General Categoryproperties don't require the namebecause they're so common and

their values don't conflict.

Properties

\p{General_Category=Letter}

Name: General CategoryValue: Letter

Match any code point with the value“Letter” for the General Category property.

Properties

\p{gc=Letter}

Name: General Category (gc)Value: Letter

Property names may be abbreviated.

Properties

\p{gc=L}

Name: General Category (gc)Value: Letter (L)

The General Category property isso commonly used that its valuesall have standard abbreviations.

Properties

\p{L}

Name: General Category (implicit)Value: Letter (L)

And the General Category values may evenbe used on their own, like the Script values.These two properties have distinct values.

Properties

\pL

Name: General Category (implicit)Value: Letter (L)

Single-character General Categoryvalues don't require curly braces.

Properties

\PL

Name: General Category (implicit)Value: not Letter (L)

Don't forget negation!

s/ / /g� �

Unicode Regular Expressions

Technology

Transcript of Unicode Regular Expressions