Unicode Regular Expressions

UnicodeRegular Expressions

s/ / /g� �Nick Patch

23 January 2013

Unicode Refresher

Unicode attempts to support thecharacters of the world — a massive task!

Unicode Refresher

It's hard to attach a single meaning to theword “character” but most folks think ofcharacters as the smallest stand-alone

components of a writing system.

Unicode Refresher

In Unicode, this sense of characters is represented by one or more code points,

which are each stored in one or more bytes.

Unicode Refresher

However, programmers andprogramming languages tend to think of

characters as individual code points,or worse, individual bytes.

We need to modernize our habits!

Unicode Refresher

Unicode is not just a big set of characters.It also defines standard properties for

each character and standard algorithmsfor operations such as collation,

normalization, and segmentation.

Normalization

NFD(ᾀ◌̀) = α◌̓◌̀◌ͅNFC(ᾀ◌̀) = ᾂ̀

Normalization

NFD(Чю рлёнис◌́ ) = Чю рле нис◌́ ◌̈NFC(Чю рлёнис◌́ ) = Чю рлёнис◌́

Normalization

ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡α ◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀

≠ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡

α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓

Perl Normalization

use Unicode::Normalize;

say $str; # ᾀ◌̀say NFD($str); # α◌̓◌̀◌ͅsay NFC($str); # ᾂ̀

JavaScript Normalization

var unorm = require('unorm');

console.log($str); # ᾀ◌̀console.log(unorm.nfd($str)); # α◌̓◌̀◌ͅconsole.log(unorm.nfc($str)); # ᾂ̀

PHP Normalization

echo $str; # ᾀ◌̀

echo Normalizer::normalize($str, Normalizer::FORM_D); # α◌̓◌̀◌ͅ

echo Normalizer::normalize($str, Normalizer::FORM_C); # ᾂ̀

Grapheme Clusters

regex: /^.$/

string 1: ᾂ

string 2: α◌̓◌̀◌ͅ

Grapheme Clusters

regex: /^.$/

string 1: ᾂ ⇧

string 2: α◌̓◌̀◌ͅ ⇧

1. anchor beginning of string

Grapheme Clusters

regex: /^.$/

string 1: ᾂ ⇧

1. anchor beginning of string2. match code point (excl. \n)

Grapheme Clusters

regex: /^.$/

string 1: ᾂ ⇧⇧

1. anchor beginning of string2. match code point (excl. \n)3. anchor at end of string

Grapheme Clusters

regex: /^.$/

1. anchor beginning of string2. match code point (excl. \n)3. anchor at end of string4. 1 success but 1 failure — mixed results �

Grapheme Clusters

regex: /^\X$/

string 1: ᾂ

Grapheme Clusters

regex: /^\X$/

string 1: ᾂ ⇧

1. anchor beginning of string

Grapheme Clusters

regex: /^\X$/

string 1: ᾂ ⇧

1. anchor beginning of string2. match grapheme cluster

Grapheme Clusters

regex: /^\X$/

string 2: α◌̓◌̀◌ͅ ⇧ ⇧

1. anchor beginning of string2. match grapheme cluster3. anchor at end of string

Grapheme Clusters

regex: /^\X$/

string 2: α◌̓◌̀◌ͅ ⇧ ⇧

1. anchor beginning of string2. match grapheme cluster3. anchor at end of string4. success! �

use v5.12; # better yet: v5.14use utf8;use charnames qw( :full ); # unless v5.16use open qw( :encoding(UTF-8) :std );

$str =~ /^\X$/;

$str =~ s/^(\X)$/->$1<-/;

preg_match('/^\X$/u', $str);

preg_replace('/^(\X)$/u', '->$1<-', $str);

JavaScript

[This slide intentionally left blank.]

Match Any Character

two bytes (if byte mode): е..иcode point (exc. \n): е.иcode point (incl. \n): е\p{Any}иgrapheme cluster (incl. \n): е\Xи

Match Any Letter

letter code point:е\p{General_Category=Letter}иletter code point: е\pLиCyrillic code point: е\p{Script=Cyrillic}иCyrillic code point: е\p{Cyrillic}и

letter grapheme cluster: е(?=\pL)\Xи

regex: / \p{Cyrillic} о т /x

string 1: който

string 2: кои то◌̆

regex: / о \p{Cyrillic} т /x

1. match letter о

regex: / о \p{Cyrillic} т /x

1. match letter о2. match Cyrillic letter (1 code point)

regex: / \p{Cyrillic}о т /x

1. match letter о2. match Cyrillic letter (1 code point)3. match letter т

regex: / \p{Cyrillic} о т /x

1. match letter о2. match Cyrillic letter (1 code point)3. match letter т4. 1 success but 1 failure — mixed results �

regex: / (?= \p{Cyrillic} ) \X о т /x

regex: / о (?= \p{Cyrillic} ) \X т /x

1. match letter о

regex: / о (?= \p{Cyrillic} ) \X т /x

string 1: който ⇧

string 2: кои то◌̆ ⇧

1. match letter о2. positive lookahead Cyrillic letter (1 code point)

regex: / (?= \p{Cyrillic} )о \X т /x

string 2: кои◌̆то ⇧

1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)

regex: / (?= \p{Cyrillic} ) \Xо т /x

string 2: кои◌̆то ⇧

1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)4. match letter т

regex: / (?= \p{Cyrillic} ) \X о т /x

string 2: кои т◌̆ о ⇧

1. match letter о2. positive lookahead Cyrillic letter (1 code point)3. match grapheme cluster (1+ code points)4. match letter т5. success! �

Character Literals

[ یي ]

(?: ی| (ي

Character Literals

[ یي ]

(?: ی|ي )

Character Literals

[ یي ]

(?: ی|ي )

[\x{064A}\x{06CC}]

Character Literals

[ یي ]

(?: ی|ي )

[\x{064A}\x{06CC}]

[\N{ARABIC LETTER YEH}\N{ARABIC LETTER FARSI YEH}]

Properties

\p{Script=Latin}

Name: ScriptValue: Latin

Match any code point with thevalue “Latin” for the Script property.

Properties

\P{Script=Latin}

Name: ScriptValue: not Latin

Negated form:Match any code point without the

value “Latin” for the Script property.

Properties

\p{Latin}

Name: Script (implicit)Value: Latin

The Script and General Categoryproperties don't require the namebecause they're so common and

their values don't conflict.

Properties

\p{General_Category=Letter}

Name: General CategoryValue: Letter

Match any code point with the value“Letter” for the General Category property.

Properties

\p{gc=Letter}

Name: General Category (gc)Value: Letter

Property names may be abbreviated.

Properties

\p{gc=L}

Name: General Category (gc)Value: Letter (L)

The General Category property isso commonly used that its valuesall have standard abbreviations.

Properties

Name: General Category (implicit)Value: Letter (L)

And the General Category values may evenbe used on their own, like the Script values.These two properties have distinct values.

Properties

Name: General Category (implicit)Value: Letter (L)

Single-character General Categoryvalues don't require curly braces.

Properties

Name: General Category (implicit)Value: not Letter (L)

Don't forget negation!

s/ / /g� �

Unicode Regular Expressions

Technology

Transcript of Unicode Regular Expressions

9-Sep-15 Regular Expressions. About “Regular” Expressions In a theory course you should have learned about regular expressions Regular expressions describe.

Fall 2004COMP 3351 Regular Expressions. Fall 2004COMP 3352 Regular Expressions Regular expressions describe regular languages Example: describes the language.

Regular Expressions Using Haskell€¦ · Introduction to Regular Expressions Regular Expressions Regular Expressions and Pattern Matching. You can guess the results of most matches

12-Dec-15 Regular Expressions. About “Regular” Expressions In a theory course you should have learned about regular expressions Regular expressions describe.

Costas Busch - LSU1 Regular Expressions. Costas Busch - LSU2 Regular Expressions Regular expressions describe regular languages Example: describes the.

UTS #18: Unicode Regular Expressions · A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Conformance to the Unicode Standard does not imply conformance

Lecture 10 Regular Expressions - homes.cs.washington.eduhomes.cs.washington.edu/~bodik/.../sp12/lectures/10-regular-expres… · Regular Expressions regular expressions are a small

Regular expressions - Osaka City University · Equivalence of regular expressions and regular languages. Similarly, we can use regular operations to build up regular expressions:

Chapter 2. Regular Expressions and Automata 2.1 Regular Expressions

Regular Expressions - Computer Scienceotternes/comp455/3-regular-expressions.pdf · Regular Expressions Regular expressions are simply algebraic notation for defining languages. A

1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.

New in Unicode Mark Davis, John Jenkins. Agenda Unicode 4.1.0 UCA 4.1.0 Regular Expressions Security Considerations Character Mapping Common Locale Data.

Unicode Text and Regular Expression

Regular Expressions 101 Introduction to Regular Expressions

REGULAR EXPRESSIONS FRIEND OR FOE?. INTRODUCTION TO REGULAR EXPRESSIONS.

Unicode Regular Expressions

Theory of Computation (Fall 2014): Regular Expressions, Compiling Regular Expressions into NFAs & DFAs to Regular Expressions

Regular Expressionsdeepakd/atc-2013/regular-exp.pdf · Regular Expressions Kleene’s Theorem Equation-based alternate construction Examples of Regular Expressions Expressions built

Regular Expressions Regular Expressions. Regular Expressions Regular expressions are a powerful string manipulation tool All modern languages have.

Regular expressions Aleksandr Lenin. Outline Motivation for regular expressions Constructing regular expressions – Atoms – Repetition operators – Concatenation.