Handling Non-English Text in Python - Graduate Student · PDF fileA Guide to Handling...

Handling Non-English Text in Python

Constanza F. Schibber

Washington University in Saint Louis

November 12, 2014

Text (WUStL) 1 / 23

Motivating Example

Al Honorable Congreso de la Nación. Tengo el agrado de dirigirme a vuestra honorabilidad con el objeto de someter a su consideración un proyecto de ley que propicia la creación de una (1) sala en la Cámara Nacional de Apelaciones en lo Comercial de la Capital Federal con el propósito de aliviar el exceso de causas que tramitan ante este fuero judicial. Como es de conocimiento público y resulta de las estadísticas oficiales publicadas por la Corte Suprema de Justicia de la Nación, el desmedido incremento en el ingreso de causas en la justicia nacional en lo comercial de la Capital Federal ha sumido al fuero en una situación de abrumadora sobrecarga y eventual colapso, poniendo en serio riesgo el correcto cumplimiento de la labor jurisdiccional. El crecimiento de las transacciones comerciales, la globalización de la economía y las sucesivas crisis que sufrió nuestro país hicieron que también la Cámara Nacional de Apelaciones en lo Comercial de la Capital Federal sufriera una demanda a la cual su estructura no puede dar respuesta temporal adecuada, teniendo en cuenta, también, la marcada complejidad de algunos litigios promovidos o por promoverse. La última sala creada por la ley 22.189 en la Cámara Comercial (Sala E) data de marzo de 1980.

Text (WUStL) 2 / 23

A Guide to Handling Non-English Text in Python

1 Encoding and Character Sets: ASCII, Unicode, UTF-8, etc.

2 Python Strings: Bytes and Unicode

3 Web-Mining & HTML Encoding

4 Reading and Saving Files

5 Unix

6 Text Processing

Text (WUStL) 3 / 23

Encoding and Character Sets

In a computer, a text file is just a row of numbers (and often letters)that may represent text in some specific encoding

Text (WUStL) 4 / 23

ASCII Appears in the 60’sBased on the English alphabetEncodes up to 128 specified characters into 7-bit binary integers

Text (WUStL) 5 / 23

Other Languages + What to do the extra bit? = Chaos

8-bit binary integers

For some languages, 1 byte is enough to add characters.

I But, without coordination, a plethora of coding schemes appear.

I With the personal computer, ISO-8859-1 – also known as Latin-1 –

(French, Spanish, German, etc.), ISO-8859-7 (Greek), ISO-8859-8

(Hebrew), etc.

I Microsoft has its own character encoding: CP-1252 (latin alphabet) is

a superset of ISO-8859-1

Asian Madness. DBCS (Double Byte Coding Set).

Text (WUStL) 6 / 23

Unicode Appears in the late-80’s

New Paradigm: Unicode maps every letter of every language to a

unique number

Upper-case D is U+0044

Lowercase d is U+0064

Hello → U+0048 U+0065 U+006C U+006C U+006F

Text (WUStL) 7 / 23

Unicode UTF-8

1 81.4% of all Web pages (November 2014)

2 Most popular East Asian encoding at 1.4% and all of them combined

under 5% (November 2014)

3 Recommendation for UTF-8 to be default encoding in XML and

HTML

UTF+: UTF-16 & UTF-32

Note: Windows does not support UTF-8. Instead, it uses UTF-16. Keepthis in mind when using documents created on a Windows machine orsaving files in a Windows machine.

Text (WUStL) 8 / 23

Character Sets over Time

Source:W3 Technology SurveysText (WUStL) 9 / 23

Python Encoding Settings

Call Python though the Command Line and type...

1 impor t s y s2 s y s . s t dou t . encod ing3 > UTF−8 #My r e s u l t

It will tell you the encoding supported by Python.

Specifying the encoding of your Python file:

1 ## −∗− cod ing : u t f−8 −∗−2

Otherwise, it will take ASCII. This line only affects the characters of thesource code.

Unix and Windows:https://wiki.python.org/moin/PrintFails

Text (WUStL) 10 / 23

https://wiki.python.org/moin/PrintFails

Potential Problems and Solutions

Byte and Unicode Strings

HTML encoding

Reading and saving files

Taking advantage of UNIX

Text Processing


Types of Strings: Bytes and Unicode

X below is αα′. We encode into byte strings and decode into Unicode.

1 # Bytes2

3 a = ”X” . encode ( ’ u t f−8 ’ )4 b = ’X ’5 c = ’ \ xce \xb1\ xce \ xac ’6

7 p r i n t l e n ( a )8

9 #Unicode10

11 d = a . decode ( ’ u t f−8 ’ ) # un i code12 e = u ’X ’ #un i code13

14 p r i n t l e n ( e )

Conclusion: Python might show you a correct print of αα′

for a, b, c, d, e,

but the length of the objects is not the same. The length of a-b-c is 4, the

length for d-e is 2. So a[1] is not equal to d[1] .


Common Python Errors

1 # Try ing to compare d i f f e r e n t encod i ng s ( s e e p r e v i o u s s l i d e )2 UnicodeWarning : Unicode equa l compar i son f a i l e d to conv e r t

both arguments to Unicode − i n t e r p r e t i n g them as be i ngunequa l

3 p r i n t a [ 1 ] == e [ 1 ]4

5 # You a r e a c h a r a c t e r not suppo r t ed by ASCII6 UnicodeEncodeEr ro r : ’ a s c i i ’ codec cannot encode c h a r a c t e r7 i n p o s i t i o n 0 : o r d i n a l not i n range (128)

Potential Solution: Assess the encoding.Open and save the file properly(more soon!). Then start working with the text.


Potential Problems and Solutions

Byte and Unicode Strings

HTML encoding

Reading and saving files

Taking advantage of UNIX

Text Processing


Web Mining: HTML & XML Encoding

Some websites provide encoding information in the HTML tag:

1 <head>2 <meta http−equ i v=”Content−Type” con t en t=” t e x t / html ;3 c h a r s e t=ut f−8”>

Or the XML:

1 <?xml v e r s i o n=” 1 .0 ” encod ing=”ISO−8859−1” ?>

Hence, urllib2 and other modules automatically detect the encoding ofthe text.


Web Mining & Multiple Languages. Wikipedia Example.

1 impor t u r l l i b 22 from Beau t i f u l Soup impor t Beau t i f u l Soup3 impor t codecs4 impor t os5

6 de f c r e a t e s o up ( u r l ) :7 hdr = { ’ User−Agent ’ : ’ Mo z i l l a /5 .0 ’ }8 r eq = u r l l i b 2 . Request ( u r l , h eade r s=hdr )9 page = u r l l i b 2 . u r l o p en ( req )

10 html = page . r ead ( )11 page . c l o s e ( )12 r e t u r n Beau t i f u l Soup ( html )13

14 soup = c r e a t e s o up ( ’ h t tp : // en . w i k i p e d i a . org / w i k i /P o l i t i c a l s c i e n c e ’ )

15

16 l i n k s = soup . f i n d A l l ( ’ a ’ , l ang=True )17

18 f o r l i n k i n l i n k s :19 p r i n t l i n k [ ’ l ang ’ ] + l i n k [ ’ t i t l e ’ ]


Saving and Opening Files. Wikipedia Example

To write a file with the text, we have to specify the encoding,

1 impor t codecs2

3 t h e f i l e = codecs . open ( ’ / Use r s / w i k i p e d i a l a n g u a g u e s . t x t ’ , ’w ’, encod ing=’ ut f−8 ’ )

4 f o r i tem i n l i n k s :5 t h e f i l e . w r i t e ( ”%s \n” % item )6 t h e f i l e . c l o s e ( )

Note: In Python 3.X you do open(’file’, ’w’, encoding=utf-8).

To read the file, we also have to specify the encoding,

1 my f i l e = codecs . open ( ’ / Use r s / w i k i p e d i a l a n g u a g u e s . t x t ’ , ’ r ’ ,’ u t f−8 ’ , e r r o r s=s t r i c t )


Web Mining & Encoding Problems. Using HTMLParser.1 impor t HTMLParser2 impor t mechanize3 impor t n l t k4

5 de f c r e a t e s o up ( u r l ) :6 br = mechanize . Browser ( )7 br . s e t h a n d l e r o b o t s ( F a l s e ) # i g n o r e r obo t s . t x t8 br . addheade r s = [ ( ’ User−agent ’ , ’ Mo z i l l a /5 .0 ’ ) ]9 page = br . open ( u r l )

10 html = page . r ead ( )11 # The f o l l o w i n g 2 l i n e s use HTMLParser12 h tm l p a r s e r = HTMLParser . HTMLParser ( )13 unescaped = h tm l p a r s e r . unescape ( html )14 r aw t e x t = n l t k . c l e a n h tm l ( unescaped )15 r e t u r n r aw t e x t16 br . c l o s e ( )17

18 l i n k = ’ h t tp : //www3 . hcdn . gov . a r / f o l i o −cg i−b in / om i s ap i . d l l ?advquery=0029−PE−05&i n f o b a s e=tp . n fo&r e c o r d=%7BA4F7%7D&r e c o r d s w i t h h i t s=on&so f t p a g e=p roye c t o ’

19 r aw t e x t = c r e a t e s o up ( l i n k )


Web Mining & Encoding Problems. Cont.

Writing the file required a bit of trial and error. The correct encodingended up being ’utf-16’.

1 impor t codecs2 impor t os3

4 f i l e o b j = codecs . open ( os . path . j o i n ( pa th save , f i l e n ame ) , ’w ’, ’ u t f −16 ’ )

5 f i l e o b j . w r i t e ( r aw t e x t )


Finding Out a File’s Encoding with Unix

Open the terminal and type,

1 f i l e 0078−D−07. t x t −−mime−encod ing2 > 0078−D−07. t x t : u t f −16 l e # answer

where 0078-D-07.txt is the file I created on slide 19.


Text Processing

The hard part is saving the text with the correct encoding.

If you have a corpus, you can open and save all the files in a few

seconds. Highly recommended.

Natural Language Toolkit (NLTK).

Take into consideration characteristics of the language. Such as if the

same word can be written in different ways (i.e. Tschuss and Tschuß).


A Guide to Handling Non-English Text in Python

Am I able to print the text? Does it look alright?I Yes. → Keep workingI No. → Problem!

For a website:I See if HTML or XML includes the encodingI Try HTMLParser

For a file:I Use codecs.open for Python 2.XI Use open with encoding attribute for Python 3.XI Option errors is very useful.

Always save file with the appropriate encoding.I How do I find the encoding? Start with UTF-8, then UTF-16.I Trial & Error.I Try using UNIX.


References

David C. Zentgraf, What Every Programmer Absolutely, Positively NeedsTo Know About Encodings And Character Sets To Work With Text

Python, Unicode HOWTO

Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processingwith Python Analyzing Text with the Natural Language Toolkit (O’Reillyor Online version)


http://kunststube.net/encoding/

http://kunststube.net/encoding/

https://docs.python.org/2/howto/unicode.html

http://www.nltk.org/book/ch03.html

http://www.nltk.org/book/ch03.html

Handling Non-English Text in Python - Graduate Student · PDF fileA Guide to Handling...

Documents

Transcript of Handling Non-English Text in Python - Graduate Student · PDF fileA Guide to Handling...