Handling Non-English Text in Python - Graduate Student · PDF fileA Guide to Handling...
Transcript of Handling Non-English Text in Python - Graduate Student · PDF fileA Guide to Handling...
Handling Non-English Text in Python
Constanza F. Schibber
Washington University in Saint Louis
November 12, 2014
Text (WUStL) 1 / 23
Motivating Example
Al Honorable Congreso de la Nación. Tengo el agrado de dirigirme a vuestra honorabilidad con el objeto de someter a su consideración un proyecto de ley que propicia la creación de una (1) sala en la Cámara Nacional de Apelaciones en lo Comercial de la Capital Federal con el propósito de aliviar el exceso de causas que tramitan ante este fuero judicial. Como es de conocimiento público y resulta de las estadísticas oficiales publicadas por la Corte Suprema de Justicia de la Nación, el desmedido incremento en el ingreso de causas en la justicia nacional en lo comercial de la Capital Federal ha sumido al fuero en una situación de abrumadora sobrecarga y eventual colapso, poniendo en serio riesgo el correcto cumplimiento de la labor jurisdiccional. El crecimiento de las transacciones comerciales, la globalización de la economía y las sucesivas crisis que sufrió nuestro país hicieron que también la Cámara Nacional de Apelaciones en lo Comercial de la Capital Federal sufriera una demanda a la cual su estructura no puede dar respuesta temporal adecuada, teniendo en cuenta, también, la marcada complejidad de algunos litigios promovidos o por promoverse. La última sala creada por la ley 22.189 en la Cámara Comercial (Sala E) data de marzo de 1980.
Text (WUStL) 2 / 23
A Guide to Handling Non-English Text in Python
1 Encoding and Character Sets: ASCII, Unicode, UTF-8, etc.
2 Python Strings: Bytes and Unicode
3 Web-Mining & HTML Encoding
4 Reading and Saving Files
5 Unix
6 Text Processing
Text (WUStL) 3 / 23
Encoding and Character Sets
In a computer, a text file is just a row of numbers (and often letters)that may represent text in some specific encoding
Text (WUStL) 4 / 23
ASCII Appears in the 60’sBased on the English alphabetEncodes up to 128 specified characters into 7-bit binary integers
Text (WUStL) 5 / 23
Other Languages + What to do the extra bit? = Chaos
8-bit binary integers
For some languages, 1 byte is enough to add characters.
I But, without coordination, a plethora of coding schemes appear.
I With the personal computer, ISO-8859-1 – also known as Latin-1 –
(French, Spanish, German, etc.), ISO-8859-7 (Greek), ISO-8859-8
(Hebrew), etc.
I Microsoft has its own character encoding: CP-1252 (latin alphabet) is
a superset of ISO-8859-1
Asian Madness. DBCS (Double Byte Coding Set).
Text (WUStL) 6 / 23
Unicode Appears in the late-80’s
New Paradigm: Unicode maps every letter of every language to a
unique number
Upper-case D is U+0044
Lowercase d is U+0064
Hello → U+0048 U+0065 U+006C U+006C U+006F
Text (WUStL) 7 / 23
Unicode UTF-8
1 81.4% of all Web pages (November 2014)
2 Most popular East Asian encoding at 1.4% and all of them combined
under 5% (November 2014)
3 Recommendation for UTF-8 to be default encoding in XML and
HTML
UTF+: UTF-16 & UTF-32
Note: Windows does not support UTF-8. Instead, it uses UTF-16. Keepthis in mind when using documents created on a Windows machine orsaving files in a Windows machine.
Text (WUStL) 8 / 23
Character Sets over Time
Source:W3 Technology SurveysText (WUStL) 9 / 23
Python Encoding Settings
Call Python though the Command Line and type...
1 impor t s y s2 s y s . s t dou t . encod ing3 > UTF−8 #My r e s u l t
It will tell you the encoding supported by Python.
Specifying the encoding of your Python file:
1 ## −∗− cod ing : u t f−8 −∗−2
Otherwise, it will take ASCII. This line only affects the characters of thesource code.
Unix and Windows:https://wiki.python.org/moin/PrintFails
Text (WUStL) 10 / 23
Potential Problems and Solutions
Byte and Unicode Strings
HTML encoding
Reading and saving files
Taking advantage of UNIX
Text Processing
Text (WUStL) 11 / 23
Types of Strings: Bytes and Unicode
X below is αα′. We encode into byte strings and decode into Unicode.
1 # Bytes2
3 a = ”X” . encode ( ’ u t f−8 ’ )4 b = ’X ’5 c = ’ \ xce \xb1\ xce \ xac ’6
7 p r i n t l e n ( a )8
9 #Unicode10
11 d = a . decode ( ’ u t f−8 ’ ) # un i code12 e = u ’X ’ #un i code13
14 p r i n t l e n ( e )
Conclusion: Python might show you a correct print of αα′
for a, b, c, d, e,
but the length of the objects is not the same. The length of a-b-c is 4, the
length for d-e is 2. So a[1] is not equal to d[1] .
Text (WUStL) 12 / 23
Common Python Errors
1 # Try ing to compare d i f f e r e n t encod i ng s ( s e e p r e v i o u s s l i d e )2 UnicodeWarning : Unicode equa l compar i son f a i l e d to conv e r t
both arguments to Unicode − i n t e r p r e t i n g them as be i ngunequa l
3 p r i n t a [ 1 ] == e [ 1 ]4
5 # You a r e a c h a r a c t e r not suppo r t ed by ASCII6 UnicodeEncodeEr ro r : ’ a s c i i ’ codec cannot encode c h a r a c t e r7 i n p o s i t i o n 0 : o r d i n a l not i n range (128)
Potential Solution: Assess the encoding.Open and save the file properly(more soon!). Then start working with the text.
Text (WUStL) 13 / 23
Potential Problems and Solutions
Byte and Unicode Strings
HTML encoding
Reading and saving files
Taking advantage of UNIX
Text Processing
Text (WUStL) 14 / 23
Web Mining: HTML & XML Encoding
Some websites provide encoding information in the HTML tag:
1 <head>2 <meta http−equ i v=”Content−Type” con t en t=” t e x t / html ;3 c h a r s e t=ut f−8”>
Or the XML:
1 <?xml v e r s i o n=” 1 .0 ” encod ing=”ISO−8859−1” ?>
Hence, urllib2 and other modules automatically detect the encoding ofthe text.
Text (WUStL) 15 / 23
Web Mining & Multiple Languages. Wikipedia Example.
1 impor t u r l l i b 22 from Beau t i f u l Soup impor t Beau t i f u l Soup3 impor t codecs4 impor t os5
6 de f c r e a t e s o up ( u r l ) :7 hdr = { ’ User−Agent ’ : ’ Mo z i l l a /5 .0 ’ }8 r eq = u r l l i b 2 . Request ( u r l , h eade r s=hdr )9 page = u r l l i b 2 . u r l o p en ( req )
10 html = page . r ead ( )11 page . c l o s e ( )12 r e t u r n Beau t i f u l Soup ( html )13
14 soup = c r e a t e s o up ( ’ h t tp : // en . w i k i p e d i a . org / w i k i /P o l i t i c a l s c i e n c e ’ )
15
16 l i n k s = soup . f i n d A l l ( ’ a ’ , l ang=True )17
18 f o r l i n k i n l i n k s :19 p r i n t l i n k [ ’ l ang ’ ] + l i n k [ ’ t i t l e ’ ]
Text (WUStL) 16 / 23
Saving and Opening Files. Wikipedia Example
To write a file with the text, we have to specify the encoding,
1 impor t codecs2
3 t h e f i l e = codecs . open ( ’ / Use r s / w i k i p e d i a l a n g u a g u e s . t x t ’ , ’w ’, encod ing=’ ut f−8 ’ )
4 f o r i tem i n l i n k s :5 t h e f i l e . w r i t e ( ”%s \n” % item )6 t h e f i l e . c l o s e ( )
Note: In Python 3.X you do open(’file’, ’w’, encoding=utf-8).
To read the file, we also have to specify the encoding,
1 my f i l e = codecs . open ( ’ / Use r s / w i k i p e d i a l a n g u a g u e s . t x t ’ , ’ r ’ ,’ u t f−8 ’ , e r r o r s=s t r i c t )
Text (WUStL) 17 / 23
Web Mining & Encoding Problems. Using HTMLParser.1 impor t HTMLParser2 impor t mechanize3 impor t n l t k4
5 de f c r e a t e s o up ( u r l ) :6 br = mechanize . Browser ( )7 br . s e t h a n d l e r o b o t s ( F a l s e ) # i g n o r e r obo t s . t x t8 br . addheade r s = [ ( ’ User−agent ’ , ’ Mo z i l l a /5 .0 ’ ) ]9 page = br . open ( u r l )
10 html = page . r ead ( )11 # The f o l l o w i n g 2 l i n e s use HTMLParser12 h tm l p a r s e r = HTMLParser . HTMLParser ( )13 unescaped = h tm l p a r s e r . unescape ( html )14 r aw t e x t = n l t k . c l e a n h tm l ( unescaped )15 r e t u r n r aw t e x t16 br . c l o s e ( )17
18 l i n k = ’ h t tp : //www3 . hcdn . gov . a r / f o l i o −cg i−b in / om i s ap i . d l l ?advquery=0029−PE−05&i n f o b a s e=tp . n fo&r e c o r d=%7BA4F7%7D&r e c o r d s w i t h h i t s=on&so f t p a g e=p roye c t o ’
19 r aw t e x t = c r e a t e s o up ( l i n k )
Text (WUStL) 18 / 23
Web Mining & Encoding Problems. Cont.
Writing the file required a bit of trial and error. The correct encodingended up being ’utf-16’.
1 impor t codecs2 impor t os3
4 f i l e o b j = codecs . open ( os . path . j o i n ( pa th save , f i l e n ame ) , ’w ’, ’ u t f −16 ’ )
5 f i l e o b j . w r i t e ( r aw t e x t )
Text (WUStL) 19 / 23
Finding Out a File’s Encoding with Unix
Open the terminal and type,
1 f i l e 0078−D−07. t x t −−mime−encod ing2 > 0078−D−07. t x t : u t f −16 l e # answer
where 0078-D-07.txt is the file I created on slide 19.
Text (WUStL) 20 / 23
Text Processing
The hard part is saving the text with the correct encoding.
If you have a corpus, you can open and save all the files in a few
seconds. Highly recommended.
Natural Language Toolkit (NLTK).
Take into consideration characteristics of the language. Such as if the
same word can be written in different ways (i.e. Tschuss and Tschuß).
Text (WUStL) 21 / 23
A Guide to Handling Non-English Text in Python
Am I able to print the text? Does it look alright?I Yes. → Keep workingI No. → Problem!
For a website:I See if HTML or XML includes the encodingI Try HTMLParser
For a file:I Use codecs.open for Python 2.XI Use open with encoding attribute for Python 3.XI Option errors is very useful.
Always save file with the appropriate encoding.I How do I find the encoding? Start with UTF-8, then UTF-16.I Trial & Error.I Try using UNIX.
Text (WUStL) 22 / 23
References
David C. Zentgraf, What Every Programmer Absolutely, Positively NeedsTo Know About Encodings And Character Sets To Work With Text
Python, Unicode HOWTO
Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processingwith Python Analyzing Text with the Natural Language Toolkit (O’Reillyor Online version)
Text (WUStL) 23 / 23