UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane...
-
Upload
tyrone-terry -
Category
Documents
-
view
220 -
download
0
Transcript of UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane...
![Page 1: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/1.jpg)
Unicode & controlDay 13 - 9/24/14LING 3820 & 6820
Natural Language Processing
Harry Howard
Tulane University
![Page 2: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/2.jpg)
Course organization
24-Sept-2014NLP, Prof. Howard, Tulane University
2
http://www.tulane.edu/~howard/LING3820/
The syllabus is under construction.
http://www.tulane.edu/~howard/CompCultEN/
![Page 3: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/3.jpg)
Review of Unicode
24-Sept-2014
3
NLP, Prof. Howard, Tulane University
![Page 4: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/4.jpg)
ASCII characters
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 – – – – – – – – – – – – – – – –
1 – – – – – – – – – – – – – – – –
2 ! “ # $ % & ‘ ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ –
24-Sept-2014NLP, Prof. Howard, Tulane University
4
![Page 5: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/5.jpg)
6.2.1. Character encoding in Python
24-Sept-2014NLP, Prof. Howard, Tulane University
5
![Page 6: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/6.jpg)
Open Spyder
24-Sept-2014
6
NLP, Prof. Howard, Tulane University
![Page 7: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/7.jpg)
6. Non-English characters: one code to rule them all
24-Sept-2014
7
NLP, Prof. Howard, Tulane University
![Page 8: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/8.jpg)
6.2.2. What happens when you type a non-ASCII character into a Python console?
1. >>> import sys 2. >>> sys.getdefaultencoding()
1. >>> special = 'ó' 2. >>> special 3. '\xc3\xb3' 4. >>> print special ó
24-Sept-2014NLP, Prof. Howard, Tulane University
8
![Page 9: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/9.jpg)
6.2.3. How to translate into and out of Unicode with decode() and encode()1. >>> S1 = 'ca\xc3\xb1\xc3\xb3n' 2. >>> uS1 = S1.decode('utf8') 3. >>> uS1 4. u'ca\xf1\xf3n'5. >>> len(uS1) 6. 5 7. >>> utf8S1 = uS1.encode('utf8')8. >>> print utf8S1 9. cañón
24-Sept-2014NLP, Prof. Howard, Tulane University
9
![Page 10: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/10.jpg)
6.2.4.1. How to turn on non-ASCII character matching with re.UNICODE1. >>> S1 = 'ca\xc3\xb1\xc3\xb3n' # same as before
2. >>> uS1 = S1.decode('utf8')
3. >>> uS1
4. u'ca\xf1\xf3n'
5. >>> import re
6. >>> lS1 = re.findall(r'\w{5}', uS1, re.U)
7. >>> lS1
8. [u'ca\xf1\xf3n']
9. >>> eS1 = ''.join(lS1)
10. >>> eS1
11. u'ca\xf1\xf3n'
12. >>> utf8S1 = eS1.encode('utf8')
13. >>> utf8S1
14. 'ca\xc3\xb1\xc3\xb3n'
15. >>> print
16. cañón
24-Sept-2014NLP, Prof. Howard, Tulane University
10
![Page 11: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/11.jpg)
6.2.5. How to translate between Unicode strings and numbers with ord() and unichar()1. >>> 'ó' 2. '\xc3\xb3' 3. >>> 'ó'.decode('utf8') 4. u'\xf3' 5. >>> ord(u'\xf3') 6. 243 7. >>> unichr(243) 8. u'\xf3' 9. test = unichr(243).encode('utf8')10. >>> print test 11. ó
24-Sept-2014NLP, Prof. Howard, Tulane University
11
![Page 12: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/12.jpg)
I am going to fold the Unicode chapter into §1 & §2 and move the next chapter (§8) up a notch, so the chapter numbering will change.
Chapter numbering
24-Sept-2014
12
NLP, Prof. Howard, Tulane University
![Page 13: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/13.jpg)
Up to now, your short programs are entirely dependent on you for making decisions. This is fine for pieces of text that fit on a single line, but is clearly insufficient for texts that run to hundreds of lines in length. You will want Python to make decisions for you. How to tell Python to do so is the topic of this chapter, and falls under the rubric of control.
8. Control
24-Sept-2014
13
NLP, Prof. Howard, Tulane University
![Page 14: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/14.jpg)
The first step in making a decision is to distinguish those cases in which the decision applies from those in which it does not. In computer science, this is usually known as a condition.
8.1. Conditions
24-Sept-2014
14
NLP, Prof. Howard, Tulane University
![Page 15: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/15.jpg)
8.1.1. How to check for the presence of an item with in Perhaps the simplest condition in text processing is whether an item is present or not. Python handles this in a way that looks a lot like English:
1. >>> greeting = 'Yo!' 2. >>> 'Y' in greeting3. >>> 'o' in greeting4. >>> '!' in greeting5. >>> 'o!' in greeting6. >>> 'Yo!' in greeting7. >>> 'Y!' in greeting8. >>> 'n' in greeting9. >>> '?' in greeting10.>>> '' in greeting
24-Sept-2014NLP, Prof. Howard, Tulane University
15
![Page 16: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/16.jpg)
in & lists
Lists behave exactly like strings, with the proviso that the string being asked about match a string in the list exactly:
1. >>> fruit = ['apple', 'cherry', 'mango', 'pear', 'watermelon']
2. >>> 'apple' in fruit
3. >>> 'peach' in fruit
4. >>> 'app' in fruit
5. >>> '' in fruit
6. >>> [] in fruit
24-Sept-2014NLP, Prof. Howard, Tulane University
16
![Page 17: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/17.jpg)
Python can understand sequences of in conditions
1. >>> 'app' in 'apple' in fruit2. # 'app' in 'apple' > True 3. # 'apple' in lst > True 4. >>> 'aple' in 'apple' in fruit5. >>> 'pea' in 'peach' in fruit
24-Sept-2014NLP, Prof. Howard, Tulane University
17
![Page 18: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/18.jpg)
8.1.2. How to check for the absence of an item with not in1. >>> not 'n' in greeting2. >>> 'n' not in greeting3. >>> 'Y' not in greeting4. >>> 'Y!' not in greeting5. >>> 'Yo' not in greeting6. >>> '' not in greeting7. >>> 'apple' not in fruit8. >>> 'peach' not in fruit9. >>> 'app' not in fruit10. >>> '' not in fruit11. >>> 'pee' not in 'peach' not in fruit12. >>> 'pea' not in 'peach' not in fruit13. >>> 'pea' not in 'apple' not in fruit
24-Sept-2014NLP, Prof. Howard, Tulane University
18
![Page 19: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader031.fdocuments.net/reader031/viewer/2022032311/56649ddb5503460f94ad2a43/html5/thumbnails/19.jpg)
More on control
Next time
24-Sept-2014NLP, Prof. Howard, Tulane University
19