Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html...
Transcript of Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html...
![Page 1: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/1.jpg)
PythonStringology
Marcin Młotkowski
27th March, 2013
![Page 2: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/2.jpg)
Regular expressionsResults groupinghtml processingXML processing
1 Regular expressions
2 Results grouping
3 html processing
4 XML processing
Marcin Młotkowski Python
![Page 3: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/3.jpg)
Regular expressionsResults groupinghtml processingXML processing
Regular expressions in examples
MS Windows system
c:\WINDOWS\system32> dir *.exe
Resultaccwiz.exeactmovie.exeahui.exealg.exeappend.exearp.exeasr_fmt.exe,asr_ldm.exe...
Marcin Młotkowski Python
![Page 4: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/4.jpg)
Regular expressionsResults groupinghtml processingXML processing
Examples, cont.
?N*X, *BSD$ rm *.tmp
Examples of regular expression
reg. exp. words’alamakota’ { ’alamakota’ }’(hop!)*’ { ”, ’hop!’, ’hop!hop!’, ’hop!hop!hop!’, ...}’br+um’ { ’brum’, ’brrum’, ’brrrum’, ... }
Marcin Młotkowski Python
![Page 5: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/5.jpg)
Regular expressionsResults groupinghtml processingXML processing
Searching and matching
re library
import re
matching
if automat.match(’brr+um’, ’brrrrum!!!’): print ’matches’
searching
if automat.search(’brr+um’, ’Automobile sounds brrrrum!!!’): print’exists’
Marcin Młotkowski Python
![Page 6: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/6.jpg)
Regular expressionsResults groupinghtml processingXML processing
Regular expression compilation
import reautomat = re.compile(’brr+um’)automat.search(’brrrrum’)automat.match(’brrrrum’)
Marcin Młotkowski Python
![Page 7: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/7.jpg)
Regular expressionsResults groupinghtml processingXML processing
Result interpretation
>>> re.search(’brr+um’, ’brrrum!!!’)
MatchObject
.group(): matched text
.start(): beginning of matched text
.end(): end of matched text
Marcin Młotkowski Python
![Page 8: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/8.jpg)
Regular expressionsResults groupinghtml processingXML processing
Advanced example
TaskOn html page find all references to other pages.
Exampleswww.ii.uni.wroc.plwww.gogole.com
Marcin Młotkowski Python
![Page 9: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/9.jpg)
Regular expressionsResults groupinghtml processingXML processing
Solution
Implementation
adres = ’([a-zA-Z]+\.)*[a-zA-Z]+’automat = re.compile(’http://’ + adres)tekst = fh.read()
[ url.group() for url in automat.finditer(tekst) ]
Marcin Młotkowski Python
![Page 10: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/10.jpg)
Regular expressionsResults groupinghtml processingXML processing
Solution
Implementation
adres = ’([a-zA-Z]+\.)*[a-zA-Z]+’automat = re.compile(’http://’ + adres)tekst = fh.read()
[ url.group() for url in automat.finditer(tekst) ]
Marcin Młotkowski Python
![Page 11: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/11.jpg)
Regular expressionsResults groupinghtml processingXML processing
Metasymbols overview
symbol descriptionw* zero or more repetition of ww+ at least one repetition of ww1|w2 alternative of w1 and w2w{m, n} w occurs at least n times, and at most m times. any character except newlinew? 0 or 1 occurrence of w
Marcin Młotkowski Python
![Page 12: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/12.jpg)
Regular expressionsResults groupinghtml processingXML processing
Popular abbreviations
symbol description\d any digit\w alphanumeric character (depends on LOCALE)\Z end of text
Marcin Młotkowski Python
![Page 13: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/13.jpg)
Regular expressionsResults groupinghtml processingXML processing
Problem with backslash
Role of backslash in Python
’Name\tSurname\n’print ’Tabulator is a character \\t’’c:\\WINDOWS\\win.ini’
Marcin Młotkowski Python
![Page 14: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/14.jpg)
Regular expressionsResults groupinghtml processingXML processing
Backslash in regular expressions
Searching of ’[’
re.match(’\[’, ’[’)
A puzzle
How to find ’\[’?
Marcin Młotkowski Python
![Page 15: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/15.jpg)
Regular expressionsResults groupinghtml processingXML processing
Backslash in regular expressions
Searching of ’[’
re.match(’\[’, ’[’)
A puzzle
How to find ’\[’?
Marcin Młotkowski Python
![Page 16: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/16.jpg)
Regular expressionsResults groupinghtml processingXML processing
Approaches
’\[’re.match(’\[’, ’\[’) # error of regexp compilation
re.match(’\[’, ’[’) # result: None
’\\[’re.match(’\\[’, ’\[’) # error of regexp compilationre.match(’\\[’, ’[’) # result: None
re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None
Marcin Młotkowski Python
![Page 17: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/17.jpg)
Regular expressionsResults groupinghtml processingXML processing
Approaches
’\[’re.match(’\[’, ’\[’) # error of regexp compilationre.match(’\[’, ’[’) # result: None
’\\[’re.match(’\\[’, ’\[’) # error of regexp compilationre.match(’\\[’, ’[’) # result: None
re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None
Marcin Młotkowski Python
![Page 18: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/18.jpg)
Regular expressionsResults groupinghtml processingXML processing
Approaches
’\[’re.match(’\[’, ’\[’) # error of regexp compilationre.match(’\[’, ’[’) # result: None
’\\[’re.match(’\\[’, ’\[’) # error of regexp compilation
re.match(’\\[’, ’[’) # result: None
re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None
Marcin Młotkowski Python
![Page 19: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/19.jpg)
Regular expressionsResults groupinghtml processingXML processing
Approaches
’\[’re.match(’\[’, ’\[’) # error of regexp compilationre.match(’\[’, ’[’) # result: None
’\\[’re.match(’\\[’, ’\[’) # error of regexp compilationre.match(’\\[’, ’[’) # result: None
re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None
Marcin Młotkowski Python
![Page 20: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/20.jpg)
Regular expressionsResults groupinghtml processingXML processing
Approaches
’\[’re.match(’\[’, ’\[’) # error of regexp compilationre.match(’\[’, ’[’) # result: None
’\\[’re.match(’\\[’, ’\[’) # error of regexp compilationre.match(’\\[’, ’[’) # result: None
re.match(’\\\[’, ’\[’) # result: Nonere.match(’\\\\[’, ’\[’) # result: None
Marcin Młotkowski Python
![Page 21: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/21.jpg)
Regular expressionsResults groupinghtml processingXML processing
Ultimate solution
A solutionre.match(’\\\\\[’, ’\[’)re.match(r’\\\[’, ’\[’)
Marcin Młotkowski Python
![Page 22: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/22.jpg)
Regular expressionsResults groupinghtml processingXML processing
String processing
String processing by Python
string in Python ’true’ character’\n’ 0x0A’\t’ 0x0B’\\’ 0x5C
String processing by regular expressions
string in regex ’true’ character’\[’ 0x5B
Marcin Młotkowski Python
![Page 23: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/23.jpg)
Regular expressionsResults groupinghtml processingXML processing
Few words on groups
res = re.match(’a(b*)a.*(a)’, ’abbabbba’)print res.groups()
Result(’bb’, ’a’)
Marcin Młotkowski Python
![Page 24: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/24.jpg)
Regular expressionsResults groupinghtml processingXML processing
Grouping expression
(?P<name>regexp)
Marcin Młotkowski Python
![Page 25: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/25.jpg)
Regular expressionsResults groupinghtml processingXML processing
Task
From data in format ’20061204’ drag day, month, and year.
Marcin Młotkowski Python
![Page 26: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/26.jpg)
Regular expressionsResults groupinghtml processingXML processing
A solution
Regular expression
wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’
res = re.search(wzor, ’On 20110406 there is a Python lecture’)
print res.group(’year’), res.group(’month’)
Marcin Młotkowski Python
![Page 27: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/27.jpg)
Regular expressionsResults groupinghtml processingXML processing
A solution
Regular expression
wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’
res = re.search(wzor, ’On 20110406 there is a Python lecture’)
print res.group(’year’), res.group(’month’)
Marcin Młotkowski Python
![Page 28: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/28.jpg)
Regular expressionsResults groupinghtml processingXML processing
A solution
Regular expression
wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’
res = re.search(wzor, ’On 20110406 there is a Python lecture’)
print res.group(’year’), res.group(’month’)
Marcin Młotkowski Python
![Page 29: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/29.jpg)
Regular expressionsResults groupinghtml processingXML processing
A solution
Regular expression
wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’
res = re.search(wzor, ’On 20110406 there is a Python lecture’)
print res.group(’year’), res.group(’month’)
Marcin Młotkowski Python
![Page 30: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/30.jpg)
Regular expressionsResults groupinghtml processingXML processing
A solution
Regular expression
wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’
res = re.search(wzor, ’On 20110406 there is a Python lecture’)
print res.group(’year’), res.group(’month’)
Marcin Młotkowski Python
![Page 31: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/31.jpg)
Regular expressionsResults groupinghtml processingXML processing
A solution
Regular expression
wzor = r’(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})’
res = re.search(wzor, ’On 20110406 there is a Python lecture’)
print res.group(’year’), res.group(’month’)
Marcin Młotkowski Python
![Page 32: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/32.jpg)
Regular expressionsResults groupinghtml processingXML processing
html processing
html file is a string of tags:
<html><title>Tytuł</title><body bgcolor="red"><div align="center">Tekst</div></body></html>
Opening tags<html>, <body>, <div>
Closing tags
</body>, </div>, </html>
Marcin Młotkowski Python
![Page 33: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/33.jpg)
Regular expressionsResults groupinghtml processingXML processing
sgmllib
import sgmllib
class sgmllib.SGMLParser:def start_tag(self, attrs):def end_tag(self):
Marcin Młotkowski Python
![Page 34: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/34.jpg)
Regular expressionsResults groupinghtml processingXML processing
How to use sgmllib
TaskFind all references of ’href’<a href="adres">Text</a>
class MyParser(sgmllib.SGMLParser):
def start_a(self, attrs):for (atr, val) in attrs:
if atr == ’href’: print val
p = MyParser()p.feed(dokument)p.close()
Marcin Młotkowski Python
![Page 35: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/35.jpg)
Regular expressionsResults groupinghtml processingXML processing
How to use sgmllib
TaskFind all references of ’href’<a href="adres">Text</a>
class MyParser(sgmllib.SGMLParser):
def start_a(self, attrs):for (atr, val) in attrs:
if atr == ’href’: print val
p = MyParser()p.feed(dokument)p.close()
Marcin Młotkowski Python
![Page 36: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/36.jpg)
Regular expressionsResults groupinghtml processingXML processing
XML
Example<?xml version="1.0" encoding="UTF-8"?><biblioteka><ksiazka egzemplarze="3"><autor>Ascher, Martelli, Ravenscroft</autor><tytul>Python cookbook</tytul>
</ksiazka><ksiazka><autor/><tytul>Python for beginners</tytul>
</ksiazka></biblioteka>
Marcin Młotkowski Python
![Page 37: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/37.jpg)
Regular expressionsResults groupinghtml processingXML processing
XML processing
processing of subsequent elements (saxutils)create a tree (DOM) corresponding to xml
Marcin Młotkowski Python
![Page 38: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/38.jpg)
Regular expressionsResults groupinghtml processingXML processing
SAX — Simple Api for XML
elements of documents are read step by stepfor each element a proper method is called
Marcin Młotkowski Python
![Page 39: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/39.jpg)
Regular expressionsResults groupinghtml processingXML processing
Parser implementation
Default parser
from xml.sax import *
class saxutils.DefaultHandler:def startDocument(self): passdef endDocument(self): passdef startElement(self, name, attrs): passdef endElement(self, name): passdef characters(self, value): pass
Marcin Młotkowski Python
![Page 40: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/40.jpg)
Regular expressionsResults groupinghtml processingXML processing
Own parser implementation
class SaxReader(saxutils.DefaultHandler):
def characters(self, value):print value
def startElement(self, name, attrs):for x in attrs.keys():
Marcin Młotkowski Python
![Page 41: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/41.jpg)
Regular expressionsResults groupinghtml processingXML processing
How to use parser
from xml.sax import make_parserfrom xml.sax.handler import feature_namespacesfrom xml.sax import saxutils
parser = make_parser()parser.setFeature(feature_namespaces, 0)dh = SaxReader()parser.setContentHandler(dh)parser.parse(fh)
Marcin Młotkowski Python
![Page 42: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/42.jpg)
Regular expressionsResults groupinghtml processingXML processing
SAX: summary
Read-only mode processing;processes parts of document;SAX is fast, with small memory requirements.
Marcin Młotkowski Python
![Page 43: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/43.jpg)
Regular expressionsResults groupinghtml processingXML processing
DOM: Document Object Model
A document is kept entirely as a treeA document (its tree) can be modified;Processing needs time and memory, all tree is kept in memory;Specification of DOM is driven by W3C.
Marcin Młotkowski Python
![Page 44: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/44.jpg)
Regular expressionsResults groupinghtml processingXML processing
Reminder
Example<?xml version="1.0" encoding="UTF-8"?><biblioteka><ksiazka egzemplarze="3"><autor>Ascher, Martelli, Ravenscroft</autor><tytul>Python. Receptury</tytul>
</ksiazka><ksiazka><autor/><tytul>Python. Od podstaw</tytul>
</ksiazka></biblioteka>
Marcin Młotkowski Python
![Page 45: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/45.jpg)
Regular expressionsResults groupinghtml processingXML processing
A picture
Document
<?xml version="1.0" encoding="UTF-8"?>
Element Text Element
""Text""
Text""
Element<biblioteka>
<ksiazka> <ksiazka>
Element
<autor>
Element
<tytul>
Text
Asher, ...
Text
Python. Od ...
Marcin Młotkowski Python
![Page 46: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/46.jpg)
Regular expressionsResults groupinghtml processingXML processing
Python libraries
xml.dom: DOM Level 2xml.dom.minidom: Lightweight DOM implementation, DOMLevel 1
Marcin Młotkowski Python
![Page 47: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/47.jpg)
Regular expressionsResults groupinghtml processingXML processing
minidom implementation
A class Node
class attribute example.nodeName library, book, author.nodeValue "Python cookbook".attributes <book copies="3">.childNodes list of subnodes
Marcin Młotkowski Python
![Page 48: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/48.jpg)
Regular expressionsResults groupinghtml processingXML processing
Tree creation
XML file processingimport xml
def wezel(node):print node.nodeNamefor n in node.childNodes:
wezel(n)
doc = xml.dom.minidom.parse(’content.xml’)wezel(doc)
Marcin Młotkowski Python
![Page 49: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/49.jpg)
Regular expressionsResults groupinghtml processingXML processing
DOM processing
Node manipulation
appendChild(newChild)removeChild(oldChild)replaceChild(newChild, oldChild)
New node creationnew = document.createElement(’chapter’)new.setAttribute(’number’, ’5’)document.documentElement.appendChild(new)
print document.toxml()
Marcin Młotkowski Python
![Page 50: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/50.jpg)
Regular expressionsResults groupinghtml processingXML processing
DOM processing
Node manipulation
appendChild(newChild)removeChild(oldChild)replaceChild(newChild, oldChild)
New node creationnew = document.createElement(’chapter’)new.setAttribute(’number’, ’5’)document.documentElement.appendChild(new)
print document.toxml()
Marcin Młotkowski Python
![Page 51: Pythonmarcinm/dyd/python_eng/regex.pdf · 2013-03-27 · Regular expressions Results grouping html processing XML processing Regularexpressionsinexamples MSWindowssystem c:nWINDOWSnsystem32>](https://reader033.fdocuments.net/reader033/viewer/2022060308/5f0a21ca7e708231d42a2ad1/html5/thumbnails/51.jpg)
Regular expressionsResults groupinghtml processingXML processing
Summarize: DOM
process entire treeneeds a lot of time and memory for large files
Marcin Młotkowski Python