1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular...

1

CSC 594 Topics in AI –Text Mining and Analytics

Fall 2015/16

4. Document Search and Regular Expressions

Information Retrieval vs. Text Mining• Information Retrieval (IR) is a task of retrieving documents which

are relevant to a query. – So the fundamental technique is document similarity.

• Text Mining/Analytics is to (a) examine a collection of documents, (b) learn decision criteria/model for classification, and (c) apply these criteria/model to new documents to classify.– So the goal is to classify new documents for prediction (using a model

derived by the collection of documents).• But in TM/A, (b) decision criteria for classification – uses document

similarity for determination. So TM/A use the same techniques as IR (although not all TM/A tasks).

2

3

Document Search• ‘Information Retrieval (IR)’ implies a query (e.g. search terms)

– For a given query, relevant or similar documents are returned.• But most basic document retrieval technique is keyword/search term

matching.– Retrieve all (or selected) documents which contain the search terms --

by string matching– Python example:

>>> s1 = 'public' >>> s2 = 'public' >>> s2 == s1 True

myword = “month python”with open("textfile.txt") as openfile: for line in openfile: if myword in line: print line

4

String Matching Using Patterns• Often, we wish to find a substring which matches a pattern• e.g. E-mail addresses:

1. Any number of alphanumeric characters and/or dots (not a dot at beginning or end)

2. @3. Any number of alphanumeric characters and/or dots (not a dot at

beginning or end); must be at least one dot• Examples:

– valid: [email protected], [email protected]– Invalid: [email protected], tomuro@depaul

• But if you want to specify search words by patterns, regular expressions are commonly used.

mailto:[email protected]

mailto:[email protected]

Regular Expressions (1)

Regular expression is an algebra for defining patterns. For example, a regular expression “a*b” matches with a string “aaaab”.

But without going through the formal definitions, here is a (partial) summary.

1. Simple Patterns– Characters match themselves. Note the chars are case-sensitive.– Metacharacters – not to be used literally _as is_

. ^ $ * + ? { } [ ] \ | ( )– To use a metacharacter, a back-slash has to be given before it

\. \^ \+ etc.– Other special characters

\t, \n, \r, \f etc.

5


2. Character classes– [abc] – a, b, or c– [^abc] – any character except a, b, or c.– [a-zA-Z] – a throughx, or A through Z inclusive (range)

3. Predefined character classes– . (dot) – any character – \d – a digit ([0-9])– \D – a non-digit ([^0-9])– \s – a whitespace character (e.g. space, \t, \n, \r)– \S – a non-whitespace character– \w – a word character ([a-zA-Z_0-9])– \W – a non-word character ([^\w])

4. Boundary matchers– ^ -- the beginning of a line– $ -- the end of a line

6


5. Greedy quantifiers– X? – X, once or not at all– Z* -- X, zero or more times– X+ -- X, one or more times– X{n} – X, exactly n times– X{n,m} – X, at least n but no more than m times

6. Logical operators– XY – X followed by Y– X|Y – either X or Y– (X) – X, as a capturing group

7

Regular Expression in Python (1)• Regular expressions are in the ‘re’ package.• Notation for patterns is slightly different from other languages –

using raw string as an alternative to Regular string.

• First compile an expression (into an re object). Then match it against a string.– >>> import re

>>> p = re.compile('ab*')

8

Regular String Raw string"ab*" r"ab*""\\\\section" r"\\section""\\w+\\s+\\1" r"\w+\s+\1"

Regular Expression in Python (2)• Matching a re object against a string is done in several ways.

9

Method/Attribute Purpose

match()Determine if the RE matches at the beginning of the string.

search()Scan through a string, looking for any location where this RE matches.

findall()Find all substrings where the RE matches, and returns them as a list.

finditer()Find all substrings where the RE matches, and returns them as aniterator.

https://docs.python.org/2/glossary.html#term-iterator

10

>>> import re>>> sent = "This book on tennis cost $3.99 at Walmart.">>> p1 = re.compile("ten")>>> m1 = p1.match(sent)>>> m1>>> p2 = re.compile(".*ten.*")>>> m2 = p2.match(sent)>>> m2<_sre.SRE_Match object; span=(0, 42), match='This book on tennis cost $3.99 at Walmart.'>>>> m3 = re.search(p1,sent)>>> m3<_sre.SRE_Match object; span=(13, 16), match='ten'>>>> m4 = re.search(p2,sent)>>> m4<_sre.SRE_Match object; span=(0, 42), match='This book on tennis cost $3.99 at Walmart.'>>>> pp1 = re.compile("is")>>> m5 = re.findall(pp1, sent)>>> m5['is', 'is']>>> pp2 = re.compile("\\d")>>> m6 = re.search(pp2, sent)>>> m6<_sre.SRE_Match object; span=(26, 27), match='3'>>>> pp3 = re.compile("\\d+")>>> m7 = re.search(pp3, sent)>>> m7<_sre.SRE_Match object; span=(26, 27), match='3'>

11

>>> pp3 = re.compile("\\$\\d+\\.\\d\\d")>>> m8 = re.search(pp3, sent)>>> m8<_sre.SRE_Match object; span=(25, 30), match='$3.99'>>>> pp4 = re.compile(r"\$\d+\.\d\d")>>> m9 = re.search(pp4, sent)>>> m9<_sre.SRE_Match object; span=(25, 30), match='$3.99'>

Regular Expression in Python (3)• Grouping – You can retrieve the matched substrings using

parentheses.• Capturing groups are numbered by counting their opening

parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:– ((A)(B(C)))– (A)– (B(C))– (C)

• Group zero always stands for the entire expression.

12

13

>>> ppp1 = re.compile("(\\w+) cost (\\$\\d+\\.\\d\\d)")>>> mm1 = re.search(ppp1, sent)>>> mm1<_sre.SRE_Match object; span=(13, 30), match='tennis cost $3.99'>>>> mm1.group(0)'tennis cost $3.99'>>> mm1.group(1)'tennis'>>> mm1.group(2)'$3.99'

TutorialsPoint, http://www.tutorialspoint.com/python/python_reg_expressions.htm 14

Python ‘search()’ Example #!/usr/bin/python

import re

line = "Cats are smarter than dogs";

searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)

if searchObj: print "searchObj.group() : ", searchObj.group() print "searchObj.group(1) : ", searchObj.group(1) print "searchObj.group(2) : ", searchObj.group(2) else: print "Nothing found!!"

When the above code is executed, it produces following result −

matchObj.group() : Cats are smarter than dogs matchObj.group(1) : Cats matchObj.group(2) : smarter

http://www.tutorialspoint.com/python/python_reg_expressions.htm


TutorialsPoint, http://www.tutorialspoint.com/python/python_reg_expressions.htm 15

Modifier Description

re.I Performs case-insensitive matching.

re.L Interprets words according to the current locale. This interpretation affects the alphabetic group (\w and \W), as well as word boundary behavior (\b and \B).

re.M Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).

re.S Makes a period (dot) match any character, including a newline.

re.U Interprets letters according to the Unicode character set. This flag affects the behavior of \w, \W, \b, \B.

re.X Permits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker.

Regular Expression Modifiers: Option Flags

Regular expression literals may include an optional modifier to control various aspects of matching. The modifiers are specified as an optional flag. You can provide multiple modifiers using exclusive OR (|), as shown previously and may be represented by one of these −


1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular...

Documents

Transcript of 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular...