Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.
-
date post
22-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.
Text operations
Prepared By : Loay Alayadhi 425121605
Supervised by: Dr. Mourad Ykhlef
2
Document Preprocessing
• Lexical analysis• Elimination of stopwords• Stemming of the remaining words• Selection of index terms• Construction of term
categorization structures. (thesaurus)
3
Logical View of a Document
document
structurerecognition
text+structure
accents,spacing,etc.
stopwordsnoun
groupsstemming
automaticor manualindexing
structure
text
full textindexterms
4
1)Lexical Analysis of the Text
• Lexical AnalysisConvert an input stream of characters into stream words .– Major objectives is the identification of the
words in the text !! How ?? • Digits. ignoring numbers is a common way
• Hyphens. state-of-the art • punctuation marks. remove them. Exception:
510B.C
• Case
5
2) Elimination of Stopwords
• words appear too often are not useful for IR.
• Stopwords: words appear more than 80% of the documents in the collection are stopwords and are filtered out as potential index words
• Problem– Search for “to be or not to be”?
6
3) Stemming• A Stem: the portion of a word which
is left after the removal of its affixes (i.e., prefixes or suffixes).
• Example– connect, connected, connecting,
connection, connections• Removing strategies
– affix removal: intuitive, simple– table lookup– successor variety– n-gram
7
4) Index Terms Selection• Motivation
– A sentence is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives.
– Most of the semantics is carried by the noun words.
• Identification of noun groups– A noun group is a set of nouns whose
syntactic distance in the text does not exceed a predefined threshold
8
5) Thesaurus Construction
• Thesaurus: a precompiled list of important words in a given domain of knowledge and for each word in this list, there is a set of related words.
• A controlled vocabulary for the indexing and searching. Why?– Normalization,– indexing concept ,– reduction of noise,– identification, ..ect
9
The Purpose of a Thesaurus
• To provide a standard vocabulary for indexing and searching
• To assist users with locating terms for proper query formulation
• To provide classified hierarchies that allow the broadening and narrowing of the current query request
10
Thesaurus (cont.)• Not like common dictionary
– Words with their explanations
• May contain words in a language• Or only contains words in a specific
domain.• With a lot of other information especially
the relationship between words– Classification of words in the language– Words relationship like synonyms, antonyms
11
Roget thesaurusExample
cowardly adjective (جبان)Ignobly lacking in courage: cowardly turncoatsSyns: chicken (slang), chicken-hearted,
craven, dastardly, faint-hearted, gutless, lily-livered,pusillanimous, unmanly, yellow (slang), yellow-bellied (slang)
• http://www.thesaurus.com• http://www.dictionary.com/
12
Thesaurus Term Relationships
• BT: broader• NT: narrower• RT: non-hierarchical, but related
13
Use of Thesaurus
• IndexingSelect the most appropriate thesaurus entries for representing the document.
• SearchingDesign the most appropriate search strategy.– If the search does not retrieve enough
documents, the thesaurus can be used to expand the query.
– If the search retrieves too many items, the thesaurus can suggest more specific search vocabulary
14
Document clustering• Document clustering : the
operation of grouping together similar documents in classes
• Global vs. local• Global: whole collection
– At compile time, one-time operation• Local
– Cluster the results of a specific query– At runtime, with each query
15
Text Compression
• Why text compression is important?– Less storage space– Less time for data transmission– Less time to search (if the
compression method allows direct search without decompression)
16
Terminology• Symbol
– The smallest unit for compression (e.g., character, word, or a fixed number of characters)
• Alphabet– A set of all possible symbols
• Compression ratio– The size of the compressed file as a
fraction of the uncompressed file
17
Types of compression models
• Static models– Assume some data properties in
advance (e.g., relative frequencies of symbols) for all input text
– Allow direct (or random) access
– Poor compression ratios when the input text deviates from the assumption
18
Types of compression models
• Semi-static models– Learn data properties in a first pass– Compress the input data in a second pass
– Allow direct (or random) access– Good compression ratio
– Must store the learned data properties for decoding
– Must have whole data at hand
19
Types of compression models
• Adaptive models– Start with no information– Progressively learn the data properties
as the compression process goes on
– Need only one pass for compression
– Do not allow random access• Decompression cannot start in the middle
20
General approaches to text compression
• Dictionary methods– (Basic) dictionary method– Ziv-Lempel’s adaptive method
• Statistical methods– Arithmetic coding– Huffman coding
21
Dictionary methods
• Replace a sequence of symbols with a pointer to a dictionary entry
aaababbbaaabaaaaaaabaabb
Compress
input
babbabaa output
aaa bb dictionary
May be suitable for one text but may be unsuitable for another
22
a|a|b|b|b|a|a|a|b|b
aaababbbaaabaaaaaaabaabb
Compress
• Instead of dictionary entries, pointers point to the previous occurrences of symbols
Adaptive Ziv-Lempel coding
23
Adaptive Ziv-Lempel coding
• Instead of dictionary entries, pointers point to the previous occurrences of symbols
aaababbbaaabaaaaaaabaabb
a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb1 2 3 4 5 6 7 8 9 10
0a|1a|0b|1b|3b|2a|3a|6a|2b|9b 1 2 3 4 5 6 7 8 9 10
24
Adaptive Ziv-Lempel coding
• Good compression ratio (4 bits/character)– Suitable for general data compression
and widely used (e.g., zip, compress)
• Do not allow decoding to start in the middle of a compressed file direct access is impossible without decompression from the beginning
25
Arithmetic coding
• The input text (data) is converted to a real number between 0 and 1, such as 0.328701
Good compression ratio (2 bits/character)
Slow Cannot start decoding in the middle of
a file
26
Symbols and alphabetfor textual data
• Words are more appropriate symbols for natural language text
• Example: “for each rose, a rose is a rose”– Alphabet
• {a, each, for, is, rose, , ‘,’}
– Always assume a single space after a word unless there is another separator• {a, each, for, is, rose, ‘,’}
27
Huffman coding
• Assign shorter codes (bits) to more frequent symbols and longer codes (bits) to less frequent symbols
• Example: “for each rose, a rose is a rose”
28
Example
each , for is
rosea
symb
freq
each
1
,1
for1
is1
a2
rose3
2 2
45
symb
freq
each
1
,1
for1
is1
a2
rose3
symb
freq
each
1
,1
for1
is1
a2
rose3
symb
freq
each1
,1
for1
is1
a2
rose3
2 2
45
29
Example
each , for is
rosea0
0
0
0
1
1
1 1
symb
freq
each1
,1
for1
is1
a2
rose3
0 1
30
Example
each , for is
rosea0
0
0
0
1
1
1 1
0 1
symb
freq
code
each1100
,1101
for1110
is1111
a200
rose301
31
Canonical tree
each , for is
rosea0
0
0
0
1
1
1 1
0 1
- Height of the left subtree of any node is never - Height of the left subtree of any node is never smaller than that of the right subtreesmaller than that of the right subtree- All leaves are in increasing order of - All leaves are in increasing order of probabilities (frequencies) from left to rightprobabilities (frequencies) from left to right
32
Advantages of canonical tree
• Smaller data for decoding– Non-canonical tree needs:
• Mapping table between symbols and codes
– Canonical tree needs:• (Sorted) list of symbols• A pair of number of symbols and numerical
value of the first code for each level– E.g., {(0, NA), (2, 2), (4, 0)}
• More efficient encoding/decoding
33
Byte-oriented Huffman coding
• Use whole bytes instead of binary coding
256 symbols 256 symbols
254 empty nodes
256 symbols 2 symbols254 empty nodes
254 symbols
Non-optimal tree
Optimal tree
34
Comparison of methods
35
Compression of inverted files
• Inverted file: composed of – A vector containing all distinct words in
the text collection.– for each a list of documents in which
that word occurs.– Types of code:
• Unary• Elias-~• Elisa~o• Golomb
36
ConclusionsConclusions
• Text transformation: meaning instead of strings– Lexical analysis– Stopwords– Stemming
• Text compression– Searchable– Random access– Model-coding
inverted files
37
Thanks….Any Questions.