Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

Text operations

Prepared By : Loay Alayadhi 425121605

Supervised by: Dr. Mourad Ykhlef

2

Document Preprocessing

• Lexical analysis• Elimination of stopwords• Stemming of the remaining words• Selection of index terms• Construction of term

categorization structures. (thesaurus)

3

Logical View of a Document

document

structurerecognition

text+structure

accents,spacing,etc.

stopwordsnoun

groupsstemming

automaticor manualindexing

structure

text

full textindexterms

4

1)Lexical Analysis of the Text

• Lexical AnalysisConvert an input stream of characters into stream words .– Major objectives is the identification of the

words in the text !! How ?? • Digits. ignoring numbers is a common way

• Hyphens. state-of-the art • punctuation marks. remove them. Exception:

510B.C

• Case

5

2) Elimination of Stopwords

• words appear too often are not useful for IR.

• Stopwords: words appear more than 80% of the documents in the collection are stopwords and are filtered out as potential index words

• Problem– Search for “to be or not to be”?

6

3) Stemming• A Stem: the portion of a word which

is left after the removal of its affixes (i.e., prefixes or suffixes).

• Example– connect, connected, connecting,

connection, connections• Removing strategies

– affix removal: intuitive, simple– table lookup– successor variety– n-gram

7

4) Index Terms Selection• Motivation

– A sentence is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives.

– Most of the semantics is carried by the noun words.

• Identification of noun groups– A noun group is a set of nouns whose

syntactic distance in the text does not exceed a predefined threshold

8

5) Thesaurus Construction

• Thesaurus: a precompiled list of important words in a given domain of knowledge and for each word in this list, there is a set of related words.

• A controlled vocabulary for the indexing and searching. Why?– Normalization,– indexing concept ,– reduction of noise,– identification, ..ect

9

The Purpose of a Thesaurus

• To provide a standard vocabulary for indexing and searching

• To assist users with locating terms for proper query formulation

• To provide classified hierarchies that allow the broadening and narrowing of the current query request

10

Thesaurus (cont.)• Not like common dictionary

– Words with their explanations

• May contain words in a language• Or only contains words in a specific

domain.• With a lot of other information especially

the relationship between words– Classification of words in the language– Words relationship like synonyms, antonyms

11

Roget thesaurusExample

cowardly adjective (جبان)Ignobly lacking in courage: cowardly turncoatsSyns: chicken (slang), chicken-hearted,

craven, dastardly, faint-hearted, gutless, lily-livered,pusillanimous, unmanly, yellow (slang), yellow-bellied (slang)

• http://www.thesaurus.com• http://www.dictionary.com/

http://www.thesaurus.com/

http://www.dictionary.com/

12

Thesaurus Term Relationships

• BT: broader• NT: narrower• RT: non-hierarchical, but related

13

Use of Thesaurus

• IndexingSelect the most appropriate thesaurus entries for representing the document.

• SearchingDesign the most appropriate search strategy.– If the search does not retrieve enough

documents, the thesaurus can be used to expand the query.

– If the search retrieves too many items, the thesaurus can suggest more specific search vocabulary

14

Document clustering• Document clustering : the

operation of grouping together similar documents in classes

• Global vs. local• Global: whole collection

– At compile time, one-time operation• Local

– Cluster the results of a specific query– At runtime, with each query

15

Text Compression

• Why text compression is important?– Less storage space– Less time for data transmission– Less time to search (if the

compression method allows direct search without decompression)

16

Terminology• Symbol

– The smallest unit for compression (e.g., character, word, or a fixed number of characters)

• Alphabet– A set of all possible symbols

• Compression ratio– The size of the compressed file as a

fraction of the uncompressed file

17

Types of compression models

• Static models– Assume some data properties in

advance (e.g., relative frequencies of symbols) for all input text

– Allow direct (or random) access

– Poor compression ratios when the input text deviates from the assumption

18


• Semi-static models– Learn data properties in a first pass– Compress the input data in a second pass

– Allow direct (or random) access– Good compression ratio

– Must store the learned data properties for decoding

– Must have whole data at hand

19


• Adaptive models– Start with no information– Progressively learn the data properties

as the compression process goes on

– Need only one pass for compression

– Do not allow random access• Decompression cannot start in the middle

20

General approaches to text compression

• Dictionary methods– (Basic) dictionary method– Ziv-Lempel’s adaptive method

• Statistical methods– Arithmetic coding– Huffman coding

21

Dictionary methods

• Replace a sequence of symbols with a pointer to a dictionary entry

aaababbbaaabaaaaaaabaabb

Compress

input

babbabaa output

aaa bb dictionary

May be suitable for one text but may be unsuitable for another

22

a|a|b|b|b|a|a|a|b|b


Compress

• Instead of dictionary entries, pointers point to the previous occurrences of symbols

Adaptive Ziv-Lempel coding

23


• Instead of dictionary entries, pointers point to the previous occurrences of symbols


a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb1 2 3 4 5 6 7 8 9 10

0a|1a|0b|1b|3b|2a|3a|6a|2b|9b 1 2 3 4 5 6 7 8 9 10

24


• Good compression ratio (4 bits/character)– Suitable for general data compression

and widely used (e.g., zip, compress)

• Do not allow decoding to start in the middle of a compressed file direct access is impossible without decompression from the beginning

25

Arithmetic coding

• The input text (data) is converted to a real number between 0 and 1, such as 0.328701

Good compression ratio (2 bits/character)

Slow Cannot start decoding in the middle of

a file

26

Symbols and alphabetfor textual data

• Words are more appropriate symbols for natural language text

• Example: “for each rose, a rose is a rose”– Alphabet

• {a, each, for, is, rose, , ‘,’}

– Always assume a single space after a word unless there is another separator• {a, each, for, is, rose, ‘,’}

27

Huffman coding

• Assign shorter codes (bits) to more frequent symbols and longer codes (bits) to less frequent symbols

• Example: “for each rose, a rose is a rose”

28

Example

each , for is

rosea

symb

freq

each

1

,1

for1

is1

a2

rose3

2 2

45

symb

freq

each

1

,1

for1

is1

a2

rose3

symb

freq

each

1

,1

for1

is1

a2

rose3

symb

freq

each1

,1

for1

is1

a2

rose3

2 2

45

29

Example

each , for is

rosea0

0

0

0

1

1

1 1

symb

freq

each1

,1

for1

is1

a2

rose3

0 1

30

Example

each , for is

rosea0

0

0

0

1

1

1 1

0 1

symb

freq

code

each1100

,1101

for1110

is1111

a200

rose301

31

Canonical tree

each , for is

rosea0

0

0

0

1

1

1 1

0 1

- Height of the left subtree of any node is never - Height of the left subtree of any node is never smaller than that of the right subtreesmaller than that of the right subtree- All leaves are in increasing order of - All leaves are in increasing order of probabilities (frequencies) from left to rightprobabilities (frequencies) from left to right

32

Advantages of canonical tree

• Smaller data for decoding– Non-canonical tree needs:

• Mapping table between symbols and codes

– Canonical tree needs:• (Sorted) list of symbols• A pair of number of symbols and numerical

value of the first code for each level– E.g., {(0, NA), (2, 2), (4, 0)}

• More efficient encoding/decoding

33

Byte-oriented Huffman coding

• Use whole bytes instead of binary coding

256 symbols 256 symbols

254 empty nodes

256 symbols 2 symbols254 empty nodes

254 symbols

Non-optimal tree

Optimal tree

34

Comparison of methods

35

Compression of inverted files

• Inverted file: composed of – A vector containing all distinct words in

the text collection.– for each a list of documents in which

that word occurs.– Types of code:

• Unary• Elias-~• Elisa~o• Golomb

36

ConclusionsConclusions

• Text transformation: meaning instead of strings– Lexical analysis– Stopwords– Stemming

• Text compression– Searchable– Random access– Model-coding

inverted files

37

Thanks….Any Questions.

Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

Documents

Transcript of Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.