Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

37
Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

Page 1: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

Text operations

Prepared By : Loay Alayadhi 425121605

Supervised by: Dr. Mourad Ykhlef

Page 2: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

2

Document Preprocessing

• Lexical analysis• Elimination of stopwords• Stemming of the remaining words• Selection of index terms• Construction of term

categorization structures. (thesaurus)

Page 3: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

3

Logical View of a Document

document

structurerecognition

text+structure

accents,spacing,etc.

stopwordsnoun

groupsstemming

automaticor manualindexing

structure

text

full textindexterms

Page 4: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

4

1)Lexical Analysis of the Text

• Lexical AnalysisConvert an input stream of characters into stream words .– Major objectives is the identification of the

words in the text !! How ?? • Digits. ignoring numbers is a common way

• Hyphens. state-of-the art • punctuation marks. remove them. Exception:

510B.C

• Case

Page 5: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

5

2) Elimination of Stopwords

• words appear too often are not useful for IR.

• Stopwords: words appear more than 80% of the documents in the collection are stopwords and are filtered out as potential index words

• Problem– Search for “to be or not to be”?

Page 6: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

6

3) Stemming• A Stem: the portion of a word which

is left after the removal of its affixes (i.e., prefixes or suffixes).

• Example– connect, connected, connecting,

connection, connections• Removing strategies

– affix removal: intuitive, simple– table lookup– successor variety– n-gram

Page 7: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

7

4) Index Terms Selection• Motivation

– A sentence is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives.

– Most of the semantics is carried by the noun words.

• Identification of noun groups– A noun group is a set of nouns whose

syntactic distance in the text does not exceed a predefined threshold

Page 8: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

8

5) Thesaurus Construction

• Thesaurus: a precompiled list of important words in a given domain of knowledge and for each word in this list, there is a set of related words.

• A controlled vocabulary for the indexing and searching. Why?– Normalization,– indexing concept ,– reduction of noise,– identification, ..ect

Page 9: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

9

The Purpose of a Thesaurus

• To provide a standard vocabulary for indexing and searching

• To assist users with locating terms for proper query formulation

• To provide classified hierarchies that allow the broadening and narrowing of the current query request

Page 10: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

10

Thesaurus (cont.)• Not like common dictionary

– Words with their explanations

• May contain words in a language• Or only contains words in a specific

domain.• With a lot of other information especially

the relationship between words– Classification of words in the language– Words relationship like synonyms, antonyms

Page 11: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

11

Roget thesaurusExample

cowardly adjective (جبان)Ignobly lacking in courage: cowardly turncoatsSyns: chicken (slang), chicken-hearted,

craven, dastardly, faint-hearted, gutless, lily-livered,pusillanimous, unmanly, yellow (slang), yellow-bellied (slang)

• http://www.thesaurus.com• http://www.dictionary.com/

Page 12: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

12

Thesaurus Term Relationships

• BT: broader• NT: narrower• RT: non-hierarchical, but related

Page 13: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

13

Use of Thesaurus

• IndexingSelect the most appropriate thesaurus entries for representing the document.

• SearchingDesign the most appropriate search strategy.– If the search does not retrieve enough

documents, the thesaurus can be used to expand the query.

– If the search retrieves too many items, the thesaurus can suggest more specific search vocabulary

Page 14: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

14

Document clustering• Document clustering : the

operation of grouping together similar documents in classes

• Global vs. local• Global: whole collection

– At compile time, one-time operation• Local

– Cluster the results of a specific query– At runtime, with each query

Page 15: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

15

Text Compression

• Why text compression is important?– Less storage space– Less time for data transmission– Less time to search (if the

compression method allows direct search without decompression)

Page 16: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

16

Terminology• Symbol

– The smallest unit for compression (e.g., character, word, or a fixed number of characters)

• Alphabet– A set of all possible symbols

• Compression ratio– The size of the compressed file as a

fraction of the uncompressed file

Page 17: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

17

Types of compression models

• Static models– Assume some data properties in

advance (e.g., relative frequencies of symbols) for all input text

– Allow direct (or random) access

– Poor compression ratios when the input text deviates from the assumption

Page 18: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

18

Types of compression models

• Semi-static models– Learn data properties in a first pass– Compress the input data in a second pass

– Allow direct (or random) access– Good compression ratio

– Must store the learned data properties for decoding

– Must have whole data at hand

Page 19: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

19

Types of compression models

• Adaptive models– Start with no information– Progressively learn the data properties

as the compression process goes on

– Need only one pass for compression

– Do not allow random access• Decompression cannot start in the middle

Page 20: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

20

General approaches to text compression

• Dictionary methods– (Basic) dictionary method– Ziv-Lempel’s adaptive method

• Statistical methods– Arithmetic coding– Huffman coding

Page 21: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

21

Dictionary methods

• Replace a sequence of symbols with a pointer to a dictionary entry

aaababbbaaabaaaaaaabaabb

Compress

input

babbabaa output

aaa bb dictionary

May be suitable for one text but may be unsuitable for another

Page 22: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

22

a|a|b|b|b|a|a|a|b|b

aaababbbaaabaaaaaaabaabb

Compress

• Instead of dictionary entries, pointers point to the previous occurrences of symbols

Adaptive Ziv-Lempel coding

Page 23: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

23

Adaptive Ziv-Lempel coding

• Instead of dictionary entries, pointers point to the previous occurrences of symbols

aaababbbaaabaaaaaaabaabb

a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb1 2 3 4 5 6 7 8 9 10

0a|1a|0b|1b|3b|2a|3a|6a|2b|9b 1 2 3 4 5 6 7 8 9 10

Page 24: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

24

Adaptive Ziv-Lempel coding

• Good compression ratio (4 bits/character)– Suitable for general data compression

and widely used (e.g., zip, compress)

• Do not allow decoding to start in the middle of a compressed file direct access is impossible without decompression from the beginning

Page 25: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

25

Arithmetic coding

• The input text (data) is converted to a real number between 0 and 1, such as 0.328701

Good compression ratio (2 bits/character)

Slow Cannot start decoding in the middle of

a file

Page 26: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

26

Symbols and alphabetfor textual data

• Words are more appropriate symbols for natural language text

• Example: “for each rose, a rose is a rose”– Alphabet

• {a, each, for, is, rose, , ‘,’}

– Always assume a single space after a word unless there is another separator• {a, each, for, is, rose, ‘,’}

Page 27: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

27

Huffman coding

• Assign shorter codes (bits) to more frequent symbols and longer codes (bits) to less frequent symbols

• Example: “for each rose, a rose is a rose”

Page 28: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

28

Example

each , for is

rosea

symb

freq

each

1

,1

for1

is1

a2

rose3

2 2

45

symb

freq

each

1

,1

for1

is1

a2

rose3

symb

freq

each

1

,1

for1

is1

a2

rose3

symb

freq

each1

,1

for1

is1

a2

rose3

2 2

45

Page 29: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

29

Example

each , for is

rosea0

0

0

0

1

1

1 1

symb

freq

each1

,1

for1

is1

a2

rose3

0 1

Page 30: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

30

Example

each , for is

rosea0

0

0

0

1

1

1 1

0 1

symb

freq

code

each1100

,1101

for1110

is1111

a200

rose301

Page 31: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

31

Canonical tree

each , for is

rosea0

0

0

0

1

1

1 1

0 1

- Height of the left subtree of any node is never - Height of the left subtree of any node is never smaller than that of the right subtreesmaller than that of the right subtree- All leaves are in increasing order of - All leaves are in increasing order of probabilities (frequencies) from left to rightprobabilities (frequencies) from left to right

Page 32: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

32

Advantages of canonical tree

• Smaller data for decoding– Non-canonical tree needs:

• Mapping table between symbols and codes

– Canonical tree needs:• (Sorted) list of symbols• A pair of number of symbols and numerical

value of the first code for each level– E.g., {(0, NA), (2, 2), (4, 0)}

• More efficient encoding/decoding

Page 33: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

33

Byte-oriented Huffman coding

• Use whole bytes instead of binary coding

256 symbols 256 symbols

254 empty nodes

256 symbols 2 symbols254 empty nodes

254 symbols

Non-optimal tree

Optimal tree

Page 34: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

34

Comparison of methods

Page 35: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

35

Compression of inverted files

• Inverted file: composed of – A vector containing all distinct words in

the text collection.– for each a list of documents in which

that word occurs.– Types of code:

• Unary• Elias-~• Elisa~o• Golomb

Page 36: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

36

ConclusionsConclusions

• Text transformation: meaning instead of strings– Lexical analysis– Stopwords– Stemming

• Text compression– Searchable– Random access– Model-coding

inverted files

Page 37: Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef.

37

Thanks….Any Questions.