0760300 term frequency A tf,doc=1 323122 3 DocFreqArray (df) P te,doc=1 0110100 term existence A...

5
0 7 6 0 3 0 0 term frequency A tf,doc= 1 3 2 3 1 2 2 3 DocFreqArray (df) P te,doc=1 0 1 1 0 1 0 0 term existence A pTree organization for a corpus of documents 1 1 1 Doc s TermArray (= ContentWord/Stem?) a or on no in it to 0 1 1 0 1 0 0 P tf,doc=1 ,2 0 1 1 0 0 0 0 P tf,doc=1 ,1 0 1 0 0 1 0 0 P tf,doc=1,0 0 9 3 0 6 0 0 1st occurence pos A fop,doc=1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 2 3 4 5 6 7 8 9 Term Pos 0 1 0 0 P te,doc=1 (SetPred=sum>0) ... 0 3 0 0 tf ... 0 1 0 0 tf 2 ... 0 0 0 0 tf 1 ... 0 1 0 0 tf 0 ... 1 2 2 3 df (Sum over tes) ... 0 0 0 1 df 2 ... 0 1 1 0 df 1 ... 1 0 0 1 df 0 ... Level-1 pTrees Level-2 pTrees 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Best pTree organization? level-1 gives te, tf (term level) level-2 gives df. (document level) level-3 ? (corpus level) Gives the Comprehensive Text ? rooted by all digital media, then libraries, then corpi, than documents,

Transcript of 0760300 term frequency A tf,doc=1 323122 3 DocFreqArray (df) P te,doc=1 0110100 term existence A...

Page 1: 0760300 term frequency A tf,doc=1 323122 3 DocFreqArray (df) P te,doc=1 0110100 term existence A pTree organization for a corpus of documents 1 1 1 Docs.

0 7 6 0 3 0 0

term frequencyAtf,doc=1

3 2 3 1 2 2 3DocFreqArray (df)

Pte,doc=10 1 1 0 1 0 0

term existence

A pTree organization for a corpus of documents

11

1

Doc

s

TermArray (= ContentWord/Stem?)a or on no in it to

0 1 1 0 1 0 0Ptf,doc=1,2

0 1 1 0 0 0 0Ptf,doc=1,1

0 1 0 0 1 0 0Ptf,doc=1,0

0 9 3 0 6 0 0

1st occurence pos

Afop,doc=1

0

0

0

0

0

0

0

0

0

0

1

0

0

1

0

0

1

0

1

0

0

1

0

0

0

0

0

1

2

3

4

5

6

7

8

9

Ter

m P

os

0

1

0

0

Pte

,doc

=1

(Se

tPre

d=su

m>

0) ..

.

0

3

0

0

tf

...

0

1

0

0

tf

2 ...

0

0

0

0

tf

1 ...

0

1

0

0

tf

0 ...

1

2

2

3

df (

Sum

ove

r te

s) .

..

0

0

0

1

df

2 ...

0

1

1

0

df

1 ...

1

0

0

1

df

0 ...

Lev

el-1

pT

rees

Lev

el-2

pT

rees

0

0

0

0

0

0

0

0

0

1

0

0

1

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Best pTree organization?level-1 gives te, tf (term level)level-2 gives df.(document level)level-3 ? (corpus level)

Gives the Comprehensive Text ? rooted by all digital media, then libraries, then corpi, than documents, then texts, then terms...

Page 2: 0760300 term frequency A tf,doc=1 323122 3 DocFreqArray (df) P te,doc=1 0110100 term existence A pTree organization for a corpus of documents 1 1 1 Docs.

thislivelyrhymingbookgivesreadersachancetopracticeanimalsoundsidentifyanimalsandguesswhatanimaliscomingupnext

book1

I Heard a

Little Baa

22 wordsonaperfectsummerdaymrgrumpdecidestotakearideontheriverinhissmallboatagoatskickcausesacumulativeeffectandeveryonelandsinthewater

book2

Mr. Grumpy's

Outing 32 words a

veryhungrycaterpillarworkshiswaythroughholescutintopagesasheeatsavarietyoffoodsspinshiscocoonandemergesasabeautifulbutterfly

book3 The Very Hungry

Caterpillar

28 wordstheclassictouchandfeelbookinwhichaboyandgirlexperiencetheworldaroundthembypattingabunnylookinginamirrorsmellingflowersandtouchingdaddysscrathyface

book4 Pat The Bunny

32 words

ayoungrabbitsaysgoodnighttoeverythinginhisroombeforebedtime

book5 Goodnight Moon

14 words

theanimalsinthisrhymingboardbookdemonstratetheverydifferentandamusingsoundsthattheymakethehumorisjustrightfortoddlers

book6 Moo Baa La La La24 words

claphandsdanceandspinopenwideandpopitinsweetbabiesgothroughvariousrhymedactivities

book7

Clap hands 18 words

Page 3: 0760300 term frequency A tf,doc=1 323122 3 DocFreqArray (df) P te,doc=1 0110100 term existence A pTree organization for a corpus of documents 1 1 1 Docs.

whodoesthatcurlytailbelongtowhatanimalhasblackandwhitespotsjustturnthepagetorevealthefamiliarfarmanimalsthatmatchtheverbalandvisualclues

book8 Spots, Feathers

and Curly Tails

31 wordsthislifttheflapboardbookhelpsbabiesexploretheworldaroundthem

book9 Baby Bathtime 13 words

whenherdadpushesayounggirlinaswingshegoeshigherhigheruntilshereachesfantasticalheightsandmeetsanairplanearocketandevenanalien

book10 Higher, Higher

29 words

rhymesandsongsfromacrosstheglobewilldelightyoungchildrenfullpageillustrationsdramatizethesenseofplay

book11 Rhymes Round the

World19 words

beautifulfullcolorphotographsofchildrenalongwithasimpletexthighlightthemanyshadesofskincolor

book12

Shades of

People18 words parents

trytoroundupthelettersofthealphabetastheyskitterscatterhelterskelterrhymingtextandhumorousillustrationshighlighteachletteruntiltheyareallfinallyasleepinbed

book13

The Sleepy

Little Alphabet

32 words babies enjoy a bouncy noisy day until they are finally fast asleep at night

book14

Ten Tiny Babies14 words

Page 4: 0760300 term frequency A tf,doc=1 323122 3 DocFreqArray (df) P te,doc=1 0110100 term existence A pTree organization for a corpus of documents 1 1 1 Docs.

Latent semantic indexing (LSI) is an indexing and retrieval method that uses Singular value decomposition to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.

LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. LSI feature: ability to extract conceptual content of a body of text by establishing assoc between those terms that occur in similar contexts. [1]Called Latent Semantic Indexing because of its ability to correlate semantically related terms that are latent in a collection of text. It uncovers the

underlying latent semantic structure in the usage of words in a body of text and how it can be used to extract the meaning of the text in response to user queries, commonly referred to as concept searches. Queries against a set of docs that have undergone LSI will return results that are conceptually similar in meaning to search criteria even if results don’t share specific word or words with the search criteria.

Benefits of LSI: LSI overcomes two of the most problematic constraints of Boolean keyword queries: multiple words that have similar meanings (synonymy) and words that have more than one meaning (polysemy). Synonymy and polysemy are often the cause of mismatches in the vocabulary used by authors of docs and users of info retrieval systems. [3] As a result, Boolean keyword queries often return irrelevant results and miss information that is relevant.LSI is also used to perform automated document categorization.Doc categorization is assignment of docs to one or more predefined categories based on their similarity to conceptual content of the categories. [5] LSI uses example documents to establish the conceptual basis for each category. During categorization, the concepts contained in the docs being categorized are compared to the concepts contained in the example items, and a

category (or categories) is assigned to the docs based on similarities between concepts they contain and the concepts contained in ex docsDynamic clustering based on the conceptual content of docs uses LSI. Clustering is a way to group docs based on their conceptual similarity to

each other without using example docs to establish the conceptual basis for each cluster - useful with an unknown collection of unstructured text.

Because it uses a strictly mathematical approach, LSI is inherently independent of language. This enables LSI to elicit the semantic content of information written in any language without requiring the use of auxiliary structures, such as dictionaries and thesauri.

LSI is not restricted to working only with words. It can also process arbitrary character strings. Any object that can be expressed as text can be represented in an LSI vector space.[6] e.g., tests with MEDLINE abstracts have shown that LSI is able to effectively classify genes based on conceptual modeling of the biological information contained in the titles and abstracts of the MEDLINE citations.[7]

LSI automatically adapts to new and changing terminology, and has been shown to be very tolerant of noise (i.e., misspelled words, typographical errors, unreadable characters, etc.).[8] This is especially important for applications using text derived from Optical Character Recognition (OCR) and speech-to-text conversion. LSI also deals effectively with sparse, ambiguous, and contradictory data.

Text does not need to be in sentences. It can work with lists, free-form notes, email, Web-based content, etc. As long as a collection of text contains multiple terms, LSI can be used to identify patterns in relationships between important terms and concepts contained in the text.

LSI has proven to be a useful solution to a number of conceptual matching problems.[9][10] LSI used to capture key relationship information, including causal, goal-oriented, and taxonomic information.[11] LSI TimelineMid-1960s – Factor analysis technique first described and tested (H. Borko and M. Bernick)1988 – Seminal paper on LSI technique published (Deerwester et al.)1989 – Original patent granted (Deerwester et al.)1992 – First use of LSI to assign articles to reviewers[12] (Dumais and Nielsen)1994 – Patent granted for the cross-lingual application of LSI (Landauer et al.)1995 – First use of LSI for grading essays (Foltz, et al., Landauer et al.)1999 – First implementation of LSI technology for intelligence community for analyzing unstructured text (SAIC).2002 – LSI-based product offering to intelligence-based government agencies (SAIC)2005 – First vertical-specific application – publishing – EDB (EBSCO, Content Analyst Company)

Page 5: 0760300 term frequency A tf,doc=1 323122 3 DocFreqArray (df) P te,doc=1 0110100 term existence A pTree organization for a corpus of documents 1 1 1 Docs.

Mathematics of LSI (linear algebra techniques to learn the conceptual correlations in a collection of text). Construct a weighted term-document matrix, do Singular Value Decomposition on it. Use that to identify the concepts contained in the text.

Term Document Matrix, A: Each (of m) term is represented by a row, and each (of n) doc is represented by a column, with each matrix cell, a ij, initially representing number of times the associated term appears in the indicated document, tf ij. This matrix is usually large and very sparse.

Once a term-document matrix is constructed, local and global weighting functions can be applied to it to condition the data. The weighting functions transform each cell, of , to be the product of a local term weight, , which describes the relative frequency of a term in a document, and a global weight, , which describes the relative frequency of the term within the entire collection of documents.

Some common local weighting functions [13] are defined in the following table.Binary if the term exists in the document, or else TermFrequency , the number of occurrences of term in document Log Augnorm Some common global weighting functions are defined in the following table.Binary Normal GfIdf , where is the total number of times term occurs in the whole collection, and is the number of documents in which term

occurs.Idf Entropy , where Empirical studies with LSI report that the Log Entropy weighting functions work well, in practice, with many data sets. [14] In other words, each

entry of is computed as: