0760300 term frequency A tf,doc=1 323122 3 DocFreqArray (df) P te,doc=1 0110100 term existence A...
-
Upload
erika-bryant -
Category
Documents
-
view
214 -
download
0
Transcript of 0760300 term frequency A tf,doc=1 323122 3 DocFreqArray (df) P te,doc=1 0110100 term existence A...
![Page 1: 0760300 term frequency A tf,doc=1 323122 3 DocFreqArray (df) P te,doc=1 0110100 term existence A pTree organization for a corpus of documents 1 1 1 Docs.](https://reader036.fdocuments.net/reader036/viewer/2022082711/56649eab5503460f94bb19e6/html5/thumbnails/1.jpg)
0 7 6 0 3 0 0
term frequencyAtf,doc=1
3 2 3 1 2 2 3DocFreqArray (df)
Pte,doc=10 1 1 0 1 0 0
term existence
A pTree organization for a corpus of documents
11
1
Doc
s
TermArray (= ContentWord/Stem?)a or on no in it to
0 1 1 0 1 0 0Ptf,doc=1,2
0 1 1 0 0 0 0Ptf,doc=1,1
0 1 0 0 1 0 0Ptf,doc=1,0
0 9 3 0 6 0 0
1st occurence pos
Afop,doc=1
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
1
0
1
0
0
1
0
0
0
0
0
1
2
3
4
5
6
7
8
9
Ter
m P
os
0
1
0
0
Pte
,doc
=1
(Se
tPre
d=su
m>
0) ..
.
0
3
0
0
tf
...
0
1
0
0
tf
2 ...
0
0
0
0
tf
1 ...
0
1
0
0
tf
0 ...
1
2
2
3
df (
Sum
ove
r te
s) .
..
0
0
0
1
df
2 ...
0
1
1
0
df
1 ...
1
0
0
1
df
0 ...
Lev
el-1
pT
rees
Lev
el-2
pT
rees
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Best pTree organization?level-1 gives te, tf (term level)level-2 gives df.(document level)level-3 ? (corpus level)
Gives the Comprehensive Text ? rooted by all digital media, then libraries, then corpi, than documents, then texts, then terms...
![Page 2: 0760300 term frequency A tf,doc=1 323122 3 DocFreqArray (df) P te,doc=1 0110100 term existence A pTree organization for a corpus of documents 1 1 1 Docs.](https://reader036.fdocuments.net/reader036/viewer/2022082711/56649eab5503460f94bb19e6/html5/thumbnails/2.jpg)
thislivelyrhymingbookgivesreadersachancetopracticeanimalsoundsidentifyanimalsandguesswhatanimaliscomingupnext
book1
I Heard a
Little Baa
22 wordsonaperfectsummerdaymrgrumpdecidestotakearideontheriverinhissmallboatagoatskickcausesacumulativeeffectandeveryonelandsinthewater
book2
Mr. Grumpy's
Outing 32 words a
veryhungrycaterpillarworkshiswaythroughholescutintopagesasheeatsavarietyoffoodsspinshiscocoonandemergesasabeautifulbutterfly
book3 The Very Hungry
Caterpillar
28 wordstheclassictouchandfeelbookinwhichaboyandgirlexperiencetheworldaroundthembypattingabunnylookinginamirrorsmellingflowersandtouchingdaddysscrathyface
book4 Pat The Bunny
32 words
ayoungrabbitsaysgoodnighttoeverythinginhisroombeforebedtime
book5 Goodnight Moon
14 words
theanimalsinthisrhymingboardbookdemonstratetheverydifferentandamusingsoundsthattheymakethehumorisjustrightfortoddlers
book6 Moo Baa La La La24 words
claphandsdanceandspinopenwideandpopitinsweetbabiesgothroughvariousrhymedactivities
book7
Clap hands 18 words
![Page 3: 0760300 term frequency A tf,doc=1 323122 3 DocFreqArray (df) P te,doc=1 0110100 term existence A pTree organization for a corpus of documents 1 1 1 Docs.](https://reader036.fdocuments.net/reader036/viewer/2022082711/56649eab5503460f94bb19e6/html5/thumbnails/3.jpg)
whodoesthatcurlytailbelongtowhatanimalhasblackandwhitespotsjustturnthepagetorevealthefamiliarfarmanimalsthatmatchtheverbalandvisualclues
book8 Spots, Feathers
and Curly Tails
31 wordsthislifttheflapboardbookhelpsbabiesexploretheworldaroundthem
book9 Baby Bathtime 13 words
whenherdadpushesayounggirlinaswingshegoeshigherhigheruntilshereachesfantasticalheightsandmeetsanairplanearocketandevenanalien
book10 Higher, Higher
29 words
rhymesandsongsfromacrosstheglobewilldelightyoungchildrenfullpageillustrationsdramatizethesenseofplay
book11 Rhymes Round the
World19 words
beautifulfullcolorphotographsofchildrenalongwithasimpletexthighlightthemanyshadesofskincolor
book12
Shades of
People18 words parents
trytoroundupthelettersofthealphabetastheyskitterscatterhelterskelterrhymingtextandhumorousillustrationshighlighteachletteruntiltheyareallfinallyasleepinbed
book13
The Sleepy
Little Alphabet
32 words babies enjoy a bouncy noisy day until they are finally fast asleep at night
book14
Ten Tiny Babies14 words
![Page 4: 0760300 term frequency A tf,doc=1 323122 3 DocFreqArray (df) P te,doc=1 0110100 term existence A pTree organization for a corpus of documents 1 1 1 Docs.](https://reader036.fdocuments.net/reader036/viewer/2022082711/56649eab5503460f94bb19e6/html5/thumbnails/4.jpg)
Latent semantic indexing (LSI) is an indexing and retrieval method that uses Singular value decomposition to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.
LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. LSI feature: ability to extract conceptual content of a body of text by establishing assoc between those terms that occur in similar contexts. [1]Called Latent Semantic Indexing because of its ability to correlate semantically related terms that are latent in a collection of text. It uncovers the
underlying latent semantic structure in the usage of words in a body of text and how it can be used to extract the meaning of the text in response to user queries, commonly referred to as concept searches. Queries against a set of docs that have undergone LSI will return results that are conceptually similar in meaning to search criteria even if results don’t share specific word or words with the search criteria.
Benefits of LSI: LSI overcomes two of the most problematic constraints of Boolean keyword queries: multiple words that have similar meanings (synonymy) and words that have more than one meaning (polysemy). Synonymy and polysemy are often the cause of mismatches in the vocabulary used by authors of docs and users of info retrieval systems. [3] As a result, Boolean keyword queries often return irrelevant results and miss information that is relevant.LSI is also used to perform automated document categorization.Doc categorization is assignment of docs to one or more predefined categories based on their similarity to conceptual content of the categories. [5] LSI uses example documents to establish the conceptual basis for each category. During categorization, the concepts contained in the docs being categorized are compared to the concepts contained in the example items, and a
category (or categories) is assigned to the docs based on similarities between concepts they contain and the concepts contained in ex docsDynamic clustering based on the conceptual content of docs uses LSI. Clustering is a way to group docs based on their conceptual similarity to
each other without using example docs to establish the conceptual basis for each cluster - useful with an unknown collection of unstructured text.
Because it uses a strictly mathematical approach, LSI is inherently independent of language. This enables LSI to elicit the semantic content of information written in any language without requiring the use of auxiliary structures, such as dictionaries and thesauri.
LSI is not restricted to working only with words. It can also process arbitrary character strings. Any object that can be expressed as text can be represented in an LSI vector space.[6] e.g., tests with MEDLINE abstracts have shown that LSI is able to effectively classify genes based on conceptual modeling of the biological information contained in the titles and abstracts of the MEDLINE citations.[7]
LSI automatically adapts to new and changing terminology, and has been shown to be very tolerant of noise (i.e., misspelled words, typographical errors, unreadable characters, etc.).[8] This is especially important for applications using text derived from Optical Character Recognition (OCR) and speech-to-text conversion. LSI also deals effectively with sparse, ambiguous, and contradictory data.
Text does not need to be in sentences. It can work with lists, free-form notes, email, Web-based content, etc. As long as a collection of text contains multiple terms, LSI can be used to identify patterns in relationships between important terms and concepts contained in the text.
LSI has proven to be a useful solution to a number of conceptual matching problems.[9][10] LSI used to capture key relationship information, including causal, goal-oriented, and taxonomic information.[11] LSI TimelineMid-1960s – Factor analysis technique first described and tested (H. Borko and M. Bernick)1988 – Seminal paper on LSI technique published (Deerwester et al.)1989 – Original patent granted (Deerwester et al.)1992 – First use of LSI to assign articles to reviewers[12] (Dumais and Nielsen)1994 – Patent granted for the cross-lingual application of LSI (Landauer et al.)1995 – First use of LSI for grading essays (Foltz, et al., Landauer et al.)1999 – First implementation of LSI technology for intelligence community for analyzing unstructured text (SAIC).2002 – LSI-based product offering to intelligence-based government agencies (SAIC)2005 – First vertical-specific application – publishing – EDB (EBSCO, Content Analyst Company)
![Page 5: 0760300 term frequency A tf,doc=1 323122 3 DocFreqArray (df) P te,doc=1 0110100 term existence A pTree organization for a corpus of documents 1 1 1 Docs.](https://reader036.fdocuments.net/reader036/viewer/2022082711/56649eab5503460f94bb19e6/html5/thumbnails/5.jpg)
Mathematics of LSI (linear algebra techniques to learn the conceptual correlations in a collection of text). Construct a weighted term-document matrix, do Singular Value Decomposition on it. Use that to identify the concepts contained in the text.
Term Document Matrix, A: Each (of m) term is represented by a row, and each (of n) doc is represented by a column, with each matrix cell, a ij, initially representing number of times the associated term appears in the indicated document, tf ij. This matrix is usually large and very sparse.
Once a term-document matrix is constructed, local and global weighting functions can be applied to it to condition the data. The weighting functions transform each cell, of , to be the product of a local term weight, , which describes the relative frequency of a term in a document, and a global weight, , which describes the relative frequency of the term within the entire collection of documents.
Some common local weighting functions [13] are defined in the following table.Binary if the term exists in the document, or else TermFrequency , the number of occurrences of term in document Log Augnorm Some common global weighting functions are defined in the following table.Binary Normal GfIdf , where is the total number of times term occurs in the whole collection, and is the number of documents in which term
occurs.Idf Entropy , where Empirical studies with LSI report that the Log Entropy weighting functions work well, in practice, with many data sets. [14] In other words, each
entry of is computed as: