Course on Data Mining (581550-4)
description
Transcript of Course on Data Mining (581550-4)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
1Page1/70
Course on Data Mining (581550-4)Course on Data Mining (581550-4)
Intro/Ass. RulesIntro/Ass. RulesIntro/Ass. RulesIntro/Ass. Rules
EpisodesEpisodesEpisodesEpisodes
Text MiningText MiningText MiningText Mining
Home ExamHome Exam
24./26.10.
30.10.
ClusteringClusteringClusteringClustering
KDD ProcessKDD ProcessKDD ProcessKDD Process
Appl./SummaryAppl./SummaryAppl./SummaryAppl./Summary
14.11.
21.11.
7.11.
28.11.
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
2Page2/70
Today 07.11.2001Today 07.11.2001Today 07.11.2001Today 07.11.2001
• Today's subjectToday's subject: :
– Text Mining, focus on maximal Text Mining, focus on maximal frequent phrases or maximal frequent phrases or maximal frequent sequences (MaxFreq)frequent sequences (MaxFreq)
• Next week's programNext week's program: :
– Lecture:Lecture: Clustering, Clustering, Classification, SimilarityClassification, Similarity
– Exercise:Exercise: Text MiningText Mining
– Seminar:Seminar: Text MiningText Mining
Course on Data Mining (581550-4)Course on Data Mining (581550-4)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
3Page3/70
BackgroundBackgroundBackgroundBackground
MaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq Algorithms
What is Text Mining?What is Text Mining?What is Text Mining?What is Text Mining?
MaxFreq SequencesMaxFreq SequencesMaxFreq SequencesMaxFreq Sequences
MaxFreq ExperimentsMaxFreq ExperimentsMaxFreq ExperimentsMaxFreq Experiments
Text MiningText Mining
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
4Page4/70
• Text databases (document databases) Text databases (document databases)
– Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, Web pages, etc.
• Information retrieval (IR)Information retrieval (IR)
– Information is organized into (a large number of) documents
– Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents
Text Databases and Text Databases and Information RetrievalInformation Retrieval
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
5Page5/70
Precision:Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)
Recall:Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved
|}{||}{}{|
RetrievedRetrievedRelevant
precision
Basic Measures for Text RetrievalBasic Measures for Text Retrieval
|}{||}{}{|
RelevantRetrievedRelevant
recall
Relevant Retrieved
All
Relevant &Retrieved
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
6Page6/70
• A document is represented by a string, which can be A document is represented by a string, which can be identified by a set of keywordsidentified by a set of keywords
• Find similar documents based on a set of common Find similar documents based on a set of common keywordskeywords
• Answer should be based on the degree of relevance Answer should be based on the degree of relevance based on the nearness of the keywords, relative based on the nearness of the keywords, relative frequency of the keywords, etc.frequency of the keywords, etc.
• In the following, some basic techniques related to the In the following, some basic techniques related to the preprocessing and retrieval are briefly mentionedpreprocessing and retrieval are briefly mentioned
Keyword/Similarity-Based RetrievalKeyword/Similarity-Based Retrieval
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
7Page7/70
• Basic techniques (1): Remove unrelevant words with Basic techniques (1): Remove unrelevant words with stop liststop list– Set of words that are deemed “irrelevant”, even though
they may appear frequently– E.g., a, the, of, for, with, etc.– Stop lists may vary when document set varies
• Basic techniques (2): Take basic forms of words with Basic techniques (2): Take basic forms of words with word stemmingword stemming– Several words are small syntactic variants of each other
since they share a common word stem (basic form)– E.g., drug, drugs, drugged
Keyword/Similarity-Based RetrievalKeyword/Similarity-Based Retrieval
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
8Page8/70
• Basic techniques (3): Calculate occurrences of terms to Basic techniques (3): Calculate occurrences of terms to a term frequency tablea term frequency table
– Each entry frequent_table(i, j) = # of occurrences of the word ti in document di (or just "0" or "1" )
• Basic techniques (4): Similarity metrics: measure the Basic techniques (4): Similarity metrics: measure the closeness of a document to a query (a set of keywords)closeness of a document to a query (a set of keywords)
– Cosine distance:
– Relative term occurrences
• This is all nice to know, but where is the text mining This is all nice to know, but where is the text mining and how does it relate to this?and how does it relate to this?
||||),(
21
2121 vv
vvvvsim
Keyword/Similarity-Based RetrievalKeyword/Similarity-Based Retrieval
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
9Page9/70
BackgroundBackgroundBackgroundBackground
MaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq Algorithms
What is Text Mining?What is Text Mining?What is Text Mining?What is Text Mining?
MaxFreq SequencesMaxFreq SequencesMaxFreq SequencesMaxFreq Sequences
MaxFreq ExperimentsMaxFreq ExperimentsMaxFreq ExperimentsMaxFreq Experiments
Text MiningText Mining
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
10Page10/70
• Data mining in text: find something useful and Data mining in text: find something useful and surprising from a text collectionsurprising from a text collection
• Text mining vs. information retrieval is like data Text mining vs. information retrieval is like data mining vs. database queriesmining vs. database queries
What is Text Mining?What is Text Mining?
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
11Page11/70
• For example, we might have the following text:For example, we might have the following text:
Documents are an interesting application field for data mining techniques.
• Remember the market basket data? Remember the market basket data? – The text can then be considered as a shopping transaction, i.e.,
row in the database– The words occurring in the text can be considered as items bought
Different Views on TextDifferent Views on Text
Transaction ID Items Bought100 A,B,C200 A,C
Document ID Words occurring100 an,application,... 200 ...
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
12Page12/70
Different Views on TextDifferent Views on Text
0 10 20 30 40 50 60 70 80 90
D C A B D A B C
• Recall the event sequence from episode rules:Recall the event sequence from episode rules:
• Now we can consider the text as a sequence of words!Now we can consider the text as a sequence of words!
0 1 2 3 4 5 6 7 8 9 10 11
Doc
umen
ts
appl
icat
ion
fiel
d
data
min
ing
tech
niqu
es
are
an inte
rest
ing
for
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
13Page13/70
• So, suppose that we have the following example text:So, suppose that we have the following example text:
Documents are an interesting application field for data mining techniques.
• To this text, we might do the following preprocessing To this text, we might do the following preprocessing operations:operations:
1. Find the basic forms of the words (stemming)1. Find the basic forms of the words (stemming)2. Use stop list to remove uninteresting words2. Use stop list to remove uninteresting words
3. Select, e.g., the wanted word classes (e.g., nouns)3. Select, e.g., the wanted word classes (e.g., nouns)
Text PreprocessingText Preprocessing
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
14Page14/70
(Documents, 1)(are, 2)(an, 3)(interesting, 4)(application, 5)(field, 6)(for, 7)(data, 8)(mining, 9)(techniques, 10)(., 11)
Text PreprocessingText Preprocessing
(document_N_PL, 1)(be_V_PRES_PL, 2)(an_DET, 3)(interesting_A_POS, 4)(application_N_SG, 5)(field_N_SG, 6)(for_PP, 7)(data_N_SG, 8)(mining_N_SG, 9)(technique_N_PL, 10)(STOP, 11)
Morphological information: N = noun, PL = plural, V = verb, PRES = present form, DET = determinant, A = adjective, POS = positive, SG = singular, PP=preposition
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
15Page15/70
Text PreprocessingText Preprocessing
(document_N_PL, 1)
(interesting_A_POS, 4)(application_N_SG, 5)(field_N_SG, 6)
(data_N_SG, 8)(mining_N_SG, 9)(technique_N_PL, 10)
(document_N_PL, 1)(be_V_PRES_PL, 2)(an_DET, 3)(interesting_A_POS, 4)(application_N_SG, 5)(field_N_SG, 6)(for_PP, 7)(data_N_SG, 8)(mining_N_SG, 9)(technique_N_PL, 10)(STOP, 11)
Morphological information: N = noun, PL = plural, V = verb, PRES = present form, DET = determinant, A = adjective, POS = positive, SG = singular, PP=preposition
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
16Page16/70
Text PreprocessingText Preprocessing
(document_N_PL, 1)
(application_N_SG, 5)(field_N_SG, 6)
(data_N_SG, 8)(mining_N_SG, 9)(technique_N_PL, 10)
(document_N_PL, 1)
(interesting_A_POS, 4)(application_N_SG, 5)(field_N_SG, 6)
(data_N_SG, 8)(mining_N_SG, 9)(technique_N_PL, 10)
Morphological information: N = noun, PL = plural, V = verb, PRES = present form, DET = determinant, A = adjective, POS = positive, SG = singular, PP=preposition
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
17Page17/70
Text PreprocessingText Preprocessing
0 1 2 3 4 5 6 7 8 9 10 11
docu
men
t
appl
icat
ion
fiel
d
data
min
ing
tech
niqu
e
• Now we have a preprocessed sequence of wordsNow we have a preprocessed sequence of words
• We might also just throw away the stop words etc., and We might also just throw away the stop words etc., and put words in consecutive "time slots" (1, 2, 3, …)put words in consecutive "time slots" (1, 2, 3, …)
• Preprocessing can be applied to transaction-based text Preprocessing can be applied to transaction-based text data in a similar fashion data in a similar fashion
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
18Page18/70
• Keyword (or term) based association analysisKeyword (or term) based association analysis
• Automatic document classificationAutomatic document classification
• Similarity detectionSimilarity detection
– Cluster documents by a common author
– Cluster documents containing information from a common source
• Sequence analysis: predicting a recurring event, Sequence analysis: predicting a recurring event, discovering trendsdiscovering trends
• Anomaly detection: find information that violates usual Anomaly detection: find information that violates usual patternspatterns
Types of Text MiningTypes of Text Mining
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
19Page19/70
• Collect sets of keywords or terms that occur frequently Collect sets of keywords or terms that occur frequently together and then find the association relationships together and then find the association relationships among themamong them
• First preprocess the text data by parsing, stemming, First preprocess the text data by parsing, stemming, removing stopwords, etc.removing stopwords, etc.
• Then evoke association mining algorithmsThen evoke association mining algorithms
– Consider each document as a transaction
– View a set of keywords/terms in the document as a set of items in the transaction
Term-Based Assoc. AnalysisTerm-Based Assoc. Analysis
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
20Page20/70
Term-Based Assoc. AnalysisTerm-Based Assoc. Analysis
• For example, we might find frequent sets such as:For example, we might find frequent sets such as:2%: application, field
5%: data, mining
• ……and association rules like:and association rules like:application field (2%,52%)data mining (5%,75%)
• These kind of frequent sets etc. might help in These kind of frequent sets etc. might help in expanding user queries or in describing better the expanding user queries or in describing better the documents than simple key wordsdocuments than simple key words
• Sometimes it would be nice to discover new descriptive Sometimes it would be nice to discover new descriptive phrases directly from the actual text - what then?phrases directly from the actual text - what then?
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
21Page21/70
Term-Based Episode AnalysisTerm-Based Episode Analysis
• Now, we want to find words/terms that occur frequently Now, we want to find words/terms that occur frequently close to each other in the actual textclose to each other in the actual text
• Take the preprocessed sequential text data and then Take the preprocessed sequential text data and then find relationships among the words/terms by evoking find relationships among the words/terms by evoking episode mining algorithms (WINEPI or MINEPI)episode mining algorithms (WINEPI or MINEPI)
• For example, we might find frequent episodes such as:For example, we might find frequent episodes such as:
data, mining, knowledge, discovery
• ……and MINEPI style episode rules like:and MINEPI style episode rules like:
data, mining knowledge, discovery [4] [8] (2%,81%)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
22Page22/70
• Quite often, it could be interesting to try to find very Quite often, it could be interesting to try to find very long descriptive phrases to describe the documents… long descriptive phrases to describe the documents…
• ……but discovery of long descriptive phrases might be but discovery of long descriptive phrases might be tedious, especially if and when you'll have to create all tedious, especially if and when you'll have to create all shorter phrases in order to get the longest onesshorter phrases in order to get the longest ones
• One answer can be One answer can be maximal frequent sequencesmaximal frequent sequences or or maximal frequent phrasesmaximal frequent phrases (note: by concepts (note: by concepts "sequence" and "phrase" we mean basically the same)"sequence" and "phrase" we mean basically the same)
ProblemsProblems
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
23Page23/70
BackgroundBackgroundBackgroundBackground
MaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq Algorithms
What is Text Mining?What is Text Mining?What is Text Mining?What is Text Mining?
MaxFreq SequencesMaxFreq SequencesMaxFreq SequencesMaxFreq Sequences
MaxFreq ExperimentsMaxFreq ExperimentsMaxFreq ExperimentsMaxFreq Experiments
Text MiningText Mining
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
24Page24/70
• Assume: S is a set of documents; each document consists of a sequence of words
• A phrase is a sequence of words
• A sequence p occurs in a document d if all the words of p occur in d, in the same order as in p
• A sequence p is frequent in S if p occurs in at least documents of S, where is a frequency threshold given
• A maximal gap n can be given: the original locations of any two consecutive words of a sequence can have at most n words between them
Frequent Word SequencesFrequent Word Sequences
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
25Page25/70
1: (The,70) (Congress,71) (subcommittee,72) (backed,73) (away,74) (from,75) (mandating,76) (specific,77) (retaliation,78) (against,79) (foreign,80) (countries,81) (for,82) (unfair,83) (foreign,84) (trade,85) (practices,86)
2: (He,105) (urged,106) (Congress,107) (to,108) (reject,109) (provisions,110) (that,111) (would,112) (mandate,113) (U.S.,114) (retaliation,115) (against,116) (foreign,117) (unfair,118) (trade,119) (practices,120)
3: (Washington,407) (charged,408) (France,409) (West,410) (Germany,411) (the,412) (U.K.,413) (Spain, 414) (and,415) (the,416) (EC,417) (Commission,418) (with,419) (unfair,420) (practices,421) (on,422) (behalf,423) (of,424) (Airbus,425)
Frequent Word SequencesFrequent Word Sequences
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
26Page26/70
Examples from the previous slides:Examples from the previous slides:
• The phrase
(retaliation, against, foreign, unfair, trade, practices)occurs in the first two documents, in the locations (78, 79, 80, 83, 85, 86) and (115, 116, 117, 118, 119, 120).
• The phrase (unfair, practices) occurs in all the documents, namely in the locations (83, 86), (118, 120), and (420, 421).
Note that we only count one occurrence of a sequence/doc!Note that we only count one occurrence of a sequence/doc!
Frequent Word SequencesFrequent Word Sequences
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
27Page27/70
• Maximal frequent sequence: Maximal frequent sequence:
– A sequence p is a maximal frequent (sub)sequence in S if there does not exist any other sequence p' in S such that p is a subsequence of p' and p' is frequent in S
• Shortly, a maximal frequent sequence is a sequence of Shortly, a maximal frequent sequence is a sequence of words thatwords that
– appears frequently in the document collection
– is not included in another longer frequent sequence
Maximal Frequent SequencesMaximal Frequent Sequences
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
28Page28/70
• Usually, it makes sense to concentrate on the maximal Usually, it makes sense to concentrate on the maximal frequent sequences or maximal frequent phrasesfrequent sequences or maximal frequent phrases
– Subsequences or subphrases usually do not have own meaning
– However, sometimes also subsequences or subphrases may be interesting, if they are much more frequent
Maximal Frequent SequencesMaximal Frequent Sequences
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
29Page29/70
• Example (maximal sequence + subsequences):Example (maximal sequence + subsequences):
dow jones industrial average
dow jones
dow industrial
dow average
jones industrial
jones average
industrial average
dow jones industrial
dow jones average
jones industrial average
A Maximal Seq. with Subseq.sA Maximal Seq. with Subseq.s
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
30Page30/70
• Interesting subsequences can be distinguished by the Interesting subsequences can be distinguished by the characteristic that they are more frequent than the characteristic that they are more frequent than the maximal sequencesmaximal sequences
– Subsequence has its OWN occurrences in the text
– Subsequence might be joint to MANY maximal sequences
– TOO FREQUENT subsequence might NOT be interesting
Examples of Meaningful SubseqsExamples of Meaningful Subseqs
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
31Page31/70
• Maximal sequences:Maximal sequences:
prime minister Lionel Jospin
prime minister Paavo Lipponen
• Subsequences:Subsequences:
prime minister
Lionel Jospin
Paavo Lipponen
Examples of Meaningful SubseqsExamples of Meaningful Subseqs
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
32Page32/70
BackgroundBackgroundBackgroundBackground
MaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq Algorithms
What is Text Mining?What is Text Mining?What is Text Mining?What is Text Mining?
MaxFreq SequencesMaxFreq SequencesMaxFreq SequencesMaxFreq Sequences
MaxFreq ExperimentsMaxFreq ExperimentsMaxFreq ExperimentsMaxFreq Experiments
Text MiningText Mining
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
33Page33/70
• Frequency of a sequence cannot be decided locally: all Frequency of a sequence cannot be decided locally: all the instances in the collection has to be countedthe instances in the collection has to be counted
• However: already a document of length 20 (words) However: already a document of length 20 (words) contains over one million sequencescontains over one million sequences
• Only small fraction of sequences are frequentOnly small fraction of sequences are frequent
– There are many sequences that have only very few occurrences
Discovery of Frequent SequencesDiscovery of Frequent Sequences
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
34Page34/70
• Basic idea: the "standard" bottom-up approachBasic idea: the "standard" bottom-up approach
– Collect all the pairs from the documents, count them, and select the frequent ones
– Build sequences of length p+1 from frequent sequences of length p
– Select sequences that are frequent
– Iterate
• Finally: select maximal sequences (by checking for each Finally: select maximal sequences (by checking for each phrase, whether it is contained in some other phrase)phrase, whether it is contained in some other phrase)
Naïve Discovery ApproachNaïve Discovery Approach
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
35Page35/70
• Problem: frequent sequences in text can be long Problem: frequent sequences in text can be long
– In our experiments: longest phrase 22 words (Reuters-21578 newswire data, 19000 documents, frequency threshold 15, max gap 2)
– Processing all the subphrases of all lengths is not possible
– Straightforward bottom-up approach does not work
– Restriction of the length would produce a large amount of slightly differing subphrases of a phrase that is longer than the threshold
Problems in the Naïve ApproachProblems in the Naïve Approach
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
36Page36/70
• First, frequent pairs are collected
Initial phaseInitial phase
• Longer sequences are constructed from shorter sequences (k-grams) as in the bottom-up approach
Discovery phaseDiscovery phase
• Maximal sequences are discovered directly, starting from a k-gram that is not a subsequence of any known maximal sequence
Expansion stepExpansion step
Combining Bottom-Up and Combining Bottom-Up and Greedy Approaches: MaxFreqGreedy Approaches: MaxFreq
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
37Page37/70
• Each maximal sequence has at least one unique subsequence that distinguishes it from the other maximal sequences. A maximal sequence is discovered, at the latest, on the level k, where k is the length of the shortest unique subsequence.
• Grams that cannot be used to construct any new maximal sequences are pruned away after each level, before the length of grams is increased
Pruning stepPruning step
• Let's take a closer look at these phases and steps!
Combining Bottom-Up and Combining Bottom-Up and Greedy Approaches: MaxFreqGreedy Approaches: MaxFreq
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
38Page38/70
Input: a set of documents S, a frequency threshold, and a maximal gap
Output: a gram set Grams2 containing the frequent pairs
For all the documents d S
collect all the ordered pairs of words (A,B) within d such that A and B occur in this order (wrt maximal gap)
Grams2 = all the ordered pairs that are frequent in the set S
(wrt frequency threshold)
Return Grams2
Algorithm: Initial PhaseAlgorithm: Initial Phase
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
39Page39/70
Document 1: (A,11) (B,12) (C,13) (D,14) (E,15)
Document 2: (P,21) (B,22) (C,23) (D,24) (K,25)
Document 3: (A,31) (B,32) (C,33) (H,34) (D,35) (K,36)
Document 4: (P,41) (B,42) (C,43) (D,44) (E,45) (N,46)
Document 5: (P,51) (B,52) (C,53) (K,54) (E,55) (L,56) (M,57)
Document 6: (R,61) (H,62) (K,63) (L,64) (M,65)
Algorithm: Initial PhaseAlgorithm: Initial Phase
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
40Page40/70
AB 2 BE 3 CK 3 EL 1 HM 1 PC 3
AC 2 BH 1 CL 1 EM 1 KE 1 PD 2
AD 1 BK 2 CN 1 EN 1 KL 2 PK 1
AH 1 CD 4 DE 2 HD 1 KM 2 RH 1
BC 5 CE 3 DK 2 HK 2 LM 2 RK 1
BD 4 CH 1 DN 1 HL 1 PB 3 RL 1
Algorithm: Initial PhaseAlgorithm: Initial Phase
• The following pairs of words could be found (with max The following pairs of words could be found (with max gap=2). E.g., AB occurs in doc 1 ([11-12]) and in doc 2 gap=2). E.g., AB occurs in doc 1 ([11-12]) and in doc 2 ([31-32]), while AE is unfrequent ([11-15] > max gap).([31-32]), while AE is unfrequent ([11-15] > max gap).
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
41Page41/70
Input: a gram set Grams2 containing the frequent pairs (A, B)Output: the set Max of maximal frequent phrases
k := 2; Max := While Gramsk is not empty
For all grams g Gramsk If a gram g is not a subphrase of some m Max
If a gram g is frequentmax := ExpandExpand(g)Max := Max maxIf max = g Remove {g} from Gramsk
Else Remove {g} from Gramsk
Prune(Gramsk)Join the grams of Gramsk to form Gramsk+1
k := k + 1Return Max
Algorithm: Discovery PhaseAlgorithm: Discovery Phase
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
42Page42/70
Input:Input: a phrase p
Output:Output: a maximal frequent phrase p' such that p is a subphrase of p'
Repeat
Let l be the length of the sequence p.
Find a sequence p' such that the length of p' is l+1,
and p is a subsequence of p'.
If p' is frequent
p := p'
Until there exists no frequent p'
Return p
Algorithm: Expansion StepAlgorithm: Expansion Step
Note! All the possibilities to expand has to bechecked: tail, front, middle!
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
43Page43/70
1: 1: (A,11) (B,12) (C,13) (D,14) (E,15)2:2: (P,21) (B,22) (C,23) (D,24) (K,25)3:3: (A,31) (B,32) (C,33) (H,34) (D,35) (K,36)4:4: (P,41) (B,42) (C,43) (D,44) (E,45) (N,46)5:5: (P,51) (B,52) (C,53) (K,54) (E,55) (L,56) (M,57)6:6: (R,61) (H,62) (K,63) (L,64) (M,65)
Freq:Freq: AB BD CD DE KL PBAC BE CE DK KM PCBC BK CK HK LM PD
Exp:Exp: AB => ABC => ABCD (- ABCDE, ABCDK)BE => BCE => BCDE
Algorithm: Expansion StepAlgorithm: Expansion Step
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
44Page44/70
• Maximal frequent sequences after the first expansion Maximal frequent sequences after the first expansion step:step:
AB => ABC => ABCD
BE => BCE => BCDE
BK => BDK => BCDK
KL => KLM
PD => PBD => PBCD
HK
ExampleExample
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
45Page45/70
• 3-grams after join:3-grams after join:
ABC ACK CDE PCD BKMABD BCD CDK PCE CKLABE BCE PBC PCK CKM italics+ABK BCK PBD PDE DKL underlined=ACD BDE PBE PDK DKM already foundACE BDK PBK BKL KLM maximal phrase
• New maximal frequent sequences:New maximal frequent sequences:
PBE => PBCEPBK => PBCK
ExampleExample
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
46Page46/70
• 3-grams after the second expansion step:3-grams after the second expansion step:
ABC BCE CDE PBE PCKABD BCK CDK PBKACD BDE PBC PCDBCD BDK PBD PCE
• 4-grams after join:4-grams after join:
ABCD ABDK BCDK PBDE ABCE ACDE PBCD PBDKABCK ACDK PBCE PCDEABDE BCDE PBCK PCDK
ExampleExample
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
47Page47/70
• After expansion step, every gram is a subsequence of some maximal sequence
• For any other maximal sequence m not found yet: m has to contain grams from two or more other maximal sequences, or from one sequence m' in a different order than in m'
• For each gram g: check if g can join grams of maximal sequences in a new way
=> extract sequences that are frequent and not yet included in any maximal sequence; mark the grams
• Remove grams that are not marked
Algorithm: Pruning StepAlgorithm: Pruning Step
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
48Page48/70
• BC: ABCD, BCDE, BCDK, PBCD• Prefixes: A, P • Suffixes: D, DE, DK• Check the strings ABCDE, ABCDK, PBCDE, PBCDK a subsequence that is frequent and not included in any
maximal sequence?ABCDE - ABC - ABCD (maximal)
- ABCE (not frequent)- BCD - BCDE (maximal)
- ABCD (known)- BCE - ABCE (known)
Pruning After the 1Pruning After the 1stst Exp. Step Exp. Step
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
49Page49/70
PBCDE - PBC - PBCD (maximal) - PBCE (frequent, not in maximal) - BCD - BCDE (maximal) - PBCD (known) - BCE - PBCE (known)
PBCDK - PBC - PBCD (maximal) - PBCK (frequent, not in maximal)
...
Marked: PB, BC, CE, CKAll the other grams are removed.
Pruning After the 1Pruning After the 1stst Exp. Step Exp. Step
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
50Page50/70
Data structures:Data structures:
• Table: for each pair its exact occurrences in text
• Table: for each prefix the grams that have this prefix
• Table: for each suffix the grams that have this suffix
• Table: for each pair the indexes of maximal sequences within which it is a subsequence
• An array of maximal sequences
• Document identifiers are attached to the grams and occurrences
Algorithm: ImplementationAlgorithm: Implementation
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
51Page51/70
• The occurrences of frequent pairs are stored:The occurrences of frequent pairs are stored:
AB: [11-12][31-32]AC: [11-13][31-33]BC: [12-13][22-23][32-33][42-43][52-53]
• The occurrences of longer sequences are computed The occurrences of longer sequences are computed from the occurrences of pairsfrom the occurrences of pairs
• All the occurrences computed are storedAll the occurrences computed are stored– The computation for ABC may help to compute later
the frequency for ABCD
Testing FrequencyTesting Frequency
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
52Page52/70
– ABCD can only occur in places where ABC has occurred
• NOTE:NOTE:
– Already calculated occurrences can be used while adding elements to the front or to the tail
– ABCD may occur in more documents than ABD, since the distance of B and D might be greater than the maximal gap
Testing FrequencyTesting Frequency
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
53Page53/70
BackgroundBackgroundBackgroundBackground
MaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq AlgorithmsMaxFreq Algorithms
What is Text Mining?What is Text Mining?What is Text Mining?What is Text Mining?
MaxFreq SequencesMaxFreq SequencesMaxFreq SequencesMaxFreq Sequences
MaxFreq ExperimentsMaxFreq ExperimentsMaxFreq ExperimentsMaxFreq Experiments
Text MiningText Mining
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
54Page54/70
• Data: Data: Reuters-21578 newswire collection (year 1987)• Around 19000 documents19000 documents (average length 135 words)• Originally 2.5 million words, after stopword pruning (400
stopwords) 1.3 million words– Stopwords: single letters, pronouns, prepositions, some
abbreviations (e.g., pct, dlr, cts, shr), etc.• 50.000 distinct words (stemming was not used)• Frequency threshold 15, max gap 2 (stopwords pruned)• Prototype implementation in Perl• Sun Enterprise 450, with 1 GB of main memory
ExperimentsExperiments
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
55Page55/70
• Amounts of maximal frequent sequences of different Amounts of maximal frequent sequences of different lengths:lengths:
Len 2 3 4 5 6 7 8 9 10 11 12
f:15 7,664 1,320 353 146 65 17 8 4 13 12 13
Len 13 14 15 16 17 18 19 20 21 22 23
f:15 5 - 1 1 - 1 - - - 2 -
ExperimentsExperiments
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
56Page56/70
• Solid, established phrases:Solid, established phrases:bundesbank president karl otto poehl
european monetary system ems
• Verb phrases:Verb phrases:bank england provided money market assistance
board declared stock split payable april
boost domestic demand
• Short phrases:Short phrases:expects higher
expects complete
Examples of MaxFreq SequencesExamples of MaxFreq Sequences
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
57Page57/70
• The following phrases are extracted from one document belonging to the Reuters data set
• The phrases contain both maximal phrases and subphrases that are more frequent than the maximal ones
• The document describes a situation, where the persons monitoring the nuclear power plant operation were catched asleep during their shift and the Nuclear Regulatory Commission ordered the power plant to be closed
• As you can see, the phrases do not actually reveal what happened, they just tell about the subject matter
Phrases Extracted from "Doc A"Phrases Extracted from "Doc A"
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
58Page58/70
power station 11immediately after 26co operations 11effective april 63company's operations 20unit nuclear 12unit power 16early week 42senior management 28nuclear regulatory commission 14-regulatory commission 34nuclear power plant 26-power plant 55-nuclear power 42-nuclear plant 42electric co 143
Phrases Extracted from "Doc A"Phrases Extracted from "Doc A"
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
59Page59/70
• Maximal frequent sequence (frequency = 15):Maximal frequent sequence (frequency = 15):federal reserve entered u.s. government securities market arrange repurchase agreements fed dealers federal funds trading fed began temporary supply reserves banking system
• One occurrence of the phrase:One occurrence of the phrase:The Federal Reserve entered the U.S. Government securities market to arrange 1.5 billion dlrs of customer repurchase agreements, a Fed spokesman said. Dealers said Federal funds were trading at 6-3/16 pct when the Fed began its temporary and indirect supply of reserves to the banking system.
Phrases Extracted from "Doc B"Phrases Extracted from "Doc B"
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
60Page60/70
• The frequency of the sequence is 13, and it contains the The frequency of the sequence is 13, and it contains the following subsequences that are more frequent:following subsequences that are more frequent:
arrange repurchase 23 banking system 66fed federal 25 trading fed 22 fed funds 23 trading system 25 fed temporary 23 reserve u.s. 43 market arrange 23 supply reserves 36market trading 41 supply system 25u.s. government 160 dealers federal 30u.s. dealers 32 dealers funds 27u.s. trading 35 dealers trading 33u.s. supply 26 federal u.s. 28reserves system 36 federal trading 30securities arrange 23 funds trading 43securities trading 32 reserve u.s. government 31government arrange 23 reserves banking system 25
Phrases Extracted from "Doc B"Phrases Extracted from "Doc B"
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
61Page61/70
• Goal: rich computational representation for documentsGoal: rich computational representation for documents– Feature sets for analysis– Human-readable description
• ApplicationsApplications– Key phrases in information retrieval– Overview to the collection: clustering– Summary of the content– Automatic generation of hypertext links– Associations between documents– Browsing of document collection
Use of Frequent PhrasesUse of Frequent Phrases
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
62Page62/70
• Example:Example: suppose that a query "agricultur*" has been made
• The user has been given a "middle-level list" of phrases that tell something more about the context around the words in the query
Use of Frequent PhrasesUse of Frequent Phrases
agricultur* QUERYQUERY
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
63Page63/70
agricultural exportsagricultural productionagricultural productsagricultural stabilization conservation service
agricultural subsidiesagricultural subsidiesagricultural tradeu.s. agricultureagriculture department usdaagriculture department wheatagriculture ministeragriculture officialsagriculture undersecretary daniel amstutzcommon agricultural policyec agriculture ministerseuropean community agriculture
Use of Frequent PhrasesUse of Frequent Phrases
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
64Page64/70
• Suppose that the user is interested in subject "agricultural subsidies" and selects it from the list
• As an answer to the query, one might now return all the sentences containing the phrase "agricultural subsidies" (e.g., the ones on the next pages)
• Alternatively, the user might want to see directly the whole documents in which the phrase appears, or the other phrases that occur together with the phrase "agricultural subsidies" in the documents
Use of Frequent PhrasesUse of Frequent Phrases
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
65Page65/70
• Text mining:Text mining: – The "roots" are in text databases and information
retrieval– Data mining techniques might complement or help the
existing database/information retrieval techniques• In this lecture, only a few methods based of association In this lecture, only a few methods based of association
and episode style algorithms were given:and episode style algorithms were given:– Naïve approaches applicable to some extent, maximal
frequent phrases might be useful in some cases– Many clustering, classification and similarity
techniques that will be presented on the next lectures, are useful to go a few steps further
SummarySummary
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
66Page66/70
• Helena Ahonen-Myka: Finding All Frequent Maximal Sequences in Text. In ICML-99 Workshop on Machine Learning in Text Data Analysis, p. 11-17, J. Stefan Institute, Ljubljana 1999. See electronic version at http://www.cs.helsinki.fi/u/hahonen/ham_icml99.ps
• Han, J., Kamber, M.: Data Mining: Concepts and Techniques (also available at "http://www.cs.sfu.ca/~han/DM_Book.html"), Section 9.5 of the book.
• Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and Inkeri Verkamo. Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collections. In Advances in Digital Libraries'98, April 1998. See electronic version at http://www-db.informatik.uni-tuebingen.de/forschung/papers/adl98.ps
ReferencesReferences
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
67Page67/70
Next WeekNext WeekNext WeekNext Week
• Lecture 14.11.: Clustering, Lecture 14.11.: Clustering, Classification, SimilarityClassification, Similarity
– Pirjo gives the lecturePirjo gives the lecture
• Excercise 15.11.: Text miningExcercise 15.11.: Text mining– Pirjo takes care of you! :-) Pirjo takes care of you! :-)
• Seminar 9.11.: Text miningSeminar 9.11.: Text mining– Mika gives the lectureMika gives the lecture
– 2 group presentations (groups 5-6)2 group presentations (groups 5-6)
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
68Page68/70
Seminar Presentations/Groups 5-6Seminar Presentations/Groups 5-6
Feldman et. alFeldman et. alFeldman et. alFeldman et. al
Lent, Agrawal, SrikantLent, Agrawal, SrikantLent, Agrawal, SrikantLent, Agrawal, Srikant
R. Feldman et al.: "Knowledge Management: A Text Mining Approach", PAKM 1998.
B. Lent, R. Agrawal, R. Srikant: "Discovering Trends in Text Databases", KDD 1997.
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
69Page69/70
• Remember:Remember:– Try to understand the
"message" in the article
– Try to present the basic ideas as clearly as possible, use examples
– Do not present detailed mathematics or algorithms
– Test: do you understand your own presentation?
– In the presentation, use PowerPoint or conventional slides
Seminar PresentationsSeminar Presentations
• Requirements:Requirements:– Articles are given on previous
week's Wed
– Presentation in an HTML page (around 3-5 printed pages) due to seminar starting:
• Can be either a HTML page or a printable document in PostScript/PDF format
– 30 minutes of presentation
– 5-15 minutes of discussion
– Active participation
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
70Page70/70
Thank you for Thank you for your attention!your attention!
Thanks to Helena Ahonen-Myka and Jiawei Han for their slides which greatly helped in preparing this lecture!
Text MiningText Mining