Visualizing sequential Patterns for Text Mining Pak Chung Wong, Wendy Cowley, Harlan Foote,...

28
Visualizing Visualizing sequential Patterns sequential Patterns for Text Mining for Text Mining Pak Chung Wong, Wendy Cowley, Harlan Foote, Elizabeth Jurrus, Jim Thomas Pacific Northwest National Laboratory Proceedings IEEE Information Visualization 2000
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Visualizing sequential Patterns for Text Mining Pak Chung Wong, Wendy Cowley, Harlan Foote,...

Visualizing sequential Visualizing sequential Patterns for Text Patterns for Text

MiningMining

Pak Chung Wong, Wendy Cowley, Harlan Foote, Elizabeth Jurrus, Jim

Thomas

Pacific Northwest National Laboratory

Proceedings IEEE Information Visualization 2000

IntroductionIntroduction

• The mining of sequential patterns is The mining of sequential patterns is designed to find patterns of discrete designed to find patterns of discrete events that frequently happen in the events that frequently happen in the same arrangement along a timeline. same arrangement along a timeline.

• Like association and clustering, the Like association and clustering, the mining of sequential patterns is among mining of sequential patterns is among the most popular knowledge discovery the most popular knowledge discovery techniques that apply statistical techniques that apply statistical measures to extract useful information measures to extract useful information from large datasets. from large datasets.

IntroductionIntroduction

• The task of sequential patterns in The task of sequential patterns in knowledge discovery and data mining is knowledge discovery and data mining is to identivy the item that frequently to identivy the item that frequently precedes another item.precedes another item.

• A sequential pattern in data mining is a A sequential pattern in data mining is a finite series of elements such as A →B finite series of elements such as A →B →C →D where A, B, C, and D are →C →D where A, B, C, and D are elements of the same domain.elements of the same domain.

Minimum SupportMinimum Support

• Each sequential pattern in data mining Each sequential pattern in data mining comes with a minimum support value, comes with a minimum support value, which indicates the percentage of total which indicates the percentage of total records that contain the pattern.records that contain the pattern.

ExampleExample

• 90% of the die-hard fans who saw the 90% of the die-hard fans who saw the movie Titanic went on to buy the movie movie Titanic went on to buy the movie sound track CD, followed by the sound track CD, followed by the videotape when it was released.videotape when it was released.

Primary GoalPrimary Goal

• The primary goal of sequential pattern The primary goal of sequential pattern discovery is to access the evolution of discovery is to access the evolution of events against a measured timeline and events against a measured timeline and detect changes that might occur detect changes that might occur coincidentally.coincidentally.

ApplicationApplication

• Detect medical fraud in insurance claimsDetect medical fraud in insurance claims

• Evaluate drug performance in Evaluate drug performance in pharmaceutical industrypharmaceutical industry

• Determine risk factors in military operationsDetermine risk factors in military operations

Association rule vs. sequential Association rule vs. sequential patternpattern

• An association rule is an implication of An association rule is an implication of the form X →Y where X is a set of the form X →Y where X is a set of antecedent items and Y is the antecedent items and Y is the consequent item. consequent item.

• Given a domain set of elements, for Given a domain set of elements, for example {A, B, C, D}, A →B →C →D is a example {A, B, C, D}, A →B →C →D is a sequential pattern while A + B + C →D sequential pattern while A + B + C →D is an association rule. is an association rule.

Association rule vs. sequential Association rule vs. sequential patternpattern

(Continued)(Continued)

• A sequential pattern is the study of A sequential pattern is the study of “ordering” or “arrangement” of “ordering” or “arrangement” of elements, whereas an association rule is elements, whereas an association rule is the study of “togetherness” of elements.the study of “togetherness” of elements.

Data DescriptionData Description

• Medium-sized (~1MB) corpus is Medium-sized (~1MB) corpus is stored as an ASCII file with about stored as an ASCII file with about 1170 articles collected from 1991 to 1170 articles collected from 1991 to 19971997

• Has a strong theme associated with Has a strong theme associated with nuclear smuggling news throughout nuclear smuggling news throughout the 90’sthe 90’s

Topic ExtractionTopic Extraction

• The first step in the corpus is to The first step in the corpus is to identify an interesting set of content-identify an interesting set of content-bearing words from the articlesbearing words from the articles

• Co-occurrence or lack of co-Co-occurrence or lack of co-occurrence of these interesting occurrence of these interesting words in document is used to words in document is used to evaluate the strengths of the wordsevaluate the strengths of the words

Topic Extraction (Cont.)Topic Extraction (Cont.)

• Commonly appearing words that do Commonly appearing words that do not directly contribute to the content not directly contribute to the content – such as prepositions, pronouns, – such as prepositions, pronouns, adjectives, and gerunds – are ignoredadjectives, and gerunds – are ignored

• The result is a set of content-bearing The result is a set of content-bearing words that represent the entire major words that represent the entire major topics of the corpustopics of the corpus

Multiresolution BinningMultiresolution Binning

• Since the goal is to study sequential Since the goal is to study sequential patterns of the daily events recorded patterns of the daily events recorded in a corpus, every topic word that in a corpus, every topic word that appears in articles with the same time appears in articles with the same time stamp is binned together in a topic stamp is binned together in a topic subset for the mining tasksubset for the mining task

• (Similarly) can bin the topics by weeks, (Similarly) can bin the topics by weeks, months, or years to show different months, or years to show different resolutions of sequential patternsresolutions of sequential patterns

Multiresolution Binning (cont.)Multiresolution Binning (cont.)

• The idea is to capture the topic The idea is to capture the topic patterns of the news stories that patterns of the news stories that span different time intervals such as span different time intervals such as days, months, or years.days, months, or years.

Discovery of Sequential PatternsDiscovery of Sequential Patterns

• Example : A plot of topic Example : A plot of topic combinations of interest versus time combinations of interest versus time from July to September 1990from July to September 1990

Discovery of Sequential Patterns Discovery of Sequential Patterns (cont.)(cont.)

• Strength Strength

- can quickly obtain an overall - can quickly obtain an overall structural view of topic patterns and structural view of topic patterns and their distribution.their distribution.

- can see not only the frequency of - can see not only the frequency of the patterns but also the occurrence the patterns but also the occurrence dates of individual events dates of individual events

Discovery of Sequential Patterns Discovery of Sequential Patterns (cont.)(cont.)

• Weakness (the precision of the pattern)Weakness (the precision of the pattern)

- Can not know the exact connections of a Can not know the exact connections of a patternpattern

- Lack of statistical support of individual Lack of statistical support of individual patternspatterns

- These problems can be solved by including These problems can be solved by including a data mining step ahead of the a data mining step ahead of the visualizationvisualization

Discovery by Data MiningDiscovery by Data Mining

• The pattern is valid pattern if its The pattern is valid pattern if its support value is larger than a pre-support value is larger than a pre-defined threshold valuedefined threshold value

• The support value is calculated as The support value is calculated as the number of occurrences of the the number of occurrences of the pattern in the datasetpattern in the dataset

• Each node of the tree represents a Each node of the tree represents a topic of the pattern sequences.topic of the pattern sequences.

Discovery by Data MiningDiscovery by Data Mining

• A basic example of mining sequential patterns of a A basic example of mining sequential patterns of a corpus with a topic domain of {A, B, C} withn a 6-corpus with a topic domain of {A, B, C} withn a 6-day period.day period.

• During the first phase, build all the patterns with 2 During the first phase, build all the patterns with 2 topics.topics.

Discovery by Data MiningDiscovery by Data Mining

• Match the qualified 2-topic patterns Match the qualified 2-topic patterns with the input data and generate with the input data and generate the new 3-topic patterns followed by the new 3-topic patterns followed by another round of threshold pruninganother round of threshold pruning

Discovery by Data MiningDiscovery by Data Mining

• The pattern A -> C happens in Day 1 to 2 The pattern A -> C happens in Day 1 to 2 and Day 2 to 3 : the support is 2/6 = 0.33 and Day 2 to 3 : the support is 2/6 = 0.33 while the pattern A -> C -> B only while the pattern A -> C -> B only happens once : the support is 1/6 = 0.167happens once : the support is 1/6 = 0.167

System OverviewSystem Overview

• Overview of visual data Overview of visual data mining system in the mining system in the discovery of sequential discovery of sequential patterns of a corpuspatterns of a corpus

• A corpus of narrative A corpus of narrative text is fed into a text text is fed into a text engine for topic engine for topic extractionextraction

• Selected topics in each Selected topics in each article are binned article are binned together for the mining together for the mining engine to generate engine to generate sequential patterns with sequential patterns with their support informationtheir support information

Performance and StatisticsPerformance and Statistics

Number Number of Binsof Bins

Support Support ThresholThreshol

dd

Number Number of of

PatternsPatterns

DailyDaily 766766 1%1% 9191

MonthlyMonthly 7676 10%10% 444,617444,617

MonthlyMonthly 7676 25%25% 509509

YearlyYearly 77 45%45% 148,667148,667

• 1170 articles collected from 1991 to 1997• This corpus has a strong theme associated with nuclear smuggling news throughout the 90s.

Snapshot of the visualization Snapshot of the visualization systemsystem

• The four white dashed circles mark the four appearances of the same two patterns within an 8-month period from Jan 92 to Aug 92.

Snapshot of the visualization Snapshot of the visualization systemsystem

• A visualization snapshot of sequential patterns using the monthly binning topic sets• A minimum threshold value of 25% applied

ConclusionsConclusions

• Presents data mining and visualization Presents data mining and visualization techniques for discovery of sequential techniques for discovery of sequential patterns from large datasetspatterns from large datasets

• Introduce a powerful visual data mining Introduce a powerful visual data mining environment that contains a data mining environment that contains a data mining engine to discover the patterns and their engine to discover the patterns and their support values and a visualization front-end support values and a visualization front-end to show the distribution and locality of the to show the distribution and locality of the patternspatterns

• We can learn more and more quickly in an We can learn more and more quickly in an integrated visual data mining environmentintegrated visual data mining environment

Discovery of Sequential PatternsDiscovery of Sequential Patterns

• The idea is to capture the topic The idea is to capture the topic patterns of the news stories that patterns of the news stories that span different time intervals such as span different time intervals such as days, months, or years.days, months, or years.