Topic mining
-
Upload
arvind-kolhe -
Category
Technology
-
view
323 -
download
4
description
Transcript of Topic mining
A
SEMINAR REPORT
ON
“Topic Mining Over“Topic Mining Over
Asynchronous TextAsynchronous Text
Sequences”Sequences”SUBMITTED BY
Arvind R Kolhe
UNDER GUIDANCE OF
Prof. S.Y.Raut
DEPARTMENT OF COMPUTER ENGINEERING
PRAVARA RURAL ENGINEERING COLLEGE, Loni - 413736
Tal. Rahata, Dist. Ahmednagar, (M.S.), India
2013 – 2014
Pravara Rural Education Society’sPravara Rural Engineering College,
Loni.Department of Computer
Engineering
Affiliated To University Of Pune, Pune
CERTIFICATEThis is to certify that this Seminar ReportThis is to certify that this Seminar Report
entitledentitled
“Topic Mining Over“Topic Mining Over Asynchronous TextAsynchronous Text
Sequences”Sequences”
Submitted bySubmitted byMr.Arvind R Kolhe
Roll No.03
Student of T.E. Computer Engineering
during the academic year 2013-2014. This
report embodies the work carried out by the
candidate, towards partial fulfillment of Third
Year Computer Engineering conferred by the
University of Pune.
Prof. S.Y.Raut Prof. N. B. Kadu Prof. S. D. Jondhale
(Guide) (Seminar Co-ordinator) (Head of Department)
ACKNOWLEDGEMENT
I wish a true sense of gratitude to my seminar guide Prof S.Y. Raut and
H.O.D of Computer Department Prof.S.D.Jondhale who, at every discrete step
of this seminar contributed their valuable guidance and helped to solve each
and every problem that occurred during the seminar.
This is a nice opportunity for me to present a seminar titled “TOPIC
MINING OVER ASYNCHRONOUS TEXT SEQUENCES ”.
I would extend my sincere thanks to all staff members for extending
their kind support and encouragement during the preparation steps of this
seminar.
I also express my thanks to all my friends who directly or indirectly
supported me during the preparation of this seminar.
Kolhe Arvind R
T.E [Computer Engg.]
INDEXSR.NO
.TITLE PAGE NO.
TITLE PAGE I
CERTIFICATE II
ACKNOWLEDGEMENT III
LIST OF FIGURE IV
LIST OF TABLES V
LIST OF ACRONYMS /ABBREVATIONS VI
ABSTRACT VII
1. INTRODUCTION 1
2. LITERATURE SURVEY 3
3. OBJECTIVES / SCOPE OF SYSTEM 43.1 PROBLEM DEFINITION 4
3.2 OBJECTIVES 4
4. ARCHITECTURE 6
4.1 Extraction Module 6
4.2 Mapping Module 7
4.3 Optimization Module and Main Topic 7
5. ALGORITHM 8
5.1 Topic Extraction 8
5.1.1 Natural Language Processing Major Task 10
5.2 Time Synchronization 10
5.3 Algorithm Steps 11
5.3.1 Split the text into sentences 11
5.3.2 Pars the sentences 11
5.3.3 Select the candidate parts 11
5.3.4 Calculate the weight for each candidate topic 11
5.3.5 Select the final topic 11
6 ADVANTAGES 14
7 DISADVANTAGES 15
8 APPLICATION 16
CONCLUSION 17
FUTURE SCOPE 18
APPENDIX 19
APPENDIX A:SOME IMPORTANT DEFINITIONS 19
APPENDIX B:HOW TO USE WEKA AS TOPIC MINING TOOL
21
REFERENCES 22
LIST OF FIGURES
Figure Number Figure Name Page Number
Fig 1 General overview of topic mining system 6
Fig 2 Parsing output of parser 12
Fig 3 Percentage of different results for topic
mining algorithm
12
Fig 4 Detail of topic mining
algorithm experiment
13
LIST OF TABLES
Table Number Table Name Page Number
Table 1 Word occurrence in a news report 9
Table 2 Notations used to define the term 19
ABSTRACT
By optimizing the corresponding concepts, we will pick a single node among the concepts nodes which
we believe is the topic of the target text. However, a limited vocabulary problem is encountered while
mapping the keywords onto their corresponding concepts.Time stamped texts, or text sequences, are
ubiquitous in real-world applications. Multiple text sequences are often related to each other by sharing
common topics. Explore the correlation with the existence of asynchronism among multiple sequences,
i.e., documents from different sequences about the same topic may have different time stamps. Our
algorithm consists of two alternate steps: the first step extracts common topics from multiple sequences
based on the adjusted time stamps provided by the second step; the second step adjusts the time stamps of
the documents according to the time distribution of the topics discovered by the first step. We perform
these two steps alternately and after iterations a monotonic convergence of our objective function can be
guaranteed.
“Topic Mining Over Asynchronous Text Sequences”
CHAPTER 1
INTRODUCTION
The growing amount of information available in Internet has attracted many
researchers to focus their works on text analysis and processing on web document. One
of the important process is to find the main topic for a particular web document, and
any other documents. Using these related concepts, we can capture the semantics
relation found among the words in the text. For example, if the extracted words from a
web document are computer and security. The mapping process will retrieve both
Computer and Security and Encryption. The ontology hierarchy helps us to identify that
the security mentioned in the web document is most probably talking about computer
security which is related to hackers rather than computer robbery. As the amount of text
available online keeps growing, it becomes increasingly difficult for people to keep
track of and locate the information of interest to them. To remedy the problem of
information overload, a robust and automated text summarizer or information extrator is
needed. Topic identification is one of two very important steps in the process of
summarizing a text; the second step is summary text generation. To discover valuable
knowledge from a text sequence, the first step is usually to extract topics from the
sequence with both semantic and temporal information, which are described by two
distributions, respectively: a word distribution describing the semantics of the topic and
a time distribution describing the topic’s intensity over time. In many real-world
applications, we are facing multiple text sequences that are correlated with each other
by sharing common topics. The method proposed therein relied on a fundamental
assumption that different sequences are always synchronous in time, or in their own
term Coordinated, which means that the common topics share the same time
distribution over different sequences. Rather, asynchronism among multiple sequences,
i.e., documents from different sequences on the same topic have different time stamps,
is actually very common in practice. For instance, in news feeds, there is no guarantee
that news articles covering the same topic are indexed by the same time stamps. There
can be hours of delay for news agencies, days for newspapers, and even weeks for
1 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
periodicals, because some sources try to provide first-hand flashes shortly after the
incidents, while others provide more comprehensive reviews afterward. Another
example is research paper archives, where the latest research topics are closely followed
by newsletters and communications within weeks or months, then the full versions may
appear in conference proceedings, which are usually published annually, and at last in
journals, which may sometimes take more than a year to appear after submission. To
visualize it, we have the relative frequency of the occurrences of two terms warehouse
and mining.
We do not assume that given text sequences are always synchronous. Instead,
we deal with text sequences that share common topics yet are temporally asynchronous.
2 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
CHAPTER 2
LITERARURE SURVEY
Topic mining has been extensively studied in the literature, starting with the
Topic Detection and Tracking, which aimed to find and track topics (events) in news
sequences with clustering-based techniques. In many real applications, text collections
carry generic temporal information and, thus, can be considered as text sequences. To
capture the temporal dynamics of topics, various methods have been proposed to
discover topics over time in text sequences. However, these methods were designed to
extract topics from a single sequence. A very recent work by Wang et al first proposed a
topic mining method that aimed to discover common (bursty) topics over multiple text
sequences. Their approach is different from ours because they tried to find topics that
shared common time distribution over different sequences by assuming that the
sequences were synchronous, or coordinated. Based on this premise, documents with
same time stamps are combined together over different sequences so that the word
distributions of topics in individual sequences can be discovered. As a contrast,
in our work, we aim to find topics that are common in semantics, while having
asynchronous time distributionsin different sequences.
In the literature of topic mining, time stamped text sequences are also referred to
as text streams. In this paper, we use the term sequence to distinguish it from the
concept of data stream in the theory literature.
3 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
CHAPTER 3
OBJECTIVES
The aim is to use data fusion,data mining and knowledge discovery processes to
detect anomalies.
3.1 Problem Identification:
In several applications, text collections take generic temporal information and
hence can be measured as text sequences. To incarcerate the temporal dynamics of
topics, various methods have been proposed to find out topics over time in text
sequences. The problem of mining common topics from multiple asynchronous text
sequences and proposes an effective method to solve it. We formally define the
problem by introducing a principled probabilistic framework, based on which a
unified objective function can be derived.Time stamped texts, or text sequences, are
ubiquitous in real-world applications. Multiple text sequences are often related to
each other by sharing common topics. The correlation among these sequences
provides more meaningful and comprehensive clues for topic mining than those
from each individual sequence. However, it is nontrivial to explore the correlation
with the existence of asynchronism among multiple sequences, i.e., documents from
different sequences about the same topic may have different time stamps. MORE
and more text sequences are being generated in various forms, such as news
streams, weblog articles, emails, instant messages, research paper archives, web
forum discussion threads, and so forth. To discover valuable knowledge from a text
sequence, the first step is usually to extract topics from the sequence with both
semantic and temporal information.
3.2 Objectives:
To discover valuable knowledge from a text sequence, the first step is usually to
extract topics from the sequence with both semantic and temporal information,
which are described by two distributions, respectively: a word distribution
describing the semantics of the topic and a time distribution describing the topic’s
4 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
intensity over time. In many real-world applications, we are facing multiple text
sequences that are correlated with each other by sharing common topics. Intuitively,
the interactions among these sequences could provide clues to derive more
meaningful and comprehensive topics than those found by using information from
each individual stream.
To address the problem of mining common topics from multiple asynchronous
text sequences. To the extent of our knowledge, this is the first attempt to solve
this problem.
To formalize our problem by introducing a principled probabilistic framework
and propose an objective function for our problem.
To develop a novel alternate optimization algorithm to maximize the objective
function with a theoretically guaranteed (local) optimum
To effectiveness and advantage of method are validated by an extensive
empirical study on two real-world data sets.
CHAPTER 4
5 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
Architecture
The aim is to use data fusion,data mining and knowledge discovery processes to
detect anomaliesGenerally, The automatic topic mining system has three main
components: The extraction module, mapping module and optimization module.
The input of the system is a document and the output will be a node concept which
is also the predicted topic of the target document. The node concept can be in a form
of one word or more. Fig.3.1(1) shows the general overview of our topic mining
system.
Fig.1: shows the general overview of our topic mining system.
4.1 Extraction Module
The extraction module handles the process of extracting important sentences
from the document. Our method of extraction is based on the HTML tag. This is
because we believe some of the HTML tags indicate the location where the authors
may emphasise their ideas. For example, author may choose the best words to
describe his web page at the title tag. Therefore we will choose sentences or words
that become the pointers to other documents, words which are highlighted and
words located in the title tag. However, some web documents maybe lack of HTML
tags and therefore our extraction technique is not appropriate to be used. In this
6 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
case, alternative way of extracting web document like words frequency and
positional policy are more applicable. Since in this paper we are interested in
extracting out information from the web document based on HTML tag, we
consider extracting information from non-structured document as a future work. The
final output of this module will be a list of keywords.
4.2 Mapping Module
The mapping module will take the output of the extraction module as an input.
The keywords will be mapped on the words of ontology concepts. However, there is
a possibility that the keyword may not be able to be mapped onto its corresponding
concept because there is no such concept available in the ontology. This situation
requires an alternative way to map the keyword onto the concept. The alternative
way is to use the extended concept as a “middle man” in order the mapping between
Yahoo concept and the keyword becomes possible.
4.3 Optimization Module and Main Topic
The optimization process will shrink the ontology tree into an optimized tree
where only active concepts and the intermediate active concepts are chosen. The
small size of optimized tree will be reduced to a single path. This single path is
retrieved using the Maximal Spanning Tree Algorithm (same as Minimal Spanning
Tree Algorithm). The Maximal Spanning Tree algorithm will find a path that has the
heaviest nodes. The weight of the node that the algorithm uses as the criteria to
choose a heaviest node will be the accumulated mixture weight that the node has.
The foundation of the topic identification process is frequent itemsets. In TopCat, a
frequent itemset is a group of named entities that occur together in multiple articles.
Co-occurrence of words has been shown to carry useful information. What this
information really gives us is correlated items, rather than a topic. However, we
found that correlated named entities frequently occurred within a recognizable topic
the interesting correlations enabled us to identify a topic.
7 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
CHAPTER 5
Algorithm
In this section, we show how to solve our objective function in
through an alternate (constrained) optimization scheme.
5.1 Topic Extraction
Topic extraction from text corpus is the fundament of many topic analysis tasks,
such as topic trend prediction, opinion extraction, etc. Since hierarchical structure is
characteristics of topics, it is preferential for a topic extraction algorithm to output
the topics description with this kind of structure. The hierarchical topic structure
that is extracted by most of the current topic analysis algorithms can not provide a
meaningful description for all subtopics in the hierarchical tree. A document may
describe a topic from different aspects. Some are general, while others are in details.
For example, a Chinadaily news report titled as “Economic hubs face tough times
amid crisis” state that Guangdong and Shanghai, the two economic powerhouses of
China, have suffered from the global financial crisis and forecast even worse
prospects. To support this statement, it provides several details in service, market,
work, etc. In this article, some of the word occurrences are counted as in table 1.
Obviously, we can see that some words that describe the general aspect of the topic,
like ‘Guangdong’, ‘Shanghai’, ‘Economy’, ‘Finance’, ‘Crisis, etc. appear at a higher
frequency. However, other words that describe the detail aspect of the topic are at
lower frequency, such as “Service”, “Product” and “Job” etc.
word occurrences
Guangdong 8
Economy 7
Shanghai 4
City 4
Year 4
Finance 2
8 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
Crisis 2
China 2
Service 2
Product 1
Market 1
Industry 1
Job 1
Table 1 Word occurrence in a news report
Although the example is simple, it provides us a clear way to recognize topic
grain. For a text corpus which contains large amount of documents, we can suppose
that word document frequency can represent the topic grain for a text corpus in a
degree.However, the kind of topic grain is still poor in reflecting the actual semantic
in text. Hence, topic granularity which is closely related to semantic characteristic,
should be explored further to provide a semantic discrimination indicator for topics.
Topics Extraction is Textalytics' solution for extracting the different elements
present in sources of information. This detection process is carried out by
combining a number of complex natural language processing techniques that allow
to obtain morphological, syntactic and semantic analyses of a text and use them to
identify different types of significant elements.The elements identified are classified
according to the following predefined categories:
Named entities: people, organizations, places, etc.
Concepts: significant keywords in the text
Time expressions
Money expressions
URIs
Phone number expressions
Other expressions: alphanumeric patterns
5.1.1 Major tasks in Natural Language Processing
9 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
The following is a list of some of the most commonly researched tasks in
NLP in topic extraction
Natural language generation: Convert information from computer databases
into readable human language.
Natural language understanding: Natural language understanding involves
the identification of the intended semantic from the multiple possible
semantics which can be derived from a natural language expression which
usually takes the form of organized notations of natural languages concepts
Parsing: Determine the parse tree (grammatical analysis) of a given sentence.
The grammar for natural languages is ambiguous and typical sentences have
multiple possible analyses.
Sentence breaking: Find the sentence boundaries.
Topic segmentation: Given a chunk of text, separate it into segments a topic,
and identify the topic of the segment.
Information extraction (IE): This is concerned in general with the
extraction of semantic information from text.
.
5.2 Time Synchronization
Once the common topics are extracted, we match documents in all sequences to
these topics and adjust their time stamps to synchronize the sequences.
5.3 Algorithm Steps
Our algorithm consists of five different steps:
1. Split the text into sentences
2. Pars the sentences
3. Select the candidate parts
4. Calculate the weight for each candidate topic
5. Select the final topic
5.3.1 Split the text document into sentences
10 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
The first step in our algorithm is splitting the sentences on the given text. In fact, the proposed algorithm is considered as a “divide and conquer” approach; therefore, the first step should be dividing the problem until it cannot be divided more. A sentence is a smallest text part which is capable to have a topic. Hence, we split the document into corresponding sentences. Text Splitter tool which splits a text into sentences. By performing this tool we would have a set of sentences.
5.3.2 Pars the sentencesIn this time, tries to calculate the weight for each noun and verb and then
creates all possible pairs. That may cause some overhead due to calculate the weight for some unimportant terms. Our proposed algorithm intends to pars the sentences and determines the candidate terms first to avoid any useless calculation. We believe that syntactic parts like Noun Phrase (NP) and Verb Phrase (VP) are playing most important roles to present the meaning of the sentence and therefore we should consider them instead of grammatical roles like noun and verb to identify the candidate topic for each sentence. For example, in sentence “My dog also likes eating bananas”, the parser has recognized “my dog” as an NP subject and “likes eating bananas” as the VP.
5.3.3 Select the candidate partsWe select noun phrase (NP) and the head of a Verb Phrase (VP) instead
of just pairs of nouns and noun-verb. We assume that the most important parts from a sentence are the NP’s that function as subject or complement and the head of the VP. To illustrate it, in sentence “My dog also likes eating bananas”, the phrase “my dog” is selected as the NP and “likes” is selected as the head of the VP and “bananas” as an NP complement. The combination of these three segments will be considered as candidate topic. Hence, the topic for this sentence is identified as “My dog likes bananas”. At the end of this step, we have a set of candidate topics.
5.3.4 Calculate the weight for each candidate topicAt this moment we can calculate the IDF(inverse document frequency)
for only required topic concept. To calculate idf suppose “A” calculate term frequency of a word in a particular document, and divide it by the number of total words in that document. Then take logarithm of “A” is the idf .
5.3.5 Select the final topicWhen we determine the candidate topic and its associated weight for
each sentence, we select the most weighted one and consider it as the main topic for the whole document. In case there are more than one candidate topics with greatest weight, we consider all of them as the main topic.
Example: “topic” stands for stream of terms which carry the semantic and meaning of
text inside the document. However, it is not necessarily as same as the title which
11 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
embossed on the top of document. Therefore, one proper method to evaluate the
accuracy of topic mining could be the comparing the identified topic by Topic mining
algorithm. We assume that the most important parts from a sentence are the NP’s that
function as subject or complement and the head of the VP. To illustrate it, in sentence
“My dog also likes eating bananas”, the phrase “my dog” is selected as the NP and
“likes” is selected as the head of the VP and “bananas” as an NP complement. The
combination of these three segments will be considered as candidate topic. Hence, the
topic for this sentence is identified as “My dog likes bananas”. At the end of this step,
we have a set of candidate topics.
Fig. 2: parsing result with parser tool
Fig. 3: Percentage of different results for topic mining algorithm
12 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
Fig. 4: Detail of automatic topic identification algorithm experiment
13 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
CHAPTER 6
Advantages
we tackle the problem of mining common topics from multiple asynchronous text
sequences. We propose a novel method which can automatically discover and fix
potential asynchronism among sequences and consequentially extract better
common topics. The key idea of our method is to introduce a self-refinement
process by utilizing correlation between the semantic and temporal information in
the sequences. Following are the some advantages of this techonique,
It performs topic extraction and time synchronization alternately to optimize a
unified objective function.
A local optimum is guaranteed by our algorithm.
This justified the effectiveness of our method on two real-world data sets, with
comparison to a baseline method.
It able to find meaningful and discriminative topics from asynchronous text
sequences.
It significantly outperforms the baseline method, evaluated both in quality and in
quantity.
The performance of our method is robust and stable against different parameter
settings and random initialization.
14 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
CHAPTER 6
Disadvantages
Following are some cases where topic mining over asynchronous text sequences
method may not work well
There is no correlation between the semantical and temporal information of
topics, i.e., the time distribution of any topic is random (no bursty behavior).
The temporal order of documents as given by their original time stamps varies
greatly from the temporal order of underlying topics, e.g., Topic A appears
before Topic B in one sequence, but after B in another. In either case, the better
choice would be discarding the original temporal information and treating the
text sequences as a collection of documents.
15 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
CHAPTER 6
Applications
The technology is now broadly applied for a wide variety of web applications.
Applications can be sorted into a number of categories by analysis type or by
business function. application categories include:
Web Search Engine:One possible applications of topic mining is to utilize it for
Web search. For example, the incremental document algorithm can be applied to
a stream of Web pages returned by a search engine. Since topic mining can build
a document cluster hierarchy incrementally, a user can browse a document
cluster hierarchy instead of examining a flat list of documents. In addition, topic
ontologies can be used to suggest alternative query terms to refine the query.
Monitoring: Especially monitoring and analysis of online plain text sources
such as Internet news, blogs.
Biomedical applications: GoPubMed is a knowledge-based search engine for
biomedical texts.
Online media applications: Text mining is being used by large media
companies, such as the Tribune Company, to clarify information and to provide
readers with greater search experiences
16 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
Conclusion
The problem of mining common topics from multiple asynchronous text sequences.
We propose a novel method which can automatically discover and fix potential
asynchronism among sequences and consequentially extract better common topics.
The key idea of our method is to introduce a self-refinement process by utilizing
correlation between the semantic and temporal information in the sequences. It
performs topic extraction and time synchronization alternately to optimize a unified
objective function. A local optimum is guaranteed by our algorithm. We justified
the effectiveness of our method on two real-world data sets, with comparison to a
baseline method. Empirical results suggest that method is able to find meaningful
and discriminative topics from asynchronous text sequences; method significantly
outperforms the baseline method, evaluated both in quality and in quantity; the
performance of our method is robust and stable against different parameter settings
and random initialization.
17 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
Future Scope
In the future we plan to further reduce the computational complexity of our time
synchronization algorithm so that our method can be applied to real-time text stream
processing. Some cases where topic mining over asynchronous text sequences
method may not work well that are if there is no correlation between the semantical
and temporal information of topics, i.e., the time distribution of any topic is random
(no bursty behavior). As well as if temporal order of documents as given by their
original time stamps varies greatly from the temporal order of underlying topics. In
the future we plan to further reduce the computational complexity of our time
synchronization algorithm so that our method can be applied to real-time text stream
processing.
18 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
Appendix
Appendix A: Some Important Definitions Related To Topic Mining
Here we discuss the terms related to tpic mining.following are the notations used to
define the term shown in the table.
Table 2: Notations used to define the term
Mining:Mining is the extraction of valuable something from somewhere.
Data Mining: The analysis step of the "Knowledge Discovery and Data
Mining". Data mining, an interdisciplinary subfield of computer science, is
the computational process of discovering patterns in large data sets.
information that can be used to increase revenue, cuts costs, or both. Data
mining software is one of a number of analytical tools for analyzing data. It
allows users to analyze data from many different dimensions or angles,
categorize it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among dozens of
fields in large relational databases.
Data Warehouse: a large store of data accumulated from a wide range of
sources within a company and used to guide management decisions. The
electronic storage of a large amount of information by a business.
Warehoused data must be stored in a manner that is secure, reliable, easy to
19 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
retrieve and easy to manage. The concept of data warehousing originated in
1988 with the work of IBM researchers Barry Devlin and Paul Murphy. The
need to warehouse data evolved as computer systems became more complex
and handled increasing amounts of data.
Topic Warehouse: As the Web continues to grow as a vehicle for the
distribution of information, many news organizations are providing
newswire services through the Internet. Web news articles are composed of
hyperlinks, audio, video, images, and text. However, since not all news
stories have corresponding multimedia data, text can be a rich source of
information about the news.topic warehouse is large collection of text
streams, documents that corelated with each other.
Text Mining: Text mining, also referred to as text data mining, roughly
equivalent to text analytics, refers to the process of deriving high-quality
information from text.
Text Sequence: S is a sequence of N documents . Each document d is a
collection of words over vocabulary V and indexed by a unique time stamp t.
Common Topic: A common topic Z over text sequences is defined by a
word distribution over vocabulary V and a time distribution over time
stamps.
Asynchronism: Given M text sequences in which documents are indexed by
timestamps asynchronism means that the time stamps of the documents
sharing the same topic in different sequences are not properly aligned.
Data Sets: A dataset is a collection of data. Most commonly a dataset
corresponds to the contents of a single database table, or a single statistical
data matrix
Ontology: In computer science and information science, an ontology
formally represents knowledge as a set of concepts within a domain, using a
shared vocabulary to denote the types, properties and interrelationships of
those concepts.
20 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
Appendix B: How to use WEKA Tool as Topic Mining Tool
Weka is a collection of machine learning algorithms for data mining tasks.Weka
supports several standard data mining tasks, more specifically, data preprocessing,
clustering, classification, regression, visualization, and feature selection. All of
Weka's techniques are predicated on the assumption that the data is available as a
single flat file or relation, where each data point is described by a fixed number of
attributes (normally, numeric or nominal attributes, but some other attribute types
are also supported). Weka provides access to SQL databases using Java Database
Connectivity and can process the result returned by a database query. It is not
capable of multi-relational data mining, but there is separate software for converting
a collection of linked database tables into a single table that is suitable for
processing using Weka. Another important area that is currently not covered by the
algorithms included in the Weka distribution is sequence modeling. The algorithms
can either be applied directly to a dataset or called from your own Java code. Weka
contains tools for data pre-processing, classification, regression, clustering,
association rules, and visualization. It is also well-suited for developing new
machine learning schemes. “We can use it as topic mining tool by treating the
data containing in datawarehouse is the text sequences, text streams,documents
containing the synchronous and asynchronous text sequences that corelated
with each other and sharing a common topics”.
21 P.R.E.C, Computer Engg. 2013-14
“Topic Mining Over Asynchronous Text Sequences”
References
1. D.M. Blei and J.D. Lafferty, “Dynamic Topic Models” Proc. Int’l Conf.
Machine Learning (ICML), pp. 113-120, 2006.
2. Z. Li, B. Wang, M. Li, and W.-Y. Ma, “A Probabilistic Model for
Retrospective News Event Detection” Proc. Ann. Int’l ACM SIGIR Conf.
Research and Development in Information Retrieval (SIGIR), pp. 106-113,
2005.
3. Q. Mei and C. Zhai, “Discovering Evolutionary Theme Patterns from Text:
An Exploration of Temporal Text Mining,” Proc. ACM
4. SIGKDD Int’l Conf. “Knowledge Discovery and Data Mining (KDD)”, pp.
198-207, 2005.
5. Asuncion, P. Smyth, and M. Welling, “Asynchronous Distributed Learning of
Topic Models” pp. 81-88, 2008.
22 P.R.E.C, Computer Engg. 2013-14