Documentation

Catching the Trend: A Framework for ClusteringConcept-Drifting Categorical Data

Abstract:

Data clustering is an important technique for exploratory data analysis

and has been the focus of substantial research in several domains for decades

among which Sampling has been recognized as an important technique to

improve the efficiency of clustering. However, with sampling applied, those

points that are not sampled will not have their labels after the normal

process. Although there is a straightforward approach in the numerical

domain, the problem of how to allocate those unlabeled data points into

proper clusters remains as a challenging issue in the categorical domain. In

this paper, a mechanism named Maximal Resemblance Data Labeling

(abbreviated as MARDL) is proposed to allocate each unlabeled data point

into the corresponding appropriate cluster based on the novel categorical

clustering representative, namely, N-Node set Importance

Representative(abbreviated as NNIR), which represents clusters by the

importance of the combinations of attribute values. MARDL has two

advantages:

1) MARDL exhibits high execution efficiency.

2) MARDL can achieve high intra cluster similarity and low inter cluster

similarity, which are regarded as the most important properties of clusters,

thus benefiting the analysis of cluster behaviors. MARDL is empirically

validated on real and synthetic data sets and is shown to be significantly

more efficient than prior works while attaining results of high quality.

Introduction

Data clustering is an important technique for exploratory data analysis

and has been the focus of substantial research in several domains for decades

. The problem of clustering is defined as follows: Given a set of data

objects, the problem of clustering is to partition data objects into groups in

such a way that objects in the same group are similar while objects in

different groups are dissimilar according to the predefined similarity

measurement. Therefore, clustering analysis can help us to gain insight into

the distribution of data . However, a difficult problem with learning in many

real world domains is that the concept of interest may depend on some

hidden context, not given explicitly in the form of predictive features. In

other words, the concepts that we try to learn from those data drift with

time .For example, the buying preferences of customers may change with

time, depending on the current day of the week, availability of alternatives,

discounting rate, etc. As the concepts behind the data evolve with time, the

underlying clusters may also change considerably with time .Performing

clustering on the entire time-evolving data not only decreases the quality of

clusters but also disregards the expectations of users, which usually require

recent clustering results.

The problem of clustering time-evolving data in the numerical domain

has been explored in the previous works .However, this problem has not

been widely discussed in the categorical domain with the exception of Web

log transactions. Actually, categorical attributes also prevalently exist in real

data with drifting concepts. For example, buying records of customers, Web

logs that record the browsing history of users, or Web documents often

evolve with time. Previous works on clustering categorical data focus on

doing clustering on the entire data set and do not take the drifting concepts

into consideration. Therefore, the problem of clustering time evolving data

in the categorical domain remains a challenging issue. As a result, a

framework for performing clustering on the categorical time-evolving data is

proposed in this paper. Instead of designing a specific clustering algorithm,

we propose a generalized clustering framework that utilizes existing

clustering algorithms and detects if there is a drifting concept or not in the

incoming data. Fig. 1 shows our entire framework of performing clustering

on the categorical time-evolving data. In order to detect the drifting

concepts, the sliding window technique is adopted. Sliding windows

conveniently eliminate the outdated records and the sliding windows

technique is utilized in several previous works on clustering time-evolving

data in the numerical domain.

. Therefore, based on the sliding window technique, we can test the

latest data points in the current window if the characteristics of clusters are

similar to the last clustering result or not. In, a similar strategy with our

framework that utilizes clustering results to analyze the drifting concepts is

proposed in the numerical domain. The online clustering algorithm is

performed on each time frame, and several numerical characteristics such as

the mean and standard deviation of cluster centers are used to represent

clustering results and detect the changes between time frames. After the

change is detected, an offline voting-based classification algorithm is

performed to associate each change with its correspondent event. However,

in the categorical domain, the above procedure is infeasible because the

numerical characteristics of clusters are difficult to define. Therefore, for

capturing the characteristics of clusters, an effective cluster representative

that summarizes the clustering results is required. In this paper, a practical

categorical clustering representative, named “Node Importance

Representative” (abbreviated as NIR) , is utilized. NIR represents clusters by

measuring the importance of each attribute value in the clusters. Based on

NIR, we propose the “Drifting Concept Detection” (abbreviated as DCD)

algorithm in this paper. In DCD, the incoming categorical data points at the

present sliding window are first allocated into the corresponding proper

cluster at the last clustering result, and the number of outliers that are not

able to be assigned into any cluster is counted. After that, the distribution of

clusters and outliers between the last clustering result and the current

temporal clustering result are compared with each other. If the distribution is

changed (exceeding some criteria), the concepts are said to drift. In the

concept-drifting window, the data points will do reclustering, and the last

clustering representative will be dumped out. On the contrary, if the concept

is steady, the clustering representative (NIR) will be updated. Moreover, the

framework presented in this paper not only detects the drifting concepts in

the categorical data but also explains the drifting concepts by analyzing the

relationship between clustering results at different times. The analyzing

algorithm is named “Cluster Relationship Analysis” (CRA). When the

drifting concept is detected by DCD, the last clustering representative is

dumped out. Therefore, each clustering representative that had been

recorded represents a successive constant clustering result, and different

dumped-out representatives describe different concepts in the data set. By

analyzing the relationship between clustering results, we may capture the

time evolving trend that explains why the clustering results have changes in

the data set.

Scope of the project:

Our contributions in this paper can be summarized as follows:

A generalized framework of performing clustering on the

categorical time-evolving data is proposed in this paper. In

particular, this framework is independent of clustering

algorithms, and any categorical clustering algorithm can be

utilized in this framework.

Based on the sliding window technique and the categorical

clustering representative NIR, the algorithm DCD is proposed

in this paper. In DCD, the distributions between clustering

results are tested to determine whether the concepts drift or not.

.

The CRA algorithm, which analyzes the relationships between

clustering results generated by different concepts, is also

presented in this paper. We may capture the time-evolving

trend in the data set by analyzing the evolving clustering

results.

Module Description:

Collection of Data’s (Data Set):

It is a collection of data which is extracted from the database that we are going to cluster. The data from the database which is time evolving categorical data (which is not sequential basis manner).

Initializing of Sliding Window: Sliding Window is used to from subset of data from dataset using specified size (i.e.) collection of data from the database and transfer to the module.Formation of Clustering:

Resulting of Sliding Window is Clustered data in which contains group of data of subsets with common relationship.

Cluster Distributions Comparator:Drifting Concept Detection (DCD)

algorithm is to detect the difference of cluster distribution between the current data subset and the last clustering result.

Re- Clustering and Outlier:If the difference of cluster distributions is

large enough, the sliding window will be considered as a concept – drifting window, and will perform re-clustering. The data point that does not belong to any proper clustering is called an outlier.

Literature Survey:Data mining:

Data mining is the process of extracting patterns from data. As more

data are gathered, with the amount of data doubling every three years,[1] data

mining is becoming an increasingly important tool to transform these data

into information. It is commonly used in a wide range of profiling practices,

such as marketing, surveillance, fraud detection and scientific discovery.

While data mining can be used to uncover patterns in data samples, it is important to be aware that the use of non-representative samples of data may produce results that are not indicative of the domain. Similarly, data mining will not find patterns that may be present in the domain, if those patterns are not present in the sample being "mined". There is a tendency for insufficiently knowledgeable "consumers" of the results to attribute "magical abilities" to data mining, treating the technique as a sort of all-seeing crystal ball. Like any other tool, it only functions in conjunction with the

http://en.wikipedia.org/wiki/Fraud

http://en.wikipedia.org/wiki/Surveillance

http://en.wikipedia.org/wiki/Marketing

http://en.wikipedia.org/wiki/Profiling_practices

http://en.wikipedia.org/wiki/Data_mining#cite_note-0%23cite_note-0

appropriate raw material: in this case, indicative and representative data that the user must first collect. Further, the discovery of a particular pattern in a particular set of data does not necessarily mean that pattern is representative of the whole population from which that data was drawn. Hence, an important part of the process is the verification and validation of patterns on other samples of data.

The term data mining has also been used in a related but negative sense, to mean the deliberate searching for apparent but not necessarily representative patterns in large numbers of data. To avoid confusion with the other sense, the terms data dredging and data snooping are often used. Note, however, that dredging and snooping can be (and sometimes are) used as exploratory tools when developing and clarifying hypotheses.

Humans have been "manually" extracting patterns from data for centuries, but the increasing volume of data in modern times has called for more automated approaches. Early methods of identifying patterns in data include Bayes' theorem (1700s) and Regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology has increased data collection and storage. As data sets have grown in size and complexity, direct hands-on data analysis has increasingly been augmented with indirect, automatic data processing. This has been aided by other discoveries in computer science, such as Neural networks, Clustering, Genetic algorithms (1950s), Decision trees(1960s) and Support vector machines (1980s). Data mining is the process of applying these methods to data with the intention of uncovering hidden patterns.[2] It has been used for many years by businesses, scientists and governments to sift through volumes of data such as airline passenger trip records, census data and supermarket scanner data to produce market research reports. (Note, however, that reporting is not always considered to be data mining).

A primary reason for using data mining is to assist in the analysis of collections of observations of behaviour. Such data are vulnerable to collinearity because of unknown interrelations. An unavoidable fact of data mining is that the (sub-)set(s) of data being analysed may not be representative of the whole domain, and therefore may not contain examples of certain critical relationships and behaviours that exist across other parts of the domain. To address this sort of issue, the analysis may be augmented using experiment-based and other approaches, such as Choice Modelling for human-generated data. In these situations, inherent correlations can be either

http://en.wikipedia.org/wiki/Choice_Modelling

http://en.wikipedia.org/wiki/Collinearity


http://en.wikipedia.org/wiki/Support_vector_machines

http://en.wikipedia.org/wiki/Support_vector_machines

http://en.wikipedia.org/wiki/Decision_trees

http://en.wikipedia.org/wiki/Genetic_algorithms

http://en.wikipedia.org/wiki/Data_clustering

http://en.wikipedia.org/wiki/Neural_networks

http://en.wikipedia.org/wiki/Data_set

http://en.wikipedia.org/wiki/Regression_analysis

http://en.wikipedia.org/wiki/Bayes'_theorem

http://en.wikipedia.org/wiki/Data

http://en.wikipedia.org/wiki/Data-snooping_bias

http://en.wikipedia.org/wiki/Data_dredging

http://en.wikipedia.org/wiki/Verification_and_validation

controlled for, or removed altogether, during the construction of the experimental design.

There have been some efforts to define standards for data mining, for example the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). These are evolving standards; later versions of these standards are under development. Independent of these standardization efforts, freely available open-source software systems like RapidMiner, Weka, KNIME, and the R Project have become an informal standard for defining data-mining processes. Most of these systems are able to import and export models in PMML (Predictive Model Markup Language) which provides a standard way to represent data mining models so that these can be shared between different statistical applications. PMML is an XML-based language developed by the Data Mining Group (DMG)[3], an independent group composed of many data mining companies. PMML version 4.0 was released in June 2009.[3][4][5]

Research and evolution

In addition to industry driven demand for standards and interoperability, professional and academic activity have also made considerable contributions to the evolution and rigour of the methods and models; an article published in a 2008 issue of the International Journal of Information Technology and Decision Making summarises the results of a literature survey which traces and analyses this evolution.[6]

The premier professional body in the field is the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD).[citation needed] Since 1989 they have hosted an annual international conference and published its proceedings,[7] and since 1999 have published a biannual academic journal titled "SIGKDD Explorations".[8] Other Computer Science conferences on data mining include:

DMIN - International Conference on Data Mining;[9]

DMKD - Research Issues on Data Mining and Knowledge Discovery; ECML-PKDD - European Conference on Machine Learning and

Principles and Practice of Knowledge Discovery in Databases; ICDM - IEEE International Conference on Data Mining;[10]


http://en.wikipedia.org/wiki/European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases

http://en.wikipedia.org/wiki/European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases



http://en.wikipedia.org/wiki/Academic_journal


http://en.wikipedia.org/wiki/Wikipedia:Citation_needed

http://en.wikipedia.org/wiki/SIGKDD

http://en.wikipedia.org/wiki/Association_for_Computing_Machinery

http://en.wikipedia.org/wiki/Association_for_Computing_Machinery




http://en.wikipedia.org/wiki/Data_mining#cite_note-DMG-2%23cite_note-DMG-2

http://en.wikipedia.org/wiki/Data_mining#cite_note-DMG-2%23cite_note-DMG-2

http://en.wikipedia.org/wiki/XML

http://en.wikipedia.org/wiki/PMML

http://en.wikipedia.org/wiki/R_Project

http://en.wikipedia.org/wiki/Weka_(machine_learning)

http://en.wikipedia.org/wiki/RapidMiner

http://en.wikipedia.org/wiki/Java_Data_Mining

http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

http://en.wikipedia.org/wiki/Experimental_design

MLDM - Machine Learning and Data Mining in Pattern Recognition; SDM - SIAM International Conference on Data Mining

Knowledge Discovery in Databases (KDD) is the name coined by Gregory Piatetsky-Shapiro in 1989 to describe the process of finding interesting, interpreted, useful and novel data. There are many nuances to this process, but roughly the steps are to preprocess raw data, mine the data, and interpret the results.[11]

Pre-processing

Once the objective for the KDD process is known, a target data set must be assembled. As data mining can only uncover patterns already present in the data, the target dataset must be large enough to contain these patterns while remaining concise enough to be mined in an acceptable timeframe. A common source for data is a datamart or data warehouse.

The target set is then cleaned. Cleaning removes the observations with noise and missing data.The clean data is reduced into feature vectors, one vector per observation. A feature vector is a summarised version of the raw data observation. For example, a black and white image of a face which is 100px by 100px would contain 10,000 bits of raw data. This might be turned into a feature vector by locating the eyes and mouth in the image. Doing so would reduce the data for each vector from 10,000 bits to three codes for the locations, dramatically reducing the size of the dataset to be mined, and hence reducing the processing effort. The feature(s) selected will depend on what the objective(s) is/are; obviously, selecting the "right" feature(s) is fundamental to successful data mining.

The feature vectors are divided into two sets, the "training set" and the "test set". The training set is used to "train" the data mining algorithm(s), while the test set is used to verify the accuracy of any patterns found.

Data mining

Data mining commonly involves four classes of task:[11]

Classification - Arranges the data into predefined groups. For example an email program might attempt to classify an email as legitimate or spam. Common algorithms include Decision Tree Learning, Nearest neighbor, naive Bayesian classification and Neural network.

http://en.wikipedia.org/wiki/Artificial_neural_networks

http://en.wikipedia.org/wiki/Naive_Bayes_classifier

http://en.wikipedia.org/wiki/Nearest_neighbor_(pattern_recognition)

http://en.wikipedia.org/wiki/Nearest_neighbor_(pattern_recognition)

http://en.wikipedia.org/wiki/Decision_Tree_Learning

http://en.wikipedia.org/wiki/Statistical_classification

http://en.wikipedia.org/wiki/Data_mining#cite_note-Fayyad-10%23cite_note-Fayyad-10

http://en.wikipedia.org/wiki/Feature_vector

http://en.wikipedia.org/wiki/Data_warehouse

http://en.wikipedia.org/wiki/Data_mining#cite_note-Fayyad-10%23cite_note-Fayyad-10

Clustering - Is like classification but the groups are not predefined, so the algorithm will try to group similar items together.

Regression - Attempts to find a function which models the data with the least error.

Association rule learning - Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as "market basket analysis".

Results validation

The final step of knowledge discovery from data is to verify the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the data mining algorithms are necessarily valid. It is common for the data mining algorithms to find patterns in the training set which are not present in the general data set, this is called overfitting. To overcome this, the evaluation uses a test set of data which the data mining algorithm was not trained on. The learnt patterns are applied to this test set and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish spam from legitimate emails would be trained on a training set of sample emails. Once trained, the learnt patterns would be applied to the test set of emails which it had not been trained on, the accuracy of these patterns can then be measured from how many emails they correctly classify. A number of statistical methods may be used to evaluate the algorithm such as ROC curves.

If the learnt patterns do not meet the desired standards, then it is necessary to reevaluate and change the preprocessing and data mining. If the learnt patterns do meet the desired standards then the final step is to interpret the learnt patterns and turn them into knowledge.

Notable uses

Games

Since the early 1960s, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data

http://en.wikipedia.org/wiki/Dots-and-boxes

http://en.wikipedia.org/wiki/Tablebase

http://en.wikipedia.org/wiki/Combinatorial_game

http://en.wikipedia.org/wiki/Oracle_machine

http://en.wikipedia.org/wiki/Receiver_operating_characteristic

http://en.wikipedia.org/wiki/Training_set

http://en.wikipedia.org/wiki/Test_set

http://en.wikipedia.org/wiki/Overfitting

http://en.wikipedia.org/wiki/Association_rule_learning

http://en.wikipedia.org/wiki/Regression_analysis

http://en.wikipedia.org/wiki/Cluster_analysis

mining has been opened up. This is the extraction of human-usable strategies from these oracles. Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully. Instead, extensive experimentation with the tablebases, combined with an intensive study of tablebase-answers to well designed problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns. Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation.

Business

Data mining in customer relationship management applications can contribute significantly to the bottom line.[citation needed] Rather than randomly contacting a prospect or customer through a call center or sending mail, a company can concentrate its efforts on prospects that are predicted to have a high likelihood of responding to an offer. More sophisticated methods may be used to optimise resources across campaigns so that one may predict which channel and which offer an individual is most likely to respond to — across all potential offers. Additionally, sophisticated applications could be used to automate the mailing. Once the results from data mining (potential prospect/customer and channel/offer) are determined, this "sophisticated application" can either automatically send an e-mail or regular mail. Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the greatest increase in responding if given an offer. Data clustering can also be used to automatically discover the segments or groups within a customer data set.

Businesses employing data mining may see a return on investment, but also they recognise that the number of predictive models can quickly become very large. Rather than one model to predict which customers will churn, a business could build a separate model for each region and customer type. Then instead of sending an offer to all people that are likely to churn, it may only want to send offers to customers that will likely take to offer. And finally, it may also want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable. In order to maintain this quantity of models, they need to manage model versions and move to automated data mining.

http://en.wikipedia.org/wiki/Churning_(stock_trade)


http://en.wikipedia.org/wiki/Wikipedia:Citation_needed

http://en.wikipedia.org/wiki/Customer_relationship_management

http://en.wikipedia.org/wiki/Chess_endgame

http://en.wikipedia.org/wiki/Chess

http://en.wikipedia.org/wiki/John_Nunn

http://en.wikipedia.org/wiki/John_Nunn

http://en.wikipedia.org/wiki/Berlekamp

Data mining can also be helpful to human-resources departments in identifying the characteristics of their most successful employees. Information obtained, such as universities attended by highly successful employees, can help HR focus recruiting efforts accordingly. Additionally, Strategic Enterprise Management applications help a company translate corporate-level goals, such as profit and margin share targets, into operational decisions, such as production plans and workforce levels.[12]

Another example of data mining, often called the market basket analysis, relates to its use in retail sales. If a clothing store records the purchases of customers, a data-mining system could identify those customers who favour silk shirts over cotton ones. Although some explanations of relationships may be difficult, taking advantage of it is easier. The example deals with association rules within transaction-based data. Not all data are transaction based and logical or inexact rules may also be present within a database. In a manufacturing application, an inexact rule may state that 73% of products which have a specific defect or problem will develop a secondary problem within the next six months.

Market basket analysis has also been used to identify the purchase patterns of the Alpha consumer. Alpha Consumers are people that play a key roles in connecting with the concept behind a product, then adopting that product, and finally validating it for the rest of society. Analyzing the data collected on these type of users has allowed companies to predict future buying trends and forecast supply demands.

Data Mining is a highly effective tool in the catalog marketing industry. Catalogers have a rich history of customer transactions on millions of customers dating back several years. Data mining tools can identify patterns among customers and help identify the most likely customers to respond to upcoming mailing campaigns.

Related to an integrated-circuit production line, an example of data mining is described in the paper "Mining IC Test Data to Optimize VLSI Testing."[13] In this paper the application of data mining and decision analysis to the problem of die-level functional test is described. Experiments mentioned in this paper demonstrate the ability of applying a system of mining historical die-test data to create a probabilistic model of patterns of die failure which are then utilised to decide in real time which die to test next and when to stop testing. This system has been shown, based on


http://en.wikipedia.org/wiki/Alpha_consumer

http://en.wikipedia.org/wiki/Market_basket_analysis

http://en.wikipedia.org/wiki/Database

http://en.wikipedia.org/wiki/Rule

http://en.wikipedia.org/wiki/Association_rule



http://en.wikipedia.org/wiki/Data_mining#cite_note-autogenerated1-11%23cite_note-autogenerated1-11

experiments with historical test data, to have the potential to improve profits on mature IC products.

Science and engineering

In recent years, data mining has been widely used in area of science and engineering, such as bioinformatics, genetics, medicine, education and electrical power engineering.

the mapping relationship between the inter-individual variation in human DNA sequences and variability in disease susceptibility. In lay terms, it is to find out how the changes in an individual's DNA sequence affect the risk of developing common diseases such as cancer. This is very important to help improve the diagnosis, prevention and treatment of the diseases. The data mining technique that is used to perform this task is known as multifactor dimensionality reduction.[14]

In the area of electrical power engineering, data mining techniques have been widely used for condition monitoring of high voltage electrical equipment. The purpose of condition monitoring is to obtain valuable information on the insulation's health status of the equipment. Data clustering such as self-organizing map (SOM) has been applied on the vibration monitoring and analysis of transformer on-load tap-changers(OLTCS). Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms. Obviously, different tap positions will generate different signals. However, there was considerable variability amongst normal condition signals for the exact same tap position. SOM has been applied to detect abnormal conditions and to estimate the nature of the abnormalities.

Data mining techniques have also been applied for dissolved gas analysis (DGA) on power transformers. DGA, as a diagnostics for power transformer, has been available for many years. Data mining techniques such as SOM has been applied to analyse data and to determine trends which are not obvious to the standard DGA ratio techniques such as Duval Triangle.[15]

A educational research, where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning[16] and to understand the factors influencing university student retention.[17]. A similar example of the social application of data mining its is



http://en.wikipedia.org/wiki/Data_mining#cite_note-McGrail-14%23cite_note-McGrail-14

http://en.wikipedia.org/wiki/Power_transformer

http://en.wikipedia.org/wiki/Dissolved_gas_analysis

http://en.wikipedia.org/wiki/Dissolved_gas_analysis

http://en.wikipedia.org/wiki/Self-organizing_map



http://en.wikipedia.org/wiki/Insulator_(electrical)

http://en.wikipedia.org/wiki/Condition_monitoring


http://en.wikipedia.org/wiki/Multifactor_dimensionality_reduction

http://en.wikipedia.org/wiki/Cancer

http://en.wikipedia.org/wiki/DNA

http://en.wikipedia.org/wiki/Electrical_power

http://en.wikipedia.org/wiki/Education

http://en.wikipedia.org/wiki/Medicine

http://en.wikipedia.org/wiki/Genetics

http://en.wikipedia.org/wiki/Bioinformatic

use in expertise finding systems, whereby descriptors of human expertise are extracted, normalised and classified so as to facilitate the finding of experts, particularly in scientific and technical fields. In this way, data mining can facilitate Institutional memory.

Other examples of applying data mining technique applications are biomedical data facilitated by domain ontologies,[18] mining clinical trial data,[19] traffic analysis using SOM,[20] et cetera.

In adverse drug reaction surveillance, the Uppsala Monitoring Centre has, since 1998, used data mining methods to routinely screen for reporting patterns indicative of emerging drug safety issues in the WHO global database of 4.6 million suspected adverse drug reaction incidents[21]. Recently, similar methodology has been developed to mine large collections of electronic health records for temporal patterns associating drug prescriptions to medical diagnoses[22].

Spatial Data mining

Spatial data mining is the application of data mining techniques to spatial data. Spatial data mining follows along the same functions in data mining, with the end objective to find patterns in geography. So far, data mining and Geographic Information Systems (GIS) have existed as two separate technologies, each with its own methods, traditions and approaches to visualisation and data analysis. Particularly, most contemporary GIS have only very basic spatial analysis functionality. The immense explosion in geographically referenced data occasioned by developments in IT, digital mapping, remote sensing, and the global diffusion of GIS emphasises the importance of developing data driven inductive approaches to geographical analysis and modeling.

Data mining, which is the partially automated search for hidden patterns in large databases, offers great potential benefits for applied GIS-based decision-making. Recently, the task of integrating these two technologies has become critical, especially as various public and private sector organisations possessing huge databases with thematic and geographically referenced data begin to realise the huge potential of the information hidden there. Among those organisations are:

offices requiring analysis or dissemination of geo-referenced statistical data

http://en.wikipedia.org/wiki/Geographic_Information_Systems


http://en.wikipedia.org/wiki/Electronic_health_records


http://en.wikipedia.org/wiki/Adverse_drug_reaction

http://en.wikipedia.org/wiki/Uppsala_Monitoring_Centre


http://en.wikipedia.org/wiki/Traffic_analysis


http://en.wikipedia.org/wiki/Data_mining#cite_note-Zhu-17%23cite_note-Zhu-17

http://en.wikipedia.org/wiki/Biomedical

http://en.wikipedia.org/wiki/Institutional_memory

http://en.wikipedia.org/wiki/Expertise_finding

public health services searching for explanations of disease clusters environmental agencies assessing the impact of changing land-use

patterns on climate change geo-marketing companies doing customer segmentation based on

spatial location.

Challenges

Geospatial data repositories tend to be very large. Moreover, existing GIS datasets are often splintered into feature and attribute components, that are conventionally archived in hybrid data management systems. Algorithmic requirements differ substantially for relational (attribute) data management and for topological (feature) data management [23]. Related to this is the range and diversity of geographic data formats, that also presents unique challenges. The digital geographic data revolution is creating new types of data formats beyond the traditional "vector" and "raster" formats. Geographic data repositories increasingly include ill-structured data such as imagery and geo-referenced multi-media [24].

There are several critical research challenges in geographic knowledge discovery and data mining. Miller and Han [25] offer the following list of emerging research topics in the field:

Developing and supporting geographic data warehouses - Spatial properties are often reduced to simple aspatial attributes in mainstream data warehouses. Creating an integrated GDW requires solving issues in spatial and temporal data interoperability, including differences in semantics, referencing systems, geometry, accuracy and position.

Better spatio-temporal representations in geographic knowledge discovery - Current geographic knowledge discovery (GKD) techniques generally use very simple representations of geographic objects and spatial relationships. Geographic data mining techniques should recognise more complex geographic objects (lines and polygons) and relationships (non-Euclidean distances, direction, connectivity and interaction through attributed geographic space such as terrain). Time needs to be more fully integrated into these geographic representations and relationships.

Geographic knowledge discovery using diverse data types - GKD techniques should be developed that can handle diverse data types




beyond the traditional raster and vector models, including imagery and geo-referenced multimedia, as well as dynamic data types (video streams, animation).

Surveillance

Previous data mining to stop terrorist programs under the U.S. government include the Total Information Awareness (TIA) program, Secure Flight (formerly known as Computer-Assisted Passenger Prescreening System (CAPPS II)), Analysis, Dissemination, Visualization, Insight, Semantic Enhancement (ADVISE[26]), and the Multistate Anti-Terrorism Information Exchange (MATRIX).[27] These programs have been discontinued due to controversy over whether they violate the US Constitution's 4th amendment, although many programs that were formed under them continue to be funded by different organisations, or under different names.[28]

Two plausible data mining techniques in the context of combating terrorism include "pattern mining" and "subject-based data mining".

Pattern mining

"Pattern mining" is a data mining technique that involves finding existing patterns in data. In this context patterns often means association rules. The original motivation for searching association rules came from the desire to analyze supermarket transaction data, that is, to examine customer behaviour in terms of the purchased products. For example, an association rule "beer => crisps (80%)" states that four out of five customers that bought beer also bought crisps.

In the context of pattern mining as a tool to identify terrorist activity, the National Research Council provides the following definition: "Pattern-based data mining looks for patterns (including anomalous data patterns) that might be associated with terrorist activity — these patterns might be regarded as small signals in a large ocean of noise."[29][30][31] Pattern Mining includes new areas such a Music Information Retrieval (MIR) where patterns seen both in the temporal and non temporal domains are imported to classical knowledge discovery search techniques.

Subject-based data mining

http://en.wikipedia.org/wiki/Music_Information_Retrieval

http://en.wikipedia.org/wiki/Data_mining#cite_note-Haag2006-30%23cite_note-Haag2006-30

http://en.wikipedia.org/wiki/Data_mining#cite_note-NRC2008-29%23cite_note-NRC2008-29


http://en.wikipedia.org/wiki/United_States_National_Research_Council

http://en.wikipedia.org/wiki/Association_rules

http://en.wikipedia.org/wiki/Pattern

http://en.wikipedia.org/wiki/Data_mining#cite_note-eff-tia-funding-27%23cite_note-eff-tia-funding-27


http://en.wikipedia.org/wiki/MATRIX


http://en.wikipedia.org/wiki/ADVISE

http://en.wikipedia.org/wiki/CAPPS_II

http://en.wikipedia.org/wiki/Total_Information_Awareness

"Subject-based data mining" is a data mining technique involving the search for associations between individuals in data. In the context of combatting terrorism, the National Research Council provides the following definition: "Subject-based data mining uses an initiating individual or other datum that is considered, based on other information, to be of high interest, and the goal is to determine what other persons or financial transactions or movements, etc., are related to that initiating datum."[30]

Privacy concerns and ethics

Some people believe that data mining itself is ethically neutral.[32] However, the ways in which data mining can be used can raise questions regarding privacy, legality, and ethics.[33] In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.[34][35]

Data mining requires data preparation which can uncover information or patterns which may compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation is when the data are accrued, possibly from various sources, and put together so that they can be analyzed.[36] This is not data mining per se, but a result of the preparation of data before and for the purposes of the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly-compiled data set, to be able to identify specific individuals, especially when originally the data were anonymous.

It is recommended that an individual is made aware of the following before data are collected:

the purpose of the data collection and any data mining projects, how the data will be used, who will be able to mine the data and use them, the security surrounding access to the data, and in addition, how collected data can be updated.[36]

Privacy concerns have also been somewhat addressed by congress via the passage of regulatory controls such as HIPAA. The Health Insurance Portability and Accountability Act (HIPAA) requires individuals to be given "informed consent" regarding any information that they provide and its

http://en.wikipedia.org/wiki/Data_mining#cite_note-NASCIO-35%23cite_note-NASCIO-35




http://en.wikipedia.org/wiki/ADVISE

http://en.wikipedia.org/wiki/Total_Information_Awareness



http://en.wikipedia.org/wiki/Data_mining#cite_note-NRC2008-29%23cite_note-NRC2008-29

http://en.wikipedia.org/wiki/United_States_National_Research_Council

intended future uses by the facility receiving that information. According to an article in Biotech Business Week, “In practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena, says the AAHC. More importantly, the rule's goal of protection through informed consent is undermined by the complexity of consent forms that are required of patients and participants, which approach a level of incomprehensibility to average individuals.” (40) This underscores the necessity for data anonymity in data aggregation practices.

One may additionally modify the data so that they are anonymous, so that individuals may not be readily identified.[36] However, even de-identified data sets can contain enough information to identify individuals, as occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by AOL. [37]

Clustering

A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks. Clusters are usually deployed to improve performance and/or availability over that of a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.[1]

Cluster categorizations

High-availability (HA) clusters

High-availability clusters (also known as Failover Clusters) are implemented primarily for the purpose of improving the availability of services which the cluster provides. They operate by having redundant nodes, which are then used to provide service when system components fail. The most common size for an HA cluster is two nodes, which is the minimum requirement to provide redundancy. HA cluster implementations attempt to use redundancy of cluster components to eliminate single points of failure.

http://en.wikipedia.org/wiki/Single_point_of_failure

http://en.wikipedia.org/wiki/Single_point_of_failure

http://en.wikipedia.org/wiki/Two-node_cluster

http://en.wikipedia.org/wiki/Node_(networking)

http://en.wikipedia.org/wiki/High-availability_cluster

http://en.wikipedia.org/wiki/Cluster_(computing)#cite_note-0%23cite_note-0

http://en.wikipedia.org/wiki/Local_area_network

http://en.wikipedia.org/wiki/Computer



There are many commercial implementations of High-Availability clusters for many operating systems. The Linux-HA project is one commonly used free software HA package for the Linux operating system.

Load-balancing clusters

Load-balancing when multiple computers are linked together to share computational workload or function as a single virtual computer. Logically, from the user side, they are multiple machines, but function as a single virtual machine. Requests initiated from the user are managed by, and distributed among, all the standalone computers to form a cluster. This results in balanced computational work among different machines, improving the performance of the cluster system.

Compute clusters

Often clusters are used primarily for computational purposes, rather than handling IO-oriented operations such as web service or databases. For instance, a cluster might support computational simulations of weather or vehicle crashes. The primary distinction within compute clusters is how tightly-coupled the individual nodes are. For instance, a single compute job may require frequent communication among nodes - this implies that the cluster shares a dedicated network, is densely located, and probably has homogenous nodes. This cluster design is usually referred to as Beowulf Cluster. The other extreme is where a compute job uses one or few nodes, and needs little or no inter-node communication. This latter category is sometimes called "Grid" computing. Tightly-coupled compute clusters are designed for work that might traditionally have been called "supercomputing". Middleware such as MPI (Message Passing Interface) or PVM (Parallel Virtual Machine) permits compute clustering programs to be portable to a wide variety of clusters.

Grid computing

Grids are usually computer clusters, but more focused on throughput like a computing utility rather than running fewer, tightly-coupled jobs. Often, grids will incorporate heterogeneous collections of computers, possibly distributed geographically, sometimes administered by unrelated organizations.

http://en.wikipedia.org/wiki/Utility_computing

http://en.wikipedia.org/wiki/PVM

http://en.wikipedia.org/wiki/Message_Passing_Interface

http://en.wikipedia.org/wiki/Beowulf_cluster

http://en.wikipedia.org/wiki/Beowulf_cluster

http://en.wikipedia.org/wiki/Load_balancing_(computing)

http://en.wikipedia.org/wiki/Linux

http://en.wikipedia.org/wiki/Free_software

http://en.wikipedia.org/wiki/Linux-HA

Grid computing is optimized for workloads which consist of many independent jobs or packets of work, which do not have to share data between the jobs during the computation process. Grids serve to manage the allocation of jobs to computers which will perform the work independently of the rest of the grid cluster. Resources such as storage may be shared by all the nodes, but intermediate results of one job do not affect other jobs in progress on other nodes of the grid.

An example of a very large grid is the Folding@home project. It is analyzing data that is used by researchers to find cures for diseases such as Alzheimer's and cancer. Another large project is the SETI@home project, which may be the largest distributed grid in existence. It uses approximately three million home computers all over the world to analyze data from the Arecibo Observatory radiotelescope, searching for evidence of extraterrestrial intelligence. In both of these cases, there is no inter-node communication or shared storage. Individual nodes connect to a main, central location to retrieve a small processing job. They then perform the computation and return the result to the central server. In the case of the @home projects, the software is generally run when the computer is otherwise idle. U of C Berkley has developed an open source application BOINC to allow individual users to contribute to the above and other projects such as llc@home (Large Hadron Collider) from a single manager which can then be set to allocate a percentage of idle time to each of the projects a node is signed up for. The Software can be dowloaded and a project list can be found here BOINC

The grid setup means that the nodes can take however many jobs they are able to process in one session and then return the results and aquire a new job from a central project server.

Implementations

The TOP500 organization's semiannual list of the 500 fastest computers usually includes many clusters. TOP500 is a collaboration between the University of Mannheim, the University of Tennessee, and the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory. As of June 18, 2008, the top supercomputer is the Department of Energy's IBM Roadrunner system with performance of 1026 TFlops measured with High-Performance LINPACK benchmark.

http://en.wikipedia.org/wiki/LINPACK

http://en.wikipedia.org/wiki/FLOPS

http://en.wikipedia.org/wiki/IBM_Roadrunner

http://en.wikipedia.org/wiki/United_States_Department_of_Energy

http://en.wikipedia.org/wiki/Supercomputer

http://en.wikipedia.org/wiki/Lawrence_Berkeley_National_Laboratory

http://en.wikipedia.org/wiki/Lawrence_Berkeley_National_Laboratory

http://en.wikipedia.org/wiki/University_of_Tennessee

http://en.wikipedia.org/wiki/University_of_Mannheim

http://en.wikipedia.org/wiki/TOP500

http://boinc.berkeley.edu/

http://en.wikipedia.org/wiki/Large_Hadron_Collider

http://lhcathome.cern.ch/

http://en.wikipedia.org/wiki/BOINC

http://en.wikipedia.org/wiki/Radiotelescope

http://en.wikipedia.org/wiki/Arecibo_Observatory

http://en.wikipedia.org/wiki/SETI@home

http://en.wikipedia.org/wiki/Folding@home

Clustering can provide significant performance benefits versus price. The System X supercomputer at Virginia Tech, the 28th most powerful supercomputer on Earth as of June 2006[2], is a 12.25 TFlops computer cluster of 1100 Apple XServe G5 2.3 GHz dual-processor machines (4 GB RAM, 80 GB SATA HD) running Mac OS X and using InfiniBand interconnect. The cluster initially consisted of Power Mac G5s; the rack-mountable XServes are denser than desktop Macs, reducing the aggregate size of the cluster. The total cost of the previous Power Mac system was $5.2 million, a tenth of the cost of slower mainframe computer supercomputers. (The Power Mac G5s were sold off.)

The central concept of a Beowulf cluster is the use of commercial off-the-shelf (COTS) computers to produce a cost-effective alternative to a traditional supercomputer. One project that took this to an extreme was the Stone Souper computer.

However it is worth noting that FLOPs (floating point operations per second), aren't always the best metric for supercomputer speed. Clusters can have very high FLOPs, but they cannot access all data the cluster as a whole has at once. Therefore clusters are excellent for parallel computation, but much poorer than traditional supercomputers at non-parallel computation.

Java Spaces is a specification from Sun Microsystems that enables clustering computers via a distributed shared memory.

Consumer game consoles

Due to the increasing computing power of each generation of game consoles, a novel use has emerged where they are repurposed into HPC clusters. Some examples of game console clusters are Sony PlayStation clusters and Microsoft Xbox clusters. It has been suggested that countries which are restricted from buying supercomputing technologies may be obtaining game systems to build computer clusters for military use.

Existing system:

http://en.wikipedia.org/w/index.php?title=Xbox_cluster&action=edit&redlink=1

http://en.wikipedia.org/wiki/Microsoft

http://en.wikipedia.org/wiki/PlayStation_3_cluster

http://en.wikipedia.org/wiki/PlayStation_3_cluster

http://en.wikipedia.org/wiki/High-performance_computing

http://en.wikipedia.org/wiki/Game_console

http://en.wikipedia.org/wiki/Game_console

http://en.wikipedia.org/wiki/Distributed_shared_memory

http://en.wikipedia.org/wiki/Sun_Microsystems

http://en.wikipedia.org/wiki/JavaSpaces

http://en.wikipedia.org/wiki/Stone_Soupercomputer

http://en.wikipedia.org/wiki/Commercial_off-the-shelf

http://en.wikipedia.org/wiki/Commercial_off-the-shelf

http://en.wikipedia.org/wiki/Beowulf_(computing)

http://en.wikipedia.org/wiki/Mainframe_computer

http://en.wikipedia.org/wiki/Power_Macintosh_G5

http://en.wikipedia.org/wiki/InfiniBand

http://en.wikipedia.org/wiki/Mac_OS_X

http://en.wikipedia.org/wiki/Hard_disk

http://en.wikipedia.org/wiki/Serial_ATA

http://en.wikipedia.org/wiki/Random_access_memory

http://en.wikipedia.org/wiki/Gigabyte

http://en.wikipedia.org/wiki/PowerPC_970

http://en.wikipedia.org/wiki/XServe

http://en.wikipedia.org/wiki/Apple_Inc.

http://en.wikipedia.org/wiki/Cluster_(computing)#cite_note-1%23cite_note-1

http://en.wikipedia.org/wiki/Virginia_Tech

http://en.wikipedia.org/wiki/System_X_(computing)

We propose a generalized clustering framework that utilizes existing

clustering algorithms and detects if there is a drifting concept or not in

the incoming data.

Existing system advantage:

We use NIR algorithm which has two advantage

1. The node is important in the cluster when the frequency of the

node is high in this cluster.

2. The node is important in the cluster if the node appears

prevalently in this cluster rather than in other clusters

Proposed system:

We test the accuracy of DCD on both synthetic and real

data sets. First, we will test the accuracy of drifting concepts that are

detected by DCD. And then, in order to evaluate the results of

clustering algorithms, we adopt the following two widely used

methods.

In order to detect the drifting concepts at different sliding windows,

we proposed the algorithm DCD to compare the cluster distributions

between the last clustering result and the temporal current clustering

result.

In order to observe the relationship between different clustering

results, we proposed the algorithm CRA to analyze and show the

changes between different clustering results.

Proposed system advantage:

We proposed a framework to perform clustering on categorical time-

evolving data. The framework detects the drifting concepts at different

sliding windows, generates the clustering results based on the current

concept, and also shows the relationship between clustering results by

visualization.

Project Description

The problem of clustering the categorical time-evolving data is

formulated as follows: Suppose that a series of categorical data points D

is given, where each data point is a vector of q attribute values, i.e., pj ¼

ðp1j ; p2j ; . . . ; pq jÞ. Let A ¼ fA1;A2; . . .;Aqg, where Aa is the ath

categorical attribute, 1 _ a _ q. In addition, suppose that the window size

N is also given. The data set D is separated into several continuous

subsets St, where the number of data points in each St is N. The

superscript number t is the identification number of the sliding window

and t is also called time stamp in this paper. For example, the first N data

points in D are located in the first subset S1. Based on the foregoing, the

objective of the framework is to perform clustering on the data set D and

consider the drifting concepts between St and Stþ1 and also analyze the

relationship between different clustering results. For ease of presentation,

several notations are defined as follows: In our framework, several

clustering results at different time stamps will be reported. Each

clustering result C½t1;t2_ is formed by one stable concept that persists

for a period of time, i.e., the sliding windows from t1 to t2. The

clustering results C½t1;t2_ contain k½t1;t2_ clusters, i.e., C½t1;t2_¼

fc½t1;t2_ 1 ; c½t1;t2_ 2 ; . . .; c½t1;t2_ k½t1;t2_g, where c½t1;t2_ i , 1

_ i _ k½t1;t2_, is the ith cluster in C½t1;t2_. If t1 ¼ t2 ¼ t, we simplify

the superscript by t. For example, the first clustering result that is

obtained from the initial clustering step is C1. Moreover, if we do not

point out a specific time stamp, the superscript will be omitted for ease of

presentation. In addition, when the DCD algorithm is performed, a

temporal clustering result, which is utilized to detect the drifting concept

at each sliding window, will be obtained. The notation C0t is used to

represent the temporal clustering result at time stamp t. Fig. 2 shows an

example of data set D with 15 data points, three attributes, and the sliding

window size N ¼ 5. The initial clustering is performed on the first sliding

window S1, and the clustering result C1, which contains two clusters,

c11 and c12 , is obtained. All of the symbols utilized in this paper are

summarized in Table .

Algorithm Used:

DRIFTING CONCEPT DETECTION

In this section, we introduce our DCD algorithm. The objective of the

DCD algorithm is to detect the difference of cluster distributions between

the current data subset St and the last clustering result C½te;t_1_ and to

decide whether the reclustering is required or not in St. Therefore, the

incoming categorical data points in St should be able to be allocated into

the corresponding proper cluster at the last clustering result efficiently.

We named the allocating process as “data labeling” [7]. In this paper, we

modify our previous work on the labeling process in order to detect

outliers in St. The data point that does not belong to any proper cluster is

called an outlier. After labeling, the last clustering result C½te;t_1_ and

the current temporal clustering result C0t obtained by data labeling are

compared with each other. If the difference of cluster distributions is

large enough, the sliding window t will be considered as a concept-

drifting window, and St will perform reclustering. The flowchart of the

DCD algorithm is shown in Figure.

Data Labeling and Outlier Detection

The goal of data labeling is to decide the most appropriate cluster

label for each incoming data point. Specifically, suppose that a data point

pj is given. The similarity Sðci; puÞ between pj and cluster ci, 1 _ i _ k, is

measured, and the cluster that obtains maxðSðci; pjÞÞ is considered as

the most appropriate cluster. In this paper, the clusters are represented by

an effective clustering representative, named NIR.

Cluster Distributions Comparison

In the Cluster Distribution Comparison step, the last clustering result and

the current temporal clustering result obtained by data labeling are

compared with each other to detect the drifting concept. The clustering

results are said to be different according to the following two criteria:

1. The clustering results are different if quite a large number of

outliers are found by data labeling.

2. The clustering results are different if quite a large number of

clusters are varied in the ratio of data points.

Since the idea of data labeling is to present the original clustering

characteristics to the incoming data points [7], the outliers that are not

able to allocate to any cluster may be generated based on different

concepts. Therefore, if too many outliers are detected by the data labeling

step, the drifting concept may happen in the current sliding window. As a

result, a threshold _ named outlier threshold is set in this step. If the ratio

of outliers in the current sliding window is larger than the outlier

threshold, the clustering results are said to be different, and the concept is

said to drift. Moreover, another type of drifting concept is also detected

in this step. The ratio of data points in a cluster may be changed

dramatically by a drifting concept, e.g., the cluster that contains half of

the data points in the last clustering result has suddenly disappeared in

the current clustering result. In order to detect the change, we adopt a

double-threshold method. One threshold _ named cluster variation

threshold is utilized to determine that the variation of the ratio of data

points in a cluster is big enough. The cluster that exceeds the cluster

variation threshold is seen as a different cluster. And then, the number of

different clusters is counted, and the ratio of different clusters is

compared with the other threshold _ named cluster difference threshold.

If the ratio of different clusters is larger than the cluster difference

threshold, the concept is said to drift in the current sliding window.

System Architecture:

System Design

Use case Diagram:

Sliding Window

Data Source

Cluster Representatives Analyser

Generating Cluster Representatives

Cluster Representative

Drifting Datas

Class Diagram

Cluster Analyser

clusterNameclusterstate

analyseData()getResult()getUpdate()

Cluster Info

clusterNameclusterTypeclusterSize

getClusterInfo()getClusterData()updateClusterdata()insertDriftData()

Sliding windows

windowSizewindowState

updateWindow()getWindowState()getDriftingDatas()insertDriftindatas()

Date Store

dataTypedataSize

updateData()insertData()formCluster()

Sequence diagram

Datapoints Sliding Window Cluster Representatives

Cluster CLuster Analyser

Get the data

Send the recent datas

Form the new Cluster

Analyse the cluster

Append with appropriate cluster

Collaboration Diagram

Sliding Window

Datapoints

Cluster Representatives

Cluster

CLuster Analyser

3: Append with appropriate cluster

2: Send the recent datas

1: Get the data

4: Form the new Cluster5: Analyse the cluster

State Diagram:

Received New datas

Updated the Sliding Window

Updated the Cluster Representatives

Cluster Analysed with Updated Data

Activity Diagram:

Receiveing the new datas

Updating the Sliding Window

Updating the cluster Representatives

Updating the Clusters

Analysing the Cluster with new datas

Dataflow Diagram:

Data Points

Drifting Data

Sliding Window Generating Cluster Representatives

Updating Cluster Representatives

Re-Clustering

Cluster Relationship Analysis

Login Page:

Admin Page:

Node Search page:

Forming Cluster according to node:

Link to get Sliding window:

Sliding Window Transacted output module:

Formation of Cluster:

Updating the data

Successful transacted page:

<%@ page language="java" contentType="text/html; charset=ISO-8859-1"

pageEncoding="ISO-8859-1"%>

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

"http://www.w3.org/TR/html4/loose.dtd">

<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

<title>Banking login page</title>

</head>

<form name="loginform" method="get" action="loginvalidator">

<body bgcolor="silver">

<br><br><br><br><br><br><br><br><br><br><br><br>

<center>

<h1>Welcome to Clustering algorithm</h1>

</center>

<table align="center" border="2">

<tr><td>UserName</td><td><input type = "text" name="username" ></td></tr>

<tr><td>Password</td><td><input type = "password" name ="password"></td></tr>

</table>

<center><input type="submit" name="Submit" value="Login"></center>

</body>

</form>

</html>

..

package ser;

import java.io.IOException;

import java.io.PrintWriter;

import javax.servlet.RequestDispatcher;

import javax.servlet.ServletException;

import javax.servlet.http.HttpServlet;

import javax.servlet.http.HttpServletRequest;

import javax.servlet.http.HttpServletResponse;

import javax.servlet.http.HttpSession;

import Accessinformation.Accessinformation;

import Helperview.Helperview;

public class loginvalidator extends HttpServlet {

private static final long serialVersionUID = 1L;

public loginvalidator() {

super();

}

String username, password;

protected void doGet(HttpServletRequest request, HttpServletResponse response)

throws ServletException, IOException {

username=request.getParameter("username");

password=request.getParameter("password");

PrintWriter out = response.getWriter();

//out.println(username);

//out.println(password);

Helperview view = new Helperview();

view.setUsername(username);

view.setPassword(password);

HttpSession session = request.getSession();

Accessinformation dao = new Accessinformation();

try

{

view=dao.getTransactioninformation(username, password);

System.out.println("From Servlet "+view.getUsername());

System.out.println("From Servlet "+view.getPassword());

System.out.println("From Servlet "+view.getRole());

if(view==null)

{

out.println("please provide information");

}

else

{

if("admin".equalsIgnoreCase(view.getRole()))

{

System.out.println("Inside admin

"+view.getUsername());

session.setAttribute("admin", view);

//RequestDispatcher dis =

request.getRequestDispatcher("/admin.jsp");

response.sendRedirect("admin.jsp");

System.out.println("After dispatching the request");

}

else

{

System.out.println("From Servlet

"+view.getUsername());

session.setAttribute("accholder", view);

//RequestDispatcher dis =

request.getRequestDispatcher("/accountholder.jsp");

response.sendRedirect("accountholder.jsp");

}

}

}

catch (Exception e) {

e.printStackTrace();

System.out.println(e);

}

}

}

..

package Helperview;

public class Helperview {

String username, password, role;

String accountnumber, transaction, amount, branch;

// Table - 2

public String getAccountnumber() {

return accountnumber;

}

public void setAccountnumber(String accountnumber) {

this.accountnumber = accountnumber;

}

public String getTransaction() {

return transaction;

}

public void setTransaction(String transaction) {

this.transaction = transaction;

}

public String getAmount() {

return amount;

}

public void setAmount(String amount) {

this.amount = amount;

}

public String getBranch() {

return branch;

}

public void setBranch(String branch) {

this.branch = branch;

}

// Table - 1

public String getRole() {

return role;

}

public void setRole(String role) {

this.role = role;

}

public String getUsername() {

return username;

}

public void setUsername(String username) {

this.username = username;

}

public String getPassword() {

return password;

}

public void setPassword(String password) {

this.password = password;

}

}

..

package AdminDBConnector;

import java.sql.Connection;

import java.sql.DriverManager;

import java.sql.PreparedStatement;

import java.sql.ResultSet;

import java.sql.SQLException;

public class AdminDBConnector {

public static Connection getConnection()

{

Connection connection = null;

try

{

Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");

connection=DriverManager.getConnection("jdbc:odbc:ADMIN");

}

catch(SQLException e)

{



}

catch(ClassNotFoundException e)

{



}

return connection;

}

public static void close(Connection conn, PreparedStatement pst, ResultSet rs)

{

try

{

if(pst!=null)

{

pst.close();

}

}


{



}

try

{

if(rs!=null)

{

rs.close();

}

}

catch(Exception e)

{


}

try

{

if(conn!=null && !conn.isClosed())

{

conn.close();

}

}


{



}

}

}

..

package DataSource;

import java.sql.*;


import java.sql.Statement;


import java.sql.SQLException;

import java.util.ArrayList;

import java.util.Collection;

import AdminDBConnector.AdminDBConnector;

import Helperview.Helperview;

public class DataSource {

public static Collection<Helperview> getDataSource(String node)

{

Connection conn =null;

ResultSet rs = null;

PreparedStatement pst = null;

Helperview view =null;

Collection<Helperview> nodeColl = new ArrayList<Helperview>();

try

{

conn = AdminDBConnector.getConnection();

String query = "select * from transaction where Branch=?";

pst=conn.prepareStatement(query);

pst.setString(1, node);

rs=pst.executeQuery();

while(rs.next())

{

System.out.println("AAAAAAA");

view = new Helperview();

view.setAccountnumber(rs.getString(1));

view.setTransaction(rs.getString(2));

view.setAmount(rs.getString(3));

view.setBranch(rs.getString(4));

nodeColl.add(view);

// System.out.println("collection"+nodeColl);

System.out.println("From Data

Source:"+view.getAccountnumber());


Source:"+view.getTransaction());


Source:"+view.getAmount());


Source:"+view.getBranch());

}

}


{



}

finally

{

AdminDBConnector.close(conn, pst, rs);

}

return nodeColl;

}

}

..





<%@page import="Helperview.Helperview"%><html>

<head>


<title>Welcome Admin Page</title>

</head>

<body>


<br><br><br><br><br><br><br><br><br><br><br>

<%Helperview view = (Helperview)session.getAttribute("admin"); %>

<center>

<h1>Welcome <%out.println(view.getUsername()); %> !!!</h1>

<a href="clusterinformation.jsp">Click here to view cluster</a>

</center>

</body>

</html>

..






<head>



</head>

<body>




<center>



</center>

</body>

</html>

..






<head>



</head>

<body>




<center>



</center>

</body>

</html>

..



<%@ page language="java" import="java.util.*" %>



<html>

<head>


<title>Transacted information</title>

</head>


<table border="0" background="silver" align="center">

<tr><td><image src="qw.PNG"/></tr>

<tr><td><image src="er1.png"/></td></tr>

</table>

<h2>The recent transacted accounts are listed below:</h2>


<%

ArrayList set = (ArrayList)session.getAttribute("mapValue");

Iterator<String> iterator = set.iterator();

while(iterator.hasNext())

{

String element = iterator.next();

System.out.println(element);

%>

<br><br><br><br><tr align="center"><td align="center"><br><h4 align="center">

<%

out.println(element);

}

%>

</h4>

</td>

</tr>

</table>

</body>

</html>

…

package transactionservlet;




import java.sql.Statement;


import java.util.List;

public class Trigger_SlidingWindow

{

public ArrayList<String> getBranch() // Gets the branches names.

{

String url = "jdbc:odbc:ADMIN";

Connection con;

String query = "select * from triggertable";

Statement stmt;

ArrayList<String> columnSet = new ArrayList<String>();

// HashSet<Integer> columnamt = new HashSet<Integer>();

try

{


con = DriverManager.getConnection(url);

stmt = con.createStatement();

ResultSet rs = stmt.executeQuery(query);

while(rs.next())

{

columnSet.add(rs.getString(1));

// columnamt.add(rs.getInt(3));

}

}

catch(Exception e)

{

System.err.println("ClassNotFoundException: ");

System.err.println(e.getMessage() + e);

}

System.out.println("account: "+columnSet);

// System.out.println("amount remaining:"+columnamt);

return columnSet;

// return columnamt;

}

}

….

create Trigger transactiondbtrigger on transactiondb AFTER update as begin

declare @accountnumb numeric(10)

declare @amttransaction varchar(20)

declare @amt numeric(10)

declare @branch varchar(20)

declare @state varchar(20)

declare @accounttype varchar(20)

declare @phone varchar(20)

declare @fax varchar(20)

select @accountnumb = accountnumber from inserted

select @amttransaction = amounttransaction from inserted

select @amt = amount from inserted

select @branch = branch from inserted

select @state = state from inserted

select @accounttype = accounttype from inserted

select @phone = phonenumber from inserted

select @fax = faxnumber from inserted

insert into triggertable(account,transactiontype, amount, branch, state, accounttype,

phone, fax) values

(@accountnumb,@amttransaction,@amt,@branch,@state,@accounttype,@phone,@fax)

end

create table triggertable(account numeric(10),transactiontype varchar(20), amount

numeric(10), branch varchar(20), state varchar(20), accounttype varchar(20), phone

varchar(20), fax varchar(20));

select * from triggertable

update transactiondb set amounttransaction='deposit', amount=2000, branch='mumbai',

state='maharastra', accounttype='instant', phonenumber='066-223456', faxnumber='066-

569856' where accountnumber=1000

…



<html>

<head>


<title>Transaction</title>

</head>

<table border="0" background="" align="center">

<tr>

<td><image src="qw.PNG" /></td>

</tr>

<tr>

<td><image src="er1.png" /></td>

</tr>

</table>


<form method="get" action="updation"><br>

<br>

<br>

<br>

<br>

<br>

<br>


<tr>

<td>Account number:</td>

<td><input type="text" size="5" name="accountnumber"

align="center"></td>

</tr>

<tr>

<td>Transaction:</td>

<td><select name="transaction">

<option value="withdrawal">Withdrawal</option>

<option value="deposit">Deposit</option>

</select>

</tr>

<tr>

<td>Amount:</td>

<td><input type="text" name="amount"></td>

</tr>

<tr>

<td>City:</td>

<td><input type="text" name="city"></td>

</tr>

<tr>

<td>State:</td>

<td><input type="text" name="state"></td>

</tr>

<tr>

<td>Account type:</td>

<td><select name="acctype">

<option value="saving">Saving</option>

<option value="instant">Instant</option>

<option value="current">Current</option>

</select></td>

</tr>

<tr>

<td>Phone number:</td>

<td><input type="text" name="phone"></td>

</tr>

<tr>

<td>Fax number:</td>

<td><input type="text" name="fax"></td>

</tr>

</table>

<center><input type="submit" value="Submit" /></center>

</form>

</body>

</html>

….







public class AllTransactedNode {

public static ArrayList<String> getClusterinformationofall(String fromdate,String

todate, String branch, String transtype, String acctype) {

Connection conn = null;



ArrayList<String> arr = new ArrayList<String>();

try {

// conn = AdminDBConnector.getConnection();


conn = DriverManager.getConnection("jdbc:odbc:ADMIN");

// System.out.println("1");

String query = "select * from transactiondb where

dateoftransaction between ? and ? and branch = ? and amounttransaction=? and

accounttype=?";

System.out.println("2");

pst = conn.prepareStatement(query);


pst.setString(1, fromdate);

pst.setString(2, todate);

pst.setString(3, branch);

pst.setString(4, transtype);

pst.setString(5, acctype);

rs = pst.executeQuery();


while (rs.next()) {

arr.add(rs.getString(1));










arr.add("&");

}

}




}

finally {


}

System.out.println(arr);

return arr;

}

}

….







public class AllTransaction {

public static ArrayList<String> getClusterinformation(String fromdate,String

todate, String branch) {

Connection conn = null;



ArrayList<String> arr = new ArrayList<String>();

try {

// conn = AdminDBConnector.getConnection();


conn = DriverManager.getConnection("jdbc:odbc:ADMIN");

// System.out.println("1");

String query = "select * from transactiondb where

dateoftransaction between ? and ? and branch=?";


pst = conn.prepareStatement(query);


pst.setString(1, fromdate);

pst.setString(2, todate);

pst.setString(3, branch);

rs = pst.executeQuery();


while (rs.next()) {











arr.add("&");

}

}




}

finally {


}

System.out.println(arr);

return arr;

}

}

Software requirements:

Hardware requirements:

Processor : Pentium iv 2.6 GHz

Ram : 512 mb dd ram

Monitor : 15” color

Hard disk : 20 gb

Floppy drive : 1.44 mb

Cddrive : lg 52x

Keyboard : standard 102 keys

Mouse : 3 buttons

Software requirements:

Front End : Jsp,servlet

Back End : SQL SERVER 2000

Tools Used : Dreamweaver

Operating System : WindowsXP

Conclusion:

Here we proposed a framework to perform clustering on

categorical time-evolving data. The framework detects the

drifting concepts at different sliding windows, generates the

clustering results based on the current concept, and also shows

the relationship between clustering results by visualization. In

order to detect the drifting concepts at different sliding

windows, we proposed the algorithm DCD to compare the

cluster distributions between the last clustering result and the

temporal current clustering result. If the results are quite

different, the last clustering result will be dumped out, and the

current data in this sliding window will perform reclustering.

In addition, in order to observe the relationship between

different clustering results, we proposed the algorithm CRA to

analyze and show the changes between different clustering

results. The experimental evaluation shows that performing

DCD is faster than doing clustering once on the entire data set,

and DCD can provide high-quality clustering results with

correctly detected drifting concepts in both synthetic and real

data cases. Therefore, the result demonstrates that our

framework is practical for detecting drifting concepts in time-

evolving categorical data.

REFERENCES

[1] C. Aggarwal, J. Han, J. Wang, and P. Yu, “A Framework for Clustering Evolving Data Streams,” Proc. 29th Int’l Conf. Very Large Data Bases (VLDB), 2003. [2] C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, and J.S. Park, “Fast Algorithms for Projected Clustering,” Proc. ACM SIGMOD ’99, pp. 61-72, 1999. [3] P. Andritsos, P. Tsaparas, R.J. Miller, and K.C. Sevcik, “Limbo: Scalable Clustering of Categorical Data,” Proc. Ninth Int’l Conf. Extending Database Technology (EDBT), 2004.[4] D. Barbara´, Y. Li, and J. Couto, “Coolcat: An Entropy-Based Algorithm for Categorical Clustering,” Proc. ACM Int’l Conf. Information and Knowledge Management (CIKM), 2002.[5] F. Cao, M. Ester, W. Qian, and A. Zhou, “Density-Based Clustering over an Evolving Data Stream with Noise,” Proc. Sixth SIAM Int’l Conf. Data Mining (SDM), 2006.[6] D. Chakrabarti, R. Kumar, and A. Tomkins, “Evolutionary Clustering,” Proc. ACM SIGKDD ’06, pp. 554-560, 2006.

[7] H.-L. Chen, K.-T. Chuang, and M.-S. Chen, “Labeling Unclustered Categorical Data into Clusters Based on the Important Attribute Values,” Proc. Fifth IEEE Int’l Conf. Data Mining (ICDM), 2005.[8] Y. Chi, X.-D. Song, D.-Y. Zhou, K. Hino, and B.L. Tseng, “Evolutionary Spectral Clustering by Incorporating Temporal Smoothness,” Proc. ACM SIGKDD ’07, pp. 153-162, 2007.[9] B.-R. Dai, J.-W. Huang, M.-Y. Yeh, and M.-S. Chen, “Adaptive Clustering for Multiple Evolving Streams,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 9, pp. 1166-1180, Sept. 2006. [10] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., 1977.[11] D.H. Fisher, “Knowledge Acquisition via Incremental ConceptualClustering,” Machine Learning, 1987.[12] M.M. Gaber and P.S. Yu, “Detection and Classification of Changesin Evolving Data Streams,” Int’l J. Information Technology and Decision Making, vol. 5, no. 4, pp. 659-670, 2006. [13] V. Ganti, J. Gehrke, and R. Ramakrishnan, “CACTUS—ClusteringCategorical Data Using Summaries,” Proc. ACM SIGKDD, 1999.[14] D. Gibson, J.M. Kleinberg, and P. Raghavan, “Clustering Categorical Data: An Approach Based on Dynamical Systems,” VLDB J., vol. 8, nos. 3-4, pp. 222-236, 2000.[15] M.A. Gluck and J.E. Corter, “Information Uncertainty and the Utility of Categories,” Proc. Seventh Ann. Conf. Cognitive Science Soc., pp. 283-287, 1985.[16] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust ClusteringAlgorithm for Categorical Attributes,” Proc. 15th Int’l Conf. Data Eng. (ICDE), 1999.[17] E.-H. Han, G. Karypis, V. Kumar, and B. Mobasher, “Clustering Based on Association Rule Hypergraphs,” Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD), 1997.[18] J. Han and M. Kamber, Data Mining: Concepts and Techniques.Morgan Kaufmann, 2001. [19] Z. Huang, “Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,” Data Mining and Knowledge Discovery, 1998.[20] Z. Huang and M.K. Ng, “A Fuzzy k-Modes Algorithm for Clustering Categorical Data,” IEEE Trans. Fuzzy Systems, 1999. [21] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data Streams,” Proc. ACM SIGKDD, 2001.

[22] A. Jain and R. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988.[23] A.K. Jain, M.N. Murty, and P.J. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, 1999.[24] O. Nasraoui and C. Rojas, “Robust Clustering for Tracking Noisy Evolving Data Streams,” Proc. Sixth SIAM Int’l Conf. Data Mining (SDM), 2006.[25] O. Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain, “A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites,” IEEE Trans. Knowledge and Data Eng., vol. 20, no. 2, pp. 202-215, Feb. 2008.[26] G. Salton and M.J. McGill, Introduction to Modern Information Retrieval. McGraw-Hill, 1986. [27] G. Salton, A. Wong, and C.S. Yang, “A Vector Space Model for Automatic Indexing,” Comm. ACM, vol. 18, no. 11, pp. 613-620, 1975.[28] C.E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical J., 1948.[29] Y. Sun, Q. Zhu, and Z. Chen, “An Iterative Initial-Points Refinement algorithm for Categorical Data Clustering,” Pattern Recognition Letters, vol. 23, no. 7, 2002.[30] H. Wang, W. Fan, P. Yun, and J. Han, “Mining Concept-Drifting Data Streams Using Ensemble Classifiers,” Proc. ACM SIGKDD, 2003.[31] G. Widmer and M. Kubat, “Learning in the Presence of Concept Drift and Hidden Contexts,” Machine Learning, 1996.[32] M.-Y. Yeh, B.-R. Dai, and M.-S. Chen, “Clustering over Multiple Evolving Streams by Events and Correlations,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 10, pp. 1349-1362, Oct. 2007.[33] M.J. Zaki and M. Peters, “Clicks: Mining Subspace Clusters in Categorical Data via k-Partite Maximal Cliques,” Proc. 21st Int’lConf. Data Eng., 2005.[34] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Database,” Proc. ACM SIGMOD, 1996.[35] A. Zhou, F. Cao, W. Qian, and C. Jin, “Tracking Clusters in Evolving Data Streams over Sliding Windows,” Knowledge and Information Systems, 2007.

Documentation

Documents

Transcript of Documentation