Data Mining over Hidden Data Sources Tantan Liu Advisor: Gagan Agrawal Dept. of Computer Science &...
-
Upload
silvester-bradford -
Category
Documents
-
view
214 -
download
0
Transcript of Data Mining over Hidden Data Sources Tantan Liu Advisor: Gagan Agrawal Dept. of Computer Science &...
Data Mining over Hidden Data Sources
Tantan Liu
Advisor: Gagan Agrawal
Dept. of Computer Science & Engineering
Ohio State University
July 23, 2012
Outline
• Introduction– Deep Web– Data Mining on the deep web
• Contributions– Stratified K-means Clustering Over A Deep Web Data Source (SIGKDD, 2012)– Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source
(Submitted to ICDM 2012)– Stratification Based Hierarchical Clustering on a Deep Web Data Source (SDM, 2012)– An Active Learning Based Frequent Itemset Mining (ICDE, 2011)– Differential Rule Mining (ICDM Workshops, 2010)– Stratified Sampling for Deep Web Mining (ICDM, 2010)
• Conclusion and Future work
Deep Web
• Data sources hidden from the Internet– Online query interface vs. Database– Database accessible through online Interface– Input attribute vs. Output attribute
• An example of Deep Web
Data Mining over the Deep Web
• High level summary of data– Scenario 1: a user wants to relocate to the county.
• Summary of the residences of the county? – Age, Price, Square Footage – County property assessor’s web-site only allows simple queries
– Scenario 2: a user is thinking about his or her career path• High level knowledge about the job posts in the market
– Job type, salary, education, experience, skills, ..– Job web-site, i.e. Linkedin and MSN careers, provide millions of job
posts.
Challenges
• Databases cannot be accessed directly– Sampling method for Deep web mining
• Obtaining data is time consuming– Efficient sampling method
– High accuracy with low sampling cost
Contributions
• Stratified K-means Clustering Over A Deep Web Data Source (SIGKDD, 2012)
• Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source (submitted to ICDM, 2012)
• Stratification Based Hierarchical Clustering on a Deep Web Data Source (SDM, 2012)
• An Active Learning Based Frequent Itemset Mining (ICDE, 2011)• Differential Rule Mining (ICDM Workshops, 2010) • Stratified Sampling for Deep Web Mining (ICDM, 2010)
Roadmap
• Introduction– Deep Web– Data Mining on the deep web
• Contributions– Stratified K-means Clustering Over A Deep Web Data Source – Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source – Stratification Based Hierarchical Clustering on a Deep Web Data Source – An Active Learning Based Frequent Itemset Mining – Differential Rule Mining– Stratified Sampling for Deep Web Mining
• Conclusion and Future work
An Example of Deep Web for Real-Estate
k-means clustering over a deep web data source
• Goal: Estimating k centers for the underlying clusters, so that the estimated k centers based on the sample are close to the k true centers in the whole population.
Overview of Method
Sub-population 1
Sub-population 2
Sub-population n
Sample 1 Sample 2 Sample n
Sample
Clusters
Stratified Based K-meansClustering
Stratification...
Sample Allocation
Stratification on the deep web
• Partitioning the entire population in to strata
– Stratifies on the query space of input attributes
– Goal: Homogenous Query subspaces– Radius of query subspace:
– Rule: Choosing the input attribute that mostly decreases the radius of a node
– For an input attribute , decrease of radius:
Y=1980 Y=1990 Y=2008
B=3 B=4
NULLYear ofconstruction
Y=2000
Bedroom
. . .
Partition on Space of Output Attributes
Price
SquareFeet
2008200019901980
Sampling Allocation Methods
• We have created c*k partitions and c*k subspaces– A pilot sample– C*k-mean clustering generate c*k partitions
• Representative sampling– Good Estimation on statistics of c*k subspaces
• Centers• Proportions
Representative Sampling-Centers
• Center of a subspace– Mean vector of all data points belonging to the subspace
• Let sample S={DR1, DR2, …, DRn}– For i-th subspace, center :
i
mjimi m
ODRsc
)(,,
Distance Function
• For c*k estimated centers with true centers
• Using Euclidean Distance
– Integrated variance • In terms of sub-space, stratum and output attributes• Computed based on pilot sample
– : # of sample drawn from j-th stratum
Optimized Sample Allocation
• Goal:
• Using Lagrange multipliers:
• We are going to sample stratum with large variance
• Data is spread in a wide area, and more data are need to represent the population
Active Learning based sampling Method
• In machine learning– Passive learning: data are randomly chosen – Active Learning
• Certain data are selected, to help build a better model• Obtaining data is costly and/or time-consuming
• Choosing stratum i, the estimated decrease of distance function is
• Iterative Sampling Process– At each iteration, stratum with largest decrease of distance function
is selected for sampling– Integrated variance is updated
Representative Sampling-Proportion
• Proportion of a sub-space:– Fraction of data records belonging to the sub-space – Depends on proportion of the sub-space in each stratum
• In j-th stratum,
• Risk function– Distance between estimated factions and their true values
• Iterative Sampling Process– At each iteration, stratum with largest decrease of risk function is
chosen for sampling– Parameters are updated
Stratified K-means Clustering
• Weight for data records in i-th stratum – , : size of population, : size of sample
• Similar to k-means clustering– Center for i-th cluster
Contribution
• Sampling methods for solving the problem of k-means clustering over a deep web data source
• Representative Sampling– Partition on the space of output attributes
• Centers– Optimized Sampling method– Active learning based sampling method
• Proportions– Active learning based sampling method
Experiment Result
• Data Set:– Noisy Synthetic data set:
• 4,000 data records with 4 input attributes and 2 output attributes.• Adding 400 noise data points
– Yahoo! data set: • Data on used cars• 8,000 data records
• Average Distance
Representative Sampling-Noisy Data Set
• Benefit of Stratification– Compared with rand,
decrease of AvgDist are 35.5%, 37.4%, 38.6%, 26.9%
• Benefit of Representative Sampling
– Compared with rand_st, decrease of AvgDist are 11.8%, 14.4% and 16.1 %
• Center based sampling methods have better performance
• Optimized sampling method has better performance in the long run
Representative Sampling-Yahoo! Data set
• Benefit of Stratification– Compared with rand,
decrease of AvgDist are 7.2%, 13.2%, 15.0% and 16.8%
• Benefit of Representative Sampling
– Compared with rand_st, decrease of AvgDist are 6.6%, 8.5%, 10.5%
• Center based sampling methods have better performance
• Optimized sampling method has better performance in the long run
Scalability
• The execution time for each method is linear of the size of data set
Roadmap
• Introduction– Deep Web– Data Mining on the deep web
• Contributions– Stratified K-means Clustering Over A Deep Web Data Source
– Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source
– Stratification Based Hierarchical Clustering on a Deep Web Data Source – An Active Learning Based Frequent Itemset Mining – Differential Rule Mining – Stratified Sampling for Deep Web Mining
• Conclusion and Future work
Outlier Detection
• Outlier– An observation that deviates greatly from other observations
• DB(p;D) outlier – At least fraction p of the objects lie at a distance greater than D.
• Challenges for Outlier Detection over a deep web data source– Recall: finding as large a fraction of outliers
– Precision: accurately identify outliers from sampled data
Two-phase Stratified Sampling Method
• Neighborhood Sampling– Aiming at improving recall
– Query spaces with high probability of containing outliers are explored.
• Uncertain Driven Sampling– Aiming at improving precision
Outliers in Stratified Sampling
• Stratified Sampling has good performance• Stratification
– Similar to stratification in k-means clustering over a deep web data source
– Control the number of strata• Outlier detection
– For a data object, let denote the fraction of data objects at distance greater than D
– Estimates in a stratified samplef
Neighbor Nodes
• Similar data objects tend to be from same query subspaces or neighbor query subspaces
• Neighbor Nodes for a node– Left and right cousin with same parent nodes
jI
Neighborhood Sampling
Y=1980
Root
Y=2010Y=2000Y=1990
B=1 B=2 B=3 B=4 B=1 B=2 B=3 B=4
Ba=1 Ba=2
Post-Stratification
• Original Strata are further stratified after additional sampling • New Stratum: Leaf nodes with same sample rate under the same
original stratum • Each data record has estimated and variance
– Fraction of data objects at distance greater than D
– Probability of being an outlier
Uncertain Driven Sampling
• For a sampled data record– Outlier: >
– Normal data object <
– Otherwise, uncertain data object
• Task: Obtain a sample for identifying uncertain data object
1
2
Sample Allocation
• For uncertain data objects with estimated
• To find better estimation of , Minimize
• By using Lagrange multiplier
Outlier in Stratified Sampling
• For a sampled data record– Outlier: >– Otherwise, Normal data object
• Distance between each pair of sampled data object is computed• An outlier:
• A normal data object
– where denotes the fraction of neighbors in D neighborhood
Efficient Outlier Detection
• It can be shown that
• Sufficient condition – If
• A normal data object• An outlier
– Else• A normal data object• An outlier
Experiment Result
• Data Set:– Yahoo! data set:
• Data on used cars
• 8,000 data records
• Evaluation– Precision: fraction of outliers that are identified in the sample
– Recall: fraction of outliers that are sampled
Recall
• Benefit of Stratification– Increase over SRS:
108.2%, 116.7%, and 74.7%
• Benefit of neighborhood Sampling– Increase over SSTS:
19.1% and 28.1%
• Uncertain sampling decrease recall: 3.7%
Precision
• All four methods have good performance
– Average precision is over 0.9
• Stratified sampling methods have lower precision
– Compared with SRS, decrease: 1.7%, 4.3%, and 0.68%.
• Benefits of uncertain sampling
– Compared with NS, increase: 2.7%
Trade-off between Precision and Recall
• Trade-off between Precision and Recall
• Benefit of Stratification– TPS , NS and SSTS
improves recall for precision in 0.75-0.975
• Benefit of Neighborhood Sampling
– TPS , NS improves recall for precision in 0.75-0.975
• Benefit of Uncertain Sampling
– TPS improves recall for precision in 0.92-1.0
Roadmap
• Introduction– Deep Web– Data Mining on the deep web
• Contributions– Stratified K-means Clustering Over A Deep Web Data Source – Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source – Stratification Based Hierarchical Clustering on a Deep Web Data Source – An Active Learning Based Frequent Itemset Mining – Differential Rule Mining – Stratified Sampling for Deep Web Mining
• Conclusion and Future work
Stratification Based Hierarchical Clustering on a Deep Web Data Source
• Hierarchical clustering based on stratified sampling– Stratification– Sample Allocation
• Representative Sampling– Mean values of output
attributes are close to true values
• Uncertain Sampling– Sample heavily on
boundary between clusters
An Active Learning Based Frequent Itemset Mining
• Frequent Itemset Mining– Estimating support for itemsets– The size of itemsets could be huge
• Considering 1-itemsets
• Bayesian Network– Model the relationship between input
attributes and output attributes– Risk Function on estimated parameters
• Active learning Based Sampling– Data records are selected step by step– Sample query subspaces with
greatest decrease on risk function
Differential Rule Mining
• Different values for the same data object– e.g. prices of commodities
• Goal: analyzing the difference between data sources• Differential Rule:
– Left hand: a frequent itemset – Right hand: behavior of differential attribute
• Differential Rule Mining– Apriori Algorithm– Hypothesis Statistical test
Stratified Sampling for Association Rule Mining and Differential Rule Mining
• Data Mining– Association Rule Mining &
Differential Rule Mining• Stratified Sampling• Stratification
– Combing estimation variance and sampling cost
– A tree recursively built on the query space
• Sampling Allocation– An optimized method for
minimizing integrated cost on variance and sampling cost
Conclusion
• Data mining on the deep web is challenging• We proposed methods for data mining on the deep web
– A stratified K-means clustering method– A two-phase sampling based outlier detection – A stratified hierarchical clustering method– An Active learning based frequent itemset mining– A stratified sampling method for data mining on the deep web– Differential rule mining
• The experiment results show the efficiency of our work
Future Work
• Outlier Detection over a deep web data source– Consider the problem of statistical distribution based outlier
detection• Mining Multiple Deep Web Data Sources
– Instance-based Schema matching• Efficiently sampling instance from deep web to facilitate schema
matching– Mining data coverage of multiple deep web data sources
• Efficient sampling methods for estimating data coverage of multiple data sources
Questions?