Data Mining over Hidden Data Sources Tantan Liu Advisor: Gagan Agrawal Dept. of Computer Science &...

Data Mining over Hidden Data Sources

Tantan Liu

Advisor: Gagan Agrawal

Dept. of Computer Science & Engineering

Ohio State University

July 23, 2012

Outline

• Introduction– Deep Web– Data Mining on the deep web

• Contributions– Stratified K-means Clustering Over A Deep Web Data Source (SIGKDD, 2012)– Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source

(Submitted to ICDM 2012)– Stratification Based Hierarchical Clustering on a Deep Web Data Source (SDM, 2012)– An Active Learning Based Frequent Itemset Mining (ICDE, 2011)– Differential Rule Mining (ICDM Workshops, 2010)– Stratified Sampling for Deep Web Mining (ICDM, 2010)

• Conclusion and Future work

Deep Web

• Data sources hidden from the Internet– Online query interface vs. Database– Database accessible through online Interface– Input attribute vs. Output attribute

• An example of Deep Web

Data Mining over the Deep Web

• High level summary of data– Scenario 1: a user wants to relocate to the county.

• Summary of the residences of the county? – Age, Price, Square Footage – County property assessor’s web-site only allows simple queries

– Scenario 2: a user is thinking about his or her career path• High level knowledge about the job posts in the market

– Job type, salary, education, experience, skills, ..– Job web-site, i.e. Linkedin and MSN careers, provide millions of job

posts.

Challenges

• Databases cannot be accessed directly– Sampling method for Deep web mining

• Obtaining data is time consuming– Efficient sampling method

– High accuracy with low sampling cost

Contributions

• Stratified K-means Clustering Over A Deep Web Data Source (SIGKDD, 2012)

• Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source (submitted to ICDM, 2012)

• Stratification Based Hierarchical Clustering on a Deep Web Data Source (SDM, 2012)

• An Active Learning Based Frequent Itemset Mining (ICDE, 2011)• Differential Rule Mining (ICDM Workshops, 2010) • Stratified Sampling for Deep Web Mining (ICDM, 2010)

Roadmap


• Contributions– Stratified K-means Clustering Over A Deep Web Data Source – Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source – Stratification Based Hierarchical Clustering on a Deep Web Data Source – An Active Learning Based Frequent Itemset Mining – Differential Rule Mining– Stratified Sampling for Deep Web Mining


An Example of Deep Web for Real-Estate

k-means clustering over a deep web data source

• Goal: Estimating k centers for the underlying clusters, so that the estimated k centers based on the sample are close to the k true centers in the whole population.

Overview of Method

Sub-population 1

Sub-population 2

Sub-population n

Sample 1 Sample 2 Sample n

Sample

Clusters

Stratified Based K-meansClustering

Stratification...

Sample Allocation

Stratification on the deep web

• Partitioning the entire population in to strata

– Stratifies on the query space of input attributes

– Goal: Homogenous Query subspaces– Radius of query subspace:

– Rule: Choosing the input attribute that mostly decreases the radius of a node

– For an input attribute , decrease of radius:

Y=1980 Y=1990 Y=2008

B=3 B=4

NULLYear ofconstruction

Y=2000

Bedroom

. . .

Partition on Space of Output Attributes

Price

SquareFeet

2008200019901980

Sampling Allocation Methods

• We have created c*k partitions and c*k subspaces– A pilot sample– C*k-mean clustering generate c*k partitions

• Representative sampling– Good Estimation on statistics of c*k subspaces

• Centers• Proportions

Representative Sampling-Centers

• Center of a subspace– Mean vector of all data points belonging to the subspace

• Let sample S={DR1, DR2, …, DRn}– For i-th subspace, center :

i

mjimi m

ODRsc

)(,,

Distance Function

• For c*k estimated centers with true centers

• Using Euclidean Distance

– Integrated variance • In terms of sub-space, stratum and output attributes• Computed based on pilot sample

– : # of sample drawn from j-th stratum

Optimized Sample Allocation

• Goal:

• Using Lagrange multipliers:

• We are going to sample stratum with large variance

• Data is spread in a wide area, and more data are need to represent the population

Active Learning based sampling Method

• In machine learning– Passive learning: data are randomly chosen – Active Learning

• Certain data are selected, to help build a better model• Obtaining data is costly and/or time-consuming

• Choosing stratum i, the estimated decrease of distance function is

• Iterative Sampling Process– At each iteration, stratum with largest decrease of distance function

is selected for sampling– Integrated variance is updated

Representative Sampling-Proportion

• Proportion of a sub-space:– Fraction of data records belonging to the sub-space – Depends on proportion of the sub-space in each stratum

• In j-th stratum,

• Risk function– Distance between estimated factions and their true values

• Iterative Sampling Process– At each iteration, stratum with largest decrease of risk function is

chosen for sampling– Parameters are updated

Stratified K-means Clustering

• Weight for data records in i-th stratum – , : size of population, : size of sample

• Similar to k-means clustering– Center for i-th cluster

Contribution

• Sampling methods for solving the problem of k-means clustering over a deep web data source

• Representative Sampling– Partition on the space of output attributes

• Centers– Optimized Sampling method– Active learning based sampling method

• Proportions– Active learning based sampling method

Experiment Result

• Data Set:– Noisy Synthetic data set:

• 4,000 data records with 4 input attributes and 2 output attributes.• Adding 400 noise data points

– Yahoo! data set: • Data on used cars• 8,000 data records

• Average Distance

Representative Sampling-Noisy Data Set

• Benefit of Stratification– Compared with rand,

decrease of AvgDist are 35.5%, 37.4%, 38.6%, 26.9%

• Benefit of Representative Sampling

– Compared with rand_st, decrease of AvgDist are 11.8%, 14.4% and 16.1 %

• Center based sampling methods have better performance

• Optimized sampling method has better performance in the long run

Representative Sampling-Yahoo! Data set

• Benefit of Stratification– Compared with rand,

decrease of AvgDist are 7.2%, 13.2%, 15.0% and 16.8%

• Benefit of Representative Sampling

– Compared with rand_st, decrease of AvgDist are 6.6%, 8.5%, 10.5%

• Center based sampling methods have better performance

• Optimized sampling method has better performance in the long run

Scalability

• The execution time for each method is linear of the size of data set

Roadmap


• Contributions– Stratified K-means Clustering Over A Deep Web Data Source

– Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source

– Stratification Based Hierarchical Clustering on a Deep Web Data Source – An Active Learning Based Frequent Itemset Mining – Differential Rule Mining – Stratified Sampling for Deep Web Mining


Outlier Detection

• Outlier– An observation that deviates greatly from other observations

• DB(p;D) outlier – At least fraction p of the objects lie at a distance greater than D.

• Challenges for Outlier Detection over a deep web data source– Recall: finding as large a fraction of outliers

– Precision: accurately identify outliers from sampled data

Two-phase Stratified Sampling Method

• Neighborhood Sampling– Aiming at improving recall

– Query spaces with high probability of containing outliers are explored.

• Uncertain Driven Sampling– Aiming at improving precision

Outliers in Stratified Sampling

• Stratified Sampling has good performance• Stratification

– Similar to stratification in k-means clustering over a deep web data source

– Control the number of strata• Outlier detection

– For a data object, let denote the fraction of data objects at distance greater than D

– Estimates in a stratified samplef

Neighbor Nodes

• Similar data objects tend to be from same query subspaces or neighbor query subspaces

• Neighbor Nodes for a node– Left and right cousin with same parent nodes

jI

Neighborhood Sampling

Y=1980

Root

Y=2010Y=2000Y=1990

B=1 B=2 B=3 B=4 B=1 B=2 B=3 B=4

Ba=1 Ba=2

Post-Stratification

• Original Strata are further stratified after additional sampling • New Stratum: Leaf nodes with same sample rate under the same

original stratum • Each data record has estimated and variance

– Fraction of data objects at distance greater than D

– Probability of being an outlier

Uncertain Driven Sampling

• For a sampled data record– Outlier: >

– Normal data object <

– Otherwise, uncertain data object

• Task: Obtain a sample for identifying uncertain data object

1

2

Sample Allocation

• For uncertain data objects with estimated

• To find better estimation of , Minimize

• By using Lagrange multiplier

Outlier in Stratified Sampling

• For a sampled data record– Outlier: >– Otherwise, Normal data object

• Distance between each pair of sampled data object is computed• An outlier:

• A normal data object

– where denotes the fraction of neighbors in D neighborhood

Efficient Outlier Detection

• It can be shown that

• Sufficient condition – If

• A normal data object• An outlier

– Else• A normal data object• An outlier

Experiment Result

• Data Set:– Yahoo! data set:

• Data on used cars

• 8,000 data records

• Evaluation– Precision: fraction of outliers that are identified in the sample

– Recall: fraction of outliers that are sampled

Recall

• Benefit of Stratification– Increase over SRS:

108.2%, 116.7%, and 74.7%

• Benefit of neighborhood Sampling– Increase over SSTS:

19.1% and 28.1%

• Uncertain sampling decrease recall: 3.7%

Precision

• All four methods have good performance

– Average precision is over 0.9

• Stratified sampling methods have lower precision

– Compared with SRS, decrease: 1.7%, 4.3%, and 0.68%.

• Benefits of uncertain sampling

– Compared with NS, increase: 2.7%

Trade-off between Precision and Recall

• Trade-off between Precision and Recall

• Benefit of Stratification– TPS , NS and SSTS

improves recall for precision in 0.75-0.975

• Benefit of Neighborhood Sampling

– TPS , NS improves recall for precision in 0.75-0.975

• Benefit of Uncertain Sampling

– TPS improves recall for precision in 0.92-1.0

Roadmap


• Contributions– Stratified K-means Clustering Over A Deep Web Data Source – Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source – Stratification Based Hierarchical Clustering on a Deep Web Data Source – An Active Learning Based Frequent Itemset Mining – Differential Rule Mining – Stratified Sampling for Deep Web Mining


Stratification Based Hierarchical Clustering on a Deep Web Data Source

• Hierarchical clustering based on stratified sampling– Stratification– Sample Allocation

• Representative Sampling– Mean values of output

attributes are close to true values

• Uncertain Sampling– Sample heavily on

boundary between clusters

An Active Learning Based Frequent Itemset Mining

• Frequent Itemset Mining– Estimating support for itemsets– The size of itemsets could be huge

• Considering 1-itemsets

• Bayesian Network– Model the relationship between input

attributes and output attributes– Risk Function on estimated parameters

• Active learning Based Sampling– Data records are selected step by step– Sample query subspaces with

greatest decrease on risk function

Differential Rule Mining

• Different values for the same data object– e.g. prices of commodities

• Goal: analyzing the difference between data sources• Differential Rule:

– Left hand: a frequent itemset – Right hand: behavior of differential attribute

• Differential Rule Mining– Apriori Algorithm– Hypothesis Statistical test

Stratified Sampling for Association Rule Mining and Differential Rule Mining

• Data Mining– Association Rule Mining &

Differential Rule Mining• Stratified Sampling• Stratification

– Combing estimation variance and sampling cost

– A tree recursively built on the query space

• Sampling Allocation– An optimized method for

minimizing integrated cost on variance and sampling cost

Conclusion

• Data mining on the deep web is challenging• We proposed methods for data mining on the deep web

– A stratified K-means clustering method– A two-phase sampling based outlier detection – A stratified hierarchical clustering method– An Active learning based frequent itemset mining– A stratified sampling method for data mining on the deep web– Differential rule mining

• The experiment results show the efficiency of our work

Future Work

• Outlier Detection over a deep web data source– Consider the problem of statistical distribution based outlier

detection• Mining Multiple Deep Web Data Sources

– Instance-based Schema matching• Efficiently sampling instance from deep web to facilitate schema

matching– Mining data coverage of multiple deep web data sources

• Efficient sampling methods for estimating data coverage of multiple data sources

Questions?

Data Mining over Hidden Data Sources Tantan Liu Advisor: Gagan Agrawal Dept. of Computer Science &...

Documents

Transcript of Data Mining over Hidden Data Sources Tantan Liu Advisor: Gagan Agrawal Dept. of Computer Science &...