Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1...
Transcript of Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1...
Panorama des méthodes de détection et de traitement
des anomaliesLaure Berti-Équille
IRD
AAFD 2012
À la recherche des problèmes… de qualité de données
“Dirty Data” :– Données malformatées
– Données aberrantes (outliers)
– Doublons
– Données incohérentes
– Données obsolètes
– Données fausses, incorrectes, erronées
– Données incomplètes, tronquées, censurées
– Données manquantes
2AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 2
Outline
1. Motivating Example
2. Generic Guidelines
3. Methods for Anomaly Detection
4. Techniques for Cleaning Dirty Data
5. Summary and Conclusions
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 3
Outline
1. Motivating Example
2. Generic Guidelines
3. Methods for Anomaly Detection
4. Techniques for Cleaning Dirty Data
5. Summary and Conclusions
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 4
IP Data Streams: A Picture
• 10 Attributes, every 5 minutes, over four weeks
• Axes transformed for plotting
5*L. Berti-Équille, T. Dasu, D. Srivastava : Discovery of complex glitch patterns : A novel approachto Quantitative Data Cleaning.Proc. of ICDE 2011 , pp. 733-744, Hannover, Germany, 2011.
Detection of Patterns of Anomalies
Missing
Outliers
Duplicate
OutliersInterfaces
Utilization_OutUtilization_In
Bytes_Out
Bytes_In
Memory
CPULatencySyslog_EventsCPU_Poll
6
Detection: Main Issues
� A large variety of detection methods with conflicting results
� No benchmark
� DQ problems are not necessarily rare events
� DQ problems may be (partially) correlated
� Mutual masking-effects impair the detection(e.g., - missing values affects the detection of duplicates
- duplicate records affects the detection of outliers
- imputation methods may mask the presence of duplicates)
� Classical assumptions won’t work (e.g., MCAR/MAR, normality, symmetry, uni-modality)
7AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 7
Cleaning: What Can Be Done?
• Cleaning strategies (ad hoc)
– Impute missing values � component-wise median?
– De-duplicate � retain a random record?
– Handle outliers � identify and remove? So many methods but contradicting results?
– Drop all records that have any imperfection
– Add special categories and analyze singularities in isolation
• Almost all existing approaches look at one-shot approaches to univariate glitches. Why?
• Cleaning introduces new errors !?
8AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 8
Deletion Imputation Modeling
Deletion Fusion RandomSelection
Deletion Winsorization Trimming
Data
MissingValues
Duplicates
Outliers
So Many Choices…
99AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 9
Outline
1. Motivating Example
2. Generic Guidelines
3. Methods for Anomaly Detection
4. Techniques for Cleaning Dirty Data
5. Summary and Conclusions
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 10
GuidelinesStep 1 – Explore the data distributions
Goal– Detect and count missing, extreme and aberrant data values– Decide not to consider some values or variables– Decide the transformation and corrective actions to apply
For continuous variables– Discretization– Test for normality (essential for small datasets) and normalization– Optional test for homoscedasticity (equality of variance-covariance
matrices)– Detect non-linearity and non-monotony
For discrete variables– Group the variables with small populations– Create new relevant aggregates
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 11
Step 1 - Data Distribution Characteristics
( )∑ −=N
ii xx
N
21σµσ=CV
∑
−=N
i x
i xx
NS
31
σ
∑
−=N
i x
i xx
NK
41
σ
• Dispersion– Standard deviation– Coefficient of Variation (CV): a normalized measure of dispersion
of a probability distribution
– IQR: Q3-Q1– Homoscedasticity: equality of variances for
a variable on different subsets using Levene, Barlett or Fisher tests (if p<.05 ⇒ heteroscedasticity)
• Skewness: measure of the asymmetry of the probability distribution of a real-valued random variable• = 0 : when the distribution is symmetrical• >0 : the mass of the distribution is concentrated on the left • <0 : the mass of the distribution is concentrated on the right
• Kurtosis: measure of the flatness of the distribution• =3 flat like the normal distribution• >3 more concentrated • <3 flatter than the Gaussian
12
13
Step 1- Test for Normality
• Many DM methods assume multivariate normal distributions
• Multivariate normality can be detected by inspecting the indices of multivariate skewness and kurtosis
• Lack of univariate normality occurs when the skewness index > 3.0 and kurtosis index > 10
• Non-normal distributions can sometimes be corrected by transforming variables
• Tests:– Kolmogorov-Smirnov Test: non-parametric test that quantifies the maximum distance between the
empirical distribution function of the variable and the cdf of the normal distribution
– Anderson-Darling Test: variant of K-S test weighting the tails of distributions
– Lilliefors Test: variant of K-S test for unknown mean and standard deviation
– Shapiro-Wilk Test : orders the sample values in ascending order and uses the correlation to detect small departures from normality - not suitable for very large sample sizes (SAS proc UNIVARIATE)
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 13
Goal– Detect inconsistencies between 2 or more variables– Determine relationships between one target variable and one or
more variables contributing to its explanation in order to eliminate no effect variables
– Determine relationships between explanation variables in order to avoid multicollinearity that may causes the failure of regression techniques
– Quantify the strength of the relationship and sensitivity in presence of outliers
– Detect spurious correlations
Methods– Bivariate statistics measuring pair-wise correlations– Discover FDs
GuidelinesStep 2 – Analyze data relationships
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 14
MV statistics
Model-based methodsLinear, logistic regressionProbabilistic methods
MCD, MVE, Robust estimators
ClusteringDistance-based techniquesDensity-based techniquesSubspace-based techniques
VisualizationGraphicsQ-Q plotConfusion Matrix
Distributional techniquesSkewness, KurtosisGoodness of fit tests: normality, Chi-square tests, analysis of residulas, Kullback-Lieber divergenceControl Charts: X-Bar, CUSUM, R
UV statistics
ClassificationRule-based techniquesSVM, Neural Networks, Bayesian NetworksInformation theoretic measuresKernel-based methods
Rule & Pattern DiscoveryAssociation Rule DiscoveryFD, AFD, CFD mining
GuidelinesStep 1&2 - Use the toolbox for detection
Ultimate Research Goals
� Benchmarking� Optimization� Refinement� Scalability� Tuning� Real-time� Interactivity
15AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 15
Guidelines
Step 3 - Data Preparation: Major Tasks
• Data cleaning– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration– Integration of multiple databases, data cubes, or files
• Data transformation– Normalization and aggregation
• Data reduction– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization– Part of data reduction but with particular importance, especially for
numerical data
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 16
Data Preparation: Major Tasks
• Data cleaning– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration– Integration of multiple databases, data cubes, or files
• Data transformation– Normalization and aggregation
• Data reduction– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization– Part of data reduction but with particular importance, especially for
numerical data
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 17
Outline
1. Motivating Example
2. Methods for Anomaly Detection
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 18
– Non standardized, misfielded/formatted
– Duplicates
– Outliers
– Inconsistencies
– Missing, truncated
– Out-of-date
– Erroneous, contradicting, false
Outline
1. Motivating Example
2. Methods for Anomaly Detection
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 19
– Non standardized, misfielded/formatted
– Duplicates
– Outliers
– Inconsistencies
– Missing, truncated
– Out-of-date
– Erroneous, contradicting, false
Name Affiliation City, State, Zip, Country Phone
Piatetsky-Shapiro G.,PhD U. of Massachusetts 617-264-9914
David J. Hand Imperial College London, UK
Benjamin W. Wah Univ. of Illinois IL 61801, USA (217) 333-6903
Hand D.J.
Vippin Kumar U. of Minnesota, MI, USA
Xindong Wu U. of Vermont Burlington-4000 USA NULL
Philip S. Yu U. of Illinois Chicago IL, USA 999-999-9999
Osmar R. Zaiiane U. of Alberta CA 111-111-1111
Example
Misfielded Value
Non-standard representationICDM Steering Committee
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 20
Extract-Transform-Load (1/4)
• Format detection, verification, and conversion
• Standardization of values with loose or predictable structure
e.g., addresses, names, bibliographic entries
• Abbreviation enforcing
• Data consolidation based on dictionaries and constraints
• Declarative language extensions• Machine learning and HMM
for field and record segmentation• Constraint-based method [Fan et al., 2008]
Goals
Approaches
[Christen et al., 2002]
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 21
22
ETL OperatorsOperators Category Application
Mapping, Convert, Select, Drop, Add, Merge, Format
Row-level Locally applied to a single row
Copy, Filter, Split, Switch
Router Locally decide, for each row, which of the many (output) destinations it should be sent to
Pivot/Unpivot, Aggregate, Clustering
Unary Grouper Transform a set of rows to a single row
Union, Merge, Join, Look-up, Compare, Divide
Binary or N-ary Combine many inputs into one output
Sort Unary Holistic Perform a transformation to the entire dataset
[Vassiliadis et al. 2007]
Open Source ETL: 2 of Many
Kettle (PDI)
Febrl
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 23
http://cs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/http://www.pentaho.com/
Extract-Transform-Load (4/4)
• Design of Ad Hoc scenarios
• Performance/scalability issues due to dependencies among ETL jobs and sequential processing
• DB bottleneck for bulk ETL operators
• Mainly for structured (relational) data
• Optimization of ETL Workflows*• Active data warehousing• Cleaning of data streams
Limitations
Research Directions
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 24
*A. Simitsis, P. Vassiliadis, T. K. Sellis. State-Space Optimization of ETL Workflows. IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE) vol. 17, no. 10, pp. 1404-1419, October 2005.
Outline
1. Motivating Example
2. Methods for Anomaly Detection
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 25
– Non standardized, misfielded/formatted
– Duplicates
– Outliers
– Inconsistencies
– Missing, truncated
– Out-of-date
– Erroneous, contradicting, false
1. Reduce the search space partitioning the dataset into mutually exclusive blocks to compare• Hashing, sorted keys, sorted nearest neighbors, (Multiple)
Windowing, Clustering
2. Select and compute a comparison function measuring the similarity distance between pairs of records• Token-based : N-grams comparison, Jaccard, TF-IDF, cosine
similarity• Edit-based: Jaro distance, Edit distance, Levenshtein, Soundex• Domain-dependent: data types, ad-hoc rules, relationship-
aware similarity measures
3. Select a decision model to classify pairs of records as matching, non-matching or potentially matching
4. Select the deduplication method
Record Linkage (RL)
Blocking
Comparison
Classification
Fusion
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 26
Record Linkage (RL)
Blocking
Comparison
Classification
Fusion
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 27
• ELMAGARMID, AHMED K., IPEIROTIS, PANAGIOTIS G., & VERYKIOS, VASSILIOS S. Duplicate Record Detection: A Survey. IEEE Trans.
Knowl. Data Eng., 19(1), 1–16, 2007.
• SimMetrics: Similarity Metric Java Library http://sourceforge.net/projects/simmetrics/
• KOUDAS, NICK, SARAWAGI SUNITA, SRIVASTAVA DIVESH. Record Linkage: Similarity Measures and Algorithms. Tutorial of SIGMOD 2006.
• DONG, LUNA, NAUMANN, FELIX : Data fusion -Resolving Data Conflicts for Integration. Tutorial of VLDB 2009.
Chaining or Spurious Linkage
ID Name Address
1 AT&T 180 Park. Av Florham Park
2 ATT 180 park Ave. Florham Park NJ
3 AT&T Labs 180 Park Avenue Florham Park
4 ATT Park Av. 180 Florham Park
5 TAT 180 park Av. NY
6 ATT 180 Park Avenue. NY NY
7 ATT Park Avenue, NY No. 180
8 ATT 180 Park NY NY
Park Av. 180 Florham Park
180 Park Avenue Florham Park
180 Park. Av Florham Park
180 park Ave. Florham Park NJ
180 Park Avenue. NY NY
180 park Av. NY
180 Park NY NY
Park Avenue, NY No. 180
1
34
56
8
Limitations: • Expertise required for method
selection and parameterization• No Benchmark
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 28
Outline
1. Motivating Example
2. Methods for Anomaly Detection
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 29
– Non standardized, misfielded/formatted
– Duplicates
– Outliers
– Inconsistencies
– Missing, truncated
– Out-of-date
– Erroneous, contradicting, false
Outlier Taxonomy
Anomaly Detection
Contextual Anomaly Detection
Collective Anomaly Detection
Online Anomaly Detection
Distributed Anomaly Detection
Point Anomaly Detection
Classification Based
Rule Based
Neural Networks Based
SVM Based
Nearest Neighbor Based
Density Based
Distance Based
Statistical
Parametric
Non-parametric
Clustering Based Others
Information Theory Based
Spectral Decomposition Based
Visualization Based
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly Detection – A Survey. ACM Computing Surveys, 41(3), 1–58.
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 30
Example
• N1 and N2 are normal regions
• o1, o2 and o4 are punctual anomalies
• Region O3 is a collective anomaly
X
Z
N1N2
o1
o2
O3Y
O4
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 31
So many detection methods…X
Y
Z
Multivariate AnalysisBivariate Analysis
comparison
Rejection area: Data space excluding the area defined between 2% and 98% quantiles for X and Y
Rejection area based on:
Mahalanobis_dist(cov(X,Y)) > χ2(.98,2)
Y
X X
Y
Legitimate outliers or data quality problems?
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 32
Contextual Anomaly
aka “conditional anomalies” *
* Xiuyao Song, Mingxi Wu, Christopher Jermaine, Sanjay Ranka, Conditional Anomaly Detection, IEEE Transactions on Data and Knowledge Engineering, 2006.
NormalAnomaly
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 33
Collective Anomaly
• A collection of abnormal observations• Requires the existence of a certain type of relationship
between the observations:– Sequential– Spatial– Connectivity (graph)
• Each instance of a collective anomaly is not abnormal itself
Subsequence anomaly
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 34
Outlier Detection (1/4)
• Detection by inspecting frequency distributions and univariate measures of Skewness and Kurtosis
• Numerous Detection Techniques– Distributional univariate technique: 3σ away from the mean
– Goodness of fit tests: tests for normality, χ2 test, analysis of residuals, Q-Q plots, Kullback-Liebler divergence
– Control charts (X-Bar, R, CUSUM), error bounds, tolerance limits
– Regression-based technique: measures the outlyingness of a model, not an individual data point
– Geometric techniques: define layers of increasing depth, outer layers contain the outlying points
Outlier Detection Methods (2/4)
• Popular methods: LOF, INFLO, LOCI see Tutorial of [Kriegel et al., 2009]
ELKI: http://elki.dbs.ifi.lmu.de/wiki
• Mixture distribution: Anomaly detection over noisy data using learned probability distributions [Eskin, 2000]
• Entropy: Discovering cluster-based local outliers [He, 2003]
• Projection into higher dimensional space: Kernel methods for pattern analysis [Shawne-Taylor, Cristiani, 2005]
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 36
Limitations
– When normal points do not have sufficient number of neighbours
– In high dimensional spaces due to data sparseness
– When datasets have modes with varying density
– Computationally expensive
Distance-based outliers (3/4)
O dNearest Neighbour-based ApproachesA point O in a dataset is an DB(p,d)-outlier if at least fraction p of the points in the data set lies greater than distance d from the point O. [Knorr, Ng, 1998]
Outliers are the top n points whose distance to the k-thnearest neighbor is greatest. [Ramaswamy et al., 2000]
O NNd
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 37
d1
d2
Goal
Compute local densities of particular regions and declare data points in low density regions as potential anomalies
Methods• Local Outlier Factor (LOF) [Breunig et al., 2000]• Connectivity Outlier Factor (COF) [Tang et al., 2002]• Multi-Granularity Deviation Factor [Papadimitriou et al., 2003]
Density-based outliers (4/4)
O1O2 NN: O2 is outlier but O1 is not
LOF: O1 is outlier but O2 is not
• Difficult choice between methods with contradicting results• In high dimensional spaces, factor values will tend to cluster
because density is defined in terms of distance
Limitations
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 38
Outline
1. Motivating Example
2. Generic Guidelines
3. Methods for Anomaly Detection
4. Techniques for Cleaning Dirty Data
5. Summary and Conclusions
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 39
How to Handle Missing Data?
– Inclusion (applicable for less than 15%)
• Anomalies are treated as a specific category
– Deletion
• List-wise deletion omits the complete record (for less than 2%)
• Pair-wise deletion excludes only the anomaly value from a calculation
– Substitution (applicable for less than 15%)
• Single imputation based on mean, mode or median replacement
• Linear regression imputation
• Multiple imputation (MI)
• Full Information Maximum Likelihood (FIML)AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 40
How to Handle Dirty Data?
• Binning / Smoothing
– first sort data and partition into bins
– then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human
• Regression
– smooth by fitting the data into regression functions
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 41
Discretization (Binning) (1/3)
Goal
Transform continuous variables into a set of ranges treated as (ordered) categories
Advantages
– Simultaneous analysis of quantitative and qualitative variables
– Ability to capture non-linear correlations between continuous variables
– Neutralize extreme values
– Handle missing values with the creation of a specific category
– Cardinality reduction
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 42
Discretization (Binning) (2/3)
Recommendations– Avoid large differences between the numbers of distinct values
(categories) per variable– Avoid categories with small population– The appropriate number of categories for a discrete or categorical
variable is 4 or 5– Remember :
• the weight of a variable is proportional to its number of distinct values
• the weight of a category is inversely proportional to its population
– Cardinality reduction on observations, variables, and categories• Very few variables implies possible information loss• Too many variables implies very small populations and less
interpretable results
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 43
Binning Methods (3/3)
• Equal-width (distance) partitioning:– It divides the range into N intervals of equal size: uniform
grid– if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.
– The most straightforward– But outliers may dominate presentation– Skewed data is not handled well.
• Equal-depth (frequency) partitioning:– It divides the range into N intervals, each containing the
same number of samples– Good data scaling– Managing categorical attributes can be tricky.
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
• Attribute/feature construction
– New attributes constructed from the given ones
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 45
Summary
• Data preparation is a big issue for warehousing and data
• Data preparation includes:
– Anomaly Detection
– Data cleaning
– Data transformation
– Discretization
– Data reduction and feature selection
• A lot a methods have been developed: an extremely active
area of research
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 46
Conclusions
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 47
• Still a lot needs to be doneto offer:
– An Iterative process with performance and quality guarantees
– Benchmarks
– Optimization
– Formalized guidelines and rigourous methodologies
– User assistance
Iterative Detection and Cleaning
Patterns and Dependencies among Anomalies
Detection
Cleaning Explanation
DuplicatesDeduplication
Outliers Uni- and MV- Detection
Missing DataImputation
Inconsistent DataConstraint
Any questions ?
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 48
Limited Bibliography
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 49
References• Tutorials
– BATINI, CARLO, TIZIANA, CATARCI, & SCANNAPIECO, MONICA. 2004. A Survey of Data Quality Issues in Cooperative Systems. Tutorial of the 23rd International Conference on Conceptual Modeling, ER 2004.
– KOUDAS, NICK, SARAWAGI SUNITA, SRIVASTAVA DIVESH. Record Linkage: Similarity Measures and Algorithms.Tutorial of SIGMOD 2006.
• Books– NAUMANN, FELIX. Quality-Driven Query Answering for Integrated Information Systems. Lecture
Notes in Computer Science, vol. 2261. Springer-Verlag,2002.
– BATINI, CARLO, & SCANNAPIECO, MONICA. Data Quality: Concepts, Methodologies and Techniques. Data-Centric Systems and Applications. Springer-Verlag, 2006.
– DASU, TAMRAPARNI, & JOHNSON, THEODORE. Exploratory Data Mining and Data Cleaning. John Wiley, 2003.
– WANG, RICHARD Y., ZIAD, MOSTAPHA, & LEE, YANG W. Data Quality.Advances in Database Systems, vol. 23. Kluwer Academic Publishers, 2002.
• Data Profiling– DASU, TAMRAPARNI, JOHNSON, THEODORE, S. Muthukrishnan, V. Shkapenyuk, Mining Database
Structure; Or, How to Build a Data Quality Browser, Proc. SIGMOD Conf. 2002
– CARUSO, FRANCESCO, COCHINWALA, MUNIR, GANAPATHY, UMA, LALK, GAIL, & MISSIER, PAOLO. 2000. Telcordia’s Database Reconciliation and Data Quality Analysis Tool. Pages 615–618 of: Proceedings of 26th International Conference on Very Large Data Bases, VLDB 2000. Cairo, Egypt.
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 50
References• ETL
– CHRISTEN, PETER: Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. KDD 2008: 1065-1068, 2008.
– CHRISTEN, PETER, CHURCHES, TIM, ZHU, XI. Probabilistic name and address cleaning and standardization. Australasian Data Mining Workshop 2002.
– RAHM, E., DO, H.H., Data Cleaning: Problems and Current Approaches, Data Engineering Bulletin 23(4) 3-13, 2000.
– GALHARDAS, HELENA, FLORESCU, DANIELA, SHASHA, DENNIS, SIMON, ERIC, SAITA, CRISTIAN-AUGUSTIN. Declarative Data Cleaning: Language, Model, and Algorithms, Proc. VLDB Conf., pp. 371-380, 2001.
– JOHNSON THEODORE, MARATHE, AMIT, DASU TAMRAPARNI. Database Exploration and Bellman. IEEE Data Eng. Bull. 26(3): 34-39,2003.
– VASSILIADIS, PANOS, VAGENA Z., SKIADOPOULOS S., KARAYANNIDIS N. and SELLIS, T. ARKTOS: A Tool For Data Cleaning and Transformation in Data Warehouse Environments. Bulletin of the Technical Committee on Data Engineering, vol. 23, no. 4, pp. 42-47, December 2000.
– VASSILIADIS, PANOS, KARAGIANNIS ANASTASIOS, TZIOVARA, VASILIKI, SIMITSIS, ALKIS. Towards a Benchmark for ETL Workflows. QDB 2007: 49-60, 2007.
– ELFEKY, MOHAMED G., ELMAGARMID, AHMED K., & VERYKIOS, VASSILIOS S. TAILOR: A Record Linkage Tool Box. Pages 17–28 of: Proceedings of the 18th International Conference on Data Engineering, ICDE 2002. San Jose, CA, USA, 2002.
– ELMAGARMID, AHMED K., IPEIROTIS, PANAGIOTIS G., & VERYKIOS, VASSILIOS S. Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng., 19(1), 1–16, 2007.
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 51
References• ETL
– LIM, EE-PENG, SRIVASTAVA, JAIDEEP, PRABHAKAR, SATYA, & RICHARDSON, JAMES. 1993. Entity Identification in Database Integration. Pages 294–301 of: Proceedings of the 9th International Conference on Data Engineering, ICDE 1993. Vienna, Austria.
– LOW, WAI LUP, LEE, MONG-LI, & LING, TOK WANG. 2001. A Knowledge-Based Approach for Duplicate Elimination in Data Cleaning. Inf. Syst., 26(8), 585–606.
– SIMITSIS, ALKIS, VASSILIADIS, PANOS, & SELLIS, TIMOS K. 2005. Optimizing ETL Processes in Data Warehouses. Pages 564–575 of: Proceedings of the 21st International Conference on Data Engineering, ICDE 2005. Tokyo, Japan.
– TEJADA, SHEILA, KNOBLOCK, CRAIG A., & MINTON, STEVEN. 2002. Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification. Pages 350–359 of: Proceedings of the 8thACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002. Edmonton, AL, Canada.
– A. Simitsis, P. Vassiliadis, T. K. Sellis. State-Space Optimization of ETL Workflows. IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE) vol. 17, no. 10, pp. 1404-1419, October 2005.
• Approximate String Matching– NAVARRO, GONZALO. 2001. A Guided Tour to Approximate String Matching. ACM Comput. Surv.,
33(1), 31–88.– GRAVANO, LUIS, IPEIROTIS, PANAGIOTIS G., JAGADISH, H. V., KOUDAS, NICK, MUTHUKRISHNAN, S.,
PIETARINEN, LAURI, & SRIVASTAVA, DIVESH. 2001. Using q-grams in a DBMS for Approximate String Processing. IEEE Data Eng. Bull., 24(4), 28–34.
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 52
References• Record Linkage
– ANANTHAKRISHNA, ROHIT, CHAUDHURI, SURAJIT, & GANTI, VENKATESH. Eliminating Fuzzy Duplicates in Data Warehouses. pp. 586–597, Proc. of VLDB 2002.
– BAXTER, ROHAN A., CHRISTEN, PETER, & CHURCHES, TIM. A Comparison of Fast Blocking Methods for Record Linkage. Pages 27–29 of: Proceedings of the KDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, 2003.
– BILENKO, MIKHAIL, BASU, SUGATO, & SAHAMI, MEHRAN. 2005. Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping. Pages 58–65 of: Proceedings of the 5th IEEE International Conference on Data Mining, ICDM 2005. Houston, TX, USA, 2005.
– BHATTACHARYA, INDRAJIT, & GETOOR, LISE. Iterative Record Linkage for Cleaning and Integration. Pages 11–18 of: Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD, 2004.
– FELLEGI, IVAN P., & SUNTER, A.B. A Theory for Record Linkage. Journal of the American Statistical Association, 64, 1183–1210, 1969.
– WINKLER, WILLIAM E. The State of Record Linkage and Current Research Problems. Tech. Rept. Statistics of Income Division, Internal Revenue Service Publication R99/04. U.S. Bureau of the Census, Washington, DC, USA, 1999.
– WINKLER, WILLIAM E. Methods for Evaluating and Creating Data Quality.Inf. Syst., 29(7), 531–550, 2004.
– WINKLER, WILLIAM E., & THIBAUDEAU, YVES. An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census. Tech. Rept. Statistical Research Report Series RR91/09. U.S. Bureau of the Census,Washington,DC, USA, 1991.
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 53
References• Duplicate Detection
– HERNANDEZ, M., STOLFO, S., The Merge/Purge Problem for Large Databases, Proc. SIGMOD Conf pg 127-135, 1995.
– HERNANDEZ, M., STOLFO, S., Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem, Data Mining and Knowledge Discovery, 2(1)9-37, 1998.
– BILENKO, MIKHAIL, & MOONEY, RAYMOND J. Adaptive Duplicate Detection Using Learnable String Similarity Measures. Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 39–48, Washington, DC, USA, 2003.
– BILKE, ALEXANDER, BLEIHOLDER, JENS, BÖHM, CHRISTOPH, DRABA, KARSTEN, NAUMANN, FELIX, &WEIS, MELANIE. 2005. Automatic Data Fusion with HumMer. of: Proc. of the 31st Intl. Conf. on Very Large Data Bases, VLDB 2005, pp. 1251–1254 Trondheim, Norway.
– CHAUDHURI, SURAJIT, GANTI, VENKATESH, &KAUSHIK, RAGHAV. 2006. A Primitive Operator for Similarity Joins in Data Cleaning. Page 5 of: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006. Atlanta, GA, USA.
– GRAVANO, LUIS, IPEIROTIS, PANAGIOTIS G., KOUDAS, NICK, & SRIVASTAVA, DIVESH. Text Joins for Data Cleansing and Integration in an RDBMS. Proc.of the 19th Intl. Conf. on Data Engineering, ICDE 2003, pp. 729–731, Bangalore, India, 2003.
– MCCALLUM, ANDREW, NIGAM, KAMAL, &UNGAR, LYLE H. 2000. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. Proc. of the 6th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, KDD 2000, pp. 169–178. Boston, MA, USA.
– MONGE, ALVARO E. 2000. Matching Algorithms within a Duplicate Detection System. IEEE Data Eng. Bull., 23(4), 14–20.
– WEIS, MELANIE, & NAUMANN, FELIX. 2004. Detecting Duplicate Objects in XML WEIS, MELANIE, NAUMANN, FELIX, & BROSY, FRANZISKA. 2006. A Duplicate Detection Benchmark for XML (and Relational) Data. Proc. of the 3rd Intl. ACM SIGMOD 2006 Workshop on Information Quality in Information Systems, IQIS 2006. Chicago, IL, USA.
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 54
References• Data Preparation
– STATNOTES: Topics in Multivariate Analysis. Retrieved 10/17/2008 from http://www2.chass.ncsu.edu/garson/pa765/statnote.htm
– KLINE, R.B., Data Preparation and Screening, Chapter 3. in Principles and Practice of Structural Equation
Modeling, NY: Guilford Press, pp. 45-62, 2005.
– BANSAL, NIKHIL, BLUM, AVRIM, and CHAWLA, SHUCHI. Correlation clustering. Machine Learning, 56(1-3):89–113, 2004.
– PARSONS, SIMON. Current Approaches to Handling Imperfect Information in Data and Knowledge Bases. IEEE Trans. Knowl. Data Eng., 8(3), 353–372, 1996.
– PEARSON, RONALD K. The problem of disguised missing data. SIGKDD Explorations 8(1): 83-92, 2006.
– PEARSON, RONALD K. Surveying Data for Patchy Structure. SDM 2005
– PEARSON, RONALD K. Mining Imperfect Data: Dealing with Contamination and Incomplete Records. Philadelphia: SIAM 2005.
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 55
References• Geometric Outliers
– PREPARATA SHAMOS. Computational Geometry: An Introduction, Springer-Verlag 1988
• Distributional Outliers– KNORR, EDWIN M., & NG, RAYMOND T. Algorithms for Mining Distance-Based Outliers in Large Datasets.
Proc. of 24rd International Conference on Very Large Data Bases, VLDB 1998, pp. 392–403. New York
City, NY, USA, 1998.
– BREUNIG, MARKUS M., KRIEGEL, HANS-PETER, NG, RAYMOND T., & SANDER, JÖRG. LOF: Identifying
Density-Based Local Outliers. Proc. of the 2000 ACM SIGMOD International Conference on Management
of Data, pp. 93–104. Dallas, TX, USA, 2000.
• Missing Value Imputation– SCHAFER, J. L., Analysis of Incomplete Multivariate Data, New York: Chapman and Hall,1997
– LITTLE, R. J. A. and RUBIN, D. B., Statistical Analysis with Missing Data. New York: John Wiley & Sons, 1987.
– Mc KNIGHT, P. E., FIGUEREDO, A. J., SIDANI, S., Missing Data: A Gentle Introduction. Guilford Press, 2007.– DEMPSTER, ARTHUR PENTLAND, LAIRD, NAN M., & RUBIN, DONALD B. Maximum Likelihood from
Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39, 1–38,1977.
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 56