Big Data Profiling
-
Upload
exascale-infolab -
Category
Science
-
view
784 -
download
3
description
Transcript of Big Data Profiling
![Page 1: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/1.jpg)
Big Data Profiling Fribourg May 2014
Felix Naumann
![Page 2: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/2.jpg)
The Hasso Plattner Institute
■ Founded in 1998 as a Public Private Partnership
■ Hasso Plattner, co-founder of SAP, endowed over 200 Mio. Euro.
■ Adjoined with the University of Potsdam
■ 500 students
□ BA, MA, PhD
2
■ Enterprise Platform and Integration Concepts
■ Internet Technologies and Systems
■ Human Computer Interaction
■ Computer Graphics Systems
■ Operating Systems and Middleware
■ Business Process Technology
■ Software Architecture
■ Information Systems
■ System Engineering and Modeling
■ School of Design Thinking
Felix Naumann | Data Profiling | CUSO 2014
![Page 3: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/3.jpg)
Research Topics
■ Data Profiling and Analytics
■ Data Quality and Data Cleansing
■ Similarity Search and ETL Management
■ Knowledge Discovery and Text Extraction
■ (Linked) Open Data Integration
■ For more information on research topics and on teaching, please
see http://www.hpi.uni-potsdam.de/naumann/home.html
3
Felix Naumann | Data Profiling | CUSO 2014
![Page 4: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/4.jpg)
Profiling in Spreadsheets
Felix Naumann | Data Profiling | CUSO 2014
4
![Page 5: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/5.jpg)
Felix Naumann | Data Profiling | CUSO 2014
5
![Page 6: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/6.jpg)
Felix Naumann | Data Profiling | CUSO 2014
6
![Page 7: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/7.jpg)
Felix Naumann | Data Profiling | CUSO 2014
7
![Page 8: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/8.jpg)
Felix Naumann | Data Profiling | CUSO 2014
8
![Page 9: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/9.jpg)
Felix Naumann | Data Profiling | CUSO 2014
9
![Page 10: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/10.jpg)
Felix Naumann | Data Profiling | CUSO 2014
10
![Page 11: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/11.jpg)
Many interesting questions remain
■ What are possible keys and foreign keys?
□ Phone
□ firstname, lastname, street
■ Are there any functional dependencies?
□ zip -> city
□ race -> voting behavior
■ Which columns correlate?
□ county and first name
□ DoB and last name
■ What are frequent patterns in a column?
□ ddddd
□ dd aaaa St
Felix Naumann | Data Profiling | CUSO 2014
11
![Page 12: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/12.jpg)
Definition Data Profiling
■ Data profiling is the process of examining the data available in an
existing data source [...] and collecting statistics and information
about that data.
Wikipedia 09/2013
■ Data profiling refers to the activity of creating small but
informative summaries of a database.
Ted Johnson, Encyclopedia of Database Systems
■ A fixed set of data profiling tasks / results
Felix Naumann | Data Profiling | CUSO 2014
12
![Page 13: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/13.jpg)
„Big“ Data Profiling or How big is „Big“?
Data profiling = measuring the „Vs“
■ Volume
□ Row counts, etc.
■ Velocity
□ Temporal profiling
■ Variability
□ How difficult to
integrate and analyse
■ Veracity
□ How good is it?
■ …
Felix Naumann | Data Profiling | CUSO 2014
13
Big Data
Volume
Velocity
Variety
Veracity
Viscosity
Virality
![Page 14: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/14.jpg)
Use Cases for Profiling
■ Query optimization
□ Counts and histograms
■ Data cleansing
□ Patterns, rules, and violations
■ Data integration
□ Cross-DB inclusion dependencies
■ Scientific data management
□ Handle new datasets
■ Data inspection, analytics, and mining
□ Profiling as preparation to decide on models and questions
■ Database reverse engineering
■ Data profiling as preparation for any other data management task
Felix Naumann | Data Profiling | CUSO 2014
14
![Page 15: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/15.jpg)
Classification of Traditional Profiling Tasks
Felix Naumann | Data Profiling | CUSO 2014
15
Data
pro
filing
Single column
Cardinalities
Patterns and data types
Value distributions
Multiple columns
Uniqueness
Key discovery
Conditional
Partial
Inclusion dependencies
Foreign key discovery
Conditional
Partial
Functional dependencies
Conditional
Partial
![Page 16: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/16.jpg)
Single-column vs. multi-column
■ Single column profiling
□ Most basic form of data profiling
□ Often part of the basic statistics gathered by DBMS
□ Discovery complexity: Number of values/rows
■ Multicolumn profiling
□ Discover joint properties
□ Discover dependencies
□ Discovery complexity: Number of columns and number of
values
Felix Naumann | Data Profiling | CUSO 2014
16
![Page 17: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/17.jpg)
Scalable profiling
■ Scalability in number of rows
■ Scalability in number of columns
□ “Small” table with 100 columns:
2100 – 1 = 1,267,650,600,228,229,401,496,703,205,375
= 1.3 nonillion column combinations
◊ Impossible to check or even enumerate
■ Possible solutions
□ Scale up: More RAM, faster CPUs
◊ Expensive
□ Scale in: More cores
◊ More complex (threading)
□ Scale out: More machines
◊ Communication overhead
□ Intelligent enumeration and aggressive pruning
Felix Naumann | Data Profiling | CUSO 2014
17
![Page 18: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/18.jpg)
Challenges of (Big) Data Profiling
Felix Naumann | Data Profiling | CUSO 2014
18
■ Computational complexity
□ Number of rows
□ Number of columns (and column combinations)
■ Large solution space
■ New data types (beyond strings and numbers)
■ New data models (beyond relational): RDF, XML, etc.
■ New requirements
□ User-oriented
□ Interactive
□ Streaming data
![Page 19: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/19.jpg)
Agenda
19
■ Basic statistics
■ Functional dependencies
■ Keys and foreign keys
■ Data profiling tools
■ Advanced profiling
Felix Naumann | Data Profiling | CUSO 2014
![Page 20: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/20.jpg)
Cardinalities, distributions, and patterns
Category Task Description Cardinalities num-rows Number of rows
value length Measurements of value lengths (min, max, median, and average)
null values Number or percentage of null values distinct Number of distinct values; aka “cardinality” uniqueness Number of distinct values divided by number of rows Value distributions histogram Frequency histograms (equi-width, equi-depth, etc.)
constancy Frequency of most frequent value divided by number of rows
quartiles Three points that divide the (numeric) values into four equal groups
soundex Distribution of soundex codes
first digit Distribution of first digit in numeric values; to check Benford's law
Patterns, data types, and domains basic type Generic data type: numeric, alphabetic, date, time
data type Concrete DBMS-specific data type: varchar, timestamp, etc. decimals Maximum number of decimal places in numeric values precision Maximum number of digits in numeric values patterns Histogram of value patterns (Aa9…)
data class Semantic, generic data type: code, indicator, text, date/time, quantity, identifier, etc.
domain Classification of semantic domain: credit card, first name, city, phenotype, etc.
Felix Naumann | Data Profiling | CUSO 2014
20
![Page 21: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/21.jpg)
Data types and value patterns
■ String vs. number
■ String vs. number vs. date
■ Categorical vs. continuous
■ SQL data types
□ CHAR, INT, DECIMAL, TIMESTAMP, BIT, CLOB, …
■ Domains
□ VARCHAR(12) vs. VARCHAR (13)
■ XML data types
□ More fine grained
■ Regular expressions (\d{3})-(\d{3})-(\d{4})-(\d+)
■ Semantic domains
□ Adress, phone, email, first name
Felix Naumann | Data Profiling | CUSO 2014
21
Incre
asin
g s
em
antic
s
![Page 22: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/22.jpg)
An Aside: Benford Law Frequency (“first digit law”)
■ Statement about the distribution of first digits d in (many)
naturally occurring numbers:
□ 𝑃 𝑑 = 𝑙𝑜𝑔10 𝑑 + 1 − 𝑙𝑜𝑔10 𝑑 = 𝑙𝑜𝑔10 1 + 1𝑑
□ Holds if log(x) is uniformly distributed
Felix Naumann | Data Profiling | CUSO 2014
22
0
20
40
1 2 3 4 5 6 7 8 9
![Page 23: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/23.jpg)
Examples for Benford‘s Law
■ Surface areas of 335 rivers
■ Sizes of 3259 US populations
■ 104 physical constants
■ 1800 molecular weights
■ 5000 entries from a mathematical handbook
■ 308 numbers contained in an issue of Reader's Digest
■ Street addresses of the first 342 persons listed in American Men of Science
Felix Naumann | Data Profiling | CUSO 2014
23
Heights of the 60 tallest structures
http://en.wikipedia.org/wiki/List_of_tallest_buildings_and_structures_in_the_world#Tallest_structure_by_category
![Page 24: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/24.jpg)
Agenda
24
■ Basic statistics
■ Functional dependencies
■ Keys and foreign keys
■ Data profiling tools
■ Advanced profiling
Felix Naumann | Data Profiling | CUSO 2014
![Page 25: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/25.jpg)
Naive Discovery Approach
■ Functional dependency „X → A“: whenever two records have the
same X values, they also have the same A values.
■ Given relation R, detect all minimal, non-trivial FDs X → A.
■ For each column combination X
□ For each pair of tuples (t1,t2)
◊ If t1[X\A] = t2[X\A] and t1[A] t2[A]: Break
■ Complexity
□ Exponential in number of attributes
□ times number of rows squared
Felix Naumann | Data Profiling | CUSO 2014
25
![Page 26: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/26.jpg)
Tane – General Idea [HKPT99]
■ Two elements of approach
1. Reduce column combinations through pruning
◊ Reasoning over FDs
2. Reduce tuple sets through partitioning
◊ Partition tuple IDs according to attribute values
◊ Level-wise increase of size of attribute set
● Consider sets of tuples whose values agree on that set
Felix Naumann | Data Profiling | CUSO 2014
26
![Page 27: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/27.jpg)
Discovery strategy
■ Bottom up traversal through lattice
□ only minimal dependencies
□ Pruning
□ Re-use results from previous level
■ For a set X, test all X\A → A, AX
□ only non-trivial dependencies
□ Test on efficient data structure
Felix Naumann | Data Profiling | CUSO 2014
27
A B C D
AB AC AD BC BD CD
ABC ABD ACD BCD
ABCD
![Page 28: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/28.jpg)
Functional Dependencies: State of the Art
Felix Naumann | Data Profiling | CUSO 2014
28
![Page 29: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/29.jpg)
Partial and conditional dependencies
■ Partial dependency: dependencies that do not perfectly hold
□ For all but 10 of the tuples
□ Only for 90% of the tuples
□ Only for 1% of the tuples
■ Partiality also for patterns, types, uniques, and other constraints
■ Given a partial dependencies: For which part does it hold?
□ Expressed as a condition over the attributes of the relation
■ Problems:
□ Infinite possibilities of conditions
□ Interestingness:
◊ Many distinct values: less interesting
◊ Few distinct values: surprising condition – high coverage
■ Useful for
□ Integration: cross-source condition inclusion dependency
Felix Naumann | Data Profiling | CUSO 2014
29
![Page 30: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/30.jpg)
Agenda
30 ■ Basic statistics
■ Functional dependencies
■ Keys and foreign keys
■ Data profiling tools
■ Advanced profiling
Felix Naumann | Data Profiling | CUSO 2014
![Page 31: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/31.jpg)
Uniqueness, keys, and foreign keys
■ Uniqueness and keys
□ Unique column: Only unique values
□ Unique column combination: Only unique value combinations
◊ Minimality: No subset is unique
□ Key candidate: No null values
◊ Uniqueness and non-null in one instance does not imply key: Only human can specify keys (and foreign keys)
■ Inclusion dependencies and foreign keys
□ A B: All values in A are also present in B
□ A1,…,Ai B1,…,Bi: All value comb. in A1,…,Ai are also present in B1,…,Bi
□ Prerequisite for foreign key
◊ Across relations and across databases
◊ Again: Discovery on a given instance, only user can specify for schema
Felix Naumann | Data Profiling | CUSO 2014
31
![Page 32: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/32.jpg)
Uniqueness and keys
■ Unique column
□ Only unique values
■ Unique column combination
□ Only unique value combinations
□ Minimality: No subset is unique
■ Uniques: {A, AB, AC, BC, ABC}
■ Minimal uniques: {A, BC}
■ (Maximal) Non-uniques: {B, C} Felix Naumann | Data Profiling | CUSO 2014
32
A B C
a 1 x
b 2 x
c 2 y
![Page 33: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/33.jpg)
Null values
■ Null values have a wide range of interpretations.
□ Unknown (date of birth)
□ Non-applicable (driver license number for kids)
□ Undefined (result of integration/outer join)
■ What are minimal uniques for the following data set?
■ Primary key {A}; Some unusual uniques: {C} and {CD}
■ Distinct: {A, BC} but not {CD}
Felix Naumann | Data Profiling | CUSO 2014
33
A B C D
a 1 x 1
b 2 y 2
c 3 z 5
d 3 ⊥ 5
e ⊥ ⊥ 5
![Page 34: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/34.jpg)
Pruning effect of a pair
Felix Naumann | Data Profiling | CUSO 2014
34
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE minimal unique
unique
![Page 35: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/35.jpg)
Pruning with uniques
■ Pruning: inferring the type of a combination without actual
verification
■ If A is unique, supersets must be unique
■ Finding a unique column prunes half of the lattice
□ Remove column from initial data set and restart
■ Finding a unique column pair removes a quarter of the lattice
□ In general, the lattice over the combination is removed
■ The pruning power of a combination is reduced by prior findings
□ AB prunes a quarter
□ BC additionally prunes only one eighth
□ ABC already pruned one eights
Felix Naumann | Data Profiling | CUSO 2014
35
![Page 36: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/36.jpg)
Pruning both ways
Felix Naumann | Data Profiling | CUSO 2014
36
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE minimal unique
unique
maximal non-unique
non-unique
![Page 37: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/37.jpg)
TPCH – Uniques and Non-Uniques
Felix Naumann | Data Profiling | CUSO 2014
37 non-unique unique
8 columns
9 columns
10 columns
![Page 38: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/38.jpg)
Unique Column Combination Discovery
■ DUCC
□ Basic idea: random walk through lattice
□ Pick random superset if current combination is non-unique
□ Pick random subset otherwise
□ Lazy prune with previously visited nodes
Felix Naumann | Data Profiling | CUSO 2014
38
Row-based Column-based Hybrid
Gordian
[SBHR06]
Apriori
[GW99]
HCA
[AN11]
DUCC
[HQA+14]
SWAN
[AQN14]
![Page 39: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/39.jpg)
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
ABCD
ABC
ABCE
ABD
ABDE
AB
ACD
CD
ACD BCD CDE
Minimum unique column combination candidate
Minimum unique column combination
Maximum non-unique column combination candidate Maximum non-unique column combination
Pruned
Visited nodes: 10 out of 26
Felix Naumann | Data Profiling | CUSO 2014
39
![Page 40: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/40.jpg)
Scaling the number of columns
■ NCVoter, 100k rows
Felix Naumann | Data Profiling | CUSO 2014
40
![Page 41: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/41.jpg)
Scaling the number of rows
■ NCVoter, 15 columns
Felix Naumann | Data Profiling | CUSO 2014
41
![Page 42: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/42.jpg)
Analysis of DUCC
■ Runtime mainly depends on size of solution set
■ Worst case: solution set in the middle
Felix Naumann | Data Profiling | CUSO 2014
42
![Page 43: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/43.jpg)
Uniques and non-uniques in NC-voter data
■ A minimal unique: voter_reg_num, zip_code, race_code
■ A maximal non-unique: voter_reg_num, status_cd, voter_status_desc, reason_cd, voter_status_reason_desc, absent_ind, name_prefx_cd, name_sufx_cd, half_code, street_dir, street_type_cd, street_sufx_cd, unit_designator, unit_num, state_cd, mail_addr2, mail_addr3, mail_addr4, mail_state, area_cd, phone_num, full_phone_number, drivers_lic, race_code, race_desc, ethnic_code, ethnic_desc, party_cd, party_desc, sex_code, sex, birth_place, precinct_abbrv, precinct_desc, municipality_abbrv, municipality_desc, ward_abbrv, ward_desc, cong_dist_abbrv, cong_dist_desc, super_court_abbrv, super_court_desc, judic_dist_abbrv, judic_dist_desc, nc_senate_abbrv, nc_senate_desc, nc_house_abbrv, nc_house_desc, county_commiss_abbrv, county_commiss_desc, township_abbrv, township_desc, school_dist_abbrv, school_dist_desc, fire_dist_abbrv, fire_dist_desc, water_dist_abbrv, water_dist_desc, sewer_dist_abbrv, sewer_dist_desc, sanit_dist_abbrv, sanit_dist_desc, rescue_dist_abbrv, rescue_dist_desc, munic_dist_abbrv, munic_dist_desc, dist_1_abbrv, dist_1_desc, dist_2_abbrv, dist_2_desc, confidential_ind, age, vtd_abbrv, vtd_desc
Felix Naumann | Data Profiling | CUSO 2014
43
![Page 44: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/44.jpg)
Dynamic Data: Challenges
■ Inserts may create new duplicate combinations
□ Minimal uniques (mUCs) might become non-unique
□ Maximal non-uniques (mNUCs) might lose maximality
■ Deletes remove duplicate value combinations
□ NUCs might get unique
□ mUCs might lose minimality
■ Idea
□ Leverage the knowledge of previously discovered mUCs and
mNUCs
□ Create appropriate indices
Felix Naumann | Data Profiling | CUSO 2014
44
![Page 45: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/45.jpg)
SWAN Architecture [AQN14]
Felix Naumann | Data Profiling | CUSO 2014
45
SW AN
Database(input dataset)
Repository(MUCS and MNUCS)
Inserts Handler
Uniqueness
Checker
Deletes Handler
Duplicate
Checker
deletesinserts
MUCS-indexdata-index duplicate-index
inserts/deletes
inserts/deletes
update
![Page 46: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/46.jpg)
Scaling the Number of Columns
■ 100k rows and 10k inserts
Felix Naumann | Data Profiling | CUSO 2014
46
0.2$ 0.9$1$
10$
100$
1000$
10000$
100000$
10$ 20$ 30$ 40$ 50$ 60$
Ex
ec
uti
on
tim
e (
s)
Number of columns
Ducc Gordian-Inc Swan
![Page 47: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/47.jpg)
■ TPCH with 16 columns and 5 million rows
■ Swan/Ducc combination is able to process larger datasets than
Ducc on a static dataset
Stressing the Number of Inserts
Felix Naumann | Data Profiling | CUSO 2014
47
0"
2000"
4000"
6000"
8000"
10000"
12000"
10%" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%" 100%"
Ex
ecu
tio
n t
ime
(s)
Insert size wrt. initial dataset size
Ducc Swan
![Page 48: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/48.jpg)
Next steps
■ Finding primary keys
□ Uniqueness is necessary criteria
□ No null values
□ Include other features
◊ Name includes “id”, number of columns
■ Partial uniques
□ 99.9% of the data unique
□ Useful to detect data errors
□ Gordian, HCA, and DUCC can be easily modified
■ Incremental discovery
Felix Naumann | Data Profiling | CUSO 2014
48
![Page 49: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/49.jpg)
Inclusion Dependencies: Definition
■ INDs involve more than one relation.
■ Let D be a relational schema and let I be an instance of D.
■ R[A1, …, An] denotes projection of I on attributes A1, … An, of
relation R: R[A1, …, An] = πA1, …, An(R)
■ IND = R[A1, …, An] S[B1, …, Bn], where R, S are (possibly
identical) relations of D.
□ Projection on R and S must have same number of attributes.
■ An instance I of D satisfies if I(R)[A1, …, An] I(S)[B1, …, Bn]
■ Values of R: “dependent values”
■ Values of S: “referenced values”
Felix Naumann | Data Profiling | CUSO 2014
49
![Page 50: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/50.jpg)
IND types
■ Unary INDs
□ INDs on single attributes: R[A] S[B]
■ n-ary INDs
□ INDs on multiple attributes: R[X] S[Y]
■ Partial INDs
□ IND R[A] S[B] is satisfied for x% of all tuples in R
□ IND R[A] S[B] is satisfied for all but x tuples in R
■ Approximate INDs
□ IND R[A] S[B] is satisfied with probability p.
□ Based on sampling or other heuristics
Felix Naumann | Data Profiling | CUSO 2014
50
![Page 51: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/51.jpg)
Motivation for IND discovery
■ General insight into data
■ Detect unknown foreign keys
■ Example
□ PDB: Protein Data Bank
□ OpenMMS provides relational schema
◊ Parses protein and nucleic acid
macromolecular structure data
from the standard mmCIF format.
□ 175 tables with primary key
constraints
□ 2705 attributes
□ But: Not a single foreign key
constraint!
Felix Naumann | Data Profiling | CUSO 2014
51
![Page 52: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/52.jpg)
Motivation for IND discovery
■ Ensembl – genome database
□ shipped as MySQL dump files
□ more than 200 tables
□ Not a single foreign key constraint!
■ Why are FKs missing?
□ Lack of support for checking foreign key constraints in the
host system
◊ Example: Oracle did not support FKs up to v6
□ Fear that checking such constraints would impede database
performance
□ Lack of database knowledge within the development team
Felix Naumann | Data Profiling | CUSO 2014
52
![Page 53: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/53.jpg)
Felix Naumann | Data Profiling | CUSO 2014
53
SPIDER: Single Pass Inclusion DEpendency
Recognition [BLNT07]
■ Main ideas
□ Test all IND-candidate pairs in parallel.
□ Read attribute values only once.
□ Stop test of an IND-candidate after first counter-example.
□ Reduce number of value comparisons by specialized data structure.
□ No need to build inverted index.
■ Two steps:
□ Sort and distinct all attribute‘s values and write them to disk
◊ For each attribute: SELECT DISTINCT A FROM R ORDER BY A
□ Test all IND candidate pairs in parallel
![Page 54: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/54.jpg)
SPIDER by example
■ In each step: Intersect „attributes to process“ with each refs list of
previous step
Felix Naumann | Data Profiling | CUSO 2014
54
attributes A, B, C
A B C
s s
t t t
x
y y y
z
attributes
to process
dep A
refs
dep B
refs
dep C
refs
Init B,C A,C A,B
Step 1 A,C C A,C A
Step 2 A,B,C C A,C A
Step 3 A A,C A
Step 4 A,B,C A,C A
Step 5 C A,C
![Page 55: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/55.jpg)
Problem: Automatic Determination of Foreign Keys
■ Given
□ Relational schema
□ Database instance of that schema
□ Complete set of (observed) inclusion dependencies
◊ Attributes A and B with R[A] S[B] (in short A B)
■ Find
□ All foreign key constraints: attributes A and B with A B
■ Difficulty
□ Foreign keys are not intrinsic to data, but defined by humans
□ Discover semantics
■ Machine learning approach based on syntactic features [RAB+09]
Felix Naumann | Data Profiling | CUSO 2014
55
![Page 56: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/56.jpg)
Features
■ DependentAndReferenced
□ Counts how often the dependent attribute A appears as referenced attribute in the set of all INDs.
□ Usually, a foreign key is not also a primary key that is referenced as foreign key by other tables.
■ MultiDependent
□ Counts how often A appears as dependent attribute in the set of all INDs.
□ If s(A) is contained in the set of values of many other attributes, the likelihood for each of these INDs being a FK is decreased.
■ MultiReferenced
□ Counts how often B appears as referenced attribute in the set of all INDs.
□ Often, primary keys are referenced by more than one foreign key.
Felix Naumann | Data Profiling | CUSO 2014
56
A
a
B
a
b
?
C
a
D
a
A
a
B
a
b
?
C
a
D
a
A
a
B
a
b
?
C
a
D
a
![Page 57: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/57.jpg)
Features
■ DistinctDependentValues
□ The cardinality of s(A).
□ Usually, attributes that are foreign keys
contain at least some different values.
■ ValueLengthDiff
□ Difference between the average value length
(as string) in s(A) and s(B).
□ Usually, average length of the values is similar
whenever foreign keys reference a non-biased
sample of the primary keys.
Felix Naumann | Data Profiling | CUSO 2014
57
A
a
a
a
a
a
B
a
b
c
d
e
?
A
abab
abab
abab
c
d
B
abab
b
c
d
e
?
![Page 58: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/58.jpg)
Features
■ Coverage
□ The ratio of values in s(B) that are covered by s(A)
compared to all values in s(B).
□ Usually, foreign keys cover a considerable number of
primary key values.
◊ 60% of FK-attribute values cover all ref-values
◊ Each covers at least 10%
■ OutOfRange
□ Percentage of values in s(B) that are not within
[ min(s(A)), max(s(A)) ].
□ Usually, the dependent values should be evenly
distributed over the referenced values.
□ Mostly, less than 5% of values outside of range
■ TableSizeRatio
□ Ratio of number of tuples in A and number of tuples in B.
□ Usually in life sciences databases, table sizes do not
differ wildly
Felix Naumann | Data Profiling | CUSO 2014
58
A
b
c
b
c
B
a
b
c
d
e
f
g
?
![Page 59: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/59.jpg)
Features
■ ColumnName
□ Similarity between name(A) and
name(B), also considering the
name of the table of which B is
an attribute.
■ TypicalNameSuffix
□ Checks whether name(A) ends
with a substring that indicates a
foreign key.
□ „id“, „key“, and „nr“
Felix Naumann | Data Profiling | CUSO 2014
59
FILMTEXTE.FILMTEXTTYPNR
FILMTEXTTYPEN.FILMTEXTTYPNR
CUSTOMER.C_NATIONKEY
NATION.N_NATIONKEY
SG_SEQFEATURE.ENT_OID
SG_COMMENT.ENT_OID
COURSE.STUDENT
STUDENT.ID
SG_BIOENTRY.TAX_OID
SG_TAXON.OID
![Page 60: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/60.jpg)
Agenda
60
■ Basic statistics
■ Functional dependencies
■ Keys and foreign keys
■ Data profiling tools
■ Advanced profiling
Felix Naumann | Data Profiling | CUSO 2014
![Page 61: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/61.jpg)
Tools have very long feature lists
Felix Naumann | Data Profiling | CUSO 2014
61
■ Num rows
■ Min value length
■ Median value length
■ Max value length
■ Avg value length
■ Precision of numeric values
■ Scale of numeric values
■ Quartiles
■ Basic data types
■ Num distinct values ("cardinality")
■ Percentage null values
■ Data class and data type
■ Uniqueness and constancy
■ Single-column frequency histogram
■ Multi-column frequency histogram
■ Pattern discovery (Aa9)
■ Soundex frequencies
■ Benford Law Frequency
■ Single column primary key discovery
■ Multi-column primary key discovery
■ Single column IND discovery
■ Inclusion percentage
■ Single-column FK discovery
■ Multi-column IND discovery
■ Multi-column FK discovery
■ Value overlap (cross domain analysis)
■ Single-column FD discovery
■ Multi-column FD discovery
■ Text profiling
![Page 62: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/62.jpg)
Oracle Data Profiling and Quality Control Center
Felix Naumann | Data Profiling | CUSO 2014
62
![Page 63: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/63.jpg)
Screenshots from IBM Information Analyzer
Felix Naumann | Data Profiling | CUSO 2014
63
![Page 64: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/64.jpg)
Typical Shortcomings of Tools (and methods from research)
■ Usability
□ Complex to configure
□ Results complex to view and interpret
■ Scalability
□ Main-memory based
□ SQL based
■ Efficiency
□ Coffee, Lunch, Overnight
■ Functionality
□ Restricted to simplest tasks
□ Restricted to individual columns or small column sets
◊ “Realistic” key candidates vs. further use-cases
□ „Checking“ vs. „discovery“
■ Interpretation of profiling results
Felix Naumann | Data Profiling | CUSO 2014
64
That‘s the big one
![Page 65: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/65.jpg)
Metanome – Profiling your Datanome
Felix Naumann | Data Profiling | CUSO 2014
65
Algorithm execution Result
management
Algorithm configuration Result
presentation
Configuration
Measurements SPIDER
jar
DUCC jar
SWAN jar
txt
xml csv
DB2 DB2
MySQL
Results
![Page 66: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/66.jpg)
Agenda
66
■ Basic statistics
■ Functional dependencies
■ Keys and foreign keys
■ Data profiling tools
■ Advanced profiling
Felix Naumann | Data Profiling | CUSO 2014
![Page 67: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/67.jpg)
Online Profiling
■ Profiling is long procedure
□ Boring for developers
□ Expensive for machines (I/O and CPU)
■ Challenge: Display intermediate results
□ … of improving/converging accuracy
□ Allows early abort of profiling run
■ Gear algorithms toward that goal
□ Allow intermediate output
□ Enable early output: “progressive” profiling
Felix Naumann | Data Profiling | CUSO 2014
67
![Page 68: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/68.jpg)
Incremental Profiling
■ Data is dynamic
□ Insert (batch or tuple-based)
□ Updates
□ Deletes
■ Problem: Keep profiling results up-to-date…
□ … without re-profiling the entire data set.
□ Easy examples: SUM, MIN, MAX, COUNT, AVG
□ Difficult examples: MEDIAN, uniqueness (see earlier slides),
dependencies
Felix Naumann | Data Profiling | CUSO 2014
68
![Page 69: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/69.jpg)
Piggyback Profiling
■ Goal: Determine metadata for query results
■ Challenge: With as little query processing overhead as possible
□ Baseline: Run second SQL query
□ Piggybacking: profile along query plan (using base statistics)
Felix Naumann | Data Profiling | CUSO 2014
69
![Page 70: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/70.jpg)
Profiling for Integration
■ Profile multiple sources simultaneously
■ Schema matching/mapping
□ What constitutes the “difficulty” of matching/mapping?
■ Duplicate detection
□ Estimate data overlap
□ Estimate fusion effort
■ Create measures to estimate
integration (and cleansing) effort
□ Schema and data overlap
□ Severity of heterogeneity
Felix Naumann | Data Profiling | CUSO 2014
70
![Page 71: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/71.jpg)
Profiling new Types of Data
■ Traditional data profiling: Single table or multiple tables
■ More and more data in other models
□ XML / nested relational / JSON
□ RDF triples
□ Textual data: Blogs, Tweets, News
□ Multimedia data
■ Different models offer new dimensions to profile
□ XML: Nestedness, measures at different nesting levels
□ RDF: Graph structure, in- and outdegrees
□ Multimedia: Color, video-length, volume, etc.
□ Text: Sentiment, sentence structure, complexity, and other
linguistic measures
Felix Naumann | Data Profiling | CUSO 2014
71
![Page 72: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/72.jpg)
Average Sentence Length
Felix Naumann | Data Profiling | CUSO 2014
75
„Literature Fingerprinting: A New Method for Visual Literary Analysis” by Daniel A. Keim and Daniela Oelke
![Page 73: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/73.jpg)
Hapax Legomena
Felix Naumann | Data Profiling | CUSO 2014
76
„Literature Fingerprinting: A New Method for Visual Literary Analysis” by Daniel A. Keim and Daniela Oelke
![Page 74: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/74.jpg)
News Statistics
Felix Naumann | Data Profiling | CUSO 2014
77
Master thesis Matthias Kohnen
![Page 75: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/75.jpg)
Summary
78
■ Basic statistics
■ Functional dependencies
■ Keys and foreign keys
■ Data profiling tools
■ Advanced profiling
Felix Naumann | Data Profiling | CUSO 2014
![Page 76: Big Data Profiling](https://reader030.fdocuments.net/reader030/viewer/2022020115/554e8d4fb4c905fc368b4a53/html5/thumbnails/76.jpg)
Summary
Felix Naumann | Data Profiling | CUSO 2014
79
Data Profiling
Single source
Single column
Cardinalities
Uniqueness and keys
Patterns and data types
Distributions
Multiple columns
Uniqueness and keys
Inclusion and foreign key
dep.
Functional dependencies
Conditional and approximate
dep.
Multiple sources
Topical overlap
Topic discovery
Topical clustering
Schematic overlap
Schema matching
Cross-schema dependencies
Data overlap
Duplicate detection
Record linkage