Systematic Development of Data Mining-Based Data Quality Tools
description
Transcript of Systematic Development of Data Mining-Based Data Quality Tools
![Page 1: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/1.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Systematic Development ofData Mining-Based Data Quality Tools
Dominik Luebbers, Udo Grimmer, Matthias Jarke
Seminar Data MiningProf. Dr. Thomas Hofmann
Steffen HartmannXu Jia
12.Jul.2005
1 / 32
![Page 2: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/2.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Überblick
1 Einführung
2 Test Data Generator
3 Data Auditing Tool
4 Evaluation
5 Literature
2 / 32
![Page 3: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/3.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Einführung
Worum geht es?
3 / 32
![Page 4: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/4.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Einführung
Motivation
41% der Data Warehousing Projekte fehlgeschlagen!Grund: mangelnde Data Quality (Garbage in, Garbage out)manuelle Inspektion ist fast unmöglichGrund: Daten über längere Zeit, verschiedene Generation vonDatenbanktechnologieLösung: (Semi-) automatische Data Auditing Tools.
4 / 32
![Page 5: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/5.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Data Quality
Was ist Data Quality?
Data Quality ist zielorientiert ⇒ keine formale Definition. Literature sprechenfitness for use or meeting end-user expectations
Quality Dimensionsaccuracy or correctnesscompletenessconsistencyactualityrelevance
5 / 32
![Page 6: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/6.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Data Quality
Was ist Data Quality?
Data Quality ist zielorientiert ⇒ keine formale Definition. Literature sprechenfitness for use or meeting end-user expectations
Quality Dimensionsaccuracy or correctnesscompletenessconsistencyactualityrelevance
5 / 32
![Page 7: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/7.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Data Auditing
Was ist Data Auditing?
application of data mining-algorithms for measuring and (possiblyinteractive) improving of data quality.Wichtig: data mining-algorithms muss geeignet zur Appliaction-domainsein.
Idee
Data mining-algorithms sucht die Regularitäten in Daten.z.B. Preis>100Euro ⇒ Versandkosten=0Deviations (Abweichungen) als Errors.
Teilaufgaben
Structure inductionDeviation detectionBeide Teilaufgaben können asynchronisiert ausgeführt werden. Vorteil?
6 / 32
![Page 8: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/8.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Data Auditing
Was ist Data Auditing?
application of data mining-algorithms for measuring and (possiblyinteractive) improving of data quality.Wichtig: data mining-algorithms muss geeignet zur Appliaction-domainsein.
Idee
Data mining-algorithms sucht die Regularitäten in Daten.z.B. Preis>100Euro ⇒ Versandkosten=0Deviations (Abweichungen) als Errors.
Teilaufgaben
Structure inductionDeviation detectionBeide Teilaufgaben können asynchronisiert ausgeführt werden. Vorteil?
6 / 32
![Page 9: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/9.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Data Auditing
Was ist Data Auditing?
application of data mining-algorithms for measuring and (possiblyinteractive) improving of data quality.Wichtig: data mining-algorithms muss geeignet zur Appliaction-domainsein.
Idee
Data mining-algorithms sucht die Regularitäten in Daten.z.B. Preis>100Euro ⇒ Versandkosten=0Deviations (Abweichungen) als Errors.
Teilaufgaben
Structure inductionDeviation detectionBeide Teilaufgaben können asynchronisiert ausgeführt werden. Vorteil?
6 / 32
![Page 10: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/10.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Data Auditing Tool Development Process
7 / 32
![Page 11: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/11.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Test Environment
Warum Test Environment?
Daten generieren, um die Charakteristik der Datenbank zu simulieren.pollute die Daten ⇒ Vergleichung der clean und polluted Testdaten fürdie Evaluation des Data Auditing Tools.
8 / 32
![Page 12: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/12.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Generieren von Testdaten
Rule-pattern-based date generation process
1 Datenbankschema feststellen (Anzahl und Typ der Attributen)2 TDG-Rule set generieren3 Data Records generieren
9 / 32
![Page 13: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/13.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Induktive Definition von Rule Patterns
Definition 1 (atomic TDG-formulae)
Let A and B be numerical or nominal attibutes and let a1 be a numerical ornominal domain value. Furthmore let N and M be numerical attibutes and let nbe a numerical domain value. Then
A = a1, A 6= a1, N < n, N > n, A isnull , A isnotnull (propositional)A = B, A 6= B, N < M, N > M (relational)
are called atomic TDG-formulae.
Definition 2 (TDG-formulae)
Each atomic TDG-formulae is a TDG-formulae.Let n ∈ N and α1, ..., αn be TDG-formulae. Then α1 ∨ ... ∨ αn andα1 ∧ ... ∧ αn are TDG-formulae
Definition 3 (TDG-rule)
Let α and β be TDG-formulae. Then α → β is a TDG-rule.
10 / 32
![Page 14: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/14.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Induktive Definition von Rule Patterns
Definition 1 (atomic TDG-formulae)
Let A and B be numerical or nominal attibutes and let a1 be a numerical ornominal domain value. Furthmore let N and M be numerical attibutes and let nbe a numerical domain value. Then
A = a1, A 6= a1, N < n, N > n, A isnull , A isnotnull (propositional)A = B, A 6= B, N < M, N > M (relational)
are called atomic TDG-formulae.
Definition 2 (TDG-formulae)
Each atomic TDG-formulae is a TDG-formulae.Let n ∈ N and α1, ..., αn be TDG-formulae. Then α1 ∨ ... ∨ αn andα1 ∧ ... ∧ αn are TDG-formulae
Definition 3 (TDG-rule)
Let α and β be TDG-formulae. Then α → β is a TDG-rule.
10 / 32
![Page 15: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/15.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Induktive Definition von Rule Patterns
Definition 1 (atomic TDG-formulae)
Let A and B be numerical or nominal attibutes and let a1 be a numerical ornominal domain value. Furthmore let N and M be numerical attibutes and let nbe a numerical domain value. Then
A = a1, A 6= a1, N < n, N > n, A isnull , A isnotnull (propositional)A = B, A 6= B, N < M, N > M (relational)
are called atomic TDG-formulae.
Definition 2 (TDG-formulae)
Each atomic TDG-formulae is a TDG-formulae.Let n ∈ N and α1, ..., αn be TDG-formulae. Then α1 ∨ ... ∨ αn andα1 ∧ ... ∧ αn are TDG-formulae
Definition 3 (TDG-rule)
Let α and β be TDG-formulae. Then α → β is a TDG-rule.
10 / 32
![Page 16: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/16.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Induktive Definition von Rule Patterns
sinnlose Rules
A = Val1 → A = Val2A = Val1 ∧ A = Val2 → B = Val1A = Val1 → A 6= Val2
⇒ Diese Rules sollen vermieden werden.⇒ Natural TDG-formulae and -rules
Definition 4 (Natural TDG-formulae)
Let α be a TDG-formulae. α is a natural TDG-formulae iff one of the followingholds:
α is an atomic TDG-formulae and α is satisfiable.α = α1 ∧ α2 ∧ ... ∧ αn and the following holds:
∀i : αi is a natural TDG-formulae,α is satisfiable and∀i : αi :
Vj,i 6=j αj
für Disjunktion analog.
11 / 32
![Page 17: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/17.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Induktive Definition von Rule Patterns
sinnlose Rules
A = Val1 → A = Val2A = Val1 ∧ A = Val2 → B = Val1A = Val1 → A 6= Val2
⇒ Diese Rules sollen vermieden werden.⇒ Natural TDG-formulae and -rules
Definition 4 (Natural TDG-formulae)
Let α be a TDG-formulae. α is a natural TDG-formulae iff one of the followingholds:
α is an atomic TDG-formulae and α is satisfiable.α = α1 ∧ α2 ∧ ... ∧ αn and the following holds:
∀i : αi is a natural TDG-formulae,α is satisfiable and∀i : αi :
Vj,i 6=j αj
für Disjunktion analog.
11 / 32
![Page 18: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/18.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Induktive Definition von Rule Patterns
Definition 5 (Natural TDG-rule)
A TDG-rule α → β is called a natural TDG-rule iffα and β are natural TDG-formulae,α ∧ β is satisfiable andα ; β
Widerspruch und Redudant
A = Val1 → B = Val1A = Val1 → B = Val2A = Val1 ∧ B = Val2 → C = Val1A = Val1 → C = Val1
⇒ to a given Rule set R the rule R = α → β should be added only if:R 2 RR∪ {α} is satisfiable
12 / 32
![Page 19: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/19.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Induktive Definition von Rule Patterns
Definition 5 (Natural TDG-rule)
A TDG-rule α → β is called a natural TDG-rule iffα and β are natural TDG-formulae,α ∧ β is satisfiable andα ; β
Widerspruch und Redudant
A = Val1 → B = Val1A = Val1 → B = Val2A = Val1 ∧ B = Val2 → C = Val1A = Val1 → C = Val1
⇒ to a given Rule set R the rule R = α → β should be added only if:R 2 RR∪ {α} is satisfiable
12 / 32
![Page 20: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/20.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Induktive Definition von Rule Patterns
Definition 6 (Natural rule set)
Let R = {α1 → β1, α2 → β2, ..., αn → βn} be a set of natural TDG-rulesαi → βi .R is called a natural rule set iff for two different rules αi → βi and αj → βj
with αj ⇒ αi the following holds:αj ∧ βi ∧ βj is satisfiable and(αj ∧ βi ) ; βj
Idea: Satisfiability Test for TDG-formulae
die TDG-formulae α in die disjunktive Form tranformieren.α ist satisfiable wenn einer diese disjunktiven Form satisfiable ist.
13 / 32
![Page 21: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/21.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Induktive Definition von Rule Patterns
Definition 6 (Natural rule set)
Let R = {α1 → β1, α2 → β2, ..., αn → βn} be a set of natural TDG-rulesαi → βi .R is called a natural rule set iff for two different rules αi → βi and αj → βj
with αj ⇒ αi the following holds:αj ∧ βi ∧ βj is satisfiable and(αj ∧ βi ) ; βj
Idea: Satisfiability Test for TDG-formulae
die TDG-formulae α in die disjunktive Form tranformieren.α ist satisfiable wenn einer diese disjunktiven Form satisfiable ist.
13 / 32
![Page 22: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/22.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Data Corruption
Verschiedene Variante auf date pollution
Wrong value polluterNull-value polluterLimiterSwitcherDuplicator
14 / 32
![Page 23: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/23.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Error detection by a data auditing tool
Specificity and sensitivity
Specificity (True Negative Rate) := TN/(TN + FP) z.B. dieWahrscheinlichkeit dass ein Symptom NICHT existiert.Sensitivity (True Positive Rate) := TP/(TP + FN) z.B. dieWahrscheinlichkeit dass ein Sysmptom existiert.beide Werte = 1 ⇒ perfektes Data Auditing ToolFalse Negative: z.B. kranke Mensch als nicht krank diagnostiziertFalse Positive: z.B. gesunde Mensch als krank diagnostiziert
15 / 32
![Page 24: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/24.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Ein kleines Beispiel
Begriffe
Class AttributeBasis AttributenTraining SetTest Set
16 / 32
![Page 25: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/25.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Entropy
Entropy
Entropy(S) = −P
p(I ) log2 p(I )Entropy(S) = − 9
14 log2(914 )− 5
14 log2(514 ) = 0.940
Wann ist Entropy=0? Wann ist Entropy=1?
17 / 32
![Page 26: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/26.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Entropy
Entropy
Entropy(S) = −P
p(I ) log2 p(I )Entropy(S) = − 9
14 log2(914 )− 5
14 log2(514 ) = 0.940
Wann ist Entropy=0? Wann ist Entropy=1?
17 / 32
![Page 27: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/27.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Entropy
Entropy
Entropy(S) = −P
p(I ) log2 p(I )Entropy(S) = − 9
14 log2(914 )− 5
14 log2(514 ) = 0.940
Wann ist Entropy=0? Wann ist Entropy=1?
17 / 32
![Page 28: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/28.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Entropy
Entropy
Entropy(S) = −P
p(I ) log2 p(I )Entropy(S) = − 9
14 log2(914 )− 5
14 log2(514 ) = 0.940
Wann ist Entropy=0? Wann ist Entropy=1?
17 / 32
![Page 29: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/29.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Entropy und Gain
Entropy und Gain
Entropy(Outlook, S) =514 Entropy(Ssunny ) + 5
14 Entropy(Srain) + 414 Entropy(Sovercast) = 0.694
Entropy(Ssunny ) = − 25 log2(
25 )− 3
5 log2(35 )
Gain(Outlook, S) = Entropy(S)− Entropy(Outlook, S)
18 / 32
![Page 30: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/30.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Entropy und Gain
Entropy und Gain
Entropy(Outlook, S) =514 Entropy(Ssunny ) + 5
14 Entropy(Srain) + 414 Entropy(Sovercast) = 0.694
Entropy(Ssunny ) = − 25 log2(
25 )− 3
5 log2(35 )
Gain(Outlook, S) = Entropy(S)− Entropy(Outlook, S)
18 / 32
![Page 31: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/31.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Entropy und Gain
Entropy und Gain
Entropy(Outlook, S) =514 Entropy(Ssunny ) + 5
14 Entropy(Srain) + 414 Entropy(Sovercast) = 0.694
Entropy(Ssunny ) = − 25 log2(
25 )− 3
5 log2(35 )
Gain(Outlook, S) = Entropy(S)− Entropy(Outlook, S)
18 / 32
![Page 32: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/32.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Entropy und Gain
Entropy und Gain
Entropy(Outlook, S) =514 Entropy(Ssunny ) + 5
14 Entropy(Srain) + 414 Entropy(Sovercast) = 0.694
Entropy(Ssunny ) = − 25 log2(
25 )− 3
5 log2(35 )
Gain(Outlook, S) = Entropy(S)− Entropy(Outlook, S)
18 / 32
![Page 33: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/33.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Der Basis Algorithmus - ID3
Attribute wählen
Gain(Outlook, S) = 0.246, Gain(Temperature, S) = 0.029Gain(Humidity , S) = 0.151, Gain(Wind , S) = 0.048
⇒ wähle die Attribute mit größter Gain als root des Entscheidungsbaums.
19 / 32
![Page 34: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/34.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Der Basis Algorithmus - ID3
20 / 32
![Page 35: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/35.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Der Basis Algorithmus - ID3
Decision tree ⇒ Rules
outlook = sunny ∧ humidity = high → playball = nooutlook = sunny ∧ humidity = normal → playball = yesoutlook = overcast → playball = yesoutlook = rain ∧ wind = true → playball = nooutlook = rain ∧ wind = false → playball = yes
21 / 32
![Page 36: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/36.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Problem mit Information Gain
22 / 32
![Page 37: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/37.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Verbesserung - C4.5
Information Gain Ratio
ID3 Information-Gain bevorzugt die Attributen, die viele Values haben.Attribute A hat nur Distinct value ⇒ Entropy(A, S)=0 ⇒ Gain(A,S) istmaximal.Verbessern durch Information gain ratioGainRatio(A, S) = Gain(A, S)/SplitInfo(A, S)
Beispiel: SplitInfo(Outlook, S) = − 514 log2(
514 )− 5
14 log2(514 )− 4
14 log2(414 )
Gain ratio ist groß, wenn daten ausbreiten (spread) und klein, wenn alledaten zu einem Ast gehört.
Attribute mit unbekanntem Wert
In building a decision tree: einfach diesen Record ignorierenIn using a decision tree: die Wahrscheinlichkeit möglicher Ergebnisseschätzen
23 / 32
![Page 38: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/38.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Verbesserung - C4.5
Information Gain Ratio
ID3 Information-Gain bevorzugt die Attributen, die viele Values haben.Attribute A hat nur Distinct value ⇒ Entropy(A, S)=0 ⇒ Gain(A,S) istmaximal.Verbessern durch Information gain ratioGainRatio(A, S) = Gain(A, S)/SplitInfo(A, S)
Beispiel: SplitInfo(Outlook, S) = − 514 log2(
514 )− 5
14 log2(514 )− 4
14 log2(414 )
Gain ratio ist groß, wenn daten ausbreiten (spread) und klein, wenn alledaten zu einem Ast gehört.
Attribute mit unbekanntem Wert
In building a decision tree: einfach diesen Record ignorierenIn using a decision tree: die Wahrscheinlichkeit möglicher Ergebnisseschätzen
23 / 32
![Page 39: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/39.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Verbesserung - C4.5
Pruning Decision Trees (Entscheidungsbaum beschneiden)
um Overfitting zu vermeidenMethode: subtree replacement - Teilbaum durch ein Blatt ersetzenBsp: Testdaten mit 3 (blue,success) und 2 (red, failure)⇒ Teilbaum durch Blatt mit <failure> ersetzen
24 / 32
![Page 40: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/40.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Error Korrektur
Durch predicted values
die predicted Werte können direkt als Korrektur benutzen.
Interaktive Error Korrektur
manchmal liegt Fehler an Basis-attibutendie predicted Werte helfen bei der Suche nach Fehlerquelle.
25 / 32
![Page 41: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/41.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Error Korrektur
Durch predicted values
die predicted Werte können direkt als Korrektur benutzen.
Interaktive Error Korrektur
manchmal liegt Fehler an Basis-attibutendie predicted Werte helfen bei der Suche nach Fehlerquelle.
25 / 32
![Page 42: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/42.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Anpassung an C4.5 für Data Auditing
Error Confidence
Wie vertrauenswürdig ist der ermittelte Error?abhängig von der Anzahl der Recordsniedrige Error Confidence Wert ist nutzlos ⇒ minimale Error Confidence⇒ mininale Records
Adjustments of C4.5
minimale Anzahl von Instanzen für eine Partition um unnötigen Teilbaumzu vermeidenpessimistic classification error benutzt in C4.5 pruning Kriterium wirdersetzt durch expected error confidence, wenn expected error confidencegrößer nach der pruning ist, dann wird das Teilbaum durch ein einzelnesBlatt ersetzt.Entscheidungsbaum in einen äquivalenten Rule Set transformieren und dieRules, die für Error Detection unrelevant sind, werden gelöscht.
26 / 32
![Page 43: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/43.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Anpassung an C4.5 für Data Auditing
Error Confidence
Wie vertrauenswürdig ist der ermittelte Error?abhängig von der Anzahl der Recordsniedrige Error Confidence Wert ist nutzlos ⇒ minimale Error Confidence⇒ mininale Records
Adjustments of C4.5
minimale Anzahl von Instanzen für eine Partition um unnötigen Teilbaumzu vermeidenpessimistic classification error benutzt in C4.5 pruning Kriterium wirdersetzt durch expected error confidence, wenn expected error confidencegrößer nach der pruning ist, dann wird das Teilbaum durch ein einzelnesBlatt ersetzt.Entscheidungsbaum in einen äquivalenten Rule Set transformieren und dieRules, die für Error Detection unrelevant sind, werden gelöscht.
26 / 32
![Page 44: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/44.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Evaluation
Anzahl der Records vs. Sensitivity
27 / 32
![Page 45: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/45.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Evaluation
Anzahl der Rules vs. Sensitivity
28 / 32
![Page 46: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/46.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Evaluation
Pollutionfaktor vs. Sensitivity
29 / 32
![Page 47: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/47.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Evaluation
Auditing Evaluation
Database that describes all industry engines manufactured byMercedes-Benzcontains 8 attibutes and about 200,000 recordsrunning 21 minutes on Athlon 900Mhzfound about 6000 suspicious records, that were ranked with their errorconfidence
For example
The following dependency between 2 attibutes BRV and GBM wasinductedBRV = 404 → GBM = 901based on 16118 records1 record got however a value of 911 for GBMthe data auditing tool give an error confidence of 99.95% to this record
30 / 32
![Page 48: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/48.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Evaluation
Auditing Evaluation
Database that describes all industry engines manufactured byMercedes-Benzcontains 8 attibutes and about 200,000 recordsrunning 21 minutes on Athlon 900Mhzfound about 6000 suspicious records, that were ranked with their errorconfidence
For example
The following dependency between 2 attibutes BRV and GBM wasinductedBRV = 404 → GBM = 901based on 16118 records1 record got however a value of 911 for GBMthe data auditing tool give an error confidence of 99.95% to this record
30 / 32
![Page 49: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/49.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Referenz
Literature und Links
Building Classification Models: ID3 and C4.5(http://www.cis.temple.edu/ ingargio/cis587/readings/id3-c45.html)The ID3 Algorithm(http://www.cise.ufl.edu/ ddd/cap6635/Fall-97/Short-papers/2.htm)Knowledge Discovery And Date Mining Techniques And Practice(http://www.netnam.vn/unescocourse/knowlegde/knowlegd.htm)Decision Trees (http://dms.irb.hr/tutorial/tut_dtrees.php)
31 / 32
![Page 50: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/50.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Ende
:-)
Vielen Dank für Eure Aufmerksamkeit!
?!
Fragen und Diskussion...
32 / 32
![Page 51: Systematic Development of Data Mining-Based Data Quality Tools](https://reader034.fdocuments.net/reader034/viewer/2022051013/547c6a7f5806b50d408b477c/html5/thumbnails/51.jpg)
Einführung Test Data Generator Data Auditing Tool Evaluation Literature
Ende
:-)
Vielen Dank für Eure Aufmerksamkeit!
?!
Fragen und Diskussion...
32 / 32