Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing...
-
Upload
leanna-wiggington -
Category
Documents
-
view
265 -
download
0
Transcript of Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing...
![Page 1: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/1.jpg)
Data Preprocessing
![Page 2: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/2.jpg)
Relational Databases - Normalization Denormalization
Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values
![Page 3: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/3.jpg)
Remember: Relational Databases Model entities and relationships Entities are the things in the real world
Information about employees and the department they work for Employee and department are entities
Relationships are the links between these entities Employee works for a department
![Page 4: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/4.jpg)
Relation or Tables Relation: a table of data
• table = relation,(set theory, based on predicate logic)
emplyeeID name job departmentID
7513 Nora Programmer 128
9842 Ben DBA 42
6651 Alex Programmer 128
9006 Claudia System-Administrator
128
![Page 5: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/5.jpg)
Columns and Rows
Each column or attribute describes some piece of data that each record in the table has
Each row in a table represents a record Rows, records or tupels
![Page 6: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/6.jpg)
Keys A superkey is a column (or a set of columns) that can
be used to identify a row in as table a key is a minimal superkey
There are different possible keys candidate keys
We chose form the candidate keys the primary key Primary key is used to identify a single row (record) Foreign keys represents links between tables
![Page 7: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/7.jpg)
Keys primary key foreign key
emplyeeID name job departmentID
7513 Nora Programmer 128
9842 Ben DBA 42
6651 Alex Programmer 128
9006 Claudia System-Administrator
128
![Page 8: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/8.jpg)
Functional Dependencies
If there is a functional dependency between columns A and B in a given table which may be written
Then the value of column A determines the value of column B
employeeID functionally determines the name
€
A→ B
![Page 9: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/9.jpg)
Schema
Database schema
Structure or design of the database Database without any data in it
employee(employeeID,name,job,departmentID)
![Page 10: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/10.jpg)
Design
Minimize redundancy Redundancy: data is repeated in different
rows
employee(employeeID,name,job,departmentID,departmentName)
emplyeeID name job departmentID departamentName
7513 Nora Programmer 128 Research and Development
9842 Ben DBA 42 Finance
6651 Alex Programmer 128 Research and Development
9006 Claudia System-Administrator 128 Research and Development
![Page 11: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/11.jpg)
Reduce redudancy
employee(employeeID,name,job,departmentID,departmentName)
employee(employeeID,name,job,departmentID)
employee(departmentID,name)
![Page 12: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/12.jpg)
Insert Anomalies
Insert data into flawed table Data does not match what is already in
the table It is not obvious which of the rows in the
database is correct
![Page 13: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/13.jpg)
Deletion Anomalies
Delete data from a flawed schema
When we delete all the employees of Department 128, we no longer have any record, that the Department 128 exists
![Page 14: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/14.jpg)
Update Anomalies
Change data in a flawed schema
We do not change the data for every row correctly
![Page 15: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/15.jpg)
Null Values
Avoid schema designs that have large numbers of empty attributes
![Page 16: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/16.jpg)
Normalization
Remove design flaws from a database Normal forms, which are a set of rules
describing what we should and should not do in our table structures
Breaking tables into smaller tables that form a better design
![Page 17: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/17.jpg)
Normal Forms
1 Forma Normal
2 Forma Normal
3 Forma Normal
5 Forma Normal
4 Forma Normal
Forma Normal Boyce Codd
![Page 18: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/18.jpg)
First Normal Form (1NF)
Each attribute or column value must be atomic
Each attribute must contain a single value
![Page 19: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/19.jpg)
emplyeeID name job departmentID
skills
7513 Nora Programmer 128 C, Perl, Java
9842 Ben DBA 42 DB2
6651 Alex Programmer 128 VB, Java
9006 Claudia System-Administrator 128 NT, Linux
![Page 20: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/20.jpg)
1NF
emplyeeID name job departmentID skills
7513 Nora Programmer 128 C
7513 Nora Programmer 128 Perl
7513 Nora Programmer 128 java
9842 Ben DBA 42 DB2
6651 Alex Programmer 128 VB
6651 Alex Programmer 128 java
9006 Claudia System-Administrator 128 NT
9006 Claudia System-Administrator 128 Linux
![Page 21: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/21.jpg)
Second Normal Form (2NF)
All attributes that are no part of the primary key are fully dependent on the primary key Each non key attribute must be functionally
dependent on the key Is already in 1NF
![Page 22: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/22.jpg)
2NF ?
employee(employeeID,name,job,departmentID,skill)
emplyeeID name job departmentID skills
7513 Nora Programmer 128 C
7513 Nora Programmer 128 Perl
7513 Nora Programmer 128 java
9842 Ben DBA 42 DB2
6651 Alex Programmer 128 VB
6651 Alex Programmer 128 java
9006 Claudia System-Administrator 128 NT
9006 Claudia System-Administrator 128 Linux
![Page 23: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/23.jpg)
Functional dependencies
employeeID,skill name, job, deparmentID employeeID name, job, deparmentID
Partially functionally dependent on the primary key
Not fully functionally dependent on the primary key
![Page 24: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/24.jpg)
2NF
Decompose the table into tables which all the non-key attributes are fully functionally dependent on the key
Breaking the table into two tables employee(employeeID,name,job,departmentID) employeeSkills(employeeID,skill)
![Page 25: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/25.jpg)
Third Normal Form (3NF)
Remove all transitive dependencies Be in 2NF
![Page 26: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/26.jpg)
employee(employeeID,name,job,departmentID,departmentName) employeeID name,job,departmentID,departmentName departmentID departmentName
employeID name job departmentID departmentName
7513 Nora Programmer 128 Research
9842 Ben DBA 42 Finance
6651 Ajay Programmer 128 Research
9006 Candy SYS 128 Research
![Page 27: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/27.jpg)
Transitive dependency
employeeID departmentName employeeID deparmtentID
departmentID departmentName
![Page 28: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/28.jpg)
3NF
Remove transitive dependency Decompose into multiple tables
emploee(employeeID,name,jop,departmentID) deparment(deparmentID,deparmtentName)
![Page 29: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/29.jpg)
3NF
The left side of the functional dependency is a superkey (that is, a key that is not necessarily minimal) Boyce-Codd Normal Form
or
The right side of the functional dependency is a part of any key of the table
![Page 30: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/30.jpg)
BCNF
All attributes must be functionally determined by a superkey
![Page 31: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/31.jpg)
Full normalization means lots of logically seperate relations
Lots of logically separate relations means a lot of physically separate files
Lots of physically separate files means a lot of I/O
Difficulties in finding dimensions for dimensional schema, star schema (dimension tables, fact table)
![Page 32: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/32.jpg)
What is Denormalization? Normalizing a relational variable R means replacing
R by a set of projections R1,R2,..,Rn such that R is equal to the join R1,R2,..,Rn Reduce redundancy, each projections R1,R2,..,Rn is at
the highest possible value of normalization Denormalizing the relational variables means
replacing them by their join R Increase redundancy, by ensuring that R is a lower level
of normalization than R1,R2,..,Rn Problems
Once we start to denormalize, it is not clear when to stop?
![Page 33: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/33.jpg)
Dimensional Schema Array cells often empty
The more dimensions, there more empty cells Empty cell Missing information How to treat not present information ? How does the system support
• Information is unknown• Has been not captured• Not applicable• ....
Solution?
![Page 34: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/34.jpg)
Why Data Preprocessing?
Data in the real world is dirty incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data• e.g., occupation=“ ”
noisy: containing errors or outliers• e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”• e.g., Was rating “1,2,3”, now rating “A, B, C”• e.g., discrepancy between duplicate records
![Page 35: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/35.jpg)
Why Is Data Dirty?
Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was
collected and when it is analyzed. Human/hardware/software problems
Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission
Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
![Page 36: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/36.jpg)
Why Is Data Preprocessing Important?
No quality data, no quality mining results! Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading statistics.
Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
![Page 37: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/37.jpg)
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility
![Page 38: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/38.jpg)
Major Tasks in Data Preprocessing Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Data integration Integration of multiple databases, data cubes, or files
Data transformation Normalization and aggregation
Data reduction Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization Part of data reduction but with particular importance, especially
for numerical data
![Page 39: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/39.jpg)
Forms of Data Preprocessing
![Page 40: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/40.jpg)
Data Cleaning
Importance “Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball “Data cleaning is the number one problem in data
warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
![Page 41: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/41.jpg)
Missing Data
Data is not always available E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred
![Page 42: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/42.jpg)
Missing Values
The approach of the problem of missing values adopted in SQL is based on nulls and three-valued logic (3VL)
null corresponds to UNK for unknown 3VL a mistake?
![Page 43: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/43.jpg)
Boolean Operators
In scalar comparison in which either of the compared is UNK evaluates the unknown truth value
AND t u f
t t u f
u u u f
f f f f
OR t u f
t t t t
u t u u
f t u f
NOT
t f
u u
f t
![Page 44: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/44.jpg)
MAYBE
Another important Boolean operator is MAYBE
MAYBE
t f
u t
f f
![Page 45: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/45.jpg)
Example
Consider the query “Get employees who may be- but are not definitely known to be- programmers born before January 18, 1971, with salary less then €40.000
EMP WHERE MAYBE ( JOB = ‘PROGRAMMER’ AND
DOB < DATE (‘1971-1-18’) AND
SALLARY < 40000 )
![Page 46: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/46.jpg)
Without maybe we assume the existence of another operator called IS_UKN which takes a single scalar operand and returns true if operand evaluates UNK otherwise false
EMP WHERE ( JOB = ‘PROGRAMMER’ OR IS_UKN (JOB) ) AND ( DOB < DATE (‘1971-1-18’) OR IS_UKN (DOB) ) AND ( SALLARY < 40000 OR IS_UKN (SALLARY) ) AND NOT ( JOB = ‘PROGRAMMER’ AND DOB < DATE (‘1971-1-18’) AND SALLARY < 40000 )
![Page 47: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/47.jpg)
Numeric expression WEIGHT * 454
If WEIGHT is UKN, then the result is also UKN
Any numeric expression is considered to evaluate UNK if any operands of that expression is itself UNK
Anomalies WEIGHT-WEIGHT=UNK (0) WEIGHT/0=UNK (“zero divide”)
![Page 48: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/48.jpg)
UNK is not u (unk) UNK (the value-unknown null) u (unk) (unknown truth value) ...are not the same thing
u is a value, UNK not a value at all!
Suppose X is BOOLEAN Has tree values: t (true),f (false), u ukn X is ukn, X is known to be unk X is UKN, X is not known!
![Page 49: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/49.jpg)
Some 3VL Consequences
The comparison x=x does not give true In 3VL x is not equal to itself it is happens to
be UNK The Boolean expression p OR NOT(p)
does not give necessarily true unk OR NOT (unk) = unk
![Page 50: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/50.jpg)
Example Get all suppliers in Porto and take the union
with get all suppliers not in Porto We do not get all suppliers!
We need to add maybe in Porto In 2 VL p OR NOT(p) corresponds to p OR NOT(p) OR MAYBE(p) in 3VL
While two cases my exhaust full range of possibilities in the real world, the database does not contain the real world - instead it contains only knowledge about real world
![Page 51: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/51.jpg)
Some 3VL Consequences The expression r JOIN r does not
necessarily give r A=B and B=C together does not imply
A=C .... Many equivalences that are valid in 2VL
break down in 3VL We will get wrong answers
![Page 52: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/52.jpg)
Special Values
Drop the idea of null and UNK,unk 3VL Use special values instead to represent
missing information
Special values are used in the real world In the real world we might use the special
value „?“ to denote hours worked by a certain employee if actual value is unknown
![Page 53: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/53.jpg)
Special Values General Idea:
Use an appropriate special value, distinct from all regular values of the attribute in question, when no regular value can be used
The special value must be of the applicable attribute is not just integers, but integers integers plus whatever the special value is
Approach is not very elegant, but without 3VL problems, because it is in 2VL
![Page 54: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/54.jpg)
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class: smarter
the most probable value: inference-based such as Bayesian formula or
decision tree
![Page 55: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/55.jpg)
Relational Databases - Normalization Denormalization
Data Preprocessing Missing Data Missing values and the 3VL approach Problems with 3VL approach Special Values
![Page 56: Data Preprocessing. Relational Databases - Normalization Denormalization Data Preprocessing Missing Data Missing values and the 3VL approach Problems.](https://reader036.fdocuments.net/reader036/viewer/2022081418/551b542a550346d41a8b6169/html5/thumbnails/56.jpg)
Next..
Data Preprocessing
Visual inspection Noise Reduction Data Reduction Data Discretization Data Integration