Schedule of K236 K236: Basis of Data Sciencebao/K236/K236-L2-print.pdf · K236: Basis of Data...

9
K236: Basis of Data Science Lecture 2. Data and Databases Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai 2 Schedule of K236 1. Introduction to data science 6K3kYµ 6/9 2. Introduction to data science 6K3kYµ 6/13 3. Data and databases 6K36K3@K/ 6/16 4. Review of univariate statistics bh´ª 6/20 5. Review of linear algebra tQ{ 6/23 6. Data mining software 6K3A$9J)2>7%&# 6/27 7. Data preprocessing 6K3a] 6/30 8. Classification and prediction (1) ^¸P (1) 7/4 9. Knowledge evaluation ¯«V 7/7 10. Classification and prediction (2) ^¸P (2) 7/11 11. Classification and prediction (3) ^¸P (3) 7/14 12. Mining association rules (1) ·HKH¨ 7/18 13. Mining association rules (2) ·HKH¨ 7/21 14. Cluster analysis (F/3K¨ 7/25 15. Review and Examination I=DK¬¹ (the data is not fixed) 7/27 Outline 1. Much more data around us than before 2. Data management 3. Data quality problems This lecture aims to provide you the idea of how data are collected, represented and organized. 3 Data collection, representation, organization and inference 4 Low level of abstraction High level of abstraction Generalization (inductive learning) ! How data is collected, represented, and organized? " Collection: sample or all available data " Representation: vectors, sequences, lists, graphs, etc. " Organization: databases, warehouses, etc. ! Inference " Induction: "#$%& ' ( , #&*%+ * ' (vs. Deduction: "#$%& * ' ,&- ' ( , -%-./% *(' ( )) Data Knowledge

Transcript of Schedule of K236 K236: Basis of Data Sciencebao/K236/K236-L2-print.pdf · K236: Basis of Data...

K236: Basis of Data ScienceLecture 2. Data and Databases

Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi

and Nuttapong Sanglerdsinlapachai

2

Schedule of K236

1. Introduction to data science 6K3�kYµ 6/9

2. Introduction to data science 6K3�kYµ 6/13

3. Data and databases 6K3�6K3@K/ 6/16

4. Review of univariate statistics bh´�ª 6/20

5. Review of linear algebra �tQ{ 6/23

6. Data mining software 6K3A$9J)2>7%&# 6/27

7. Data preprocessing 6K3a]� 6/30

8. Classification and prediction (1) ^¸�P� (1) 7/4

9. Knowledge evaluation �¯«V 7/7

10. Classification and prediction (2) ^¸�P� (2) 7/11

11. Classification and prediction (3) ^¸�P� (3) 7/14

12. Mining association rules (1) �·HKH�¨� 7/18

13. Mining association rules (2) �·HKH�¨� 7/21

14. Cluster analysis (F/3K¨� 7/25

15. Review and Examination I=DK�¬¹ (the data is not fixed) 7/27

Outline

1. Much(more(data(around(us(than(before

2.Data management3.Data quality problems

This lecture aims to provide you the idea of how data are collected, represented and organized.

3

Data collection, representation, organization and inference

4

Low(levelof abstraction

High(level(((((((of abstraction

Generalization(inductive0learning)

! How(data(is(collected,(represented,(and(organized?

" Collection:(sample(or(all(available(data" Representation:(vectors,(sequences,(lists,(graphs,(etc." Organization:(databases,(warehouses,(etc.

! Inference" Induction:(!"#$%&! '( , !#&*%+!* '(vs.(Deduction:("#$%&!* ' !,&-!'(, !-%-./%!*('())

Data Knowledge

5

Astronomical0data0� �.=RjwhAstronomy is facing a major data avalanche: � �NSjwh��R�$Q'HMAY

Multi-terabyte sky surveys and archives (soon: multi-petabyte), billions of detected sources, hundreds of measured attributes per source … �iqm]kUR�16%jwh{���UR6%&{6%&FOQ�-U@Y�

6

Earthquake0data �?jwh

1932-199604/25/92 Cape

Mendocino, CA

Japanese)earthquakes)))!"R�?196121994

7

Explosion of biological data������k6K3

10,267,507,282 bases in 9,092,760 records.

25,000%Genes

2,000,000%Proteins

3000%metabolitesMetabolomics

Proteomics

Genomics8

A portion of the DNA sequence, consisting of 1.6 million characters, is given as follows (about 350 characters, 4570 times smaller): 1600� �CWPY���<R�; y4570R�z

How biological data look like?��k6K3�ts��

…TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAAAGATTCATGTAAATTTCTTATTTGTTTATTTAGAGGTTTTAAATTTAATTTCTAAGGGTTTGCTGGTTTCATTGTTAGAATATTTAACTTAATCAAATTATTTGAATTTTTGAAAATTAGGATTAATTAGGTAAGTAAATAAAATTTCTCTAACAAATAAGTTAAATTTTTAAATTTAAGGAGATAAAAATACTACTCTGTTTTATTATGGAAAGAAAGATTTAAATACTAAAGGGTTTATATATATGAAGTAGTTACCCTTAGAAAAATATGGTATAGAAAGCTTAAATATTAAGAGTGATGAAGTATATTATGT…

Many other kinds of biological data

9

! Approximately 80% of the world’s data is held in unstructured formats(source: Oracle Corporation) �,�RjwhR2~}xD#:[UJPAjwhyaqcsQVYz

! Example: MEDLINE is a source of life sciences and biomedical information, with nearly eleven million records ��+�0�v+*��R��&N@YMEDLINEQS21100��R7 ��D@Y

" About 60,000 abstracts on hepatitis (BK4(QLAMS6��)

Text:0huge0sources0of0knowledgeibfk�/8R�EP&

36003: Biomed Pharmacother. 1999 Jun;53(5-6):255-63. Pathogenesis of autoimmune hepatitis.Institute of Liver Studies, King's College Hospital, London, United Kingdom.

Autoimmune hepatitis (AIH) is an idiopathic disorder affecting the hepatic parenchyma. There are no morphological features that are pathognomonic of the condition but the characteristic histological picture is that of an interface hepatitis without other changes that are more typical of other liver diseases. It is associated with hypergammaglobulinaemia, high titres of a wide range of circulating auto-antibodies, often a family history of other disorders that are thought to have an autoimmune basis, and a striking response to immunosuppressive therapy. The pathogenetic mechanisms are not yet fully understood but there is now considerable circumstantial evidence suggesting that: (a) there is an underlying genetic predisposition to the disease; (b) this may relate to several defects in immunological control of autoreactivity, with consequent loss of self-tolerance to liver auto-antigens; (c) it is likely that an initiating factor, such as a hepatotropic viral infection or an idiosyncratic reaction to a drug or other hepatotoxin, is required to induce the disease in susceptible individuals, …

10

Web0link0data0000^_nRrucjwh

Outline

1. Much(more(data(around(us(than(before

2.Data management" Data models" Data types" Structures of data" Various kinds of databases

3.Data quality problems

11

Data models

• Model: Simplified description or abstraction of a reality.

• Data model: Data description by a set of concepts of " The structure of a database, typically include

! elements (e.g., data types), ! groups of elements (e.g., entity, record, table), and ! relationships among such groups.

" The operations for manipulating these structures, specifying database retrievals and updates! basic model operations (e.g., insert, delete operations)! user-defined operations (e.g., compute_student_avarage_score)

" Certain constraints (restrictions on valid data) that the database should obey.

12

Approaches to data models• External model (Views): Describes how

users see the data for a particular purpose" Course_info(cid: string, enrollment: integer)

• Conceptual model: Defines logical structure*" Students(sid: string, name: string, login:

string, age: integer, gpa: real)" Courses(cid: string, cname: string, credits:

integer) " Enrolled(sid: string, cid: string, grade: string)

• Internal (physical) model: Describes how data is stored in computer" Relations stored as unordered files. " Index on first column of students.

13

View(1 View(2

Conceptual(model

Physical(model

External(Level

Conceptual(Level

Physical(Level

*(A(conceptual(model(is(an(underlying(model(that(is(capable(of(supporting(any(valid((and(perhaps(changing)(external(view(that(falls(within(its(scope.(https://en.wikipedia.org/wiki/Data_model#cite_noteUMW99U3

Types of data models• Flat model: a single, two-dimensional array of data

elements.

• Hierarchical model: data is organized into a tree-like

structure, implying a single upward link in each record to describe the nesting.

• Network model: two constructs: records contain fields, and sets define one-to-many relationships between records.

• Relational model: a database as a collection of predicates

over a finite set of predicate variables, describing constraints on the possible values and combinations of values.

• Object-relational model: a relational database model, but objects, classes and inheritance are directly supported in database schemas and in the query language.

• Star scheme: The simplest style of data warehouse

14

Data types! SYMBOLIC

" Indexing: E.g., names, tags, case numbers, or serial numbers that identify a respondent or group of respondents.

" Binary: Two values, e.g., YES or NO, SUCCESS or FAILURE, MALE or FEMALE, WHITE or NON-WHITE, FOR or AGAINST, and so on.

" Boolean: Two values TRUE or FALSE, and may have the value UNKNOWN.

" Nominal: Character-string values (green, blue, red, …)

" Ordinal: Values for this character-string data type are linearly ordered (Small, Middle, Large,…)

! NUMERIC" Integer: Values are just integer numbers" Continuous: real numbers.

15

Symbols(or(Numbers

16

Combinatorial search in hypothesis spaces (machine learning)R®�¶�����d! x�

Often matrix-based computation (multivariate data analysis)±r�¤`@K/�ª�»ih´6K3¨�¼

Why caring about data types?

Attribute Numerical Symbolic

No structure

!= Places,Color

Ordinal structure

!"= Ring

structure

Rank,Resemblance

Integer: Age,Temperature

Continuous: Income,Length

Nominal orcategorical(Binary, Boolean)

Ordinal

Measurable

!+"#=

Posible analysis

operations (thus

methods, algorithms) depend on data types

Advances: Data Transformation

Structures of data

• Structured data" Can be stored in database SQL

in table with rows and columns.

" Only about 5-10% of all available data.

• Semi-structured data" Doesn’t reside in a relational

database but that does have some organizational properties that make it easier to analyze.

" XML documents and NoSQL databases documents are semi structured

17

Articls2in2a2Latex2database

Structures of data

• Unstructured data" Unstructured data represent around 80% of data. It often include text

and multimedia content. Example: e-mail messages, word documents, videos, photos, audio files, webpages and many other kinds of business documents.

" A key issue in data science is representing unstructured dataExample: The DNA sequence“…TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAAAGATTC…”can be represented by different ways for computation such as sliding windows, motifs, kernel function, etc., or the web link representation

18

Databases

• The most popular format for organizing data in a database is in the form of rectangular tables (also called data arrays or data matrices)data array�data matrices��f� ��t5K?H�6K3@K/"����}�~e�|���

" Each row represents the values of all variables on a single multivariate observation, c¤�bL�ih´§��Z��h{�¦�"¥

" Each column represents the values of a single variable for each observation. c`�c§�����bL�h{�X"¥����

• A typical database table having n multivariate observations taken on r variables will be represented by an (r × n)-matrix \g��6K3@K/�Á�ih´§��Â�h{�(r x n) - A7G4(/�¥ �

19

Elements of database systems

! A database management system (DBMS) is a software system that manages data and provides controlled access to the database. 6K3@K/A:.CJ7-/5B»DBMS�¼6K3"��½6K3@K/��#(1/"yU��2>7

! Database system (consisting of databases, DBMS, and application programs) is typically used for managing large quantities of data, regarded as two entities: ! a server (or backend), which holds the DBMS, and

! a set of clients (or frontend), each consists of a hardware and a software component, including application programs

6K3@K/�+K<K�(F$#J7����j´�6K3"����-/5B

20

Structured

Commercial

Open2source

Unstructured

(RDBMS)(NoSQL DB)

Source:(Cisco

Big data landscape Structured query language (SQL)

! Users communicate with a DBMS through a declarative query language typically SQL (Structured Query Language).EK,K�±rSQL�f� �o©g('G©­"±��RDBMS�±W"¤�

! SQL has two main sublanguages: SQL�O�¿���©­���" a data definition language (DDL), used by database admin to define data

structures by creating a database object, altering or destroying a database object.6K3m�©­»DDL¼½�� �T�6K3�²"m���©­

" a data manipulation language (DML) is an interactive system that allows users to retrieve, delete, and update existing data from and add new data to the database.6K3zS©­»DML¼½EK,K�6K3@K/M�6K3"zS�����©­

! Examples" create!table!<table!name>!(<table!elements>);!" select!<columns>!from!<table!name>!where!<condition>;!" select!max(<column>)!as!max,!min(<column>)!as!min!from!<table!name>! where!

<condition>;!22

Flat0model:0labeled0data

23

H1

C3

H3 H4

H2

C2C1

C4

ID color))))))))#nuclei)))))#tails))) status

H1)))))))light 1 1)))))))))healthyH2)))))))dark 1 1)))))))))healthyH3)))))))light 1 2)))))))))healthyH4)))))))light 2 1)))))))))healthyC1)))))))dark 1 2))))))))cancerousC2)))))))dark 2 1))))))))cancerousC3)))))))light 2 2))))))))cancerousC4)))))))dark 2))))))) 2))))))))cancerous)

��LEjwhSupervised data (labeled)

Descriptive0attributes00000000000000000000000000000000000000000Color:({dark,(light},(#nuclei:({1,(2},(#tails:({1,(2}(

Class0attributeStatus({cancerous,(healthy}

Flat0model:0unlabeled0data

24

H1

C3

H3 H4

H2

C2C1

C4

ID color))))))))#nuclei)))))#tails))) status

H1)))))))light 1 1)))))))))healthyH2)))))))dark 1 1)))))))))healthyH3)))))))light 1 2)))))))))healthyH4)))))))light 2 1)))))))))healthyC1)))))))dark 1 2))))))))cancerousC2)))))))dark 2 1))))))))cancerousC3)))))))light 2 2))))))))cancerousC4)))))))dark 2))))))) 2))))))))cancerous)

��)HjwhUnsupervised data (unlabeled)

Descriptive0attributes00000000000000000000000000000000000000000Color:({dark,(light},(#nuclei:({1,(2},(#tails:({1,(2}(

25

Relational0databases

A relational database is a collection of tables, each of which is assigned a unique name, and consists of a set of attributes and a set of tuples.

Cust2ID))))))))))))))name)))))))))))))))))))address))))))))))))))))))))))))))))))))))))age)))))))))))))))))income)))))))))))))credit2info)))))))))))).C1))))))))))))Smith,)Sandy)))))))))5463)E)Hasting,)Burnaby)))))))))))))))21)))))))))))))))))))$27000)))))))))))))))))1))))) …

BC)V5A)459,)Canada))))))))… … … … … … …

Item2ID))))))))))name)))))))))))))brand))))))))))))))category)))))))))))))))type))))))))))))))price)))))))))))))))place2made)))) supplier)))))))))))))))cost))))I3)))))))))))high2res2TV))))))Toshiba))))))))high)resolution)))))))))))TV)))))))))))))$988.00))))))))))))))))Japan)))))))))) NIkoX)))))))))))))$600.00I8)))))))))))multidisc2 Sanyo)))))))))))))multidisc))))))))))))))CD)player)))))$369.00))))))))))))))))Japan)))))))) MusicFont))))))))$120.00

… CDplayer))))))))))))… … … … … … …

customer

item

Emp2ID)))))))))))name))))))))))))))))))category))))))))))))))))))))))))group))))))))))))))))))))salary)))))))))))))))))))))commisionE35))))))))))))Jones,)Jane)))))))home)entertainmentl))))))))))manager)))))))))))))))$18,000))))))))))))))))))))))))))2%… … … … … …

employee

Branch2ID))))))))))name)))))))))))))))))))))))))))))))))))))))))))))))))))addressB1)))))))))))))))City)square))))))))369)Cambie)St.,)Vancouver,)BC)V5L)3A2,)Canada… … …

branch

Trans2ID))))))cust2ID)))))))empl2ID)))))))data)))))))))time))))))))method2paid)))))))amountT100))))))))))C1))))))))))))B55)))))))))))01/21/98))))15:45))))))Visa))))))))))))))))))$1357.00… .))… … … … … …

purchases

Trnas2ID))item2ID)))sty

T100))))))))))I3)))))))))1T100))))))))))I8)))))))))2… … …

Empl2ID))branch2ID

E55))))))))))B1… …

Item2sold444444444444444444works2at

26

A data warehouse is a repository of information collected from multiple resources, stored under a unified schema, and which is usually resides at a single site. jwh^`\l^fS5�RrgwfCW3�GZJfbwpQ�>GZJ��RtoekrNI|9� �Rd]kQ@XTI

Data)sourcein)Chicago

Data)sourcein)New)York

Data)sourcein)Vancouver

Data)sourcein)Toronto

CleanTransformIntegrateLoad

Data)warehouse

Query)andanalysis)tool

client

client

Data0warehouses

27

Transactional databases

! A transactional database consists of a file where each record represents a transaction.

! A transaction typically includes a unique transaction identity number (trans_ID), and list of the items making up the transaction.

Trans_ID)))))list)of)item_ID

T100))))))))))))beer,)cake,)onigiriT200))))))))))))beer,)cakeT300))))))))))))beer,)onigiri))))))T400 beer,)onigiriT500))))))))))))cake

28

! Object-Oriented Databases

! Object-Relational Databases

! Spatial Databases

! Temporal Databases and Time-Series Databases

! Text Databases and Multimedia Databases

! Heterogeneous Databases and Legacy Databases

! The World Wide Web

Advanced0database0systems

29

! Spatial databases contain spatial-related information: geographic databases, VLSI chip design databases, medical and satellite image databases etc.

! Data mining may uncover patterns describing the characteristics of houses located near a specified kind of location, the climate of mountainous areas located at various altitudes, etc.

Spatial databases

Japanese)earthquakes)))196121994

30

! They store time-related data. A temporal database stores relational data that include time-related attributes (timestamps with different semantics). A time-series database stores sequences of values that change with time (stock exchange)

! Data analytics finds the characteristics of object evolution, trend of change for objects: e.g., stock exchange data can be mined to uncover trends in investment strategies

Temporal and time-series databases

31

! Text databases contain documents, usually highly unstructured or semi-structured. To uncover general descriptions of object classes, keywords, content associations, clustering behavior of text objects, etc.

! Multimedia databases store image, audio, and video data: picture content-based retrieval, voice-email systems, video-on-demand-systems, speech-based user interface, etc.

Text and multimedia databases

32

The Web provides an enormous source of explicit and implicit knowledge that people can navigate and search for what they need.

Example: When examining the data collected from Internet Mart, heavily trodden paths gave BT hints to regions of the site which were of key interest to its visitors.

The world wide web

Outline

1. Much more data around us than before2. Data management3. Data quality problems

33

Noisy, inconsistencies, outliersCommon properties of large real-world databases: �nM�qj�6K3@K/�[±���v

! Incomplete: lacking attribute values or certain of interest NlZÀ6K3��£¾pw�³_����

��

! Noisy: containing errors or outliers ;$0À'FK��rX

! Inconsistent: containing discrepancies in codes or names ��À*K8�ea�NL¡

No quality data, no quality data mining results!

°�¢���6K3���VX�����

�u� ��º

34

KDD nuggets

www.kdnuggets.com is website of the data mining community

35

Homework for K236-L2

! Carefully study the slides. You(can(consult(the(book(chapter(“Data and Databases” provided in the website. Raise your questions on what you have yet clearly seen.

! Choose 4 datasets from www.statsci.org/datasets.html and summarize each of them (about the area where the data are collected, data type, number of features and objects, etc.). It is required that the datasets you select relating to different kinds of data (categorical, ordinal, integer, real number, etc.) and different data representations (vector, sequence, lists, graph, etc.).

! Report of this homework will be submitted at the latest one week after the class (June 23).

36