What is Data Science? DS123 · Harry Potter Iron Man Saw Inception Titanic Aquaman Ann 3 5 1 5 5 ?...

What is Data Science?

DS123

January 7, 2019

Netflix (Recommender System)

1

Harry Potter Iron Man Saw Inception Titanic Aquaman

Ann 3 5 1 5 5 ?

Ben 2 3 5 4 5 2

Cat 5 4 3 4 5 5

Dan 3 4 1 3 4 4

Emma 4 3 2 4 5 4

How can we predict Ann’s rating on Aquaman?

2

ให d(Ann,u) เปนผลบวกของผลตาง rating ของหนังแตละเรื่องระหวาง Ann และผูใช u.

HP Iron Man Saw Inception Titanic Aquaman d(Ann,u)

Ann 3 5 1 5 5 ?

Ben 2 3 5 4 5 2 8

Cat 5 4 3 4 5 5 6

Dan 3 4 1 3 4 4

Emma 4 3 2 4 5 4

3

ในตัวอยางนี้เราไดใชความเขาใจวา “คนที่มีความชอบเหมือนกันจะมี rating คลายๆกัน” ใน

การสราง d(Ann,u) เพื่อวัดความคลายระหวาง Ann กับผูใชคนอื่น

4

นิยามของ Data Science (วิทยาการขอมูล)

Data Science คือ ศาสตรของการใชความเขาใจในขอมูล (understanding of data)

ในการสรางขั้นตอนวิธี (process/algorithm) เพื่อดึงโครงสรางหรือความรูใหมของ

ขอมูล (structure or knowledge of data) ออกมา

5

ทำไมคนถึงหันมาใหความสนใจกับ Data Science?

1. การเติบโตอยางรวดเร็วของขอมูลในยุคอินเทอรเน็ต

- ในป 2012 พบวามีขอมูลใหมเพิ่มเขามาจำนวนประมาณ 2.5 พันลาน GB ตอวัน

(IBM)

- เกิดจากการพัฒนาของอุปกรณบันทึกขอมูลตางๆเชนมือถือ, กลองวงจรปด,

ดาวเทียม

2. ราคาของอุปกรณเก็บบันทีกขอมูลที่ถูกลง

- มือถือสามารถเก็บขอมูลไดมากกวา 32 GB และคอมพิวเตอรสามารถเก็บขอมูลได

มากกวา 1 TB

8

Credit: Data Never Sleeps 5.0 — Domo

9

ชนิดของขอมูล

Table data

Titanic data

• ขอมูลชิ้นเดึยว เรียกวา data point

• ขอมูลทั้งชุด เรียกวา dataset10

Time Series data

Alcohol Sales data from 2000-2017

• ขอมูลมีตัวแปรเดียวคือ Alcohol Sales ซึ่งเปลี่ยนแปลงตามเวลาและฤดูกาล11

Graph data

Example of Facebook social network

12

Graph data

13

Image data

Hand-written image data (MNIST)

14

Image data

Image read by a computer program

• แตละ pixel มีคาที่เปนไปไดจาก 0-255 15

Extracting structure and knowledge from data

Basic ideas

Concept/Class description

• What are characteristics of people who survived the Titanic incident?

Data statistics

• Central Tendency Measure – Mean, Mode, Median

• Dispersion Measure – Standard deviation, Variance

All of these ideas must generalize well on unseen data!

16

Association rules

Frequent patterns

• What items are frequently purchased together at 7-11?

Association, correlation vs causality

• Diaper → Beer

• Fried chicken → Sticky rice

• Coke → Dimsum

We can find “efficient rules” using Weka, more in Chapter 4

How can we make use of these rules?17

Classification

Basic idea

• Construct models based on some labeled training examples.

• Goal: Predicting correct label of future unseen data.

• E.g., classify numbers based on images from MNIST, or classify cars

based on gas mileage.

Typical methods

• Decision trees, naïve Bayesian classification, support vector machines,

neural networks, nearest neighbours, logistic regression, ...

Typical applications: Credit card fraud detection, diseases, ratings, ...18

Cluster analysis

Basic idea

• Construct models based on some unlabeled training examples.

• Goal: Group similar inputs to form new categories.

• E.g., clustering online customers, clustering English words.

Typical methods

• K-means clustering, Hierarchical clustering, Gaussian Mixture Model.

21

Learning path for Data Science

22

What is Data Science? DS123 · Harry Potter Iron Man Saw Inception Titanic Aquaman Ann 3 5 1 5 5 ?...

Documents

Transcript of What is Data Science? DS123 · Harry Potter Iron Man Saw Inception Titanic Aquaman Ann 3 5 1 5 5 ?...