effective data analysis with R

44
Effective Data Analysis A Comprehensive Workflow with TTVM Process & R 呂呂 1

Transcript of effective data analysis with R

Page 1: effective data analysis with R

1

Effective Data AnalysisA Comprehensive Workflow with TTVM Process & R

呂奕

Page 2: effective data analysis with R

2

AboutThese slides provide introductory material to help improve skills for manipulation data, efficiently modeling, and getting insights through such process.

The tool used in this slide is "R", which is a popular open-source software, not only as a statistical software but a programming language.

Page 3: effective data analysis with R

3

你是否有以下困擾?• 不知道什麼資料可以被使用• 拿到的資料跟天書一樣無法整理• 終於下定決心要整理資料時不知道從哪裡開始• 火眼金睛的整理方式無法確保是否遺漏或手殘• 一旦資料出錯就無法回頭,資料夾一堆 "xxx_backup" 檔案• 終於整理完後可以分析的方法非常有限• 下次遇到同樣的東西,又要從頭再來

Page 4: effective data analysis with R

4

你是否有以下困擾?• 別人看不懂你的處理方法,協同工作很困難• 決定方法後要動手做很痛苦,開始從厚厚一疊參考資料翻找• 畫圖很痛苦• 做模型很痛苦• 一大堆模型不知道怎麼解釋和選擇

Page 5: effective data analysis with R

5

What is dataanalysis?

Page 6: effective data analysis with R

6

Data analysis

is the process by which data becomes understanding, knowledge and insight

Page 7: effective data analysis with R

7

Why R?

Page 8: effective data analysis with R

http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?pagewanted=all

8

Page 9: effective data analysis with R

9

Why R?• It is free

• It has a comprehensive set of packages

Data access

Data cleaning

Analysis

Data reporting

• It has one of the best development environments - RStudio http://www.rstudio.com/

• It has an amazing ecosystem of developers

• Packages are easy to install and "play nicely together"

Page 10: effective data analysis with R

10

Why NOT SPSS?

因為以下要講的觀念 SPSS 都很難做到而且 SPSS 很貴

Page 11: effective data analysis with R

11

TTVM Process

Page 12: effective data analysis with R

12

Before thinking outside the box

We have to look inside the black box and figure out how it works.

Not until we understand the mechanism of (quantitative) data analysis do we really master the (quantitative) analysis skill.

Page 13: effective data analysis with R

Modified from Hadley Wickham 13

TidyAcquiring

DataTransform

Visualize

Model

Interpret

Page 14: effective data analysis with R

14

used to be…Computation time >> Cognition time

Page 15: effective data analysis with R

https://www.flickr.com/photos/mutsmuts/4695658106 15

should be…Cognition time Computation ≫time

Page 16: effective data analysis with R

16

Tidy Data

Page 17: effective data analysis with R

17

Page 18: effective data analysis with R

18

現實中遇到的資料 …通常來源都是沒有整理過的資料散落在各處,儲存格式幾乎都不一樣必須結合各種其他資料源才能獲得有用的資訊資料是動態的湧入,不斷持續增加

Page 19: effective data analysis with R

19

Data analysis

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003).

Page 20: effective data analysis with R

20

Defining tidy dataLike families, tidy datasets are all alike but every messy dataset is messy in its own way.

1. Each variable forms a column.

2. Each observation forms a row.

3. Each type of observational unit forms a table.

This is Codd's 3rd normal form (Codd 1990)

Page 21: effective data analysis with R

21

Messy dataset

religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k

Agnostic 27 34 60 81 76 137

Atheist 12 27 37 52 35 70

Buddhist 27 21 30 34 33 58

Catholic 418 617 732 670 638 1116

Don't know/refused 15 14 15 11 10 35

Evangelical Prot 575 869 1064 982 881 1486

Hindu 1 9 7 9 11 34

Historically Black Prot 228 244 236 238 197 223

Jehovah's Witness 20 27 24 24 21 30

Jewish 19 19 25 25 30 95

Page 22: effective data analysis with R

22

Tidy data

religion income freq

Agnostic <$10k 27

Agnostic $10-20k 34

Agnostic $20-30k 60

Agnostic $30-40k 81

Agnostic $40-50k 76

Agnostic $50-75k 137

Agnostic $75-100k 122

Agnostic $100-150k 109

Agnostic >150k 84

Agnostic Don't know/refused 96

Page 23: effective data analysis with R

23

If you can get things done at a time,

then don't spend dozens!

Page 24: effective data analysis with R

24

Page 25: effective data analysis with R

25

Transformation

Page 26: effective data analysis with R

26

Split, Apply, Combine

name n

Al 2

Bo 4

Bo 0

Bo 5

Ed 5

Ed 10

name n

Al 2

name n

Bo 4

Bo 0

Bo 5

name n

Ed 5

Ed 10

total

2

total

9

total

15

name n

Al 2

Bo 9

Ed 15

Page 27: effective data analysis with R

27

Simple and comprehensible code makes your work replicable and easy to debug

最重要的是:永遠不需要再砍掉重練,錯了可以重來

Page 28: effective data analysis with R

28

Re-level

Page 29: effective data analysis with R

29

Recode

Page 30: effective data analysis with R

30

Manipulating variables

Page 31: effective data analysis with R

31

Visualization

Page 32: effective data analysis with R

32

資料要怎麼處理才畫得出來?

Page 33: effective data analysis with R

33

跑表跑到眼花撩亂

Page 34: effective data analysis with R

34

用手畫到升天

Page 35: effective data analysis with R

35

FA , clustering 好像很難

Page 36: effective data analysis with R

36

Correspondence Analysis

Page 37: effective data analysis with R

37

Modeling

Page 38: effective data analysis with R

38

Flexible / Learning by doing

Page 39: effective data analysis with R

39

很多別人佛心寫好的套件

Page 40: effective data analysis with R

http://www.slideshare.net/ckliu/z-b-38495724 | http://gene.speaking.tw/2014/10/28.html 40

有脈絡的流程,易於發現問題

Page 41: effective data analysis with R

41

Data Product

Page 42: effective data analysis with R

42

Hi,我是彩蛋Aloha~~

Page 43: effective data analysis with R

43

淺談 Big data

先別管 big 不 big 了,你知道分析方法有甚麼不同嗎?→ 事實上沒什麼不同:假設、驗證、預測 ( 學習 )

而且只是你的硬碟還裝得下的檔案,基本上都不算 big

Page 44: effective data analysis with R

http://www.bnext.com.tw/article/view/id/34692 44

淺談 Big databig data 不是什麼新概念,就是一個莫名近年來莫名在炒的話題。big data 確實是管理問題,很多公司還把它當 data mining ,跑跑公司的交易資料,甚至根本不信 data 這套的還更多。太多的 unstructured data和 machine data 沒利用到了,甚至是 open structured data 也根本沒在用。再來是公司 data-driven decision 做到什麼程度,只有行銷做一做嗎?策略性的去累積你的 data ,以及訓練你的 model 、 data automation 的程度,都會變成贏過對手的競爭優勢。透過結合 Big data+ machine learning + cloud打造出來的應用會大大替代過去的各種商業模式。