Data Analysis with Python - Pandas | WeiYuan

43
Python 的資料分析 - Pandas 給新手的 Python 程式設計 | WeiYuan

Transcript of Data Analysis with Python - Pandas | WeiYuan

Python 的資料分析 - Pandas給新手的 Python 程式設計 | WeiYuan

site: v123582.github.ioline: weiwei63

§ 全端⼯程師 + 資料科學家略懂⼀點網站前後端開發技術,學過資料探勘與機器學習的⽪⽑。平時熱愛參與技術社群聚會及貢獻開源程式的樂趣。

Outline

§什麼是 Pandas ?§序列:Series§資料表:DataFrame§描述統計與統計函數§資料合併與分組§缺失數據與稀疏數據

3

什麼是 Pandas ?

§Pandas 是基於 NumPy 的一個資料分析函式庫,提供了大量進階的資料結構和資料處理的方法,目的是為了達到高效的資料分析。提供了兩個主要的資料結構:Series 和 DataFrame,這些數據結構都是構建在numpy 的 Ndarray 之上。可以把DataFrame 想成是 Series 的容器,也就是說 DataFrame是由 Series 所組成的。

4

Import pandas into python

5

1234567891011

import numpy as npimport pandas as pd

Series and DataFrame

6

1234567891011

s = pd.Series([1,3,5,np.nan,6,8])

# 0 1# 1 3# 2 5# 3 NaN# 4 6# 5 8# dtype: float64

Series and DataFrame

7

1234567891011

d = pd.DataFrame(np.random.randn(6,4), index=np.arange(6), columns=[’A’, ‘B’, ‘C’, ‘D’])

# A B C D# 0 0.358221 -0.870112 -1.393456 -0.902327# 1 1.210681 -0.484630 1.551892 -1.747265# 2 0.587932 -0.433354 -0.742197 -0.128311# 3 -0.100495 -0.742343 -0.356780 -0.346326# 4 -0.789095 0.494642 -0.368307 0.614529# 5 1.689294 -1.468678 2.886471 1.076100

Outline

§什麼是 Pandas ?§序列:Series§資料表:DataFrame§描述統計與統計函數§資料合併與分組§缺失數據與稀疏數據

§Series Definition§Series 的基本用法§Create a Series§Access a Series§Reindex§插入或丟棄資料§算术运算和数据对齐§排序與排名

8

Series

§Series 是一個一維陣列容器,類似於 NumPy 的一維 array,除了包含一組數值還包含一組索引,所以可以把它理解為一組帶索引的陣列。他能夠保存任何類型的資料(整數,字符串,浮點數等等)的一維標記數組,標籤稱為索引。

9

Series Definition

10

1234567891011

pandas.Series( data, index, dtype, copy)

• data => 數據採取各種形式,如:ndarray,list,constants• index => 索引值必须是唯一的和散列的,与数据的长度相同,默认np.arange(n)如果没有索引被传递。

• dtype => dtype用于数据类型。如果没有,将推断数据类型。• copy => 复制数据,默认为false。

Series vs Dictionary

11

Series vs NdArray

12

Series 的基本用法

§axes 返回行轴标签列表。§dtype 返回对象的数据类型(dtype)。§empty 如果系列为空,则返回True。§ndim 返回底层数据的维数,默认定义:1。§ size 返回基础数据中的元素数。§ values 将系列作为ndarray返回。§head() 返回前n行。§ tail() 返回最后n行。

13

Create a Series

14

1234567891011

import pandas as pd

s = pd.Series()print(s)# Series([], dtype: float64)

Create a Series

15

123456789101112131415

import pandas as pdimport numpy as np

data = np.array(['a','b','c','d'])s = pd.Series(data)print(s)

# 0 a# 1 b# 2 c# 3 d# dtype: object

Create a Series

16

123456789101112131415

import pandas as pdimport numpy as np

data = np.array(['a','b','c','d'])s = pd.Series(data, index=[100,101,102,103])print(s)

# 100 a# 101 b# 102 c# 103 d# dtype: object

Create a Series

17

123456789101112131415

import pandas as pd

data = {'a' : 0., 'b' : 1., 'c' : 2.}s = pd.Series(data)print(s)

# a 0.0# b 1.0# c 2.0# dtype: float64

Create a Series

18

123456789101112131415

import pandas as pd

data = {'a' : 0., 'b' : 1., 'c' : 2.}s = pd.Series(data,index=['b','c','d','a'])print(s)

# b 1.0# c 2.0# d NaN# a 0.0# dtype: float64

Create a Series

19

123456789101112131415

import pandas as pd

s = pd.Series(5, index=[0, 1, 2, 3])print(s)

# 0 5# 1 5# 2 5# 3 5# dtype: int64

Try it !

§ #練習: Write a Python program to create and display a one-dimensional array-like object containing an array of data using Pandas module.

20

Try it !

§ #練習: Write a Python program to convert a Panda module Series to Python list and it’s type.

21

Try it !

§ #練習: Write a Python program to convert a Python dictionary to Panda module Series and it’s type.

22

Access a Series

23

1234567

import pandas as pds = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first elementprint(s.index) # ['a', 'b', 'c', 'd', 'e']print(s.values) # [1, 2, 3, 4, 5]

Access a Series

24

123456789

import pandas as pds = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first elementprint(s[0]) # 1print(s['a']) # 1print(s.get('a')) # 1

Access a Series

25

12345678910111213

import pandas as pds = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three elementprint(s[:3])print(s[:'c'])

# a 1# b 2# c 3# dtype: int64

Access a Series

26

12345678910111213

import pandas as pds = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the last three elementprint s[-3:]print s['c':]

# c 3# d 4# e 5# dtype: int64

Access a Series

27

12345678910111213

import pandas as pds = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elementsprint(s[[0, 2, 3]])print(s[['a','c','d']])

Access a Series

28

12345678910111213

import pandas as pds = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elementsprint s['f']# KeyError: 'f'

Reindex

29

12345678910111213

import pandas as pds = pd.Series([1,2,3],index = ['c','b','a'])

s.index = ['a', 'b', 'c']print(s)

Reindex

30

12345678910111213

import pandas as pds = pd.Series([1,2,3],index = ['c','b','a'])

s = s.reindex(['a', 'b', 'c'], fill_value = 0)print(s)

Reindex

31

12345678910111213

import pandas as pds = pd.Series([1, 2, 3])

ffill = s.reindex(range(6), method = 'ffill')print(ffill)bfill = s.reindex(range(6), method = 'bfill')print(bfill)

插入或丟棄資料

32

12345678910111213

import pandas as pdinput = pd.Series([1,2,3,4,5])input.append(6)

# AttributeError: 'int' object has no attribute 'index'

插入或丟棄資料

33

12345678910111213

ds = pd.Series([1,2,3,4,5])ds.append(pd.Series([6]))

# 0 1# 1 2# 2 3# 3 4# 4 5# 0 6

插入或丟棄資料

34

12345678910111213

ds.set_value(max(ds.index) + 1, 6)

# 0 1# 1 2# 2 3# 3 4# 4 5# 5 6# dtype: int64

插入或丟棄資料

35

12345678910111213

import numpy as nppd.Series(np.concatenate((ds.values, [6])))

插入或丟棄資料

36

12345678910111213

import pandas as pds = pd.Series([1,2,3])

del s[0]s.pop(1)s.drop(2)

算术运算和数据对齐

37

12345678910111213

import pandas as pds1 = pd.Series([1,2,3])s2 = pd.Series([3,2,1])

s1 + s2s1 + 1s1 - s2s1 - 1s1 * s2s1 * 2s1 / s2s1 / 2

算术运算和数据对齐

38

12345678910111213

import pandas as pds1 = pd.Series([1,2,3], index=['a', 'b', 'c'])s2 = pd.Series([3,2,1], index=['d', 'e', 'f'])

s1 + s2s1 + 1s1 - s2s1 - 1s1 * s2s1 * 2s1 / s2s1 / 2

Try it !

§ #練習: Write a Python program to add, subtract, multiple and divide two given Pandas Series.• Sample Input Given: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]

39

Try it !

§ #練習: Write a Python program to add, subtract, multiple and divide two Pandas Series from user input.• Sample Input: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]

40

Try it !

§ #練習: Write a Python program to compare the elements of the two Pandas Series.• Sample Input: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]

41

排序與排名

42

12345678910111213

import pandas as pds = pd.Series([1,2,3])

s.sort_index()s.sort_index(ascending=False)s.sort_values()s.sort_values(ascending=False)s.rank()s.rank(ascending=False)

Thanks for listening.2017/08/03 (Thus.) Scientific Computing with Python – NumPyWei-Yuan [email protected]