Data Analysis with Python - Pandas | WeiYuan
-
Upload
wei-yuan-chang -
Category
Data & Analytics
-
view
72 -
download
5
Transcript of Data Analysis with Python - Pandas | WeiYuan
site: v123582.github.ioline: weiwei63
§ 全端⼯程師 + 資料科學家略懂⼀點網站前後端開發技術,學過資料探勘與機器學習的⽪⽑。平時熱愛參與技術社群聚會及貢獻開源程式的樂趣。
什麼是 Pandas ?
§Pandas 是基於 NumPy 的一個資料分析函式庫,提供了大量進階的資料結構和資料處理的方法,目的是為了達到高效的資料分析。提供了兩個主要的資料結構:Series 和 DataFrame,這些數據結構都是構建在numpy 的 Ndarray 之上。可以把DataFrame 想成是 Series 的容器,也就是說 DataFrame是由 Series 所組成的。
4
Series and DataFrame
6
1234567891011
s = pd.Series([1,3,5,np.nan,6,8])
# 0 1# 1 3# 2 5# 3 NaN# 4 6# 5 8# dtype: float64
Series and DataFrame
7
1234567891011
d = pd.DataFrame(np.random.randn(6,4), index=np.arange(6), columns=[’A’, ‘B’, ‘C’, ‘D’])
# A B C D# 0 0.358221 -0.870112 -1.393456 -0.902327# 1 1.210681 -0.484630 1.551892 -1.747265# 2 0.587932 -0.433354 -0.742197 -0.128311# 3 -0.100495 -0.742343 -0.356780 -0.346326# 4 -0.789095 0.494642 -0.368307 0.614529# 5 1.689294 -1.468678 2.886471 1.076100
Outline
§什麼是 Pandas ?§序列:Series§資料表:DataFrame§描述統計與統計函數§資料合併與分組§缺失數據與稀疏數據
§Series Definition§Series 的基本用法§Create a Series§Access a Series§Reindex§插入或丟棄資料§算术运算和数据对齐§排序與排名
8
Series
§Series 是一個一維陣列容器,類似於 NumPy 的一維 array,除了包含一組數值還包含一組索引,所以可以把它理解為一組帶索引的陣列。他能夠保存任何類型的資料(整數,字符串,浮點數等等)的一維標記數組,標籤稱為索引。
9
Series Definition
10
1234567891011
pandas.Series( data, index, dtype, copy)
• data => 數據採取各種形式,如:ndarray,list,constants• index => 索引值必须是唯一的和散列的,与数据的长度相同,默认np.arange(n)如果没有索引被传递。
• dtype => dtype用于数据类型。如果没有,将推断数据类型。• copy => 复制数据,默认为false。
Series 的基本用法
§axes 返回行轴标签列表。§dtype 返回对象的数据类型(dtype)。§empty 如果系列为空,则返回True。§ndim 返回底层数据的维数,默认定义:1。§ size 返回基础数据中的元素数。§ values 将系列作为ndarray返回。§head() 返回前n行。§ tail() 返回最后n行。
13
Create a Series
14
1234567891011
import pandas as pd
s = pd.Series()print(s)# Series([], dtype: float64)
Create a Series
15
123456789101112131415
import pandas as pdimport numpy as np
data = np.array(['a','b','c','d'])s = pd.Series(data)print(s)
# 0 a# 1 b# 2 c# 3 d# dtype: object
Create a Series
16
123456789101112131415
import pandas as pdimport numpy as np
data = np.array(['a','b','c','d'])s = pd.Series(data, index=[100,101,102,103])print(s)
# 100 a# 101 b# 102 c# 103 d# dtype: object
Create a Series
17
123456789101112131415
import pandas as pd
data = {'a' : 0., 'b' : 1., 'c' : 2.}s = pd.Series(data)print(s)
# a 0.0# b 1.0# c 2.0# dtype: float64
Create a Series
18
123456789101112131415
import pandas as pd
data = {'a' : 0., 'b' : 1., 'c' : 2.}s = pd.Series(data,index=['b','c','d','a'])print(s)
# b 1.0# c 2.0# d NaN# a 0.0# dtype: float64
Create a Series
19
123456789101112131415
import pandas as pd
s = pd.Series(5, index=[0, 1, 2, 3])print(s)
# 0 5# 1 5# 2 5# 3 5# dtype: int64
Try it !
§ #練習: Write a Python program to create and display a one-dimensional array-like object containing an array of data using Pandas module.
20
Try it !
§ #練習: Write a Python program to convert a Panda module Series to Python list and it’s type.
21
Try it !
§ #練習: Write a Python program to convert a Python dictionary to Panda module Series and it’s type.
22
Access a Series
23
1234567
import pandas as pds = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the first elementprint(s.index) # ['a', 'b', 'c', 'd', 'e']print(s.values) # [1, 2, 3, 4, 5]
Access a Series
24
123456789
import pandas as pds = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the first elementprint(s[0]) # 1print(s['a']) # 1print(s.get('a')) # 1
Access a Series
25
12345678910111213
import pandas as pds = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the first three elementprint(s[:3])print(s[:'c'])
# a 1# b 2# c 3# dtype: int64
Access a Series
26
12345678910111213
import pandas as pds = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the last three elementprint s[-3:]print s['c':]
# c 3# d 4# e 5# dtype: int64
Access a Series
27
12345678910111213
import pandas as pds = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve multiple elementsprint(s[[0, 2, 3]])print(s[['a','c','d']])
Access a Series
28
12345678910111213
import pandas as pds = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve multiple elementsprint s['f']# KeyError: 'f'
Reindex
29
12345678910111213
import pandas as pds = pd.Series([1,2,3],index = ['c','b','a'])
s.index = ['a', 'b', 'c']print(s)
Reindex
30
12345678910111213
import pandas as pds = pd.Series([1,2,3],index = ['c','b','a'])
s = s.reindex(['a', 'b', 'c'], fill_value = 0)print(s)
Reindex
31
12345678910111213
import pandas as pds = pd.Series([1, 2, 3])
ffill = s.reindex(range(6), method = 'ffill')print(ffill)bfill = s.reindex(range(6), method = 'bfill')print(bfill)
插入或丟棄資料
32
12345678910111213
import pandas as pdinput = pd.Series([1,2,3,4,5])input.append(6)
# AttributeError: 'int' object has no attribute 'index'
插入或丟棄資料
33
12345678910111213
ds = pd.Series([1,2,3,4,5])ds.append(pd.Series([6]))
# 0 1# 1 2# 2 3# 3 4# 4 5# 0 6
插入或丟棄資料
34
12345678910111213
ds.set_value(max(ds.index) + 1, 6)
# 0 1# 1 2# 2 3# 3 4# 4 5# 5 6# dtype: int64
算术运算和数据对齐
37
12345678910111213
import pandas as pds1 = pd.Series([1,2,3])s2 = pd.Series([3,2,1])
s1 + s2s1 + 1s1 - s2s1 - 1s1 * s2s1 * 2s1 / s2s1 / 2
算术运算和数据对齐
38
12345678910111213
import pandas as pds1 = pd.Series([1,2,3], index=['a', 'b', 'c'])s2 = pd.Series([3,2,1], index=['d', 'e', 'f'])
s1 + s2s1 + 1s1 - s2s1 - 1s1 * s2s1 * 2s1 / s2s1 / 2
Try it !
§ #練習: Write a Python program to add, subtract, multiple and divide two given Pandas Series.• Sample Input Given: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]
39
Try it !
§ #練習: Write a Python program to add, subtract, multiple and divide two Pandas Series from user input.• Sample Input: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]
40
Try it !
§ #練習: Write a Python program to compare the elements of the two Pandas Series.• Sample Input: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]
41
排序與排名
42
12345678910111213
import pandas as pds = pd.Series([1,2,3])
s.sort_index()s.sort_index(ascending=False)s.sort_values()s.sort_values(ascending=False)s.rank()s.rank(ascending=False)
Thanks for listening.2017/08/03 (Thus.) Scientific Computing with Python – NumPyWei-Yuan [email protected]