Introduction to Pandas in Python

Introduction to Pandas in Python

Pandas is an open-source library that is made mainly for working with

relational or labeled data both easily and intuitively. It provides various data

structures and operations for manipulating numerical data and time series.

This library is built on top of the NumPy library. Pandas is fast and it has high

performance & productivity for users.

Fast and efficient for manipulating and analyzing data.

Data from different file objects can be loaded.

Easy handling of missing data (represented as NaN) in floating point as

well as non-floating point data

Size mutability: columns can be inserted and deleted from DataFrame and

higher dimensional objects

Data set merging and joining.

Flexible reshaping and pivoting of data sets

Provides time-series functionality.

Powerful group by functionality for performing split-apply-combine

operations on data sets.

import pandas as pd

Here, pd is referred to as an alias to the Pandas. However, it is not necessary

to import the library using the alias, it just helps in writing less amount code

every time a method or property is called.

Pandas generally provide two data structures for manipulating data, They are:

Series

DataFrame

Series:

Pandas Series is a one-dimensional labelled array capable of holding data of

any type (integer, string, float, python objects, etc.). The axis labels are

collectively called indexes. Pandas Series is nothing but a column in an excel

sheet. Labels need not be unique but must be a hashable type. The object

supports both integer and label-based indexing and provides a host of

methods for performing operations involving the index.

Creating a Series

In the real world, a Pandas Series will be created by loading the datasets from

existing storage, storage can be SQL Database, CSV file, an Excel file. Pandas

Series can be created from the lists, dictionary, and from a scalar value etc.

import pandas as pd

import numpy as np

# Creating empty series

ser = pd.Series()

print(ser)

# simple array

data = np.array(['g', 'e', 'e', 'k', 's'])

ser = pd.Series(data)

print(ser)

Series([], dtype: float64)

0 g

1 e

2 e

3 k

4 s

dtype: object

DataFrame

Pandas DataFrame is a two-dimensional size-mutable, potentially

heterogeneous tabular data structure with labeled axes (rows and columns). A

Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular

fashion in rows and columns. Pandas DataFrame consists of three principal

components, the data, rows, and columns.

Creating a DataFrame:

In the real world, a Pandas DataFrame will be created by loading

the datasets from existing storage, storage can be SQL Database,

CSV file, an Excel file. Pandas DataFrame can be created from the

lists, dictionary, and from a list of dictionaries, etc.

import pandas as pd

# Calling DataFrame constructor

df = pd.DataFrame()

print(df)

# list of strings

lst = ['Geeks', 'For', 'Geeks', 'is',

'portal', 'for', 'Geeks']

# Calling DataFrame constructor on list

df = pd.DataFrame(lst)

print(df)

Why Pandas is used for Data Science

Pandas are generally used for data science but have you wondered why? This

is because pandas are used in conjunction with other libraries that are used

for data science. It is built on the top of the NumPy library which means that a

lot of structures of NumPy are used or replicated in Pandas. The data

produced by Pandas are often used as input for plotting functions of

Matplotlib, statistical analysis in SciPy, machine learning algorithms in Scikit-

learn.

Pandas program can be run from any text editor but it is recommended to use

Jupyter Notebook for this as Jupyter given the ability to execute code in a

particular cell rather than executing the entire file. Jupyter also provides an

easy way to visualize pandas data frames and plots.

Read Data

In the following examples, the data frame used contains data of some NBA

players. The image of data frame before any operations is attached below.

In this example, top 5 rows of data frame are returned and stored in a new

variable. No parameter is passed to .head() method since by default it is 5.

# importing pandas module

import pandas as pd

# making data frame

data = pd.read_csv("https://media.geeksforgeeks.org/wp-

content/uploads/nba.csv")

# calling head() method

# storing in new variable

data_top = data.head()

# display

data_top

In this example, the .head() method is called on series with custom input of n

parameter to return top 9 rows of the series.


import pandas as pd

# making data frame



# number of rows to return

n = 9

# creating series

series = data["Name"]

# returning top n rows

top = series.head(n = n)

# display

top

In this example, bottom 5 rows of data frame are returned and stored in a new

variable. No parameter is passed to .tail() method since by default it is 5.


import pandas as pd

# making data frame



# calling tail() method

# storing in new variable

data_bottom = data.tail()

# display

data_bottom

In this example, the .tail() method is called on series with custom input of n

parameter to return bottom 12 rows of the series.


import pandas as pd

# making data frame



# number of rows to return

n = 12

# creating series

series = data["Salary"]

# returning top n rows

bottom = series.tail(n = n)

# display

bottom

In this example, the data frame is described and [‘object’] is passed to

include parameter to see description of object series. [.20, .40, .60, .80] is

passed to percentile parameter to view the respective percentile of Numeric

series.


import pandas as pd

# importing regex module

import re

# making data frame



# removing null values to avoid errors

data.dropna(inplace = True)

# percentile list

perc =[.20, .40, .60, .80]

# list of dtypes to include

include =['object', 'float', 'int']

# calling describe method

desc = data.describe(percentiles = perc, include = include)

# display

desc

In this example, the describe method is called by the Name column to see the

behaviour with object data type.


import pandas as pd

# importing regex module

import re

# making data frame



# removing null values to avoid errors

data.dropna(inplace = True)

# calling describe method

desc = data["Name"].describe()

# display

desc

Dealing with Columns

In order to deal with columns, we perform basic operations on columns like

selecting, deleting, adding and renaming.

In Order to select a column in Pandas DataFrame, we can either access the

columns by calling them by their columns name.

# Import pandas package

import pandas as pd

# Define a dictionary containing employee data

data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],

'Age':[27, 24, 22, 32],

'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],

'Qualification':['Msc', 'MA', 'MCA', 'Phd']}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data)

# select two columns

print(df[['Name', 'Qualification']])

In Order to add a column in Pandas DataFrame, we can declare a

new list as a column and add to a existing Dataframe.

# Import pandas package

import pandas as pd

# Define a dictionary containing Students data

data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],

'Height': [5.1, 6.2, 5.1, 5.2],

'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data)

# Declare a list that is to be converted into a column

address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']

# Using 'Address' as the column name

# and equating it to the list

df['Address'] = address

# Observe the result

print(df)

Dealing with Rows:

In order to deal with rows, we can perform basic operations on rows like

selecting, deleting, adding and renmaing.

Pandas provide a unique method to retrieve rows from a Data

frame.DataFrame.loc[] method is used to retrieve rows from Pandas

DataFrame. Rows can also be selected by passing integer location to an iloc[]

function.

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving row by loc method

first = data.loc["Avery Bradley"]

second = data.loc["R.J. Hunter"]

print(first, "\n\n\n", second)

In Order to add a Row in Pandas DataFrame, we can concat the old dataframe

with new one.


import pandas as pd

# making data frame

df = pd.read_csv("nba.csv", index_col ="Name")

df.head(10)

new_row = pd.DataFrame({'Name':'Geeks', 'Team':'Boston', 'Number':3,

'Position':'PG', 'Age':33, 'Height':'6-2',

'Weight':189, 'College':'MIT', 'Salary':99999},

index =[0])

# simply concatenate both dataframes

df = pd.concat([new_row, df]).reset_index(drop = True)

df.head(5)

Python Pandas - Sorting

There are two kinds of sorting available in Pandas. They are −

By label

By Actual Value

Let us consider an example with an output.

import pandas as pd

import numpy as np

unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],c

olu

mns=['col2','col1'])

print unsorted_df

Using the sort_index() method, by passing the axis arguments and the

order of sorting, DataFrame can be sorted. By default, sorting is done

on row labels in ascending order.

import pandas as pd

import numpy as np

unsorted_df =

pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu

mns = ['col2','col1'])

sorted_df=unsorted_df.sort_index()

print sorted_df

By passing the Boolean value to ascending parameter, the order of the

sorting can be controlled. Let us consider the following example to

understand the same.

import pandas as pd

import numpy as np

unsorted_df =



sorted_df = unsorted_df.sort_index(ascending=False)

print sorted_df

By passing the axis argument with a value 0 or 1, the sorting can be

done on the column labels. By default, axis=0, sort by row. Let us

consider the following example to understand the same.

import pandas as pd

import numpy as np

unsorted_df =



sorted_df=unsorted_df.sort_index(axis=1)

print sorted_df

Like index sorting, sort_values() is the method for sorting by values. It

accepts a 'by' argument which will use the column name of the

DataFrame with which the values are to be sorted.

import pandas as pd

import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})

sorted_df = unsorted_df.sort_values(by='col1')

print sorted_df

Observe, col1 values are sorted and the respective col2 value and row

index will alter along with col1. Thus, they look unsorted.

'by' argument takes a list of column values.

import pandas as pd

import numpy as np


sorted_df = unsorted_df.sort_values(by=['col1','col2'])

print sorted_df

sort_values() provides a provision to choose the algorithm from

mergesort, heapsort and quicksort. Mergesort is the only stable

algorithm.

import pandas as pd

import numpy as np


sorted_df = unsorted_df.sort_values(by='col1' ,kind='mergesort')

print sorted_df

Its output is as follows −

Python Pandas - Aggregations

Applying Aggregations on DataFrame

Let us create a DataFrame and apply aggregations on it.

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(10, 4),

index = pd.date_range('1/1/2000', periods=10),

columns = ['A', 'B', 'C', 'D'])

print df

r = df.rolling(window=3,min_periods=1)

print r

We can aggregate by passing a function to the entire DataFrame, or

select a column via the standard get item method.

Apply Aggregation on a Whole Dataframe

import pandas as pd

import numpy as np



columns = ['A', 'B', 'C', 'D'])

print df


print r.aggregate(np.sum)

Apply Aggregation on a Single Column of a Dataframe

import pandas as pd

import numpy as np



columns = ['A', 'B', 'C', 'D'])

print df


print r['A'].aggregate(np.sum)

Apply Aggregation on Multiple Columns of a DataFrame

import pandas as pd

import numpy as np



columns = ['A', 'B', 'C', 'D'])

print df


print r[['A','B']].aggregate(np.sum)

Apply Multiple Functions on a Single Column of a DataFrame

import pandas as pd

import numpy as np



columns = ['A', 'B', 'C', 'D'])

print df


print r['A'].aggregate([np.sum,np.mean])

Apply Multiple Functions on Multiple Columns of a DataFrame

import pandas as pd

import numpy as np



columns = ['A', 'B', 'C', 'D'])

print df


print r[['A','B']].aggregate([np.sum,np.mean])

Apply Different Functions to Different Columns of a Dataframe

import pandas as pd

import numpy as np



columns = ['A', 'B', 'C', 'D'])

print df


print r.aggregate({'A' : np.sum,'B' : np.mean})

Python Pandas - Missing Data

When and Why Is Data Missed?

Let us consider an online survey for a product. Many a times, people do

not share all the information related to them. Few people share their

experience, but not how long they are using the product; few people

share how long they are using the product, their experience but not

their contact information. Thus, in some or the other way a part of data

is always missing, and this is very common in real time.

Let us now see how we can handle missing values (say NA or NaN)

using Pandas.

# import the pandas library

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',

'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print df

Using reindexing, we have created a DataFrame with missing values. In

the output, NaN means Not a Number.

Check for Missing Values

To make detecting missing values easier (and across different array

dtypes), Pandas provides the isnull() and notnull() functions, which are

also methods on Series and DataFrame objects −

import pandas as pd

import numpy as np




print df['one'].isnull()

import pandas as pd

import numpy as np




print df['one'].notnull()

Calculations with Missing Data

When summing data, NA will be treated as Zero

If the data are all NA, then the result will be NA

import pandas as pd

import numpy as np




print df['one'].sum()

import pandas as pd

import numpy as np

df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])

print df['one'].sum()

Cleaning / Filling Missing Data

Pandas provides various methods for cleaning the missing values. The

fillna function can “fill in” NA values with non-null data in a couple of

ways, which we have illustrated in the following sections.

Replace NaN with a Scalar Value

The following program shows how you can replace "NaN" with "0".

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',

'two', 'three'])

df = df.reindex(['a', 'b', 'c'])

print df

print ("NaN replaced with '0':")

print df.fillna(0)

Here, we are filling with value zero; instead we can also fill with any

other value.

Fill NA Forward and Backward

Using the concepts of filling discussed in the ReIndexing Chapter we

will fill the missing values.

import pandas as pd

import numpy as np




print df.fillna(method='pad')

import pandas as pd

import numpy as np




print df.fillna(method='backfill')

Drop Missing Values

If you want to simply exclude the missing values, then use

the dropna function along with the axis argument. By default, axis=0,

i.e., along row, which means that if any value within a row is NA then

the whole row is excluded.

import pandas as pd

import numpy as np




print df.dropna()

Replace Missing (or) Generic Values

Many times, we have to replace a generic value with some specific value.

We can achieve this by applying the replace method.

Replacing NA with a scalar value is equivalent behavior of

the fillna() function.

import pandas as pd

import numpy as np

df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})

print df.replace({1000:10,2000:60})

Python Pandas - GroupBy

Any groupby operation involves one of the following operations on the

original object. They are −

Splitting the Object

Applying a function

Combining the results

In many situations, we split the data into sets and we apply some functionality

on each subset. In the apply functionality, we can perform the following

operations −

Aggregation − computing a summary statistic

Transformation − perform some group-specific operation

Filtration − discarding the data with some condition

Let us now create a DataFrame object and perform all the operations on it −

#import the pandas library

import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',

'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],

'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],

'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],

'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}

df = pd.DataFrame(ipl_data)

print df

Split Data into Groups

Pandas object can be split into any of their objects. There are multiple

ways to split an object like −

obj.groupby('key')

obj.groupby(['key1','key2'])

obj.groupby(key,axis=1)

Let us now see how the grouping objects can be applied to the

DataFrame object


import pandas as pd



'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],

'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],

'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}


print df.groupby('Team')

View Groups


import pandas as pd



'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],

'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],

'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}


print df.groupby('Team').groups

With the groupby object in hand, we can iterate through the object similar to

itertools.obj.


import pandas as pd



'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],

'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],

'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}


grouped = df.groupby('Year')

for name,group in grouped:

print name

print group

Group by with multiple columns −


import pandas as pd



'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],

'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],

'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}


print df.groupby(['Team','Year']).groups

Iterating through Groups

With the groupby object in hand, we can iterate through the object

similar to itertools.obj.


import pandas as pd



'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],

'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],

'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}



for name,group in grouped:

print name

print group

By default, the groupby object has the same label name as the group

name.

Select a Group

Using the get_group() method, we can select a single group.


import pandas as pd



'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],

'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],

'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}



print grouped.get_group(2014)

Aggregations

An aggregated function returns a single aggregated value for each

group. Once the group by object is created, several aggregation

operations can be performed on the grouped data.

An obvious one is aggregation via the aggregate or

equivalent agg method −


import pandas as pd

import numpy as np



'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],

'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],

'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}



print grouped['Points'].agg(np.mean)

Another way to see the size of each group is by applying the size()

function −

import pandas as pd

import numpy as np



'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],

'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],

'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}


Attribute Access in Python Pandas

grouped = df.groupby('Team')

print grouped.agg(np.size)

Applying Multiple Aggregation Functions at Once

With grouped Series, you can also pass a list or dict of functions to do

aggregation with, and generate DataFrame as output −


import pandas as pd

import numpy as np



'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],

'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],

'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}



print grouped['Points'].agg([np.sum, np.mean, np.std])

Transformations

Transformation on a group or a column returns an object that is

indexed the same size of that is being grouped. Thus, the transform

should return a result that is the same size as that of a group chunk.


import pandas as pd

import numpy as np



'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],

'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],

'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}



score = lambda x: (x - x.mean()) / x.std()*10

print grouped.transform(score)

Python Pandas - Merging/Joining

Pandas has full-featured, high performance in-memory join operations

idiomatically very similar to relational databases like SQL.

Pandas provides a single function, merge, as the entry point for all

standard database join operations between DataFrame objects −

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,

left_index=False, right_index=False, sort=True)

Here, we have used the following parameters −

left − A DataFrame object.

right − Another DataFrame object.

on − Columns (names) to join on. Must be found in both the left

and right DataFrame objects.

left_on − Columns from the left DataFrame to use as keys. Can

either be column names or arrays with length equal to the length of

the DataFrame.

right_on − Columns from the right DataFrame to use as keys. Can

either be column names or arrays with length equal to the length of

the DataFrame.

left_index − If True, use the index (row labels) from the left

DataFrame as its join key(s). In case of a DataFrame with a

MultiIndex (hierarchical), the number of levels must match the

number of join keys from the right DataFrame.

right_index − Same usage as left_index for the right DataFrame.

how − One of 'left', 'right', 'outer', 'inner'. Defaults to inner. Each

method has been described below.

sort − Sort the result DataFrame by the join keys in lexicographical

order. Defaults to True, setting to False will improve the

performance substantially in many cases.

Let us now create two different DataFrames and perform the merging

operations on it.


import pandas as pd

left = pd.DataFrame({

'id':[1,2,3,4,5],

'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],

'subject_id':['sub1','sub2','sub4','sub6','sub5']})

right = pd.DataFrame(

{'id':[1,2,3,4,5],

'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],


print left

print right

Merge Two DataFrames on a Key

import pandas as pd


'id':[1,2,3,4,5],



right = pd.DataFrame({

'id':[1,2,3,4,5],



print pd.merge(left,right,on='id')

Merge Two DataFrames on Multiple Keys

import pandas as pd


'id':[1,2,3,4,5],




'id':[1,2,3,4,5],



print pd.merge(left,right,on=['id','subject_id'])

Merge Using 'how' Argument

The how argument to merge specifies how to determine which keys are

to be included in the resulting table. If a key combination does not

appear in either the left or the right tables, the values in the joined table

will be NA.

Here is a summary of the how options and their SQL equivalent names

−

Merge Method SQL Equivalent Description

left LEFT OUTER JOIN Use keys from left object

right RIGHT OUTER JOIN Use keys from right object

outer FULL OUTER JOIN Use union of keys

inner INNER JOIN Use intersection of keys

Left Join

Live Demo

import pandas as pd


'id':[1,2,3,4,5],




http://tpcg.io/dh1jjG

'id':[1,2,3,4,5],



print pd.merge(left, right, on='subject_id', how='left')


Name_x id_x subject_id Name_y id_y

0 Alex 1 sub1 NaN NaN

1 Amy 2 sub2 Billy 1.0

2 Allen 3 sub4 Brian 2.0

3 Alice 4 sub6 Bryce 4.0

4 Ayoung 5 sub5 Betty 5.0

Right Join

Live Demo

import pandas as pd


'id':[1,2,3,4,5],




'id':[1,2,3,4,5],


http://tpcg.io/j3Hfa8


print pd.merge(left, right, on='subject_id', how='right')



0 Amy 2.0 sub2 Billy 1

1 Allen 3.0 sub4 Brian 2

2 Alice 4.0 sub6 Bryce 4

3 Ayoung 5.0 sub5 Betty 5

4 NaN NaN sub3 Bran 3

Outer Join

Live Demo

import pandas as pd


'id':[1,2,3,4,5],




'id':[1,2,3,4,5],



print pd.merge(left, right, how='outer', on='subject_id')

http://tpcg.io/EmK4fw



0 Alex 1.0 sub1 NaN NaN

1 Amy 2.0 sub2 Billy 1.0

2 Allen 3.0 sub4 Brian 2.0

3 Alice 4.0 sub6 Bryce 4.0

4 Ayoung 5.0 sub5 Betty 5.0

5 NaN NaN sub3 Bran 3.0

Inner Join

Joining will be performed on index. Join operation honors the object on

which it is called. So, a.join(b) is not equal to b.join(a).

Live Demo

import pandas as pd


'id':[1,2,3,4,5],




'id':[1,2,3,4,5],



http://tpcg.io/FSfHCD

print pd.merge(left, right, on='subject_id', how='inner')



0 Amy 2 sub2 Billy 1

1 Allen 3 sub4 Brian 2

2 Alice 4 sub6 Bryce 4

3 Ayoung 5 sub5 Betty 5

Demo

1 - 项目导学页

1.1 - 项目介绍

Chipotle 全名是 ChipotleMexicanGrill，意思是 Chipotle 墨西哥风味烤肉。

Chipotle 的原意本是一种墨西哥干辣椒。这家快餐是一个叫 SteveElls 的年轻人于

1993 年在美国的丹佛创立的。主要是产品为墨西哥式的食品，食材多是新鲜的牛肉，

鸡肉，猪肉，各种蔬菜，各种豆类，米饭等。

Chipotle 是美国休闲快餐美食行业的领头羊，在《财富》杂志 2011 年度增长最快

的 100 家公司排行榜上列第 54 位。Chipotle 一直保持着惊人的增长速度。在截至

2011 年 6 月 30 日的财年里，Chipotle 公司的营业额超过了 20 亿美元，比上年增长

了 23.5%。自 2006 年以来，公司的营业额几乎增加了两倍。与此同时，Chipotle 连

锁公司的餐厅数量翻了一番。2011 年上半年，Chipotle 开业至少一年的餐厅的销售

额增长了 11%。Chipotle 餐厅的利润率一直在 25%~26%之间，这在快餐行业内属于

佼佼者。

随着公司的发展，越来越需要数据化运营手段辅助公司的业务运营。为此，公司希

望数据分析师，分析整体的经营数据，为销售决策提供依据。

在这个项目中我们将运用 Python 的数据分析模块—pandas 对 Chipotle 的数据进

行探索性分析。

1.2 - 教学目标

本项目，主要是教会大家熟练使用 Pandas 数据分析全家桶的以下内容：

Pandas 数据分析全家桶介绍

Pandas 环境安装 Pandas 数据分析工具安装

Pandas 数据分析数据分析模

块

Pandas 数据分析函数等

数据分析基本流程数据分析的基本过程与方法

1.3 - 前置知识点

如果要更好地实现本项目，最好具备以下技能：

掌握 Python 编程基础

掌握统计学基本常识

掌握 Python 环境安装

掌握数据分析的基本知识

掌握 Pandas 的基本操作

1.4 - 学习周期

总时⻓任务 1 任务 2 任务 3 任务 4 任务 45

3h 1 0.5 0.5 0.5 0.5

1.5 - 配套资料

Pandas 官⽅⽂档：https://pandas.pydata.org

Python 基础：https://www.runoob.com/python3/python3-tutorial.html

实验数据下载：https://pan.baidu.com/s/1BItW8kYNU4xp2cw8rF5ojA 密

码:fvtn

Python 与 Pandas 开发环境安装方法参考：

https://blog.csdn.net/weixin_42526141/article/details/84141157

2 - 项目剖析页

2.1 - 项目解读

数据分析是指用适当的统计分析方法对收集来的大量数据进行分析，提取有用信息

和形成结论而对数据加以详细研究和概括总结的过程。这一过程也是质量管理体系的

支持过程。在实用中，数据分析可帮助人们作出判断，以便采取适当行动。

Chipotle 快餐数据是数据分析中一个入门级的数据集，包含 5 个特征，4622 个样

本。数据简单易理解的同时可以充分的支撑练习 Pandas 的关键函数及各种描述性分

析方法的需求。

我们为了了解 Chipotle 快餐的经营情况，可以运用 Pandas 中的函数从多角度进

行数据分析。

2.2 - 技能要求

掌握 Pandas 的使用

https://pandas.pydata.org/pandas-docs/stable/

https://www.runoob.com/python3/python3-tutorial.html

https://blog.csdn.net/weixin_42526141/article/details/84141157

掌握 Python 编程基础

掌握统计学基本常识

2.3 - 项⽬拆解

本项⽬主要分为 5 个任务：

任务 1：开发环境安装与准备

通过完成 Python 数据分析的环境搭建与相关实验数据的准备，为后续的分析任务

的展开打下基础。

任务 2：导入数据并完成数据字段的理解

通过使用 Python Pandas 的数据加载 API，加载餐饮数据分析。数据分析的第一

步，往往都是数据加载，数据源一般会有很多种不同的格式，例如 csv，txt。所以学

习如何加载数据对我们来讲非常重要。

任务 3: 下单次数最多的商品

通过使用 Python Pandas 的数据分析 API，groupby 等，完成下单次数最多的商品

数据分析。

任务 4：不同种商品销售情况分析

通过使用 Python Pandas 的数据分析 API，groupby 等，完成不同种商品销售情况

分析。

任务 5：总收入与总销量分析

通过使用 Python Pandas 的数据分析 API，value_counts，groupby 等，完成总

收入与总销量分析。

3 - 任务详情页

3.1 - 开发环境安装与准备-基于 Anaconda

Pandas 是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。

Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集

所需的工具。pandas 提供了大量能使我们快速便捷地处理数据的函数和方法。你很快

就会发现，它是使 Python 成为强大而高效的数据分析环境的重要因素之一。

在本任务中，我们使用 Python 的 Anaconda 搭建数据分析环境。

任务要求：

1. 基于 Anaconda 最新版本搭建数据分析环境

2. 构建 Jupyter，spyder 开发环境

https://baike.baidu.com/item/NumPy/5678437

https://baike.baidu.com/item/%25E6%2595%25B0%25E6%258D%25AE%25E6%25A8%25A1%25E5%259E%258B/1305623

3. 下载练习数据

任务提示：

Anaconda 指的是一个开源的 Python 发行版本，其包含了 conda、Python 等

180 多个科学包及其依赖项。因为包含了大量的科学包，Anaconda 的下载文件比较

大（约 531 MB），如果只需要某些包，或者需要节省带宽或存储空间，也可以使用

Miniconda 这个较小的发行版（仅包含 conda 和 Python。

安装文档可以参考：

https://baijiahao.baidu.com/s?id=1616120886763657106&wfr=spider&for

=pc

Anaconda 下载链接: https://www.anaconda.com/distribution，需要下载 3.7

版本同时与自己电脑的操作系统要对应。

https://baijiahao.baidu.com/s?id=1616120886763657106&wfr=spider&for=pc

https://baijiahao.baidu.com/s?id=1616120886763657106&wfr=spider&for=pc

https://www.anaconda.com/distribution

搭建成功后的如下：

开发环境搭建：

我们使用 Jupyter，spyder 都可以，以下我们用 Jupyter 来举例子

Jupyter 是一款界面如下

Jupyter 是一款非常优秀的 Python 数据分析开发工具，具体使用文档大家可

以参考

http://baijiahao.baidu.com/s?id=1601883438842526311&wfr=spi

der&for=pc

编程见面风格，清爽自如

测试数据下载

接下来，需要去下载测试的数据，具体的链接如下

http://baijiahao.baidu.com/s?id=1601883438842526311&wfr=spider&for=pc

http://baijiahao.baidu.com/s?id=1601883438842526311&wfr=spider&for=pc

实验数据下载：

https://pan.baidu.com/s/1BItW8kYNU4xp2cw8rF5ojA 密码:fvtn

数据使用百度网盘保存，小伙伴们直接下载即可，本项目中使用的字段说明

如下：

字段名称含义

item_name 产品名称

quantity 销售数量

choice_description 选择的描述

order_id 订单 ID

item_price 产品价格

3.2 - 导入数据并完成数据字段的理解

在完成了 3.1 环境安装与数据下载的练习后，我们开始进行正式的数据分析模块开

发。数据分析也是运营当中的一部分，而且起到非常关键的作用，我们可以通过数据

分析可以做出正确的判断。在实用中，数据分析可帮助人们作出判断，以便采取适当

的行动。数据分析是组织有目的地收集数据、分析数据，使之成为信息的过程。

对于所有的数据分析过程中，都应该把最多的注意⼒放在数据上，读取数据虽然

简单，却是最重要的⼀步。在第一个任务中我们主要工作是把数据集读取进来，并且

对数据集中的信息有整体的了解。

本任务中，我们将实现数据集的加载，数据集的基本探索。

任务要求：

1. 读取 Chipotle 数据集，并将数据集命名为 chipo

2. 查看数据前 100 行

3. 查看数据集信息

4. 查看数据集中的列

任务提示：

首先第一步我们要导入 Pandas 包,Pandas 是基于 Numpy 的一个数据分析包，

主要目的是为了数据分析。它提供了大量的高级数据结构和对数据处理的方法。

import pandas as pd

接下来就是将数据集以 DataFrame 形式导入，并且命名为 chipo。DataFrame

是 Python 中 Pandas 库中的一种数据结构，它类似 excel，是一种二维表。或许

说它可能有点像 matlab 的矩阵，但是 matlab 的矩阵只能放数值型值（当然

matlab 也可以用 cell 存放多类型数据），DataFrame 的单元格可以存放数值、

字符串等，这和 excel 表很像。同时 DataFrame 可以设置列名 columns 与行名

index，可以通过像 matlab 一样通过位置获取数据也可以通过列名和行名定位。

我们使用 Pandas 中的 read_csv()来读取数据。代码如下所示，其中 path 为数据

集存储路径，sep 参数为指定分隔符。如果不指定参数，则会尝试使用逗号分隔。

因为本文的数据文件为 TSV 文件，而 TSV 文件和 CSV 的文件的区别是：前者使

用\t 作为分隔符，后者使用,作为分隔符。所以参数 sep 为 '\t'。

path = "/Users/用户数据.tsv"

#读取 CSV 数据，以\t 作为分隔符,关键函数为 read_csv

chipo = pd.read_csv(path1, sep = '\t')

通过上一步骤，我们已经将数据集完整的导入，并且命名为 chipo。在做数据分

析时我们一般再导入数据后要查看一下导入的数据是否存在什么问题，并且对数

据大致情况进行了解。由于一般情况下整体数据量可能很大，我们只需要查看部

分数据就可以，因此调用 head()来查看前 500 条数据，代码如下：

#关键函数为 head

print(chipo.head(500))

接下来为了对数据集有更深刻的认识我们调用 info()函数来查看数据集信息，代码

如下：

#关键函数为 info

chipo.info()

我们得到的结果如下图:

其中包含索引范围，数据列信息(列名，每列数据不为空的条数，每列数据类型)，

数据类型，占用内存。

当我们只想查看数据中包含哪些列时，通过调用 columns 实现。

#关键函数为 columns

print(chipo.columns)

3.3 - 下单次数最多的商品分析

经过任务 3.2 的练习，我们已经将数据导入，并且对数据的内容有了一定的认识。

接下来，从本任务开始我们将进行分析工作。

首先以商品为维度分析下单次数最多的商品，一般来说下单次数最多的商品为主打

商品，这部分商品会在收益中占有很高份额，因此对下单次数最多的商品进行分析有

重要意义。

本任务中，我们将实现下单次数最多的商品分析。

任务要求：

1. 将商品名字和下单数量两列取出

2. 按照商品名字分组，求分组内下单数量和，分组内下单数量均值

3. 将分组后的数据按照下单数量和值排序

任务提示：

第一步我们要做的就是把我们需要用到的两列：商品名字，下单数量从

DataFrame——chipo 中取出，并且创建为新的 DataFrame——c。DataFrame

按列进行取数据时，如果取一列方法为：a=df['列名']，取多列时方法为：b=df[['

列名 1'，‘列名 2’，‘列名 3’]]，所以代码如下：

c = chipo[['item_name','quantity']]

接下来按照商品名字分组，求分组内下单数量和，分组内下单数量均值。首先调

用 groupby()函数将 c 按照商品名字做分组，其中参数 as_index：bool，默认为

True。对于聚合输出，返回以组标签作为索引的对象。仅与 DataFrame 输入相

关。as_index = False 实际上是“SQL 风格”的分组输出。pandas 引入了 agg

函数，它提供基于列的聚合操作。而 groupby 可以看做是基于行，或者说 index

的聚合操作。通过这里介绍我们可以交接 groupby 函数是基于行操作的而 agg

是基于列操作的。这个说可能太抽象，什么是行操作什么是列操作呢。最简单的

理解就是基于行操作我可以进行分类（比如一个班名单所有 180 以上的是一组

160-180 是一组低于 160 是一组）如果实现这个过程我们是每一行每一行就行

查找，查看符合什么条件然后分组。这就是 groupby 函数最简单的理解而我们

分好组以后想得到每一组的平均值咋办一般我们是着用操作的选择一个组之后

把他们所有身高都加起来然后除以该组人数。那么问题来了不管是身高和还是平

均值我们都是进行列操作的即我们是从上至下加起来的而不是从左到右。为了

计算简便我们引入了 agg 函数。接下来调用 agg()函数对分组后的数据进行聚合，

默认情况下对分组后其他列进行聚合。代码如下：

#关键函数为,group by, agg

c1=c.groupby(['item_name'],as_index=False).agg({'quantity':

[sum,'mean']})

最后我们将上一步做好的分组后数据按照下单数量和值排序。此处调用

sort_values()函数。Pandas 中的 sort_values()函数原理类似于 SQL 中的 order

by，可以将数据集依照某个字段中的数据进行排序，该函数即可根据指定列数据

也可根据指定行的数据排序。具体代码如下所示，其中参数 ascending ：是否按

指定列的数组升序排列，默认为 True，即升序排列。inplace ：是否用排序后的

数据集替换原来的数据，默认为 False，即不替换。

#关键函数为 sort_values

c1.sort_values([('quantity', 'sum')],ascending=False,inplace=True)

3.4 - 不同种商品销售情况分析

在完成了 3.3 的练习后，我们可以通过观察有多少商品被下单可以看出是否存在未

售出商品，便于调整上架商品。接下来可以分析不同种商品销售情况从而找出影响销

量的因素。

本任务中，我们将实现不同种商品销售情况。

任务要求：

1. 分析多少种商品被下单

2. 分析不同种商品销售情况分析

3. 分析一共有多少商品被下单

任务提示：

第一步是查看多少种商品被下单，也就是输出商品名称不同的计数。我们调用

nunique()函数进行实现。nunique()函数用于获取唯一值得统计次数。

#关键函数为 nunique

print(chipo['item_name'].nunique())

接下来我们对不同种商品销售情况进行分析，也就是统计不同

‘choice_description’字段的频数，我们通过调用 value_counts()函数进行实

现，value_counts()是一种查看表格某列中有多少个不同值的快捷方法，并计算

每个不同值又在该列中有多少重复值。

#关键函数为 value_counts

chipo['choice_description'].value_counts().head(10)

下面我们将对一共有多少商品被下单进行分析，也就是统计一共有多少

‘quantity’字段被下单，代码如下：

#关键函数为 sum

total_items_orders = chipo.quantity.sum()

3.5 - 总收入与总订单数分析

最后我们可以对收入情况及订单数进行汇总，进而对店铺的整体运营情况进行分析。

并计算每一订单的均值得出店铺的消费水平。

本任务中，我们将实现总收入与总订单数分析。

任务要求：

1. 分析在该数据集对应的时期内，总收入是多少？

2. 分析在该数据集对应的时期内，一共有多少订单？

3. 分析每一单(order)对应的平均总价是多少？

任务提示：

首先对总收入情况进行分析，也就是对 item_price 求和即为收入。在对

item_price 进行求和之前，我们由任务 3.1 可知 item_price 数据类型为 object，

因此需要将其转化为数值类型方可进行求和。而 chipo['item_price']中每个对象

的格式是 string（‘$’+数字），所以如果直接用.astype(float)则会出错，因为

str 无法转换成为 float。所以我们可以利用可调用函数处理每个对象，并将函数

传入 apply。代码如下：

dollarizer = lambda x: float(x[1:-1])#去除'$'符号并将数字传唤成为 float

chipo['item_price']=chipo['item_price'].apply(dollarizer)

经过上一步的处理就可以对 item_price 字段进行 sum()求和计算，代码如下：

#关键函数为 sum

chipo.item_price.sum()

接下来对订单总数进行分析。我们先通过调用 value_counts()查看订单 ID 中有多

少个不同值和每个值对应的重复次数，然后对 value_counts()的结果调用 count()

函数进行计数统计。代码如下：

#关键函数为 value_counts().count()

chipo.order_id.value_counts().count()

最后我们计算每一单(order)对应的平均总价进行分析。首先调用 groupby()函数，

将数据按照订单 ID 进行分组，并通过 sum()函数对分分组后数据进行求和计算。

接下来就是对刚刚做好分组的数据求 item_price 字段的均值

#关键函数为 groupby,sum,mean

order_grouped = chipo.groupby(by=['order_id']).sum()

order_grouped.mean()['item_price']

Introduction to Pandas in Python

Documents

Transcript of Introduction to Pandas in Python