Toward Whole-Session Relevance: Exploring Intrinsic Diversity in Web Search
Session 05 cleaning and exploring
-
Upload
bodaceacat -
Category
Data & Analytics
-
view
142 -
download
1
Transcript of Session 05 cleaning and exploring
![Page 1: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/1.jpg)
Cleaning and Exploring Data
Datascience session 5
![Page 2: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/2.jpg)
Lab 5: your 5-7 things
Data cleaning
Basic data cleaning with Python
Using OpenRefine
Exploring Data
The Pandas library
The Seaborn library
The R language
![Page 3: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/3.jpg)
Data Cleaning
![Page 4: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/4.jpg)
Algorithms want their data to be:
Machine-readable
Consistent format (e.g. text is all lowercase)
Consistent labels (e.g. use M/F, Male/Female, 0/1/2, but not *all* of these)
No whitespace hiding in number or text cells
No junk characters
No strange outliers (e.g. 200 year old living people)
In vectors and matrices
Normalised
![Page 5: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/5.jpg)
Cleaning with Python
![Page 6: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/6.jpg)
Cleaning Strings
Removing capitals and whitespace:
mystring = " CApiTalIsaTion Sucks "
mystring.lower().strip()
original text is - CApiTalIsaTion Sucks -lowercased text is - capitalisation sucks -Text without whitespace is -capitalisation sucks-
![Page 7: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/7.jpg)
Regular Expressions: repeated spaces
There’s a repeated space in capitalisation sucks
import re
re.sub(r'\s', '.', 'this is a string')re.sub(r'\s+', '.', 'this is a string')
'this.is..a.string''this.is.a.string'
![Page 8: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/8.jpg)
Regular Expressions: junk
import re
string1 = “This is a! sentence&& with junk!@“
cleanstring1 = re.sub(r'[^\w ]', '', string1)
This is a sentence with junk
![Page 9: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/9.jpg)
Converting Date/Times
European vs American? Name of month vs number? Python comes with a bunch of date reformatting libraries that can convert between these. For example:
import datetime
date_string = “14/03/48"
datetime.datetime.strptime(date_string, ‘%m/%d/%y').strftime('%m/%d/%Y')
![Page 10: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/10.jpg)
Cleaning with Open Refine
![Page 11: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/11.jpg)
Our input file
![Page 12: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/12.jpg)
Getting started
![Page 13: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/13.jpg)
Inputting data
![Page 14: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/14.jpg)
Cleaning up the import
![Page 15: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/15.jpg)
The imported data
![Page 16: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/16.jpg)
Cleaning up columns
![Page 17: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/17.jpg)
Facets
![Page 18: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/18.jpg)
Exploring Data
![Page 19: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/19.jpg)
Exploring Data
Eyeball your data
Plot your data - visually look for trends and outliers
Get the basics statistics (mean, sd etc) of your data
Create pivot tables to help understand how columns interact
Do more cleaning if you need to (e.g. those outliers)
![Page 20: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/20.jpg)
Exploring with Pandas
![Page 21: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/21.jpg)
Reading in data files with Pandas
read_csv
read_excel
read_sql
read_json
read_html
read_stata
read_clipboard
import pandas as pddf = pd.read_stata('example_data/AG_SEC12A.dta')
![Page 22: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/22.jpg)
Eyeballing rows
How many rows are there in this dataset?
len(df)
What do my data rows look like?
df.head(5)
df.tail()
df[10:20]
![Page 23: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/23.jpg)
Eyeballing columns
What’s in these columns?
df[‘sourceid’]
df[[‘sourceid’,’ag12a_01','ag12a_02_2']]
What’s in the columns when these are true?
df[df.ag12a_01 == ‘YES’]
df[(df.ag12a_01 == 'YES') & (df.ag12a_02_1 == 'NO')]
![Page 24: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/24.jpg)
Summarising columns
What are my column names and types?
df.columns
df.dtypes
Which labels do I have in this column?
df['ag12a_03'].unique()
df['ag12a_03'].value_counts()
What are my columns’ mean, standard deviation etc?
df.describe
![Page 25: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/25.jpg)
Pivot Tables: Combining data from one dataframe
● pd.pivot_table(df, index=[‘sourceid’, ‘ag12a_03’])
![Page 26: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/26.jpg)
Merge: Combining data from multiple frameslongnames = pd.DataFrame({ 'country' : pd.Series(['United States of America', 'Zaire', 'Egypt']), 'longname' : pd.Series([True, True, False])})
merged_data = pd.merge( left=popstats, right=longnames, left_on='Country/territory of residence', right_on='country')merged_data[['Year', 'Country/territory of residence', 'longname', 'Total population', 'Origin / Returned from']]
![Page 27: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/27.jpg)
Left Joins: Keep everything from the left table… longnames = pd.DataFrame({ 'country' : pd.Series(['United States of America', 'Zaire', 'Egypt']), 'longname' : pd.Series([True, True, False])})
merged_data = pd.merge( left=popstats, right=longnames, how='left', left_on='Country/territory of residence', right_on='country')merged_data[['Year', 'Country/territory of residence', 'longname', 'Total population', 'Origin / Returned from']]
![Page 28: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/28.jpg)
Normalising
Use pd.stack()
![Page 29: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/29.jpg)
The Seaborn Library
![Page 30: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/30.jpg)
The Iris dataset
import seaborn as sns
iris = sns.load_dataset('iris')
![Page 31: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/31.jpg)
Visualising Iris data with Seabornsns.pairplot(iris, hue='species', size=2)
![Page 32: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/32.jpg)
Exploring with R
![Page 33: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/33.jpg)
R
Matrix analysis (similar to Pandas)
Good at:
Rapid statistical analysis (4000+ R libraries)
Rapidly-created static graphics
Not so good at:
Non-statistical things (e.g. GIS data analysis)
![Page 34: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/34.jpg)
Running R code
● Running R files:
○ From the terminal window: “R <myscript.r —no-save”
○ From inside another R program: source('myscript.r')
● Writing your own R code:
○ iPython notebooks: create “R” notebook (instead of python3)
○ Terminal window: type “r” (and “q()” to quit)
○ Rstudio: click on Rstudio tool
![Page 35: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/35.jpg)
Exercises
![Page 36: Session 05 cleaning and exploring](https://reader033.fdocuments.net/reader033/viewer/2022042908/58ecd8d01a28ab0e278b469d/html5/thumbnails/36.jpg)
Code
Try running the Python and R code in the 5.x set of notebooks