Data analysis with pandas

Post on 11-Apr-2017

59 views 2 download

Transcript of Data analysis with pandas

Data Analysis with Pandas

When you think of Python...

Meet Jupyter Notebook

And me

job_title != “Developer”

I’m a Consultant at Distilled (since September 2015)

I do build some software in Python

But I mainly use it for data analysis

Getting Started

Python for scientific computing

Huge community

Fantastic ecosystem of packages other people have written

Can be tedious to actually install everything

Just use this! (https://continuum.io/downloads)

What is Anaconda?

Essentially a large (~400 MB) Python installation

But contains everything* you need for data analysis

Unless you have a special reason not to, you should just install and use this

*OK, technically not true, but it has everything you’re likely to need

You need the command line (but only for a minute)

On Windows, open Powershell

On mac, Terminal or iTerm2

Just one line, though:

1. Just type “jupyter notebook”

2. Wait

3. ...

Back to safety

Open a new Notebook

Your very own data analysis environment

So that was fairly easy...

but why is it better than Excel?

There’s not enough room to list everything, but:

1. Handle larger data sets—no set limit on rows

2. Combine multiple files and data sources together instantaneously. Pull data straight from APIs or scraping

3. Everything is completely customisable—if you can imagine a query, it can be done (though not always easily)

4. It’s a safe place to mess things up

5. Keeps a record of your workflow—retrace your steps

...and it’s the perfect playground for learning Python

Side note: don’t know any Python?

Can’t cover it all today, so go here:

1. Learn Python the Hard Way (free)

2. Real Python ($60, but good)

3. Writing Idiomatic Python (~$15)

Unless you’re building applications:

1. Stick with the small building blocks

2. Learn how to write a function (we’ll do this today)

3. Learn about loops, conditional statements, and handling data

4. Probably no need to learn about managing projects and apps

Jupyter Notebook

Save notebooks for later

Run and re-run Python code

Really cool features like post-mortem debugging if you make a mistake

Cells

1. Type all the code you want

2. Shift+Enter to run it

3. View the result

Now we have our Jupyter Notebook up and running, you can start playing around with almost any Python code

We’re going to look at Pandas, though—a data analysis library written in Python

Started its life in finance

Great for fast, flexible computation

The Star of the Show

A little setup, first

You’ll do this more or less at the beginning of each session

It’ll become second nature; just import the workhorse libraries we always use: numpy, pandas, pyplot.

The DataFrame

If you’re used to spreadsheets, the DataFrame isn’t too difficult to understand

It’s the fundamental, flexible building block in Pandas

At its simplest, it looks rather like a spreadsheet would

The only obvious difference with Excel is the column indexes, which are numeric instead of A, B, C...

You’ll usually create them from some other source:

The Pandas library provides some nice functions for importing from common file formats, so you won’t usually be building “by hand”:

1. pd.read_csv()

2. pd.read_table()

3. pd.read_sql()

We have so much data stored in CSVs

Our first function call will just read some data into the DataFrame, where we can analyse it

Reading a CSV

Get help at any time with Shift+Tab

1. pd.read_csv() will read in the data

2. Fields are separated by tabs

3. The encoding is UTF-16 (don’t ask…)

4. The whole result is assigned to the variable ‘df’

Get a quick sense of the data (658k rows, here)

See the columns

Filtering

What’s happening there?

df[‘Link Active?’] is:

1. Checking that whole column for values that are True or False

2. Returning an array of True/False values

3. This is fast, and lets us filter in an amazing variety of ways

Filtering (again)

We’re probably ready for this one, now:

Example project: Getting data from SEMRush

Writing your own function

Call our function, get a DataFrame!

Write to disk in case anything goes wrong

Reading in multiple files

Apply custom filters

Drill down into individual words:

Counter() will save you a huge amount of workHere we wanted to hone in on modifier words

More detailed questions

How local are the searches?Do people search by state code or full name?Do people search by hotel category?

Second example: Custom Rank Tracking Charts

Where to begin?

If you don’t know Python, start with those books I shared earlier.

If you do, check out Python for Data Analysis

Keep Jupyter Notebook open at all times

Experiment!

Questions?