Transforming Data to Unlock Its Latent Value

74
TRANSFORMING DATA TO UNLOCK ITS LATENT VALUE PyData Carolinas 9/15/2016

Transcript of Transforming Data to Unlock Its Latent Value

Page 1: Transforming Data to Unlock Its Latent Value

TRANSFORMING DATA TO UNLOCK ITS LATENT VALUE

PyData Carolinas9/15/2016

Page 2: Transforming Data to Unlock Its Latent Value

How many of you consider yourselves data scientists?

Page 3: Transforming Data to Unlock Its Latent Value

How many of you spend a lot of time exploring data?

Page 4: Transforming Data to Unlock Its Latent Value

How many of you have a formal process for data exploration?

Page 5: Transforming Data to Unlock Its Latent Value

ABOUT ME - TONY OJEDA

Founder of District Data Labs

Education and research company

Business & finance background

Self-taught programmer (R & Python)

Page 6: Transforming Data to Unlock Its Latent Value

HOW I THINK ABOUT DATA

Page 7: Transforming Data to Unlock Its Latent Value

PUT THINGS IN ORDERCategories, Classifications, Taxonomies, Ontologies.

Page 8: Transforming Data to Unlock Its Latent Value

EXPLORATION FRAMEWORKIdentify

Types of  Information

Entities  in  Data  Set

Review

Transformation  Methods

Visualization  Methods

Create

Category  Aggregations

Continuous Bins

Cluster  Categories

Prep  Phase

Insights

Over  Time

Visualization

Filter  +  Aggregate Field  Relationships Entity  RelationshipsExplore  Phase

Page 9: Transforming Data to Unlock Its Latent Value

THE DATA: EPA VEHICLE FUEL ECONOMY

http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip

Page 10: Transforming Data to Unlock Its Latent Value

IDENTIFYIdentify

Types of  Information

Entities  in  Data  Set

Review

Transformation  Methods

Visualization  Methods

Create

Category  Aggregations

Continuous Bins

Cluster  Categories

Page 11: Transforming Data to Unlock Its Latent Value

IDENTIFY TYPES OF INFORMATION

Page 12: Transforming Data to Unlock Its Latent Value

IDENTIFY ENTITIES IN THE DATA

Page 13: Transforming Data to Unlock Its Latent Value

ENTITIES IN OUR DATA SET

Level 4

Level 3

Level 2

Level 1 Year + Model

Year + Model Type

Year + Make

Year

Year + Vehicle Class

Vehicle Class

Model Type

Make

Page 14: Transforming Data to Unlock Its Latent Value

ENTITIES IN OUR DATA SET

Level 4

Level 3

Level 2

Level 1 2016 Ford Mustang 2.3L V4 Automatic Rear-wheel Drive

2016 Ford Mustang

2016 Fords

All 2016 Vehicles

2016 Subcompacts

All Subcompacts

Ford Mustangs

All Ford Vehicles

Page 15: Transforming Data to Unlock Its Latent Value

REVIEWIdentify

Types of  Information

Entities  in  Data  Set

Review

Transformation  Methods

Visualization  Methods

Create

Category  Aggregations

Continuous Bins

Cluster  Categories

Page 16: Transforming Data to Unlock Its Latent Value

TRANSFORMATION METHODS

! Filtering

" Aggregation/Disaggregation

# Pivoting

$ Graph Transformation

Page 17: Transforming Data to Unlock Its Latent Value

VISUALIZATION METHODSBarcharts |

Multi-line Graphs &

Scatter plots/matrices

Heatmaps '

Network Visualizations $

Page 18: Transforming Data to Unlock Its Latent Value

CREATEIdentify

Types of  Information

Entities  in  Data  Set

Review

Transformation  Methods

Visualization  Methods

Create

Category  Aggregations

Continuous Bins

Cluster  Categories

Page 19: Transforming Data to Unlock Its Latent Value

CATEGORY AGGREGATIONS Transmission

Automatic  3-­‐spdAutomatic  4-­‐spdManual  5-­‐spdAutomatic  (S5)Manual  6-­‐spd

Automatic  5-­‐spdAuto(AM8)Auto(AV-­‐S7)

Automatic  (S6)Automatic  (S9)Manual  4-­‐spd+  33  more

Transmission  Type

Automatic

Manual

Page 20: Transforming Data to Unlock Its Latent Value

CATEGORY AGGREGATIONS Vehicle  Class

Special  Purpose  Vehicle  2WDMidsize  Cars

Subcompact  CarsCompact  Cars

Sport  Utility  Vehicle  -­‐ 4WDSmall  Sport  Utility  Vehicle  2WDSmall  Sport  Utility  Vehicle  4WD

Two  SeatersSmall  Station  WagonsMinicompact CarsMinivan  -­‐ 4WD+ 23  more

Vehicle  Category

Small  Cars

Midsize  Cars

Large  Cars

Station  Wagons

Pickup  Trucks

Special  Purpose

Sport  Utility

Vans  &  Minivans

Page 21: Transforming Data to Unlock Its Latent Value

CATEGORIES FROM CONTINUOUS

Very  LowLow

ModerateHigh

Very  High

Combined MPG à Fuel Efficiency Quintiles

Engine Displacement à Engine Size Quintiles

CO2 Emission à Emission Quintiles

Fuel Cost à Fuel Cost Quintiles

Page 22: Transforming Data to Unlock Its Latent Value

CLUSTER CATEGORIESTakes multiple fields into consideration

together.

Groups things in ways you may not have thought of.

Come up with descriptive names

for clusters.

Number of clusters? Looking for relatively

clear boundaries.

Automatically creates new categories

(saves time).

Page 23: Transforming Data to Unlock Its Latent Value

VEHICLE CLUSTERS = 8

Page 24: Transforming Data to Unlock Its Latent Value

VEHICLE CLUSTERS = 4

Page 25: Transforming Data to Unlock Its Latent Value

ASSIGN DESCRIPTIVE NAMES

Cluster 0 à Small Very Efficient

Cluster 1 à Large Inefficient

Cluster 2 à Midsized Balanced

Cluster 3 à Small Moderately Efficient

Page 26: Transforming Data to Unlock Its Latent Value

EXPLORE PHASE

Insights

Over  Time

Visualization

Filter +  Aggregate Field  Relationships Entity  Relationships

Page 27: Transforming Data to Unlock Its Latent Value

VECHICLE CATEGORY COUNTS (2016)

Page 28: Transforming Data to Unlock Its Latent Value

VECHICLE CATEGORY COUNTS (1985)

Page 29: Transforming Data to Unlock Its Latent Value

ENGINE SIZE COUNTS (2016)

Page 30: Transforming Data to Unlock Its Latent Value

FUEL EFFICIENCY COUNTS (2016)

Page 31: Transforming Data to Unlock Its Latent Value

VEHICLE CLUSTER COUNTS (2016)

Page 32: Transforming Data to Unlock Its Latent Value

MANUFACTURER VEHICLE COUNTS (2016)

Page 33: Transforming Data to Unlock Its Latent Value
Page 34: Transforming Data to Unlock Its Latent Value

MORE DETAIL

Page 35: Transforming Data to Unlock Its Latent Value

FUEL EFFICIENCY VS. ENGINE SIZE (2016)

Page 36: Transforming Data to Unlock Its Latent Value

FUEL EFFICIENCY VS. ENGINE SIZE (1985)

Page 37: Transforming Data to Unlock Its Latent Value

ENGINE SIZE & EFFICIENCY VS. CATEGORY

Page 38: Transforming Data to Unlock Its Latent Value

PIVOT COUNTS BY MAKE & CATEGORY

Page 39: Transforming Data to Unlock Its Latent Value
Page 40: Transforming Data to Unlock Its Latent Value

CHANGES OVER TIME

Insights

Over  Time

Visualization

Filter  +  Aggregate Field  Relationships Entity  Relationships

Page 41: Transforming Data to Unlock Its Latent Value

CATEGORIES OVER TIME

Page 42: Transforming Data to Unlock Its Latent Value
Page 43: Transforming Data to Unlock Its Latent Value

BMW OVER TIME

Page 44: Transforming Data to Unlock Its Latent Value
Page 45: Transforming Data to Unlock Its Latent Value

TOYOTA OVER TIME

Page 46: Transforming Data to Unlock Its Latent Value
Page 47: Transforming Data to Unlock Its Latent Value

EXPLORE PHASE

Insights

Over  Time

Visualization

Filter +  Aggregate Field  Relationships Entity  Relationships

Page 48: Transforming Data to Unlock Its Latent Value

SCATTER MATRICES& PLOTS

Page 49: Transforming Data to Unlock Its Latent Value
Page 50: Transforming Data to Unlock Its Latent Value

SCATTER MATRIX WITH CATEGORIES

Page 51: Transforming Data to Unlock Its Latent Value
Page 52: Transforming Data to Unlock Its Latent Value

ENGINE SIZE VS. EFFICIENCY

Page 53: Transforming Data to Unlock Its Latent Value
Page 54: Transforming Data to Unlock Its Latent Value

ENGINE SIZE VS. FUEL COST

Page 55: Transforming Data to Unlock Its Latent Value
Page 56: Transforming Data to Unlock Its Latent Value

EXPLORE PHASE

Insights

Over  Time

Visualization

Filter  +  Aggregate Field  Relationships Entity  Relationships

Page 57: Transforming Data to Unlock Its Latent Value

GRAPH ANALYSIS

Relationships between entities.

Attributes entities have in common.

Actions one entity takes involving another.

Changes in relationships over time.

Page 58: Transforming Data to Unlock Its Latent Value

RELATIONAL TO GRAPH TRANSFORMATION

Page 59: Transforming Data to Unlock Its Latent Value

RELATIONAL TO GRAPH TRANSFORMATION

Page 60: Transforming Data to Unlock Its Latent Value

MANUFACTURER NETWORK (2016)

Page 61: Transforming Data to Unlock Its Latent Value
Page 62: Transforming Data to Unlock Its Latent Value

EGO GRAPH FOR NISSAN

Page 63: Transforming Data to Unlock Its Latent Value
Page 64: Transforming Data to Unlock Its Latent Value

COMMUNITY GRAPH (2016)

Page 65: Transforming Data to Unlock Its Latent Value
Page 66: Transforming Data to Unlock Its Latent Value

EDGE WEIGHTS OVER TIME

Page 67: Transforming Data to Unlock Its Latent Value
Page 68: Transforming Data to Unlock Its Latent Value

FILTER FOR SPECIFIC MAKES

Page 69: Transforming Data to Unlock Its Latent Value
Page 70: Transforming Data to Unlock Its Latent Value

RECAP

Page 71: Transforming Data to Unlock Its Latent Value

EXPLORATION FRAMEWORKIdentify

Types of  Information

Entities  in  Data  Set

Review

Transformation  Methods

Visualization  Methods

Create

Category  Aggregations

Continuous Bins

Cluster  Categories

Prep  Phase

Insights

Over  Time

Visualization

Filter  + Aggregate Field  Relationships Entity  RelationshipsExplore  Phase

Page 72: Transforming Data to Unlock Its Latent Value

SO MANY INSIGHTS!

Page 73: Transforming Data to Unlock Its Latent Value

NO REALLY… SO MANY!Significantly more small cars than other types in 2016.

Sport Utility Vehicles are currently next most popular.

Midsize Cars currently third most popular.

Other vehicle types currently not as popular. Sport Utility Vehicles didn't exist in 1985. Small Cars were even more popular in 1985.

Pickup Trucks were also more popular in 1985. Special Purpose Vehicles were more popular as well.

Most vehicles today have very small or moderate sized engines.

Most vehicles today are very fuel efficient. Few vehicles today have large engines. Even fewer have low fuel efficiency.

BMW currently makes the most vehicle models, followed by Chevy and Ford.

Pagani and Alfa Romeo make the least number of vehicle models.

Smaller cars with smaller engines are most fuel efficient.

Currently no small engine vehicles with low efficiency.

Even Vans and Station Wagons are relatively fuel efficient these days.

Each vehicle category has varying engine sizes.

BMW is doubling down on small cars. So is Porsche.

Ford, Chevrolet, & Nissan are going for breadth. Jeep and Land Rover are focused solely on SUVs.

Ram is focused solely on Pickup Trucks A few companies are focused only on small cars.

Toyota used to make a lot of small cars, but now makes less in favor of SUVs and Pickup Trucks.

Overall surge in small efficient engine vehicles over last 10 years.

Mostly at expense of moderate/large inefficient engines.

Even Large Cars are relatively fuel efficient these days.

Linear relationships between engine size and fuel cost and emmissions.

Exponential relationships between efficiency and engine size and fuel cost.

Clustering into 4 groups results in relatively clear boundaries in the data.

Manufacturers with both depth and breadth of vehicle attributes have most connections.

Manufacturers that specialize are positioned toward edge of network.

Four distinct communities detected and connections over time have converged.

Page 74: Transforming Data to Unlock Its Latent Value

THANK YOU!( [email protected]) linkedin.com/in/tonyojeda

*@tonyojeda3+ http://districtdatalabs.com, http://bit.ly/PyDataNC (code)