Data Visualization in Data Science

34
Data Visualization in Data Science Maloy Manna biguru.wordpress.com linkedin.com/in/maloy twitter.com/itsmaloy

Transcript of Data Visualization in Data Science

Data Visualization in Data Science

Maloy Manna

biguru.wordpress.com linkedin.com/in/maloy twitter.com/itsmaloy

Synopsis

Having data is not enough. Adding context to data is essential to understand the data, find patterns and engage audiences. Data visualization is a key element of data science, the interdisciplinary field which deals with finding insights from data.

• In this webinar, we explore the roles of data visualization at different stages of the data science process, and why it is essential.

• We also look at how data is encoded visually with shape, size, color and other variables and also the basic principles of visual encoding can be applied to build better visualizations.

• We cover narratives, types of bias and maps. • Finally we look at how various tools – both open source and off-the-shelf

software that’s used in data science to build effective data visualizations.

Speaker profile

Maloy Manna Project Manager - Engineering AXA Data Innovation Lab

• Over 14 years experience building data driven products and services • Previous organizations: Thomson Reuters, Saama, Infosys, TCS

biguru.wordpress.com linkedin.com/in/maloy twitter.com/itsmaloy

Contents

Defining Data visualization

Data science process Data visualization

Visual encoding of data

Narrative structures Dataviz Technology & Tools

Defining Data visualization

• Visual display of quantitative information

• Mapping data to visual elements • Encoding data with size, shape, color... • Storytelling / narrative elements

Defining Data Visualization

Exploratory • Find insights • Conversation between data and “you”

Explanatory • Present insights

Data science project life-cycle

• Acquire data • Prepare data

• Analysis &

Modeling

• Evaluation & Interpretation

• Deployment • Operations &

Optimization

Data science process

Data Wrangling

EDA:

Exploratory

Data Analysis

Data Visualization

Explanatory Exploratory

Source: Computational Information Design | Ben Fry

Exploratory data visualization

Data analysis approaches: Classical:

Problem > Data > Model > Analysis > Conclusions

EDA: [Exploratory Data Analysis]

Problem > Data > Analysis > Model > Conclusions

Bayesian: Problem > Data > Model > Prior distribution > Analysis > Conclusions

EDA = approach, not a set of techniques

Exploratory data visualization

Statistical approaches:

• Quantitative • Hypothesis testing

• Analysis of variance (ANOVA) • Point estimates and confidence intervals • Least squares regression

• Graphical • Scatter plots • Histograms • Probability plots • Residual plots • Box plots • Block plots

Exploratory data visualization

Graphical • Scatter plots • Histograms • Probability plots • Residual plots • Box plots • Block plots

Exploratory data visualization

Graphical analysis procedures: • Testing assumptions • Model selection

• Model validation

• Estimator selection

• Relationship identification

• Factor effect determination

• Outlier detection

MUST USE for deriving insights from data

Exploratory data analysis

Anscombe's quartet

N=11

Mean of X = 9.0

Mean of Y = 7.5

Intercept = 3

Slope = 0.5

Residual standard deviation = 1.237

Correlation = 0.816

Exploratory data analysis

Explanatory data visualization

Design

Engineering

Journalism

Explanatory data visualization

Visualization is both an art and science • Harry Beck's subway map of London

Visual encoding of data

Data Types • Quantitative

• Continuous, Discrete

• Categorical • Nominal, Ordered, Interval

Visual encoding of data

Categorical scales and graph design

Visual encoding of data Bandwidth of our senses: [Tor Norretranders]

Visual encoding of data

Data → visual display elements

• Position x

• Position y

• Retinal variables • Size, Orientation (ordered data) • Color Hue, Shape (nominal data)

• Animation

Visual encoding of data

Ranking visual display elements (framework): 1. Position along a common-scale e.g. scatter plots

2. Position on identical but non-aligned scales E.g. multiple scatter plots

3. Length e.g. bar chart

4. Angle & Slope e.g. pie-chart

5. Area e.g. bubbles

6. Volume, density & color saturation e.g. heat-map

7. Color hue e.g. highlights

Ref. Graphical Perception & graphical methods for analyzing scientific data – William

Cleveland & Robert McGill (1985)

Design principles Choose the right type of chart

• Trends / Change over time → Line charts • Distributions → Histograms • Summary Information → Table • Relationships → Scatter Plots

Get it right in black & white (before adding color) Prefer 2D to 3D for statistical charts Use color to highlight Avoid rainbow palette Avoid chartjunk : “less is more” Try to have a high data-ink ratio

Design principles Choose the right type of chart

Ranking

Time-series Deviation

Correlation Nominal comparison

Narrative structures

Data Journalism

Traditional journalism Data journalism

• Data around narrative • Narrative around data

• Linear flow • Complex, often non-linear flow

• Physical static media • Online interactive media

Narrative structures

Narrative structures

Narrative structures

Bias (and ethics: Don’t lie with data)

Bar-charts must have a zero-baseline Present data in its context

Narrative structures

Bias: Misleading with data Selective presentation with line-charts

• Author Bias

• Data Bias

• Reader Bias

Narrative structures

Bias and Errors (statistics): • Selection bias e.g. in sampling • Omitted-variable bias

Errors: • Hypothesis testing • Null Hypothesis = default/no-effect state

Null Hypothesis H0 Valid Invalid

Reject Type I error

• False positive

Correct inference

• True positive

Accept Correct inference

• True negative

Type II error

• False negative

Narrative structures

Storytelling: Visual narratives have moved from author-driven to viewer-

driven with use of highly interactive media for data visualization

Author driven Viewer driven

Strong ordering Exploratory

Heavy messaging Ability to ask questions

Need for clarity and speed Build own story

Author-driven Viewer-driven

DataViz Technologies & Tools

Off-the-shelf: Tableau, Qlikview

Tools: Predefined charts: Raw, Chartio, Plotly

Google fusion tables, Excel, Gephi

Code & Javascript libraries: R ggplot2, ggvis, rCharts + shiny(interactive apps) Python matplotlib, D3.js, Dimple.js, Leaflet, Rickshaw (use JSON data) Linux gnuplot

DataViz Technologies & Tools

Tableau data viz

DataViz Technologies & Tools

Chart in R ggplot2

References

Visual display of Quantitative Information: Edward Tufte http://goo.gl/qb5ej Exploratory Data Analysis: John Tukey http://goo.gl/tV57HP Data Science Life cycle : Maloy Manna http://www.datasciencecentral.com/profiles/blogs/the-data-science-project-lifecycle Selecting right graph for your message: Stephen Few www.perceptualedge.com/articles/ie/the_right_graph.pdf Practical rules for using color in charts: Stephen Few www.perceptualedge.com/articles/visual.../rules_for_using_color.pdf OpenIntro Statistics: https://www.openintro.org/stat/ Misleading with statistics: Eric Portelance https://medium.com/i-data/misleading-with-statistics-c63780efa928 Computational Information Design: Ben Fry http://benfry.com/phd/dissertation-050312b-acrobat.pdf