Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana...
Transcript of Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana...
![Page 1: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/1.jpg)
Data Wrangling and Statistical Analysis
PracticalsAna Maria Heilman, Ph.D
Breeding Pipeline DB Mgr
2018
Collect Clean
Data Wrangling Selection
Extraction
Statistical Analysis
InterpretVisualize
Model
Source: Trevor Bihl, 2017
![Page 2: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/2.jpg)
OBJECTIVES
• Identify different descriptive statistics and visualizations
used to explore and interpret your data
• Identify the different steps of the data management cycle
• Understand the importance of QAQC in the cleaning of
messy data
![Page 3: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/3.jpg)
Data Wrangling: Data Collection
– "Data wrangling involves:
• Taking raw data
• Extracting it
• Cleaning it
• Developing data features for analysis”
Collect Clean
Data Wrangling Selection
Extraction
Statistical Analysis
InterpretVisualize
Model
Source: Trevor Bihl, 2017
![Page 4: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/4.jpg)
Data Wrangling: Data Collection
Raw Data• Real-world data is rarely orderly
and clean (Bihl, 2017)
• Must establish standard definitions
and protocols
Data collection
Data entry
4
![Page 5: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/5.jpg)
QA vs QC
QA QC Process oriented to
eliminate errors
Product oriented
Proactive process Reactive Process
Define SOP (methods,
standards, audits, checklist)
Follow SOP steps to correct
errors
QA – Process oriented
QC – Product oriented
QA = “Set of processes or steps that
ensure protocols developed are
followed to minimize errors in the
data” (Campbell et al. 2013)
QC = “protective process to identify
suspect data after it has been
generated” (Campbell et al. 2013)
Quality Assurance/ Quality Control
https://passel.unl.edu/communities/index.php?idinformationmodule=1130447290&topicorder=4&maxto=6&minto=1&idcollectionmodule=1130274258
![Page 6: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/6.jpg)
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
Data Life Cycle• Plan:
– Description of the data
– Management steps (SOPs)
– Accessibility
• Collect:
– Field books
– Tablets/iPads
– Sensors
• Assure:
– QA through automatic check ups
![Page 7: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/7.jpg)
Descriptive Statistics
• Both descriptive statistics and visualizations should be the
first methods to use to understand your data
• Descriptive Statistics:
– Compute basic quantitative information
• Means, variances
• Histograms and Pareto Charts (distribution of data)
![Page 8: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/8.jpg)
Descriptive Statistics
Practicals using JMP and Excel
![Page 9: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/9.jpg)
Descriptive Statistics: Histograms
• Can instantly give us a subjective assessment about a
data set
• A histogram reflects:
– Distribution of the data based on counting # of obs. within
range bins
![Page 10: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/10.jpg)
Descriptive Statistics: Histograms
• Open Big Class.jmp
• Select Analyze > Distribution
• Select weight and age for Y
columns (this indicates JMP
which column of data to
analyze)
• Click OK
Big Class.jmp
![Page 11: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/11.jpg)
Descriptive Statistics: Histograms
• Below the histogram is the
quantile info, the min, max,
median values for this data
set
– Quantiles give us an idea of
how the data is distributed
Big Class.jmp
![Page 12: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/12.jpg)
Descriptive Statistics: Histograms
• Histogram Bin:
– Click on the red triangle next to
Weight
– Select Histogram Options
– Set Bin Width
– Type the new bin width to be 10
Big Class.jmp
![Page 13: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/13.jpg)
Descriptive Statistics: Histograms
• Below the quantiles we have
the Summary Statistics
– Mean: Avg of a set of numbers
– Std Dev: Standard deviations
– N: Number of observations
Big Class.jmp
![Page 14: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/14.jpg)
Descriptive Statistics: Histograms
• Distribution of categorical
data:
– Example using Grocery
Purchases.jmp
– The frequency tables do not
correspond with quntiles
• Show how often each category
appears
Gro
cery
Pu
rch
ases.jm
p
![Page 15: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/15.jpg)
Descriptive Statistics: Histograms• Box plots :
– Presented immediately next to the
histogram is a box plot
• Box Plots displays a representation of
the distribution whereby:
– Box = location of the 1st and 3rd
quartiles
– Line inside = median
– Whiskers(above/below) = Extent of
data 1.5X length of interquartile
range
– Diamond = location of the upper
and lower 95% confidence interval
about the mean
– Bracket (red) = densest 50% of the
data
Big Class.jmp
![Page 16: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/16.jpg)
Descriptive Statistics: Histograms• Quantile Box Plot:
– The outlier box plot relates
information regarding the
distribution of data BUT not the
quantiles
– To add a quantile box plot:
• Click on the red triangle next to
weight
• Select Quantile Box Plot
– Q1 = 91.25 (first quartile)
– Q2 = 105 (median )
– Q3 =115.75 (quartile)Big Class.jmp
![Page 17: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/17.jpg)
Descriptive Statistics: Histograms• Stem and Leaf Plots:
– Another approach to visualize
the distribution of the data
– Uses the same frequency bins
– Retains the quantifiable
information
– To add a stem and leaf plot:
• Click on the red triangle next to
weight > select Stem and Leaf
• The value 6 | 4 indicates the first
stem is a value of 60, with a leaf
being 4 = 64
Big Class.jmp
![Page 18: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/18.jpg)
Descriptive Statistics: Histograms
• Pareto Charts:
– “Problem solving tools that show
causes and quantiles in an ordered
manner” (Burr, 1990 cited by Bihl,
2017)
– Data is organized by group and
quantity from the largest to the smallest
– It includes a cumulative count that
represents the overall contribution to
the total number of occurrences (Bihl,
2017)
Failure2.jmp
![Page 19: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/19.jpg)
Descriptive Statistics: Histograms
• Pareto Charts:
– Select the file called
Failure2.jmp
– Click Analyze > Quality and
Process > Pareto Plot
– Select failure and click Y, Cause.
– Select clean and click X,
Grouping.
– Select N and click Freq.
– Click OK.
Failure2.jmp
![Page 20: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/20.jpg)
• Rearrange the order of the
plots by clicking the title (after)
in the first tile and dragging it
to the title of the next tile
(before)
– Order of the causes changes to
reflect the order based on the
first cell
• A reduction in the oxide
defects is clear after cleaning
Descriptive Statistics: Histograms
Failure2.jmp
![Page 21: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/21.jpg)
Data Visualization Tools
Practicals using JMP and Excel
![Page 22: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/22.jpg)
Data Visualization Tools
Type of Visualizations
• Scatter Plots
• Charts
• Multidimensional plots
– Parallel Plots
– Cell Plots
• Multivariate and Correlations Tool
– Correlations Table
– Heat maps
– Simple Statistics
• Graph Builder and Custom Figure
![Page 23: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection](https://reader030.fdocuments.net/reader030/viewer/2022040205/5f37003b7e5c59083f5df704/html5/thumbnails/23.jpg)
Questions?