MSc in Computing (Data Analytics)
Probability & Statistical Inference Lecture 1
Lecture Outline Introduction
General Info Questionnaire
Introduction to Statistics Statistics at work The Analytics Process Descriptive Statistics & Distributions Graphs and Visualisation
Introduction Name : Aoife D’Arcy Email: [email protected] Bio: Managing Director and Chief Consultant at the Analytics
Store, has degrees in statistics, computer science, and financial & industrial mathematics. With over 12 years of experience in analytics consultancy with major national and international companies in banking, finance, insurance, manufacturing and gaming; I have developed particular expertise in risk analytics, fraud analytics, and customer insight analytics.
Lecture Notes: Will be available online on
www.comp.dit.ie/bmacnamee and later on webcourses
Course OutlineWeek Topic
1 Introduction to Statistics2 & 3 Probability Theory4 Introduction to SAS Enterprise Guide5 Probability Distributions6 Confidence Intervals7 & 8 Hypothesis testing9 Assignment10 - 12 Regression Analysis13 Revision
Exam & AssignmentExam The end of term exam accounts for 60% of
the overall mark
Assignment The assignment is worth 40% of the overall
mark. The assignment will be handed out in week 5 Week 9’s class will be dedicated to working
on the assignment.
Software SAS Enterprise Guide will be the
software that will be used during the course.
Applied Statistics and Probability for EngineersJohn Wiley & SonsDouglas C. Montgomery
Probability and Statistics for Engineers and ScientistsPearson EducationR.E. Walpole, R.H. Myers, S.L. Myers, K. Ye
Modelling Binary DataChapman & HallDavid Collett
Probability and Random Processes Oxford University PressG. Grimmett & D. Stirzaker
Statistical InferenceBrooks/ColeGeorge Casella
Recommended Reading
Questionnaire
Section 1: Statistics are everywhere
We are bombarded with Statistics
http://www.irishtimes.com/newspaper/frontpage/2012/0918/1224324122326.html http://www.irishtimes.com/newspaper/world/2012/0914/1224324008884.html http://www.independent.ie/business/world/survey-names-oslo-the-worlds-priciest-city-ireland-ranks-
27th-3229426.html
The internet is full of interesting statistics
http://www.usatoday.com/news/politics/twitter-election-meter
Statistics can be misleading An ad claimed:
“9 Out of 10 Dentists prefer Colgate” What is wrong with this statement?
Consider these complaints about airlines published in US News and World Report on February 5, 2001
Can we conclude the United airlines has the worst customer service?
Statistics in Everyday Life With the increase in the amount
of data available and advancement`s in the power of computers, statistics are being used more and more frequently
Question: Is it good that statistics are used so much
and what happens when statistics are misused?
Statistics can be misleading
Misinterpreted Statistics can be Devastating
In 1999 Sally Clarke was wrongly convicted of the murder of two of her sons. The case was widely criticised because of the way statistical evidence was misrepresented in the original trial, particularly by paediatrician Professor Sir Roy Meadow.
He claimed that, for an affluent non-smoking family like the Clarks, the probability of a single cot death was 1 in 8,543, so the probability of two cot deaths in the same family was around "1 in 73 million" (8543 × 8543).
What is wrong with this assumption?
Video http://www.youtube.com/watch?v=4TKbIi
dbyhk&feature=fvwrel
Challenges As an Analytics practitioner you will
face a number of challenges:
Create insight from all available data (and there is lots of it)
Interpret statistic correctly Communicate statistically driven
insight in a way that is clearly understood
Objective of this course
Give you a set of statistical skills to allow you, as an analytics practitioner, turn data into insight!!
The Analytics Process & Statistics
Section Overview Statistics and Analytics Introduction to CRISP
Data Analytics Is Multidisciplinary
Databases
StatisticsPatternRecognition
KDD
MachineLearning AI
Neurocomputing
Predictive Analytics
Data Warehousing
Analytics Process
Data Insight Business Decision
Analytics Is A Lot Of ThingsWhat’s the best that can happen?
What will happen next?
What if these trends continue?
Why is this happening?
What actions are needed?Where exactly is the problem?How many, how often, where?What happened?
Optimization
Predictive modellingForecasting/extrapolation
Statistical analysisAlerts
Query/drill down
Ad hoc reports
Standard reports
Com
petit
ive
adva
ntag
e
Degree of intelligence
Pred
ictiv
e An
alyt
icsAc
cess
&
repo
rting
For this course we will concentrate on Statistical Analysis
What’s the best that can happen?
What will happen next?
What if these trends continue?
Why is this happening?
What actions are needed?Where exactly is the problem?How many, how often, where?What happened?
Optimization
Predictive modellingForecasting/extrapolation
Statistical analysisAlerts
Query/drill down
Ad hoc reports
Standard reports
Com
petit
ive
adva
ntag
e
Degree of intelligence
Pred
ictiv
e An
alyt
icsAc
cess
&
repo
rting
CRISP-DM Evolution Over 200 members of the CRISP-DM SIG
worldwide DM Vendors: SPSS, NCR, IBM, SAS, SGI,
Data Distilleries, Syllogic, etc System Suppliers/Consultants: Cap Gemini,
ICL Retail, Deloitte & Touche, etc End Users: BT, ABB, Lloyds Bank, AirTouch,
Experian, etc Crisp-DM 2.0 is due…
Complete information on CRISP-DM is available at: http://www.crisp-dm.org/
CRISP-DM Features of CRISP-DM:
Non-proprietary Application/Industry neutral Tool neutral Focus on business issues
As well as technical analysis Framework for guidance Experience base
Templates for Analysis
Data
Business Understandin
g
Data Understandin
g
Data Preparation
Modelling
Evaluation
Deployment
Phases & Generic TasksBusiness
Understanding Data
Understanding Data
Preparation Modeling Deployment Evaluation
Determine Business Objectives
AssessSituation
DetermineData Mining
Goals
ProduceProject Plan
Business Understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives
Phases & Generic TasksBusiness
Understanding Data
Understanding Data
Preparation Modeling Deployment Evaluation
CollectInitialData
DescribeData
ExploreData
VerifyData
Quality
Data Understanding
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.
Phases & Generic TasksBusiness
Understanding Data
Understanding Data
Preparation Modeling Deployment Evaluation
SelectData
CleanData
ConstructData
IntegrateData
FormatData
Data Preparation
The data preparation phase covers all activities to construct the data that will be fed into the modelling tools from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modelling tools.
Phases & Generic TasksBusiness
Understanding Data
Understanding Data
Preparation Modeling Deployment Evaluation
SelectModelingTechnique
GenerateTest Design
BuildModel
AssessModel
Modelling
In this phase, various modelling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary.
Phases & Generic TasksBusiness
Understanding Data
Understanding Data
Preparation Modeling Deployment Evaluation
EvaluateResults
ReviewProcess
DetermineNext Steps
Evaluation
Before proceeding to final deployment of a model, it is important to thoroughly evaluate it and review the steps executed to construct it to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.
Phases & Generic TasksBusiness
Understanding Data
Understanding Data
Preparation Modeling Deployment Evaluation
Plan Deployment
Plan Monitoring &
Maintenance
ProduceFinal
Report
ReviewProject
Deployment
Creation of a model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.
Crisp - DM Business Understanding
Data Understanding
Data Preparation
Modelling
Evaluation
Deployment.
Crisp – DM – Areas covered in this course
Business Understanding
Data Understanding
Data Preparation
Modelling
Evaluation
Deployment
Section 2: Descriptive Statistics & Distributions
Topics1. Introduction to Statistics2. The Basics 3. Measures of location: Mean, Median &
Mode.4. Measures of location & Skew.5. Measures of dispersion: range, standard
deviation (variance) & interquartile range.
Introduction to Statistics According to The Random House College
Dictionary, statistics is “the science that deals with the collection, classification, analysis and interpretation of numerical facts or data.” In short, statistics is the science of data.
There are two main branches of Statistics: The branch of statistics devoted to the
organisation, summarization and the description of data sets is called Descriptive Statistics.
The branch of statistics concerned with using sample data to make an inference about a large set of data is called Inferential Statistics.
Process of Data AnalysisPopulation
Representative
Sample
Sample Statistic
A Statistical population is a data set that is our target of interest.
A sample is a subset of data selected from the target population.
If your sample is not representative then it is referred to as being bias
Describe
Make Inferenc
e
Types of Data: Numeric Data Numeric data can be of two types:
Continuous Data: Data is continuous if it has an interval of real numbers for its range The number of centimetres of rain that fell in
March Discrete Data: Data is defined as discrete
if it has a finite range The number of correct answers in a 10 question
quiz
Types of Data: Categorical Data Data that is broken into discrete categories is
referred to as categorical data Categorical data has two main types:
Nominal: A nominal variable has a discrete number of categories or levels with no logical order Gender: Male, Female Working Status: Employed, Unemployed, Home-maker,
Student, Retired Ordinal: An ordinal variable has a discrete number
of categories or levels with a logical order Income Level: Low, Medium, High Places in a race: 1st, 2nd, 3rd, 4th, 5th, 6th
Class Task Task: Classify the type of each of the
data the following examples: The profit margin made from customers of
an online clothing company The type of interest rate you can be
charged on a mortgage i.e. Fixed rate, Adjustable rate
Number of dependents a associated with a loan applicant
Let’s Start at the Very Beginning When learning to read and write we start
with A-B-C, when starting to count we start with 1-2-3 and of course The Von Trappe family singers started with Do-Re-Me!
When learning statistics you start with the arithmetic mean or a simple average
The Arithmetic Mean
Year Canada China Germany* Russia** United Kingdom
United States
Total Gold Total Gold Total Gold Total Gold Total Gold Total Gold1992 18 7 54 16 82 33 112 45 20 5 108 371996 22 3 50 16 65 20 63 26 15 1 101 442000 14 3 59 28 56 13 88 32 28 11 92 372004 12 3 63 32 49 13 92 27 30 9 103 362008 18 3 100 51 41 16 72 23 47 19 110 36Mean 17 4 65 28 58 19 85 30 28 9 102 38
The table below shows the total medals won and gold medals won by each country in the last 5 Olympic games
• * Germany combines East and West Germany prior to reunification ** Russia or The Soviet Union• Data source http://www.databaseolympics.com/index.htm
Arithmetic Mean – The Formula The formula for calculating the sample
arithmetic mean of n data points x1, x2 ..... xn:
x x i
1
n
n
:x is referred to as x-bar
Attributes of the Arithmetic Mean It is straight-forward to calculate It is easy to interpret the mean It gives us a good estimate of where
a set of numbers is centred This is referred to as the central tendency of a sample
It is sensitive to outliers
Other Measures of Central Tendency
Median: The middle value of an ordered set of values, i.e. 50% higher and 50% lower
Mode: The most commonly occurring value in a distribution
Calculating the MedianYear Medals1964 901968 1071972 941976 941980 01984 1741988 941992 1081996 1012000 922004 1032008 110
Medals (Sorted)
17411010810710310194949492900
Sort the data Median =
97.5
Calculating the Mode
Medals Count174 1110 1108 1107 1103 1101 194 392 10 1
Mode = 94
Year Medals1964 901968 1071972 941976 941980 01984 1741988 941992 1081996 1012000 922004 1032008 110
Count frequenci
es
When to Use Each Central Tendency Value?
Question: When and why would you use the median over the mean?
Let’s Look at the Variation in our Data
01 - 2
424 - 4
848 - 7
474 - 9
797 - 1
21
121 - 146
146 - 170
0
2
4
6
8
10
12
14
16
18
20
Distribution of the Total Olympic Medals won by any Country from 1964 - 2008
Coun
t
01 - 2
424 - 4
848 - 7
474 - 9
797 - 1
21
121 - 146
146 - 170
0
2
4
6
8
10
12
14
16
18
20
Distribution of the Total Olympic Medals won by any Country from 1964 - 2008
Coun
t
Let’s Look at the Variation in our Data
Central Tendency / Location
Spread/Variation
Measures of Spread or VariationRangeVarianceStandard DeviationInter-quartile Range
Calculating the Range The Range in calculated by
subtracting the minimum value in a data set from the maximum value
The main advantage to using the range is the ease with which it is calculated
The major disadvantage of the range is that it is highly sensitive to outliers
Calculating the Variance As an example of Variance consider the
following data:OBS Data
1 32 43 8
Sum 15Mea
n5
Calculating the Variance As an example of Variance consider the
following data:OBS Data Mean Deviatio
n1 3 5 -22 4 5 -13 8 5 3
Sum 15 15 0Mea
n5 15 0
Calculating the Variance As an example of Variance consider the
following data:OBS Data Mean Deviatio
n(Deviation
)2
1 3 5 -2 42 4 5 -1 13 8 5 3 9
Sum 15 15 0 14Mea
n5 15 0 4.67
Variance – The Formula Square the deviations around the mean
before summing. For n data points x1, x2 ..... xn:
Divide by n-1 (?) to get the average of squared deviations:
x i xn 2i1
n
s2 x i xn 2
i1
n
n 1
Standard Deviation – The Formula Take the square root of the variance.
The value is in the original unit
s x i xn 2
i1
n
n 1
Standard Deviation
Question: Why might it be useful to have the value is in the original unit?
Percentiles The nth percentile is a value that has a
proportion of the sample taking values at or lower than it, and taking values larger than it
Example: if your grade in an industrial engineering class was located at the 84th percentile, then 84% of the grades were equal to or lower than your grade and 16% were higher
n100
100 n100
Inter-quartile Range The median is the 50th percentile The 25th percentile and the 75th
percentile are called the lower quartile and upper quartile respectively (or 1st and 3rd)
The difference between the lower and upper quartile is called the inter-quartile range
Quartiles ExampleMedals (Sorted)
17411010810710310194949492900
Sort the data
25th Percentile= 1st Quartile = 93
50th Percentile= Median= 97.5
75th Percentile = 3rd Quartile = 107.5
Inter-quartile Range 107.5 – 93 = 14.5
Year Medals1964 901968 1071972 941976 941980 01984 1741988 941992 1081996 1012000 922004 1032008 110
Proportions The proportion, p, of items in a population
that belong to a certain class, for example: The proportion of your customers that are
female The proportion of voters that will vote for
Labour in the next election A proportion is calculated as:
where C is the number of items in a population of size N that belong to the class of interest
p CN
Skew – The Shape of a DistributionThere are a number of ways of describing the shape of a distribution.
We will consider only one – skew.
Skew is a measure of how asymmetric a distribution is.
Symmetric Distributions = skew is zero
There are few very large data points which create a 'tail' going to the right (i.e. up the number line)
Note: No axis of symmetry here - skew > 0 (i.e. it is positive)
Example: Lifetime of people, house prices
Positive Skew
There are few very small data points which create a 'tail' going to the left (i.e. down the number line)
Note: No axis of symmetry here - skew < 0 (i.e. it is negative)
Examples: Examination Scores, reaction times for drivers
Negative Skew
Mean, Median & Mode are the same and are found in the middle
66
5 6 74 5 6 7 8
3 4 5 6 7 8 9
Mean = 102/17 = 6Median = 6Mode = 6
Skew & Measures of Location - Symmetry
ModeMedianMean
66
5 6 75 6 7 8 95 6 7 8 9 10 11
Mean = 121/17 = 7.12Median = 7Mode = 6
In general: Mode < Median < Mean
Positive Skew
ModeMedianMean
Mean = 83/17 = 4.89Median = 5Mode = 6
In general: Mode > Median > Mean
66
5 6 73 4 5 6 7
1 2 3 4 5 6 7
Negative Skew
Section 3: Graphs and Visualisation
Graphical Displays A way of letting people get a 'picture' of
relationships in the data set.
The simpler the better should be a rule in graphical display.
People can remember pictures better.
A good graph should show something that is not easy to ‘see’ using tables.
Bar Charts Used to display categorical data or
discrete data with a modest number of values.
A Bar is drawn to represent each category. The Bar height represents the frequency
or % in each category. Allows for visual comparison of relative
frequencies. Need to draw up a frequency distribution
table first.
Core Statistical Plots
4265.
375 88.75
112.12
5135
.5
158.87
5182
.25
205.62
5More
0
5
10
15
20
25
Points Scored by any Team in Six Nations Champi-
onship 2000 - 2011
Core Statistical Plots Comparisons Column Charts
Box Plots
Core Statistical Plots Correlations Scatter Plots
Trends(time)
Line Charts
Core Statistical Plots Proportions Pie Chart
Column Chart
Some Hans Inspiration to Finish UP http://www.youtube.com/watch?v=fTznEI
ZRkLg
Top Related