Use of Quantile Functions

58
Use of Quantile Use of Quantile Functions Functions in Data Analysis in Data Analysis

description

Use of Quantile Functions. in Data Analysis. In general, Quantile Functions (sometimes referred to as Inverse Density Functions or Percent Point Functions) return the Value X at which P(X) = [specified cumulative probability] given that particular distribution. - PowerPoint PPT Presentation

Transcript of Use of Quantile Functions

Page 1: Use of Quantile Functions

Use of Quantile Use of Quantile FunctionsFunctions

in Data Analysisin Data Analysis

Page 2: Use of Quantile Functions

In general, Quantile Functions (sometimes referred to as Inverse Density Functions or Percent Point Functions) return the Value X at which

P(X) = [specified cumulative probability]

given that particular distribution.

Page 3: Use of Quantile Functions

Recall the normal distribution

Page 4: Use of Quantile Functions

So for the specified cumulative probability of 0.025, the value of X turns out to be -1.96.

In other words,

Ф(-1.96) = 0.025, so Ф-1(0.025) =-1.96

Notice, then, that Ф-1 is producing the quantile function, Q

Page 5: Use of Quantile Functions
Page 6: Use of Quantile Functions

Idealised SamplesIdealised Samples

i – ½

n

i = 1,2,…….n

Note that these are evenly spaced between 0 and 1

pi =

These are produced by first defining a sequence of n probability points using

Page 7: Use of Quantile Functions

For example, for n = 20, we have p1, p2, ……….p20 equal to

0.025, 0.075, ………… 0.975

Probability points may be produced with the R function ppoints:

Page 8: Use of Quantile Functions
Page 9: Use of Quantile Functions

Now suppose that a distribution has quantile function Q. Then, if the random variable Z has qf Q, from the original definition,

P(Z ≤ Q(pi) )= pi i =1….. n

Hence, Q(p1)…………….Q(pn)

may be regarded as an idealised sample of size n.

Page 10: Use of Quantile Functions

The Uniform DistributionThe Uniform Distribution

0 1

1

Page 11: Use of Quantile Functions

The Uniform DistributionThe Uniform Distribution

0 1

1

Area

0.2

0.2

The quantile function is easy here

e.g. Q(0.2)=0.2

Page 12: Use of Quantile Functions

The Uniform DistributionThe Uniform Distribution

0 1

1

Area

p

p

In general,

Q(p)=p

Page 13: Use of Quantile Functions

Hence p1…………. pn may be regarded as an idealised sample of size n from the U(0, 1) distribution.

Page 14: Use of Quantile Functions

The Normal DistributionThe Normal DistributionWe have already seen that the normal distribution has quantile function given by Ф-1.

Thus an idealised N(0, 1) sample of size 20, will be

Ф-1(0.025), Ф-1(0.075), Ф-1(0.125), …..

Page 15: Use of Quantile Functions

In R the result is found with qnorm

Page 16: Use of Quantile Functions

A histogram can be plotted using hist(z)

Page 17: Use of Quantile Functions

Exponential DistributionExponential Distribution

The Exp(1) distribution has quantile function Q given by

Q(p) = - ln(1 - p) (exercise!)

In R it is given by the function qexp.

Thus an idealised Exp(1) sample of size 20, together with its histogram are as follows:

Page 18: Use of Quantile Functions
Page 19: Use of Quantile Functions
Page 20: Use of Quantile Functions

Empirical Quantile FunctionsEmpirical Quantile FunctionsSuppose that we have a set of n observations y1, …… yn of a variable y. Let y(1) ≤......... ≤ y(n)

be their order statistics.

Analogously to earlier work we define the empirical quantile function Qe of these data by

Qe(pi) = y( i )

with linear interpolation for other values of p in the interval (0, 1).

Page 21: Use of Quantile Functions

Thus, for any p, Qe(p) is the essentially the value of the variable, y, below which a fraction, p, of the observations lie.

Example: the three quartiles of the distribution of the data are given by Qe(1/4), Qe(1/2) and Qe(3/4).

Page 22: Use of Quantile Functions

ExampleExampleMaximum daily ozone concentrations.

The R data frame ozone is made available on the module web pages. It gives maximum daily ozone concentrations, in parts per billion, measured at Stamford, Connecticut and Yonkers, New York, during the period May - September one year.

Data are given only for the 132 days (out of a total of 152) on which readings were successfully obtained at both locations.

Page 23: Use of Quantile Functions
Page 24: Use of Quantile Functions
Page 25: Use of Quantile Functions
Page 26: Use of Quantile Functions
Page 27: Use of Quantile Functions

The sorted Stamford data are given by

Page 28: Use of Quantile Functions

The R code

> hist(stamford, xlab="ozone concentration",main="Stamford ozone concentrations")

> plot(density(stamford), main="density estimate for Stamford ozone concentrations")

produces the histogram and (kernel) density estimate shown in the next slides. (For details of the latter, see 5.6 of Venables and Ripley (2002).)

Page 29: Use of Quantile Functions
Page 30: Use of Quantile Functions
Page 31: Use of Quantile Functions

A simple summary is given by

> summary(stamford)Min. 1st Qu. Median Mean 3rd Qu. Max.14.00 50.50 80.00 89.87 119.80 240.00

while much the same is achieved with the use of the R (empirical) quantile function, which by default evaluates this atp = (0.00, 0.25, 0.50, 0.75, 1.00)

> quantile(stamford)0% 25% 50% 75% 100%14.00 50.50 80.00 119.75 240.00

Page 32: Use of Quantile Functions

Both the histogram and the summaries show the data to be quite positively skewed. This is quite typical of data constrained to be positive.

We can specify any vector of points at which to evaluate the quantile function:> p = seq(0,1,0.1)> p[1]0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Page 33: Use of Quantile Functions
Page 34: Use of Quantile Functions

Q-Q PlotsQ-Q PlotsWe wish to investigate whether observations y1 ……. yn may reasonably be regarded as a random sample from some theoretical distribution.

A Q-Q plot compares the empirical quantile function Qe of the data with the theoretical quantile function Q of the distribution.

Page 35: Use of Quantile Functions

We plot y(i) (= Qe(pi)) against Q(pi)

for the probability points p1,p2 etc. as defined earlier.

If the data follow an approximately straight line with slope 1 and intercept 0, the observed values are said to be compatible with the theoretical distribution.

i.e. y (i) ≈ Q(pi)

Page 36: Use of Quantile Functions

RefinementRefinement

If the data do not lie on the specific line just mentioned, but instead lie on a line with slope b and intercept a,

i.e. y(i) ≈ a + bQ(pi)

This indicates compatibility with the appropriate distribution scaled by b and shifted by a.

Page 37: Use of Quantile Functions

For example if Q is the quantile function of the N(0,1) distribution, then the relation suggests compatibility with the N(a,b2) distribution.

Page 38: Use of Quantile Functions

Stamford Ozone ConcentrationsStamford Ozone Concentrations

We check for normality.

A normal Q-Q plot compares empirical qf of the data with the qf of the N(0,1) distribution.

The R function qqnorm constructs the plot.

Page 39: Use of Quantile Functions
Page 40: Use of Quantile Functions
Page 41: Use of Quantile Functions

The function qqline adds the straight line y = a +bx corresponding to the normal distribution N(a,b2) whose lower and upper quartiles match those of the data. (This is a resistant fit, unaffected by one or two extreme observations).

Page 42: Use of Quantile Functions
Page 43: Use of Quantile Functions

The lack of linearity in the plotted data suggests that they are not well modelled by a normal distribution. Rather, the convex shape indicates positive skewness (as was seen in the histogram).

We investigate whether a transformation would help. A check is carried out as to whether the square roots of the Stamford ozone concentrations can be modelled by a normal distribution.

Page 44: Use of Quantile Functions

(Notice that, when using R, x^y gives the yth power of x, and, of course, a power of 0.5 gives a square root).

Page 45: Use of Quantile Functions
Page 46: Use of Quantile Functions

The plotted data are reasonably well fitted by the straight line which has slope 2.844 and intercept 9.0. Thus the data are reasonably well modelled by the N(9, 2.8442) distribution (whose lower and upper quartiles match those of the data).

Note that the sample mean and standard deviation are 9.1 and 2.715 respectively.

Page 47: Use of Quantile Functions

Note: The equation of the straight line can be obtained by inspection, or more accurately by using the two points that it is defined by, i.e.

(Q(1/4), Qe(1/4)), (Q(3/4), Qe(3/4))

Appropriate R commands give these as

(-0.675, 7.106) and (0.675, 10.940)

Page 48: Use of Quantile Functions

ExampleExample

The R vector abbey in the package MASS gives 31 determinations of nickel content (μg g-1) in a Canadian syenite rock.

Page 49: Use of Quantile Functions

We check whether the data are reasonably modelled by an exponential distribution. There is no predefined function in R to construct an exponential Q-Q plot, so we have to work from first principles.

Page 50: Use of Quantile Functions

qexp(ppoints(31)) gives the theoretical quantile values at 31 probability points.

sort(abbey) gives the sorted experimental values

Page 51: Use of Quantile Functions

The following command produces a Q-Q plot with axes labelled accordingly.

Page 52: Use of Quantile Functions
Page 53: Use of Quantile Functions

If we ignore the highest observation, the data appear to be reasonably compatible with an exponential distribution with mean around 12.5 (exp(0.08)).

Page 54: Use of Quantile Functions
Page 55: Use of Quantile Functions

However, the probability that the highest of 31 independent observations from an exponential distribution of mean 12.5 is as great as 125 is only 0.0014, so there must be considerable doubt about the suitability of an exponential model here.

A further possibility is that some transformation of the data may be modelled by a normal distribution. Try this out in the practicals.

Page 56: Use of Quantile Functions

BoxplotsBoxplots

These are useful for comparing distributions of the same variable.

The R code:

produces the following boxplot

Page 57: Use of Quantile Functions
Page 58: Use of Quantile Functions

Recall that the lower and upper ends of each box give the first and third quartiles of the corresponding distribution and the centre line indicates the median.