1 Data Formats (CDF, ASCII, FLATFILES) Error Analysis Probability Distributions Binning and...

1

Data Formats (CDF, ASCII, FLATFILES)Error Analysis

Probability DistributionsBinning and Histograms

Examples using Kp and Dst indices

ESS 265 Spring Quarter 2009

Lecture 1March 30, 2009

2

Formats of Data Files• Time series data are stored in a variety of formats. These include:

– ASCII (American Standards Code for Information Interchange) and binary tables- the most common forms of data format for time series data. Data in other formats frequently converted to tables.

– Common Data Format (CDF) – developed by the National Space Science Data Center (NSSDC) for all types of data. Used for the International Solar Terrestrial Physics program. Requires NSSDC-provided software.

– Flexible Image Transport System (FITS) – the only format allowed by the astronomy community, a lot of use for images. Has been tried for time series data without success.

– Hierarchical Data Format (HDF) – developed by the National Partnership for Computing Infrastructure (NPACI). Frequently used for results from simulations. NPACI provides software.

– Standard Format Data Units (SFDU) – used by all space faring nations to label raw telemetry data. An international standard that is rarely used for processed scientific data.

• Binary data is the most common form of data compression with a savings of about a factor of 3 over ASCII data.

• zip, gzip lossless compression often used on ASCII data for fast transfer

3

Tables, Flat Files and Relations

• Tables are the simplest way to represent time series data. • A table is defined as “a compact arrangement of related facts, figures,

values in an orderly sequence usually in rows and columns” – McPherron• If all records in a file are identical and are simply a series of rows in a table,

the file is called a flat file. • In some formats the file may have a variable sequence of records of

different types in which one must read each record in sequence to determine what records are coming next. They can be exceedingly difficult to read.

• The dependent variable y is shown as a function of several independent variables x1, x2, ... xm. If y is the Dst index and x1 is time then the table would contain a time series. A flat file is also called a relation. The table displays information about the connection between the various quantities contained in the table. A column of a table is normally a sequence of samples of a single variable. In contrast, a row of the table is called a tuple, a set of simultaneous measurements of a set of variables. Tuple is an abstraction of the sequence: Single, Double, Triple, Quadruple, Quintuple, N-tuple. A complex number is a 2-tuple, or pair. A Quaternion is a 4-tuple or Quadruple. Note that a time series is a specific type of table or relation in which the order of values is important.

4

A Relation

•Assume n sets of observations of a dependent variable y which is a function of m independent variables x1,x2,….xm.

•The relation can be represented by a flat file.

•Each column is a variable and each row is a tuple.

•Model the relation with a regression equation that combines the m variables.

5

Tables and Metadata• The simplest way to store a table in a computer is as an ASCII file containing

a sequence of identical records. Such files are easy to read since every record has the same format. They are also simple to view since they may be opened and edited in any text editor. A more compact version of the same flat file would be in binary format. While these files are still flat they cannot be viewed or edited without first converting to ASCII format.

• Time is usually represented in seconds or milliseconds since a certain date. UCLA practice has time in seconds since 1966-Jan-01, ignoring leap seconds. IDL time is in seconds since 1970, which is the same as UNIX time.

• One must know the format of the data record. This includes the number of columns, the widths of the columns, how the values are represented, the names of the columns, the units of the variable etc. Such data are called metadata.

• Binary data tables also are used. – Much data at UCLA is in the form of binary flat files.– Lower flat files contain embedded metadata (header information).

• Lower.ffd contains binary data, Lower.ffh contains ascii headers

– Upper flat files are completely flat with detached detached metadata. • Upper.DAT contains binary data, Upper.DES contains data description

• Upper.HED contains header information on the data, Upper.ABS is abstract on data

6

DATA = SDT.Export.BZGSE.UnNamed.ffh CDATE = Wed Jun 5 10:48:14 19960 RECL = 12 NCOLS = 2 NROWS = 3826 OPSYS = SUN/UNIX # NAME UNITS SOURCE TYPE LOC 001 UT SECS UNIVERSAL TIME T 0 002 BZGSE nT BMag_Angles R 8 ######### SDT EXPORT FLAT FILE ABSTRACT FileName: SDT.Export.BZGSE.UnNamed Format: UCLA Flatfile Date/Time: Wed Jun 5 10:48:14 19960 SDT Version:2.3 Comment: test_comment ######### Name: BMag_Angles Time: 1995/10/18/00:00:00 Points: 3826 Components: 7 Component Depths: 1 1 1 1 1 1 1 ######### FLAT FILE MAKER:SDT Export Flatfile INPUT FROM: Geotail Minute Survey

UCLA Lower Flatfile Header (Metadata) Example

12 bytes per record

2 Columns

3826 Rows

7

An ASCII Flat File- Galileo Magnetometer Data during the G8 Flyby

1997-05-07T15:36:55.133 -8.36 -25.04 -85.24 89.23 -1.57 -3.68 0.65

1997-05-07T15:36:55.467 -8.38 -25.16 -85.22 89.25 -1.57 -3.68 0.65

1997-05-07T15:36:55.800 -8.41 -25.09 -85.24 89.25 -1.57 -3.67 0.65

1997-05-07T15:36:56.133 -8.44 -25.08 -85.27 89.28 -1.57 -3.67 0.65

1997-05-07T15:36:56.467 -8.50 -25.16 -85.19 89.23 -1.57 -3.67 0.65

1997-05-07T15:36:56.800 -8.47 -25.18 -85.18 89.23 -1.57 -3.67 0.65

1997-05-07T15:36:57.133 -8.60 -25.20 -85.18 89.25 -1.57 -3.67 0.65

1997-05-07T15:36:57.467 -8.47 -25.04 -85.12 89.13 -1.57 -3.67 0.65

1997-05-07T15:36:57.800 -8.44 -25.04 -85.17 89.17 -1.57 -3.67 0.65

1997-05-07T15:36:58.133 -8.41 -25.30 -85.06 89.14 -1.57 -3.67 0.65

1997-05-07T15:36:58.467 -8.39 -25.27 -85.00 89.08 -1.57 -3.67 0.65

1997-05-07T15:36:58.800 -8.37 -25.01 -85.09 89.09 -1.57 -3.66 0.65

1997-05-07T15:36:59.133 -8.37 -24.98 -85.12 89.11 -1.57 -3.66 0.65

1997-05-07T15:36:59.467 -8.35 -24.93 -85.24 89.20 -1.57 -3.66 0.65

1997-05-07T15:36:59.800 -8.36 -24.71 -85.26 89.16 -1.57 -3.66 0.65

1997-05-07T15:37:00.133 -8.31 -24.78 -85.30 89.21 -1.57 -3.66 0.65

8

ERROR ANALYSIS: Some Nomenclature• Systematic errors – Reproducible errors that result from calibration errors or

bias on the part of the observer. Sometimes data can be corrected for these errors but in other cases we must estimate these errors and combine them with errors from statistical fluctuations.

• Accuracy – otherwise called “Absolute Accuracy” is a measure of how close an observation comes to the true value. How well we compensate for systematic errors. E.g. Magnetometer accuracy is how far the measurement is from absolute value of the B-field in nT, and is order of 1nT for fluxgates (including long term drifts) and 0.01nT for Vector Helium magnetometers. Relative inter-spacecraft accuracy is the systematic difference in measurement between two nearby spacecraft.

• Precision – a measure of how a result was obtained, how reproducible it is. How well we overcome random errors.

• Uncertainty –Refers to the difference between a result and a true value. Often we don't know what the "true" value so we must estimate the error. Repeated measurements of the same thing will differ and we can only talk about the discrepancy between these measurements- this is uncertainty.

• Probable error- A measure of the magnitude of the error we estimate. For two identical measurements it is a measure of the probable discrepancy.

9

ERROR ANALYSIS: Some Definitions•Parent population – Set of data points from which experimental data are assumed to be a random sample.•Parent distribution – Probability distribution P(x) determining the choice of sample data from parent population. Usually normalized to 1.•Expectation value

•Median is defined as such that P(xi≤1/2 )=P(xi≥1/2 )=1/2•Most probable value max is defined such that P(max)≥P(x≠max).•Mean- <x>•Average deviation – •Variance - .

•Standard deviation - .

•Sample mean - .

•Sample variance – Best estimate of the parent standard variance

)()()()()(1

lim)(1

xPxfxPxfxfN

xf jj

n

ji

N

ix

222 2 xix

2

ixNx

1

22 )(1

1xx

Ns i

10

ERROR ANALYSIS: Useful Probability Distributions: The Binomial Distribution

• Measures the probability of observing x successes in n tries when the probability of success in each try is p (not to be confused with bimodal distribution).

• The mean is given by

• For a binomial distribution the average of the number of successes approaches the mean value given by product of the probability of success of each item times the number of items.

• The variance is given by

• For the case of a coin toss p=1/2 and the distribution is symmetric about the mean and the median and most probable value are equal to the mean. The variance is half of the mean.

• In probability theory a random variable, x, has a binomial distribution B(n,p) where n is the number of tries. It can be approximated by the normal distribution N when n is large. N converges towards the Poisson distribution when the number of trials n goes to infinity and the product =np remains fixed.

xnxB pp

xnx

npnxP

1

!!

!,,

npppxnx

nx xnx

n

x

1

!!

!

0

pnpppxnx

nx xnx

n

x

11

!!

!

0

22

11

ERROR ANALYSIS: Useful Probability Distributions: The Poisson Distribution

• A Poisson distribution occurs when p<<1 and =np is constant.

• It frequently is useful for counting experiments such as particle detectors.

• It describes the probability of observing x events per unit time out of n possible events each of which has a probability of p of occurring.

• The mean of the Poisson distribution must be the parameter in the above equation.

• The variance is

• The standard deviation is the square root of the mean.

ex

pnxPp

xPx

BP !,,

0

lim,

01

1

0 !!1! y

y

x

x

x

x

ye

xee

xxx

0

222

!x

x

ex

xx

12

ERROR ANALYSIS: Useful Probability Distributions: The Gaussian Distribution

• The Gaussian distribution results from the case where the number of possible different observations (n) is infinitely large and probability of success is finitely large so that np>>1.

• It works for many physical systems. It is also called normal distribution.

• The Gaussian distribution is a continuous function describing the probability a random observation x will occur from a parent distribution with mean and standard deviation .

• The probability function is defined so that probability (dPG(x,,) that a random observation will fall in an interval dx about x is dPG(x,) =P’G(x,) dx.

• The width of a Gaussian is usually expressed as the full-width at half maximum – it is given by 2.354

• The probable error (P.E.) is defined so that half the observations of an experiment are expected to fall within ±P.E. (the probability of any deviation is less is equal to ½). P.E. = 0.6745

2

2

1exp

2

1,,

x

xPG

14

Propagation of Errors• In general we do not know the actual errors in the determinations of parameters.

Instead we use some estimate (e.g. ) of the error in each parameter.

• Assume that x=f(u,v…) and that .

• The uncertainty in x can be found by considering the spread in xi resulting from the spread in the individual measurements ui,vi....

• The variance is given by .

• Expand

• The first two terms can be expresses in terms of the variances of u2 and v

2.

• The third term is related to the covariance uv2

...),( vufx

22 1limxx

NN ix

xxi ...

v

xvv

u

xuuxx iii

...21lim

...1lim

22

22

2

2

v

x

u

xvvuu

v

xvv

u

xuu

NN

v

xvv

u

xuu

NN

iiii

iix

vvuuNN iiuv

1lim2

15

Propagation of Errors 2• The standard deviation of x is given by

• If u and v are uncorrelated then uv2=0.

• Specific combinations

...2 22

22

22

v

x

u

x

v

x

u

xuvvux

u

abuax

bx

aex

ub

xaux

uvvuxv

aux

uvvuxauvx

abbabvaux

ux

uxbu

uxb

uvvux

uvvux

uvvux

:ln

:

:

2:

2:

2:

2

2

2

2

2

2

2

2

2

2

2

2

2

2

222222

16

DISCRETE DISTRIBUTIONS: Measures of Central Tendency: Mean, Median and Mode

• There are several common quantitative measures of the tendency for a variable to cluster around a central value including the mean, median, and mode.

–The mean of a set of Ntot observations of a discrete variable xi is defined as

–The median of a probability distribution function (pdf) p(x) is the value of xmed for which larger and smaller values are equally probable. For discrete values, sort the samples xi into ascending order and if Ntot is odd find the value of xi that has equal numbers of points above and below it. If it is even this is not possible so instead take the average of the two central values of the sorted distribution.–The mode is defined as the value of xi corresponding to the maximum of the pdf. For a quantized variable like the Kp index this corresponds to the discrete value of Kp that occurs most frequently. More generally it is taken to be the value at the center of the bin containing the largest number of values. For continuous variables the definition depends on the width of bins used in determining the histogram. If the bins are too narrow there will be large fluctuations in the estimated pdf from bin to bin. If the bins are too large the location of the mode will be poorly resolved.

17

More on the Mode

• It is not necessary to create a histogram to obtain the mode of a distribution [Press et al., 1986, page 462]. It can be calculated directly from the data in the following manner.

• Sort the data in ascending order.

• Choose a window width of J samples (J >= 3).

• For every i = 1, 2, …, Ntot–J estimate the pdf by using the formula

• Take as the mode the value of [xi + xi+j]/2 corresponding to the largest estimate of the pdf.

• A section in Press et al. (1986) describes a complex procedure for choosing the most appropriate value of J.

iJiTotjij xxN

Jxxp

2

1

18

The Probability Distribution Function 1

• Probability is the statistical concept that describes the likelihood of the occurrence of a specific event. It is estimated as the ratio of the number of ways the specific event might occur to the total number of all possible occurrences, i.e. P(x) = N(x)/Ntot. Suppose we have a random variable X with values lying on the x axis. The probability density p(x) for X is related to probability through an integral

• Suppose we have a sample set of Ntot observations of the variable X. The probability distribution function (pdf) for this variable at the point xi is defined as

• Here x is the interval (or bin) of x over which occurrences of different values of X are accumulated, N[xi, xi+x] is the number of events found in the bin between xi and xi+x, and Ntot is the total number of samples in the set of observations of X.

19

The Probability Distribution Function 2

• Usually the sample set is not large enough to allow the limit to be achieved so that the pdf is approximated over a set of equal width bins defined by the bin edges {xi} = {x0, x0+x, x0+2x, x0+3x, …,, x0+mx}.

• Normally x0 and x0+mx are chosen so that all points in the sample set fall between these two limits.

• A plot of the quantity N[xi, xi+x] calculated for all values of x with a fixed x is called a frequency histogram. The plot is called a probability histogram when the frequency of occurrence in each bin is normalized by the total number of occurrences, Ntot. The sum of all values of a probability histogram is 1.0.

• If the bin width is changed the occurrence probabilities will also change. To compensate for this the probability histogram is additionally normalized by the width of the bin to obtain the probability density function which we refer to as the probability distribution function. The sum of all values of the probability density distribution equals 1/x. The bin width x is usually fixed, but in cases where some bins have very low occurrence probability it may be necessary to increase x as a function of x.

20

Kp and Dst

•The Kp index is a measure of the strength of geomagnetic variations with period shorter than 3 hours caused mainly by magnetospheric substorms. The index is roughly proportional to the logarithm of the range of deviation of the most disturbed horizontal component of the magnetic field from a quiet day in a 3-hr interval.

•Kp is available continuously from the beginning of 1932. It is dimensionless and quantized in multiples of 1/3. Its range is finite and limited to the interval [0, 9]. In the following section Kp is one of time series we use to illustrate some commonly used statistical techniques

•The Dst (disturbance storm time) index is a measure of the strength of the ring current created by the drifts of charged particles in the earth’s magnetic field.

•A rapid decrease in Dst is an indication that the ring current is growing, and that a magnetic storm is in progress. Ideally Dst is linearly proportional to the total energy of the drifting particles. Sym-H is higher resolution.

21

A Histogram of Kp• The Kp index is what is called a "categorical variable“. It can only take on a limited number of discrete values. By definition Kp ranges from a value of 0.0 meaning geomagnetic activity is very quiet, to 9.0 meaning that it is extremely disturbed. In this limited range it can assume 28 values corresponding to bins of width 1/3. Kp has no units because the numbers refer to classes of activity.

•The values of 0 and 9 are quite rare, since most of the time activity is slightly disturbed. A useful way to visualize the distribution of values assumed by the Kp index is to create a histogram.

•A histogram consists of a set of equal width bins that span the dynamic range of a variable

• If the number of occurrences in each bin is normalized by the total number of samples of Kp one obtains the probability of occurrence of a given value. •If in addition we divide by the width of the bin we obtain the probability density function (pdf). discussed in a later page.

22

Measures of Dispersion• It is obvious from the Kp histogram that values of this variable are spread around a central

value. Three standard measures of this dispersion include the mean absolute deviation, the standard deviation, and the interquartile range. The mean absolute deviation (mad) is defined by the formula The standard deviation (root mean square) is given by

• The upper and lower quartiles are defined in the same way as the median except that the values ¼ and ¾ are used instead of ½.

• The interquartile range (iqr) is the difference between the upper and lower quartiles (Q3 and Q1)

For variables with a Gaussian pdf, 68% of all data values will lie within ±1 std of the mean. Similarly, by definition 50% of the data values fall within the interquartile range. Note that the standard deviation is more sensitive to values far from the mean than is the average absolute deviation.

23

Measures of Asymmetry and Shape

• The standard measure of asymmetry of a pdf is called skewness. It is defined by the third moment of the probability distribution. For discrete data the definition reduces to

• Because of the standard deviation in the denominator, skewness is a

dimensionless quantity. • Probability distribution functions can have wide variations in shape from

completely flat to very sharply peaked about a single value. A measure of this characteristic is kurtosis defined as

The factor 3 is chosen so that kurtosis for a variable with a Gaussian distribution is zero. Negative kurtosis indicates a flat distribution with little clustering relative to a Gaussian while positive kurtosis indicates a sharply peaked distribution.

24

Statistical Properties of Kp and DstQuantity Kp Dst

Ntot 198,696 376,128

min 0.0 -589

max 9 92

mean 2.317 -16.49

median 2.000 -12

mode 1.3 to 1.7 -10 to 0

Ave, deviation

1.7173 17.04

Standard deviation

1.463 24.86

Lower quartile

1.333 -26

Upper quartile

3.333 -1

skewness 0.744 -2.737

skewstd 0.0055 0.0040

kurtosis 3.511 22.009

kurtstd 0.011 0.0080

• The center of the Kp distribution is ~2.

• Dispersion about the central value is about 1.

•Skewness for Kp is +0.744 indicating that the pdf is skewed in the direction of positive values.

• If the pdf were Gaussian, then the standard deviation of the skewness depends only on the total number of points used in calculating the pdf and is skewstd ~ sqrt(6/Ntot).

• For Kp this value is 0.0055 indicating a highly significant departure from a symmetric distribution.

• The corresponding values for Dst are –2.737 and 0.0040 indicating very significant asymmetry towards negative values.

25

The Shape of the Kp and Dst Distributions

• Negative kurtosis indicates a flat distribution with little clustering relative to a Gaussian while positive kurtosis indicates a sharply peaked distribution.

– For Gaussian variables the standard deviation of the kurtosis also depends only on the total number of points used in calculating the pdf and is approximately kurtstd ~ sqrt(24/Ntot).

– Both distributions exhibit positive kurtosis, the Dst pdf to a greater extent than the Kp distribution. Thus the distributions for both indices are more sharply peaked than would be a Gaussian distribution.

1 Data Formats (CDF, ASCII, FLATFILES) Error Analysis Probability Distributions Binning and...

Documents

Transcript of 1 Data Formats (CDF, ASCII, FLATFILES) Error Analysis Probability Distributions Binning and...