ExcelTM and Statistical Analysisfaculty.fiu.edu/~bidarkot/eco4421/excel_info_4421.doc · Web...

Excel and Statistical AnalysisJonathan Hill

Florida International University

In this document, basic principles for analyzing data with Microsoft’s Excel software will be detailed. Many assignments with require the information provided in the document. It is assumed that you know how to boot-up Excel from your personal computer’s Start Menu, as well as open existing Excel files and save files.

CONTENTS

Topic Page

1. Data Arrangement: Rows, Columns, Observations 21.1 Rows and Columns 21.2 Observations 2

2. Creating News Variables as Functions of Existing Variables 32.1 Creating One New Variable for One Observation 32.2 Creating One New Variable for All Observations 42.3 Examples

3. Statistical Analysis: Mean, Variance, Standard Deviation 53.1 Sample Mean, Mode and Median 53.2 Sample Variance and Standard Deviation 73.3 Descriptive Statistics: Tool Package 8

4. Relative Frequency and Histograms 104.1 Frequency 104.2 Relative Frequency 114.3 Histogram 12

5. Probability Distribution Functions 145.1 Discreet Distribution Functions 14

Binomial and Poisson5.2 Continuous Distribution Functions 17

Normal and Student's t

6. Bivariate Analysis: Sample Covariance and Correlation 206.1 Sample Covariance 206.2 Sample Correlation 21

7. Interval Estimation 23

8. Graphics: Charts, Tables and Figures 258.1 Scatter Plots 258.2 Time-Series: Line Plots 278.3 Editing Graphs and Figures 29

1

1. Data Arrangement: Rows, Columns, Observations

1.1 Rows and ColumnsWhen you boot-up Excel, a spread sheet of “cells” appears with column labels in letters (e.g. A, B, etc..) and row labels in Arabic numerals (e.g. 1, 2, etc..). For example, the upper-left corned of the spread-sheet will appear as follows:

A B C1234

A vertical stack of cells is a “column”. For example, column “A” is the stackA

A horizontal layer of cells is a “row”. For example, the 3rd row is the layer

A B C12

3

4

1.2 Observations [Individual People/Time Period in a Sample of Data]Your data assignments will include Excel data files. For example, consider data on work-hours, hourly wage, education and age of 4 people:

A B C D1 hours wage ed age2 34 7.5 12 203 50 20.8 18 374 40 5.75 16 505 30 12.90 14 42

The data files will always contain column headers with variable names. The data is read as follows. Clearly, the columns contain individual variables. For example, columns C contains information on the years of education the 4 people have. Moreover, in the 2nd row is the first person in our sample of data. This person works 34 hours/week, earns $7.50/hour, has 12 years of education and is 20 years old. In this manner, the rows contain information for one person. We call one such set of information for one person an “observation”. Thus, the fourth observation comprises 30 work hours, a wage of $12.90, 14 years of education, and an age of 42. Our sample, therefore, has n = 4 observations.

Similarly, we could have data that changes over time, like GNP, the CPI and inflation for 1988, 1989 and 1990

A B C D1 YEAR GNP CPI INFL2 19883 19894 1990

Thus, one year, say 1988, and its GNP, CPI and inflation rate, represents one observation.

2

2. Creating News Variables as Functions of Existing Variables

2.1 Creating One New Variable for One ObservationOften, we want to create new columns of data based on the existing data in our data file. For example, if our file contains information on one firm’s revenue [REV] and cost-per-unit [COST] of production for 4 separate years (1998, 1999, 2000, 2001), we can use an empty column to create a new profit variable, PROFIT= REV – COST.

A B C D1 YEAR REV COST2 1998 100 873 1999 76 754 2000 490 5005 2001 80 40

2.2.1 First, we want to create a new column label, in column D, for the PROFIT variable. Simply place the cursor over the cell in row 1, column D (denoted as cell "D1"), click, and type “PROFIT”:

A B C D1 YEAR REV COST PROFIT2 1998 100 873 1999 76 754 2000 490 5005 2001 80 40

2.2.2 Next, we want to use the data in the 2nd row (the first year in our sample) to create the new profit variable for the year 1998. Point and click-on column D, row 2 (i.e. cell D2), just below the new label, PROFIT. We will use Excel’s ability to perform mathematical operations. Type “=”, then type in the cell coordinates containing the first year’s revenues, “B2” (always letters/columns first), then type “-”, then the coordinates containing cost, “C2”:

A B C D1 YEAR REV COST PROFIT2 1998 100 87 =B2-C23 1999 76 754 2000 490 5005 2001 80 40

Finally, hit Enter. Excel will perform the operation for you:

A B C D1 YEAR REV COST PROFIT2 1998 100 87 133 1999 76 754 2000 490 5005 2001 80 40

2.2 Creating One New Variable for All ObservationsExcel beautifully allows us to perform the same operation for all other years in the sample simply by clicking and dragging the cursor. Click-on the cell containing the first year’s profit. The cell will be surrounded by black with a small box in the lower-right corner:

A B C D1 YEAR REV COST PROFIT

3

2 1998 100 87 13

3 1999 76 754 2000 490 5005 2001 80 40

Point the cursor over small blacken box in the lower right corner of the cell: Excel will change the shape of the cursor to a small cross. Then, click, hold, and drag down: Excel will automatically perform the same arithmetic operation for the first cell in all subsequent cells. Once you reach the 5 th row (i.e. cell D5), stop dragging, and release the mouse-button. The result is

A B C D1 YEAR REV COST PROFIT2 1998 100 87 133 1999 76 75 14 2000 490 500 -105 2001 80 40 40

2.3 ExamplesWe want to the use labor data from above to create a yearly wage income variable. Mathematically, yearly wage income is

(thus, the units are $'s)

A B C D E1 hours wage ed age2 34 7.5 12 203 50 20.8 18 374 40 5.75 16 505 30 12.90 14 42

In column E, row 1 (cell E1), type “income_wage”, or whatever variable name you prefer, as long as it logically describes the content of the data:

A B C D E1 hours wage ed age income_wage2 34 7.5 12 203 50 20.8 18 374 40 5.75 16 505 30 12.90 14 42

Click-on column E, row 2 (cell E2), type “=A2*B2*52”, then enter. Notice that multiplication is performed with the asterisk “*”. The result is a yearly wage income of $13260/year:

A B C D E1 hours wage ed age income_wage2 34 7.5 12 20 132603 50 20.8 18 374 40 5.75 16 505 30 12.90 14 42

Place the cursor above the small black box in the lower right corned of E2, click, hold, and drag down to the 5th row (cell E5). The outcome is

A B C D E1 hours wage ed age income_wage2 34 7.5 12 20 13260

4

3 50 20.8 18 37 540804 40 5.75 16 50 119605 30 12.90 14 42 20124

3. Statistical Analysis: Mean, Variance, Standard Deviation

In the present sub-section, we will study how to use Excel for basic statistical analysis of data. The theory required for the subsequent discourse is contained in Newbold, Chapter 2.

The data that we inspect in any situation or environment is necessarily a sample taken from a larger population: we cannot (typically) have the entire population of data. Thus, all data analysis is of the sample statistic form.

3.1 Sample Mean, Mode and MedianConsider the sample of data on work-hours, wage, education, age, and newly created variable, yearly wage income = income_wage:

A B C D E F1 hours wage ed age income_wage2 34 7.5 12 20 132603 50 20.8 18 37 540804 40 5.75 16 50 119605 30 12.90 14 42 2012467

For ease of reference, we can use the 7th row to contain a label for the "mean", as well as the sample means of each variable. Thus, type in cell A7 the title "MEAN".

A B C D E F1 hours wage ed age income_wage2 34 7.5 12 20 132603 50 20.8 18 37 540804 40 5.75 16 50 119605 30 12.90 14 42 2012467 MEAN

3.1.1 Sample Mean for one Variable (e.g. work-hours)For the sample mean (i.e. average) of work-hours, click-on cell B7, and type the command

=average(B2:B5)

A B C D E F1 hours wage ed age income_wage2 34 7.5 12 20 132603 50 20.8 18 37 540804 40 5.75 16 50 119605 30 12.90 14 42 2012467 MEAN =average(B2:B5)

This command will tell Excel to derive the sample mean of the numbers contained in cells B2 - through - B5, which identically represents the work hours data. Once you type the above command in, hit ENTER, and Excel will fill the cell with the sample average:

A B C D E F1 hours wage ed age income_wage2 34 7.5 12 20 132603 50 20.8 18 37 54080

5

4 40 5.75 16 50 119605 30 12.90 14 42 2012467 MEAN 38.5

3.1.2 Sample Mean for All Variables Recall the point, click and drag method for easily creating observations of a newly created variable. We will employ the same technique. Click-on cell B7, in which the sample mean of "hours" is contained. The cell will be highlighted with black around the edges, and a small box in the lower right corner:

A B C D E F1 hours wage ed age income_wage2 34 7.5 12 20 132603 50 20.8 18 37 540804 40 5.75 16 50 119605 30 12.90 14 42 2012467 MEAN 38.5

Point the cursor over the small black box until it changes into a small black cross, click and drag to the right over cells C7 - F7. Excel will not copy the numerical value "38.5" into the subsequent cells. Rather, Excel will copy the formula for the average in those cells, updating the column of data used as you move right. The result is sample averages for all variables:

A B C D E F1 hours wage ed age income_wage2 34 7.5 12 20 132603 50 20.8 18 37 540804 40 5.75 16 50 119605 30 12.90 14 42 2012467 MEAN 38.5 11.7375 15 37.25 24856

3.1.3 Formatting Cells for Decimal PlacesThe above results are somewhat messy in the sense that the sample mean of hours has one decimal place, the sample mean of wages has 4 decimal places, etc. We would like to standardize the representation by displaying all numbers with only two decimal places. Click-on cell B7. Place the cursor above the center of the cell, and NOT the small box in the corner. Click and drag over cells B7 - F7. The cells with become highlighted.

A B C D E F1 hours wage ed age income_wage2 34 7.5 12 20 132603 50 20.8 18 37 540804 40 5.75 16 50 119605 30 12.90 14 42 2012467 MEAN 38.5 11.7375 15 37.25 24856

Go to the task-bar on the top of the screen, click-on FORMAT, CELL, NUMBER. Once done, Excel will have displayed a pop-up box with a choice of the number of decimal places. Choose "2". Click-on "ok" on the bottom of the pop-up box.

A B C D E F1 hours wage ed age income_wage2 34 7.5 12 20 13260

6

3 50 20.8 18 37 540804 40 5.75 16 50 119605 30 12.90 14 42 2012467 MEAN 38.50 11.74 15.00 37.25 24856.00

Excel will round values in the case when the true numerical value has more than the prescribed number of decimal places (in this care, two).

3.1.4 Sample Mode and Median

For sample modes and medians, use the command “=mode(…)”and “=median(…)”. As a separate example, consider two columns of data labeled simply X and Y. Using the methods describe above, we can calculate the samples of the two variables:

A B C1 X Y2 1 23 1 74 0 05 -1 067 MEAN .25 2.2589

For clarity, type “MODE” and “MEDIAN” in cells A8 and A9, respectively. In cell B8, type

=mode(B2:B5)

Then, hit ENTER. The mode of variable X is 1.00, the most frequently occurring value in the sample of 4 observations. For the sake of speed, click on cell B8, place the cursor above the small black box in the lower right corner, click, hold, and drag over cell C8. The mode of variable Y is 0.00. Formatting the cells so numerical values have 2 decimal places, we now have

A B C1 X Y2 1 23 1 74 0 05 -1 067 MEAN .25 2.258 MODE 1.00 0.009 MEDIAN

Next, in cell B9, type “=median(B2:B5)”, then ENTER. Click and drag for variable Y. The result, after formatting, is

A B C1 X Y2 1 23 1 74 0 05 -1 067 MEAN .25 2.258 MODE 1.00 0.009 MEDIAN .50 1.00

7

3.2 Sample Variance and Standard DeviationAny subsequent statistical operation in Excel requires only knowledge of the proper command. All

further steps are identical to those outlined for the sample mean. For the sample variance, type "VARIANCE" in cell A8, and in cell B8 use the command

=var(B2:B5)

Then type ENTER, and click-and-drag over the remaining cells, C8 - F8.For the sample standard deviation, type "STAND. DEV." in cell A9, then type in cell B9

=stdev(B2:B5)

Then type ENTER, and click-and-drag over the remaining cells, C9 - F9. After the numerical values in the cells have been standardized to two decimal places, we obtain

A B C D E F1 hours wage ed age income_wage2 34 7.5 12 20 132603 50 20.8 18 37 540804 40 5.75 16 50 119605 30 12.90 14 42 2012467 MEAN 38.50 11.74 15.00 37.25 24856.008 VARIANCE 75.67 45.76 6.67 160.92 392402677.339 STAND. DEV. 8.70 6.76 2.58 12.69 19809.16

3.3 “Descriptive Statistics” Excel Package

3.3.1 Deriving All Descriptive StatisticsExcel provides a very simple mechanism for obtaining an array of sample descriptive statistics of

your data. Consider the spread-sheet


In the main Excel tool-bar, click-on TOOLS, DATA ANALYSIS, DESCRIPTIVE STATISTCS, then OK.

In the area to the right of Input Range, type in the cell range in which your data is stored. In this case, we type

B1: F5

Notice that we include the 1st row with the variable labels. Click-on Labels in First Row. Click-on Summary Statistics for standard sample statistics. Click-on Confidence Level for Mean in order to generate the CI-length. Excel sends the output to another “sheet” which can be accessed by clicking-on the sheet with the highest number at the bottom of the screen (e.g. Sheet 4). The result is

hours wage ed age income_wage Mean 38.5 Mean 11.7375 Mean 15 Mean 37.25 Mean 24856Standard Error 4.349329 Standard Error 3.382392 Standard Error 1.290994 Standard Error 6.342647 Standard Error 9904.578Median 37 Median 10.2 Median 15 Median 39.5 Median 16692Mode #N/A Mode #N/A Mode #N/A Mode #N/A Mode #N/AStandard Deviation 8.698659 Standard Deviation 6.764783 Standard Deviation 2.581989 Standard Deviation 12.68529 Standard Deviation 19809.16Sample Variance 75.66667 Sample Variance 45.76229 Sample Variance 6.666667 Sample Variance 160.9167 Sample Variance 3.92E+08Kurtosis -0.03656 Kurtosis -0.20184 Kurtosis -1.2 Kurtosis 1.365516 Kurtosis 3.290698

8

Skewness 0.829537 Skewness 0.980108 Skewness 0 Skewness -0.96447 Skewness 1.813812Range 20 Range 15.05 Range 6 Range 30 Range 42120Minimum 30 Minimum 5.75 Minimum 12 Minimum 20 Minimum 11960Maximum 50 Maximum 20.8 Maximum 18 Maximum 50 Maximum 54080Sum 154 Sum 46.95 Sum 60 Sum 149 Sum 99424Count 4 Count 4 Count 4 Count 4 Count 4

3.3.2 Descriptive Statistics Output

“NA”: Denotes “not applicable” in the event a unique answer does not exist.

Standard Error: Denotes the sample mean divided by the estimated standard deviation of the sample mean:

Range: Denotes the difference between the maximum and minimum values.

Count: Denotes the number of actual data points.

Skew: Measures the degree of skew in the distribution of X. A large positive skew implies a skew to the right (disproportionate probability of relatively small/negative values); a large negative skew implies a skew to the right (disproportionate probability of relatively large/positive values).

Kurtosis: Measures the thickness of the tails of the distribution. Random variables with a high variance have a greater odds of taking on relatively large values, hence the middle realm of the distribution is flat, and the tails are thick: this denotes a large kurtosis. Random variables with a low variance have a smaller odds of taking on relatively large values, hence the middle realm of the distribution is tall, and the tails are thin: this denotes a small kurtosis.

9

4. Relative Frequency and Histograms

Excel provides a simple method for calculating frequencies, relative frequencies and histograms. Consider the following column of data representing the number of divorces in a sample of n = 10 people:

A B C D1 divorces2 03 04 15 46 27 18 19 010 011 2

4.1 FrequencyWe would like to derive the frequency for each divorce value. Recall that we define frequency as the number of times a value occurs in a sample. Notice that the only values that appear in the sample are (0,1,2,4), thus the relative frequencies of values 3,5,6,… will be zero.

In cell B2, type FREQUENCY. In column D, we need to create the possible values which we want to include for the frequencies: unfortunately, Excel does not understand that we simply want the values between, say 0 and 5; rather, we need to type the numerical values to be counted in a separate column. Create the following:

A B C D1 divorces FREQUENCY2 0 03 0 14 1 25 4 36 2 47 1 58 19 010 011 2

Now, click on cell B2, hold the mouse-key down, and drag down to cell B7. This will highlight cells B2 - B7, exactly parallel to cells D2-D7, which contain the possible values we want to count in our sample. Next, in the task-bar at the top of the screen, click on the fx key: this brings up a pop-up box within which we can employ of Excel's statistical and mathematical formulas. On the left-side of the pop-up box, click-on "statistical", on the right-side, scroll down to "frequency", and click. A frequency pop-up box appears. In the Data-Array, type the cells in which the column of data is contained, in this case A2:A11 (be sure to use a colon between cell values). In the Bins-Array, type the cells which contain the possible values we want to count, in this case cells D2:D7. Again, be sure to use a colon. DO NOT TYPE ENTER. In order to tell Excel that you want this operation to occur for all possible values in D2-D7, relative to the

10

sample of data A2-A11, we need to hit simultaneously CTRL and SHIFT and ENTER. Once done, Excel will display the number of times each value in cells D2-D7 occurs in the sample contained in A2-A11.

In summary:

i. Highlight cells

A B C D1 divorces FREQUENCY

2 0 0

3 0 1

4 1 2

5 4 3

6 2 4

7 1 5

8 1

9 0

10 0

11 2

ii. Click on the fx key, click on statistical, then frequency. Type in A2:A11 in the Data-Array, and type in D2:D7 in the Bins-Array.

iii. Hit simultaneously CTRL and SHIFT and ENTER. The result:

A B C D1 divorces FREQUENCY2 0 4 03 0 3 14 1 2 25 4 0 36 2 1 47 1 0 58 19 010 011 2

Thus, the value 0 occurs 4 times, 1 occurs 2 times, 2 occurs twice, 3 never occurs, 4 occurs onces, zero 5's, and of course, zero 6's, etc.

4.2 Relative FrequencyRecall that the relative frequency is the frequency relative to the sample size. Thus, for example, the relative frequency of the value of 0 divorces represents the percent of the sample which has this value of divorces. In this case, the relative frequency of 0 is 4/10 = .4 0, or 40% of the people in our sample have a value of 0 divorces. To derive the relative frequency, simply use column C, and perform a basic new-variable derivation, as detailed in Section 1.

Type "REL. FREQ." in cell C1. Click on cell C2, type

11

=B2/10

then hit ENTER. Excel will derive 4/10 for you, and present .40. To derive the rest of the relative frequencies, click on cell C2, hold the cursor over the small black box in the lower right corner, click, hold and drag down over cells C3 - C7. The result, after formatting the numerical values to have 2 decimal places, is

A B C D1 divorces FREQUENCY REL. FREQ.2 0 4 0.40 03 0 3 0.30 14 1 2 0.20 25 4 0 0.00 36 2 1 0.10 47 1 0 0.00 58 19 010 011 2

Clearly, the relative frequencies must sum to 1.00 (one).

4.3 HistogramA simple method of graphically representing an array of relative frequencies if the histogram. We will use Excel's ability to create graphs to build a histogram: further details on graphs and figures is presented in Section 8.

4.3.1 Create the Rough Figure ShapeFirst, highlight the cells that contain the relative frequency information, as well as the label "REL.

FREQ.":

A B C D

1 divorces FREQUENCY REL. FREQ.

2 0 4 0.40 0

3 0 3 0.30 1

4 1 2 0.20 2

5 4 0 0.00 3

6 2 1 0.10 4

7 1 0 0.00 5

8 19 010 011 2

Next, go to the task bar at the top of the screen, click on the little colored bar-chart that is two buttons to the left of the screen percent size [e.g. 100%]. A pop-up box appears. Click on "Standard Types", then "Column". Then, click on "Next": Excel will show you the basic set-up of the resulting chart. Click on "Next" again: a pop-up box allowing for substantial alterations of the basic outlook of the bar-chart appears. Type in the Title as "Relative Frequency of Divorces", type is Category X as "Number of Divorces", and type in Category Y as "f(x)/n". If you want to remove the grid-lines, click-on Gridlines, and click-on any box with a check-mark in it: this will remove the mark, and remove the associated grid-lines. Because the legend (the graph box that describes the variable) is redundant, click-on Legend, then click off

12

the mark on "Show Legend". Finally, click on Finish at the bottom of the pop-up box. The result is the following histogram:

4.3.2 Refine the X-axisObserve that unless we tell Excel otherwise, it will automatically display X-axis values starting at

1: For our histogram purposes, however, this is in error: the smallest number of divorces possible is, of course, zero.

In order to specify to Excel the correct X-axis array of values, in our case 0 – 5, perform the following task. First, recall that column D contains the possible values, 0 – 5. Now, point the cursor over the white-area of the chart, and click once: this highlight the chart with small black dots around the border. Go to the task-bar at the top of the screen, click on CHART, SOURCE DATA, then SERIES. At the bottom of the pop-up box, there is a white-area next to “Category (X) Axis”. Click once in this area: this places Excel’s cursor in the box. Now, move the mouse cursor over the cell that contains the first possible value, 0: in our case, the cell is D2. Click once and hold, dragging the cursor down over all cells that contain possible, cells D2 – D7. Once you reach cell D7, let go of the mouse button: Excel will automatically fill in the cell range D2 – D7 into the white area to the right of “Category (X) Axis”. Once this is done, click on OK at the bottom of the pop-up box.

The result is as follows:

See Section 8 for further information about graph/chart/figure formatting.

13

5. Probability Distribution Functions

Excel provides a simple means for deriving the probability distribution functions or cumulative distribution functions for a variety of discreet and continuous random variables. In this section, we will study Excel’s built in mathematical operations relating to the pdf’s of the binomial, the poission, normal and Student’s t distributions.

5.1 Discreet Distribution Functions

5.1.1 Binomial DistributionConsider the following set-up. We have 5 people drawn from a population in which p = 35% are smokers. We want to tabulate the probability distribution function of the number of smokers X in our group of n = 5 people. The easiest way to derive the probability distribution function is by first creating a column containing all possible values of X, in this case (0,1,2,3,4,5):

A B C1 X2 03 14 25 36 47 5

Now, in cell B1 type P(X). In cell B2, type the command

=binomdist(A2, 5, .35, false)

Then hit ENTER. The command “false” indicates that we want a probability distribution function, and NOT a cumulative distribution function. The result is

A B C1 x PX(X=x)2 0 0.1160293 1 0.3123864 2 0.3364165 3 0.1811476 4 0.0487707 5 0.005252

Consider, next, deriving the cumulative distribution function, F(X < x). Type FXX < x) in cell C1, click on cell C2, and type the command

=binomdist(A2, 5, .35, true)

Next, click on cell C2, the hold the cursor above the small black box, click and drag over cells C3-C7. The result, after formatting the cells to show 4 decimal places, is the complete cdf for x = (0,1,2,3,4,5):

14

A B C1 x PX(X=x) FX( x)2 0 0.1160 0.11603 1 0.3124 0.42844 2 0.3364 0.76485 3 0.1811 0.94606 4 0.0488 0.99477 5 0.0053 1.0000

Graphically, we can employ the techniques discussed in Sections 4 and 8 for creating a bar-chart of the pdf and cdf. Consider the pdf contained in cells C2-C7. Highlight the cells C2-C7, and create a bar-chart using said techniques. Remember to change the X-axis values: our random variable’s possible values start at 0, and the cells that contain the possible values, in our case, are A2-A7. The result is:

5.1.2 PoissonThe poisson distribution is helpful for modeling the probability of discreet occurrences of some event, x. In particular, the random variable must be zero or positive and discreet. For example, the number of times a volcano erupts in a year can be assumed to be a poisson random variable.

Recall the probability distribution function of a poisson random variable of x occurrences:

where it can be shown the parameter satisfies

Excel provides a means for deriving P(x) based on x and the mean, . In this case, you can the values of x and contained in cells, or you can simply use Excel to derive a probability by typing the values in directly. Consider the latter quick method, and consider a blank grid spread:

A B1234

Suppose you want to derive P(x = 5) for a poisson random variable x with mean = 2. Click on any cell, say A1, and type

=poisson(5,2,false)

where "false" dictates that we want the probability at the point x = 5, and NOT the cumulative distribution up to x = 5. Hit ENTER, and Excel displays the odds of such an event:

15

A B1 .036089234

If you have an array of values x from the same poisson distribution (i.e. with the same mean and variance, ), then type in cell values instead of numerical values. For example, consider the following array of x's from a poisson distribution where = 2:

A B1 x2 03 14 25 3

Click on cell B2, and type the command=poisson(A2,2,false)

Then hit ENTER. Subsequently, point and click on cell B2, aim the cursor above the small black box, click and drag down over the remaining cells, B3-B5. The result is P(x) for x = 0,1,2,…,9:

A B1 x2 0 0.1353352833 1 0.2706705664 2 0.2706705665 3 0.1804470446 4 0.0902235227 5 0.0360894098 6 0.0120298039 7 0.00343708710 8 0.00085927211 9 0.000190949

We can use the graphical methods described in Section 4 on histograms to create a figure for the above probability distribution. Highlight the cells containing the probabilities, B2-B11, and using the methods described in Section 4 (for the title, and x-axis and y-axis labels), create a bar graph. We obtain:

Intuitively, for a poisson random variable x with a mean of 2, there will be great odds that the actual value of x will be near 1 (i.e. 0,1,2,3,4). Hence, the pdf is stacked near 2 (high probability values at 0,1,2,3,4).

For a cumulative distribution value, simply use "true" instead of "false":

=poisson(A2,2,true)The result is:

A B1 x

16

2 0 0.1353353 1 0.4060064 2 0.6766765 3 0.8571236 4 0.9473477 5 0.9834368 6 0.9954669 7 0.99890310 8 0.99976311 9 0.999954

A simple bar-chart plot of the CDF, above, follows:

5.2 Continuous Distribution FunctionsFor all subsequent probability derivations, the only changes from the above methods in the particular command required. We can derive continuous probabilities P(x a) for normally or t-distributed random variables x by typing in the value a directly, or by using cells that contain various a's.

5.2.1 Normal Distribution

5.2.1.1 Probabilities from any Normal Distribution

Let

and suppose we want to derive

For example, let = 2 ,2 = 16 and a = 3. For direct computation, directly type in any cell

=normdist(3,2,16,true)

where "true" denotes that we want the cumulative distribution up to the value a. The result is .524918.Using the same procedure as with Binomial or Poisson random variables, we can instead use cell

values that contain a rather than directly typing in a.

5.2.1.2 Probabilities from The Standard Normal

Let

and suppose we want to derive

For example, let a = 1.23. For direct computation, directly type in any cell

17

=normsdist(1.23)

The result is .89065.

5.2.1.3 Cut-off Values of the Standard Normal

Suppose instead of a probability, for given a, we want some a for a given probability p. For example, suppose we want some a such that

In any cell in Excel, type

=normsinv(.95)

The result is 1.644853.

5.2.2 Student's t Distribution

5.2.2.1 Probabilities from the t-Distribution

Excel provides a built-in operation for deriving "tail" probabilities of t-distributed random variables, x. We define a "one-tail" probability as, for example,

or

and a "two-tailed" probability as

Suppose we want to derive the two-tailed probability

which is identical to

for a t distributed random variable, x, with 5 degrees of freedom:

For a cumulative probability derivation of a t distributed random variable, x, use the following command:

18

=tdist(1.75,5,2)

where the final "2" denotes that we want a two-tailed derivation rather than a one-tailed probability. The result is .140522.

For a one-tailed probability,

use

=tdist(1.75,5,1)

and we obtain half of the two-tailed probability, .070261.

5.2.2.2 Two-Tailed Cut-off Values from the t-Distribution

As in the normal case, we have an inverse function for t-random variables that returns some a for a given probability p. For example, suppose we want some a such that

In Excel, type

=tinv(.01, 15)

The result is 2.787438.

19

6. Sample Bivariate Analysis: Sample Covariance and Correlation

A primary concern in the analysis of data is not measures of central tendency and dispersion, per se. Rather, we are interested in how economic phenomena relate to each other. For example, we might ask whether income tax increases will diminish net tax revenues collected and therefore threaten the solvency of certain welfare programs. This topic is ultimately left for Econ. 120B and 120C, Econometrics 1 and 2, however the initial step entails the analysis of simple linear relationships by measures of covariance and correlation. Details on the theory of covariance and correlation can be found in Newbold, Chapter 4.

6.1 Sample CovarianceConsider the labor data again, without means, variances and standard deviations present:

A B C D E F1 hours wage ed age income_wage2 34 7.5 12 20 132603 50 20.8 18 37 540804 40 5.75 16 50 119605 30 12.90 14 42 2012467 MEAN 38.50 11.74 15.00 37.25 24856.008 VARIANCE 75.67 45.76 6.67 160.92 392402677.339 STAND. DEV. 8.70 6.76 2.58 12.69 19809.16

We would like to measure the degree of linear of dependence between work hours, wages education and age.

For the sample covariance between the relevant variables, we need to set up an area in the spread-sheet that can contain this information. Here, we will place the covariance information below the mean and variance information. Type in cell A11 the title "COVARIANCE", and re-type the variable names in cells C11 - F11 and B11 - B14:

A B C D E F1 hours wage ed age income_wage2 34 7.5 12 20 132603 50 20.8 18 37 540804 40 5.75 16 50 119605 30 12.90 14 42 2012467 MEAN 38.50 11.74 15.00 37.25 24856.008 VARIANCE 75.67 45.76 6.67 160.92 392402677.339 STAND. DEV. 8.70 6.76 2.58 12.69 19809.161011 COVARIANCE hours wage ed age12 hours13 wage14 ed15 age

We will fill out the table completely in order to make a point below. Click-on cell C12 in order to derive the sample covariance between hours and hours. Type the command

=covar(B2:B5, B2:B5)

20

Excel will use the data in cells B2 - B5 (i.e. work hours) to derive the covariance between hours and itself. Type ENTER to complete the task. Note that the sample covariance between a variable and it self is simply a sample estimate of the variance. However, it is not an unbiased sample estimate of the variance. Specifically, Excel uses the following formula for sample covariance:

Thus, the sample between, say, x and x is

which is a biased estimator of the variance. The unbiased variance estimator that Excel uses is

Next, perform the task for the covariance between hours and wages. Click-on cell D12, and type

=covar(B2:B5, C2:C5)

Then type ENTER. Excel will use the data for work hours (cells B2 - B5) and the data for wages (cells C2 - C5) to derive the sample covariance. Repeat this task for all other combinations of variables (hours, ed), (hours,age), (wage,wage), (wage,ed), (wage,age), (ed,ed), (ed,age) and (age,age). We find, after formatting the numbers to have four decimal places

A B C D E F1 hours wage ed age income_wage2 34 7.5 12 20 132603 50 20.8 18 37 540804 40 5.75 16 50 119605 30 12.90 14 42 2012467 MEAN 38.50 11.74 15.00 37.25 24856.008 VARIANCE 75.67 45.76 6.67 160.92 392402677.339 STAND. DEV. 8.70 6.76 2.58 12.69 19809.161011 COVARIANCE hours wage ed age12 hours 56.7500 26.1063 14.5000 13.385013 wage 34.3217 8.1875 0.003114 ed 5.0000 14.750015 age 120.6875

Now, recall that the covariance is symmetric, hence

which means we can ignore the rest of the cells. For example, the sample covariance between hours and wages in cell D12 is 26.1063. Likewise, the sample covariance between wages and hours would be

21

26.1063, which could redundantly be placed in cell C13. Thus, the above covariance "table" is considered complete.

6.2 Sample CorrelationConsider the correlation between hours and wages. Deriving the sample correlation is performed in a manner identical to the derivation of the covariance, except we use the command in cell D18 (see the presentation, below)

=correl(B2:B5, C2:C5)

Like covariance, the sample correlation is symmetric, thus we do need to find the correlation between wages and hours. Recall that the correlation between a variable and itself is simple 1, thus you only need to perform the operation for couples of different variables. Consider using the cells beneath the covariance cells. We can type "CORRELATION" in cell A17, and use the cells to the right to fill in the correlation derivations. After formatting the cells to allow for 5 decimal places, we obtain the following:

A B C D E F1 hours wage ed age income_wage2 34 7.5 12 20 132603 50 20.8 18 37 540804 40 5.75 16 50 119605 30 12.90 14 42 2012467 MEAN 38.50 11.74 15.00 37.25 24856.008 VARIANCE 75.67 45.76 6.67 160.92 392402677.339 STAND. DEV. 8.70 6.76 2.58 12.69 19809.161011 COVARIANCE hours wage ed age12 hours 56.7500 26.1063 14.5000 13.385013 wage 34.3217 8.1875 0.003114 ed 5.0000 14.750015 age 120.68751617 CORRELATION hours wage ed age18 hours .59153 .86080 .1616119 wage .62500 .0000520 ed .6004521 age

22

7. Interval Estimation

Consider the task of estimating a 95% Confidence Interval for the true mean of some variable. Recall that the confidence interval itself is

where K satisfies

Excel assumes the variable in question is either normally distributed, or the sample size is large enough so that, by the Central Limit Theorem (see Newbold, Chapter 6), the sample mean will be nearly normally distributed .

After some simplification (see Newbold, Chapter 8), we can solve directly for K based on Excel's distributional assumptions:

Of course, we do not know the true standard deviation, thus we must give Excel a sample estimate of .In order to calculate a 95% confidence interval length, K, perform the following. Suppose the

sample size is 4, as it is in the previous section, and suppose an estimate of the standard deviation is contained in cell B9, as it is for the variable hours, above. Click-on an empty cell (say, cell B24 for the variable hours), the type

=confidence(.05, B9, 4)

then hit ENTER. Notice that the above command is specifically contoured to our needs. The general command

format is

=confidence(, , n)

where satisfies the (1 - )% confidence interval size, denotes the true standard deviation, and n denotes the number of individuals in our group.

Excel will derive K and place it in the chosen cell. For clarity, we can type "95% CI: K" in cell A24, “95% Lower” in A25 and “95% Upper” in A26. Notice that we re-typed the variable names for ease of reference. The result is


23

7 MEAN 38.50 11.74 15.00 37.25 24856.008 VARIANCE 75.67 45.76 6.67 160.92 392402677.339 STAND. DEV. 8.70 6.76 2.58 12.69 19809.1610…2223 hours wage ed age income_wage24 95% CI: K 8.52451625 95% Lower26 95% Upper

Thus, the 95% CI for the sample of hours is

We can directly create the lower bound by typing in cell B25

=B7 – B24

and we can directly create the upper bound by typing in cell B26

=B7 + A24

We can now proceed as usual to perform the same CI operation on the rest of the cells, C24 - F24. As usual, click-on cell B24, place the cursor above the small black-box in the lower right corner, click, hold and drag over C24 - F24. The result, after formatting the cells to 4 decimal places, is

A B C D E F1 hours wage ed age income_wage2 34 7.5 12 20 132603 50 20.8 18 37 540804 40 5.75 16 50 119605 30 12.90 14 42 2012467 MEAN 38.50 11.74 15.00 37.25 24856.008 VARIANCE 75.67 45.76 6.67 160.92 392402677.339 STAND. DEV. 8.70 6.76 2.58 12.69 19809.1610…23 hours wage ed age income_wage24 95% CI: K 8.5245 6.6294 2.5303 12.4313 19412.587825 95% Lower 29.9755 5.1106 12.4697 24.8187 5443.412226 95% Upper 47.0245 18.3694 17.5303 49.6813 44268.587827

Comment:For a 99% CI, replace the ".05" with ".01". For a 90% CI, replace the ".05" with ".10".

24

8. Graphics: Charts, Tables and Figures

8.1 Scatter Plots: Two Variable Relationships

A scatter-plot present points (X,Y) in two-dimensional Euclidean space. For example, in the (WAGE,ED) space, a point would represent an individuals hourly wages and level of education:

ED 15 14 13 12 11 10 9 8

4 5 6 7 8 9 10 11 12 13 14 15 16 WAGE

Consider a sample of n = 10 years of aggregate U.S. health care expenditure, EXP_HEALTH, and real income, INCOME.

A B1 INCOME EXP_HEALTH2 9.3 0.9983 11.2 1.4994 17.1 4.2855 13.8 1.5736 10.9 2.0217 15.3 2.268 12.8 1.9539 14.6 2.10310 21.2 3.42811 19.3 2.277

We want to create a scatter-plot with income on the X-axis and health care expenditures on the income on the X-axis and health care expenditures on the Y-axis.

25

This point denotes an individual's (WAGE, ED) = ($5.25/hour, 13 years)

1. Click on cell A1 (the label INCOME), hold and drag over cells A1-A11 and B1-B11. That is, highlight the labels and data for expenditure and income:

A B

1 INCOME EXP_HEALTH

2 9.3 0.998

3 11.2 1.499

4 17.1 4.285

5 13.8 1.573

6 10.9 2.021

7 15.3 2.26

8 12.8 1.953

9 14.6 2.103

10 21.2 3.428

11 19.3 2.277

2. Next, go to the task bar at the top of the screen, click on INSERT, then CHART. A pop-up bax appears: for a scatter-plot, on the left side of the pop-up box, click on “standard types”, then “XY (Scatter)”. Click on NEXT at the bottom of the box: Excel will show you the graph with its default style. We will add a title, axes labels and remove the grid-lines, next.

3. Click on NEXT again. Click on TITLE, and type “Income versus Health Expenditure”. Under Category (X) axis, type “Income”; under Value (Y) axis, type “Health Expenditure”. Click on GRIDLINES, and remove any check-marks by clicking on them: this will turn off all gridlines. Finally, click on LEGEND, and remove the check-mark in the legend box by clicking on it.

The result is the following polished scatter plot:

Comment

26

1. Notice that a very clear linear relationship seems to exist: as real income increases, consumers can afford more health care and subsequently purchase more health care. Indeed, the sample correlation between the two economic variates is .74, a strong positive value.

2. For editing (point colors, X-axis, background color, etc.), see the discourse under Time Series, below).

8.2 Time-Series: Line Plots: Multiple Variable Representations, PDF's, CFD'sOften, or data exists over time. Examples include GNP, the CPI, production indices, inflation rate, exchange rates, asset returns and stock market indices, etc. It is often usual to present such information in a concise manner on the same graph.

Consider data U.S. GDP (in billions $), the money supply MS (in billions $) and gross private domestic investment GPDI:

A B C D1 Year GDP MS GPDI2 1970 1010.7 628.1 150.33 1971 1097.2 717.2 175.54 1972 1207 805.2 205.65 1973 1349.6 861 243.16 1974 1458.6 908.6 245.87 1975 1585.9 1023.3 2268 1976 1768.4 1163.7 286.49 1977 1974.1 1286.6 358.310 1978 2232.7 1388.7 43411 1979 2488.6 1496.7 480.212 1980 2708 1629.5 467.613 1981 3030.6 1792.9 55814 1982 3149.6 1951.9 503.415 1983 3405 2186.1 546.716 1984 3777.2 2374.3 718.917 1985 4038.7 2569.4 714.518 1986 4268.6 2811.1 717.619 1987 4539.9 2910.8 749.320 1988 4900 3071.1 793.621 1989 5250.8 3227.3 832.322 1990 5522.2 3339 799.523 1991 5677.5 3439.8 721.1

In the following discourse, we will look at various graphical presentations of the above data.

8.2.1 Smooth Lines

Creating the FigureHighlight the data and labels by clicking on cell A1, hold and drag over cells A1-D23. The "year" variable has already been formatted to be understood by Excel to be, in fact, a year. When we include this variable in a graph, Excel will not plot it: rather, Excel will use the year-formatted data for a time-line on the X-axis.

Next, go to the task bar, click-on INSERT, then CHART, then COSTUM TYPES, scroll down, and finally click-on Smooth Lines.

Click on NEXT: Excel shows you the default style of the figure. Click on NEXT again: a pop-up box appears for title and label editing. Click on TITLE, and type under Chart Title, "U.S. Macro Data: 1970-1991", or whatever you deem appropriate. Under Value (Y) Axis, type "Billions of $": this will provide a Y-axis label explaining that the data is in billions of U.S. dollars (i.e. GDP = 1010.7 denotes 1010.7billion = 1.0107 trillion dollars).

27

Finally, click on FINISH. Excel will present you the following figure:

8.3 Editing Graphs and FiguresConsider the figure we created for Gross Domestic Product and Investment data:

The year orientation is difficult to read, and we might want to change the yellow color of the line representing GDPI over time to some darker hue. Also, for style (according to your personal interests as well as readability for the viewer of your figure), we may want to remove the dark background color, or eved remove the outline boundary line of the entire figure, as well as re-locate the legend, and change the font of the test used.

Line ColorIn order to change a line's color, double click on the line itself, and a pop-up box appears. Click on PATTERN, then COLOR, and chose a darker hue: here, we will choose a shade of green.

Background AreaIn order to remove the dark background, double click on the dark background itself, go to AREA and click on NONE.

28

BordersI prefer figures without outer boundaries. To create this effect, double click-on the white-area between the outer box and the inner box. A pop-up box appears: go to BORDER, and click on NONE. Also, we can remove the border around the legend: double-click on the legend, in the pop-up box, under BORDER, and click on NONE,

X-Axis Label and OrientationFinally, to orient the year denominations, double click on the X-axis itself, then click on ALIGNMENT, and notice that a little line appears sprouting from the word Text . Click on this line, hold and drag the line to an upward orientation. If you want the numerical values on the Y-axis to display dollar signs, double click on the Y-axis, click on Number, Currency, go to Symbol, and click on $. Next to Decimal Places, because we like round numbers on the axes, type 0. We obtain:

FontsIf you prefer a different than the default font (usually 12-point), click ONCE on the white-area between the outer and inner boxes. Excel will "highlight" the figure by placing small black boxes in the outer corners. Next, at the top of the screen, in the task bar, change the font size to 10, or 8, or what-have-you. I prefer 10-point font.

Using the Left and Right Y-Axes

29

Finally, some data is difficult to see because we represent it on the same Y-axis as data with a very different size structure. For example, above, clearly GDP is much larger than GDPI, and therefore some of the small year-to-year nuances of GDPI is lost in the large Y-axis used for GDP. There are, however, two Y-axes, the left and standard Y-axis, and the right Y-axis. Consider placing GDPI on its own Y-axis, to the right. Click on the green, GDPI, line: a pop-up box appears. Click on Axis, then click on Secondary. Excel will orient the GDPI data on the right-side Y-axis, using a more convenient range of numerical values that befits the smaller values of the GDPI. Now, we can see specific year-to-year changes that were somewhat over-shadowed, above. As before, we can double click on the right Y-axis, click on Number, Currency, go to Symbol, and click on $.

Figure TitleIf you want to enhance the figure title (increase font size, use italics, bold, etc.), click on the title

once, place the cursor anywhere above the title characters (the cursor should change in shape to a little line), and click. Excel will now allow you to edit the title. You can highlight the title, and using the task bar at the top of the screen, change the characters to italics, bold, change the font, etc. We will change the font to 12-point and use bold:

30

ExcelTM and Statistical Analysisfaculty.fiu.edu/~bidarkot/eco4421/excel_info_4421.doc · Web...

Documents

Transcript of ExcelTM and Statistical Analysisfaculty.fiu.edu/~bidarkot/eco4421/excel_info_4421.doc · Web...