ExcelHandbk

Copyright © by Todd Easton & Stela Dhami 2007

Excel Skills for Bus. 500, “Statistical and Quantitative Analysis”

by Todd Easton & Stela Dhami

8 April 2023

Please send suggested improvements to [email protected] .

mailto:[email protected]

Table of Contents (To jump to a topic, mouse over it, depress the <Ctrl> key, and click.)

1 CREATING FREQUENCY DISTRIBUTIONS AND HISTOGRAMS 1

1.1 Absolute frequency distribution 21.1.1 Find the minimum, maximum, and the range 21.1.2 Establish the classes 51.1.3 Create the absolute frequency distribution table and histogram 6

1.2 Relative frequency distributions 91.2.1 Create the table 101.2.2 Create the chart 11

2 CREATING PIVOT TABLES 13

2.1 A two-way table with counts 14

2.2 Column percentages 18

2.3 Row percentages 19

2.4 Grouping rows together 20

3 FIGURING PROBABILITIES WITH EXCEL FUNCTIONS 23

3.1 BINOMDIST 23

3.2 NORMDIST 27

3.3 TDIST 28

4 PERFORMING LINEAR REGRESSION 30

4.1 Simple Linear Regression 30

4.2 Multiple Linear Regression 35

A format note: In what follows, italic script always indicates the labels on Excel commands that you need to select or click on.

2

We wrote this document to help students grasp and review crucial Excel skills for Bus. 500. Because this course is meant to be about applied statistics, not Excel, each Excel skill is usually taught only once. If you have trouble learning a skill, or if you learn a skill and then forget it, this handbook can be a valuable resource.

For most students, it will be best to learn (or review) these skills actively. To make that easier, there is an Excel workbook to accompany this document. It contains the raw data you can use to create the frequency distributions, pivot tables, and linear regressions in sections 1, 2, and 4 of the document. You can find the workbook at teaching.up.edu/BUS500/ExcelHandbkData.xls.

The data in the first worksheet of the workbook are from the 2000 US Census. The wage income data in Column E of this worksheet are used to create the tables and graphs in Section 1. The race data in Column D, along with the county of residence data in Column F, are used to create the pivot tables in Section 2. The data in the workbook’s second sheet represent a random sample of the cars produced in the US in 1997. Section 4 uses these data to estimate linear regressions predicting fuel efficiency.

1 Creating Frequency Distributions and HistogramsA frequency distribution is a table that describes the distribution of a numerical variable. For example, a table might show how many people in a city earn low incomes compared to the number earning high incomes. To do this, a frequency distribution table divides the range of incomes in the county into a number of contiguous classes. For example, $0 to $9,999; $10,000 to $19,999, et cetera. The table provides counts of the number of people in the city that fall in each income class. If a table provides counts of the number of people in each class, it is called an absolute frequency distribution. If it provides the percentage of the population that falls in each class, it is called a relative frequency distribution. If bar graph is created from a frequency distribution, with the bars’ heights illustrating the number of cases in each class (or the percentage of all cases in each class), we call it a histogram.

To illustrate how to create frequency distributions and histograms, we use data from the 2000 Census for Portland, Oregon. These data are the result of sampling approximately 7.5% of the records from Portland generated by the 2000 Census. The universe sampled was all people 16-65 in the densely populated portions of the Portland metropolitan area. Because the Census truncates incomes at $200,000, the 23 people who reported incomes over 200,000 were eliminated from the sample. The Census is a stratified random sample (not a simple random sample), but that complication is ignored here.

http://teaching.up.edu/BUS500/ExcelHandbkData.xls

1.1 Absolute frequency distributionTo create an absolute frequency distribution—along with the corresponding histogram, one needs to take three steps:

1) find the minimum, maximum, and the range for a data set,2) establish the classes the data set will be sorted into, and3) use Excel’s Histogram procedure to create a table and a graph describing the data’s distribution.

What follows explains how to take these steps and shows you what results each should produce.

1.1.1 Find the minimum, maximum, and the range

To create an absolute frequency distribution, open the Excel workbook where the data are located (ExcelHandbookData.xls). Select the data of interest, including the data’s label. In the Portland Census data, the variable of interest is WageInc (wage and salary income). After you have done that, you should see something like the following on your screen:

2

http://teaching.up.edu/BUS500/ExcelHandbkData.xls

To keep the resulting output simple, copy the column and paste it into the next sheet, where you will perform the relative and absolute frequency distribution analysis. A quick way of doing this is by clicking on cell E1, and then depressing the Control, Shift, and down-arrow keys simultaneously. This will allow you to select the entire column of data E1:E2603. Copy the data and paste into cell A1 of the next sheet.

Prior to building a histogram, you need some information to help build it: the minimum, maximum, and range for the data. To have Excel calculate these descriptive statistics, click on Tools, Data Analysis, and then on Descriptive Statistics. At that point, you should see this:

3

Now, click on OK. You should see a new dialog box:

Fill out the box in four steps: 1) Click on the small button with the little red arrow to the right of “Input Range”. Then, select the input data (cells A1:A2603). 2) Click in the box to the left of “Labels in First Row,” to let Excel know you included a label when you selected the data in Step 1).3) Click in the circle to the left of “Output Range,” and then click on the tiny box to its right. After that, click in an area of the spreadsheet where the descriptive statistics can be displayed. Select one cell. That cell will become the upper-right-hand corner of the area in which the descriptive statistics will be displayed. 4) Click in the box to the left of the label “Summary statistics.”

After doing these four things, you should see the following:

4

Now click on OK. Excel will calculate the descriptive statistics for the data you selected. You can beautify this table in two steps.

1) Get rid of distracting decimals. Select the numbers Excel calculated, click on Format, and then click on Cells. Now, select Number, and then set decimal places to zero.2) Avoid excessive column width. First, select both the labels and numbers. Second, click on Format, Column, and then on Autofit Selection. That should give you this descriptive statistics table:

1.1.2 Establish the classes

At this point, you have enough information to establish the classes you will use to for your table. Generally, you should use between 5 and 15 classes. For small data sets, pick a number of classes at the small end of that range. For large data sets, pick a number of classes at the large end. The class width should be the same for all classes, and should be an easy-to-interpret number (e.g. $1,000 would be better than $940).

To determine class width, take the range and divide it by the desired number of classes. In this case, suppose you use 10 classes. That would give you a class width of $17,000. Using that width, the following classes would make sense: incomes less than or equal to $0; incomes more than $0, but less than or equal to $17,000; incomes more than $17,000, but less than or equal to $34,000, et cetera.

5

You communicate these classes to Excel using what Excel calls “bins.” Each bin is the top value of its class, so the bins corresponding to the classes named above would be $0, $17,000, $34,000, et cetera. Type the label, Bins, and the relevant bin values into a column on your spreadsheet, as is demonstrated below:

1.1.3 Create the absolute frequency distribution table and histogram

Next, click on Tools, Data Analysis, Histogram, and then OK. The following box will appear:

6

Before Excel can create the table and histogram, the input and and bin range must be selected. Use the same method you used to select the data and output ranges for the Descriptive Statistics Tool. After you have done that, you should see this:

Make sure that the “Chart Output” option is selected. This will direct Excel to create both a table and a graph of the absolute frequencies of the incomes in the data.

7

Click on “OK” and the following table and histogram should appear:

8

Left-clicking on the Histogram and tugging on the lower edge down will make it more legible. Right-clicking on the “Frequency” legend, and then left-clicking on “Clear” will get rid of it. Left-clicking on the chart title (Histogram), and then left-clicking again right after the letter “m” will allow you to create a more descriptive title. At that point, you might see the following:

1.2 Relative frequency distributionsThe graph above would certainly be useful, but what if we want a table or graph displaying relative frequencies (percentages) rather than absolute frequencies (counts)? To do this, we would need to:

1) create the table by calculating relative frequencies from the counts made by the Histogram Tool, and2) use Excel’s chart wizard to create a chart from these relative frequencies.

Below we describe how to create the table and the chart.

9

1.2.1 Create the table

We can compute the relative frequencies in three steps:1) Drag the histogram to the right three columns (to create some room to work in).2) Paste a copy of the Bins to the right of the absolute frequency distribution table.3) Divide the first absolute frequency by the total (from the “Count” produced by the Descriptive Statistics Tool), to get a relative frequency, placing the relative frequency to the right of the first bin (the one having the value of zero).

After following these steps, we should see this:

To complete the relative frequency table:1) Copy the formula down the column, 2) Type the word “Total” below the last bin and the words “Relative Frequency” just above the column,3) Sum the relative frequencies into the cell to the right of the word “Total,”4) Select all the relative frequencies and the total, right-click on the selected area, left-click on Format Cells, left-click on Percentage, set “Decimal places” to one, and left-click on OK.

10

That should leave us with the following:

1.2.2 Create the chart

After calculating the relative frequencies, you can create a graph of them as follows.1) Select the relative frequency data (but not “100%”).2) Click on the Chart Wizard icon. (If you have the Standard Toolbar installed, you will see its icon toward the right end of the toolbar, labeled with a miniature bar graph.)* Next, choose the first “Column” option. 3) Click Next.4) Go to series tab, click on the small button with the little red arrow to the right of “Category (X) axis labels,” select the numbers “0” through “170000,” and then click again on the small button.

At that point, you should see the following:

* If you do not see the icon, click on Insert and then on Chart.

11

Then left-click on Finish. Pretty the graph up and you will see something like this:

Congratulations! You have now created tables and charts for both absolute and relative frequency distributions!

12

2 Creating Pivot Tables

Excel’s PivotTable tool quickly creates what are variously called pivot tables, two-way tables, contingency tables, or crosstabulation tables. A pivot table allows you to explore the relationship between two categorical variables.†

To illustrate the use of pivot tables, we use the same data set used previously to create frequency distributions of incomes in Portland, Oregon. This time we use the data to explore the ethnic composition of the three metropolitan counties.

The following instructions will show you how to create a simple pivot table, and then modify it in three useful ways. The first table summarizes the number of people in each ethnic group in each county.

An explanation of each variable in the Portland datasetVariable Explanation

PUMAPublic Use Microdata Area--a metropolitan area subset, contains about 100,000 peopleAgePerson's age in years SexEqual to 1 for males & 2 for females

RaceGDesignates which general racial/ethnic group the person belongs to 1White 2African American 3American Indian 4Chinese 5Japanese 6Other Asian or Pacific Islander 7Other race, n.e.c. 8Two major races 9Three or more major races

WageIncwage and salary income reported for 1999 MetCountythe county the person lives in 1Multnomah 2Clackamas 3Washington

† Through the use of Page Fields, one can use the PivotTable tool to explore the relationship between three or four categorical variables, but this document does not introduce the use of Page Fields.

13

2.1 A two-way table with countsTo create a pivot table, the first step is to select all the data you will include in the table. Begin with the Ethnicity by County work sheet. Select all data in columns A and B. Click on Data, and then on PivotTable and PivotChart Report. The PivotTable and PivotChart Wizard will open:

Leaving the defaults in place, click Next. Excel will automatically use the data you selected as the range to be entered into the PivotTable tool, so you should see the following:

14

Click Next, and Finish, and you will see this:

This screen allows you to quickly select the variable that will be included in your pivot table. Beginning in the PivotTable Field List box, drag the variable label RaceG and drop it in the “Drop Row Fields Here” box. Do the same with MetCounty, but drop it in where you see, “Drop Column Fields Here.” That should get you this:

15

At this point you have established the two variables that will be used to crosstabulate the cases in your sample. To actually have Excel do the crosstabulation, you need to drop a variable label in the center of the table, where it says, “Drop Data Items Here.” It does not matter which variable you drop, but suppose you select MetCount. After dropping it in the center of the table, you will likely see:

Notice how the text in cell A3 says, “Sum of RaceG.” Excel’s default is to sum the codes for the variable that was dropped in the center of the table. That is fine if the variable measures the dollar cost of something, and you want to display the total. However, we wish to count cases, not sum variable values. To replace counts with sums, double left-click on the “Sum of MetCounty” label. A new dialog box should open up:

16

In the “Summarize By” box, click on “Count,” and then on OK. You should see this:

Note that the values in the column labeled 2 (for MetCounty 2) are half as big as before, and that the values in the column labeled 3 are one-third as big as before.

To make the table easier to interpret, replace the numerical variable codes with their corresponding labels. To do this, type each label into its corresponding box. Doing that will get you this final pivot table, showing how the number from each ethnic group in the sample differs among the three counties.

17

2.2 Column percentagesFor each county, the simple pivot table we just completed tallied the number of people in the sample from in each ethnic category. What if your goal was to compare the ethnic composition of the three counties? This count table would not be an ideal tool, because the total number of people differs greatly among the counties. To make the table more helpful, you could have Excel calculate column percentages.

To do that, double left-click on “Count of MetCounty,” to get the following:

Next, click on Options, and then find “% of column” in the “Show data as” window. Now you should see this:

18

Finally, click on OK to see the following:

2.3 Row percentagesThe table we just finished allows us to compare the ethnic compositions of the three counties in our sample. For example, examine the first row. It tells us that 92% of the sample individuals from Clackamas County are white, while only 81% of the Multnomah county sample is white.

What if you were interested in opening a market in Portland that catered to Chinese people, so that you wanted to find where most Chinese people lived? If that were your goal, you would want to calculate row percentages rather than column percentages. Luckily, it is easy to get Excel to make the switch.

19

First, you would double click on “Count of MetCounty.” Next, you would click on Options, and then find “% of row” in the “Show data as” window. Finally, you would click on OK. Having done all of that, you would see the following:

Notice that the row to the right of the “Chinese” label tells us that 48% of the Chinese people in the sample live in Washington County. Maybe you should investigate locations within Washington County for your market!

2.4 Grouping rows togetherWhen you create a pivot table, it will often provide more detail than you think is necessary. If you want a reader to quickly see the point of a table, show only the necessary information. For example, suppose you decided that your market should cater to all Asians, and also to Pacific Islanders. It might be useful, in that case, to combine three groups (“Chinese,” “Japanese,” and “Other Asian or Pacific Islander”) into one.‡

‡ To see labels for these three groups, look in rows 8, 9, and 10 of the illustration above.

20

To do this, begin by highlighting the three labels you wish to combine. After that, right click the selected area with your mouse, select Group and Show Detail, and select Group. That should give you this:

If you now left click, these dialog boxes will close. They may be replaced by a message that begins, “Do you want to replace the contents of the destination cells in…” If you get that message, click on No. That will leave you with the following:

21

To see row percentages combine for “Group 1”, rather than for each of its ethnic groups separately, right click in the rectangle labeled “Group 1.” Then, select Group and Show Detail, then Hide Detail. You should see a table that looks like this:

To neaten things up, delete “Group 1” and type the label “Asian and Pacific Islander” in its place. Next, left-click on “B” at the top of the second column, then right-click on Hide. Your final Pivot table should like the following:

The fourth row of the table suggests your Pan-Asian market should be in Multnomah County!

22

3 Figuring Probabilities With Excel Functions

We will use three Excel functions to calculate probabilities: BINOMDIST, NORMDIST, and TDIST. BINOMDIST figures probabilities using the binomial distribution, NORMDIST figures probabilities using the normal distribution, and TDIST figures probabilities using the t distribution.

3.1 BINOMDISTThe binomial distribution is a discrete probability distribution. Our text suggests we think of a binomial distribution as being generated by a sampling process with the following characteristics:

The sample consists of a fixed number of observations, n Each observation is classified into success or failure The probability than an observation is classified as a success is p, while the probability it

is classified as a failure is 1-p Each observation is randomly selected either from an infinite population without

replacement or from a finite population with replacement

Here is an example of a situation you could describe using a binomial distribution: you wish to analyze reports of phone problems to the call center of a telephone company. In particular, you are interested in the first three calls of the day, and in the likelihood that the company successfully resolves exactly two of those problems before the day is over.

Suppose this situation can be analyzed with the binomial distribution. That is another way of saying that:

a) we have three observations (the first call, the second call, and the third), b) we can classify each observation as a success or a failure (the company repairs the phone by the end of the day or it does not), and c) the probability of success (the chance of repairing a phone by day’s end) is the same for each observation.

To be concrete, suppose .7 is the probability of success. Given that, we could use the Excel function BINOMDIST to figure this probability as follows:

23

One can use the function by typing the necessary text (=BINOMDIST(2,3,.7,FALSE) into a cell and pressing the Enter key.§ Typing this text into a cell and pressing the Enter key will cause .441 to be displayed. If you type the same text, preceded by a single quotation mark (‘=BINOMDIST(2,3,.7,FALSE), Excel will display the text itself. The single quotation mark tells Excel to treat what follows it as text, rather than as a formula to be calculated.

One can also invoke BINOMDIST by using the function wizard. To do this, click on Insert and then on Function. Then, select “Statistical” in the second window of the Insert Function dialog box. Finally, select “BINOMDIST” in the third window of the dialog box.

At this point, you should see something like this:

§ This notation means that Excel is calculating the probability that 2 out of 3 observations are successes, when the probability of success is .7. The FALSE tells Excel to calculate the probability of exactly two successes.

24

Next, click on OK and fill in the blanks in the Insert Function dialog box, using the same values on the previous page(2, 3, .7, FALSE).

You should see this:

If you then click OK, you can see the answer is the same as above: .441. (The text of the formula in cell B18 is reproduced in C18 to make it visible.)

25

Let us look at one more example. If we switch the last argument of the BINOMDIST function to TRUE, then it calculates the probability using a cumulative distribution function. Instead of calculating the probability of exactly x successes, it calculates the probability of x or fewer successes. Here is the example:

26

3.2 NORMDISTNORMDIST finds (for a particular normal distribution) the probability that a normally distributed random variable takes on a value less than its first argument.

The second argument of NORMDIST is the mean of the particular normal distribution being evaluated.

The third argument is the standard deviation of the particular normal distribution being evaluated.

The fourth argument should be TRUE (or 1) if you want to find the probability that a normal distributed random variable is less than the first argument.**

As an example, consider the following problem from Levine, Stephan, Krehbiel, and Berenson (our text). Suppose the fees mutual funds charge are normally distributed. In a particular year, the mean fee for a fund was .93% of the value of a fund’s assets. The standard deviation of the fees in that year was .30%.

Suppose we select a fund at random. To find the probability that the fund selected charged less than 1% of the value of its assets, we type the following text into a cell: =NORMDIST(1,.93,.3,1). Pressing the Enter key will give the desired result (.592).mean= 0.93%standard deviation= 0.30%

a) NORMDIST(1,0.93,0.3,1)= 0.592is the probability that average expense fees are less than 1%.

One can also invoke the NORMDIST function using the function wizard. To do this, click on Insert and then on Function. Then, select “Statistical” in the second window of the Insert Function dialog box. Finally, select “NORMDIST” in the third window of the dialog box, and click on OK.

At this point, you should see the following:

** If you make a FALSE (or 0), then it will give you the value of the normal probability density function at X—the height of the function above the X-axis.

27

3.3 TDISTWe use the t distribution when we are testing a hypothesis about a population mean and we do not know the standard deviation of the population. In particular, we use it when we utilize the p-value approach to testing the hypothesis.

When we know the population standard deviation, we implement the p-value approach with the standard normal distribution. However, if we must estimate the sample mean’s standard deviation, the standard normal distribution will not give us an accurate p-value. This is because it fails to account for the additional insecurity introduced into the hypothesis testing process when we must utilize an estimate of the population standard deviation. We use the t distribution, rather than the standard normal distribution, to take into account this additional insecurity.

For example, suppose we have a sample of 49 two-liter bottles from a soft drink bottler’s production line. Our null hypothesis is that the population mean amount of soda in a bottle is 2 liters. For our 49-bottle sample, suppose the mean content is 2.01 liters, with a standard deviation of .114 liters.

To calculate the p-value to test this null hypothesis, we figure out how likely it is that we get a sample mean as extreme as—or more extreme than—2.01 liters if the null hypothesis is true. Since we are working with the t-distribution, rather than the standard normal distribution, we cannot find the relevant probability in a table. The standard normal table gives us probabilities corresponding to a huge number of possible values for z. The table of "Critical Values of t" can't do the same, because there is not just one t-distribution. There is one t-distribution for each possible value the degrees of freedom can take on.

Luckily, Excel's TDIST function allows us to easily find the probability we need. We only need to provide it with 3 arguments: the test statistic, the degrees of freedom, and the number of tails in our test. Suppose we wish to use the function wizard to access TDIST. We begin by choosing the relevant function using the Insert menu:

28

Then, we click on OK and fill in the relevant arguments:††

Finally, we click on OK to see the following:

TDIST is telling us that, if the null hypothesis were true, there would be little surprise in seeing a sample mean of 2.01 liters from a sample of 49 bottles. To be more precise, it tells us there would be a 54% chance we would get a sample mean .01 liters or more from 2 liters, if the true population mean was 2 liters. Since that’s a pretty high probability, larger than any typical alpha, this sample gives us little reason to reject our null hypothesis.

†† We switch to Excel 2007 for the next two screenshots. They were inserted to correct an error in the previous edition of the handbook. Though the appearance of the program has changed, TDIST is exactly the same.

29

4 Performing Linear RegressionLinear regression involves fitting a line, or a plane, to a data set. This can allow us to seek evidence for a relationship between a dependent variable of interest and one or more independent variables. If the relationship we want to investigate is between a single independent variable and a dependent variable, then we use simple linear regression. Excel can perform a simple linear regression in two ways: with the Chart Wizard and with Regression Tool on the Data Analysis Menu. If the relationship of interest is between two or more independent variables and a dependent variable, then we use multiple linear regression. Excel performs multiple linear regression only with the Regression Tool.

The sections that follow work through one example of simple regression and one example of multiple regression.

4.1 Simple Linear RegressionSuppose we suspect a relationship exists between the weight of a car (independent variable) and its fuel efficiency (dependent variable). One way to gather evidence to support (or undercut) the existence of this relationship in a population would be to collect a random sample of cars from that population, recording the weight and fuel efficiency of each one.

Suppose you did this, selecting your sample from the population of all cars produced in 1997. To get a visual sense of the strength of this relationship, and to see if it is positive or negative, you could use Excel to make a scatter plot of the two variables. Here’s the data set you collected:

30

To make the scatter plot, you should select the Weight and MPG columns, including their labels. Click on the Chart Wizard icon and then on XY (Scatter). Select the first chart sub-type (a scatter diagram with no lines), and you should see a dialog box that looks like this:

If you then click on Next >, you should see the following:

31

Now take the following four steps: again click on Next > enter chart title, enter the labels for the X-axis, and enter the label for the Y-axis. That will leave you with the following:

Now, click on Next > one final time. Neatening the resulting graph up will leave you with this:

32

To have Excel perform a simple linear regression, we can work with this graph a bit more. Begin by clicking on Chart to see this:

Click on Add Trendline. That should leave you with this:

33

To have an unobstructed view of the graph, just click on OK.

Now, suppose you want to know the equation of the line Excel just fit to your data. To display this equation, along with its R2, right click the trendline and select Format Trendline. In the dialog box that appears, click on the Options tab. Now, choose Show Equation and R2. After that, you should see the following:

If you click on OK, that should get you the equation for the linear regression you desired:

34

4.2 Multiple Linear Regression You just analyzed the relationship between fuel efficiency and weight. In this model, fuel efficiency is the dependent variable. The model supposes that fuel efficiency depends on weight.

What if you think one variable depends on two (or more) other variables? In that case, you can use multiple regression analysis to describe the sample relationship between the variables, or to test for the existence of a population relationship.

For example, suppose you think that fuel efficiency depends not just on a car’s weight, but also on the power of its engine. In that case, you might wish to estimate a multiple regression with fuel efficiency as the dependent variable, using both weight and engine horsepower as independent variables. To do this, first paste the data into a new worksheet. Now, select Tools, Data Analysis, and then Rergression. At that point, you should see the following:

35

To continue, select OK. When the Regression dialog box appears, enter the range for your dependent variable (MPG) in the box labeled “Input Y Range.” (To do this, use the same technique described in the section above titled “Find the minimum, maximum, and the range.”) Then, enter the ranges for your independent variables (Weight and Hpower) in the box labeled “Input X Range.” Select Labels, Residual Plots, and Normal Probability Plots. Select the output range (As you did above in the “Find the minimum, maximum, and the range” section.) At this point you should see this:

36

Clicking on OK will get Excel to perform the multiple linear regression. In addition, doing this will provide you with three plots that can be used to see how appropriate it is to use linear regression to analyze this data set.

Congratulations for completing this Excel workbook! At this point, you should be able to create frequency distributions and their corresponding histograms, use three valuable statistical functions (BINOMDIST, NORMDIST, and TDIST), make two-way pivot tables, and perform both simple and multiple regressions.

37

ExcelHandbk

Documents

Transcript of ExcelHandbk