An R Companion to Quantifying Archaeology by Stephen...

An R Companion to Quantifying Archaeology by Stephen Shennan

David L. CarlsonAnthropology Department

Texas A&M UniversityVersion 0.6 - 04/02/12

http://people.tamu.edu/~dcarlson/quant/Shennan/

Chapter 1. Introduction

This guide shows you how to use R, a free software environment for statistical computing and graphics. The guide is intended to be used in conjunction with Stephen Shennan's book, Quantifying Archaeology. I suggest that you read each chapter first and then go back through the chapter with this guide while using R to produce the analyses, figures, and tables in the book. There are a number of excellent guides to R for new users, but they do not necessarily include all of the topics that archaeologists would find useful and they start with the basics and build up from there. This guide starts with the commands needed to produce the analyses in each chapter in Quantifying Archaeology. At first they will seem much like magical incantations, but in each chapter we will talk about basic concepts and how they apply to the kinds of analyses we are doing. If you want to know a lot about the design of R and how it works before using it, this is not the guide for you. If you want to jump into using R and gradually learn the background, you may find this guide to be a productive way to learn.

As a platform for quantitative analysis, R has several advantages over commercial statistical packages:

1. R is completely free. You can take it with you wherever you go and you can update it whenever you want without paying any licensing fees. You obtain R by downloading it from one of over 70 websites worldwide. If you have an internet connection, you can obtain R.

2. R is available for computers running Windows, Mac OS X, and Linux.3. R is continually being updated and extended. Because it is an open source program, people

around the world continually add extensions to the basic system. Currently there are about 3000 packages that extend the basic functions in R.

4. R commands can easily be combined to perform specialized analyses and saved so that those analyses can easily be repeated with new or corrected data.

5. Using the command approach to data analysis can shift your orientation from what is convenient to what makes sense since you can easily design your own ways of summarizing, plotting, and analyzing your data.

6. Online R documentation is extensive and forums provide places for new R users to ask questions.

7. Books on various aspects of R are readily available. Springer has a growing series called “Use R” and Wiley has a series called “Statistics Using R” each with about two dozen titles covering R and its closely related, commercial ancestor S-Plus.

To obtain R, visit the main website: http://www.r-project.org/. You will download R from CRAN (The Comprehensive R Archive Network) http://cran.r-project.org/. R is a command language meaning that you type commands and the R system responds to those commands. The advantage of this approach is

1

http://people.tamu.edu/~dcarlson/quant/Shennan

http://cran.r-project.org/

http://www.r-project.org/

that it is easy to record a series of commands in a file and it is easy to add new commands without overhauling the menu system each time. Typing commands does mean that R comes with a somewhat longer learning curve than a graphical program such as SPSS or Statgraphics. To ease you into the command interface we will install a graphical user interface written for R that gives you easy access to many commands in R. It will also give you an opportunity to see what the commands look like since they are printed as you select them from the menus. You can do a great deal with the Rcmdr interface and you may find that it provides everything you need. Every R command is not available from the menus and for those commands that are included in the menus, every possible option is not included. The command interface gives you access to every R command and every option. This limitation is not unique to R. The graphical menus used by commercial programs such as SAS and SPSS are based on an underlying command language. The SPSS menus also do not include every SPSS command and some menu commands do not include every option.

For more details, see the document “Getting R” on my website which includes instructions on Installing R and Rcmdr. Once you have installed R and Rcmdr, look at the “R resources on the web.” I recommend reading one of the manuals for beginners to get you started (for example “Using R for Data Analysis and Graphics” by J. H. Maindonald) and “R Commander: An Introduction” by Natasha Karp.

To get the most out of the guide you should first read the book chapter. Then go back through the chapter with this guide and your computer with R and Rcmdr loaded. There are no computer printouts or figures in this guide, but I will talk about them assuming that you have just run the analyses or generated the plots on your computer. My description won't make much sense if you are not working along with me. Generally you should type the commands into the Rcmdr Script Window, select them and click Submit or type them into the R Console. The Script Window lets you type multiple commands, select them and then run them all by clicking Submit. The R Console runs each command when you press the Enter key on the keyboard. If a command is not working in the Script Window, try typing it in the R Console window. It won't run, but the Console will print the whole error message.

If you are selecting and pasting commands, they should work as I have tested them while writing this guide. But somethings things happen. Most likely the problem will be conversion of special characters, principally the quotation marks and the dash. Word processors sometimes convert the simple quotation marks to open quote/close quote marks that R doesn't recognize or a dash to a hyphen. Type the command directly and it should work. Let me know about the errant command and I'll correct it in any subsequent editions.

As you are working through the chapter (or after you have finished it), take the time to look at the help page for every new command. For example, the command ?mean brings up a help page on the mean that describes all of the options you can use with the mean. Always do this since even simple commands may have unexpected features. In the case of the mean() function you will discover that it can also compute trimmed means. In addition to getting help on single commands, you can browse the available functions in packages from The R Language help page (Help | Html help on the R Console or Help | Start R help system on Rcmdr). In addition to information about the packages you have installed on your computer, there are frequently asked questions, search functions and other resources.

Now that we've covered how to use the guide, let's talk about the first chapter. Shennan describes the role of quantitative methods in archaeology. I agree or I would not have created this guide.

2

http://cran.r-project.org/doc/contrib/Karp-Rcommander-intro.pdf

http://cran.r-project.org/doc/contrib/usingR.pdf

http://cran.r-project.org/doc/contrib/usingR.pdf

http://people.tamu.edu/~dcarlson/quant/R-Resources.html

http://people.tamu.edu/~dcarlson/quant/GettingR.pdf

Chapter 2. Quantifying Description

Shennan discusses issues relating to the measurement of archaeological data. In general, we record observations on objects. Those objects can be ceramic vessels (as in Table 2.1 and in Exercise 2.1) or graves (as in Exercise 2.2). They can also be archaeological sites, bones, bronze fibulae, stone tools, house pits, storage pits, a volume of excavated sediment, or just about anything else archaeologists recover or record from an archaeological site. As in Table 2.1 each object is usually represented as a row in a table. Each column of the table is an attribute or property that we have recorded and the intersection of a row with a column is the specific value that we have recorded for that item on that attribute. The values can be numeric measurements (15mm, 4gm, or 5 decorative bands) or they can be non-numeric descriptions (red, cord-marked, jar-shaped, lipped, conoidal base, large). In R, this kind of table is referred to as a data.frame. A data.frame has the rows labeled by rownames (e.g. Vessel 1, Vessel 2 in Table 2.1) although in many cases these are just row numbers. The columns are labeled with colnames (length, width, weight, color, temper, etc). You can refer to a specific data value in terms of its rowname and colname or in terms of its row number and column number.

Shennan describes these different kinds of measurements as nominal, ordinal, interval, and ratio. Each type requires somewhat different procedures depending on the kind of analysis and the statistical package you are using. In R, interval and ratio measurements are called numeric. Nominal variables are referred to as factors and ordinal variables are ordered factors.

You will not need to use R for either of the exercises in this chapter. The challenge is to think about ways of recording and measuring archaeological data. The first exercise requires you to think about how to record qualitative information from a group of artifacts. The second requires you to think about how to record qualitative and quantitative information from a feature that contains several different kinds of artifacts.

Chapter 3. Picture Summaries of a Single Variable

Shennan introduces the concept that data consist of “smooth” and “rough” components. In searching for significant patterns in the data (the “smooth”), we try to remove the individual variability (the “rough”). One way to do this is to simplify the data by representing it with charts and graphs. The first example is an examination of the number of bone fragments of different domestic animal species from a hypothetical British Iron Age site. Although the exact data are not presented, we can reconstruct it from the information provided in the pie chart in Figure 3.2:

Species Percent Calculation FrequencySheep/Goat 30 330 x 0.30 99Cattle 46 330 x 0.46 152Pig 21 330 x 0.21 69Horse 3 330 x 0.03 10

We only need the first and last columns to create a data.frame in R. Start R and Rcmdr (using the

3

command library(Rcmdr)) and select New Data Set on the Data menu tab. Call the new data set Bones and click OK. Rcmdr opens up a spreadsheet window for you to type in the data. Click on var1 and change it to Species and then click on var2 and change it to Freq and click on numeric next to the type label. Then enter the data and close the window. In Rcmdr the word Bones now appears next to Data set: indicating which data set we are currently using. The command you just executed is listed in the Script Window and in the Output Window. In the Messages Window, Rcmdr confirms that you have created the data set: “NOTE: The dataset Bones has 4 rows and 2 columns.”

As our first attempt to create the bar chart in Figure 3.1 select Graph | Bar Graph on the Rcmdr menu. There will be only option, Species, in the variable list so just click on OK. Rcmdr will open a graph window with a bar chart. It does not look much like the figure however. This is because Rcmdr wants the barplot data organized differently. Instead of the data table we created, Rcmdr assumes that there is a row of data for each bone so that there would be 99 lines with Species = “Sheep/Goat” followed by 152 lines with Species = “Cattle” and so on. We have two choices. We can edit the graph command that Rcmdr produced to create the graph we want or we create the data set that Rcmdr expects and generate the graph. Either way we have to use the Script Window.

To use the first approach, click in the Script Window and find the command:

barplot(table(Bones$Species), xlab="Species", ylab="Frequency")

Change this command to the following:

barplot(Bones$Freq, names.arg=Bones$Species, xlab="Species", ylab="Frequency")

Make sure the cursor is on the same line as the command and click on the Submit button. The graph will now resemble Figure 3.1.

The second approach is easier than it sounds since we really have only a single column of data (Species). Click the cursor in the Script Window below the last command. Type in the following command exactly as shown (upper/lower case matters in R) and then click Submit (without pressing Enter). Submit runs the command that is on the same line as the cursor or a selected (highlighted) group of lines:

BonesLong <- data.frame(Species=rep(Bones$Species,Bones$Freq))

The command creates a new data.frame called BonesLong (the <- (the “less than” symbol followed by a dash) assigns the value of the data.frame() command to BonesLong with a single column labeled Species. We have used the rep() (repeat) command to create as many rows for each species as the Freq column in our original data set indicates. Click on the Data set button in Rcmdr and select BonesLong. Click on the View data set button and scroll through the data set just to see what it looks like. Now click on the Graphs menu tab and select Bar graph. Click OK on the Bar Graph dialog (there is only one variable so you don't have any choice) and the bar chart will look like Figure 3.1 except that the order of the species is now alphabetical.

Regardless of which approach you use, you can change the order of the species by selecting Data | Mange Variables in Active Data Set | Reorder Factor Levels. Click OK on the Reorder Factor Levels

4

dialog and number the factor levels (the different species) so that Sheep/Goat is number 1, Cattle is number 2, Pig is number 3, and Horse is number 4. Then create the bar chart again.

If you are using BonesLong, select the Pie option in the Graph menu tab to get a pie chart like the one in Figure 3.2. If you are using the Bones data set, you will have to change the “table(Bones$Species)” in the pie() command to “Bones$Freq.” The default organization for a pie chart in R follows trigonometry so that we begin on the right side of the pie (3 o'clock or due east) and work counterclockwise. If you need a pie chart that starts at the top (12 o'clock or due north) and runs clockwise you have to insert “clockwise=TRUE” into the pie command (options must be separated by commas).

These examples highlight several features of R. First, the variables in a data.frame can be referenced by the name of the data.frame followed by a $ and then the variable name. Since you can have multiple data.frames loaded at one time, this ensures that the correct variable is being referenced. Rcmdr takes care of this by having you specify which data set you are using. There are several ways to do this using R commands, but we won't cover that just yet. For now, if you use commands, use the data.frame$name approach. Second, R includes commands that can manipulate data sets and create new ones. Third, R grew over a period of time by the volunteer efforts of many people so similar options in different commands can be labeled differently. For example, in the barplot command, the axis labels are set using the names.arg= option while in the pie command they are set using the labels= option.

R also has an extensive help system that documents all of the options of each function (command). To find out about the options of the barplot() command, use the command help(barplot) or ?barplot. Either command will open your web browser to a page describing the barplot() command. The help page provides a brief description of the command followed by a list of all the options (arguments) and their default values. Read over the pages for the barplot() and pie() commands.

Another feature of R to consider at this point is that R never saves anything automatically. If you do not save a data set, it disappears when you quit R. Rcmdr makes it simple to save your data sets. First use File | Change working directory to set the directory you will use to save your data files. The select Data | Active data set | Save active data set to save the data set. Rcmdr will give the file the same name as the Data set name and will add .RData as a file extension to indicate that this is an R data file. Save the Bones and BonesLong data sets.

When the data are ordinal or interval/ratio, the category ordering is set. Figure 3.3 provides an example of discrete ratio data. The number of types of grave goods must be a whole number (there could not be 1.5 types of artifacts in a grave). The data for Figure 3.3 is provided in Table 3.2. We only need to enter the first two columns since R can fill in the third column. You could use Rcmdr as we did in the previous example, but instead we will use a single command to create this data set:

GraveGoods <- data.frame(NoTypes=0:5,NoGraves=c(17,30,26,17,13,6))

Type this line into the Script Window and click Submit (while the cursor is still on the line with the command). If you click the Dataset button, the GraveGoods dataset is now available. This command uses a few new features of R. Within the data.frame() command we name the columns by placing a label on the left side of an equal sign. The expression on the right side of the equals sign creates the data. The 0:5 to the right of NoTypes causes R to create a vector (a batch of numbers) from 0 to 5 (i.e.,

5

0, 1, 2, 3, 4, 5). The c() function creates another batch of numbers. Make sure that the number of values is the same or R will repeat part of the shorter sequence which is probably not what you want. In this case, each variable has 6 values.

The command to create a bar plot that matches Figure 3.3 is

plot(GraveGoods$NoTypes,GraveGoods$NoGraves,type="h", xlab="No. of types of goods in the grave", ylab="No. of graves")

You will have to type this command in the Script Window since this type of plot is not offered via the Rcmdr menus. Since this command spans two lines you will have to select both lines before you click the Submit button. This plot matches Figure 3.3 very closely.

Figure 3.4 illustrates a histogram but does not include the data so that figure cannot be regenerated. However, the next section of the chapter provides data on postholes from a Neolithic henge monument which we can use to create a histogram. Use the following command to create a data set for the posthole diameter data:

Postholes<-data.frame(Diameter=c(48,48,43,48,38,57,49,40,53,35,66,48,44,43,30, 48,47,40,43,38,50,57,34,25,38,58,40,42,45,28,47,50,47,39,27))

Type or paste this command into the Rcmdr Script Window, select the lines, and click Submit. Then select the Postholes data set. The Graph | Histogram menu option will let you make a histogram. R will divide the continuous variable into a reasonable number of groups and plot the result. You can experiment with the various options to see how the histogram changes.

Figure 3.4 plots the number of graves against the number of types. We can generate that easily by selecting the GraveGoods data set, selecting Graphs | Line graph and then specifying NoTypes as the x- variable and NoGraves as the y-variable. Rcmdr uses matplot() instead of plot() so that multiple groups can be plotted on a single graph. Changing the pch=1 option to pch=4 and adding x and y axis labels with xlab="No. of types of goods in the grave", ylab="No. of graves" will match Figure 3.5 exactly.

The next section of the chapter focuses on the stem-and-leaf diagrams first proposed by John Tukey in his book, Exploratory Data Analysis (1977) which developed methods for data analysis that did not require calculators or computers. Rcmdr can produce stem-and-leaf plots. Select the Postholes data set and then Graph | Stem-and-leaf display. Diameter is the only variable in this data set so click OK and Rcmdr will generate a stem-and-leaf plot in the Output Window. To make it look like Figure 3.7 select “1” under Parts per stem.

The next section is an optional presentation of kernel smoothing. R can easilty generate kernel densities (but not through Rcmdr). To produce a kernel density plot for the Posthole data, type the following command:

plot(density(Postholes$Diameter),type="l",ylim=c(0,.06),col="red",lwd=2)

The option type=”l” (lowercase letter “l”) draws a line (instead of plotting symbols), col=”red”

6

specifies a red line, and lwd=2 produces a thicker line than the default value (1). Finally ylim=c(0, 0.06) sets the y axis limits. The default bandwith for these data is 3.134. To plot bandwidths of 2 and 4, we add two commands:

points(density(Postholes$Diameter,bw=4),type="l",col="blue",lwd=2)points(density(Postholes$Diameter,bw=2),type="l",col="green",lwd=2)legend("topright", c("bw = 3.13","bw = 2","bw = 4"), lwd=2, col=c("red", "green", "blue"))

These commands add kernel density plots for bandwidths of 4 (blue) and 2 (green) to show how the bandwidth parameter affects the smoothness of the kernel density curve. The last command adds a legend in the upper right corner of the graph.

The last section of the chapter discusses cumulative frequency distributions. Table 3.2 illustrates the grave good data. We have already created this data set. Select the GraveGoods data set. It does not include the third column of percentages so we need to create that first. In Rcmdr select Data | Manage variables in active data set | Compute new variable. First put Percents in the “New variable name” box and round(GraveGoods$NoGraves/sum(GraveGoods$NoGraves)*100,1) in the “Expression to compute” box. Then select the Compute new variable menu item again and use CPct as the variable name and cumsum(Percents) as the expression. Alternatively you can use the following two commands in the Script Window:

Percents<-round(GraveGoods$NoGraves/sum(GraveGoods$NoGraves)*100, 1)GraveGoods<-data.frame(GraveGoods, Percents, CPct=cumsum(Percents))

These two variables are attached to GraveGoods so the the new data set has four columns (use the View data set button to confirm this). If you use the script command, you may need to select another data set (e.g. Postholes) and then select GraveGoods. Otherwise the menus may not include the new variables. Now select Graph | Line graph and select NoTypes as the x-variable and Cpct as the y-variable. The graph will closely match Figure 3.10. You can change to pch=4 and add and xlab= "No. Of types of goods in the grave" and ylab= "Cumulative percentage" to match Figure 3.10 more closely. Notice that the command identifies the cumulative percentage as GraveGoods[, c("CPct")] instead of GraveGoods$CPct. Either one is correct and the second way involves less typing, but for listing multiple columns the first way is easier. Don't forget to save any data sets before you exit Rcmdr and R.

Chapter 4. Numerical Summaries of a Single Variable

Shennan discusses measures of central tendency, dispersion, shape-symmetry, and shape-tails. Central tendency can be described by the mean of the data which is the sum of the values divided by the number of values. You can follow along with the text by typing the following commands in the Script Window in Rcmdr and clicking Submit. Since the commands are pretty simple, you could also just type them in the R Console window. The example uses the first seven posthole diameters. To compute the sum we can use the sum() function in the Rcmdr Script Window:

Posts7<-c(48,57,66,48,50,58,47)sum(Posts7)

7

Posts7 is an array of numbers (not a data frame). The mean is just the sum divided by the number of values (length(Posts7) returns the number of values in Posts7):

sum(Posts7)/length(Posts7)

The mean is 53.42857 which is easily computed with

mean(Posts7)

The deviations from the mean sum to zero:

sum(Posts7-mean(Posts7))

The result, -1.421085e-14, is essentially zero: -1.4 divided by 100,000,000,000,000 (one followed by 14 zeros).

The median is the middle value (when there are an odd number of values or the average of the two middle values when there are an even number of values). The R command is

median(Posts7)

The mode is not uniquely defined for many data sets. For Posts7, the mode would be 48 since that is the only value that occurs more than once. Another way to define the mode is in terms of the kernel density plot. The kernel density function produces a smooth curve by estimating the density at 512 points between the minimum and the maximum value. The following commands save that density curve data to KernPosts and then identify the x value at which the y value is a maximum.

KernPosts<-(density(Posts7))KernPosts$x[which(KernPosts$y==max(KernPosts$y))]

The value, 48.89251, is an estimate of the mode for these data. Changing the bandwidth for the density function might change the value, so it is best seen as an approximation. The first command saves the results of the kernel density to KernPosts. The second command locates the maximum density value which we take as the mode. KernPosts is not an array or a data frame, but a list. In R a list is a collection of things. In this case the list includes a variable x that stores the 512 values of possible post diameters from 34.03 to 78.97 and a variable y that stores the 512 values of the kernel density. Also included in the list is the number of values used in computing the kernel density and the bandwidth. You can see what is included in a list with the str() command:

str(KernPosts)

Taken together the fact that mean>median>mode indicates that the distribution is asymmetrical with a tail to the right. Plotting the density function makes this conclusion clear. For example,

plot(density(Posts7))abline(v=c(mean(Posts7), median(Posts7),48.89), col=c("red", "blue", "green"))

8

legend(70,.055,c("Mean", "Median", "Mode"), lty=1, col=c("red", "blue", "green"))

These three commands plot the kernel density for the seven posthole diameters and add vertical lines to identify the mean, median, and mode.

Measures of dispersion include the standard deviation, the coefficient of variation, and the interquartile range. The standard deviation is the square root of the sum of the squared deviations from the mean divided by the number of values minus one. In R that is

sqrt(sum((Posts7-mean(Posts7))^2)/(length(Posts7)-1))

or more simply

sd(Posts7)

The coefficient of variability is just

sd(Posts7)/mean(Posts7)*100

The quartile statistics are used to construct a box-and-whiskers plot of the 35 Mount Pleasant postholes. The following commands will generate the data.frame. If you saved it when we created it in the previous chapter just load it into Rcmdr and select Statistics | Summaries | Active data set. This will generate summary statistics for every variable in the data set.

Postholes <- data.frame(Diameter=c(48, 48, 43, 48, 38, 57, 49, 40, 53, 35, 66, 48, 44, 43, 30, 48, 47, 40, 43, 38, 50, 57, 34, 25, 38, 58, 40, 42, 45, 28, 47, 50, 47, 39, 27))summary(Postholes$Diameter)

You can use these values to construct a box-and-whiskers plot or you can use Rcmdr by selecting Graphs | Boxplot. Diameter is the only variable so just click OK. The boxplot will resemble Figure 4.6 but will not match it exactly. You can get closer to the figure by making editing the command in the Script Window to make three changes. The original command is

boxplot(Postholes$Diameter, ylab="Diameter")

Change this to

boxplot(Postholes$Diameter, horizontal=TRUE, range=1, xlab="Diameter")

and then Submit the changed command. Now the plot is horizontal, the range indicating where outlier points will be plotted has been changed to 1 interquartile range (the default is 1.5) and the x-axis is labeled. This is much closer to the figure. You can make the height of the box smaller by inserting the option boxwex=.25 into the command.

Chapter 5. An Introduction to Statistical Inference

9

After introducing the concepts of statistical inference, the differences between samples and populations, and the role of hypothesis testing, these concepts are illustrated with data presented in Table 5.1 on the age distribution of burials distinguished between those that are “rich” and those that are “poor” in terms of associated grave goods. You can create this data set in Rcmdr by typing the following command into the Script Window, selecting all three lines, and clicking Submit:

Bronze <- data.frame(Age=c("Infans I", "Infans II", "Juvenilis", "Adultus", "Maturus", "Senilis"), Rich=c(6, 8, 11, 29, 19, 3), Poor=c(23, 21, 25, 36, 27, 4))Bronze$Age <- factor(Bronze$Age, c("Infans I", "Infans II", "Juvenilis", "Adultus", "Maturus", "Senilis"), ordered=TRUE)

The second command specifies the order of the age categories so they will not be listed alphabetically. These data are used for the Kolomogorov Smirnov test. We can use an R procedure to conduct this test if we re-organize the data, but for now we will build the various tables to follow the example in the book. To add proportions to the table (as in Table 5.2) run the following commands:

RichP <- Bronze$Rich/sum(Bronze$Rich)PoorP <- Bronze$Poor/sum(Bronze$Poor)RichC <- cumsum(RichP)PoorC <- cumsum(PoorP)Bronze <- data.frame(Bronze,RichP, PoorP, RichC, PoorC)Bronze

The result includes the information in Tables 5.2 and 5.3. Now run this command

Diff <- abs(Bronze$RichC - Bronze$PoorC)max(Diff)print(cbind(Bronze$Age, round(Bronze[,6:7], 3), Diff=round(Diff, 3)))

The first command subtracts the cumulative proportions for each age category and takes the absolute value (eliminates the negative sign). The second identifies the maximum difference and the third produces Table 5.4 showing that the maximum difference occurs at in the Juvenilis category. The command uses cbind() to tack the various columns together and round() to round to 3 decimal places.

Another approach is to use the ks.test() in R, but we have to modify the data:

Rich<-rep(as.numeric(Bronze$Age), Bronze$Rich)Poor<-rep(as.numeric(Bronze$Age), Bronze$Poor)ks.test(Rich, Poor)

As before, we have created a long version of the data with each grave on its own line. R gives a warning that since there are many ties (multiple individuals in the same age class, it cannot compute exact p values, but we can see the conclusion is the same, the two groups are not significantly different. With ties the p values for the KS test tend to be conservative so we are likely to miss a significant relationship rather than inferring a relationship is significant when it is not.

With either approach, you may want to try the following commands to produce figure 5.1 as an

10

introduction to some new graphics commands:

plot(Bronze$Age, Bronze$RichC, xlab="Age categories", ylab="Cumulative proportion", type=”n”)lines(as.numeric(Bronze$Age), Bronze$RichC)lines(as.numeric(Bronze$Age), Bronze$PoorC)lines(c(3,3), c(Bronze$RichC[3], Bronze$PoorC[3]))lines(c(3.05, 4.4), c(.45, .55))text(1.9, .38, "poor")text(2.5, .20, "rich")text(4.6, .55, expression(D[max]))

The plot() command sets up the axes, but does not plot any data with type=”n”. Then we use lines() commands to add the data lines to the plot and to draw the vertical line and the pointer to the vertical line. Finally text() commands add the text labels and expression() makes the Dmax label with max as a subscript. The coordinates for the lines and text command are based on the plot area. The y-axis ranges from 0 to 1.0 (cumulative proportion) and R numbers each factor on the x-axis with consecutive integers (i.e. Infans I is 1 and Senilis is 6). A little trial and error lets you place the labels exactly where you want them within a few tries. Then save the commands as a .R file to reproduce the figure whenever you need it, or save is as a bitmap or vector graphic file (File | Save As in the graph window if you are using the Windows operating system or use Rcmdr Graphs | Save graph to file if you are using Windows, Mac OS or Linux).

It is relatively easy to use randomization to test the burial distribution as described by Shennan with a few R commands.

Combined <- rep(as.numeric(Bronze$Age), Bronze$Rich + Bronze$Poor)old <- getOption("warn")options(warn=-1)val <- array(dim=1000)for(i in 1:1000) { test <- sample(Combined) val[i] <- ks.test(test[1:76], test[77:212])$statistic}hist(val)quantile(val, c(.95))sum(val>.178)/1000options(warn=old)

The first statement constructs a new data set that consists only of the numeric value of the age categories repeated as many times as they are represented in the cemetery (e.g. Infantilis I is repeated 29 time since there are 6 rich and 23 poor Infantalis I graves). The next two lines and the last line just save the current setting of the Warn option and reset it for the simulation to -1. This suppresses the warning message that ks.test() generates when there are ties in the data. Since we are running the ks.test() command 1000 times, we don't really need to see the message 1000 times and we are not using the probability that ks.test() calculates anyway. The val <- array(dim=1000) creates a vector called val that will store the results. The next four lines set up a loop to execute the commands 1000 times. First we use sample() draw a random sample of 212 from the combined data set since 212 is the

11

number of graves. This command essentially creates a random permutation of the 212 graves in Combined. Second run the ks.test() taking the first 76 graves as the Rich group and the rest of the graves as the Poor group. The ks.test() produces list containing a summary of the test, but we are only interested in the statistic value which is the maximum absolute difference between the cumulative proportions. That value is assigned to val[i] and i changes from 1 to 1000 so we are saving 1000 values of the statistic. The closing brace on the next line matches the open brace on the line that begins with “for(i in 1:1000)” indicating that only the two preceding commands should be repeated 1000 times. The next three commands summarize the results in various ways. The hist() command shows the distribution of different values of the statistic (maximum difference) and the quantile() command shows the 95% percentile. If you are testing at the .05 significance level, your statistic (maximum difference) should be larger than this number to reject the null hypothesis. Finally the sum() command shows you the probability of getting 0.178 (the actual value for data in the example). If that number is less than .05 we can reject the null hypothesis. You may get slightly different results since they are based on random samples.

This randomization is based only on shuffling the actual data. If we assume that the data represent a sample of a larger cemetery, we may want to treat the proportions of the various age classes as a random variable. In other words, instead of drawing Infantis I exactly 29 times (but the number who are Poor or Rich varies), we could let the probability of drawing an Infantis I grave be 29/212 = 0.1368 so that the actual number varies around 29 but we do not always draw 29 Infantis I graves. Although this sounds complicated it requires just one insertion in the above lines. Instead of

test<-sample(combined)

use

test<-sample(combined,replace=TRUE)

With either approach, the distribution appears to be significant and we should reject the null hypothesis.

The runs test is available in the tseries package if you want to install it (using the same procedure you did in installing Rcmdr) and try using it (after reading about it with help(runs) and creating a data set. The Mann Whitney is also known as the Wilcoxon test and is available through the wilcox.test() function. The function is available in Rcmdr, but the format does not follow the data sets we have created so far. For the grave data that we created to do the ks.test() the command would be

wilcox.test(Rich, Poor)

We would reject the null hypothesis at the p<.05 level since (as we did with the randomization tests.

Chapter 6 Estimation and Testing with the Normal Distribution

R provides functions to estimate probability distribution functions and to draw random samples from those distributions. Rcmdr provides access to many of them on the Distributions menu tab. The normal distribution is listed with the other continuous distributions. Rcmdr provides four options. For now select the Plot normal distribution option. Leave the mean and standard deviation settings at 0 and 1and

12

the type of plot as density function and you will create a plot that is very similar to Figure 6.3. By changing the mean to 110 and the standard deviation to 20 you can create Figure 6.4. The other option, distribution function plots a cumulative normal distribution (a S-shaped curve that represents the sum of the density curve).

Z-scores are a way of standardizing variables measured at different scales to the standard normal distribution with mean=0 and standard deviation=1. Rcmdr provides an easy way to compute Z scores on the Data menu tab. Data | Manage variables in active data set | Standardize variables. Select the variable or variables you want to standardize and click OK. Rcmdr will add new variables to the data set with “Z.” preceding the name of the variable to add Z-scores to the data set. If you want to save these variables, you should use Data | Active data set | Save active data set to save the variables.

For a particular sample, you can use R to compute the standard error of the mean as follows:

sem <- std(varname)/sqrt(length(varname))

The function std() computes the standard deviation of the variable named varname. The function sqrt() computes the square root and the function length() returns the length of the variable (the number of observations).

Shennan notes that a sampling distribution will often be normal even when the population distribution is not by using an example of tossing dice. You can conduct your own experiments using R (no dice required) with the following command:

hist(sapply(1:200, function(x) mean(sample(1:6, 2, replace=TRUE))))

The easiest way to see what the command does is to read it from the inside out. The sample() function draws a sample of two from the integers 1 through 6 with replacement. This is the equivalent of throwing two dice. The replacement option is necessary because when you throw two dice you can get two ones or two sixes. Then we use mean() to take the mean value of the two draws. The sapply() function performs this operation 200 times. The function(x) inside sapply() is needed to indicate that the following command is the function that should be performed. Finally hist() draws a histogram of the 200 draws. Once you have the command in the Script Window you can submit it over and over and watch how the histogram changes. Once that gets old, try changing the number 2 to a larger number, say 5 or 15 to see how the histogram becomes smoother. Even though the population distribution (the numbers 1 through 6 with each one having an equal probability) is uniform, the sampling distribution of means of random draws from that distribution approximates a normal distribution and does so more closely as the number of draws we use for each mean increases. Increasing the number of means we calculate (increasing the value of 200 to a larger number) will also make the histogram smoother.

Figure 6.7 illustrates the relationship between a population mean and a set of sample means. We can recreate that figure in such a way that you can explore how sample size affects your ability to estimate the population mean.

plot(0:10, rep(0,11), ylim=c(-2,2), type="l")for (i in 1:9) { n <- 15

13

s <- rnorm(n) m <- mean(s) se <- sd(s)/sqrt(n) points(i, m) lines(c(i, i), c(m-1.96*se, m+1.96*se))}

First we create a plot space with a horizontal line at 0 (the mean of a normal population that has a mean of zero and a standard deviation of one). Then we create nine samples, each consisting of 15 numbers drawn from a random normal distribution. The mean and the standard error is calculated and plotted for each of the 9 samples. Once you have the lines in the Script Window, select them all and click Submit. You can repeatedly click Submit as long as they are selected to run the procedure over again. Given the 1.96 standard errors that we are using, about one sample in 20 will completely miss the population mean with its confidence interval. You can make the confidence intervals smaller by changing the sample size from 15 to a larger number.

To get the confidence limits, we need to find the Z score (or the Student's t value). If we are interested in the 99% confidence interval, we will leave out .005 (.5%) at each end of the distribution (i.e. our range will be .005 to .995) and if we are interested in the 95% confidence interval, we will leave out .025 (2.5%) at each end of the distribution (i.e. out range will be .025 to .975). With these values, R can give us the number of standard errors we need. In Rcmdr select Distributions | Continuous distributions | Normal distribution | Normal quantiles. Put the two probabilities in the Probabilities window (e.g. .025 and .975 separated by a space). Now click OK and you should get -1.959964 and 1.959964. If you look at the Script window you will see the command

qnorm(c(.025,.975), mean=0, sd=1, lower.tail=TRUE)

which we used to compute the Z-score values for the lower and upper end of the confidence interval. To replicate the example on page 82, click in front of this command and insert 22.6+.594* and click Submit while the cursor is still on the line with the command. R will compute the upper and lower confidence intervals as 21.43578 to 23.76422. You should be able to repeat this process to get the 99% confidence intervals.

To get the t distribution values, use Distributions | Continuous distributions | t distribution | t quantiles. The probabilities are the same as before (.025 and .975 for a 95% confidence interval). The degrees of freedom is one less than the number of cases (in this case 50-1 = 49). The result illustrates the difference between the normal and t distribution. While the Z-score value was 1.96 for the 95% confidence interval, for the t distribution (and a sample size of 50), the t-value is 2.01 so the interval will be slightly wider.

You can easily compute t-tests (or the non-parametric version, the Wilcoxon test) using Rcmdr). First test the distribution to see if it is normal by selecting Statistics | Summaries | Shapiro-Wilk test of normality. The null hypothesis is that the data are normally distributed so failing to reject the null hypothesis means the distribution is normal. You may want to conduct this test separately for the two groups in the data set by using Data | Active data set | Subset active data set to create separate data sets for each variable. Next to compare the variances of the two groups go to Statistics | Variances | Two- variances F-test to see if the variances for the two groups are the same. If the variances are equal, we

14

can use one form of the t-test (pooled variances), but if they are not we need to use a different form (separate variances). Third, select Statistics | Means | Independent samples t-test and select the Group and the Response. Select one of the choices under Assume equal variances depending on the result of the F-test. If you want to compare your results to the Wilcoxon (Mann Whitney) test which does not assume a normal distributions (or even interval scale data), select Statistics | Nonparametric tests | Two-sample Wilcoxon test and select the Group and the Response as with the t-test.

If the data do not follow a normal distribution, it may be that you can transform them suitably to obtain a normal distribution. A common approach is to take the common logarithm of each value. In R you use log10() for this purpose. Note that the logarithms are not defined for 0 or for negative numbers so you will not be able to use logarithms to transform count data where you have 0's (e.g. 0 tools in a square, 0 sites in a survey block, 0 of a particular type of ceramic in a site) unless you modify the data by adding .5 or 1 to each observation.

Chapter 7. The Chi-Squared Test and Measures of Association

The chapter begins with an example looking at the distribution of settlements by soil type to illustrate the one-variable chi-square test. To work with these data in Rcmdr we need to create a long version by typing the following commands in the Script Window, selecting, and Submitting them:

Neo<-data.frame(Soil=c("Rendzina","Alluvium","Brown earth"),Freq=c(26,9,18))Neo$Soil <- factor(Neo$Soil, c("Rendzina","Alluvium","Brown earth"))Neolithic<-data.frame(Soil=rep(Neo$Soil,Neo$Freq))

These commands produce a data set called Neo that has the grouped version of the data (Table 7.1) and Neolithic, the ungrouped or long version of the data. In between is a command that sets the order for the soil types. Run these commands in the Script Window and then select the Neolithic data set. The Rcmdr command for a one-variable chi-square test is at Statistics | Summaries | Frequency distributions. The only variable is Soil so it should be selected for you. Then click the Chi-square goodness of fit box and OK. Rcmdr prints the table of sites in each soil type (both counts and percentages) and then prompts you for hypothesized probabilities. The default proportion is to divide the cases equally between the groups. In this case, that is 1/3 (on third). But we want the distribution to reflect the area covered by each soil type (Table 7.2) so we insert .32, .25, and .43 and then click OK.

The results match the example in the book within some rounding error and Rcmdr lists the chi-square value as 7.1885 and the probability of obtaining this result if the sites were distributed randomly across the soil types as .02748.

Table 7.3 introduces a new data set with two nominal variables. It is relatively easy to compute chi-square for a simple table by using the Rcmdr command Statistics | Contingency tables | Enter and analyze two-way table. You can use the sliders to change the rows and columns, but the default 2 x 2 table is what we want. Type in the counts and label the rows and columns. Make sure Chi-square test of independence is checked and check Print expected frequencies. Then click OK. The expected frequencies are in a separate table, but the chi-square value is similar (the difference is in the rounding of the expected frequencies). Rcmdr does not produce a table with marginal frequencies (row and column totals). If you want them, it takes two steps. First you have to recreate the table since Rcmdr

15

deleted the table when it finished. Scroll up in the Script window looking for the following commands:

.Table <- matrix(c(29,14,11,33), 2, 2, byrow=TRUE)rownames(.Table) <- c('RHS', 'LHS')colnames(.Table) <- c('M', 'F')

Select these lines and submit them. Then type the following commands in the Script window and submit it

names(dimnames(.Table)) <- list("Side", "Sex")addmargins(.Table)

The first command is not essential, but it lets us name the table dimensions. These will be used whenever we print the table. To get several of the association statistics that Shennan describes add the following commands:

library(vcd)assocstats(.Table)

This will produce a summary table that includes the Pearson chi-square (the one Shennan describes), the likelihood ratio chi-square (described in Chapter 10), the phi-coefficient (square it to get phi-squared), the contingency coefficient, and Cramer's V. We can create a simple function to compute Yule's Q:

YulesQ <- function(x) {(x[1,1]*x[2,2]-x[1,2]*x[2,1])/(x[1,1]*x[2,2]+x[1,2]*x[2,1])}

When you run this command, there will be no output. But we have now created a simple function. To compute Yule's Q we use the new command YulesQ() with a valid 2x2 table. The function is very simple and does not check to make sure the input is a valid 2x2 table and it does not format the output nicely, but it is enough to illustrate the process of creating functions in R. If you exit from R, you will have to re-create the function when you start the program again unless you have R create it automatically ever time you start the program. In the section on causal inference, Shennan breaks down a 2 x 2 table on the basis of a third nominal variable to see if controlling for that variable changes the association between the first two. Rcmdr can create the two tables (from a long data set) but it does not compute the separate chi-square values so the easiest way to follow this example is to enter each table separately and compute the chi-square values and the measures of association. Typing in following commands lets you do the analysis on a 2 x 2 x 2 table. We'll use this approach in Chapter 10 for the log-linear analysis. First create the table with a single (long) command:

GPit<-array(c(17, 29, 4, 6, 5, 4, 43, 20), dim=c(2, 2, 2), dimnames = list(Sex=c("Male", "Female"), 'Grave Volume'=c("<1.5 m3",">1.5 m3"), 'Est Height'=c("<1.55 m",">1.55 m")))

We enter the values for each cell by row for column 1, table 1, then the rows for column 2, table 1, then the rows for column 1, table 2, and finally, the rows for column 2, table 2. First we collapse the table across “Est Height” to create Table 7.11 and compute the chi-square:

16

SxGV <- ftable(GPit, row.vars="Sex", col.vars="Grave Volume")SxGVassocstats(SxGV)YulesQ(SxGV)

We use ftable() to collapse the 2x2x2 table by summing across the “Est Height” dimenstion. We can also produce Table 7.11, but we have to produce the statistics separately:

ftable(GPit,row.vars=c("Est Height","Sex"),col.vars="Grave Volume")assocstats(GPit[,,1])YulesQ(GPit[,,1])assocstats(GPit[,,2])YulesQ(GPit[,,2])

This time we use ftable() to create stack the 2x2 tables showing “Sex” by “Grave Volume.” For the statistics we select out each 2x2 table separately. Gpit[,,1] takes both rows of the variable, “Sex” and both columns of the variable, “Grave Volume,” but only the first value of “Est Height” (which happens to be “<1.55 m.” As before, you will have to square the phi-coefficient to get the phi-square value listed in Table 7.11.

R does not have functions to compute Goodman and Kruskall's tau or the lambda coefficient. Although it would not be too difficult to write functions to create them, they are not widely used in the archaeological literature.

Finally, you may want to print your tables in a format that allows you to use the formatting features of Microsoft Word or something similar. The best solution is a package called xtable. Install the package and then try the following commands:

library(xtable)a <- ftable(GPit,row.vars=c("Est Height","Sex"),col.vars="Grave Volume")print(xtable(gsub("\\\"", "", format(a))), type="html")

This will produce html code in the Output Window. Select the code, copy it, and paste it into an Excel spreadsheet. Now you can save the table or copy and paste into Microsoft Word. If you do not have Excel, paste the code into a text editor and save it with an html extension. Then import it into your word processing program.

Chapter 8. Relationships Between Two Numeric Variables: Correlation and Regression

Table 8.1 introduces some data on pottery density by distance from a kiln. By now you should be able to create this data set in Rcmdr on your own. Call it NewForest with variables Site, Distance, and Quantity (all numeric). We can now create a scatter diagram in Rcmdr with the Graphs | Scatterplot command. Select Distance as the x-axis variable and Quantity as the y-axis variable. Uncheck Smooth Line and Show Spread (since there are too few points), but leave the other options alone. Click OK and a graph window will display the graph. The scatterplot procedure plots the individual points as open

17

circles. It puts box-and-whisker plots along each axis so you can check the distributions of each variable. A solid line is the regression line for the data.

To make the scatterplot look more like Figure 8.1, run it again but this time uncheck the boxes that produce the Marginal Boxplots, the Least Squares Line, Smooth Line, and Show Spread. Then put the number 4 in the plot characters box and add x and y axis labels that match the ones in the book. The limits of the y-axis will still not match, but you can edit the command that Rcmdr produces to add ylim=c(0,100) and grid=FALSE to get closer to the figure in the book.

In the section talking about the shape of a relationship, Figure 8.6 shows three examples. We can illustrate those examples with the following commands:

par(op)op <- par(mfrow=c(2, 3))curve(1*x, 0, 2, ylim=c(0, 4))curve(2*x, 0, 2, ylim=c(0, 4))curve(x^2, 0, 2, ylim=c(0, 4))curve(sqrt(x), 0, 2, ylim=c(0, 4))curve(-x+2, 0, 2, ylim=c(0, 4))curve(-2*x+4, 0, 2, ylim=c(0, 4))par(op)

We set parameter mfrow to divide the graph window into 2 rows by 3 columns and save the default value in the variable op so we can reset it after we finish. We could have used a single row, but then the plots would be tall and narrow (although we could have opened a graph window with different dimensions) so we added 3 functions for a total of six using the curve() function.

To compute a linear regression for the New Forest pottery data, use Statistics | Fit models | Linear model and select Quantity as the Response and Distance as the Explanatory variables. Click OK and the results appear in the Output window. Look at the commands produced in the Script Window. The command for a linear regression is RegModel.1 <- lm(Quantity~Distance, data=NewForest). The name of the regression is arbitrary. The lm() function computes the linear regression and the formula for the regression, , places the dependent variable on the left of the tilde (~) and the independent variables on the right.

In the Output Window the Residuals are the deviations between the actual data points and the line. The Intercept is the “a” value in Shennan's equation and the Distance value is the “b” or slope value. The probability that the observed values are equal to zero is given in the last column. Only the slope probability is relevant to judging the significance of the regression line and in this case it is .0053, much less than .05 so we reject the null hypothesis that the slope is zero.

The correlation coefficient between Distance and Quantity is obtained by selecting Statistics | Summaries | Correlation test. Select Distance and Quantity (use Ctrl-Click to select two or more values from a single window). The value is -.97 indicating a very strong relationship in which Quantity decreases as Distance increases.

To get the rank correlations, follow the same procedure but select the Spearman rank-order test or

18

Kendall's tau instead of the Pearson product-moment correlation.

You should save the NewForest data set as you will use it in the next chapter. In Rcmdr use, Data | Active data set | Save active data set. This will save the data in R's binary format which can be read only by R. If you want to be able to read it with another program, use Data | Active data set | Export active data set. Then select the options you want. You might want to uncheck “Write row names” and change the “Field Separator” to Tabs or Commas. Then click OK and save the file. This will be a text file that can be opened in many different programs including Excel and Word.

Chapter 9. When the Regression Doesn't Fit

Load your NewForest data set if you saved it. Otherwise regenerate it from Table 8.1. Use Statistics | Fit models | Linear regression to fit the model with Quantity as the Response and Distance as the Explanatory variable. You can add the predicted values to the data set by selecting Models | Add observation statistics to data. Uncheck all except Fitted values and Residuals since we are not using the others. Click on the “View data set” button to see the results. Notice that the Model: button now says “RegModel.1” which is the name Rcmdr has given to the regression model you just created.

In Table 9.1, the first column is the same as the Quantity variable in your data set, the second column is the Fitted value, and the third column is the Residual value squared. There are some slight differences between the Fitted values in R and those in the book because of rounding. You can generate column 3 and the total with the following commands:

NewForest$residuals.RegModel.1^2sum(NewForest$residuals.RegModel.1^2)

In the output from the regression, R lists the residual standard error (the standard error of the regression) as 5.9 in contrast to Shennan's estimate of 4.57. The reason for the difference is explained in the footnote on page 152. R divides the sum of the squared errors (104.42) by 3 (n-2) instead of 5 because the statistical tests of the regression equation assume that the data represent a sample from a larger population of pottery. You can convert R's sample estimate to the population estimate Shennan gets by multiplying 5.9 by the square root of 3/5 (sqrt(3/5).

It is possible to create a plot like the one in Figure 9.1 for the NewForest data but it requires typing some commands into the Script Window. First use Rcmdr to produce the basic plot. Graphs | Scatterplot gets you to the dialog box. The x-variable is Distance and the y-variable is Quantity. Uncheck Marginal boxplots, Smooth line, and Show spread. To get one standard error bands (assuming the 5.9 standard error value) for the regression, use the following commands:

abline(a=RegModel.1$coefficients[1]+5.9,b=RegModel.1$coefficients[2], lty=2)abline(a=RegModel.1$coefficients[1]-5.9,b=RegModel.1$coefficients[2], lty=2)

The abline() commands plot dashed lines above and below the regression line to define the one-standard-error band. The abline() procedure has a number of options. One is to plot a line given the a (intercept) and b (slope) values. Another is to produce horizontal or vertical lines on an existing plot. RegModel.1 is a list variable that contains the results of the regression analysis that was performed

19

using the lm() function. Elements of the list have names. We are using coefficients to plot these lines since they have the same slope (coefficients[2]) as the regression but different intercepts (coefficients[1]± 5.9). The bands are parallel to the original regression which reflects the fact that they assume the error is in the estimate of the intercept, not the slope. Later in the chapter (pp 173-5) Shennan shows how to take this error into account.

The standardized residual computations are easily computed in R because the equation presented on page 155 can compute all the standardized residuals at once:

with(NewForest, (Quantity-fitted.RegModel.1)/4.57)

Using with() saves us from typing the data set name before each variable. Since Quantity and fitted.RegModel.1 each have 5 values, R automatically performs the calculation for each value.

R provides a number of diagnostic plots to detect violations of the regression assumptions. Using the RegModel.1 that we created earlier with the New Forest Pottery select Models | Graphs | Basic diagnostic plots. Four plots are displayed. The one in the top left corner is similar to the plot shown in Figure 9.6b. The residuals show no trend (Assumption 4). The plot in the upper right shows that the residuals are normally distributed; that is they follow the dashed line in the plot (Assumption 3). The lower left plot shows that the variance of the residuals do not increase or decrease with the fitted (Assumption 5). The plot estimates variance by plotting the square root of the absolute value of the standardized residual. The bottom right plot helps to identify individual cases that might overly influence the results. Only the first observation seems to be a potential problem, but an examination of the scatterplot indicates that we can ignore it, at least until we have a larger sample.

When the data suggest that the assumption of a linear relationship is violated, transformations of the variables may allow us to use the linear model on the transformed data. For example the function y = axb can be estimated by taking the logarithm of x and y and then fitting the linear model using the log-transformed variables (log y = log a + b log x). The model y = a10bx can be estimated by taking the logarithm of y and then fitting the linear model (log y = log a + log b x). The second equation assumes that we are using common logarithms (base 10) rather than natural logarithms (base e where e is approximately 2.718282). The quality of the fit (as measured by r2 will not be affected by which logarithm we use, but the coefficients will be.

To fit the models, create the data set shown in Table 9.2 with the following command:

Obsidian <- data.frame(Distance=c(5, 12, 17, 25, 31, 36, 44, 49, 56, 63, 75), Density=c(5.01, 1.91, 1.91, 2.24, 1.20, 1.10, 0.447, 0.347, 0.239, 0.186, 0.126))

To calculate the regression model in Rcmdr use Statistics | Fit models | Linear regression and then select Density as the Response variable and Distance as the Explanatory variable. The results appear in the Output Window. The value listed in the first row of the Estimate column (labeled as (Intercept) in the output) is equivalent to a in the linear model and the the second row (labeled as Distance) is equivalent to b (the slope). R does not list r directly, but the Multiple R-squared value is shown at the bottom. The significance tests indicate that the intercept is significantly different from zero (not surprisingly since we expect the highest density to occur at the source where distance equals zero. The slope (-0.055) is also significantly different from zero and indicates that density decreases by .055 g/m3

20

for every increase of one km in distance. You can plot the relationship using Scatterplot in the Graphs menu. Just select Distance as the x-variable, Density as the y-variable, unselect Marginal boxplots, Smooth Line, and Show spread and put 4 in the Plotting characters field. The plot should resemble Figure 9.9 (a).

To create Figure 9.9 (b) we need to compute the standardized residuals. We can add the unstandardized residuals using the Models | Add observation statistics to data. Unselect everything except Residuals. We need to compute the standardized residuals following the procedure described on pages 152-153. First compute the standard error of the regression as follows:

ser<-sqrt(sum(Obsidian$residuals.RegModel.2^2)/11)

We sum the square of the residuals and divide by the number of observations and then take the square root of that number. To compute the standarized residuals and add them to the Obsidian data.frame use the following command:

Obsidian<-data.frame(Obsidian,StdRes=scale(Obsidian$residuals.RegModel.2,0,ser))

The scale() command computes z-scores given a mean (in this case, 0) and a standard deviation (in this case, ser, the standard error of the regression). You may have to unselect and reselect the Obsidian data set in Rcmdr so it recognizes this new variable. Now we can produce Figure 9.9 (b) with the following two commands (or use Rcmdr to produce the scatterplot and run the abline(h=0) to get the horizontal line:

with(Obsidian, plot(Distance,StdRes,ylab="Standardized residuals", xlab="Distance (km)", ylim=c(-3,3), pch=4, las=1))abline(h=0)

This is not identical to the plot since the x-axis is at the bottom and a solid horizontal reference line is drawn at y=0. Just to show the flexibility that R provides, we can match the plot more closely with the following commands:

with(Obsidian, plot(Distance, StdRes, ylim=c(-3, 3), xlim=c(0, 80), ylab= "Standardized residuals", xlab="Distance (km)", xaxt="n", yaxt="n", bty="n", pch=4))axis(1,at=seq(0,80,10),labels=c(NA,10,20,30,40,NA,60,70,80),pos=0)axis(2, las=1, pos=0)

Most of the complexity in these commands is suppressing the default axes and replacing them with axes that intercept at 0, 0. In addition, we suppressed x-axis labels at 0 and 50 by setting them to missing values (NA is used to specify a missing value in R)..

After looking at the plots, we conclude that the linear model is not appropriate for these data. Shennan recommends fitting a linear model between the logarithm of Density against Distance. To compute the logarithm of distance, select Data | Manage variables in active data set | Compute new variable. In the dialog box put LogDensity as the New variable name, and log10(Density) as the Expression. The new variable will be added to the data set. Use View data set to see the new variable and compare the values

21

to Table 9.3.

You should be able to create a new Regression model (RegModel.3 since this is the third one we have created, but you can change the name to anything else you want) with LogDensity as the Response variable and Distance as the Explanatory variable. Now use the Scatterplot command to plot the LogDensity (y-axis) against Distance (x-axis) and show the linear regression line. To plot the standardized residuals use the commands:

plot(Obsidian$Distance,rstandard(RegModel.3))abline(h=0)

This does not match Figure 9.11 exactly since R uses a slightly different formula to compute the standardized residuals, but is it close enough to examine the fit of the regression.

To plot the original Density values and the regression equation on a single plot (like Figure 9.12) use the following commands:

plot(Obsidian$Distance, Obsidian$Density, xlab="Distance (km)", ylab = expression(paste("Obsidian density (g/", m^3 ,")", sep="")), pch=4)curve(10^(.737212-0.022852*x),0,75,add=TRUE)

The rather complicated y label (the ylab= expression) is needed to produce the superscript to indicate that density is measured in grams per cubic meter. The curve() command lets us define an equation with a single unknown (x) and indicate the range over which to draw the equation. By computing the linear regression between the log of Density and Distance we are fitting the equation Density = 10(a+bDistance).

In discussing autocorrelation, Shennan introduces a new data set in Table 9.4. Put these data into R as follows:

Plog<-data.frame(Time=seq(25, 525, 50),Sites=c(.25, .25, .55, .60, .95, 1.00, 1.05, 1.00, 1.15, 1.30, 1.65))

Select this data set in Rcmdr and use Scatterplot to create a plot similar to Figure 9.13. Use 4 as the Plotting character, turn off Marginal boxplots, Smooth line, and Show spread. Use “Date” as the x-axis label and “Site density” as the y-axis label (unless you want to try modifying the y-axis label we used in the previous plot). Now compute the regression statistics by creating RegModel.4 with Sites as the Response variable and Time as the Explanatory variable. Shennan raises the possibility of autocorrelation and mentions the Durbin-Watson statistic. You can compute that statistic by selecting Models | Numerical diagnostics | Durbin-Watson test for autocorrelation. As he reports, the results are not significant so we can accept the null hypothesis that there is no autocorrelation. To illustrate how to remove a significant autocorrelation effect, Shennan creates a new data set containing the differences between all possible pairs of points. We can generate that data set using a programming loop in R:

alldiffs <- NULLfor (i in 1:10){diffs<-diff(as.matrix(Plog), i); alldiffs<-rbind(alldiffs, diffs)}alldiffs <- data.frame(alldiffs)

22

The first line creates an empty variable to hold the results and the last line converts that variable to a data set so that Rcmdr will find it. The line in the middle is a loop that runs 10 times, each time it takes the difference between a pair of points. The diff() command subtracts a point from a previous point. If i is 1 it subtracts from the previous point, if it is 2, it subtracts from the point before the previous point and so on. As we process each batch of differences we append the result to alldiffs. The curly braces indicate the commands to be repeated each time and the semicolon separates commands that are on the same line. Once the command has run, you can click on the Data set button in Rcmdr and select alldiffs. Then plot the variables and compute the regression model. When you use Scattergram, turn off the line options except the linear regression line and select Jitter x variable. This will add a small value to the x variables so that they do not overplot. Your R2 should match the .79 result reported in the book and the slopes of both regressions are virtually identical.

In the Statistical Inference section of the chapter, Shennan discusses the difference between the confidence band for the regression line and the confidence band for a prediction using the regression equation. R can generate both types of confidence intervals. If you created the Handaxe data set presented in the Exercises for Chapter 8, you can load it using Rcmdr. If not, create it now using the commands:

Scars<-c(18, 19, 33, 28, 24, 36, 45, 56, 47, 37, 72, 57, 53, 46, 78, 68, 63, 82)Wgt<-c(210, 300, 195, 285, 410, 375, 295, 415, 500, 620, 510, 565, 650, 740, 690, 710, 840, 900)HandAxe<-data.frame(Scars,Wgt)

Using Rcmdr, create the regression (RegModel.5) with Scars as the Response variable and Wgt as the Explanatory variable. Use scatterplot to examine the fit around the line. To create Figure 9.15 use the following commands:

plot(Wgt, Scars, xlab="Weight", ylab="Number of scars", xlim=c(0, 1000), ylim=c(0, 100), pch=16)abline(RegModel.5)PlotData<-data.frame(Wgt=seq(0, 1000, 10))Conf<-predict(RegModel.5, PlotData, interval="confidence")Pred<-predict(RegModel.5, PlotData, interval="prediction")matplot(PlotData, Conf[,-1], type="l", lty=2, col="black", add=TRUE)matplot(PlotData, Pred[,-1], type="l", lty=3, col="black", add=TRUE)legend(0, 100, c("Regression line", "Confidence band (95%)", "Prediction band (95%)"), col="black", lty=c(1, 2, 3))

The first command generates the basic plot using a solid circle for each point. The abline() command plots the regression line (assuming the regression is stored in RegModel.5). The next line generates 101 points from 0 to 1000 (i.e. 0, 10, 20, 30 . . . 1000) to use in generating the points for the confidence and prediction intervals. The lines with the predict() commands generate the data for the confidence and prediction bands. The matplot() commands plot these bands using long dashes (lty=2) for the confidence bands and short dashes (lty=3) for the prediction bands. The two data sets, Conf and Pred, each include predicted values for the regression line as the first column so we delete that column before plotting. At this point, the plot closely matches Figure 9.15. The legend() command adds a descriptive legend to identify the different lines.

23

Figure 9.16 adds several more lines to the plot. To get the mean lines just add the following commands:

abline(h=mean(HandAxe$Scars))abline(v=mean(HandAxe$Wgt))

To add the additional regression lines, add the following commands:

b1 <- .07022 + 2*.01479b2 <- .07022 - 2*.01479a1 <- mean(HandAxe$Scars) - b1*mean(HandAxe$Wgt)a2 <- mean(HandAxe$Scars) - b2*mean(HandAxe$Wgt)abline(a=a1, b=b1, lty=6)abline(a=a2, b=b2, lty=6)

To produce the other two lines we need to compute the two slopes, one that is the original value plus twice the standard error and one that is the original value minus twice the standard error (b1 and b2). We can get the original value and standard error from the printout for RegModel.5. Assuming you still have RegModel.5 selected you can reproduce this information with the menu selection: Models | Summarize model. Since the two new regression lines pass through the mean of both values, we have the necessary information to compute the intercept values for each line (a1 and a2). The variables a1, b1 define one of the lines and a2, b2 define the other line. We put those values into abline() and get the regression lines shown in Figure 9.16.

The last section of the chapter presents a way of fitting a linear model that is more robust (in reducing the influence of a small number of extreme values) based on exploratory data analysis methods developed by John Tukey. The Tukey line is presented in terms of an example. The Tukey line is not directly estimated by any current R function, but it would be relatively easy to write a function to compute the intercept and slope values. Other methods of robust regression have been defined that are available in R. The function rlm() will compute a robust estimation of the slope and intercept of a linear equation. You could use this function with the HandAxe data set, to try it out, but there is very little difference between the robust estimates and those from the least squares model (RegModel.5).

Chapter 10. Facing Up to Complexity: Multiple Regression and Correlation

After presenting the multiple regression model, Shennan discusses partial correlation. The data set used to illustrate partial correlation and multiple regression is presented in Table 10.1. Create that data set in R and call it Formative. Rcmdr makes it simple to follow the analysis presented on pages 187-201. First create a scatterplot of Available agricultural land (x) by Site size (y) (Figure 10.3) and the corresponding regression model with Site size as the Response variable and Available agricultural land as the explanatory variable. Next create the scatterplot of Land productivity index by Site size (Figure 10.4) and the corresponding regression model with Site size as the Response variable and Land productivity as the Explanatory variable. Finally create the scatterplot of Available agriculture land and Land productivity index (Figure 10.5) and the corresponding regression model with Land productivity as the Response variable and Available agricultural land as the Explanatory variable. To compare the various plots side by side, it may be useful to use the Scatterplot matrix command in the Graphs menu

24

tab. Select all three variables and get a matrix of scatterplots along with density plots for each variable.

For the discussion of partial correlations, it will be useful to have the correlations between the pairs of variables. Use Statistics | Summaries | Correlation matrix and select the three variables to get a correlation matrix showing the correlations between each pair of variables. With these correlations you can compute the partial correlations between each variable pair while holding the third variable constant. After computing the partial correlations using the data in the correlation matrix, go back to Statistics | Summaries | Correlation matrix and select the three variables again, but this time select Partial correlation under Type of correlation to get the results directly.

In the section on Multiple Correlation, the two correlation matrices you just created give you the numbers to replicate the computation of the multiple correlation. Including the regression models you have already computed, you also have the necessary numbers to compute the multiple regression coefficients. Now use Rcmdr to generate the multiple regression of Site size (Response) by Available arable land and Land productivity (Explanatory variables). The Multiple R-squared matches the number you computed from the correlation matrix and the partial correlation coefficients. To look at the multiple regression fit, use Graphs | 3D graph. Select Site size as the Response and the other two variables as the Explanatory variables. The graph window is small. Grab the lower corner with your mouse and drag it to make it larger. By clicking on the graph and holding the right mouse button down you can rotate the graph in any direction to see the data from different perspectives.

You can compute the beta coefficients in two ways. By using Statistics | Summaries | Numerical summaries, you can select the three variables and get their descriptive statistics. The standard deviations give you the values you need to convert the b (slope) values from the multiple regression into beta values. The second is to create three new variables that represent the z-scores of each of the original values and then generate the multiple regression again using the z-score versions of each variable. Rcmdr makes this very easy. Just select Data | Manage variables in active data set | Standardize variables and select the three variables. Rcmdr will add three new variables to the data set with the prefix Z. to identify them. Use these standardized variables to run the multiple regression again and your slope values will be the beta coefficients.

To check some of the assumptions of multiple regression, add the fitted values and the Studentized residuals to the data set and plot them as shown in Figure 10.6. The Studentized residuals are similar to the Standardized residuals. For each point, the regression is computed leaving that point out of the analysis and then computing the residual for that point. These residual values are then standardized.

In the section on log-linear modeling, Shennan introduces a method of analyzing multiple categorical variables at one time. R provides several ways to analyze multivariate categorical data. Log-linear modeling involves fitting multiple variables in terms of their interactions with one another. The fitting process uses Iterative Proportional Fitting to produce expected values that are compared with the actual data. This is the method described in the chapter and it is not available through Rcmdr so we will type in the commands directly. The second method uses a generalized linear model to estimate the maximum likelihood parameter values of a model that predicts the counts of a response variable by using explanatory variables that can be interval or categorical variables. This method is slightly more time-consuming, but provides regression coefficients and standard errors. This method is directly available through Rcmdr so it will be illustrated briefly after working through the log-linear approach.

25

To follow the analysis you need to create the data set presented in Table 10.2 (very similar, but not identical to Table 7.12). First create table using the following command. Then use ftable() to display a table that closely matches Table 10.2 in the book.

GPit<-array(c(18, 30, 4, 6, 4, 3, 43, 20), dim=c(2, 2, 2), dimnames = list(Sex=c("Male", "Female"), 'Grave Volume'=c("<1.5 m3",">1.5 m3"), 'Est Height'=c("<1.55 m",">1.55 m")))ftable(GPit,row.vars=c("Est Height","Sex"),col.vars="Grave Volume")

If all of the variables are independent of one another we can compute the expected frequencies for the table in the same way we did in Chapter 7 except that we have three variables instead of two. The expected frequency is the probability of being a short male in a small grave pit is the probability of being male (69/128) times the probability of being short (58/128) times the probability of being in a small grave (55/128) times the number of burials (128). The result is 13.43. If you want to compute the other expected values, it will be easier to get the numbers if you type the command: addmargins(GPit) to get the table with the row and column sums. We can rearrange the formula to (69 x 58 x 55)/(128 x 128) since the 128 in the numerator cancels out one of the 128's in the denominator. Since the log-linear method works on logarithms, we can take the logarithm of both sides to get

log(expected) = log(69) + log(58) + log(55) - 2 x log(128)

The result is 2.5978 and the antilog (e2.5978 or exp(2.5978)) is 13.43. We are using natural (or base e, where e=2.71828· · ·) logarithms for the calculations throughout this section. Rather than work through the calculation of G2 by hand, we can let R to compute it for us.

loglin(GPit,list(1,2,3),fit=TRUE)

The value of lrt (Likelihood Ratio Test statistic) is 87.46. This is equivalent to G2. The value in the text is slightly larger due to rounding errors. The Pearson Chi Square value is 87.01. The number of degrees of freedom is 4 and the significance of G2 can be computed with the command: 1-pchisq(87.46, 4). The result is a very small number indicating that we must reject the null hypothesis of no associations between the variables. The section of the output labeled “$fit” shows the expected or fitted values for the table assuming no association between any of the variables.

Now we can proceed to follow the analysis presented in the book. Recall that the variables are numbered 1 = Sex, 2 = Grave Volume, and 3 = Est Height (p 208). In loglin() we can use either the numbers or the names we assigned earlier. The model specification in loglin() is handled by passing a list to the function. Within the list terms that are associated are enclosed with c() (the concatenation function). When Shennan writes [12], we will put c(1,2) and so on. Terms that are unassociated are just included in the list so the specification [1][23] becomes list(1,c(2,3)). It is straightforward to compute the expected values for the models shown in Tables 10.5 to 10.7. The comment before each command indicates which model and the comment after each command provides an alternate way of specifying the model.

# Table 10.5, model 2a, sex and grave-volume ([12][3])loglin(GPit, list("Est Height", c("Sex", "Grave Volume")),fit=TRUE)

26

# loglin(GPit, list(c(1, 2), 3), fit=TRUE)

# Table 10.6, model 2b, est height and grave-volume ([1][23])loglin(GPit, list("Sex", c("Est Height", "Grave Volume")),fit=TRUE)# loglin(GPit, list(1, c(2, 3)), fit=TRUE)

# Table 10.7, model 2c, sex and est height ([2][13])loglin(GPit, list("Grave Volume", c("Sex", "Est Height")), fit=TRUE)# loglin(GPit, list(2, c(1, 3)), fit=TRUE)

The Likelihood Ratio Test statistic (labeled as lrt) or G2 values are shown in the output of each command. They support Shennan's conclusion that including an association between Grave Volume and Est Height reduces the statistic greatly (i.e. improves the fit), but the expected values are still significantly different from the observed values. The next stage of the analysis looks at models containing two association pairs. The following commands compute the models shown in Tables 10.9 to 10.11.

# Table 10.9, model 3a, (Sex and Est Height) and (Est Height and Grave Volume) ([13][23])loglin(GPit, list(c("Sex", "Est Height"), c("Est Height","Grave Volume")), fit=TRUE)# loglin(GPit,list(c(1,3),c(2,3)),fit=TRUE)

# Table 10.10, model 3b, (Sex and Est Height) and (Sex and Grave Volume) ([12][13])loglin(GPit, list(c("Sex", "Grave Volume"), c("Sex", "Est Height")), fit=TRUE)# loglin(GPit, list(c(1, 2), c(1, 3)), fit=TRUE)

# Table 10.11, model 3c, (Sex, Grave Volume) and (Est Height, Grave Volume) ([12][23])loglin(GPit, list(c("Sex", "Grave Volume"), c("Est Height","Grave Volume")), fit=TRUE)# loglin(GPit, list(c(1, 2), c(2, 3)), fit=TRUE)

Examining the lrt values shows that model 3a is the best. Just for fun we can compute the models for three two-way associations ([12][23][13]) and for the three-way association ([123], this is the saturated model).

loglin(GPit, list(c("Sex", "Grave Volume"), c("Est Height", "Grave Volume"), c("Sex", "Est Height")), fit=TRUE)# loglin(GPit, list(c(1, 2), c(2, 3), c(1, 3)), fit=TRUE)

loglin(GPit, list(c("Sex", "Grave Volume", "Est Height")), fit=TRUE)# loglin(GPit, list(c(1, 2, 3)), fit=TRUE)

Notice that lrt is not much smaller when we add the Sex and Grave Volume association ([12]) to the model in the first example. In the second case, the saturated model, lrt and df drop to zero because the saturated model always fits the observed data perfectly but does so by including as many terms as there are categories. In this case 2x2x2=8 categories and the saturated model includes the following terms:

27

[1], [2], [3], [12], [13], [23], [123]. This is because whenever we add an association, for example, [12], we must also include the separate terms, [1] and [2] and including a three-way association term requires including the three two-way associations as well. Seven terms plus the sample size (128) uses up all 8 degrees of freedom. While in this example we built the final model by working from complete independence to completely saturated, it is also possible to work from the saturated model to a simpler model by eliminating interactions that do not cause the chi-square value to increase significantly.

The second way of analyzing the data is to use poisson regression, a generalized linear model that is included in Rcmdr with the function glm(). First we need to convert the GPit table into a data frame (and display the results to see how it changes):

Gpit2<-data.frame(ftable(GPit, row.vars=c("Sex", "Est Height"), col.vars="Grave Volume"))Gpit2

If you have Rcmdr loaded, you should be able to select Gpit2 by clicking the button next to "Data set:" Then select Statistics | Fit models | Generalized linear model. The Model Formula is Freq ~ Grave.Volume*Est.Height+Sex*Est.Height. You can type this into the two boxes or use the mouse to select variables and the buttons to select operators (e.g. +, *, etc). Now double-click “poisson” under Family and Link function “log” should now be highlighted. The results represent the same analysis that we did for model 3a but the output is quite different. The Residual deviance is our G2 with 2 degrees of freedom. Running the command 1-pchisq(.36139, 2) indicates that the significance is p=.8347 so we do not reject the null hypothesis that expected values capture the variability in the observed data. The rest of the output resembles a standard linear regression with an intercept and coefficients, but the coefficients represent specific categories within each variable. Examining the Pr(>|z|) column makes it clear that all of them are significant (with the possible exception of Sex(T.Female)). Since the interaction term including sex is significant, we must keep this term in the model.

The intercept represents the log of the expected number short males in small graves (exp(2.9018) = 18.21. If you have the output from model 3a computed by loglin(), you will see this value matches the fitted value we computed with that model. The coefficients show how the expected number changes as we change characteristics (from male to female, from short to tall, or from small to large grave). For example, a tall female in a large grave requires us to add the entire column of values exp( 2.9018 + .4925 - 1.3542 - 1.5686 - 1.2071 + 3.7658) = 20.70. The poisson regression model has the advantage of giving clear information about how each variable changes the predicted number of cases in a particular combination. It also has the advantage of indicating clearly which variables and interactions do not contribute significantly to the model.

Chapter 11. Classification and Cluster Analysis

R has a number of packages for cluster analysis covering all of the options described in the chapter. Rcmdr offers access to hierarchical clustering for interval scale data and to K-means partitioning. If you need Gower's coefficient or any of the dichotomous measures of association, you will have to use one of the packages that provide these options. Package cluster provides many different cluster analysis methods and implements Gower's coefficient as a distance option. Package vegan provides a Jaccard's coefficient and other dichotomous and metric measures. Package ade4 provides ten different

28

distance measures for dichotomous data.

Beginning on page 235, Shennan presents a discussion of hierarchical methods of cluster analysis. To illustrate them using Rcmdr create the following data set of two measurements on 18 objects that fall into three distinct clusters (not included in Shennan):

Cluster <- data.frame(group=factor(c(rep(1,6),rep(2,6),rep(3,6))), x=c(54,40,52,41,48,46,61,68,56,59,57,61,42,56,48,58,47,47), y=c(92,95,94,90,85,97,70,67,64,63,78,69,58,49,51,42,42,55))

Select Cluster in Rcmdr and create a scatterplot of x and y by group. First look at the data using Graphs | Scatterplot. Pick x for x and y for y and uncheck all of the options. Now look at the plot. You should see three clusters, each composed of six observations. Run Scatterplot again, following the same instructions as before and clicking the Plot by groups. The only group is group so click OK. Now you get the same plot, but each group is identified. Now lets see if cluster analysis can find these groups.

Select Statistics | Dimensional analysis | Cluster analysis | Hierarchical cluster analysis. Select x and y as your variables, Average linking as the Clustering Method, and Euclidean and the Distance Measure. Make sure Plot Dendrogram is checked. Then click OK. Notice that the options include all of the alternative methods that Shennan mentions and both Euclidean Distance the City Block metric. You should explore how the results change when you select different options. On these data, the results will not change much. On real data, they could vary considerably.

The dendrogram shows that the cluster analysis has “discovered” the three groups of six objects each. We know they match the groups because the row numbers show that each group consists of 6 consecutive numbers. Now select Statistics | Dimensional analysis | Cluster analysis | Summarize hierarchical clustering and select 3 clusters. The means on each variable for each group are displayed in the Output window and the plot window plots the objects by the first two principal components along with the vectors representing each variable. More details on principal components and biplots will be presented in the next chapter. If you want to include the cluster assignments in your data set, select Statistics | Dimensional analysis | Cluster analysis | Add hierarchical clustering to data set. Pick three clusters and click OK and a new variable, hclus.label, will be added to the data set. If you View the new data set, you will see that the cluster assignments match the original group assignments. With real data, we would not know what the original or “real” group assignments were.

There is other information about the cluster analysis that we can access by typing commands. For example, the command HClust.1$merge will describe the sequence of clustering (assuming your cluster analysis was named HClust.1). The two columns define what was combined at each step. If two single objects were combined, the row names of those objects are listed with minus signs preceding them. Line one shows that the first step in the analysis was to combine objects 7 and 12 and the second step was to combine objects 1 and 3. The first five steps all show the combination of pairs of objects. In step 6 the group created in step 1 (objects 7 and 12) is combined with the group created in step 3 (objects 9 and 10). You may want to follow the process through all 17 steps by connecting the pairs and groups on a plot. The following commands produce a plot of the points labeled by their row numbers:

plot(Cluster$x, Cluster$y, asp=1, pch="", las=1)

29

text(Cluster$x, Cluster$y, rownames(Cluster))

The asp=1 option ensures that the x and y axis use the same scale so that distances on the graph reflect Euclidean distances.

The y axis on the dendrogram shows the height at which each join takes place. The height reflects the increasing distance between objects and groups as the clustering process proceeds. The exact value of the height value depends on the clustering method being used. One line of evidence supporting the validity of the groups identified by the cluster analysis is that the height value should bend upward sharply when the ideal cluster size is reached. For this example the following commands plot the height by the number of clusters:

plot(17:1, HClust.1$height, type="l", xlab="Clusters", ylab="Distance", xaxp=c(1, 18, 17), las=1)abline(v=3, lty=3)

The 17:1 in the plot command lets us plot the height value against the current number of clusters, ranging from 17 after the first pair of objects are joined to one when all the objects are combined into a single cluster. The line shows how the height value increases as the number of clusters decreases. The vertical line at 3 clusters marks the actual number of clusters for these data. The height line increases steeply after this point. We can compare this line to another data set where we have randomly shuffled the x and y columns. We call this new data set Random and generate it with the following commands:

Random <- data.frame(x=sample(Cluster$x), y=sample(Cluster$y))HClust.2 <- hclust(dist(Random), method="average")plot(HClust.2, main= "Cluster Dendrogram for Solution HClust.2", xlab="Observation Number in Data Set Cluster", sub="Method=average; Distance=euclidian")

Each time you run these commands you will get a different permutation. The dendrogram does not clearly indicate that the data are a random permutation of the original observations. If you run the Summarize command, the plot will show groups that appear to be separated in space. The algorithm for hierarchical cluster analysis will combine points that are close together to form clusters so the existence of clusters is not evidence that they are real. The following lines put the height line for the original data on the same graph as the randomized data:

plot(17:1,HClust.2$height, type="l", xlab="Clusters", ylab="Distance", xaxp=c(1, 18, 17))points(17:1, HClust.1$height, type="l", lty=2)abline(v=3, lty=3)legend(13, 35, c("Random", "Data"), lty=c(1, 2))

The increase in the height line after three clusters should be steeper for the original data than the randomized data. The difference may be slight because the randomized data has the same distribution as the original. If you plot the histograms for x and y in the original and the randomized data, they will be the same and both show distinct bimodal distributions. The bimodal distributions will give even the randomized data the appearance of genuine clusters, but not necessarily three evenly-sized clusters.

30

Another way to compare the observed data to randomly generated data is to generate random normal variates with means and standard deviations that match the original data. The following commands:

Random <- data.frame(x=rnorm(18,mean(Cluster$x),sd(Cluster$x)), y=rnorm(18,mean(Cluster$y),sd(Cluster$y)))HClust.3 <- hclust(dist(Random),method="average")plot(HClust.3, main= "Cluster Dendrogram for Solution HClust.3", xlab="Observation Number in Data Set Cluster", sub="Method=average; Distance=euclidian")

After examining the dendrogram, plot height by cluster for the original data and the randomized data:

plot(17:1, HClust.3$height,type="l", xlab="Clusters", ylab="Distance", xaxp=c(1, 18, 17), las=1)points(17:1, HClust.1$height,type="l", lty=2)abline(v=3, lty=3)legend(13, 35, c("Random", "Data"), lty=c(1, 2))

Run these commands several times to see how the plots change with each new randomized data set. Shennan provides several ways of looking at the results of a cluster analysis. Discriminant analysis will be described later, but it provides only a way to identify which variables are important to define the clusters. It does not test the validity of the cluster analysis. Randomization methods like those just described can help. Finally the cophenetic correlation can be helpful in evaluating cluster results. The following commands compare the cophenetic distance to the Euclidean distances between observations based on the original variables. The following commands compute the correlation matrix between these two distance matrices for the original data and the random normal data:

cor(dist(Cluster[,2:3]), cophenetic(HClust.1))cor(dist(Random), cophenetic(HClust.3))

The correlation should be higher for the original data, but the difference may not be that great. In general, multiple lines of evidence will be needed to evaluate the value of the results of cluster analysis.

To examine monothetic divisive clustering, create this data set from Table 11.14:

Graves <- data.frame(Type1=c(0, 0, 1, 1, 0, 1, 1, 1, 0, 0), Type2=c(0, 0, 1, 1, 1, 1, 1, 0, 0, 0), Type3=c(1, 1, 1, 1, 1, 0, 1, 0, 0, 0), Type4=c(1, 1, 0, 0, 1, 0, 1, 0, 1, 1))

We need to use package cluster which has a procedure for monothetic division called mona(). Install cluster and then load it with the library command:

install(cluster)library(cluster)div <- mona(Graves)divplot(div

This runs a monothetic divisive analysis on the data and then prints and plots the results. Divisions are

31

based on the association between variables. The plot indicates the sequence of divisions. At separation step 1 Type 1 is used to divide the graves into two groups, (1, 2, 5, 9, 10) vs (3, 4, 6, 7, 8). From here on, the different types are used to split each group. At split 2 Type 2 is used to split the first group into two groups, (1, 2, 9, 10) vs (5) and Type 3 is used to split the second group into two groups, (3, 4, 7) vs (6, 8). At step 3. Type 3 splits (1, 2, 9, 10) into (1, 2) vs (9, 10), Type 4 splits (3, 4, 7) into (3, 4) vs (7), and Type 2 splits (6, 8) into (6) and (8). No more splits are possible as each group is homogeneous.

Rcmdr also provides the ability to run a K-Means cluster analysis (p. 249 under Partioning methods). On the menu bar select Statistics | Dimensional analysis | Cluster analysis | k-means cluster analysis. Select the variables you need and the number of clusters. The other default settings should be fine. For the Cluster data set, we would specify 3 clusters. The results include descriptive statistics on the clusters, within and between sums of squares, and a biplot for the first two principal components. You could also compare the results to randomly generated data as we did for the hierarchical cluster analysis.

Chapter 12. Multidimensional Spaces and Principal Components Analysis

Shennan introduces the concepts of multidimensional spaces and the use of geometric representations of correlations between two or more variables. Principal components analysis extracts vectors of components that summarize the variability of a group of variables. The goal is to reduce the number of variables (dimensions) to simplify the analysis and to identify patterns in the data.

Rcmdr makes it easy to compute a principal components analysis, view the results and save principal component scores in your data set for further analysis. But it saves the results (loadings, eigenvalues) only temporarily so you will have to repeat the analysis if you want to work with them. Also R standardizes the component loadings differently from the way Shennan presents them (see the paragraph that begins on the middle of page 279). We can convert to his way of doing things if we save the results of the analysis.

On page 288 Shennan begins an extended example using twenty two Bell Beakers. The data presented in Table 12.13 (p 294) are available on this website (the file is called Beakers.RData) if you do not want to type or scan them yourself. Load the data set using Rcmdr with Data | Load data set. To get the correlation matrix of all the variables use Statistics | Summaries | Correlation matrix and select all 12 variables. The resulting table wraps because the correlations are presented to seven decimal places (unlike the three used in Table 12.9. To get something closer to the table drag the edge of the Rcmdr to make the window wider (so the results do not wrap) and then type the following command in the Script Window and click Submit:

round(cor(Beakers), 3)

This is simpler than Rcmdr's command since we take advantage of the fact that the data.frame contains only measurements so we don't have to name each variable and we don't have any missing data so we don't have to specify how to handle it. The cor() function computes the correlation matrix. The default method is pearson correlations, but kendall and spearman are also available (use the command help(cor) for the details). The round() command rounds each correlation to three decimal places to match Table 12.9. This works fine if we want to use all the columns of the dataset in the analysis, but

32

what if the first column is a catalog number or a variable that we do not want in the analysis? By using Beakers[,2:12] we can specify columns 2 through 12 as the variables to use, or by dropping the first column (we could get the same result with Beakers[,-1] which would drop the first column). The comma is necessary to indicate that all rows of the data set should be used. For example Beakers[-22,2:12] would exclude row 22 (the last row) and the first column.

To compute the principal components using Rcmdr, select Statistics | Dimensional analysis | Principal components analysis. Select all the variables. Make sure Analyze correlation matrix is selected. R will compute all of the principal components, not just the first three shown in Table 12.11.

Scroll the Output Window up to the .PC<-princomp(. . .) command. The component loadings do not resemble Table 12.11 at all. There are two issues. First, the eigenvectors (aka principal component loadings) are standardized so that the sum of the squared loadings equals one instead of being multiplied by the square root of the respective eigenvalue so that the sum of squares equals the eigenvalue (Shennan, page 279). When this is done, the loadings represent the correlation between each variable and the principal component. Secondly, the result in R inverts the loadings of the first and third components so that every positive loading in Table 12.11 is negative and vice versa. Mathematically the results are identical, but we'll rerun the analysis using commands so that we can transform the results to match the text.

The next two sections of the output provide the information in Table 12.10. First the eigenvalues for each component (but they are labeled “.PC$sd^2 # component variances”). Recall that the sum of the eigenvalues is 12 (the number of variables). The first three eigenvalues are all greater than one. A general rule of thumb in deciding how many components to use is to select those with eigenvalues greater than one (i.e. components that summarize more than a single variable's variance). From the “Importance of the components” section you can see that the first three components summarize a bit more than 90% of the variance in the original 12 variables. Notice in the Output Window that Rcmdr displays a remove(.PC) command which deletes the results of the analysis. If we type the commands directly, we have more control over the output and the format of the results. To run the analysis again, type the following commands into the Script Window. Select them and click Submit.

PC <- princomp(Beakers, cor=TRUE)AdjLoadings <- sweep(PC$loadings, 2, PC$sdev, "*")round(AdjLoadings[,1:3], 2)# round(cbind(-AdjLoadings[,1], AdjLoadings[,2], -AdjLoadings[,3]), 2)round(AdjLoadings[,1:3]^2, 2)AdjScores <- round(PC$scores[,1:3]/4.7, 2)

The first command reproduces the principal components analysis and saves it to a variable called PC. PC is a list, a variable that contains multiple variables (use str(PC) to examine the contents). The results of the analysis include the standard deviations (PC$sdev which is just the square root of the eigenvalue), the principal component loadings (PC$loadings), the means of the original variables (PC$center), the standard deviations of the original variable (PC$scale), the number of observations (PC$n.obs), and the principal component scores (PC$scores). The second line creates adjusted loadings by multiplying the original loadings by the square root of the eigenvalue for each loading. The sweep() command handles this and is useful for many data management operations (?sweep to find out more). The third line prints the first three columns so that you can compare the results to Table 12.11.

33

The signs are still inverted on the first and third components but you can ignore this. The next line begins with # which means that it is a comment that R will ignore. The command on that line does the same thing, but it multiplies the first and third loadings by -1 so they will match Table 12.11. Just type the command without the # if you want to run it. The fifth line alters this command slightly to print out the squared loadings which are also listed in Table 12.11.

Finally the last command displays the factor scores shown in Table 12.13 except that the signs are inverted on components 1 and 3. The last command requires some explanation. The scores computed by R are exactly proportional to the ones in Table 12.12 except that they are 4.7 times larger. The computer program Shennan used must have scaled the principal component scores, but I arrived at the 4.7 number by trial and error. Normally you will not be comparing your results to another programs output. Scaling constants and inversion of the signs on one or more components does not affect your ability to use the results.

The next four commands produce Figures 12.18a and b.

plot(-AdjLoadings[,1]*1000, AdjLoadings[,2]*1000, xlab="Component 1", ylab= "Component 2", pch=16, las=1)text(-AdjLoadings[,1]*1000, AdjLoadings[,2]*1000, as.character(1:12), , cex=.75, pos=3)plot(-AdjScores[,1], AdjScores[,2]*1000, xlab="Component 1", ylab="Component 2", pch=16, las=1)text(-AdjScores[,1], AdjScores[,2]*1000, as.character(1:22), cex=.75, pos=3)

Look at the first two command. Both factor loadings are multiplied by 1000 as in the figure and the first one is multiplied by -1 to invert it so that the figures will be similar. The second command labels the points by generating numbers 1 to 12 for the variables and then places them above each point (pos=3). The second two commands plot the scores. Figure 12.18 plots the scores for component 1 against the scores for component 2 multiplied by 1000. The scaling does not really matter and the figures will look the same (except for the axis labeling) if you delete “*1000” everywhere it occurs in the four commands.

Principal components can also be plotted on a biplot which shows the loadings as vectors and the cases as points. To produce a biplot from the principal component results (remember we stored them in the list PC, just use the following command:

biplot(PC)

Notice that the variables (the red arrows) fall into the same clusters as shown in Figure 12.18a except that they are flipped left to right. Vessels 13, 21, and 22 are relatively squat vessels while 7, 9, and 10 are relatively slim vessels (Shennan discussion of Component 1, p. 293). Vessels 9 and 18 have relatively larger rim diameters and relatively low neck and belly heights while vessels 11 and 12 are the opposite.

It is possible to rotate the results of the analysis as discussed on pages 301-303.

RPC<-varimax(AdjLoadings[,1:3])

34

RPC

The variable RPC now holds the results of rotating the adjusted component loadings. The default print method for data matrices classed as “loadings” blanks out small loadings to emphasize the larger ones. You will see the same thing if you type PC and click Submit. This will summarize the original principal component analysis. You can force display of all of the loadings with unclass(PC$loadings).

There is another function for computing principal components in R called prcomp(). While princomp() extracts eigenvalues and eigenvectors to compute the principal components, prcomp() uses singular value decomposition. This is generally accepted to be more stable and reliable, especially, in cases where the variables are highly correlated. In general you should use prcomp().

PC2<-prcomp(Beakers,scale=TRUE)

The results in this case are identical. To make things a bit more confusing, prcomp() labels the results somewhat differently so that the loadings are called rotation and the scores are called x.

Finally, Shennan briefly mentions factor analysis. In R the procedure factanal() does a basic factor analysis and is available through Rcmdr. Packages FAiR and psych have procedures for performing various kinds of factor analysis.

Chapter 13. Correspondence Analysis and Other Multivariate Techniques

We can use R to follow along with Shennan's discussion of correspondence analysis, but we cannot use the Rcmdr menus. You can either open Rcmdr and type the following commands in the Script Window and then select and Submit them to be executed or you can open just a Script window in R for the same purpose. Whichever you select, you can save the Script window as a file for future reference.

First we have to generate the data in table 13.1. We will create each type as a separate vector and then combine them as a set of three columns (cbind()) into a table. Then will add dimension names to the table and row names for the assemblages. We display the table so far; it lacks the margins (row and column totals). Finally we add the marginal values and display the result which should look like Table 13.1.

TMeso <- array(c(68, 136, 41, 690, 78, 37, 95, 0, 181, 165, 8, 3, 3, 26, 19), dim=c(5, 3), dimnames=list(Assemblage=1:5, Type=c("Microlith", "Scraper", "Burin")))TMesoaddmargins(TMeso)

The next two commands produce Tables 13.2 and 13.3

round(prop.table(addmargins(TMeso,1),1),3)round(prop.table(addmargins(TMeso,2),2),3)

Notice that we use addmargins() to add just row (1) or column (2) totals and then we pass the results

35

to prop.table() to construct proportions by row or column (again we use 1 to specify rows and 2 to specify columns). Then we pass the results to round() to round the results to 3 decimal places. R processes the innermost function first and then works its way out. The margins are labeled “Sum” but because we added them before we executed computed the proportions, they match the Averages in the tables. To save these tables in Excel so you can incorporate them into a document, use the xtable package:

library(xtable)print(xtable(addmargins(TMeso)), type="html")print(xtable(prop.table(addmargins(TMeso,1),1), digits=3), type="html")print(xtable(prop.table(addmargins(TMeso,2),2), digits=3), type="html")

Select the html commands between <TABLE border=1> and </TABLE> (including the commands and paste them into an Excel spreadsheet. If you do not have Excel, paste the commands into a plain text editor and save the file with an “.html” extension. Then open or import the table file into a spreadsheet program or word processing program.

To look at the distribution of the five assemblages, Shennan uses a tripolar (ternary) plot. There are several packages in R that contain functions to do this. One is in the package vcd (Visualizing Categorical Data based on the book of the same name by John Friendly). You already have the package installed because Rcmdr uses some of its functions. To load the package and create the ternary plot use the following commands. Run the first plot, examine it and then run the second plot.

library(vcd)ternaryplot(TMeso, id=dimnames(TMeso)$Assemblage)ternaryplot(TMeso[, c(2,3,1)], id=dimnames(TMeso)$Assemblage, main="")

We load the vcd package and then run the ternary plots. The first plot adds the default label “ternary plot” and the order of the variables around the triangle does not match that in Figure 13.1. The second one puts the types in the same order around the triangle as the figure although the labels are positioned differently. To add the Average to the plot, you just need to remember that we are plotting the rows of a table. If we add the summary row to the bottom of the table, it will be plotted as well:

TmesoMod<-addmargins(TMeso, 1)ternaryplot(TMesoMod[,c(2, 3, 1)], id=dimnames(TMesoMod)$Assemblage, main="")

To get the Average shown as an asterisk and the label shown as “Ave” we need a couple more commands:

dimnames(TMesoMod)$Assemblage <- sub("Sum", "Ave", dimnames(TMesoMod)$Assemblage)ternaryplot(TMesoMod[,c(2, 3, 1)], id=dimnames(TMesoMod)$Assemblage, pch= c(rep(16, 5), 8), main="")

The “Ave” label is printed on top of the symbol for the first assemblage, but there is no simple way to change that. Other ternary plot functions have more options. The one in the ade4 package is particularly extensive. The ternary plot shows that the main variation between the assemblages is the relative abundance of scrapers vs microliths. There is little variation in burins.

36

Table 13.4 compares the observed frequencies for the five assemblages to the expected frequencies if there were no differences in the relative abundance. This is the same as the model of no association between the rows and columns that we use as the null hypothesis when computing a Chi-Square test. The easiest way to get the expected frequencies is to compute a Chi-Square test:

Results<-chisq.test(TMeso)Results$observedround(Results$expected,1)

These commands display the observed and expected frequencies as separate tables. The first line created a new variable, Results, that contains several values and tables that will be useful in following Shennan's discussion of correspondence analysis. We can generate a summary table for the row and column Chi-Square values with the following command:

ChiSq<-Results$residuals^2round(addmargins(ChiSq),2)

Squaring the residuals and summing over all the cell entries gives the Chi-Square value in the lower right cell of the table (also available as .Results$statistic. Using the round() command makes it easier to read the values. To get the transformed coordinates shown in Table 13.5 we start with Table 13.2 and divide each entry by the square root of the bottom row in the same column:

num<-prop.table(addmargins(TMeso, 1), 1)denom<-sqrt(num[6,])Trans<-sweep(num, 2, denom, "/")round(Trans, 2)

The first command creates a variable num that contains Table 13.2. The second command takes the square root of each value in the bottom row of Table 13.2 (row 6) and saves the result as denom. The third command uses the function sweep() to sweep down the columns of num (2 specifies columns instead of rows) dividing by the value of denom (for that column). The last command prints a table that looks like Table 13.5 (within rounding error and recognizing the typo for burins in assemblage 5). You can plot these results using the scatterplot3d() command in the scatterplot3d package to match Figure 13.2..

library(scatterplot3d)scatterplot3d(Trans, type="h", angle=50, scale.y=.6, pch=as.character(c(1:5, 9)), lty.hplot=2, cex.symbols=1.5)

We plot the assemblage numbers 1 though 5 and use 9 for the average as in Figure 13.2. The angle and orientation of the axes are controlled by angle and scale.y. The easiest way to learn what each does is to run the command repeatedly, changing one of the parameters slightly to see how figure changes. If you want an interactive 3d plot, use the following commands:

library(rgl)plot3d(Trans, type="s", size=1.5, col=c(rainbow(5), "black"))

37

The interactive graphics commands tend to open a very small window. Use your mouse to drag the corner of the window to make it as large as you want. Put the mouse cursor in the graph box and hold the left mouse button while you drag left, right, up, or down to rotate the graph. Hold the right mouse button and drag to zoom in or out on the graph. If you have a mouse wheel, you can use that to zoom the graph. We cannot label the points in this interactive graph, but we can give each point a different color. The average is black. The five assemblages are coded with the rainbow function which selects colors along the rainbow beginning with red (then orange-yellow-green-blue-indigo-violet). We are only using 5 colors along the spectrum so they are easy to distinguish (red is the first assemblage, yellow. green, blue, and violet is the fifth assemblage).

To compute a correspondence analysis of the Mesolithic data set (the basis for Figures 13.3 – 13.8 and Tables 13.6 and 13.7) we will use the ca() function in package ca which was developed by Michael Greenacre and Oleg Nenadic. There are a number of different correspondence analysis procedures in R, but most of them focus on the coordinates and graphical results and do not provide the mass, inertia, correlation, and contribution values that Shennan discusses. This is not surprising since Shennan's description of correspondence analysis follows Greenacre's text on the subject. First we will generate Table 13.6, the cell contributions to inertia. This is simple to do with the ChiSq table that we produced earlier:

round(addmargins(ChiSq)/sum(ChiSq)*100,1)

The values agree reasonably well, but Shennan's table has a number of rounding errors. The row and column sums in the table we just computed agree with the “Inr” columns in Table 13.7 although the values there have been multiplied by 1000 (per mil) while ours were multiplied by 100 (percent). To compute a correspondence analysis of the Mesolithic assemblages, use the following commands:

install.packages(“ca”) # to install the package if you have not done so.library(ca)CAMeso<-ca(TMeso)summary(CAMeso)

The first command loads the ca package and the second computes the correspondence analysis. The third displays the results in a manner very similar to Table 13.7. The main difference is in the second component for the artifact types. The numbers match exactly, but the values for microliths and scrapers are negative while the value for burins is positive in Table 13.7. In our table they are the opposite. Correspondence analysis describes the spatial relationships of the data and the location of the components, but not their direction so two programs can differ in the signs of the component scores (just as is the case with principal components). But they should do so consistently. Within a component and within rows or columns of data, all the values should be flipped so that all negative values become positive and vice versa. Also note that the summary report abbreviates the variable names to provide a compact table. The procedure does not plot a single dimension at a time so we cannot produce Figures 13.3-13.5.

Now let's plot the results:

plot(CAMeso)

38

This command reproduces Figure 13.8 except for the fact that the second axis is reversed for the types. For some reason Figure 13.8 plots the y-coordinates of the assemblages with the signs reversed. Figures 13.6 and 13.7 can be produced with slight modifications of the same command. The terminology is a bit confusing. The program generates two sets of scores for the rows and the columns, “principal” and “standard.” When Shennan refers to “space defined by the variables” he is referring to the column standard scores. In the ca() program this is map="rowprincipal" because the rows use principal scores. The default map is symmetric using row and column principal scores (Figure 13.8).

plot(CAMeso, map="rowprincipal")plot(CAMeso, map="colprincipal")

The next section of the chapter is an extended example using Janet Levy's data on Bronze Age hoards from Denmark (Levey, Janet. 1982. Social and Religious Organization in Bronze Age Denmark: An Analysis of Ritual Hoard Finds. British Archaeological Reports. International Series 124). The data are presented in Table 13.8. Getting that data into a format to duplicate the analysis is a bit of a challenge, but you can download Hoards.RData and HoardsWide.RData from the website. If you want to create them yourself, the Appendix gives step by step instructions. To run the correspondence analysis you need to install the ca package and load it.

Results <- ca(HoardsWide, nd=4)summary(Results)

The results match Table 13.10, but there are some discrepancies that probably relate to rounding errors in computing the results and possibly to typos in the transcription of the data. The discrepancies are larger for the fourth component. To produce the displays in Figures 13.9-13.11 use the following plot commands:

plot(Results)plot(Results, dim=c(1, 3))plot(Results, dim=c(1, 4))

The ca package also gives you the ability to plot three components at a time:

plot3d(Results, labels=c(1, 1))

As before, you can drag the corner of the window to make it larger and then drag with the left mouse button to rotate the graph or drag with the right mouse button (or the wheel) to zoom in and out.

Seriation

By now you should not be surprised to learn that R has a package for seriation. Shennan does not provide an extensive discussion of seriation methods, but he does point out that the first axis in a correspondence analysis often captures chronological differences between assemblages. Figures 13.12 and 13.13 use the scores from the correspondence analysis of Danish Hoards to show that the sequence resembles what one would expect of a chronological ordering, but the clustering of the variables (artifact types) suggests that the axis may relate more to a male/female dichotomy than to chronology. It is relatively easy to generate figures similar to those in the book using the seriation package.

39

install.packages("seriation")library(seriation)HoardO <- HoardsWide[order(Results$rowcoord[,1]), order(Results$colcoord[,1])]bertinplot(as.matrix(HoardO))

You will only need to install the package once. The next line loads the seriation package. We create a new data.frame with the rows and columns ordered according to the scores for the first axis of the correspondence analysis. Then we use bertinplot() to plot the data. The bertinplot uses bars to indicate the number of each type in each hoard. If the number of artifacts is greater than the mean, the bar is highlighted (black). The labeling of the hoards across the top is a bit crowded. Grab one side of the window with your mouse and drag it across to make it wider. This is not exactly like the figures, but it has the advantage of indicating the quantity of each type in each hoard. It is not much additional work to get closer to the figure. We need to change the data so that instead of showing the count for each type, we just indicate presence (1) or absence (0). We create that data set with a single command and then generate the two plots.

HoardPA<-apply(HoardO, 2, function(x) ifelse(x>0, 1, 0))bertinplot(as.matrix(HoardPA),options=list(reverse=TRUE))bertinplot(as.matrix(HoardPA))

The first plot matches 13.12 and the second one matches 13.13.

The chapter concludes with brief summaries of three other techniques: principal coordinates analysis, multidimensional scaling, and discriminant analysis. The first two are related and begin with a distance matrix (like cluster analysis, chapter 11). Discriminant analysis starts with data collected from samples of known groups (e.g. obsidian sources) and constructs an equation to predict the source of any unknown samples. Since they are not addressed in detail, we will only identify the key programs for performing these analyses.

Principal coordinates analysis is also called classical multidimensional scaling or metric multidimensional scaling. It is provided by the function cmdscale() which operates on a distance matrix. The dist() function can create a distance matrix using one of six ways of measuring distance including Euclidean distance. The cmdscale() function returns the coordinates of the points. These will be the rows of the original data.

Non-metric multidimensional scaling is provided by two functions in the MASS package. Kruskal's classic non-metric multidimensional scaling is obtained with isoMDS() and sammon() provides an alternative that fits smaller differences better. If isoMDS() produces a plot with a tight cluster of points surrounded by a few outlying points, sammon() may give better results. Both start with a distance matrix (dist()) and both require that none of the distances are 0 (no identical points in the original data). The solution is only determined up to rotations and reflections. The results include the coordinates for the number of dimensions specified (two by default) and the stress value.

Discriminant analysis is performed using the lda() (linear discriminant analysis) and qda() (quadratic discriminant analysis) functions. Once the discriminant functions have been computed, predict() allows you to predict group membership for the original data (to check the accuracy of the

40

classification) or new data.

Chapter 14. Probabilistic Sampling in Archaeology

In the previous chapters, we have used R functions to perform various computations. Here we will construct our own simple functions to illustrate how simple it is to build custom functions for analyses that you need to do regularly.

On page 365, Shennan gives a formula for estimating the required sample size to discriminate between two populations given a tolerance (± factor). The formula requires three numbers: Zα (or tα,df), the standard normal score for a two-tailed significance level of α; d, the tolerance; and s, an estimate of the standard deviation. We can create a function with just a single line (here broken into two lines):

samsize<-function(stdev, tol, alpha=.05) { (qnorm(alpha/2, lower.tail=FALSE) * stdev/tol)^2}samsize(5,1)1/(1/samsize(5,1)+1/2000)

When you run the first line (notice it is wrapped here), nothing happens unless you make a typo in which case you'll get an error message. We give the function a name (samsize) and use function() to show what variables the function will need. For two of the variables, we just listed names (tdev, tol). That means that they must be provided or the function will fail. The third variable (alpha) is given a default value of .05. That means that if you are computing the sample size for an alpha value of .05, you do not need to specify that in the function. Between the curly braces {} go the commands to perform the function. In this case, it is a single equation and the result will be returned. Once you have the function, you can run the second command and you should get the result shown at the bottom of page 367. The third command adjusts the sample size using the finite population correction as shown at the top of page 368.

You should be able to create functions for the other equations in this section.

R has some simple functions to help you select random samples. The example of selecting a random sample from 2000 projectile points (for detailed use-wear analysis for instance) can be handled simply in R with a single command:

sample(2000, 25)

This will draw 25 specimens from 2000 without replacement. If you had data set with the catalog numbers for the 2000 points (along with variables characterizing each one), you could use the variable (column) containing the catalog number and get a list of the catalog numbers of the specimens selected. Likewise, if you had a listing of the quadrants in a survey area, sample() would give you the quadrats in your random sample. You can use the function for stratified sampling by drawing samples from lists of members from each stratum.

Appendix. Creating the Danish Hoards Datasets

41

If you want to try creating the data set yourself, it is easiest to use the new data.set editor in Rcmdr using Data|New data set. Name it Hoards. Create three columns: Hoard (for the hoard number, numeric), Type (for the artifact type, character), and Freq (for the number of artifacts, numberic). Each artifact type in each hoard gets its own row so Hoard 1 will have five rows. There is a discrepancy between the data (Table 13.8) and the results (Table 13.10). Leave out Hoard 25 completely, the fibula in Hoard 6, and the saw in Hoard 22, as they are not included in the analysis. Add for Hoard 31, 3 belt plates, 17 tutuli, 1 celt, and 5 sickles. Use 1 as the frequency for tubes since no number is given. You should have 153 rows. Use Rcmdr to save the file.

Now we have to replace the Type names with the shorter labels in Table 13.10 and create a table with the types as columns and the rows as hoards. Type these commands to create a list of the different type names.

Labels <- unique(Hoards$Type)Labels

The second command lists the types. There should be 20. If there are more, then you mistyped a type name. If there are fewer, you missed one. Use the Edit data set button to correct any errors. The variable Type is a factor so changing an entry in the data set will add a new factor, but not delete the old label. To delete it type the following commands:

Hoards$Type <- factor(Hoards$Type)Labels <- unique(Hoards$Type)Labels

Check again to make sure there are 20 types. Now we need to make shorter labels for each of the variables which can become the column headings of the table. We'll use the labels from Table 13 10.

Labels <- data.frame(Type=Labels, Label=substr(gsub(" ", "", Labels), 1, 8))Labels <- Labels[order(Labels$Label),]fix(Labels)levels

These commands create a new Labels data set which contains two columns. The first column (Type) is the original Type name and the second column (Label) is created by stripping the blanks from the type names and taking the first 8 characters as the label. That gets us pretty close, but you'll have to edit the following labels by hand to match Table 13.10. Change “beltplat” to “beltpla”, “neckcoll” to “neckcol”, “spiralar” to “spirarm”,“spiralfi” to “spirfin”, and “weaponpa” to “weapals.” Now, let's put the labels in alphabetical order to make it easier to find things in the results.

ord <- levels(Labels$Label)[order(levels(Labels$Label))]Labels$Label<-factor(Labels$Label, levels=ord)

Finally, we merge Hoards and Labels and then create the table we will analyze with correspondence analysis.

Hoards <- merge(Hoards, Labels, by="Type")HoardsWide <- xtabs(Freq~Hoard+Label, data=Hoards)

42

HoardsWide <- data.frame(unclass(HoardsWide))

The first command merges the two data sets using Type as the link. This adds a new column, Label, to Hoards. The second command creates a “wide” form of the data so that each label is a column and each row is a single hoard. The last command changes the table produced by xtabs() to a data.frame. This step is not necessary to run the correspondence analysis, but Rcmdr only displays, loads, and saves data.frames so we convert the table to a data.frame so you can save it for future use. Check to be certain that HoardsWide contains 44 rows (hoards) and then save it using Rcmdr.

43

An R Companion to Quantifying Archaeology by Stephen...

Documents

Transcript of An R Companion to Quantifying Archaeology by Stephen...