NOMOR 3 UTS

STAC50H3:Data CollectionAssign 2

Due: Oct 30, 2013 in classAll relevant work must be shown for credit.

Note: In any question, if you are using R, all R codes and R outputs mustbe given in an appendix and your written answers must be given separatelyin the main part of the answers. You should assume that the reader is notfamiliar with R outputs and explain all your findings from the outputs insimple English.

1. The data set packetdata.csv (on course webpage) was obtained from theJournal of Statistics Education Data Achieve (but I have made some changesto data). The data set gives one hours data on internet network traffic be-tween a company and the rest of the world on March 8th, 1995. (Paxsonand Floyd (1995) and (Sanchez and He (2005) also analyzed this data set.)The actual data set is a large data set. In the data file packetdata.csv, I haveonly included subset of that large data set, but lets assume that as our pop-ulation. This data set contains six variables. In this question we will focuson the variable databytes(the last column of the data set).

Note 1: Before selecting your SRSWOR, use set.seed(124) to make surethat you get the same sample as mine.

(a) (5 points) Make a histogram of databyes and describe its main fea-tures.

(b) (13 points) Use R to take a SRSWOR of size n = 300 from this pop-ulation and use that sample to estimate the population mean and itsstandard error. You may use Sampling and Survey packages. Alsocalculate an approximate 95% confidence interval for the populationmean. Comment on the theoretical validity and practical appropriate-ness (as a measure of center) of your estimates.

(c) (2 points) Calculate the population mean (i.e. the mean of the databytesin the population). Is that close enough to your estimate in part b.

Solution:

> #If there are non-numerical data in the variable of intrest, we ca> #remove the non-numerical data first and then use the sampling and> #survey packages as in this code.> pop=read.csv("C:/Users/Mihinda/Desktop/packetdata.csv", header=1)> x=pop$databytes # Variable to be analyzed> #------------------------------------------------------------> # The following two lines will remove non-numerical data, if any

Question 1 continues on the next page. . .

of 6

> pop$x pop # variable is interest,> #-------------------------------------------------------------> n=300 # sample size> library(sampling)Loading required package: MASSLoading required package: lpSolve> set.seed(124) #This gives the same sample> s=srswor(n,length(pop))> srsdata = getdata(pop,s)> srsdata

ID_unit data1 23 2562 338 5123 452 5124 938 05 1390 06 1535 5...> # R assigns a name called data but we can change it as shown below.> hist(pop)

a) The histogram of the databyes is shown below. It is not symmetric,but for this sample size (with CLT) the effect of non-Normality will notbe a big problem. It looks like there is a few outliers. It would better toremove these outliers and redo the analysis.

Histogram of pop

pop

Freq

uenc

y

0 500 1000 1500

010

000

2500

0

> #-----------------------------------------------------------> library (survey)

of 6

Attaching package: survey

The following object is masked from package:graphics:

dotchart

> srsdata$x=srsdata$data> srsdesign m m

mean SEx 180.46 13.857> confint(m , level = 0.95)

2.5 % 97.5 %x 153.2997 207.6203> t confint(t , level = 0.95)

2.5 % 97.5 %x 7640919 10348416> summary(pop)

Min. 1st Qu. Median Mean 3rd Qu. Max.0.0 0.0 15.0 185.3 512.0 1460.0

b) The 95 percent confidence interval for the population mean is(153.2997, 207.6203). Even though the distribution of databytes is skewedto the right, this confidence interval is still valid (theoretically approx-imately correct) since our sample size (n = 300) is large (CLT), butmedian is a more appropriate measure of of center for skewed distri-butions.

(c) The population mean is 185.3 which is in the confidence intervalwe calculated (153.2997, 207.6203) and so yes, it is close enough to ourestimate.

2. (10 points) Forest data. The data in file forest.dat are fromkdd.ics.uci.edu/databases/covertype/covertype.data.html (Blackard, 1998).This link also provides a description of the data set (please read it). Thisdata set consists of a subset of the measurements from 581,012 30 30mcells from Region 2 of the U.S. Forest Service Resource Information System.These cells are classified into 7 cover types (denoted by 1 to 7), The orig-inal data were used in a data mining application, predicting forest covertype from covariates. Data-mining methods are often used to explore rela-tionships in very large data sets; in many cases, the data sets are so largethat statistical software packages cannot analyze them. Many data-miningproblems, however, can be alternatively approached by analyzing proba-bility samples from the population. In this exercise, we treat forest.csv asa population. Select an SRSWOR of size 1000 from this population. Usingyour SRSWOR, estimate the percentage of cells in forest of cover type 2along with 95% CI.

of 6

Note : Before selecting your SRSWOR, use set.seed(124) to make surethat you get the same sample as mine.

Solution:

> pop=read.csv("C:/Users/Mahinda/Desktop/forest.csv", header=0)> cover = pop[,15]> pop$x = (cover > 1 )*(cover < 3)> # pop$x = (cover == 2 ) is OK but the summary looks a bit different> # but that is OK.> n=1000 # sample size> library(sampling)Loading required package: MASSLoading required package: lpSolve> set.seed(124) #This gives the same sample> s=srswor(n,nrow(pop))> srsdata = getdata(pop,s)> head(srsdata) # To see the first few data lines

ID_unit V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 x143 143 2837 112 8 272 16 3649 235 231 128 6221 1 0 0 0 2 1262 262 3168 25 4 30 0 4426 218 231 150 3180 1 0 0 0 1 0406 406 2656 110 23 85 35 1350 252 208 72 636 1 0 0 0 5 01417 1417 2779 78 47 134 118 150 227 111 0 1318 1 0 0 0 5 02270 2270 2041 112 25 30 2 510 253 203 62 685 0 0 0 1 4 02516 2516 2097 26 21 0 0 300 204 188 113 601 0 0 0 1 3 0> summary(srsdata$x)

Min. 1st Qu. Median Mean 3rd Qu. Max.0.000 0.000 0.000 0.498 1.000 1.000

> #-----------------------------------------------------------> library (survey)

Attaching package: survey

The following object is masked from package:graphics:

dotchart

> srsdesign m m

mean SEx 0.498 0.0158> confint(m , level = 0.95)

2.5 % 97.5 %x 0.4670217 0.5289783

3. Consider the hypothetical population below:

Unit number Stratum y1 1 12 1 23 1 44 2 45 2 76 2 7


of 6

(a) (5 points) Write out all possible SRSs of size 2 from stratum 1. Do thesame for stratum 2.

Solution: All possible samples from stratum 1: (1, 2), (1, 3), (2, 3)All possible samples from stratum 2: (4, 5), (4, 6), (5, 6)

(b) (5 points) Using your work in (a), find the sampling distribution ofyStr

Solution:The SRSs Measures on the samples Sample means ySrt p

(1, 2), (4,5) (1, 2), (4,7) y1 = 1.5, y2 = 5.5 12 1.5 + 12 5.5 = 3.50 13 13 = 19(1, 2), (4,6) (1, 2), (4,7) y1 = 1.5, y2 = 5.5 12 1.5 + 12 5.5 = 3.50 13 13 = 19(1, 2), (5,6) (1, 2), (7,7) y1 = 1.5, y2 = 7.0 12 1.5 + 12 7.0 = 4.25 13 13 = 19(1, 3), (4,5) (1, 4), (4,7) y1 = 2.5, y2 = 5.5 12 2.5 + 12 5.5 = 4.00 13 13 = 19(1, 3), (4,6) (1, 4), (4,7) y1 = 2.5, y2 = 5.5 12 2.5 + 12 5.5 = 4.00 13 13 = 19(1, 3), (5,6) (1, 4), (7,7) y1 = 2.5, y2 = 7.0 12 2.5 + 12 7.0 = 4.75 13 13 = 19(2, 3), (4,5) (2, 4), (4,7) y1 = 3.0, y2 = 5.5 12 3.0 + 12 5.5 = 4.25 13 13 = 19(2, 3), (4,6) (2, 4), (4,7) y1 = 3.0, y2 = 5.5 12 3.0 + 12 5.5 = 4.25 13 13 = 19(2, 3), (5,6) (2, 4), (7,7) y1 = 3.0, y2 = 7.0 12 3.0 + 12 7.0 = 5.00 13 13 = 19

(c) (5 points) Use your answer in part (b), to find the mean and varianceof ySrt (i.e. E(ySrt) and V ar(yStr)).

Note: Do not use R in this question. Show all your work clearly.

Solution: E(ySrt) =

ySrt p = 3.5 19 + 3.5 19 + 4.25 19 + 4.0019

+ 4.00 19

+ 4.75 19

+ 4.25 19

+ 4.25 19

+ 5.00 19

= 4.166666667

E(y2Srt) =

y2Srtp = 3.52 19 +3.52 19 +4.252 19 +4.002 19 +4.00219

+ 4.752 19

+ 4.252 19

+ 4.252 19

+ 5.002 19

= 17.58333333

V ar(yStr) = E(y2Srt)(E(ySrt))2 = 17.583333334.1666666672 = 0.22222

4. (S. L. Lohr)Suppose that a city has 80,000 dwelling units, of which 30,000are houses, 40,000 are apartments, and 10,000 are condominiums.

(a) (5 points) You believe that the mean electricity usage is about twiceas much for houses as for apartments or condominiums, and that thestandard deviation is proportional to the mean so that S1 = 2S2 = 2S3.The cost of taking an observation is the same for all dwelling units. Wewant to take a stratified sample of n = 800 dwelling units in order toestimate the mean electricity consumption per unit in the population.How would you allocate a stratified sample of 800 observations if wewanted to minimize the variance of the estimate?


of 6

Solution: Using nh = NhShNhShn for optimal allocation with equalcosts, n1 = 32k32k+4k+k 800 = 436, n2 = 411 800 = 291 and n3 =111 800 = 73.

(b) (8 points) Now suppose that you take a stratified random sample withproportional allocation and want to estimate the overall proportion ofhouseholds in which energy conservation is practiced. If 40% of housedwellers, 20% of apartment dwellers, and 5% of condominium resi-dents practice energy conservation, what is the value of p for this pop-ulation? What gain would the stratified sample with proportional al-location offer over an SRSWOR? i.e., what is Vprop(pstr)/V (pSRSWOR)?

Solution: p = 300000.40+400000.20+100000.0580000

= 2050080000

= 0.25625.V (pStr) =

Hh=1W

2h (1 fh)ph(1ph)nh , Wh =

NhN

, nh = Whn,fh =

nhNh

= nN

, n = 800 N1 = 30000, N2 = 40000, N3 = 10000,p1 = 0.40, p2 = 0.20, p3 = 0.05

V (pStr) =H

h=1W2h (1fh)ph(1ph)nh = (1 nN ) 1n

Hh=1Whph(1ph) =

(1 80080000

1800 (3

80.4 (10.4) + 4

80.2 (10.2) + 1

80.05

(1 0.05)) = 0.0002177226563.

V (pSRSWOR) =

(N nN 1

)p(1 p)

n

=

(80000 80080000 1

)0.25625(1 0.0.25625)

800= 0.0002358530458.

Vprop(pstr)

V (pSRSWOR)= 0.0002177226563

0.0002358530458 0.92. The variability of the estimator

with stratified sampling is is about 92% of that with SRSWOR.

NOMOR 3 UTS

Documents

Transcript of NOMOR 3 UTS