NOMOR 3 UTS

6
STAC50H3:Data Collection Assign 2 Due: Oct 30, 2013 in class All relevant work must be shown for credit. Note: In any question, if you are using R, all R codes and R outputs must be given in an appendix and your written answers must be given separately in the main part of the answers. You should assume that the reader is not familiar with R outputs and explain all your findings from the outputs in simple English. 1. The data set packetdata.csv (on course webpage) was obtained from the Journal of Statistics Education Data Achieve (but I have made some changes to data). The data set gives one hour’s data on internet network traffic be- tween a company and the rest of the world on March 8th, 1995. (Paxson and Floyd (1995) and (Sanchez and He (2005) also analyzed this data set.) The actual data set is a large data set. In the data file packetdata.csv, I have only included subset of that large data set, but lets assume that as our pop- ulation. This data set contains six variables. In this question we will focus on the variable ’databytes’(the last column of the data set). Note 1: Be fore se lect ing your SR SWOR, use set.seed(124) to make sure that you get the same sam ple as mine. (a) (5 points) Make a histogram of databyes and describe its main fea- tures. (b) (13 points) Use R to take a SRSWOR of size n = 300 from this pop- ulation and use that sample to estimate the population mean and its standard error. You may use Sampling and Survey packages. Also calculate an approximate 95% confidence interval for the population mean. Comment on the theoretical validity and practical appropriate- ness (as a measure of center) of your estimates. (c) (2 points) Calculate the population mean (i.e. the mean of the databytes in the population). Is that close enough to your estimate in part b. Solution: > #If there are non-numerical data in the variable of intrest, we ca > #remove the non-numerical data first and then use the sampling and > #survey packages as in this code. > pop=read.csv("C:/Users/Mihinda/Desktop/packetdata.csv", header=1) > x=pop$databytes # Variable to be analyzed > #------------------------------------------------------------ > # The following two lines will remove non-numerical data, if any Question 1 continues on the next page. . .

description

jawaban

Transcript of NOMOR 3 UTS

  • STAC50H3:Data CollectionAssign 2

    Due: Oct 30, 2013 in classAll relevant work must be shown for credit.

    Note: In any question, if you are using R, all R codes and R outputs mustbe given in an appendix and your written answers must be given separatelyin the main part of the answers. You should assume that the reader is notfamiliar with R outputs and explain all your findings from the outputs insimple English.

    1. The data set packetdata.csv (on course webpage) was obtained from theJournal of Statistics Education Data Achieve (but I have made some changesto data). The data set gives one hours data on internet network traffic be-tween a company and the rest of the world on March 8th, 1995. (Paxsonand Floyd (1995) and (Sanchez and He (2005) also analyzed this data set.)The actual data set is a large data set. In the data file packetdata.csv, I haveonly included subset of that large data set, but lets assume that as our pop-ulation. This data set contains six variables. In this question we will focuson the variable databytes(the last column of the data set).

    Note 1: Before selecting your SRSWOR, use set.seed(124) to make surethat you get the same sample as mine.

    (a) (5 points) Make a histogram of databyes and describe its main fea-tures.

    (b) (13 points) Use R to take a SRSWOR of size n = 300 from this pop-ulation and use that sample to estimate the population mean and itsstandard error. You may use Sampling and Survey packages. Alsocalculate an approximate 95% confidence interval for the populationmean. Comment on the theoretical validity and practical appropriate-ness (as a measure of center) of your estimates.

    (c) (2 points) Calculate the population mean (i.e. the mean of the databytesin the population). Is that close enough to your estimate in part b.

    Solution:

    > #If there are non-numerical data in the variable of intrest, we ca> #remove the non-numerical data first and then use the sampling and> #survey packages as in this code.> pop=read.csv("C:/Users/Mihinda/Desktop/packetdata.csv", header=1)> x=pop$databytes # Variable to be analyzed> #------------------------------------------------------------> # The following two lines will remove non-numerical data, if any

    Question 1 continues on the next page. . .

  • Page 2 of 6

    > pop$x pop # variable is interest,> #-------------------------------------------------------------> n=300 # sample size> library(sampling)Loading required package: MASSLoading required package: lpSolve> set.seed(124) #This gives the same sample> s=srswor(n,length(pop))> srsdata = getdata(pop,s)> srsdata

    ID_unit data1 23 2562 338 5123 452 5124 938 05 1390 06 1535 5...> # R assigns a name called data but we can change it as shown below.> hist(pop)

    a) The histogram of the databyes is shown below. It is not symmetric,but for this sample size (with CLT) the effect of non-Normality will notbe a big problem. It looks like there is a few outliers. It would better toremove these outliers and redo the analysis.

    Histogram of pop

    pop

    Freq

    uenc

    y

    0 500 1000 1500

    010

    000

    2500

    0

    > #-----------------------------------------------------------> library (survey)

  • Page 3 of 6

    Attaching package: survey

    The following object is masked from package:graphics:

    dotchart

    > srsdata$x=srsdata$data> srsdesign m m

    mean SEx 180.46 13.857> confint(m , level = 0.95)

    2.5 % 97.5 %x 153.2997 207.6203> t confint(t , level = 0.95)

    2.5 % 97.5 %x 7640919 10348416> summary(pop)

    Min. 1st Qu. Median Mean 3rd Qu. Max.0.0 0.0 15.0 185.3 512.0 1460.0

    b) The 95 percent confidence interval for the population mean is(153.2997, 207.6203). Even though the distribution of databytes is skewedto the right, this confidence interval is still valid (theoretically approx-imately correct) since our sample size (n = 300) is large (CLT), butmedian is a more appropriate measure of of center for skewed distri-butions.

    (c) The population mean is 185.3 which is in the confidence intervalwe calculated (153.2997, 207.6203) and so yes, it is close enough to ourestimate.

    2. (10 points) Forest data. The data in file forest.dat are fromkdd.ics.uci.edu/databases/covertype/covertype.data.html (Blackard, 1998).This link also provides a description of the data set (please read it). Thisdata set consists of a subset of the measurements from 581,012 30 30mcells from Region 2 of the U.S. Forest Service Resource Information System.These cells are classified into 7 cover types (denoted by 1 to 7), The orig-inal data were used in a data mining application, predicting forest covertype from covariates. Data-mining methods are often used to explore rela-tionships in very large data sets; in many cases, the data sets are so largethat statistical software packages cannot analyze them. Many data-miningproblems, however, can be alternatively approached by analyzing proba-bility samples from the population. In this exercise, we treat forest.csv asa population. Select an SRSWOR of size 1000 from this population. Usingyour SRSWOR, estimate the percentage of cells in forest of cover type 2along with 95% CI.

  • Page 4 of 6

    Note : Before selecting your SRSWOR, use set.seed(124) to make surethat you get the same sample as mine.

    Solution:

    > pop=read.csv("C:/Users/Mahinda/Desktop/forest.csv", header=0)> cover = pop[,15]> pop$x = (cover > 1 )*(cover < 3)> # pop$x = (cover == 2 ) is OK but the summary looks a bit different> # but that is OK.> n=1000 # sample size> library(sampling)Loading required package: MASSLoading required package: lpSolve> set.seed(124) #This gives the same sample> s=srswor(n,nrow(pop))> srsdata = getdata(pop,s)> head(srsdata) # To see the first few data lines

    ID_unit V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 x143 143 2837 112 8 272 16 3649 235 231 128 6221 1 0 0 0 2 1262 262 3168 25 4 30 0 4426 218 231 150 3180 1 0 0 0 1 0406 406 2656 110 23 85 35 1350 252 208 72 636 1 0 0 0 5 01417 1417 2779 78 47 134 118 150 227 111 0 1318 1 0 0 0 5 02270 2270 2041 112 25 30 2 510 253 203 62 685 0 0 0 1 4 02516 2516 2097 26 21 0 0 300 204 188 113 601 0 0 0 1 3 0> summary(srsdata$x)

    Min. 1st Qu. Median Mean 3rd Qu. Max.0.000 0.000 0.000 0.498 1.000 1.000

    > #-----------------------------------------------------------> library (survey)

    Attaching package: survey

    The following object is masked from package:graphics:

    dotchart

    > srsdesign m m

    mean SEx 0.498 0.0158> confint(m , level = 0.95)

    2.5 % 97.5 %x 0.4670217 0.5289783

    3. Consider the hypothetical population below:

    Unit number Stratum y1 1 12 1 23 1 44 2 45 2 76 2 7

    Question 3 continues on the next page. . .

  • Page 5 of 6

    (a) (5 points) Write out all possible SRSs of size 2 from stratum 1. Do thesame for stratum 2.

    Solution: All possible samples from stratum 1: (1, 2), (1, 3), (2, 3)All possible samples from stratum 2: (4, 5), (4, 6), (5, 6)

    (b) (5 points) Using your work in (a), find the sampling distribution ofyStr

    Solution:The SRSs Measures on the samples Sample means ySrt p

    (1, 2), (4,5) (1, 2), (4,7) y1 = 1.5, y2 = 5.5 12 1.5 + 12 5.5 = 3.50 13 13 = 19(1, 2), (4,6) (1, 2), (4,7) y1 = 1.5, y2 = 5.5 12 1.5 + 12 5.5 = 3.50 13 13 = 19(1, 2), (5,6) (1, 2), (7,7) y1 = 1.5, y2 = 7.0 12 1.5 + 12 7.0 = 4.25 13 13 = 19(1, 3), (4,5) (1, 4), (4,7) y1 = 2.5, y2 = 5.5 12 2.5 + 12 5.5 = 4.00 13 13 = 19(1, 3), (4,6) (1, 4), (4,7) y1 = 2.5, y2 = 5.5 12 2.5 + 12 5.5 = 4.00 13 13 = 19(1, 3), (5,6) (1, 4), (7,7) y1 = 2.5, y2 = 7.0 12 2.5 + 12 7.0 = 4.75 13 13 = 19(2, 3), (4,5) (2, 4), (4,7) y1 = 3.0, y2 = 5.5 12 3.0 + 12 5.5 = 4.25 13 13 = 19(2, 3), (4,6) (2, 4), (4,7) y1 = 3.0, y2 = 5.5 12 3.0 + 12 5.5 = 4.25 13 13 = 19(2, 3), (5,6) (2, 4), (7,7) y1 = 3.0, y2 = 7.0 12 3.0 + 12 7.0 = 5.00 13 13 = 19

    (c) (5 points) Use your answer in part (b), to find the mean and varianceof ySrt (i.e. E(ySrt) and V ar(yStr)).

    Note: Do not use R in this question. Show all your work clearly.

    Solution: E(ySrt) =

    ySrt p = 3.5 19 + 3.5 19 + 4.25 19 + 4.0019

    + 4.00 19

    + 4.75 19

    + 4.25 19

    + 4.25 19

    + 5.00 19

    = 4.166666667

    E(y2Srt) =

    y2Srtp = 3.52 19 +3.52 19 +4.252 19 +4.002 19 +4.00219

    + 4.752 19

    + 4.252 19

    + 4.252 19

    + 5.002 19

    = 17.58333333

    V ar(yStr) = E(y2Srt)(E(ySrt))2 = 17.583333334.1666666672 = 0.22222

    4. (S. L. Lohr)Suppose that a city has 80,000 dwelling units, of which 30,000are houses, 40,000 are apartments, and 10,000 are condominiums.

    (a) (5 points) You believe that the mean electricity usage is about twiceas much for houses as for apartments or condominiums, and that thestandard deviation is proportional to the mean so that S1 = 2S2 = 2S3.The cost of taking an observation is the same for all dwelling units. Wewant to take a stratified sample of n = 800 dwelling units in order toestimate the mean electricity consumption per unit in the population.How would you allocate a stratified sample of 800 observations if wewanted to minimize the variance of the estimate?

    Question 4 continues on the next page. . .

  • Page 6 of 6

    Solution: Using nh = NhShNhShn for optimal allocation with equalcosts, n1 = 32k32k+4k+k 800 = 436, n2 = 411 800 = 291 and n3 =111 800 = 73.

    (b) (8 points) Now suppose that you take a stratified random sample withproportional allocation and want to estimate the overall proportion ofhouseholds in which energy conservation is practiced. If 40% of housedwellers, 20% of apartment dwellers, and 5% of condominium resi-dents practice energy conservation, what is the value of p for this pop-ulation? What gain would the stratified sample with proportional al-location offer over an SRSWOR? i.e., what is Vprop(pstr)/V (pSRSWOR)?

    Solution: p = 300000.40+400000.20+100000.0580000

    = 2050080000

    = 0.25625.V (pStr) =

    Hh=1W

    2h (1 fh)ph(1ph)nh , Wh =

    NhN

    , nh = Whn,fh =

    nhNh

    = nN

    , n = 800 N1 = 30000, N2 = 40000, N3 = 10000,p1 = 0.40, p2 = 0.20, p3 = 0.05

    V (pStr) =H

    h=1W2h (1fh)ph(1ph)nh = (1 nN ) 1n

    Hh=1Whph(1ph) =

    (1 80080000

    1800 (3

    80.4 (10.4) + 4

    80.2 (10.2) + 1

    80.05

    (1 0.05)) = 0.0002177226563.

    V (pSRSWOR) =

    (N nN 1

    )p(1 p)

    n

    =

    (80000 80080000 1

    )0.25625(1 0.0.25625)

    800= 0.0002358530458.

    Vprop(pstr)

    V (pSRSWOR)= 0.0002177226563

    0.0002358530458 0.92. The variability of the estimator

    with stratified sampling is is about 92% of that with SRSWOR.