Lecture1: Introduction Bohyung Han CSE, POSTECH [email protected] CSED233: Data Structures (2014F)
Lecture1 Data
-
Upload
rollingstonenes -
Category
Documents
-
view
31 -
download
2
Transcript of Lecture1 Data
Syllabus • Purpose of this course:
– To introduce the counting techniques – To introduce the concept of probability – To introduce the basic elements of probability – To make aware of the students about the use of probability in
Statistics
• Contents:
– Counting Techniques, Concept of Probability, Probability Function, Probability Density Function, Bernoulli, Binom, Poisson Disributions, Exponential, Gamma, Normal Density Functions, Random Variables of Multiple Dimensions, The Concept of Estimator and Properties of Estimators, Maxsimum Likelihood Function, Test of Hypothesis, Ki-Square Test, t-test, F-test.
• Textbooks: – Şaşmaz, D. Ali, İstatistik ve Olasılık, İTÜ Kimya-Metalürji Fakülyesi,
Kimya Mühendisliği Bölümü, 2011 – Freund, E. J., Modern Elementary Statistics, 7th Edition, Prentice Hall,
1988
• Technology Resources: – Matlab – MS Excel
• Grading: – Midterm I 25% – Midterm II 25% – Homeworks 10% – Final 40%
• Attendence: min 70%, – closer to 100% better chance to increase your letter grade, smaller
than 70% decrease in your letter grade
Statistics is the science to
o Collect, organize, analyze and interpret data to draw valid conclusions
o Predict and forecast using data and statistical models
Widely used in natural sciences, economics, business, psychology, medicine.
• Weather forecast,
• Polls for election,
• Accident statistics,
• Health statistics.
• Stockmarkets.
Definitions
• Data: collections of observations (measurements, genders, survey responses)
• Population (or Universe): complete collection of all individuals (scores, people, measurements etc) to be studied.
• Sample: a subcollection of members selected from a population.
A survey on elections has 5 million respondents --> sample People eligible to vote in Turkey in 2011: 52 million --> population
• Data is useful for:
– Investigating a (scientific) subject
– Observing the efficiency of a production facility
– Meeting existing standards
– Trying different approaches in the phase of decision
– Personal curiosity
Two groups of statistical methods according to their purpose:
– Descriptive statistics are used to describe the
basic features of the data in a study: It simply describes what the data shows using charts and tables.
– Inferential statistics are used to reach conclusions and make generalizations that extend beyond the immediate data alone.
A statistical study involves:
– A context (description of the study, ex: weight of 100 male students interested in sports in high school A)
– Collecting data from the source
– Analyzing data with a sampling method
– Deducting a result
Methods for collecting data • Survey
– Questions to randomly selected people (sample) from a population.
– Questions, people, locations should be carefully determined
– Wrong data leads to wrong statistics
• Observation
– Collecting data without interfering
• Experiments
– Investigating the relationship between the input-output.
– The observer should not interfere
• Projection
– Usually used in psychology: it usually involves personal thoughts, behaviors, emotions of patients.
Although there are different ways of collecting data, all majors analyze them using same statistical methods Statistics is not customized for different majors, it
contains general methods.
Example 1:
• Education: In a random sample of 200 high school seniors in a large city, 137 said that they will go on to college. At the 0.05 level of significance, does this refute claim that 60% of all the high school seniors in this city will go on to college?
• Engineering: In a random sample of 200 transistors made by a given manufacturer, 137 passed an accelerated performance test. At the 0.05 level of significance, does this refute can claim that 60% of all transistors made by the manufacturer will pass the test?
• Food: In a random sample of 200 citrus trees exposed to -7oC frost, 137 showed a damage to their fruit. At the 0.05 level of significance, does this refute can claim that the 60% of all citrus trees exposed to a -7oC frost will show some damage to their fruits?
Should you believe a statistical study?
In Statistical Reasoning for Everyday Life, by Jeff Bennett and Mario Triola, eight guidelines for critically evaluating a statistical study:
(1) Identify the goal of the study, the population considered and the
type of study (2) Consider the source, with regard to a possibility of bias (3) Analyze the sampling method (4) Look for problems in defining or measuring the variables (5) Watch out for confounding varibles that could invalidate
conclusions (6) Consider the setting and wording of any survey (7) Check that graphs represent data fairly (8) Consider whether the conclusions achieve goals of study, and
whether they have practical significance.
Example 2: What is wrong with this survey?
In USA, Newsweek magazine ran a survey about the Napster Web site (a former pirate site for free mp3 download). Readers were asked: “Will you still use Napster if you have to pay a fee?” Readers could register their responses on the magazine’s Web site.
Among the 1873 responses received, 19% said yes, it is still cheper than buying CDs. Another 5% said yes, they felt more comfortable using it with a charge.
Voluntary response sample is one in which the respondents themselves decide whether to be included. This sample is biased and may not represent the reality, because people with strong opinions are more likely to participate.
Nature of statistical data • Continuous data can be measured. It is related with
physical measurement. It can have units such as kg, m etc.
– Ex: Amount of milk coming from cows, they can assume any value over a continuous span. (any value between 0-7000 L/ year)
• Discrete data can only be counted. Finite number of values are possible.
– Ex: ‘head’ and ‘tail’ of a coin; number of eggs that hens lay.
Ordinary arithmetic (e.g. addition, substraction) can be
applied to both continuous and discrete data.
• Nominal data (in latin nomen means ‘name’)
– Ex: “What is your marital status?”
a) Married b) Single c) Divorced d) Widow
-Nominal items can be made numerical for statistical analysis.
Interpretation of data
• Example 1: Analyze the overall performance of senior students Ayşe, Ahmet, Mehmet and Gül by ranking their scores.
History Mathematics English
Ayşe 89 51 40
Ahmet 61 56 54
Mehmet 40 70 55
Gül 13 77 72
Case 1: Ranking based on total sum of grades
• This approach does not reflect the success of the students
History Mathematics English Total Rank
Ayşe 89 51 40 180 1
Ahmet 61 56 54 171 2
Mehmet 40 70 55 169 3
Gül 13 77 72 162 4
Case 2: Ranking based on each subject
History Mathematics English Total Rank
Ayşe 1 4 4 9 4
Ahmet 2 3 3 8 3
Mehmet 3 2 2 7 2
Gül 4 1 1 6 1
Case 3: Ranking based on weights of subjects
• Same results with case 2.
History Mathematics English Total Rank
Weight 3 6 5
Ayşe 1*3=3 4*6=24 4*5=20 47 4
Ahmet 2*3=6 3*6=18 3*5=15 39 3
Mehmet 3*3=9 2*6=12 2*5=10 31 2
Gül 4*3=12 1*6=6 1*5=5 23 1
Case 4: Ranking based on weights of subjects (2)
• Same as in case 2 and 3
History Mathematics English Total Rank
Weight 3 6 5
Ayşe 89*3=267 51*6=306 40*5=200 773 4
Ahmet 61*3=183 56*6=336 54*5=270 789 3
Mehmet 40*3=120 270*6=420 55*5=275 815 2
Gül 13*3=39 77*6=462 72*5=360 841 1
• Example 2: “Drought in Turkey!: The newspaper claims that precipitation in Turkey has significantly diminished in the last 6 months.”
Center Average precipitation in 6
months(mm)
Average over
years(mm)
Average for last year
(mm)
Balıkesir 476 419 363
Bursa 461 471 574
Tekirdağ 241 389 434
Edirne 399 346 268
Kırklareli 298 349 270
Trabzon 385 469 574
Giresun 928 727 843
Samsun 566 411 551
Rize 1539 1315 1497
Konya 111 201 197
Karaman 136 223 233
Kırşehir 218 239 145
Niğde 205 194 176
Afyonkarahisar 180 234 239
Aydın 459 505 609
Denizli 282 406 435
İzmir 682 579 477
Manisa 537 603 507
Adana 361 515 582
Gaziantep 492 451 537
Şanlıurfa 356 392 357
Iğdır 123 104 152
Malatya 207 260 278
Muş 508 536 720
• If we look close enough, we cannot claim that the precipitation is significantly diminished. When these figures from the three data set are analyzed with ANOVA technique (ANalysis Of VAriance), there is no hint for decrease in precipitation. Moreover, in order to talk about for all regions in Turkey, the data should be expended in order to include all cities.