ANALYSIS OF BIOLOGICAL DATA BIOL4062/5062 Hal Whitehead.
-
Upload
austen-parrish -
Category
Documents
-
view
245 -
download
2
Transcript of ANALYSIS OF BIOLOGICAL DATA BIOL4062/5062 Hal Whitehead.
Introduction
• Instructors
• Purpose of class
• Related classes
• Books
• Computer programs
http://myweb.dal.ca/~hwhitehe/BIOL4062/handout4062.htm
• Instructor: Hal Whitehead– LSC3076 (Ph 3723; email [email protected])– Best times: 8:00-9:00 a.m.
• Teaching Assistant: ?
• Other instructors– Dr David Lusseau
Why “Analysis of Biological Data”?
• Biologists– increasingly using quantitative techniques
– to analyze larger and larger data sets
– need skills in data analysis• especially in broad area of ecology
• BIOL4062/5062– introduce techniques for analysis of biological data
– emphasis will be on the practical use and abuse of techniques, not derivations or mathematical formulae
– in assignments students explore real and realistic data sets
Related classes
• Design of Biological Experiments (BIOL4061/5061)
– most useful for those who work with systems that can be manipulated
• Courses in Statistics– more emphasis on mathematical sides
Some books (on reserve)
• Legendre, L. and P. Legendre. Numerical Ecology (2nd edition). Elsevier (1998)
• Manly, B.F.J. Multivariate statistical methods: a primer (2nd edition). Chapman & Hall (1994)
• Other books:– Many, do not need to be right up to date
Computer programs
• MINITAB
• SPSS
• SYSTAT
• SAS
• MATLAB (Statistics toolbox)
• S-plus
• R
Good, comprehensivepackages, can do analyses for this class
More sophisticatedand powerful,harder to use
Computer programs• MINITAB * †• SPSS * †• SYSTAT †• SAS * †• MATLAB (Statistics toolbox)• S-plus (freely available at Dal.?)• R † (freely available on the web)* on GS.DAL.CA
† in Biology-Earth Sciences computer lab
Support from ?
Support from Hal
Assignments
• Type 1– artificial data sets for trying different
techniques
• Type 2– real data set to try a real analysis
Type 1 assignments
• Five assignments, sent by email (next few days)
• Each 10% final mark
• Artificial but realistic data sets– Different data sets to each student, but
structurally similar– More analyses expected for graduate students
(BIOL5062)
• Analyze using a computer statistical package
Type 1 assignments• Hand in a short write-up, explaining clearly:
– what you did– what you found– what you think the results might mean biologically
• Beware of:– Rubbish!
• Check the results against patterns in the original data to make sure they make sense.
– Over-interpreting the results– Not answering the questions posed
Type 1 assignments
• Five assignments:– Multiple regression 10%– Log-linear models 10%– Principal components analysis 10%– Discriminant function analysis 10%– Cluster analysis, multidimensional scaling,
network analysis 10%
Type 2 assignment
• Find a biological data set, and then analyze it
• The analysis should not be:– part of past, present, or future Honours, MSc or
PhD thesis, or used for another class:
self-plagiarism– that, or repeat that, done by someone else:
plagiarism
Type 2 assignment
• The analysis can– use same data as in thesis or another course, but
totally different analysis– use data collected by your supervisor, or someone
else, but you should ask them– use a data set that you find on the web, or somewhere
else, but you should check that it is OK– be submitted for publication, but you must check that
you have all necessary permissions
Type 2 assignment• Minimum sizes of data set (ask Hal for exceptions or in case
of uncertainty):– For undergraduates (BIOL4062):
• >50 units x >3 variables
– For graduates (BIOL5062)• >50 units x >5 variables• either, two types of variables
– e.g. “Dependent; Independent”; “Species; Environment”
• or, link two data sets with one at least as large as the undergraduate data set
• Must address at least 3 biological questions (BIOL4062), or 4 questions (BIOL5062)
Type 2 assignment (4 steps)• a) Short meeting with Hal or *** to discuss your proposed data
set and proposed analysis: feedback– bring draft of 2b assignment
• b) Description of data set and proposed analysis.– where it came from– its structure(s) (number of variables, units, names of variables, types
of variables, ...)– proposed biological questions– proposed analytical methods– possible problems
– Example on web
Type 2 assignment (4 steps)• c (i) Presentation of results to the class by graduate
students– biological questions being addressed– brief description of the data set– how you analyzed it– conclusions– Example in Class
• c(ii) Undergraduate students should go to graduate presentations and will be tested on general issues arising from them on last day
Type 2 assignment (4 steps)• d) Write-up of your analysis as for a scientific journal paper
– Max 5 pages (4062) or 7 pages (5062) single-spaced• excluding references, tables, figures
– Explain biological question, methods in sufficient detail for someone to replicate them, problems, and biological conclusions
– Show graphically, or in tables, the major effects• Do not just present summaries of ordinations or significance levels of hypotheses
tests
– Introduction and Discussion can be shorter and less detailed than in published paper
• sufficient to give a good feel for biological issue being examined and the potential biological significance of the results
Example on web
Type 2 assignment• Marks
• 2b Description of data set and proposed analysis 5%
• 2c 15%– (i) Presentation of results by graduate students
(BIOL5062)– (ii) Test on general principles from graduate student
presentations (BIOL4062)
• 2d Write-up of results 30%
SYSTATdemo.at end oflectures
DateTopic
Who ExamplesType 1 Assignments
6-Sep Thurs Introduction to data analysis and the course HW11-Sep Tues Modes of statistical analysis HW TREE13-Sep Thurs Plotting and tabulating data and results HW SYSTAT18-Sep Tues Introduction to S-plus and R (optional) S-Plus20-Sep Thurs Correlation HW SYSTAT25-Sep Tues Linear regression HW SYSTAT27-Sep Thurs Multiple linear regression, path analysis HW SYSTAT 1a give
2-Oct Thurs General linear models HW SYSTAT4-Oct Tues Introduction to likelihood HW SYSTAT9-Oct Thurs Logistic regression HW SYSTAT 1a due
11-Oct Tues Categorical data and log-linear models HW SYSTAT 1b give16-Oct Thurs Introduction to multivariate analysis and
multivariate distances HW SYSTAT18-Oct Tues Principal Components Analysis HW SYSTAT 1c give23-Oct Thurs Network analysis-1 DL 1e give25-Oct Tues Network analysis-2 DL 1b due30-Oct Thurs Discriminant Function Analysis and Canonical
Variate Analysis HW SYSTAT1d give
1-Nov Thurs Canonical Correlation Analysis, Redundancy Analysis and Canonical Correspondence Analysis HW SYSTAT
1c due
6-Nov Tues Principal Coordinate Analysis, Correspondence Analysis and Multidimensional Scaling HW SYSTAT
1e give
8-Nov ThursCluster analyses
HW SYSTAT1e give; 1d due
13-Nov Tues Bootstraps and Jackknives HW SYSTAT15-Nov Thurs Permutation tests, Mantel tests and matrix
correlations HW SYSTAT1e due
20-Nov Tues Graduate presentations HW22-Nov Thurs Graduate presentations HW27-Nov Tues Graduate presentations HW29-Nov Thurs Test for undergraduates (BIOL4062) on grad.
student projects HW
Analysis of Biological Data
• Types of biological data
• History (very abbreviated!)
• The process of biological data analysis– why garbage may come out
• Hypothesis testing and data analysis– assumptions– other issues
Types of biological data
• Morphometric• Community ecology
– organism distribution and environmental variation
• Genetic data for ecological and evolutionary questions
• Population data for management, conservation, evolutionary questions
• Behavioural, physiological, ...
Development of biological data analysis• >~1850 Displays• >~1900 ANOVA's, regression, correlation
– without computers
• >~1930 Non-parametric methods• >~1970 Multiple regression and multivariate analysis
– matrix algebra using computers
• >~1980 Robust methods: bootstraps, jackknives, permutations– need powerful computers
Real Biological System
Stochastic error Measurement error
Data Model+Assumptions
Data Analysis
Inferences about Biological System
Sampling process
Garbage in => Garbage out• Good data + Errors => Garbage
in => Garbage out– Check data entry
• Good data + Errors in routine => Garbage out– Check results, run routines on data
with known answer,– run on 2 routines
• Good data + Wrong model => Garbage out– Think about, read about and
discuss model
Real Biological System
Stochastic error Measurement error
Data Model+Assumptions
Data Analysis
Inferences about Biological System
Sampling process
Hypothesis Testing Data AnalysisHypothesis
Experimental Design
Experiment
Analysis
Conclusion
[ANOVA, T-test]Agriculture
Experimental ecology
Physiology
Animal behaviour
Data Collection
Data Analysis
Hypothesis
[scatter plots, box plots, most multivariate analyses]
Fisheries
Community ecology
Paleontology
Some assumptions• Normality
– can only be properly examined on large data sets– mainly a problem on small ones– an important issue for hypothesis testing– normality desirable in data analysis
• Linearity– makes hypothesis testing easier– makes data analysis easier
• Independence– major problem for hypothesis testing– no problem, or advantage, in data analysis
Transformdataor usenon-linear ornon-parametricmethods
Other issues in data analysis
• Missing data– Often present in ecological data
• Outliers– What do we do with apparent outliers?– Remove them?
• Multiple comparisons– Major issue with hypothesis testing– Not an issue with data analysis
• although: Patterns appear in random data
Next class:
• Inference in ecology and evolution:– Null hypothesis statistical tests– Effect size statistics– Bayesian statistics– Information theoretic model comparisons
DateTopic
Who ExamplesType 1 Assignments
6-Sep Thurs Introduction to data analysis and the course HW11-Sep Tues Modes of statistical analysis HW TREE13-Sep Thurs Plotting and tabulating data and results HW SYSTAT18-Sep Tues Introduction to S-plus and R (optional) S-Plus20-Sep Thurs Correlation HW SYSTAT25-Sep Tues Linear regression HW SYSTAT27-Sep Thurs Multiple linear regression, path analysis HW SYSTAT 1a give
2-Oct Thurs General linear models HW SYSTAT4-Oct Tues Introduction to likelihood HW SYSTAT9-Oct Thurs Logistic regression HW SYSTAT 1a due
11-Oct Tues Categorical data and log-linear models HW SYSTAT 1b give16-Oct Thurs Introduction to multivariate analysis and
multivariate distances HW SYSTAT18-Oct Tues Principal Components Analysis HW SYSTAT 1c give23-Oct Thurs Network analysis-1 DL 1e give25-Oct Tues Network analysis-2 DL 1b due30-Oct Thurs Discriminant Function Analysis and Canonical
Variate Analysis HW SYSTAT1d give
1-Nov Thurs Canonical Correlation Analysis, Redundancy Analysis and Canonical Correspondence Analysis HW SYSTAT
1c due
6-Nov Tues Principal Coordinate Analysis, Correspondence Analysis and Multidimensional Scaling HW SYSTAT
1e give
8-Nov ThursCluster analyses
HW SYSTAT1e give; 1d due
13-Nov Tues Bootstraps and Jackknives HW SYSTAT15-Nov Thurs Permutation tests, Mantel tests and matrix
correlations HW SYSTAT1e due
20-Nov Tues Graduate presentations HW22-Nov Thurs Graduate presentations HW27-Nov Tues Graduate presentations HW29-Nov Thurs Test for undergraduates (BIOL4062) on grad.
student projects HW