Golden Rules of Bioinformatics

49
An Introduction to Bioinformatics Tools Part 1: Golden Rules of Bioinformatics Leighton Pritchard and Peter Cock

description

Golden Rules of Bioinformatics. Presented as part of a full-day introductory bioinformatics course - the example data and source for the slides can be found at https://github.com/widdowquinn/Teaching-Intro-to-Bioinf

Transcript of Golden Rules of Bioinformatics

Page 1: Golden Rules of Bioinformatics

An Introduction to BioinformaticsToolsPart 1: Golden Rules of Bioinformatics

Leighton Pritchard and Peter Cock

Page 2: Golden Rules of Bioinformatics

On Confidence

“Ignorance more frequently begets confidence than doesknowledge: it is those who know little, not those who know much,who so positively assert. . .”- Charles Darwin

Page 3: Golden Rules of Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 4: Golden Rules of Bioinformatics

Zeroeth Golden Rule of Bioinformatics

• No-one knows everything about everything - talk to people!• local bioinformaticians, mailing lists, forums, Twitter, etc.

• Keep learning - there are lots of resources

• There is no free lunch - no method works best on all data

• The worst errors are silent - share worries, problems, etc.

• Share expertise (see first item)

Page 5: Golden Rules of Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 6: Golden Rules of Bioinformatics

Subgroups

• You are in group A, B, C or D - this decides your dataset:expnA.tab, expnB.tab, expnC.tab, expnD.tab

• You will use R at the command-line to analyse your data

Page 7: Golden Rules of Bioinformatics

The biological question

• Your dataset expn?.tab describes (log) expression data fortwo genes: gene1 and gene2

• Expression measured at eleven time points (including control)

• Q: Are gene1 and gene2 genes coregulated?

• How do we answer this question?

Page 8: Golden Rules of Bioinformatics

Reformulating the biological question

• Q: Are gene1 and gene2 genes coregulated?

• A: We cannot determine this from expression data alone

• Reformulate the question:

• NewQ: Is there evidence that gene1 and gene2 expressionprofiles are correlated?(is expression gene1 ∝ gene2)

• How do we answer this new question?

Page 9: Golden Rules of Bioinformatics

Reformulating the biological question

• Q: Are gene1 and gene2 genes coregulated?

• A: We cannot determine this from expression data alone

• Reformulate the question:

• NewQ: Is there evidence that gene1 and gene2 expressionprofiles are correlated?(is expression gene1 ∝ gene2)

• How do we answer this new question?

Page 10: Golden Rules of Bioinformatics

Starting the analysis

• Change directory to where Exercise 1 data is located, andstart R.

1 $ cd ../../ data/ex1_expression/

2 $ R

Page 11: Golden Rules of Bioinformatics

Load and inspect data in R

1 > data = read.table("expnA.tab", sep="\t", header=TRUE)

2 > head(data)

3 gene1 gene2

4 1 10 8.04

5 2 8 6.95

6 3 13 7.58

7 4 9 8.81

8 5 11 8.33

9 6 14 9.96

Page 12: Golden Rules of Bioinformatics

Load and inspect data in R

1 > mean(data$gene1)

2 [1] 9

3 > mean(data$gene2)

4 [1] 7.500909

5 > sd(data$gene1)

6 [1] 3.316625

7 > sd(data$gene2)

8 [1] 2.031568

9 > cor(data)

10 gene1 gene2

11 gene1 1.0000000 0.8164205

12 gene2 0.8164205 1.0000000

Page 13: Golden Rules of Bioinformatics

Results

measure expnA expnB expnC expnD

mean(gene1) 9mean(gene2) 7.5

sd(gene1) 3.3sd(gene2) 2.0cor(data) 0.816

Page 14: Golden Rules of Bioinformatics

Results

measure expnA expnB expnC expnD

mean(gene1) 9 9 9 9mean(gene2) 7.5 7.5 7.5 7.5

sd(gene1) 3.3 3.3 3.3 3.3sd(gene2) 2.0 2.0 2.0 2.0cor(data) 0.816 0.816 0.816 0.816

• r = 0.816(P < 0.005) in every experiment

• Can we conclude that gene1 and gene2 are coexpressed ineach experiment?

Page 15: Golden Rules of Bioinformatics

Results

measure expnA expnB expnC expnD

mean(gene1) 9 9 9 9mean(gene2) 7.5 7.5 7.5 7.5

sd(gene1) 3.3 3.3 3.3 3.3sd(gene2) 2.0 2.0 2.0 2.0cor(data) 0.816 0.816 0.816 0.816

• r = 0.816(P < 0.005) in every experiment

• Can we conclude that gene1 and gene2 are coexpressed ineach experiment?

Page 16: Golden Rules of Bioinformatics

Plot the data in R

1 > plot(data)

Page 17: Golden Rules of Bioinformatics

Always plot the data

Which gene pairs are coexpressed?

Page 18: Golden Rules of Bioinformatics

Always plot the data

Is the matrix of (Pearson) correlation values potentially misleading?

1 > data = anscombe

2 > cor(data)[1:4 ,5:8]

3 y1 y2 y3 y4

4 x1 0.8164205 0.8162365 0.8162867 -0.3140467

5 x2 0.8164205 0.8162365 0.8162867 -0.3140467

6 x3 0.8164205 0.8162365 0.8162867 -0.3140467

7 x4 -0.5290927 -0.7184365 -0.3446610 0.8165214

Page 19: Golden Rules of Bioinformatics

Sometimes real correlation doesn’tmean anything

Page 20: Golden Rules of Bioinformatics

First Golden Rule of Bioinformatics

• Always inspect the raw data (trends, outliers, clustering)

• What is the question? Can the data answer it?

• Communicate with data collectors! (don’t be afraid ofpedantry)

• Who? When? How?• You need to understand the experiment to analyse it (easier if

you helped design it).• Be wary of block effects (experimenter, time, batch, etc.)

Page 21: Golden Rules of Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 22: Golden Rules of Bioinformatics

Exercise 2

• You are in group A, B, C or D - this decides your databasedbA, dbB, dbC, dbD

• You will use BLAST at the command-line to analyse your data

• You will use script at the command-line to record your work

Page 23: Golden Rules of Bioinformatics

Exercise 2

• Start recording your actions by entering script at thecommand line

1 $ script

2 Script started , output file is typescript

Page 24: Golden Rules of Bioinformatics

Exercise 2

• Change directory to the ex2 blast directory

• Run BLAST with the appropriate database

• Exit script

1 $ cd ../ ex2_blast

2 $ blastp -num_alignments 1 -num_descriptions 1 -query query.fasta -db dbA

3 $ exit

4 exit

5 Script done , output file is typescript

Page 25: Golden Rules of Bioinformatics

Exercise 2

• You can view the typescript file with cat

1 $ cat typescript

2 Script started on Fri May 9 10:45:12 2014

3 lpritc@lpmacpro:$ cd ../ ex2_blast

4 [...]

Page 26: Golden Rules of Bioinformatics

Exercise 2

Query= query protein sequence

Length=400

Score

Sequences producing significant alignments: (Bits)

PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3

> PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like

protein (441 aa)

Length=486

Score = 34.3 bits (77), Method: Compositional matrix adjust.

Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%)

Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165

E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++

Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95

Query 166 IKTKSNSSE 174

T SN S+

Sbjct 96 CHTSSNISQ 104

Page 27: Golden Rules of Bioinformatics

Exercise 2

• What is a reasonable E-value threshold to call a ’match’?• 1e-05, 0.001, 0.1, 10?

dbA dbB dbC dbD

E-value

Page 28: Golden Rules of Bioinformatics

Exercise 2

• What is a reasonable E-value threshold to call a ’match’?• 1e-05, 0.001, 0.1, 10?

dbA dbB dbC dbD

E-value 0.45 0.002 4e-06 0.019

• Five orders of magnitude difference in E-value, depending ondatabase choice - Why?

Page 29: Golden Rules of Bioinformatics

Exercise 2

• E-values depend on database size

• Bit score and alignment do not depend on database size

dbA dbB dbC dbD

E-value 0.45 0.002 4e-06 0.019Bit score 34.3 34.3 34.3 34.3

Sequences 100,001 501 1 5,001Letters 48,650,486 210,866 486 2,066,510

Page 30: Golden Rules of Bioinformatics

Exercise 2

• E-values differ, but the query matches a cholinetransporter-like protein quite well. . .

• Doesn’t it?

• After all, a biological match is a biological match. . .

• Isn’t it?

Page 31: Golden Rules of Bioinformatics

Exercise 2

• E-values differ, but the query matches a cholinetransporter-like protein quite well. . .

• Doesn’t it?

• After all, a biological match is a biological match. . .

• Isn’t it?

Page 32: Golden Rules of Bioinformatics

Exercise 2

Query= query protein sequence

Length=400

Score E

Sequences producing significant alignments: (Bits) Value

PITG_08491T0 Phytophthora infestans T30-4 choline transporter-l... 34.3 4e-06

> PITG_08491T0 Phytophthora infestans T30-4 choline transporter-like

protein (441 aa)

Length=486

Score = 34.3 bits (77), Expect = 4e-06, Method: Compositional matrix adjust.

Identities = 22/69 (32%), Positives = 38/69 (55%), Gaps = 4/69 (6%)

Query 106 EVILPMMYQFALKPSFADVINDYKPYSKHTAGVSDQELKGEATTWMLADKNSRMKAFLSQ 165

E+++PM+Y L F ++ Y P HTA ++ EL+G T ++A+ S + F ++

Sbjct 40 ELMVPMLYSLYLVVLFHLPVSAYYP---HTASMTAHELQGAVITILVAETPSIIIQF-AK 95

Query 166 IKTKSNSSE 174

T SN S+

Sbjct 96 CHTSSNISQ 104

Page 33: Golden Rules of Bioinformatics

Exercise 2

• Sequence accessions (PITG ?????T0) are correct in thedatabases

• Sequence functional descriptions are randomly shuffled:lengths do not match in BLAST output

• dbA contains only three different sequences: two are repeated50,000 times

• query.fasta is random sequence, not a real protein

• Shuffled from all P. infestans proteins• No nr or PFam matches

Page 34: Golden Rules of Bioinformatics

Exercise 2

• Sequence accessions (PITG ?????T0) are correct in thedatabases

• Sequence functional descriptions are randomly shuffled:lengths do not match in BLAST output

• dbA contains only three different sequences: two are repeated50,000 times

• query.fasta is random sequence, not a real protein

• Shuffled from all P. infestans proteins• No nr or PFam matches

Page 35: Golden Rules of Bioinformatics

Exercise 2

• Sequence accessions (PITG ?????T0) are correct in thedatabases

• Sequence functional descriptions are randomly shuffled:lengths do not match in BLAST output

• dbA contains only three different sequences: two are repeated50,000 times

• query.fasta is random sequence, not a real protein

• Shuffled from all P. infestans proteins• No nr or PFam matches

Page 36: Golden Rules of Bioinformatics

Exercise 2

• Sequence accessions (PITG ?????T0) are correct in thedatabases

• Sequence functional descriptions are randomly shuffled:lengths do not match in BLAST output

• dbA contains only three different sequences: two are repeated50,000 times

• query.fasta is random sequence, not a real protein• Shuffled from all P. infestans proteins• No nr or PFam matches

Page 37: Golden Rules of Bioinformatics

Second Golden Rule of Bioinformatics

• Do not trust the software: it is not an authority• Software does not distinguish meaningful from meaningless

data• Software has bugs• Algorithms have assumptions, conditions, and applicable

domains• Some problems are inherently hard, or even insoluble

• You must understand the analysis/algorithm

• Always sanity test

• Test output for robustness to parameter (including data)choice

Page 38: Golden Rules of Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 39: Golden Rules of Bioinformatics

Exercise 3

• Rule: If there is a vowel on one side of the card, there mustbe an even number on the other side.

• Which cards must be turned over to determine if this rule (ifa card shows a vowel on one face, the opposite face is even)holds true?

Page 40: Golden Rules of Bioinformatics

Exercise 3

This is the Wason Selection Task

• If you chose E and 4

• You are in the typical majority group• You are not correct• You have been a victim of confirmation bias (System 1

thinking)

• If you chose E and 7

• Congratulations!• Your choice was capable of falsifying the rule.

Page 41: Golden Rules of Bioinformatics

Exercise 3

This is the Wason Selection Task

• If you chose E and 4• You are in the typical majority group• You are not correct• You have been a victim of confirmation bias (System 1

thinking)

• If you chose E and 7

• Congratulations!• Your choice was capable of falsifying the rule.

Page 42: Golden Rules of Bioinformatics

Exercise 3

This is the Wason Selection Task

• If you chose E and 4• You are in the typical majority group• You are not correct• You have been a victim of confirmation bias (System 1

thinking)

• If you chose E and 7

• Congratulations!• Your choice was capable of falsifying the rule.

Page 43: Golden Rules of Bioinformatics

Exercise 3

This is the Wason Selection Task

• If you chose E and 4• You are in the typical majority group• You are not correct• You have been a victim of confirmation bias (System 1

thinking)

• If you chose E and 7• Congratulations!• Your choice was capable of falsifying the rule.

Page 44: Golden Rules of Bioinformatics

Exercise 3

Rule: If there is a vowel on one side of the card, there must be aneven number on the other side.

Card Outcome Rule

EEven Can be true even if rule falseOdd violated

KEven naOdd na

4Vowel Can be true even if rule false

Consonant na

7Vowel violated

Consonant na

Page 45: Golden Rules of Bioinformatics

Exercise 3

• This is equivalent to functional classification, e.g:

• Rule: If there is a CRN/RxLR/T3SS domain, the protein mustbe an effector.

Page 46: Golden Rules of Bioinformatics

Exercise 3

• Confirmation Bias (Wason Selection Task)• An uninformative experiment is performed• http://en.wikipedia.org/wiki/Wason_selection_task

• Affirming the Consequent (a related formal fallacy)

1. If P, then Q2. Q3. Therefore, P

• Experimental results are misinterpreted• http:

//en.wikipedia.org/wiki/Affirming_the_consequent

Page 47: Golden Rules of Bioinformatics

Third Golden Rule of Bioinformatics

• Everyone has expectations of their data/experiment• Beware cognitive errors, such as confirmation bias!• System 1 vs. System 2 ≈ intuition vs. reason

• Think statistically!• Large datasets can be counterintuitive and appear to confirm a

large number of contradictory hypotheses• Always account for multiple tests.• Avoid “data dredging”: intensive computation is not an

adequate substitute for expertise

• Use test-driven development of analyses and code• Use examples that pass and fail

Page 48: Golden Rules of Bioinformatics

Table of Contents

Rule 0

Rule 1

Rule 2

Rule 3

Conclusions

Page 49: Golden Rules of Bioinformatics

In Conclusion

• Always communicate!• worst errors are silent

• Don’t trust the data• formatting/validation/category errors - check!• suitability for scientific question

• Don’t trust the software• software is not an authority• always benchmark, always validate

• Don’t trust yourself• beware cognitive errors• think statistically• biological “stories” can be constructed from nonsense