Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data
On the importance (and absence) of annotation in Next Generation Sequencing Data
-
Upload
hugh-shanahan -
Category
Science
-
view
156 -
download
3
Transcript of On the importance (and absence) of annotation in Next Generation Sequencing Data
![Page 1: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/1.jpg)
The importance (and absence) of annotation in the Next
Generation Sequence DataHugh Shanahan & Jamie Alnasir
[email protected] @hughshanahan
Results to be published in GigaScience
![Page 2: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/2.jpg)
It was the best of times• Many exciting experiments based on gathering huge amounts of data.
• 100,000 Genomes in the UK, many others
• Elixir - Exabytes of biomedical data in the next decade
• Large experiments - SKA, LHC
• Opening up of Government data
• Up ahead - Sensor networks and Monitoring Cities
• Machine Learning is now a widely accepted tool in analysing data and in making decisions.
• Evidence-based policy becoming the norm.
![Page 3: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/3.jpg)
It was the worst of times• Leaks appearing in the Scientific process.
• In domains with many possible relationships, most published results are wrong (Ioannidis, PLoS Medicine, 2005).
• 1/4 of 67 published experiments on drug targets reproduced (Prinz et al., Nat. Rev. Drug Disc., 2011)
• 39% of key Psychology experiments could be reproduced (Nature News, 2015).
![Page 4: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/4.jpg)
Poor statistics?• Naive use of p-value
calculations across fields.
• Banning use of Null Hypothesis Significance Test Procedure in Basic and Applies Social Psychology (Trafimow and Marks, BASP, 2015)
• Not the end of the story…more like the tip of the iceberg (Leek and Peng, Nature 2015)
![Page 5: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/5.jpg)
Lessons learnt
• Results from individual experiments are probably wrong.
• Bias in your data means your conclusions are even more likely to be wrong.
• Meta-analyses help.
• Understand how you got the data you have.
![Page 6: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/6.jpg)
Sequence Read Archive
• Central repository of sequence data.
• Nearly 30,000 genomic and transcriptomics experiments stored and freely available.
• 2 x 1015 nucleotides stored
![Page 7: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/7.jpg)
![Page 8: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/8.jpg)
• Based on Next Generation Sequencing
• Step reduction in cost of sequencing
• ~$thousands for a human genome
• Potentially an enormous resource
• But how do you get that data?
![Page 9: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/9.jpg)
Good news
• SRA data is open
• Stored in a sensible way (uses SQL)
• API and documentation to access it
![Page 10: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/10.jpg)
Mucky business
• Data stored in SRA are short reads.
• ~100 nucleotide-long fragments which are then assembled.
• Very long pipeline to get from a sample to this step.
• Pipeline (Protocol in their lingo) is VARIABLE
![Page 11: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/11.jpg)
![Page 12: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/12.jpg)
Obvious question
• Is there any evidence of bias in the data due to varying the protocol?
![Page 13: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/13.jpg)
Even More Obvious Question
• Where is the metadata on the pipeline (protocol)?
![Page 14: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/14.jpg)
4% of experiments describe all of the steps
![Page 15: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/15.jpg)
What’s more…
• Metadata are stored as text fields.
• Hugely difficult task to parse.
• Submitters are not obliged to fill this data in.
• Confusion about what level to enter data in.
![Page 16: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/16.jpg)
Bottom line
• For much of the SRA data, there is a “known unknown” about biases due to preparation.
• It’s very unlikely we’ll ever be able to figure that out.
![Page 17: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/17.jpg)
Why should you be paying attention?
• As a member of the public - it’s your money down the drain ($108-$109)
• As a researcher - all of this undermines confidence in Science as a whole.
• If you work with big (and more particularly) complex data - the same issues will crop up for you.
![Page 18: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/18.jpg)
Answers?• Understand how you got your data - even if it’s a step
for modelling.
• Metadata is crucial.
• Organising your data is crucial.
• Use Ontologies
• Use discrete keywords
• Get people to use it
![Page 19: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/19.jpg)
In summary :- We want to do all the clever stuff….
![Page 20: On the importance (and absence) of annotation in Next Generation Sequencing Data](https://reader034.fdocuments.net/reader034/viewer/2022042701/55adcd4a1a28ab34168b4894/html5/thumbnails/20.jpg)
Most of the time we need to deal with a ton of pitchblende to find the milligram
of Radium ..