Samuel O’Malley [email protected] Supervisor: Prof. Jiuyong Li [email protected]...

22
Samuel O’Malley oymsj001@mymail .unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li @unisa.edu.au Associate Supervisor: Dr. Jixue Liu [email protected] Information Retrieval of microRNA Research from Biomedical Literature

Transcript of Samuel O’Malley [email protected] Supervisor: Prof. Jiuyong Li [email protected]...

Page 1: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Samuel O’[email protected]

Supervisor: Prof. Jiuyong [email protected]

Associate Supervisor: Dr. Jixue [email protected]

Information Retrieval of microRNA Research from

Biomedical Literature

Page 2: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Do not remove this notice.

Copyright Notice

COMMONWEALTH OF AUSTRALIA

Copyright Regulations 1969

WARNING

This material has been produced and communicated to you by or on behalf of the University of South Australia pursuant to Part VB of the

Copyright Act 1968 (the Act).

The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you

may be the subject of copyright protection under the Act.

Do not remove this notice.

Page 3: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Overview

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Motivation Background Research Question Contribution

Implementation Examples

References

Page 4: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Motivation

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

microRNA research is increasing exponentially

Databases can not be curated fast enough A researcher can not be “current” in the

field of microRNA Automatic curation tools exist for other

areas of biomedical research

Page 5: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

microRNA – What are they?

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

microRNA are small non-coding lengths of RNA

They inhibit the creation of proteins

Video from rossettagenomics.com

Page 6: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

miRBase

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

A database of microRNA sequences and annotations.

Human microRNA 150 is also called MIR150, hsa-mir-150, MIRN150 etc.

miRBase provides the human readable name as well as a machine readable ID

Example: hsa-mir-150 has an ID of MI0000479 and

HGNC:MIR150

A. Kozomara and S. Griffihs-Jones, “mirbase: integrating microrna annotation and deep-sequencing data”, Nucleic Acids Research, vol. 39, no. suppl 1, pp. D152-D157,2011.

Page 7: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Disease Related Enzymes

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Finds occurrences of an Enzyme and a Disease mentioned in the same sentence

Classifies their relationship using a Support Vector Machine

Uses a training-set of pre-classified sentences.

Example: “Chronic granulomatous disease (CGD) results from

mutations of phagocyte NADPH oxidase.” Classified as “Causal Interaction”

C. Sohngen, A. Chang, and D. Schomburg, “Development of a classication scheme for disease-related enzyme information”, BMC Bioinformatics, vol. 12, no. 1, p. 329, 2011.

Page 8: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Gene Name Disambiguation

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Genes can have many different names or variations

Humans can understand “context”, for machines this is a challenge

Example: Five sentences in the paper refer to different genes. Four of these are referring to a human gene,

however the fifth is ambiguous as a human gene or a fly gene.

C.J. Sun, X.L.Wang, L. Lin, and Y.-C. Liu, “A multi-level disambiguation framework for gene name normalization”, Acta Automatica Sinica, vol. 35, no. 2, pp. 193-197, 2009.

Page 9: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

LINNAEUS – Species Identification

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

LINNAEUS uses a set of simple regular expressions to find indicators of what species a text is refering to.

In my research I use a modified list to incorporate the specific MicroRNA domain knowledge.

Example -These words can all be used when talking about humans (ID: 9606): [hH]umans? [pP]atients? [pP]articipants?

[wW]oman [wW]omen [mM]en [gG]irls? [bB]oys? [pP]eoples? [Cc]hild(ren)? [Ii]nfants? [Pp]ersons?

Gerner, M, Nenadic, G & Bergman, C 2010, 'LINNAEUS: A species name identification system for biomedical literature', BMC Bioinformatics, vol. 11, no. 1, p. 85.

Page 10: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Research Question

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

What is the most suitable technique for discovering and classifying microRNA - gene

relationships from biomedical literature?

Page 11: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Contribution

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

1. A normalisation and disambiguation technique for gene names will be adapted to fit the unique microRNA ontology.

2. Automatic curation of microRNA and gene relationships in biomedical literature. (Not completed yet)

Page 12: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

MYSQL Database Backend

Table Name Rows

Abstracts ID Abstract Title  

Stop_Abstracts ID Abstract Title  

Species ID Name    

Micro_Prefix Prefix Species_ID    

Species_Mentions Abstract_ID Species_ID Sentence_Num Word_Num

MicroRNA_Mentions Abstract_ID Micro_ID Sentence_Num Word_Num

Page 13: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Full Example – Original Abstract

microRNA profiling in Epstein-Barr virus-associated B-cell lymphoma.

The Epstein-Barr virus (EBV) is an oncogenic human Herpes virus found in ~15% of diffuse large B-cell lymphoma (DLBCL). EBV encodes miRNAs and induces changes in the cellular miRNA profile of infected cells. MiRNAs are small, non-coding RNAs of ~19-26?nt which suppress protein synthesis by inducing translational arrest or mRNA degradation. Here, we report a comprehensive miRNA-profiling study and show that hsa-miR-424, -223, -199a-3p, -199a-5p, -27b, -378, -26b, -23a, -23b were upregulated and hsa-miR-155, -20b, -221, -151-3p, -222, -29b/c, -106a were downregulated more than 2-fold due to EBV-infection of DLBCL. All known EBV miRNAs with the exception of the BHRF1 cluster as well as EBV-miR-BART15 and -20 were present. A computational analysis indicated potential targets such as c-MYB, LATS2, c-SKI and SIAH1. We show that c-MYB is targeted by miR-155 and miR-424, that the tumor suppressor SIAH1 is targeted by miR-424, and that c-SKI is potentially regulated by miR-155. Downregulation of SIAH1 protein in DLBCL was demonstrated by immunohistochemistry. The inhibition of SIAH1 is in line with the notion that EBV impedes various pro-apoptotic pathways during tumorigenesis. The down-modulation of the oncogenic c-MYB protein, although counter-intuitive, might be explained by its tight regulation in developmental processes.

Page 14: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Full Example – Stopwords Removed

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Epstein-Barr virus EBV oncogenic human Herpes virus found 15 diffuse large B-cell lymphoma DLBCL

… MiRNAs small non-coding RNAs 19-26 nt suppress

protein synthesis inducing translational arrest mRNA degradation . we report comprehensive miRNA-profiling study show hsa-miR-424 223 199a-3p 199a-5p 27b 378 26b 23a 23b upregulated hsa-miR-155 20b 221 151-3p 222 29b c 106a downregulated 2-fold due EBV-infection DLBCL

Page 15: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Full Example – Stopwords Removed

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

First replace all full stops with “ . “ and remove the final full stop:◦ $abstract =~ s/([^\s])\.\s+/$1 . /gm;◦ $abstract =~ s/([^\s])\.\s*\Z/$1/gm;◦ “Ph.D” will not be affected by this

Then split the words into the following chunks:◦ $abstract =~ /(([a-zA-Z0-9']+-)*[a-zA-Z0-9'\.]+)/g)◦ And remove the word if it matches Lingua’s stopword list (James

2002).◦ Essentially this algorithm splits each word up but still keeps hyphens,

apostrophes and numbers.◦ Most stopword algorithms remove numbers and hyphens but they are

essential for microRNA detection.

Page 16: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Full Example – Analysis

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

These two lines from the text specify 17 different MicroRNAs:

hsa-miR-424 223 199a-3p 199a-5p 27b 378 26b 23a 23b

hsa-miR-155 20b 221 151-3p 222 29b c 106a

The“hsa-” prefix confirms to us that this is a human sequence.

If there are competing species in the same document we use a distance function to calculate which one to use, and the others we use as backups.

Page 17: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Full Example – Detection

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

This regular expression captures all microRNA written in the standard format:◦ m/^((([a-zA-Z]+-)?(mir|let)-?)[\d][\d\-a-z]*$)/mi

For example:◦ hsa-miR-27b◦ hsa-miR-29b-1◦ let-7b◦ MIR298A

It does not capture the following string:◦ hsa-miR-424 -223◦ It would only see the first microRNA, but miss 223◦ My algorithm appends each number to the last seen microRNA prefix

if the number occurs immediately after a valid microRNA

Page 18: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Full Example – Real Detection

Abstract_ID Micro_ID Sentence Word Micro_Name

21062812 MI0000079 3 13 hsa-mir-23a

21062812 MI0000084 3 12 hsa-mir-26b

21062812 MI0000298 3 18 hsa-mir-221

21062812 MI0000299 3 20 hsa-mir-222

21062812 MI0000300 3 7 hsa-mir-223

21062812 MI0000439 3 14 hsa-mir-23b

21062812 MI0000440 3 10 hsa-mir-27b

21062812 MI0000113 3 11 hsa-mir-106a

21062812 MI0000681 3 16 hsa-mir-155

21062812 MI0001446 3 6 hsa-mir-424

21062812 MI0000105 3 8 hsa-mir-29b-1

21062812 MI0000105 3 8 hsa-mir-29b-2

21062812 MI0000735 3 9 hsa-mir-29c

21062812 MI0001519 3 17 hsa-mir-20b

mir-199a-3pNew Terminology

mir-199a-5pNew Terminology

mir-378Ambiguous Entries

mir-151-3pNew Terminology

Missing Entries:

Page 19: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Full Example – Review

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

To Review the effectiveness of this algorithm:1. We will manually annotate a random selection of abstracts with

correct MicroRNA information. Pros:

Accurate, wide selection of different types of writing Cons:

Slow and laborious

2. We will do a reverse lookup from MIRBase (which references pubmed IDs and assume that they contain the microRNA from MIRBase in the abstract.

Pros: Fast and Automated

Cons: The microRNA might not be mentioned at all in the abstract (False Negatives) The microRNA are likely to be specified with their fully qualified names and

perhaps not represent the target population fully.

Page 20: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Some Statistics

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

There are 18,314 entries in my Abstracts table◦ Of those, there are 17,231 with useable Abstracts

48% of these abstracts contain species indicators. When the abstracts finished downloading (after 2

hours) there were already 16 new abstracts available.

My database has 21,222 unique microRNA listed from MIRBase.

There are 62,036 MicroRNA with no ambiguity in the abstracts. 53% of total detections were improved by the species detection.

Page 21: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

References

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Imig, J, Motsch, N, Zhu, JY, Barth, S, Okoniewski, M, Reineke, T, Tinguely, M, Faggioni, A, Trivedi, P, Meister, G, Renner, C & Grasser, FA 2011, 'microRNA profiling in Epstein-Barr virus-associated B-cell lymphoma', Nucleic Acids Res, vol. 39, no. 5, Mar, pp. 1880-1893.

M. Gerner, G. Nenadic, and C. Bergman, 2010, 'LINNAEUS: A species name identification system for biomedical literature', BMC Bioinformatics, vol. 11, no. 1, p. 85.

L. J. Jensen, J. Saric, and P. Bork, “Literature mining for the biologist: from information retrieval to biological discovery," Nat Rev Genet, vol. 7, no. 2, pp. 119-129, 2006.

A. Kozomara and S. Griffihs-Jones, “mirbase: integrating microrna annotation and deep-sequencing data”, Nucleic Acids Research, vol. 39, no. suppl 1, pp. D152-D157,2011.

C. Sohngen, A. Chang, and D. Schomburg, “Development of a classication scheme for disease-related enzyme information”, BMC Bioinformatics, vol. 12, no. 1, p. 329, 2011.

C.J. Sun, X.L.Wang, L. Lin, and Y.-C. Liu, “A multi-level disambiguation framework for gene name normalization”, Acta Automatica Sinica, vol. 35, no. 2, pp. 193-197, 2009.

H. C. Wang, Y. H. Chen, H. Y. Kao, and S. J. Tsai, “Inference of transcriptional regulatory network by bootstrapping patterns”, Bioinformatics (Oxford, England), vol. 27, no. 10, pp. 1422-1428, 2011.

Page 22: Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au.

Motivation Background Research Question Contribution Implementation References

Questions

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Any Questions?