MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting...

32
milkER milkER – a milk – a milk informatics resource informatics resource Stephen Edwards BSc. Stephen Edwards BSc. University of Edinburgh University of Edinburgh BioNLP meeting 6th June 2005 BioNLP meeting 6th June 2005

Transcript of MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting...

Page 1: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

milkERmilkER – a milk – a milk informatics resourceinformatics resource

Stephen Edwards BSc.Stephen Edwards BSc.University of EdinburghUniversity of Edinburgh

BioNLP meeting 6th June 2005BioNLP meeting 6th June 2005

Page 2: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

OverviewOverview

Aims of Aims of milkERmilkER milkERmilkER database database Text-miningText-mining Potential targetsPotential targets

Page 3: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

milkERmilkER aims aims

To amalgamate disperse milk To amalgamate disperse milk information into one resource, information into one resource, allowing more focused analysis of allowing more focused analysis of milk proteins in relation to dairy milk proteins in relation to dairy issues, health and disease.issues, health and disease.

Page 4: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

A milk databaseA milk database

Knowledge on milk affects many Knowledge on milk affects many industriesindustries

UniProt, GenBank excellent resources UniProt, GenBank excellent resources Marsupial genomics database Marsupial genomics database (New (New

Zealand)Zealand)

Glasgow genomics dataGlasgow genomics data Chinese databaseChinese database Polish bioactive peptide databasePolish bioactive peptide database Food property database (commercial)Food property database (commercial)

Page 5: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

Milk componentsMilk components

Fat, carbohydrates, proteins, mineralsFat, carbohydrates, proteins, minerals Growth factors, enzymes, enzyme Growth factors, enzymes, enzyme

inhibitors, immunoglobulins, allergens, inhibitors, immunoglobulins, allergens, disease factors, anti-bacterial proteins, disease factors, anti-bacterial proteins, opioidsopioids

1. Deliberate 1. Deliberate 2. Leakage from blood 2. Leakage from blood 3. Result of disease conditions 3. Result of disease conditions 4. Engineered4. Engineered5. Bacterial origin5. Bacterial origin

Page 6: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

milkERmilkER database database

Database using BioSQL which Database using BioSQL which allows incorporation of UniProt, allows incorporation of UniProt, EMBL, GenBank entriesEMBL, GenBank entries

Page 7: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

LOCUS NM_173929 790 bp mRNA linear MAM 27-OCT-2004DEFINITION Bos taurus lactoglobulin, beta (LGB), mRNA.ACCESSION NM_173929VERSION NM_173929.2 GI:31343239KEYWORDS .SOURCE Bos taurus (cow) ORGANISM Bos taurus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae; Bovinae; Bos.REFERENCE 1 (bases 1 to 790) AUTHORS Jayat,D., Gaudin,J.C., Chobert,J.M., Burova,T.V., Holt,C., McNae,I., Sawyer,L. and Haertle,T. TITLE A recombinant C121S mutant of bovine beta-lactoglobulin is more susceptible to peptic digestion and to denaturation by reducing agents and heating JOURNAL Biochemistry 43 (20), 6312-6321 (2004) PUBMED 15147215 REMARK GeneRIF: Results suggest that the stability of beta-lactoglobulin arising from the hydrophobic effect is reduced by the C121S mutation so that unfolded or partially unfolded states are more favored.ORIGIN 1 actccactcc ctgcagagct cagaagcgtg atcccggctg cagccatgaa gtgcctcctg 61 cttgccctgg ccctcacctg tggcgcccag gccctcatcg tcacccagac catgaagggc …..

Page 8: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

milkEROther Sources (e.g. published

tables)

Web Query

Information extraction

Other Databases

milkER population

EMBL UniProt

Information retrieval

Page 9: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

milkERmilkER database database

Database using BioSQL which Database using BioSQL which allows incorporation of UniProt, allows incorporation of UniProt, EMBL, GenBank entriesEMBL, GenBank entries

Library of literature on milk Library of literature on milk User interface User interface

(www.milker.org.uk)(www.milker.org.uk)

Page 10: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.
Page 11: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.
Page 12: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.
Page 13: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

Text-miningText-mining

Machine ‘reading’ of textMachine ‘reading’ of text Many techniques involved:Many techniques involved:

– TokenisationTokenisation– StemmingStemming (Activation (Activation Activat) Activat)– POS tagging (Protein POS tagging (Protein noun) noun)– Abbreviation expansion (CN Abbreviation expansion (CN Casein) Casein)– Entity identification (Casein Entity identification (Casein

protein)protein)– DictionaryDictionary

Page 14: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

Increased [past participle] levels [plural noun] of [preposition]IgA [antibody]

B-LG [protein]Diabetes [disease]

[IgA antibodies to B-LG]

‘MARKER’ [type 1 diabetes]

”Increased levels of IgA antibodies to B-LG were found and were shown to be an independent risk marker for type 1 diabetes.”

Tokeniser / POS tagger

Entity identification

Parser

Page 15: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

Information extractionInformation extraction

Rule basedRule based– ‘‘interact’ ‘bind’ ‘activate’interact’ ‘bind’ ‘activate’– [protein] (0-5 words) [verbs] (0-5 words) [protein][protein] (0-5 words) [verbs] (0-5 words) [protein]

(Blaschke and Valencia, 2002)(Blaschke and Valencia, 2002)

Machine-learningMachine-learning– Statistical methods, Hidden Markov Statistical methods, Hidden Markov

ModelsModels– Learn interfillers, text lying between Learn interfillers, text lying between

tagged entities tagged entities (Bunescu et al, 2004)(Bunescu et al, 2004)

Page 16: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

DifficultiesDifficulties

SynonymsSynonyms Proteins and genes with same Proteins and genes with same

namename Funny names e.g. ERK-1/2, ‘and’ gene!Funny names e.g. ERK-1/2, ‘and’ gene! Variability of natural languageVariability of natural language Compounded names Compounded names Co-ordination, negatives, speeling Co-ordination, negatives, speeling

errorserrors

Page 17: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

EvaluationEvaluation

Precision (P) Precision (P) - how - how correctcorrect is output is output Recall (R)Recall (R) - how often does it pick- how often does it pick F-measureF-measure - combines P and R- combines P and R

IE systems can achieve high results, IE systems can achieve high results, but not enough to populate but not enough to populate databases automaticallydatabases automatically

Page 18: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

Text-mining usesText-mining uses

Aim to extract interactions and Aim to extract interactions and diseasesdiseases

Swanson (Fish oil) Swanson (Fish oil) Srinivasan (Turmeric) Srinivasan (Turmeric)

Page 19: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

General model for discovering implicit links between topics Starting topic: Turmeric (inhibits)Intermediate topic: Nuclear factor-kappa B (involved in)Terminal topic: Crohn’s disease

Diagram taken from Srinivasan et al, 2004

Page 20: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

Targets for text miningTargets for text mining

Many milk relationships still Many milk relationships still require further investigationrequire further investigation

Positive reasonsPositive reasons

- nutritional benefits- nutritional benefits

- neonatal growth - neonatal growth

- antimicrobial activity- antimicrobial activity- bioactive peptides- bioactive peptides

Page 21: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

Targets for text mining Targets for text mining (cont.)(cont.) Negative reasonsNegative reasons

- recent link with Alzheimer's- recent link with Alzheimer's- diabetes link- diabetes link- asthma- asthma- human reactions to cow - human reactions to cow

hormones hormones (e.g. Acne, Danby 2005)(e.g. Acne, Danby 2005)

- drug transfer to milk and effects- drug transfer to milk and effects- allergic reactions/intolerance- allergic reactions/intolerance

- toxic contaminants- toxic contaminants

Page 22: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

milkER milkER processprocess

897 proteins, 772 dna, 1232 rna897 proteins, 772 dna, 1232 rna Analyze references Analyze references (1465 MEDLINE refs)(1465 MEDLINE refs)

– MeSH terms, GO terms etcMeSH terms, GO terms etc POS tagPOS tag UMLS standardisationUMLS standardisation Gene/protein dictionaryGene/protein dictionary Extract relationsExtract relations

Page 23: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

Milk literatureMilk literature

ArticlesArticles EnglishEnglish AbstractsAbstracts

Milk Milk 38,09738,097 32,20732,207 23,20123,201

Diabetes Diabetes mellitusmellitus

174,498174,498 133,844133,844 103,868103,868

Milk and Milk and diabetesdiabetes

210210 191191 132132

Page 24: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

milkERmilkER interactions interactions

Table of interacting proteinsTable of interacting proteins– Store as queryable XML strings?Store as queryable XML strings?

Discover links between proteins and Discover links between proteins and diseasedisease

Create hypothesesCreate hypotheses Confirm experimentallyConfirm experimentally

Page 25: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

DiabetesDiabetes

Pancreas secretes hormonesPancreas secretes hormones– Glycagon, increases conversion glycagon Glycagon, increases conversion glycagon glucose glucose– Insulin, increases conversion glucose Insulin, increases conversion glucose glycagon. glycagon.

Allows glucose into cells.Allows glucose into cells.

““Condition where the amount of Condition where the amount of glucose in the blood is abnormally high glucose in the blood is abnormally high as the body cannot use it adequately as the body cannot use it adequately as fuel”as fuel”

Page 26: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

DiabetesDiabetes

Affects 3-5% of industrialised populationsAffects 3-5% of industrialised populations Type 1 (~10%)Type 1 (~10%)

– Genetic and environmental factors (e.g. diet)Genetic and environmental factors (e.g. diet)– Decreased insulin productionDecreased insulin production– Mostly develops < age 20Mostly develops < age 20

Type II (~90%)Type II (~90%)– Resistance of body to insulinResistance of body to insulin– Normally develops > age 40Normally develops > age 40– Often associates with high B.P, cholsterol and arterial Often associates with high B.P, cholsterol and arterial

diseasedisease

Page 27: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

Milk and diabetesMilk and diabetes

Page 28: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

“More research is needed on all aspects of lactation in women with diabetes.”

– Reader D. et al, Curr Diab Rep. 2004

“The effect of high protein intakes from different sources on glucose-insulin metabolism needs further study”

– Hoppe et al, European Journal of Clinical Nutrition 2005

“American children also tend to be heavier American children also tend to be heavier than those from European countries, skewing than those from European countries, skewing the [growth] charts further.” the [growth] charts further.”

– The Scotsman Sat 5 Feb 2005 The Scotsman Sat 5 Feb 2005

The government currently recommends that babies should be fed breast milk alone for the first six months - the WHO recommends two years.

Selected quotesSelected quotes

Page 29: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

ConclusionsConclusions

Knowledge of milk vital in many Knowledge of milk vital in many areasareas

milkERmilkER aims to bring disparate aims to bring disparate milk data togethermilk data together

Text-mining can wade through Text-mining can wade through large amounts of data to retrieve large amounts of data to retrieve and and discoverdiscover vital information vital information

Page 30: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

Future workFuture work

Relation extraction of milk Relation extraction of milk literatureliterature

Extend content of Extend content of milkERmilkER to to include interaction datainclude interaction data

Create hypotheses for Create hypotheses for experimental workexperimental work

Page 31: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

AcknowledgementsAcknowledgements

Prof. Lindsay SawyerProf. Lindsay Sawyer Dr. Carl Holt Dr. Carl Holt (Hannah Research Institute, Ayr)(Hannah Research Institute, Ayr)

Prof. Bonnie Webber Prof. Bonnie Webber (Informatics)(Informatics)

Dr. Alistair Kerr and Dr. Douglas Dr. Alistair Kerr and Dr. Douglas Armstrong for technical supportArmstrong for technical support

Page 32: MilkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005.

ReferencesReferences

Acne/milk– Acne and milk, the diet myth, and beyond (Danby,

2005) Diabetes/milk

– Milk and diabetes (Schrezenmeir et al, 2000) REVIEW– The role of -casein variants in the induction of insulin-

dependent diabetes (Elliott et al, 1997) Text-mining

– Natural language processing and systems biology (Cohen et al, 2004) REVIEW

– Mining MEDLINE for implicit links between dietary substances and diseases (Srinivasan et al, 2004)

– Learning to extract proteins and their interactions from MEDLINE abstracts (Bunescu et al, 2003)