Semantic phenotyping for disease diagnosis and discovery
Transcript of Semantic phenotyping for disease diagnosis and discovery
CBIIT SPEAKER SERIES
1.21.15
SEMANTIC
PHENOTYPING FOR
DISEASE DIAGNOSIS
AND DISCOVERY
@monarchinit
www.monarchinitiative.orgMatchmaker
Exchange
Melissa
Haendel
@ontowonka
TODAY’S TALK
The computable phenotypic profile
Exome analysis for disease diagnosis
Crossing the species divide
What is GOOD phenotyping?
Chronological considerations
http://anthro.palomar.edu/abnormal/abnormal_4.htmhttp://www.pyroenergen.com/articles07/downs-syndrome.htm
http://www.theguardian.com/commentisfree/2009/oct/27/downs-syndrome-increase-
terminations
YOU ALL KNOW THIS PRESENTATION
BUT A COMPUTER DOES NOT
“Phenotypic Profile”
Often free text or checkboxesDysmorphic features
• df
• dysmorphic
• dysmorphic faces
• dysmorphic features
Congenital malformation/anomaly:
• congenital anomaly
• congenital malformation
• congenital anamoly
• congenital anomly
• congential anomaly
• congentital anomaly
• cong. m.
• cong. Mal
• cong. malfor
• congenital malform
• congenital m.
• multiple congenital anomalies
• multiple congenital abormalities
• multiple congenital abnormalities
Examples of lists:
* dd. cong. malfor. behav. pro.
* dd. mental retardation
* df< delayed puberty
* df<
* dd df mr
* mental retar.short stature
CLINICAL PHENOTYPING
6% OF THE GENERAL POPULATION SUFFERS FROM
A RARE DISORDER
6% of patients contacting the NIH Office of
Rare Disorders do not have a diagnosis
THE YET-TO-BE DIAGNOSED PATIENT
Known disorders not recognized during
prior evaluations?
Atypical presentation of known
disorders?
Combinations of several disorders?
Novel, unreported disorder?
THE CHALLENGE: INTERPRETATION OF
DISEASE CANDIDATES
?
What’s in the box?
How are
candidates
identified?
How do they
compare?
Prioritized
Candidates,
functional validation
C1
C2
C3
C4
...
Phenotypes
P1
P2
P3
…
Genotype
G1
G2
G3
G4
…Pathogenicity, frequency,
protein interactions, gene
expression, gene
networks, epigenomics,
metabolomics….
Environments
E1, E2, E3, E4 …
MATCHING PATIENTS TO DISEASES
Patient
Disease X
Differential diagnosis with similar but non-matching phenotypes is difficult
Flat back of head Hypotonia
Abnormal skull morphology Decreased muscle mass
SEARCHING FOR PHENOTYPES USING
TEXT ALONE IS INSUFFICIENT
OMIM Query # Records
“large bone” 785
“enlarged bone” 156
“big bone” 16
“huge bones” 4
“massive bones” 28
“hyperplastic bones” 12
“hyperplastic bone” 40
“bone hyperplasia” 134
“increased bone growth” 612
Phenotypes.
THINK GRAPHICALLY
Each node is a different phenotype, classified by anatomical system
DISEASE X IS A COLLECTION OF NODES
Each disease is associated with different phenotype nodes in the graph
Disease X
EACH DISEASE IS ANNOTATED WITH A
PHENOTYPIC PROFILE
Chromosome 21 Trisomy
Failure
to thrive
Umbilical
hernia
Broad
handsAbnormal
ears
Flat
head
Down’s
Syndrome
PHENOTYPE “BLAST”: WHICH PHENTOYPIC
PROFILE IS GRAPHICALLY MOST SIMILAR?
Disease X
Patient
Disease Y
FINDING THE PHENOTYPE GRAPH IN
COMMON
Disease X
Patient
Disease Y
THE HUMAN PHENOTYPE ONTOLOGY
Used to annotate:
• Patients
• Disorders/Diseases
• Genotypes
• Genes
• Sequence variants
In human
Reduced pancreaticbeta cells
Abnormality ofpancreatic islet
cells
Abnormality of endocrinepancreas physiology
Pancreatic islet cell adenoma
Pancreatic islet celladenoma
Insulinoma
Multiple pancreaticbeta-cell adenomas
Abnormality of exocrinepancreas physiology
Köhler et al. Nucleic Acids Res. 2014 Jan 1;42(1):D966-74.
WHY DO WE NEED THE HUMAN
PHENOTYPE ONTOLOGY?
Winnenburg and Bodenreider, ISMB PhenoDay, 2014
How does HPO relate to other clinical vocabularies?
EXOME ANALYSIS
Recessive, de novo filters
Remove off-target, common variants,
and variants not in known disease
causing genes
http://compbio.charite.de/PhenIX/
Target panel of 2741 known
Mendelian disease genes
Compare
phenotype
profiles using
data from:
HGMD, Clinvar,
OMIM, Orphanet
Zemojtel et al. Sci Transl Med 3 September 2014: Vol. 6,
Issue 252, p.252ra123
CONTROL PATIENTS WITH KNOWN
MUTATIONS
Inheritance Gene Average
Rank
AD ACVR1, ATL1, BRCA1, BRCA2, CHD7 (4),
CLCN7, COL1A1, COL2A1, EXT1, FGFR2 (2),
FGFR3, GDF5, KCNQ1, MLH1 (2), MLL2/KMT2D,
MSH2, MSH6, MYBPC3, NF1 (6), P63, PTCH1,
PTH1R (2), PTPN11 (2), SCN1A, SOS1, TRPS1,
TSC1, WNT10A
1.7
AR ATM, ATP6V0A2, CLCN1 (2), LRP5, PYCR1,
SLC39A4
5
X EFNB1, MECP2 (2), DMD, PHF6 1.8
52 patients with diagnosed rare diseases
PHENIX HELPED DIAGNOSE 11/40 PATIENTS
global developmental delay (HP:0001263)
delayed speech and language development (HP:0000750)
motor delay (HP:0001270)
proportionate short stature (HP:0003508)
microcephaly (HP:0000252)
feeding difficulties (HP:0011968)
congenital megaloureter (HP:0008676)
cone-shaped epiphysis of the phalanges of the hand (HP:0010230)
sacral dimple (HP:0000960)
hyperpigmentated/hypopigmentated macules (HP:0007441)
hypertelorism (HP:0000316)
abnormality of the midface (HP:0000309)
flat nose (HP:0000457)
thick lower lip vermilion (HP:0000179)
thick upper lip vermilion (HP:0000215)
full cheeks (HP:0000293)
short neck (HP:0000470)
WHAT ABOUT THE PATIENTS WE CAN’T
SOLVE?
HOW DO WE UNDERSTAND RARE
DISEASE ETIOLOGY AND DISCOVER
TREATMENTS?
B6.Cg-Alms1foz/fox/J
increased weight,
adipose tissue volume,
glucose homeostasis altered
ALSM1(NM_015120.4)
[c.10775delC] + [-]
GENOTYPE
PHENOTYPE
obesity,
diabetes mellitus,
insulin resistance
increased food intake,
hyperglycemia,
insulin resistance
kcnj11c14/c14; insrt143/+(AB)
MODELS RECAPITULATE VARIOUS
PHENOTYPIC ASPECTS
???
HOW MUCH PHENOTYPE DATA?
Human genes have poor phenotype coverage
GWAS
+
ClinVar
+
OMIM
HOW MUCH PHENOTYPE DATA?
Human genes have poor phenotype coverage
What else can we leverage?
GWAS
+
ClinVar
+
OMIM
HOW MUCH PHENOTYPE DATA?
Human genes have poor phenotype coverage
What else can we leverage? …animal models
Orthology via PANTHER v9
WHY WE NEED ALL THE MODELS
Combined, human and model phenotypes can be linked to
>75% human genes.
Orthology via PANTHER v9
PHENOTYPIC DIVERSITY ACROSS SPECIES
=> May need different models to recapitulate different
aspects of the disease
PROBLEM: CLINICAL AND MODEL
PHENOTYPES ARE DESCRIBED DIFFERENTLY
lung
lung
lobular organ
parenchymatousorgan
solid organ
pleural sac
thoracic cavity organ
thoracic cavity
abnormal lung morphology
abnormal respiratory system morphology
Mammalian Phenotype
Mouse Anatomy
FMA
abnormal pulmonary acinus morphology
abnormal pulmonary alveolus morphology
lungalveolus
organ system
respiratory system
Lower respiratory
tract
alveolar sac
pulmonary acinus
organ system
respiratory system
Human development
lung
lung bud
respiratory primordium
pharyngeal region
PROBLEM: EACH ORGANISM USES
DIFFERENT VOCABULARIES
develops_frompart_ofis_a (SubClassOf)
surrounded_by
SOLUTION: BRIDGING SEMANTICS
Mungall et al. (2012). Genome Biology, 13(1), R5. doi:10.1186/gb-2012-13-1-r5
anatomical structure
endoderm of forgut
lung bud
lung
respiration organ
organ
foregut
alveolus
alveolus of lung
organ part
FMA:lung
MA:lung
endoderm
GO: respiratory gaseous exchange
MA:lungalveolus
FMA: pulmonary
alveolus
is_a (taxon equivalent)
develops_frompart_ofis_a (SubClassOf)
capable_of
NCBITaxon: Mammalia
EHDAA:lung bud
only_in_taxon
pulmonary acinus
alveolar sac
lung primordium
swim bladder
respiratory primordium
NCBITaxon:Actinopterygii
Köhler et al. (2014) F1000Research 2:30
Haendel et al. (2014) JBMS 5:21 doi:10.1186/2041-1480-5-21
=> Web application for model phenotyping and G2P validation
PROBLEM: EACH SPECIES MAKES DIFFERENT
G2P ASSOCIATIONS
INTEGRATED GENTOYPE-2-
PHENOTYPE DATA IN MONARCH
Also in the system: Rat; IMPC; GO annotations; Coriell cell lines; OMIA; MPD; Yeast; CTD; GWAS;
Panther, Homologene orthologs; BioGrid interactions; Drugbank; AutDB; Allen Brain …157 sources
Coming soon: Animal QTLs for pig, cattle, chicken, sheep, trout, dog, horse
Species Data
source
Genes Genotypes Variants Phenotype
annotations
Diseases
mouse MGI 13,433 59,087 34,895 271,621
fish ZFIN 7,612 25,588 17,244 81,406
fly Flybase 27,951 91,096 108,348 267,900
worm Wormbase 23,379 15,796 10,944 543,874
human HPOA 112,602 7,401
human OMIM 2,970 4,437 3,651
human ClinVar 19,694 111,294 252,838 4,056
human KEGG 2,509 3,927 1,159
human ORPHANET 3,113 5,690 3,064
human CTD 7,414 23,320 4,912
EXOMISER: DIAGNOSING UDP_930 USING
A PHENOTYPICALLY SIMILAR MOUSE
Chronic acidosis
Neonatal hypoglycemia
Ostopenia
Short stature
decreased circulating
potassium level
Decreased circulating
glucose level
Decreased bone
mineral density
decreased body length
abnormal ion
homeostasis
Decreased
circulating
glucose level
Decreased
bone mineral
density
Short stature
UDP_930/29
phenotypesSms
tm1a(EUCOMM)Wtsi
Robinson et al. (2013). Genome Res, doi:10.1101/gr.160325.113
EXOMISER: COMBINING PHENOTYPIC
SIMILARITY WITH OTHER DATA
MED21
MAU2
MED8
MED26
Recurrent otitis media
Spasticity
Esotropia
Cerebral palsy
Conductive hearing
impairment
Limitation of joint mobility
Strabismus
Hypertonia
Abnormality of
the middle ear
Abnormal joint
mobility
Strabismus
Abnormality of
central motor
function
UDP_2146/56
phenotypes
Brachmann-de
Lange syndrome
NIPBLMED23
?
CCNC
Contractures of the joints of the
lower limbs
Hypertonicity
CDK8
UDP CASES ANALYZED WITH
EXOMISER
=> Use of genotype, phenotype, PPI, and inheritance
together provide best prioritization
ANALYSIS OF UNSOLVED UDP CASES
4 families now have a diagnosis including, one novel
disease-gene association discovered: York Platelet
syndrome and STIM1
Strong candidates identified for 19 families that are
now undergoing functional validation through mouse
and zebrafish modeling
Several hundred UDP cases now being analyzed
using Exomiser and cross-species phenotype data
HOW DOES THE CLINICIAN KNOW THEY’VE
PROVIDED ENOUGH PHENOTYPING?
How many annotations…?
How many different categories?
How many within each?
Image credit: Viljoen and Beighton, J Med Genet. 1992
Schwartz-jampel Syndrome, Type I
Schwartz-jampel Syndrome, Type I
Caused by Hspg2 mutation, a proteoglycan
~100 phenotype annotations
EVALUATION METHOD
Create a variety of “derived” diseases
More general (depth)
Remove subset(s) (breadth)
Introduce noise
Assess the change in similarity between the derived
disease and it’s parent.
Ask questions:
Is the derived disease considered similar to original?
…or more similar to a different disease?
Is it distinguishable beyond random?
Are there any specific factors that influence similarity?
FINDING THE PHENOTYPE GRAPH IN
COMMON
The most specific phenotypic profile in common
METHOD: DERIVE BY CATEGORY
REMOVAL
Remove annotations that are subclasses of a
single high-level node
Repeat for each 1° subclass
Example: Schwartz-jampel Syndrome, Type I
to test influence of a single
phenotypic category
Example: Schwartz-jampel Syndrome derivations
to test influence of a single
phenotypic category
Example: Schwartz-jampel Syndrome derivations
SEMANTIC SIMILARITY ALGORITHMS ARE ROBUST
IN THE FACE OF MISSING INFORMATION
(avg) 92% of derived diseases are most-similar to
original disease
Severity of impact follows proportion of
phenotype
Similarity of Derived Disease to Original Derived Disease Profile Rank
METHOD: DERIVE BY LIFTING
Iteratively map each class to their direct
superclass(es)
Keep only leaf nodes
SEMANTIC SIMILARITY ALGORITHMS ARE
SENSITIVE TO SPECIFICITY OF INFORMATION
Severity of impact increases with more-general
phenotypes
Similarity of Derived Disease to Original Derived Disease Profile Rank
ANNOTATION SUFFICIENCY SCORE
http://www.phenotips.orghttp://www.monarchinitiative.org
ANNOTATION SUFFICIENCY SCORE
CONSIDERING TIME
PATIENT 1
Lower back pain
Motor weakness
Unpleasant muscle twitching
40yrs old
PATIENT 2
Unpleasant muscle twitching
65 yrs old
Stumbling
Leg weakness
PATIENT 1
Diagnosis: Degenerative disc disease with L3 nerve root
radiculopathy causing muscle weakness.
More recent onset of benign fasciculation syndrome, a non-
progressive disease.
PATIENT 2
Diagnosis: Amyotrophic lateral sclerosis
(Lou Gehrig disease)
ADDING CHRONOLOGY TO THE
ALGORITHM
ADDING EXPOSURE TO THE ALGORITHM
Patient 1
Disease/condition
Drug/chemical
ADDING NEGATION TO THE ALGORITHM
Patient
Disease X
CONCLUSIONS
Phenotypic data can be represented using ontologies to support improved comparisons within and across species
For known disease-gene associations comparison to human phenotype data is effective at variant prioritization.
For unknown disease-gene associations the expansion of phenotypic coverage using model organisms greatly improves variant prioritization.
Phenotype breadth is recommended to buffer lack of information, ALSO very specific phenotypes are necessary to ensure quality matches
FUTURE WORK
Add additional variables to semantic similarity algorithm – e.g. negation, environment, chronology
Validate existing animal models for recapitulation of disease
Further characterization of organism-specific phenotypes
Adding many more non-model organisms to the analysis
ACKNOWLEDGMENTS
NIH-UDPWilliam Bone
Murat Sincan
David Adams
Amanda Links
Joie Davis
Neal Boerkoel
Cyndi Tifft
Bill Gahl
OHSUNicole Vasilesky
Matt Brush
Bryan Laraway
Shahim Essaid
Kent Shefchek
GarvanTudor Groza
Lawrence BerkeleyNicole Washington
Suzanna Lewis
Chris Mungall
UCSDJeff Grethe
Chris Condit
Anita Bandrowski
Maryann Martone
U of PittChuck Boromeo
Vincent Agresti
Becky Boes
Harry Hochheiser
SangerAnika Oehlrich
Jules Jacobson
Damian Smedley
TorontoMarta Girdea
Sergiu Dumitriu
Heather Trang
Bailey Gallinger
Orion Buske
Mike Brudno
JAXCynthia Smith
CharitéSebastian Kohler
Sandra Doelken
Sebastian Bauer
Peter RobinsonFunding:
NIH Office of Director: 1R24OD011883
NIH-UDP: HHSN268201300036C, HHSN268201400093P