The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction
-
Upload
goodb -
Category
Technology
-
view
650 -
download
0
description
Transcript of The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Survival Prediction
Benjamin Good*, Salvatore Loguercio, Max Nanis, Andrew Su
The Scripps Research Institute
http://genegames.org/cure/
Rocky 2013
THE CURE: A GAME WITH THE PURPOSE OF GENE SELECTION FOR BREAST CANCER
SURVIVAL PREDICTION
A QUESTION
How would you get 150 PhD level scientists to work together on the same problem?
Without any money?
TRAIL MAP
Games Survival Prediction
The Cure
WHY GAMES?
It is estimated that 9 billion hours are spent playing Solitaire every year
Luis Von Ahn. : Google Tech Talk: Human Computation 2006. (Shortly after receiving $500,000 ‘Genius Grant’ for this work)
Seven million hours of human labor
Empire State Building
ONE YEAR SOLITAIRE = 1,285 EMPIRE STATE BUILDINGS
McGonigal J. Reality is broken : why games make us better and how they can change the world. New York: Penguin Press; 2011.
What if we could use a tiny fraction of that human effort to achieve another purpose?
empir
e stat
e build
ing
one y
ear o
f solita
ire
one y
ear o
f gam
es
7M 9B 150B
150 billion hours gaming each year
PURPOSES
Label all images on the Web
Find objects inside images
Teach computers English
Tag songs
Rate image quality
Computer science
Build ontologies
Tag Malaria parasites in blood smears
Map connections between neurons Align DNA and
protein sequences
Assemble genomes
Design RNA molecules
Figure out how proteins fold
Biology
Link genes with diseases
Develop better treatments for breast cancer
GAMES WITH A PURPOSE
The Cure
MOLT
TRAIL MAP
Games Survival Prediction
The Cure
10 year survival?
find patterns
INFERRING SURVIVAL PREDICTORS
No
van't Veer, Laura J., et al. "Gene expression profiling predicts clinical outcome of breast cancer.” Nature 415.6871 (2002): 530-536.
Yes make predictions on new samples
No
Yes
10 year survival?
find patterns make predictions
INFERRING SURVIVAL PREDICTORS
1) select genes
2) infer predictor from data (e.g. decision tree, SVM, etc.)
Out of the 25,000+ genes, which small set works together the best?
No
Yes
10 year survival?
PROBLEM: GENE SELECTION INSTABILITY
instability: different methods, different datasets produce different gene sets for the same phenotype [1]
[1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine 5.10 (2013).
PROBLEM: THE VALIDATION GAP
training data, test data
validation
validation: predictive signatures often perform worse on independent data created for validation.
Photograph by Richard Hallman, National Geographic Adventure Blog
find patterns
make predictions
ADDING PRIOR KNOWLEDGE TO THE DISCOVERY ALGORITHM
<10 yr survival
>10 yr survival
EX.) NETWORK GUIDED FORESTS
Use network to find good gene combinations
Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology
BUT MOST KNOWLEDGE IS NOT STRUCTURED
2000200120022003200420052006200720082009201020112012
500000
550000
600000
650000
700000
750000
800000
850000
900000
950000
1000000
Number ar-ticles added to PubMed
112 publications/hour(37 more by the end of this talk)
>160,000 publications linked to “breast cancer” since 2000 http://tinyurl.com/brsince2000
HOW CAN WE USE UNSTRUCTURED KNOWLEDGE FOR GENE SELECTION?
Need an intelligent system that is good at reading and hypothesizing
Like you
TRAIL MAP
Games Survival Prediction
The Cure
THE CURE HTTP://GENEGAMES.ORG/CURE/
education level?
cancer knowledge?
biologist?
PLAY = GENE SELECTION
Alternate turns picking a gene from a “board” of 25
Your hand
Opponents hand
SCORING
Cure Server
Score reflects accuracy of decision tree created with just the selected genes on real training data
PLAY WITH KNOWLEDGE: GENE ONTOLOGY
PLAY WITH KNOWLEDGE: GENE RIFS
YOU WIN!
COMMUNITY BOARD VIEW, CHOOSE OPEN BOARD
You beat this one
The community finished this board (e.g. 11 different players completed it)
This board is still open
BOARDS
• 25 genes each
• randomly selected from 1,250 genes that passed an unsupervised filter for minimum expression level and variance for a particular dataset [1],[2]
• 4 different 100 board rounds completed, each with some overlap
• 3731 distinct genes used in the game
[1] Curtis, Christina, et al. "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups." Nature (2012)[2] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine (2013)
PLAYERS
Sep-12
Oct-12
Nov-12
Dec-12Ja
n-13
Feb-13
Mar-13
Apr-13
May-13
Jun-1
3Ju
l-13
Aug-13
0
50
100
150
200
250
OtherDid not statenoneBAMScMDPhD
New player registra-tions
Sep-12
Oct-12
Nov-12
Dec-12Ja
n-13
Feb-13
Mar-13
Apr-13
May-13
Jun-1
3Ju
l-13
Aug-13
00.05
0.10.15
0.20.25
0.30.35
0.4
%PhD
http://io9.com/these-cool-games-let-you-do-real-life-science-486173006
1,077 Players registered (one year)
Sage DREAM7 challenge, game announcement
PLAYER DEMOGRAPHICS
no ns yes0
100200300400500600700
Cancer knowl-edge?
no ns yes0
100200300400500600700800
Are you a Biologist?
graduate_degree
undergraduate
none
bachelors
master
s mdnon
e nsothe
rphd
050
100150200250300350
Most recent degree
GAMES PLAYED • 9,904 games (non training)
0 100 200 300 400 500 600 700 8001
10
100
1000
Total games played per player
Player
Total games played
PhD
0 5 10 15 20 250
100
200
300
400
500
600
700
800
games played, top 20 players
PhD
MD
MSPhD
GENE RANKINGS FROM GAMES
find patterns
make predictions
<10 yr survival
>10 yr survival
GENE RANKINGS FROM GAMES• For each gene:
1. O = number of times it appeared in a game (some genes occur on multiple boards, all boards are played multiple times, all occurrences are counted)
2. S = number of times it was selected by a player
3. F = S/0
• Games can be filtered based on player data
• We can estimate an empirical P value for each value of O, S
• P reflects the chances of getting S or more by chance given O
Examples (all games):
• B-cell lymphoma 2 gene:
O = 13, S = 10, F = 10/13 = 0.77, P < 0.0001
• Alanine and arginine rich domain containing protein:
O = 33, S = 3, F = 3/33 = 0.09, P = 0.91
GENES SELECTED BY ALL PLAYERS9904 GAMESP<0.001, 60 GENES
Top 10 enriched disease annotations n genes
adj. P < 2.43e-06background = 3731 genes used in any game
Top 10 genes
Wang, Jing, et al. "WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013." Nucleic acids research (2013).
GENES SELECTED BY PEOPLE: WITH PHDS WITH KNOWLEDGE OF CANCER,
2373 GAMES P<0.001, 82 GENES
Top 10 genes
Top 10 enriched disease annotations n genes
adj. P < 5.76e-08
“Expert Gene Set”
GENES SELECTED BY PEOPLE: WITHOUT PHDS, WITH NO KNOWLEDGE OF CANCER, THAT ARE NOT BIOLOGISTS
3607 GAMESP<0.001 , 10 GENES
• Gene set not significantly enriched with any disease annotations
Top 10 genes
SELF REPORTING SEEMED TO WORK...
EVEN WITHOUT FILTERING, THE DATA CONTAINS THE KNOWLEDGE• “All Players” still contained significant cancer signal.
PROBLEM: GENE SELECTION INSTABILITY
instability: different methods, different datasets produce different gene sets for the same phenotype [1]
[1] Griffith, Obi L., et al. "A robust prognostic signature for hormone-positive node-negative breast cancer." Genome Medicine 5.10 (2013).
GENE SET OVERLAPS, SOME BUT NOT MUCH
http://bioinformatics.psb.ugent.be/webtools/Venn/
“Expert Gene Set”
PROBLEM: THE VALIDATION GAP
training data, test data
validation
validation: predictive signatures often perform worse on independent data created for validation.
Photograph by Richard Hallman, National Geographic Adventure Blog
CLASSIFIER PERFORMANCE WITH DIFFERENT GENE GROUPS, DIFFERENT DATASETS
X-axis Test Set performance Griffith 2013 data
Y-axis Test Set performanceMetabric training Oslo Test
Only difference between points, are the genes used to build SVM classifier
10 year survivalYes
No
“Expert Gene Set”
SUMMARYPlusses
• 1 year
• 1,000 players, 150 PhDs
• 10,000 games
• “expert knowledge” captured through an open game
• New gene ranking method with results competitive with established approaches
• Game is now in use in an undergraduate class
Minuses
• Did not make a significantly better breast cancer survival predictor
• Game could have been better in many ways
• no beginning, middle or end
• random guessing can win
• easy to cheat
NEXT STEPS • More fun
• More learning for novices
• More control for experts
• More data
THE END
More information at:http://genegames.org/cure/[email protected]@bgood
Thanks to:
Players!!!!Andrew SuSalvatore LoguercioMax NanisKarthik Gangavarapu
We are hiring! Looking for postdocs, programmers interested in crowdsourcing and bioinformatics. Contact: [email protected]
GAMES WITH A PURPOSE
The Cure
MOLT
Loguercio, Salvatore, et al. "Dizeez: an online game for human gene-disease annotation." PloS One (2013)
Khatib, Firas, et al. "Algorithm discovery by protein folding game players." Proceedings of the National Academy of Sciences (2011)
of collecting expert level knowledge
HUMAN GUIDED FOREST (HGF)
http://i9606.blogspot.com/2012/04/human-guided-forests-hgf.html
Let CURE players build decision modules
WHY DID YOU SIGN UP? (83 RESPONSES)
To help breast cancer research
To learn something To have fun playing a game0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
Why did you sign up for The Cure? (select all that apply)
WAS THE GAME FUN?
Yes, it was very fun A little bit entertaining No, not at all0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
perc
ent
DO YOU KNOW ANYONE THAT HAS OR HAD BREAST CANCER?
Have you known or do you currently know anyone that has or has had breast cancer?
YesNo
DID YOU LEARN ANYTHING FROM PLAYING?
Yes, I felt like I learned a lot Yes, I learned a little bit No, I did not learn anything0
10
20
30
40
50
60
MY KNOWLEDGE OF BREAST CANCER IS:
I am an
expe
rt in b
reast c
ancer
I have
helpe
d con
duct c
ancer
resea
rch ias
part o
f my jo
b
I know
some b
iology
and h
ave so
me und
erstan
ding o
f wha
t cance
r is
I know
a littl
e biolo
gy, bu
t noth
ing sp
ecific
to can
cer
Nothing
, I do
not kn
ow a
thing a
bout
it0
0.1
0.2
0.3
0.4
0.5
0.6
AGE?
Which category below includes your age?
17 or younger18-2021-2930-3940-4950-5960 and above
GENDER?
What is your gender?
FemaleMale
TRAINING LEVELS
the decision tree created using the feature “makes milk” is 100% correct on training data, you win!
TRAINING INTERFACE
Choose the feature that best distinguishes mammals from other creatures
TRAINING INTERFACE
the decision tree created using the feature “has hair” is 94% correct on training data, you win!
OVERLAP OF SIGNIFICANT GENE SETS FROM DIFFERENT CURE GAME FILTERS
No Expertise (3,607 games)PhD & Cancer Knowledge (2,373 games)
Biologist (4,913 games)
PhD or MD (3,070 games)
Cancer Knowledge (4,660 games)
MOST RANDOM GENE EXPRESSION SIGNATURES ARE SIGNIFICANTLY ASSOCIATED WITH BREAST CANCER OUTCOME
Venet et al.(2011). PLoS Comp. Bio.
Still need to pick gene setsFeature selection challenge still relevant Very useful grain of salt in interpreting these results..