Welcoming to incoming bioinformatics students at UCSF

43
Biological & Medical Informatics : the beginning Daniel Himmelstein September 24, 2014 Hand Drawn Map of SF by Jenni Sparks Before the Money Came Bettye LaVette

Transcript of Welcoming to incoming bioinformatics students at UCSF

Page 1: Welcoming to incoming bioinformatics students at UCSF

Biological & Medical

Informatics:!the beginning

Daniel Himmelstein!September 24, 2014

Hand Drawn Map of SF!by Jenni Sparks

Before the Money Came!Bettye LaVette

Page 2: Welcoming to incoming bioinformatics students at UCSF

challengeHand Drawn Map of SF!by Jenni Sparks

Page 3: Welcoming to incoming bioinformatics students at UCSF

review article

T h e n e w e ngl a nd j o u r na l o f m e dic i n e

n engl j med 369;5 nejm.org august 1, 2013448

Global Health

Measuring the Global Burden of DiseaseChristopher J.L. Murray, M.D., D.Phil., and Alan D. Lopez, Ph.D.

From the Institute for Health Metrics and Evaluation, University of Washington, Seattle (C.J.L.M.); and the University of Melbourne, School of Population and Global Health, Carlton, VIC, Australia (A.D.L.). Address reprint requests to Dr. Murray at the Institute for Health Metrics and Evaluation, 2301 Fifth Ave., Suite 600, Seattle, WA 98121, or at [email protected].

N Engl J Med 2013;369:448-57.DOI: 10.1056/NEJMra1201534Copyright © 2013 Massachusetts Medical Society.

It is difficult to deliver effective and high-quality care to patients without knowing their diagnoses; likewise, for health systems to be effective, it is necessary to understand the key challenges in efforts to improve population

health and how these challenges are changing. Before the early 1990s, there was no comprehensive and internally consistent source of information on the global bur-den of diseases, injuries, and risk factors. To close this gap, the World Bank and the World Health Organization launched the Global Burden of Disease (GBD) Study in 1991.1 Although assessments of selected diseases, injuries, and risk factors in se-lected populations are published each year (e.g., the annual assessments of the human immunodeficiency virus [HIV] epidemic2), the only comprehensive assess-ments of the state of health in the world have been the various revisions of the GBD Study for 1990, 1999–2002, and 2004.1,3-10 The advantage of the GBD approach is that consistent methods are applied to critically appraise available information on each condition, make this information comparable and systematic, estimate results from countries with incomplete data, and report on the burden of disease with the use of standardized metrics.

The most recent assessment of the global burden of disease is the 2010 study (GBD 2010), which provides results for 1990, 2005, and 2010. Several hundred investigators collaborated to report summary results for the world and 21 epidemio-logic regions in December 2012.11-18 Regions based on levels of adult mortality, child mortality, and geographic contiguity were defined. GBD 2010 addressed a number of major limitations of previous analyses, including the need to strength-en the statistical methods used for estimation.11 The list of causes of the disease burden was broadened to cover 291 diseases and injuries. Data on 1160 sequelae of these causes (e.g., diabetic retinopathy, diabetic neuropathy, amputations due to diabetes, and chronic kidney disease due to diabetes) have been evaluated separately. The mortality and burden attributable to 67 risk factors or clusters of risk factors were also assessed.

GBD 2010, which provides critical information for guiding prevention efforts, was based on data from 187 countries for the period from 1990 through 2010. It includes a complete reassessment of the burden of disease for 1990 as well as an estimation for 2005 and 2010 based on the same definitions and methods; this facilitated meaningful comparisons of trends. The prevalence of coexisting condi-tions was also estimated according to the year, age, sex, and country. Detailed results from global and regional data have been published previously.11-18

The internal validity of the results is an important aspect of the GBD approach. For example, demographic data on all-cause mortality according to the year, coun-try, age, and sex were combined with data on cause-specific mortality to ensure that the sum of the number of deaths due to each disease and injury equaled the number of deaths from all causes. Similar internal-validity checks were used for

The New England Journal of Medicine Downloaded from nejm.org on August 5, 2013. For personal use only. No other uses without permission.

Copyright © 2013 Massachusetts Medical Society. All rights reserved.

Global Burden of Disease (2010)

Disease Years Lost!(million)

ischemic heart disease 129.8

HIV-AIDS 81.5

Respiratory Cancers 46.9

disability-adjusted life year (DALY) is a measure of overall disease burden, expressed as the number of years lost due to ill-health, disability or early death

DOI: 10.1056/NEJMra1201534Murray et al. NEJM. 2013

!100 Million Pennies

http://www.kokogiak.com/megapenny

Page 4: Welcoming to incoming bioinformatics students at UCSF

US Life Expectancy

Gregg Easterbrook (September 17, 2014) What Happens When We All Live to 100?. The Atlantic

Calico: 500 Million USD

Page 5: Welcoming to incoming bioinformatics students at UCSF

1950 1960 1970 1980 1990 2000 2010

0.0

1.0

2.0

Increasing R&D Spending per New Drug Approval

Spen

ding

per

Dru

g*

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ●

●● ●

●● ● ●

●●

● ●

● ●●

● ●

●●

●●

●Td = 9.44

exponential model

1950 1960 1970 1980 1990 2000 2010Year

log 1

0(Sp

endi

ng p

er D

rug*)

−2−1

0

●●

●●

● ●●

●●

● ● ●

● ●●

●●

●●

●●

● ●●

●●

● ●●

●●

●●

●●

●●

● ●● ●

● ●●

● ●●

● ●●

R2 = 0.95

linear modelconfidence intervalprediction interval

*Spending in Billions of 2008 Dollars data from doi:10.1038/nrd3681

Himmelstein, Daniel; Baranzini, Sergio (2014): Increasing R&D Spending per New Drug Approval. figshare. http://dx.doi.org/10.6084/m9.figshare.937004

1950 1960 1970 1980 1990 2000 2010

0.0

1.0

2.0

Increasing R&D Spending per New Drug Approval

Spen

ding

per

Dru

g*

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ●

●● ●

●● ● ●

●●

● ●

● ●●

● ●

●●

●●

●Td = 9.44

exponential model

1950 1960 1970 1980 1990 2000 2010Year

log 1

0(Sp

endi

ng p

er D

rug*)

−2−1

0

●●

●●

● ●●

●●

● ● ●

● ●●

●●

●●

●●

● ●●

●●

● ●●

●●

●●

●●

●●

● ●● ●

● ●●

● ●●

● ●●

R2 = 0.95

linear modelconfidence intervalprediction interval

*Spending in Billions of 2008 Dollars data from doi:10.1038/nrd3681

Page 6: Welcoming to incoming bioinformatics students at UCSF

Physarum polycephalum: Slime Mold

Plasmodium:!• vegetative state!• acellular!• multinuclear!• protoplasmic veins

(tubules)!• locomotion by pulsation

- surface tensionhttp://youtu.be/MX2Fo4k6pxE

Signature habitat:!shady, cool, moist

Page 7: Welcoming to incoming bioinformatics students at UCSF

the presentHand Drawn Map of SF!by Jenni Sparks

Page 8: Welcoming to incoming bioinformatics students at UCSF

The exponential rise of ‘omics’

Andrew Su on Twitter

‘omics’ — collective characterization and quantification of biomolecules

Page 9: Welcoming to incoming bioinformatics students at UCSF

Data Scientist:

Data Scientist: The Sexiest Job of the 21st Century

Meet the people who can coax treasure out of messy, unstructured data. by Thomas H. Davenport and D.J. Patil

ARTWORK Tamar Cohen, Andrew J Buboltz 2011, silk screen on a page from a high school yearbook, 8.5" x 12"

Spotlight

hen Jonathan Goldman ar-rived for work in June 2006

at LinkedIn, the business networking site, the place still

felt like a start-up. The com-pany had just under 8 million

accounts, and the number was growing quickly as existing mem-

bers invited their friends and col-leagues to join. But users weren’t

seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently miss-ing in the social experience. As one LinkedIn manager put it, “It was like arriving at a conference reception and realizing you don’t know anyone. So you just stand in the corner sipping your drink—and you probably leave early.”

70  Harvard Business Review October 2012

SPOTLIGHT ON BIG DATA

Data Scientist: The Sexiest Job of the 21st Century

Meet the people who can coax treasure out of messy, unstructured data. by Thomas H. Davenport and D.J. Patil

ARTWORK Tamar Cohen, Andrew J Buboltz 2011, silk screen on a page from a high school yearbook, 8.5" x 12"

Spotlight

hen Jonathan Goldman ar-rived for work in June 2006

at LinkedIn, the business networking site, the place still

felt like a start-up. The com-pany had just under 8 million

accounts, and the number was growing quickly as existing mem-

bers invited their friends and col-leagues to join. But users weren’t

seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently miss-ing in the social experience. As one LinkedIn manager put it, “It was like arriving at a conference reception and realizing you don’t know anyone. So you just stand in the corner sipping your drink—and you probably leave early.”

70  Harvard Business Review October 2012

SPOTLIGHT ON BIG DATA

Data Scientist: The Sexiest Job of the 21st Century

Meet the people who can coax treasure out of messy, unstructured data. by Thomas H. Davenport and D.J. Patil

ARTWORK Tamar Cohen, Andrew J Buboltz 2011, silk screen on a page from a high school yearbook, 8.5" x 12"

Spotlight

hen Jonathan Goldman ar-rived for work in June 2006

at LinkedIn, the business networking site, the place still

felt like a start-up. The com-pany had just under 8 million

accounts, and the number was growing quickly as existing mem-

bers invited their friends and col-leagues to join. But users weren’t

seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently miss-ing in the social experience. As one LinkedIn manager put it, “It was like arriving at a conference reception and realizing you don’t know anyone. So you just stand in the corner sipping your drink—and you probably leave early.”

70  Harvard Business Review October 2012

SPOTLIGHT ON BIG DATAData Scientist: The Sexiest Job of the 21st Century

Meet the people who can coax treasure out of messy, unstructured data. by Thomas H. Davenport and D.J. Patil

ARTWORK Tamar Cohen, Andrew J Buboltz 2011, silk screen on a page from a high school yearbook, 8.5" x 12"

Spotlight

hen Jonathan Goldman ar-rived for work in June 2006

at LinkedIn, the business networking site, the place still

felt like a start-up. The com-pany had just under 8 million

accounts, and the number was growing quickly as existing mem-

bers invited their friends and col-leagues to join. But users weren’t

seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently miss-ing in the social experience. As one LinkedIn manager put it, “It was like arriving at a conference reception and realizing you don’t know anyone. So you just stand in the corner sipping your drink—and you probably leave early.”

70  Harvard Business Review October 2012

SPOTLIGHT ON BIG DATA

Artwork: Tamar Cohen, Andrew J Buboltz, 2011

Definition (wikipedia): !the study of the generalizable extraction of knowledge from data

Page 10: Welcoming to incoming bioinformatics students at UCSF

comparison with maternal grandmother

The Dawn of Personalized GenomicsNHGRI GWAS Catalog

Page 11: Welcoming to incoming bioinformatics students at UCSF

Open Source Explosion

Audio from: Let’s Talk Bitcoin! #134 Disruptive Leaps Andreas Antonopoulos & Jeffrey Tucker

Science graphic from: http://nisd.net/academics/elementary-science

Page 12: Welcoming to incoming bioinformatics students at UCSF

the pastHand Drawn Map of SF!by Jenni Sparks

Page 13: Welcoming to incoming bioinformatics students at UCSF

• Aggregate microbial rDNA content of a seawater sample

• richness of operational taxonomic units (OTUs)

• species distribution modeling

Diversity of the Marine Metagenome

Ladau et al. (2013) ISME doi:10.1038/ismej.2013.37

Katie Pollard

-180° -150° -120° -90° -60° -30° 0° 30° 60° 90° 120° 150° 180°

-180° -150° -120° -90° -60° -30° 0° 30° 60° 90° 120° 150° 180°

-90°

-60°

-30°

30°

60°

90°

-90°

-60°

-30°

30°

60°

90°

MICROBIS

FUHRMAN2008

POMMIER2007

GOS

Figure S1: Sampling locations for data used in constructing maps. Models withzero to eight parameters were fitted using MICROBIS data. Predictive performance ofthe models was evaluated using both internal measures of model performance (AIC, BIC,and PRESS) and three independent data sets, collected at the locations shown in red,green, and yellow (see Table S1). Analyses were based on 377 samples (234 MICROBIS,30 GOS, 9 POMMIER2007, 103 FUHRMAN2008) collected from 164 distinct locations.

11

-180° -150° -120° -90° -60° -30° 0° 30° 60° 90° 120° 150° 180°

-180° -150° -120° -90° -60° -30° 0° 30° 60° 90° 120° 150° 180°

-90°

-60°

-30°

30°

60°

90°

-90°

-60°

-30°

30°

60°

90°

MICROBIS

FUHRMAN2008

POMMIER2007

GOS

Figure S1: Sampling locations for data used in constructing maps. Models withzero to eight parameters were fitted using MICROBIS data. Predictive performance ofthe models was evaluated using both internal measures of model performance (AIC, BIC,and PRESS) and three independent data sets, collected at the locations shown in red,green, and yellow (see Table S1). Analyses were based on 377 samples (234 MICROBIS,30 GOS, 9 POMMIER2007, 103 FUHRMAN2008) collected from 164 distinct locations.

11

Page 14: Welcoming to incoming bioinformatics students at UCSF

Diversity in June

Ladau et al. (2013) ISME doi:10.1038/ismej.2013.37

linear model at a rarefaction depth of 4266sequences, with de novo sequence classification.To estimate ranges of individual taxa, we used SDMswith a logistic regression model (Franklin andMiller, 2009). Data used for model fitting areavailable in Supplementary File 3.

We performed 15 analyses, labeled Analyses I–XV,to check the robustness of the diversity maps and tomodel the distributions of different taxa and groupsof taxa (Supplementary Tables S1 and S5). ForAnalyses I–XI, we log-transformed richness andShannon diversity.

Robustness analysesAnalyses I–V checked the robustness of overalldiversity patterns that we report. Analysis I used alinear model, with OTUs identified using de novoclustering, and a rarefaction depth of 4266sequences. Analysis II checked whether the patternsare affected by the classification method. It wasthe same as Analysis I, but used OTUs identifiedby the Ribosomal Database Project (RDP) classifier,a reference-based procedure. We ran the RDPclassifier with and without a 50% bootstrap thresh-old. Using the bootstrap threshold introducessignificant bias to the data set, because sequences

with high similarity to known bacterial generaare not evenly distributed across latitudes. Withouta bootstrap threshold, the relative diversity patternsof RDP classified genera are very similar to thosefrom de novo OTUs (Supplementary Figure S2).Anaylsis III checked for effects of rarefactiondepth. It was the same as Analysis I, but used ararefaction depth of 150 sequences rather than 4266sequences. Analysis IV checked whether using alinear model affected our results. It implemented anonlinear, multiple adaptive regression splinesmodel (MARS) in lieu of the linear model, butwas otherwise like Analysis I. Analysis V checkedwhether our patterns were dependent on thediversity metric used. It was the same as AnalysisI, but used Shannon diversity instead of OTUrichness. The results of all five analyses werequalitatively alike (Figure 1, SupplementaryFigures S2–5), so in the main text we focus on theresults from Analysis I.

Additional diversity mapsAnalyses VI–XI mapped the distribution of richnessof OTUs within certain phyla. Analyses XII–XVmapped the distributions of select genera of marinebacteria.

2.05 2.20 2.35 2.50 2.65

Log10(OTU Richness)

Latitude

-90

-60

-30

0

30

60

90

Log10(OTU Richness)

2.0

Latitude

-90

-60

-30

0

30

60

902.62.42.2

Figure 1 Maps of predicted global marine bacterial diversity. Color scale shows relative richness of marine surface waters as predictedby SDM. Samples were rarefied to 4266 rDNA sequences to enable accurate estimation of relative richness patterns on a global scale fromdata sets with different sequencing depths. True richness is expected to exceed estimated values. (a) In December, OTU richness peaks intemperate and higher latitudes in the Northern Hemisphere. (b) In June, OTU richness peaks in temperate latitudes in the SouthernHemisphere. Predicted richness during the spring and fall is intermediate, with roughly globally uniform richness near the equinoxes(movie available in Supplementary File 2). Predicted richness patterns remain qualitatively the same regardless of the taxonomicclassification method (Supplementary Figure S2), modeling method (Supplementary Figure S3), choice of environmental predictors(Supplementary Figure S4) and sequencing depth (Supplementary Figure S5). Error rates for the predictions are generally low, asindicated by 95% confidence intervals on the marginal plots (right panels, shaded gray) and maps of standard errors (SupplementaryFigure S6). Grayed regions on the maps are areas where environmental raster data and, hence, predictions are unavailable. Richnessestimates in most regions are interpolated rather than extrapolated (Supplementary Figure S7).

Global marine bacterial diversityJ Ladau et al

1671

The ISME Journal

linear model at a rarefaction depth of 4266sequences, with de novo sequence classification.To estimate ranges of individual taxa, we used SDMswith a logistic regression model (Franklin andMiller, 2009). Data used for model fitting areavailable in Supplementary File 3.

We performed 15 analyses, labeled Analyses I–XV,to check the robustness of the diversity maps and tomodel the distributions of different taxa and groupsof taxa (Supplementary Tables S1 and S5). ForAnalyses I–XI, we log-transformed richness andShannon diversity.

Robustness analysesAnalyses I–V checked the robustness of overalldiversity patterns that we report. Analysis I used alinear model, with OTUs identified using de novoclustering, and a rarefaction depth of 4266sequences. Analysis II checked whether the patternsare affected by the classification method. It wasthe same as Analysis I, but used OTUs identifiedby the Ribosomal Database Project (RDP) classifier,a reference-based procedure. We ran the RDPclassifier with and without a 50% bootstrap thresh-old. Using the bootstrap threshold introducessignificant bias to the data set, because sequences

with high similarity to known bacterial generaare not evenly distributed across latitudes. Withouta bootstrap threshold, the relative diversity patternsof RDP classified genera are very similar to thosefrom de novo OTUs (Supplementary Figure S2).Anaylsis III checked for effects of rarefactiondepth. It was the same as Analysis I, but used ararefaction depth of 150 sequences rather than 4266sequences. Analysis IV checked whether using alinear model affected our results. It implemented anonlinear, multiple adaptive regression splinesmodel (MARS) in lieu of the linear model, butwas otherwise like Analysis I. Analysis V checkedwhether our patterns were dependent on thediversity metric used. It was the same as AnalysisI, but used Shannon diversity instead of OTUrichness. The results of all five analyses werequalitatively alike (Figure 1, SupplementaryFigures S2–5), so in the main text we focus on theresults from Analysis I.

Additional diversity mapsAnalyses VI–XI mapped the distribution of richnessof OTUs within certain phyla. Analyses XII–XVmapped the distributions of select genera of marinebacteria.

2.05 2.20 2.35 2.50 2.65

Log10(OTU Richness)

Latitude

-90

-60

-30

0

30

60

90

Log10(OTU Richness)

2.0

Latitude

-90

-60

-30

0

30

60

902.62.42.2

Figure 1 Maps of predicted global marine bacterial diversity. Color scale shows relative richness of marine surface waters as predictedby SDM. Samples were rarefied to 4266 rDNA sequences to enable accurate estimation of relative richness patterns on a global scale fromdata sets with different sequencing depths. True richness is expected to exceed estimated values. (a) In December, OTU richness peaks intemperate and higher latitudes in the Northern Hemisphere. (b) In June, OTU richness peaks in temperate latitudes in the SouthernHemisphere. Predicted richness during the spring and fall is intermediate, with roughly globally uniform richness near the equinoxes(movie available in Supplementary File 2). Predicted richness patterns remain qualitatively the same regardless of the taxonomicclassification method (Supplementary Figure S2), modeling method (Supplementary Figure S3), choice of environmental predictors(Supplementary Figure S4) and sequencing depth (Supplementary Figure S5). Error rates for the predictions are generally low, asindicated by 95% confidence intervals on the marginal plots (right panels, shaded gray) and maps of standard errors (SupplementaryFigure S6). Grayed regions on the maps are areas where environmental raster data and, hence, predictions are unavailable. Richnessestimates in most regions are interpolated rather than extrapolated (Supplementary Figure S7).

Global marine bacterial diversityJ Ladau et al

1671

The ISME Journal

Page 15: Welcoming to incoming bioinformatics students at UCSF

Diversity in December

linear model at a rarefaction depth of 4266sequences, with de novo sequence classification.To estimate ranges of individual taxa, we used SDMswith a logistic regression model (Franklin andMiller, 2009). Data used for model fitting areavailable in Supplementary File 3.

We performed 15 analyses, labeled Analyses I–XV,to check the robustness of the diversity maps and tomodel the distributions of different taxa and groupsof taxa (Supplementary Tables S1 and S5). ForAnalyses I–XI, we log-transformed richness andShannon diversity.

Robustness analysesAnalyses I–V checked the robustness of overalldiversity patterns that we report. Analysis I used alinear model, with OTUs identified using de novoclustering, and a rarefaction depth of 4266sequences. Analysis II checked whether the patternsare affected by the classification method. It wasthe same as Analysis I, but used OTUs identifiedby the Ribosomal Database Project (RDP) classifier,a reference-based procedure. We ran the RDPclassifier with and without a 50% bootstrap thresh-old. Using the bootstrap threshold introducessignificant bias to the data set, because sequences

with high similarity to known bacterial generaare not evenly distributed across latitudes. Withouta bootstrap threshold, the relative diversity patternsof RDP classified genera are very similar to thosefrom de novo OTUs (Supplementary Figure S2).Anaylsis III checked for effects of rarefactiondepth. It was the same as Analysis I, but used ararefaction depth of 150 sequences rather than 4266sequences. Analysis IV checked whether using alinear model affected our results. It implemented anonlinear, multiple adaptive regression splinesmodel (MARS) in lieu of the linear model, butwas otherwise like Analysis I. Analysis V checkedwhether our patterns were dependent on thediversity metric used. It was the same as AnalysisI, but used Shannon diversity instead of OTUrichness. The results of all five analyses werequalitatively alike (Figure 1, SupplementaryFigures S2–5), so in the main text we focus on theresults from Analysis I.

Additional diversity mapsAnalyses VI–XI mapped the distribution of richnessof OTUs within certain phyla. Analyses XII–XVmapped the distributions of select genera of marinebacteria.

2.05 2.20 2.35 2.50 2.65

Log10(OTU Richness)

Latitude

-90

-60

-30

0

30

60

90

Log10(OTU Richness)

2.0

Latitude

-90

-60

-30

0

30

60

902.62.42.2

Figure 1 Maps of predicted global marine bacterial diversity. Color scale shows relative richness of marine surface waters as predictedby SDM. Samples were rarefied to 4266 rDNA sequences to enable accurate estimation of relative richness patterns on a global scale fromdata sets with different sequencing depths. True richness is expected to exceed estimated values. (a) In December, OTU richness peaks intemperate and higher latitudes in the Northern Hemisphere. (b) In June, OTU richness peaks in temperate latitudes in the SouthernHemisphere. Predicted richness during the spring and fall is intermediate, with roughly globally uniform richness near the equinoxes(movie available in Supplementary File 2). Predicted richness patterns remain qualitatively the same regardless of the taxonomicclassification method (Supplementary Figure S2), modeling method (Supplementary Figure S3), choice of environmental predictors(Supplementary Figure S4) and sequencing depth (Supplementary Figure S5). Error rates for the predictions are generally low, asindicated by 95% confidence intervals on the marginal plots (right panels, shaded gray) and maps of standard errors (SupplementaryFigure S6). Grayed regions on the maps are areas where environmental raster data and, hence, predictions are unavailable. Richnessestimates in most regions are interpolated rather than extrapolated (Supplementary Figure S7).

Global marine bacterial diversityJ Ladau et al

1671

The ISME Journal

Ladau et al. (2013) ISME doi:10.1038/ismej.2013.37

linear model at a rarefaction depth of 4266sequences, with de novo sequence classification.To estimate ranges of individual taxa, we used SDMswith a logistic regression model (Franklin andMiller, 2009). Data used for model fitting areavailable in Supplementary File 3.

We performed 15 analyses, labeled Analyses I–XV,to check the robustness of the diversity maps and tomodel the distributions of different taxa and groupsof taxa (Supplementary Tables S1 and S5). ForAnalyses I–XI, we log-transformed richness andShannon diversity.

Robustness analysesAnalyses I–V checked the robustness of overalldiversity patterns that we report. Analysis I used alinear model, with OTUs identified using de novoclustering, and a rarefaction depth of 4266sequences. Analysis II checked whether the patternsare affected by the classification method. It wasthe same as Analysis I, but used OTUs identifiedby the Ribosomal Database Project (RDP) classifier,a reference-based procedure. We ran the RDPclassifier with and without a 50% bootstrap thresh-old. Using the bootstrap threshold introducessignificant bias to the data set, because sequences

with high similarity to known bacterial generaare not evenly distributed across latitudes. Withouta bootstrap threshold, the relative diversity patternsof RDP classified genera are very similar to thosefrom de novo OTUs (Supplementary Figure S2).Anaylsis III checked for effects of rarefactiondepth. It was the same as Analysis I, but used ararefaction depth of 150 sequences rather than 4266sequences. Analysis IV checked whether using alinear model affected our results. It implemented anonlinear, multiple adaptive regression splinesmodel (MARS) in lieu of the linear model, butwas otherwise like Analysis I. Analysis V checkedwhether our patterns were dependent on thediversity metric used. It was the same as AnalysisI, but used Shannon diversity instead of OTUrichness. The results of all five analyses werequalitatively alike (Figure 1, SupplementaryFigures S2–5), so in the main text we focus on theresults from Analysis I.

Additional diversity mapsAnalyses VI–XI mapped the distribution of richnessof OTUs within certain phyla. Analyses XII–XVmapped the distributions of select genera of marinebacteria.

2.05 2.20 2.35 2.50 2.65

Log10(OTU Richness)

Latitude

-90

-60

-30

0

30

60

90

Log10(OTU Richness)

2.0

Latitude

-90

-60

-30

0

30

60

902.62.42.2

Figure 1 Maps of predicted global marine bacterial diversity. Color scale shows relative richness of marine surface waters as predictedby SDM. Samples were rarefied to 4266 rDNA sequences to enable accurate estimation of relative richness patterns on a global scale fromdata sets with different sequencing depths. True richness is expected to exceed estimated values. (a) In December, OTU richness peaks intemperate and higher latitudes in the Northern Hemisphere. (b) In June, OTU richness peaks in temperate latitudes in the SouthernHemisphere. Predicted richness during the spring and fall is intermediate, with roughly globally uniform richness near the equinoxes(movie available in Supplementary File 2). Predicted richness patterns remain qualitatively the same regardless of the taxonomicclassification method (Supplementary Figure S2), modeling method (Supplementary Figure S3), choice of environmental predictors(Supplementary Figure S4) and sequencing depth (Supplementary Figure S5). Error rates for the predictions are generally low, asindicated by 95% confidence intervals on the marginal plots (right panels, shaded gray) and maps of standard errors (SupplementaryFigure S6). Grayed regions on the maps are areas where environmental raster data and, hence, predictions are unavailable. Richnessestimates in most regions are interpolated rather than extrapolated (Supplementary Figure S7).

Global marine bacterial diversityJ Ladau et al

1671

The ISME Journal

Page 16: Welcoming to incoming bioinformatics students at UCSF

Slime Mold & the Greater Tokyo Rail System

Tero et al (2010) Science DOI: 10.1126/science.1177894http://youtu.be/GwKuFREOgmo

• 17 cm (7 in) agar-filled petri dish

• plasmodium for Tokyo

• quaker oats for cities

• vegetate for a day

• decentralized, distributed planning

Page 17: Welcoming to incoming bioinformatics students at UCSF

Tero et al (2010) Science DOI: 10.1126/science.1177894

aftermath: no illuminationaftermath: geographic

constraint using illumination

Page 18: Welcoming to incoming bioinformatics students at UCSF

The SlimeNet was comparable or preferable to the RealNet in terms of: !• efficiency • fault tolerance • cost

Actual Rail Network Slime Tubule Network

Tero et al (2010) Science DOI: 10.1126/science.1177894

Page 19: Welcoming to incoming bioinformatics students at UCSF

Human Evolution & Population GeneticsJohn Novembre

Ryan Hernandez

• 3,192 Europeans • 500,568 SNPs • Reduced to 2d (PCA)

Veeramah & Hammer (2014) Nat Rev Genet doi:10.1038/nrg3625

out-of-Africa bottleneck

• Europeans have less genetic diversity than Africans

Novembre et al (2008) Nature doi:10.1038/nature07331

Page 20: Welcoming to incoming bioinformatics students at UCSF

Genes mirror geography within Europe

Novembre et al (2008) Nature doi:10.1038/nature07331

• Despite the low diversity in Europeans, 500 thousand common variants discriminate population diversity with high resolution.

Page 21: Welcoming to incoming bioinformatics students at UCSF

Medical Informatics - An invited segment by Antoine Lizée -

How to build intelligence around

patient medical records

Adriana Karembeu & Antoine Lizee at Sandler Neurosciences Center, UCSF

Page 22: Welcoming to incoming bioinformatics students at UCSF

4500 visits - 600 patients – 10th year (UCSF EPIC STUDY)

Images ~200MB/visit

Brain MRI T1, T2,

proton density Processed MRI Cortical Thickness,

Myelin Overlays CT, Myelin,

Anatomical labels

GWAS 500,000+ SNPs

HLA A,B,C,

DRB1, DQB1

Patient data Age, sex, history, etc. Clinical data Clinical Scores, treatments Patient reported Quality of Life questionnaires Processed data MRI-based

Refe

renc

e Da

ta

Genotypes ~1MB/patient

(Para-) Clinical Data ~250 variables/visit

Page 23: Welcoming to incoming bioinformatics students at UCSF

Visits

Page 24: Welcoming to incoming bioinformatics students at UCSF
Page 25: Welcoming to incoming bioinformatics students at UCSF
Page 26: Welcoming to incoming bioinformatics students at UCSF

hometown

kin collegedebate

campGuatemala

UCSF

research

Dartmouth

The Friendship Network of Daniel Himmelstein

Learn more online at: http://dhimmel.com

• 1,278 nodes • 40,255 edges

Page 27: Welcoming to incoming bioinformatics students at UCSF

Highschool

Camp

College

Kin

UCSF

Research

Debate

1,278 nodes (1 type) 40,255 edges (1 type)

http://dhimmel.com

Facebook Friends

Genes

DiseasesPathophysiologiesTissues

Genomic Positions

Perturbations

Canonical Pathways

BioCarta

KEGG

ReactomemiRNA

TFBSCancer Hoods

Cancer Modules

GO: BP

GO: MF

GO: CC

Oncogenic

Immunologic

Complex Diseases29,241 nodes (19 types)

1,608,168 edges (20 types)

http://het.io

Page 28: Welcoming to incoming bioinformatics students at UCSF

G T De l

G G Di a

a aG D G Da

G Da

MetaPaths

GT

De

lG

GD

ia

Multiple SclerosisIRF1 IL2RA4 1 1 4

Multiple SclerosisIRF1 IRF84 1 1 4

Multiple SclerosisIRF1 CXCR44 2 1 4

Multiple SclerosisIRF1 Leukocyte

2 1 1 1

metapath paths pathdegreeproduct

degreeweighted

path count

0.707

0.25

0.25

0.177

0.677

0.707

ITCH

Lung

SUMO1

Multiple Sclerosis

IRF1

Leukocyte

Crohn’s Disease

IL2RAIRF8

CXCR4

STAT3

expression

interaction asso

ciatio

n

loca

lizat

ion

asso

ciat

ion

asso

ciat

ion

association

inter

actio

n

Graph Subset

a mG D P Dm

i iG G G Da

a lG D T Dl

e eG T G Da

MetaGraph

MSigDBCollection

Disease

Tissue

Gene

expr

essio

n localization

association

Patho-physiology

mem

bers

hip

mem

bership

inte

raction

A

B

C

D

PDP (path) =Y

d2Dpath

d�wm mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Da

m mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Da

metaedge-specific degrees

Network

G T De l

G G Di a

a aG D G Da

G Da

MetaPaths

GT

De

lG

GD

ia

Multiple SclerosisIRF1 IL2RA4 1 1 4

Multiple SclerosisIRF1 IRF84 1 1 4

Multiple SclerosisIRF1 CXCR44 2 1 4

Multiple SclerosisIRF1 Leukocyte

2 1 1 1

metapath paths pathdegreeproduct

degreeweighted

path count

0.707

0.25

0.25

0.177

0.677

0.707

ITCH

Lung

SUMO1

Multiple Sclerosis

IRF1

Leukocyte

Crohn’s Disease

IL2RAIRF8

CXCR4

STAT3

expression

interaction asso

ciatio

n

loca

lizat

ion

asso

ciat

ion

asso

ciat

ion

association

inter

actio

n

Graph Subset

a mG D P Dm

i iG G G Da

a lG D T Dl

e eG T G Da

MetaGraph

MSigDBCollection

Disease

Tissue

Gene

expr

essio

n localization

association

Patho-physiology

mem

bers

hip

mem

bership

inte

raction

A

B

C

D

PDP (path) =Y

d2Dpath

d�wm mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Da

m mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G Dam mG M G DaDWPC(metapath) =

X

path2Paths

PDP (path)

metaedge-specific degrees

Feature Computation

{Cancer H

ood}{Positio

nal}GeTeGaDGiGeTlDGeTlD

{GO Functio

n}

{GO Component}

{miRNA Target}

{BioCarta}

{Oncogenic}

{TF Target}

GaD (any gene)

{Cancer M

odule}GiGiGaD

{GO Process}GiGaD{KEGG}

{Immunologic}{R

eactome}

{Perturbatio

n}GaDmPmD

GaD (any disease)GaDlTlDGaDaGaD

2 0 2 4

Standardized Coe cient

Method (AUROC)

ridge (0.829)

lasso (0.823)

Machine Learning

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate

Reca

ll

Partition (AUROC)

Testing (0.829)

Training (0.810)

AUPRC = 0.062

0.0

0.2

0.4

0.6

0.8

1.0

Rid

ge

0.0 0.2 0.4 0.6 0.8 1.0

Recall

Pre

cisi

on

0.2

0.4

0.6

PredictionThreshold

Performance

0.2

0.6

1.0

1.4

1.8

Meta

2.5

0.0 0.2 0.4 0.6 0.8 1.0

P-value

Densi

ty

Combine Predictions & Statistical Evidence

15

Gene Meta2.5 HNLP WTCCC2

JAK2 0.047 0.102 0.0015REL 0.001 0.040 0.0003SH2B3 0.012 0.034 0.0130RUNX3 0.016 0.025 0.0073

Table 5. Multiple sclerosis gene discovery.

ValuePrediction Threshold 0.024False Positive Rate 0.001Recall 0.108Precision 0.133Lift 68.4Novel & Meta2.5-nominal Total 1211Discovered 4Bonferroni Cuto↵ 0.0125Discovered < Bonferroni 3Total < Bonferroni 199Replication p-value 0.015

Table 6. Multiple sclerosis gene discovery statistics.

Discover Novel Susceptibility Genes

Page 29: Welcoming to incoming bioinformatics students at UCSF

Interactive Web Browser - http://het.io

Page 30: Welcoming to incoming bioinformatics students at UCSF

Mechanisms of Pathogenesis

Gene—{MSigDB Collection}—Gene—Disease DWPC Model

————

——

——— — —

————

——

——— — —

— —

0.4

0.6

0.8

1.0

Positio

nal

Cance

r Hoo

d

BioCar

ta

GO C

ompo

nent

miR

NA Tar

get

GO F

unction

Reactom

e

Onc

ogen

ic

TF Tar

get

KEGG

GO P

roce

ss

Cance

r Mod

ule

Imm

unolog

ic

Pertu

rbat

ion

Lass

o

Ridge

AU

RO

C

Degree-Weighted Path Count Path Count Model

— ——— ——

— ———

—— ——

— ——

——

— —

0.4

0.6

0.8

1.0

GiG

aD

GeT

eGaD

GeT

lD

GiG

eTlD

GaD

aGaD

GaD

mPm

D

GiG

iGaD

GaD

lTlD

GaD

(any

gen

e)

GaD

(any

dise

ase)

Lass

o

Ridge

AU

RO

C

Pathophysiology

degenerative

immunologic

metabolic

neoplastic

psychiatric

unspeci c

Gene—{MSigDB Collection}—Gene—Disease DWPC Model

————

——

——— — —

————

——

——— — —

— —

0.4

0.6

0.8

1.0

Positio

nal

Cance

r Hoo

d

BioCar

ta

GO C

ompo

nent

miR

NA Tar

get

GO F

unction

Reactom

e

Onc

ogen

ic

TF Tar

get

KEGG

GO P

roce

ss

Cance

r Mod

ule

Imm

unolog

ic

Pertu

rbat

ion

Lass

o

Ridge

AU

RO

C

Degree-Weighted Path Count Path Count Model

— ——— ——

— ———

—— ——

— ——

——

— —

0.4

0.6

0.8

1.0

GiG

aD

GeT

eGaD

GeT

lD

GiG

eTlD

GaD

aGaD

GaD

mPm

D

GiG

iGaD

GaD

lTlD

GaD

(any

gen

e)

GaD

(any

dise

ase)

Lass

o

Ridge

AU

RO

CPathophysiology

degenerative

immunologic

metabolic

neoplastic

psychiatric

unspeci c

Page 31: Welcoming to incoming bioinformatics students at UCSF

c

c

Page 32: Welcoming to incoming bioinformatics students at UCSF

c

c

Sergio!Baranzini

Ryan Hernandez

John Witte

Andrej Sali

Katie Pollard

Patsy Babbitt

Page 33: Welcoming to incoming bioinformatics students at UCSF

Decreased Lung Cancer at High Elevations

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

β = −9.167R2 = 0.202

●●

●●

●●●●

●●

●●

●●

●●

●●

β = −7.781R2 = 0.109

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

β = −2.072R2 = 0.059

●●

●●

●●

●●

●●

● ●

β = 3.545R2 = 0.011

25

50

75

100

40

80

120

160

20

30

40

50

60

70

80

120

160

200

0 1 2

0 1 2

0 1 2

0 1 2Elevation (km)

Inci

denc

e (A

ge−a

djus

ted

case

s pe

r 100

,000

)

●●●

●●

●●

●●

●●

●●● ●

●●

●●

●●

●●

● ●

β = −7.234R2 = 0.252

●●

●●

● ●●●●

●●

● ●

●●

●● ●

●●

●●

●● ●

β = −4.056R2 = 0.04

● ●●●

●● ●●

● ●

● ●

●●

●●

●●

β = 0.648R2 = 0.006

●●●

●●

●●

●●

● ●

●●

●●

β = 4.861R2 = 0.015

−40

−20

0

20

40

−50

−25

0

25

−10

0

10

20

−40

0

40

80

−1 0 1

−1 0 1

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1 0 1Elevation Residual

Inci

denc

e R

esid

ual

A B

Elevation (km) Residual Elevation

Lung

Can

cer I

ncid

ence

Lung

Can

cer I

ncid

ence

Res

idua

l

Bivariate Plot●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

β = −9.167R2 = 0.202

●●

●●

●●●●

●●

●●

●●

●●

●●

β = −7.781R2 = 0.109

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

β = −2.072R2 = 0.059

●●

●●

●●

●●

●●

● ●

β = 3.545R2 = 0.011

25

50

75

100

40

80

120

160

20

30

40

50

60

70

80

120

160

200

0 1 2

0 1 2

0 1 2

0 1 2Elevation (km)

Inci

denc

e (A

ge−a

djus

ted

case

s pe

r 100

,000

)

●●●

●●

●●

●●

●●

●●● ●

●●

●●

●●

●●

● ●

β = −7.234R2 = 0.252

●●

●●

● ●●●●

●●

● ●

●●

●● ●

●●

●●

●● ●

β = −4.056R2 = 0.04

● ●●●

●● ●●

● ●

● ●

●●

●●

●●

β = 0.648R2 = 0.006

●●●

●●

●●

●●

● ●

●●

●●

β = 4.861R2 = 0.015

−40

−20

0

20

40

−50

−25

0

25

−10

0

10

20

−40

0

40

80

−1 0 1

−1 0 1

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1 0 1Elevation Residual

Inci

denc

e R

esid

ual

A B

Elevation (km) Residual Elevation

Lung

Can

cer I

ncid

ence

Lung

Can

cer I

ncid

ence

Res

idua

l

Partial Regression Plot• Counties of the American West

• Lung cancer versus elevation

• Publicly-available data

Page 34: Welcoming to incoming bioinformatics students at UCSF

Lung Breast Colorectal Prostate

-0.6

-0.4

-0.2

0.0

0.2

0.4

2 5 8 2 5 8 2 5 8 2 5 8

Subset Size

Sta

nd

ard

ize

d E

leva

tion

Co

eci

en

t

500

600

700

Model BIC

Association specific to lung cancer

Kamen Simeonov

• Inhaled carcinogen

• Oxygen concentration decreases by ~11% for every 1000 meter rise in elevation

Lung Breast Colorectal Prostate

-0.6

-0.4

-0.2

0.0

0.2

0.4

2 5 8 2 5 8 2 5 8 2 5 8

Subset Size

Sta

nd

ard

ize

d E

leva

tion

Co

eci

en

t

500

600

700

Model BIC

Lung Breast Colorectal Prostate

-0.6

-0.4

-0.2

0.0

0.2

0.4

2 5 8 2 5 8 2 5 8 2 5 8

Subset Size

Sta

nd

ard

ize

d E

leva

tion

Co

eci

en

t

500

600

700

Model BIC

Page 35: Welcoming to incoming bioinformatics students at UCSF

the futureHand Drawn Map of SF!by Jenni Sparks

Page 36: Welcoming to incoming bioinformatics students at UCSF

Subscription PublishingHealth science journal subscription

costs are skyrocketing

© Association of Research Libraries, 2013

$0

$2,000

$4,000

$6,000

$8,000

$10,000

$12,000

$14,000

$16,000

$18,000

$20,000

1.50%

1.70%

1.90%

2.10%

2.30%

2.50%

2.70%

2.90%

3.10%

3.30%

3.50%

3.70%

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

In te

n th

ousa

nds (

mul

tiply

val

ues

by 1

0000

0)

Library Expenditure as % of Total University Expenditure (Average of Select US ARL Libraries)

Total University Expenditure (Average of Select US ARL Libraries)

Library and University Expenditure Trends (Time-Series)

Library and University Expenditure Trends (Time-Series)

Library and University Expenditure Trends (Time-Series)

Library and University Expenditure Trends (Time-Series)

Library and University Expenditure Trends (Time-Series)

Library and University Expenditure Trends (Time-Series)

Library and University Expenditure Trends (Time-Series)

Library and University Expenditure Trends (Time-Series)

1982 20111.7%

3.7%

year

% o

f uni

vers

ity b

udge

ts

for l

ibra

ries

Library budges are nosediving

http://www.library.ucsf.edu/services/scholpub/journalcosts

• Libraries are canceling subscriptions • Research is paywalled, inaccessible

to those who could benefit • Scientists desire their findings to be

widely-applied

• Research funding is public • Very small percentage of individuals

have institutional access • Academia doesn’t succeed in a

vacuum — innovation grows from diverse and plentiful inputs Audio from:

Let’s Talk Bitcoin! #134 Disruptive Leaps Andreas Antonopoulos

Page 37: Welcoming to incoming bioinformatics students at UCSF

Per Article Cost from "Open Access: Market Size, Share, Forecast, and Trends" Outsell. January 31, 2013 !Subscription: $4,000.00 Open Access: $950.00

UCSF Open Access Fund http://www.library.ucsf.edu/services/scholpub/oa/fund/eligibility

Fully OA Journal: $2,000 Hybrid OA: $1,000

• PeerJ — Lifetime publishing plan for $99

• eLife — currently no APC, “pain free publication”

• PLOS, BMC, Specialty Pubs • F1000 Research, pre-review

publication • preprints, arRxiv & bioRxiv

Page 38: Welcoming to incoming bioinformatics students at UCSF

Article-level metrics

doi:10.1371/journal.pone.0013636.g005

Open Access increases Citations

Gargouri et al. PLOS One. 2010

• Alternative to journal impact factor

• Citations, downloads, views, social media

• Accelerates science — impact factor = rejection

• Expands the audience evaluating article importance and quality

• Already used: h-indexGrow in importance

Page 39: Welcoming to incoming bioinformatics students at UCSF

Public Data increases Citations

citations

Piwowar & Vision (2013) DOI: 10.7717/peerj.175

• 10,555 microarray studies

• Classified studies by data availability

• 8 categories of covariates

Page 40: Welcoming to incoming bioinformatics students at UCSF

Availability & Reuse

• only applies to original research articles

• journals often withhold the typeset version

• does not affect reuse

Creative Commons Attribution Alone

Mandatory Archiving !NIH: PubMed Central UC: eScholarship

• subscription journal require the transfer of article ownership

• enforce the article copyright

• require licensing for reuse

Page 41: Welcoming to incoming bioinformatics students at UCSF

Tools for Efficiency & Reproducibility

Version control:

Online code repositories:

Interactive programming environments:

ipython notebook

Page 42: Welcoming to incoming bioinformatics students at UCSF

Personal WebsiteClint Cario clintcario.com

Daniel Himmelstein dhimmel.com

Brian O’Donovan iambrianodonovan.com

Kieran Mace mace.co

Andrew Sczesnak andrewsczesnak.com

Kyle Barlow kylebarlow.com

Page 43: Welcoming to incoming bioinformatics students at UCSF

your beginning

Hand Drawn Map of SF!by Jenni Sparks