PGC bioinformatics

16
PGC bioinformatics PF Sullivan 20 September 2016

Transcript of PGC bioinformatics

Page 1: PGC bioinformatics

PGC bioinformaticsPF Sullivan

20 September 2016

Page 2: PGC bioinformatics

ApologiesTo the whole of the US for having this at an inaccessible time

Then again, US presidential election has polluted global consciousness

Next one will be more US-friendly

Page 3: PGC bioinformatics

Motivation

• The PGC is no longer a ”one-and-done” organization. Not just a handful of gwas mega-analyses, but dozens

• Interpretation of results & downstream analyses should be routine but is often cumbersome, incomplete, and error-prone

• This seems pretty clear, obvious, unobjectionable

Page 4: PGC bioinformatics

Goal

• Develop, test, & deploy command line tools to interpret gwas results

• Integrated with ricopili

• Implement on LISA, initially for use within PGC (open later), /home/pgcbioif (thanks Danielle)

• Open-source, community effort, full documentation

• Some databases will need to be updated

• Two types: • Mature, best-in-class

• Experimental/investigative (to be used cautiously)

Page 5: PGC bioinformatics

Proposal, mature: for a set of gwas resultsType Content Status

Lookups, general Genes, OMIM, gwas catalog, CNV, ID, DD, ASD TIEFIghter java, beta test

Lookups, vs all PGC findings Find SNP results all prior PGC studies Available now, gwasLibrary on LISA

SNP-h2 Use LDSR to compute SNP-h2 for range of K(0.001 to 0.01 by 0.001, 0.01 to 0.15 by 0.01)

Need a script

Local SNP h2 Bogdan Pasaniuc, HESS Available now

rg vs PGC & vs LD-Hub Genetic correlations vs all PGC & LD-Hub(need both b/c LD-Hub update status unclear)

Available now

Partitioned LDSR Hilary Finucane (pmid 26414678) Available now, but input data

TWAS Eg, Sasha Gusev, Alkes Price (submitted): impute brain expression levels, case-control; other methods exist

Available now, but input data

MAGMA de Leeuw, Posthuma Available now

Others? SMR. GCTA. Popcorn. Credible SNPs. eQTLs. I personally wouldn’t do much with ENCODE/RoadMap – better data coming.

Page 6: PGC bioinformatics

DiscussionCorrect current mature set? Others?

Page 7: PGC bioinformatics

geneMatrixSven Stringer, VU Amsterdam

Page 8: PGC bioinformatics

Goal

• Pipeline to automatically create psychiatric genetics-focused annotated geneMatrix

• Usable in PGC and COSYN project

• Gene matrix should be • useful for most of the people most of the time (not 100%)

• easy to update

• well-documented

• directly usable in Excel as well as other analytic environments (R, matlab, python, linux, etc.)

• Housed on LISA /home/pgcbioif (thanks Danielle)

Page 9: PGC bioinformatics

Plan

• Original by PF Sullivan beginning 2004 (not general)

• Create flexible geneMatrix pipeline

• Pipeline will be• fully portable across linux environments• run from lisa cluster on its own account (/home/pgcbioif )• easy to configure• well-documented• create a gene matrix suitable for human and computer consumption (probably .csv

format)

• Implementation mostly in R

• Update and distribution policy for gene matrix will be put in place

Page 10: PGC bioinformatics

HGNCdata

GENCODE V24

back-mapped to hg19

core matrix

preprocess

preprocess

mergeand

format

create gene translation table

(GTT)GTT

outputsettings

external annotations

mergeand

format

output matrix

Design

Page 11: PGC bioinformatics

Main annotations

• Gene names (official HUGO symbol and aliases)

• Location on hg19 (GENCODE v24)

• Information about LD and SNP density

• ExAC1 constraint score (pLI)

• Associated OMIM diseases & NHGRI/EBI GWAS catalog traits

• Gene-based p-values from large psychiatric GWAS

• Disease-specific manually curated annotations (ID, DD, ASD, brain expression, community flags)

Page 12: PGC bioinformatics

Manual curation

• Extracting information from important disease-specific papers

• Distribute tasks/responsibilities across stakeholders

• Data from curators will need to conform to specific format to be included automatically in pipeline

• Policies and conventions will be put in place to make this manual curation work

Page 13: PGC bioinformatics

Limitations

• Quality of information obviously depends on data sources used (GENCODE, HGNC, etc.)

• Gene matrix is provided “as-is”, no guarantees

• However• care is taken to ensure quality as much as possible

• sanity checks are performed

• pipeline is transparent and documented

Page 14: PGC bioinformatics

We need a core team to take responsibility

• Suggest based in PGC Stats Group, need steering group – two leaders to be responsible

• PGC has employed analysts & data wranglers

• Use our paid PGC consultants (advice only)

• Implementation / update data / add new data & features

• Interface with PGC Data Access Committee, Pathway group

• Simple standard formats (please, let’s not get fancy)

(PGC liberalizing results access policy – in progress)

Page 15: PGC bioinformatics

How can I get involved with the PGC?

Much of PGC leadership is 55+. Turnover is good for an organization.

The people who are in leadership roles in PGC stepped up:

• Volunteered, followed through, did tasks well

• Took on small roles, did them well, got more to do

• Volunteered to write parts of papers

• Were consistently on callsPGC FAQ

Page 16: PGC bioinformatics

Let’s get phase 1 doneThen we can move from there.

Particularly working with psychENCODE on functional genomic data.

I can’t do this too…if people don’t step up, won’t happen