DKFZ PhD Retreat Poster 20190703 - EMBL Blogs · 1st d e Celltype Morula Blastocyst ICM Regulatory...

1
Single-cell transcriptome and chromatin accessibility data integration reveals cell specific signatures 1 Division of Neuroblastoma Genomics, German Cancer Research Center, Heidelberg, Germany. 2 Health Data Science Unit, Medical Faculty Heidelberg and BioQuant. *Correspondence: [email protected] Andrés Quintero 1,2,° , Anne-Claire Kröger 2 and Carl Herrmann 2,* recover cell specific signatures between different species 1 2 The ability to integrate multiple layers of omics data plays an essential role in understanding the complex interplay of different molecular mechanisms that give rise to cellular diversity. To address this challenge we implemented Integrative Iterative Non-negative Matrix Factorization (i2NMF), a computational method to dissect genomic signatures from multi-omics data sets. i2NMF was implemented as an extension of the R package Brat- wurst available in Github. We applied i2NMF to : https://github.com/wurst- theke/bratwurst identify rare cell populations F o r e v e r y i n i t ia l m a t r i x X n Variable number of signatures for every data set Signatures are shared across data sets Stage 1 Stage 2 i2NMF workflow: The shared effect is recovered in the H s matrix, and the exposure of the features explaining this effect are contained in the W s matrices (Yang and Michailidis, 2016). The explained variance of the decomposed model can be estimated for both stages by: This is useful to compare the performance between stages and the overall decomposition Common features should be shared across columns, e.g. gene or sample IDs. i2NMF advantages ● The feature exposure matrices W s and W r are different, recovering unique signatures between stage 1 and 2 ● The number of inferred signatures in stage 2 can vary across matrices, allowing a better resolution of specific effects. ● All solvers were implemented on TensorFlow, allowing scalability between platforms. H rn H r1 H r2 + × = × X₁ X₂ X n X rn X rn X n H s W sn H s W s1 W s2 W sn 1. Starting from two or more non- negative matrices, i2NMF initially decomposes the shared effect across them, using integrative NMF (iNMF). Solving the following problem: × W r1 W r2 W rn Residual input matrix X r is defined by: and decomposed into: 2. On a second iteration, i2NMF decomposes the residual effect, which was not explained by the shared decomposition × H rn W rn Sig. 5 Sig. 4 Sig. 1 Sig. 6 Sig. 3 Sig. 2 Sig. 7 Oligodendrocyte Neuron Endothelial Astrocyte Mural OPCs Microglia Ependymal Polydendrocyte 012345 -log 10 (p-value) Gene Set Lein astrocyte markers Lein oligodendrocyte markers Lein neuron markers GO endothelium development GO regulation of immune response GO regulation of neuron differentiation Gene set enrichment Cell type enrichment Enrichment Z-score −100 0 100 200 0.00 0.05 0.10 0.15 0.20 Human Mouse i2NMF stage 2nd 1st Explained variance −10 0 10 UMAP1 −10 0 10 −10 0 10 UMAP1 UMAP2 Cell type Astrocyte Endothelial Ependymal Microglia Mural Neuron Oligodendrocyte OPCs Polydendrocyte Species Human Mouse −10 −5 0 5 −5 0 5 UMAP1 UMAP2 Mouse Polydendrocyte Sub-types Tnr Bmp4 Tnr Cspg5 Tnr Cspg5 Dad1 Tnr Opalin Tnr Tmem2 23 18 27 24 4 3 21 25 41 CSPG5 Mouse residual signatures OPALIN BMP4 TNR Exposure Hr mouse Low High c. Cell type and gene set enrichment analysis revealed that each shared signature corresponds to cell types. Human & Mouse substantia nigra (SN) scRNA-seq data a. The human and mouse SN data sets were integrated over the set of shared genes using i2NMF. b. The shared Signatures identified in the first integrative step, were able to combine human and mouse cells (left) and resolve groups of the most relevant cell types in the SN. d. The second stage of i2NMF recovered species-specific Signatures, that helped to resolve cellular sub-types (top) and were defined by marker genes (bottom). 40,453 Cells Welch et al., 2019 5,127 genes in common 51,912 Cells Saunders et al., 2018 Shared Sig. 1 Shared Sig. 2 ATAC-seq Sig. 3 Cell type Exposure H shared Low High Exposure H ATACseq Low High Cell type Morula Blastocyst 0.0 0.2 0.4 0.6 0.8 ATAC-seq RNA-seq i2NMF stage 2nd 1st Explained variance Cell type Morula Blastocyst ICM Regulatory relationship Regulatory relationships Present Not present Morula Blastocyst ICM 0% 50% 100% Relative Expression KLF17 NANOG OCT4 b. The shared H matrix was able to recover two cell specific signatures. On the second iteration for the ATAC-seq data, a defined signature was decomposed for two cells. Human embryos Morula and blastocyst scCAT-seq data a. The human embryo scCAT- seq data set was integrated over all 72 cells. For the gene expression data, the majority of the explained variance was captured in the first stage of i2NMF, interestingly for the chromatin accessibility the second stage also recovered a a considerable fraction of the variance. c. The decomposed shared signatures where stable across a range of factorization ranks, showing a clear separation between morula and blastocyst cells. d. The set of chromatin accessible regions associated with the ATAC-seq Sign. 3 and its targets genes showed a specific pattern for two blastocyst cells. These also showed higher expression in marker genes for cells of the inner cell mass (ICM). Thus, allowing the identification of this rare cell type. ATAC-seq K2 K3 K4 K5 K6 Factorization rank Morula Blastocyst Blastocyst Morula RNA-seq 72 Cells gene expression and chromatin accessibility for every cell Liu et al., 2019 16,501 expressed genes (RNA-seq) & 42,713 identified peaks (ATAC-seq) Cell type

Transcript of DKFZ PhD Retreat Poster 20190703 - EMBL Blogs · 1st d e Celltype Morula Blastocyst ICM Regulatory...

Page 1: DKFZ PhD Retreat Poster 20190703 - EMBL Blogs · 1st d e Celltype Morula Blastocyst ICM Regulatory relationship s y Present Notpresent Morula ... DKFZ_PhD_Retreat_Poster_20190703

Single-cell transcriptome and chromatin accessibility dataintegration reveals cell specific signatures

1Division of Neuroblastoma Genomics, German Cancer Research Center, Heidelberg, Germany.2Health Data Science Unit, Medical Faculty Heidelberg and BioQuant.*Correspondence: [email protected]

Andrés Quintero1,2,°, Anne-Claire Kröger2 and Carl Herrmann2,*

recover cell specific signatures between different species1

2

The ability to integrate multiplelayers of omics data plays anessential role in understanding thecomplex interplay of differentmolecular mechanisms that giverise to cellular diversity.

To address this challenge weimplemented Integrative IterativeNon-negative Matrix Factorization(i2NMF), a computational method todissect genomic signatures frommulti-omics data sets.

i2NMF was implemented as anextension of the R package Brat-wurst available in Github.

We applied i2NMF to :

https://github.com/wurst- theke/bratwurst

identify rare cell populations

For ev

ery initial matrix Xn

Variable number of signaturesfor every data set

Signatures are sharedacross data sets

Stage 1 Stage 2i2NMF workflow:

The shared effect is recovered in the Hsmatrix, and the exposure of the featuresexplaining this effect are contained in theWs matrices (Yang and Michailidis, 2016).

The explained variance of thedecomposed model can beestimated for both stages by:

This is useful to compare theperformance between stages and the

overall decomposition

Common featuresshould be sharedacross columns,e.g. gene orsample IDs.

i2NMF advantages

● The feature exposure matricesWs and Wr are different,

recovering unique signaturesbetween stage 1 and 2

● The number of inferredsignatures in stage 2 can varyacross matrices, allowing a betterresolution of specific effects.

● All solvers were implementedon TensorFlow, allowing

scalability between platforms.

Hrn

Hr1

Hr2

+×≈

= ×

X₁

X₂Xn

Xrn

Xrn

−Xn HsWsn

Hs

Ws1

Ws2

Wsn

1. Starting from two or more non-negative matrices, i2NMF initially

decomposes the shared effect across them,using integrative NMF (iNMF).Solving the following problem:

×Wr1

Wr2

Wrn

Residual input matrixXr is defined by:

and decomposed into:

2. On a second iteration, i2NMFdecomposes the residual effect, which wasnot explained by the shared decomposition

≈ × HrnWrn

Sig. 5

Sig. 4

Sig. 1

Sig. 6

Sig. 3

Sig. 2

Sig. 7

Oligodendrocyte

Neuron

Endothelial

AstrocyteMuralOPCs

Microglia

Ependymal

Polydendrocyte 0 1 2 3 4 5

-log10(p-value)

Gene SetLein astrocytemarkersLein oligodendrocytemarkersLein neuronmarkersGO endotheliumdevelopmentGO regulation ofimmune responseGO regulation ofneuron differentiation

Gene setenrichmentCell type enrichment

Enrichment Z-score

−1000100200

0.000.050.100.150.20

Human Mouse

i2NMFstage2nd1stEx

plained

variance

−10 0 10UMAP1

−10

0

10

−10 0 10UMAP1

UMAP2

Cell type●●●●

AstrocyteEndothelialEpendymalMicroglia

●●●●●

MuralNeuronOligodendrocyteOPCsPolydendrocyte

Species●

HumanMouse

−10

−5

0

5

−5 0 5UMAP1

UMAP2 Mouse

PolydendrocyteSub-types●●●●●

Tnr Bmp4Tnr Cspg5Tnr Cspg5 Dad1TnrOpalinTnr Tmem2

2318272443212541

CSPG5

Mouse residualsignatures

OPALINBMP4TNR

ExposureHrmouse

LowHigh

c. Cell type and gene set enrichment analysisrevealed that each shared signature

corresponds to cell types.

Human & Mousesubstantia nigra (SN)scRNA-seq data

a. The human andmouse SN data setswere integrated overthe set of sharedgenes using i2NMF.

b. The shared Signatures identified in the firstintegrative step, were able to combine humanand mouse cells (left) and resolve groups of the

most relevant cell types in the SN.

d. The second stage of i2NMFrecovered species-specificSignatures, that helped to resolvecellular sub-types (top) and weredefined by marker genes (bottom).

40,453Cells

Welchet al., 2019

5,127 genesin common

51,912Cells

Saunderset al., 2018

Shared Sig. 1

Shared Sig. 2

ATAC-seq Sig. 3

Cell type

Exposure Hshared

LowHigh

Exposure HATACseq

LowHigh

Cell typeMorulaBlastocyst

0.0

0.2

0.4

0.6

0.8

ATAC-seq RNA-seq

i2NMFstage2nd1stEx

plained

variance

Cell typeMorulaBlastocystICM

Regulatoryrelationship

Regulatory

relationships

PresentNot present

Morula

Blastocyst

ICM

●●

●●

●●

0%

50%

100%

Relative Expression

● KLF17 ● NANOG● OCT4

b. The shared H matrix was able to recovertwo cell specific signatures. On the seconditeration for the ATAC-seq data, a definedsignature was decomposed for two cells.

Human embryosMorula and blastocyst

scCAT-seq data

a. The human embryo scCAT-seq data set was integratedover all 72 cells. For the geneexpression data, the majority ofthe explained variance wascaptured in the first stage ofi2NMF, interestingly for thechromatin accessibility thesecond stage also recovered aa considerable fraction of thevariance.

c. The decomposedshared signatureswhere stable across arange of factorizationranks, showing a clearseparation betweenmorula and blastocystcells.

d. The set of chromatin accessible regionsassociated with the ATAC-seq Sign. 3 and itstargets genes showed a specific pattern fortwo blastocyst cells. These also showed higherexpression in marker genes for cells of theinner cell mass (ICM). Thus, allowing theidentification of this rare cell type.

ATAC-seqK2

K3

K4

K5

K6Factorization

rank

Morula

Blastocyst

BlastocystMorula

RNA-seq72 Cells

gene expression andchromatin accessibility

for every cellLiu

et al., 201916,501 expressed genes

(RNA-seq)&

42,713 identified peaks(ATAC-seq)

Celltype