ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews...

17
ChIP-Seq Data Processing and QC Simon Andrews [email protected] @simon_andrews v2020-05

Transcript of ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews...

Page 1: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

ChIP-Seq Data Processing and QC

Simon [email protected]

@simon_andrewsv2020-05

Page 2: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

Data Creation and Processing

Starting DNA Fragmented DNA ChIPped DNA

Sequence LibraryFastQ Sequence

FileMapped BAM File

Page 3: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

A typical ChIP Library

• Potential technical problems

– Adapter contamination

– PCR Duplication

• Potential biological problems

– Lack of enrichment

– Other selection biases

ChIP FragmentAdapterBarcode Adapter Barcode

Page 4: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

QC of raw sequenceBase Call Quality

Page 5: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

QC of raw sequenceSequence Composition

Page 6: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

QC of raw sequenceSequence Composition

Page 7: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

QC of raw sequenceAdapter Contamination

ChIP FragmentAdapterBarcode Adapter Barcode

Read

Trim Galore!Quality and Adapter Trimming

Page 8: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

Mapping ChIP Data

• All regions should be linear genomic stretches

• Standard genomic aligners are fine

– Bowtie2 http://bowtie-bio.sourceforge.net/bowtie2/

– BWA http://bio-bwa.sourceforge.net/

Page 9: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

Example Bowtie2 Mapping

• Create Genome Index (once - slow!)

• Map a single FastQ file

bowtie2-build yeast_genome.fa yeast_index

bowtie2 \

-x yeast_index \

-U data.fastq.gz \

| samtools view \

-bS \

-o data.bam

Page 10: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

Post Alignment QCMapping Statistics

sample1: 2264052 (14.66%) aligned exactly 1 time

sample2: 2698005 (18.79%) aligned exactly 1 time

sample3: 13434392 (67.08%) aligned exactly 1 time

sample4: 1108477 (6.70%) aligned exactly 1 time

sample5: 2143911 (17.58%) aligned exactly 1 time

sample6: 2980154 (13.98%) aligned exactly 1 time

Page 11: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

Post Alignment ProcessingMAPQ Filtering

• ChIP-Seq relates sequences to positions in a reference genome

• You need to be confident that the reported position is correct

• Filtering on MAPQ value (likelihood of reported position being incorrect) is an easy way to do this

• MAPQ filtering should be performed in most cases

samtools view -q 20 -b -o filtered.bam data.bam

Page 12: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

Post Alignment ProcessingDeduplication

java -jar picard.jar MarkDuplicates \

INPUT=sorted.bam \

OUTPUT=dedup.bam \

METRICS_FILE=metrics.txt

java -jar picard.jar SortSam \

INPUT=filtered.bam \

OUTPUT=sorted.bam \

SORT_ORDER=coordinate

DO NOT DEDUPLICATE AS A MATTER OF COURSE! THINK FIRST!

Page 13: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

To Deduplicate or Not?

• Deduplication can make enrichment visually clearer and help to spot truly enriched regions

• Why not just deduplicate everything?

– Quantitation compression

Page 14: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

Good Deduplication

Page 15: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

Quantitation Compression from Deduplication

Number of reads per peak

QuantitationDifference

(Dedup – Normal)

Page 16: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

Assessing Duplication

Read Density

Ob

serv

ed D

up

licat

ion

(%

)

Read DensityO

bse

rved

Du

plic

atio

n (

%)

Page 17: ChIP-Seq Data Processing and QC - Babraham Bioinf · ChIP-Seq Data Processing and QC Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2020-05

MultiQC Report

Standard Processing Workflow

Mapping Stats

Mapping Stats

Mapping Stats

FastQ File

FastQ File

FastQ File

FastQC Report

FastQC Report

FastQC Report

Trimmed FQ File

Trimmed FQ File

Trimmed FQ File

FastQC Report

FastQC Report

FastQC Report

BAM File

BAM File

BAM File

Filtered BAM

Filtered BAM

Filtered BAM

Visualisation and

Assessment

FastQC FastQC

TrimGalore

BowtieBWA

SAMTools