GimmeMotifs Documentation · 2019-11-21 · CHAPTER FOUR FULL CONTENTS 4.1Installation GimmeMotifs...

GimmeMotifs DocumentationRelease 0+untagged.80.g7f81a8a.dirty

Simon van Heeringen

Nov 21, 2019

CONTENTS

1 What is GimmeMotifs? 1

2 Getting started 3

3 Get help 5

4 Full contents 74.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Simple examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.4 Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.5 Command-line reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.6 API documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.8 Auto-generated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.9 FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.10 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

i

CHAPTER

ONE

WHAT IS GIMMEMOTIFS?

GimmeMotifs is an analysis framework for transcription factor motif analysis written in Python. It contains command-line scripts to predict de novo motifs, scan for known motifs, identify differential motifs, calculate motif enrichmentstatistics, plot sequence logos and more. In addition, all this functionality is available from a Python API.

GimmeMotifs is free and open source research software. If you find it useful please cite our paper:

• Bruse N and van Heeringen SJ, GimmeMotifs: an analysis framework for transcription factor motif anal-ysis, bioRxiv, 2018. https://www.biorxiv.org/content/10.1101/474403v1.full. doi: 10.1101/474403.

• van Heeringen SJ and Veenstra GJC, GimmeMotifs: a de novo motif prediction pipeline for ChIP-sequencing experiments, Bioinformatics. 2011 Jan 15;27(2):270-1. doi: 10.1093/bioinformatics/btq636.

1

https://www.biorxiv.org/content/10.1101/474403v1.full

https://doi.org/10.1101/474403

http://dx.doi.org/10.1093/bioinformatics/btq636

GimmeMotifs Documentation, Release 0+untagged.80.g7f81a8a.dirty

2 Chapter 1. What is GimmeMotifs?

CHAPTER

TWO

GETTING STARTED

• The easiest way to install GimmeMotifs is using bioconda on Linux or Mac. From version 0.13.0 only Python3 (>= 3.4) is supported.

• Have a look at these simple examples to get a taste of what is possible.

• Check out the more detailed tutorials.

• Full command-line reference can be found here.

• There’s also an API documentation.

3

https://bioconda.github.io/


4 Chapter 2. Getting started

CHAPTER

THREE

GET HELP

• First, check the FAQ for common issues.

• The preferred way to get support is through the Github issues page.

• Finally, you can reach me by mail or via twitter.

5

https://github.com/simonvh/gimmemotifs/issues/

mailto:[email protected]

https://twitter.com/svheeringen


6 Chapter 3. Get help

CHAPTER

FOUR

FULL CONTENTS

4.1 Installation

GimmeMotifs runs on Linux. On Windows 10 it will run fine using the Windows Subsystem for Linux. Mac OSXshould work and is included in the build test. However, as I don’t use it myself, unexpected issues might pop up. Letme know, so I can try to fix it.

4.1.1 The easiest way to install

The preferred way to install GimmeMotifs is by using conda. Activate the bioconda channel if you haven’t usedbioconda before. You only have to do this once.

$ conda config --add channels defaults$ conda config --add channels bioconda$ conda config --add channels conda-forge

You can install GimmeMotifs with one command. In the current environment:

$ conda install gimmemotifs

Or create a specific environment:

$ conda create -n gimme python=3 gimmemotifs

# Activate the environment before you use GimmeMotifs$ conda activate gimme

GimmeMotifs only supports Python 3. Don’t forget to activate the environment with source activate gimmewhenever you want to use GimmeMotifs.

Installation successful? Good. Have a look at the configuration section.

Important note on upgrading from 0.11.1

The way genomes are installed and used has been changed from 0.11.1 to 0.12.0. Basically, we have switched to thefaidx index used and supported by many other tools. This means that the old (<=0.11.1) GimmeMotifs index cannotbe used by GimmeMotifs 0.12.0 and higher. You can re-install genomes using genomepy, which is now the preferredtool for genome management for GimmeMotifs. However, because of this change you can now also directly supply agenome FASTA instead of a genome name. Pre-indexing is not required anymore.

7

https://docs.microsoft.com/en-us/windows/wsl/install-win10

https://docs.continuum.io/anaconda

https://bioconda.github.io/

https://github.com/simonvh/genomepy


4.1.2 Alternative installation

These are the prerequisites for a full GimmeMotifs installation.

• bedtools http://bedtools.readthedocs.io

• UCSC genePredToBed http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed

• UCSC bigBedToBed http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigBedToBed

• Perl + Algorithm::Cluster

In addition many of the motif tools (such as MEME) will need to be installed separately. Instructions for doing so arenot included here.

Installation from PyPI with pip is a relatively straightforward option. Install with pip as follows:

$ sudo pip install gimmemotifs

Or the (unstable) develop branch with the newest bells, whistles and bugs:

$ sudo pip install git+https://github.com/vanheeringen-lab/gimmemotifs.git@develop

If you don’t have root access, see the option below.

Ubuntu prerequisites

To install GimmeMotifs in a virtualenv, several Python packages need to be built from source.

Install the necessary packages to build numpy, scipy, matplotlib and GimmeMotifs:

sudo apt-get install python-pip python-dev build-essential libatlas-base-dev \gfortran liblapack-dev libatlas-base-dev cython libpng12-dev libfreetype6-dev \libgsl0-dev

Install via pip

Create a virtualenv and activate it according to the documentation.

Install numpy:

$ pip install numpy

Now you can install GimmeMotifs using pip. Latest stable release:

$ pip install gimmemotifs

Installation from source

Did I mention conda?

You know bioconda is amazing, right?

So. . .

These instructions are not up-to-date! Basically, you’re on your own!

Make sure to install all required dependencies.

8 Chapter 4. Full contents

http://bedtools.readthedocs.io

http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed

http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigBedToBed

https://virtualenv.readthedocs.org/en/latest/userguide.html#usage


You can download the lastest stable version of GimmeMotifs at:

https://github.com/simonvh/gimmemotifs/releases

Start by unpacking the source archive

tar xvzf gimmemotifs-0.11.0.tar.gzcd gimmemotifs-0.11.0

You can build GimmeMotifs with the following command:

python setup.py build

Run the tests to check if the basics work correctly:

python run_tests.py

If you encounter no errors, go ahead with installing GimmeMotifs (root privileges required):

sudo python setup.py install

On first run GimmeMotifs will try to locate the tools you have installed. If you have recently installed them, runningan updatedb will be necessary. Using this option GimmeMotifs will create a configuration file, the default is:

~/.config/gimmemotifs/gimmemotifs.cfg

This is a personal configuration file.

It is also possible to run the setup.py install command with the --prefix, --home, or --install-dataoptions, to install in GimmeMotifs in a different location (for instance, in your own home directory). This should befine, however, these alternative methods of installing GimmeMotifs have not been extensively tested.

4.1.3 Configuration

You will need genome FASTA files for a lot of the tools that are included with GimmeMotifs.

Download genomes automatically

The most straightforward way to download and index a genome is to use the genomepy tool, which is installed withGimmeMotifs.

$ genomepy install hg19 UCSC --annotation

Here, the hg19 genome and accompanying gene annotation will be downloaded from UCSC to the directory ~/.local/share/genomes/hg19. You can change this default location by creating/editing the file ~/.config/genomepy/genomepy.yaml and change the following line:

genome_dir: /data/genomes

Please note: in contrast to earlier versions of GimmeMotifs it is no longer necessary to index a genome.

4.1. Installation 9

https://github.com/simonvh/gimmemotifs/releases


Adding gene files

Note: If you used the genomepy command, annotation will be included automatically.

For some applications a gene file is used. This is a file containing gene annotation in BED12 format. It shouldbe located in the gene_dir, which is defined in the configuration file (see below). The file needs to be named<index_name>.bed, so for instance hg19.bed.

Other configuration options

All of GimmeMotifs configuration is stored in ~/.config/gimmemotifs/gimmemotifs.cfg. The configu-raton file is created at first run with all defaults set, but you can always edit it afterwards. It contains two sectionsmain and params that take care of paths, file locations, parameter settings etc. Additionally, every motif tool has it’sown section. Let’s have a look at the options.

[main]template_dir = /usr/share/gimmemotifs/templatesseqlogo = /usr/local/bin/seqlogoscore_dir = /usr/share/gimmemotifs/score_distsmotif_databases = /usr/share/gimmemotifs/motif_databasesgene_dir = /usr/share/gimmemotifs/genestools_dir = /usr/share/gimmemotifs/tools

• template_dir The location of the jinja2 html templates, used to generate the reports.

• seqlogo The seqlogo executable.

• score_dir To generate p-values, a pre-calculated file with mean and sd of score distributions is needed. Theseare located here.

• motif_databases For now contains only the JASPAR motifs.

• gene_dir Directory with bed-files containing gene locations. This is needed to create promoter backgroundsequences.

• tools_dir Here all tools included with GimmeMotifs are stored.

[params]fraction = 0.2use_strand = Falseabs_max = 1000analysis = mediumenrichment = 1.5width = 200lwidth = 500genome = hg19background = gc,randomcluster_threshold = 0.95available_tools = MDmodule,MEME,Weeder,GADEM,MotifSampler,trawler,Improbizer,→˓BioProspector,Posmo,ChIPMunk,JASPAR,AMD,HMS,Homertools = MDmodule,MEME,Weeder,MotifSampler,trawler,Improbizer,BioProspector,Posmo,→˓ChIPMunk,JASPAR,AMD,HMS,Homerpvalue = 0.001max_time = Nonencpus = 2motif_db = gimme.vertebrate.v3.1.pwmscan_cutoff = 0.9

(continues on next page)



(continued from previous page)

use_cache = Falsemarkov_model = 1

This section specifies all the default GimmeMotifs parameters. Most of these can also be specified at the command-linewhen running GimmeMotifs, in which case they will override the parameters specified

Configuration of MotifSampler

If you want to use MotifSampler there is one more step that you’ll have to take after installation of GimmeM-otifs. For every organism, you will need a MotifSampler background. Note that human (hg19, hg38) andmouse (mm9, mm10) background models are included, so for these organisms MotifSampler will work out ofthe box. For other organisms the necessary background files can be created with CreateBackgroundModel(which is included with GimmeMotifs or can be downloaded from the same site as MotifSampler). The back-ground model file needs to be saved in the directory /usr/share/gimmemotifs/MotifSampler and itshould be named <organism_index_name>.bg. So, for instance, if I downloaded the human epd background(epd_homo_sapiens_499_chromgenes_non_split_3.bg), this file should be saved as /usr/share/gimmemotifs/MotifSampler/hg19.bg. here.

4.2 Overview

4.2.1 Motif databases

By default GimmeMotifs uses a non-redundant, clustered database of known vertebrate motifs:gimme.vertebrate.v5.0. These motifs come from CIS-BP (http://cisbp.ccbr.utoronto.ca/) and other sources.Large-scale benchmarks using ChIP-seq peaks show that this database shows good performance and should be a gooddefault choice.

In addition, many other motif databases come included with GimmeMotifs:

• CIS-BP - All motifs from the CIS-BP database (version 1.02).

• ENCODE - ENCODE motifs from Kheradpour & Kellis (2013).

• factorbook - All non-redundant motifs from Factorbook.

• HOCOMOCOv11_HUMAN - All human motifs from HOCOMOCO version 11.

• HOCOMOCOv11_MOUSE - All mouse motifs from HOCOMOCO version 11.

• HOMER - All motifs from HOMER (downloaded Oct. 2018).

• IMAGE - The motif database from Madsen et al. (2018).

• JASPAR2018 - All CORE motifs from JASPAR 2018.

• JASPAR2018_vertebrates - CORE vertebrates motifs from JASPAR 2018.

• JASPAR2018_plants - CORE plants motifs from JASPAR 2018.

• JASPAR2018_insects - CORE insects motifs from JASPAR 2018.

• JASPAR2018_fungi -CORE fungi motifs from JASPAR 2018.

• JASPAR2018_nematodes - CORE nematodes motifs from JASPAR 2018.

• JASPAR2018_urochordata - CORE urochordata motifs from JASPAR 2018.

• JASPAR2020 - All CORE motifs from JASPAR 2020.

4.2. Overview 11

http://cisbp.ccbr.utoronto.ca/

http://cisbp.ccbr.utoronto.ca/

http://compbio.mit.edu/encode-motifs/

https://dx.doi.org/10.1093/nar/gkt1249

http://www.factorbook.org/human/chipseq/tf/

http://hocomoco11.autosome.ru/

http://hocomoco11.autosome.ru/

http://homer.ucsd.edu/homer/motif/

https://dx.doi.org/10.1101/gr.227231.117

http://jaspar.genereg.net









• JASPAR2020_vertebrates - CORE vertebrates motifs from JASPAR 2020.

• JASPAR2020_plants - CORE plants motifs from JASPAR 2020.

• JASPAR2020_insects - CORE insects motifs from JASPAR 2020.

• JASPAR2020_fungi -CORE fungi motifs from JASPAR 2020.

• JASPAR2020_nematodes - CORE nematodes motifs from JASPAR 2020.

• JASPAR2020_urochordata - CORE urochordata motifs from JASPAR 2020.

• SwissRegulon - The SwissRegulon motifs.

You can specify any of these motif databases by name in any GimmeMotifs tool. For instance:

$ gimme scan TAp73alpha.fa -p JASPAR2018_vertebrates -g hg38

or

$ gimme motifs TAp73alpha.fa TAp73alpha.motifs -p HOMER -g hg38 --known

4.3 Simple examples

4.3.1 Install a genome

Any genome on UCSC, Ensembl or NCBI can be installed automatically using genomepy. The genomepy commandtool comes installed with gimmemotifs. For instance, to download the hg38 genome from UCSC:


4.3.2 Predict de novo motifs

For a quick analysis, to see if it works:

$ gimme motifs TAp73alpha.bed -g hg19 -n gimme.p73 -a small -t Homer,MDmodule,→˓BioProspector

For a full analysis:

$ gimme motifs TAp73alpha.bed -g hg19 -n gimme.p73

Open the file gimme.p73/gimme.denovo.html in your web browser to see the results.

4.3.3 Compare motifs between data sets

$ gimme maelstrom hg19.blood.most_variable.1k.txt hg19 maelstrom.out/

The output scores of gimme maelstrom represent the combined result of multiple methods. The individual resultsfrom different methods are ranked from high-scoring motif to low-scoring motif and then aggregated using rankaggregation. The score that is shown is the -log10(p-value), where the p-value (from the rank aggregation) is correctedfor multiple testing. This procedure is then repeated with the ranking reversed. These are shown as negative values.








http://swissregulon.unibas.ch/sr/



4.3.4 Create sequence logos

$ gimme logo -i GM.5.0.p53.0004

This will generate GM.5.0.p53.0004.png.

Use the -p argument for a different pwm file. The following command will generate a logo for every motif incustom.pfm.

$ gimme logo -p custom.pfm

4.4 Tutorials

While GimmeMotifs was originally developed to predict de novo motifs in ChIP-seq peaks, it is now a full-fledgedsuite of TF motif analysis tools. You can still identify new motifs, but also scan for known motifs, find differentialmotifs in multiple sets of sequences, create sequence logos, calculate all kinds of enrichment statistics, and more!

For this tutorial I’ll assume you use bioconda. If you haven’t already done so, install GimmeMotifs.

$ conda create -n gimme python=3 gimmemotifs

And activate it!

$ source activate gimme

To locate the example files mentioned in the tutorial, locate the examples/ directory of your GimmeMotifs instal-lation. When using conda:

$ echo `conda info | grep default | awk '{print $4}'`/share/gimmemotifs/examples/home/simon/anaconda3/share/gimmemotifs/examples

Alternatively, the example data is also available from figshare and you can download it from there.

$ curl -L -o gimme.example_data.tgz https://ndownloader.figshare.com/files/8834965$ tar xvzf gimme.example_data.tgz

4.4. Tutorials 13

https://doi.org/10.6084/m9.figshare.5182897.v1


4.4.1 Find de novo motifs

As a simple example, let’s predict the CTCF motif based on ChIP-seq data from ENCODE. Download the peaks:

$ wget http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/→˓byDataType/peaks/jan2011/spp/optimal/hub/spp.optimal.→˓wgEncodeBroadHistoneGm12878CtcfStdAlnRep0_VS_→˓wgEncodeBroadHistoneGm12878ControlStdAlnRep0.bb

Convert the bigBed file to a BED file using bigBedToBed:

$ bigBedToBed spp.optimal.wgEncodeBroadHistoneGm12878CtcfStdAlnRep0_VS_→˓wgEncodeBroadHistoneGm12878ControlStdAlnRep0.bb Gm12878.CTCF.narrowPeak

Select the top 500 peaks, based on the signalValue column of the narrowPeak format, as input:

$ sort -k7gr Gm12878.CTCF.narrowPeak | head -n 500 > Gm12878.CTCF.top500.narrowPeak

Note that the top 500 peaks are just for the sake of the tutorial. Normally you would use a much larger sample (or allpeaks) as input for gimme motifs.

Now, the ENCODE peak coordinates are based on hg19 so we need to install the hg19 genome. For a UCSC genome,this is just a matter of running genomepy.


This will take some time. The genome sequence will be downloaded and indexed, ready for use with GimmeMotifs.

Having both an index genome and an input file, we can run gimme motifs.

$ gimme motifs Gm12878.CTCF.top500.narrowPeak gimme.CTCF -g hg19 --denovo

Once again, this will take some time. By specifying the –denovo flag, we will only look for de novo motifs. If this flagis not used, the sequences will also be scanned with known motifs. When gimme motifs is finished you can viewthe results in a web browser. gimme.CTCF/gimme.denovo.html should look a lot like this. This is what an almostperfect motif looks like, with a ROC AUC close to 1.0. The gimme motifs command also selects a minimal set ofmotifs that best explain the data using recursive feature elimination. This file is called gimme.motifs.html.

4.4.2 Motif enrichment statistics

You can use gimme motifs to compare motifs or to identify relevant known motifs for a specific input file.

Let’s evaluate known motifs for one of the example files, TAp73alpha.fa.

$ gimme motifs TAp73alpha.fa TAp73alpha.motifs --known -g hg19

You can also specify other motif files with the -p argument, for instance -p my_motifs.pfm, -p HOMER or-p JASPAR2020_vertebrates. The command will create an output directory with several output files and twodirectories. One contain the motif logos and the other the motif scan results.

$ ls TAp73alpha.motifsgenerated_background.gc.fa gimme.motifs.redundant.html logosgimme.motifs.html gimme.roc.report.txt motif_scan_results

The file generated_background.gc.fa is the FASTA file used as background. This is automatically generatedand contains sequences with the same GC% frequencies as your input sequences. The file gimme.motifs.htmlis a graphical report that can be opened in your web browser. It should look something like this:


https://genome.ucsc.edu/FAQ/FAQformat.html#format12

gimme.CTCF/gimme.denovo.html

gimme.CTCF/gimme.motifs.html


The columns are sortable (click on the header) and the full list of factors that can bind to this motif can be obtainedby hovering over the text. This file contains a non-redundant set of motifs. The full report is present in gimme.motifs.redundant.html. This report will most likely contain many very similar motifs.

The file gimme.roc.report.txt is a text report of the same results.

$ head -n 1 TAp73alpha.motifs/gimme.roc.report.txt | tr \\t \\nMotif# matches# matches backgroundP-valuelog10 P-valueROC AUCEnr. at 1% FPRRecall at 10% FDR

The motif ID, the number of matches in the sample and in the background file, followed by five statistics: the en-richment p-value (hypergeometric/Fisher’s exact), the log-transformed p-value, the ROC area under curve (AUC), theenrichment compared to background set at 1% FPR and the recall at 10% FDR.

The ROC AUC is widely used, however, it might not always be the most informative. In situations where the back-ground set is very large compared to the input set, it might give a more optimistic picture than warranted.

Let’s sort on the last statistic:

$ sort -k8g TAp73alpha.motifs/gimme.roc.report.txt | cut -f1,6,8 | tailGM.5.0.p53.0010 0.794 9.11GM.5.0.p53.0008 0.812 9.21GM.5.0.Grainyhead.0001 0.761 11.00GM.5.0.Unknown.0179 0.739 12.40GM.5.0.p53.0005 0.862 26.60GM.5.0.p53.0011 0.853 31.19GM.5.0.p53.0007 0.868 32.00GM.5.0.p53.0003 0.884 37.40GM.5.0.p53.0004 0.905 42.87GM.5.0.p53.0001 0.920 52.70

Not surprisingly, the p53 family motif is the most enriched. The Grainyhead motif somewhat resembles the p53 motif,which could explain the enrichment. Let’s visualize this. This command will create two sequence logos in PNGformat:

4.4. Tutorials 15


$ gimme logo -i GM.5.0.p53.0001,GM.5.0.Grainyhead.0001

The p53 motif, or p73 motif in this case, GM.5.0.p53.0001:

And the Grainyhead motif, GM.5.0.Grainyhead.0001:

The resemblance is clear. This also serves as a warning to never take the results from a computational tool (includingmine) at face value. . .

4.4.3 Scan for known motifs

Note: gimme scan can be used to identify motif locations. If you’re just interested in identifying enriched motifsin a data set, try gimme motifs.

To scan for known motifs, you will need a set of input sequences and a file with motifs. By default, gimme scanuses the motif database that comes included, which is based on clustered, non-redundant motifs from CIS-BP andother sources. For input sequences you can use either a BED, FASTA, narrowPeak file or a file with regions inchr:start-end format. You will also need to specify the genome, which can either be a genome installed withgenomepy or a FASTA file. The genome sequence will be used to retrieve sequences, if you have specified a BEDor region file, but also to determine a reasonable motif-specific threshold for scanning. The default genome can bespecified in the configuration file.



We will use the file Gm12878.CTCF.top500.narrowPeak that was used for de novo motif search above forknown motifs. While gimme motifs automatically extends regions from the center of the input regions (or thesummit if it is a narrowPeak file), gimme scan uses the regions as specified in the file. This means we will have tochange the size of the regions to 200 nucleotides. Depending on the type and quality of your input data, you can ofcourse make this smaller or larger.

$ cat Gm12878.CTCF.top500.narrowPeak | awk ' {print $1 "\t" $2 + $10 - 100 "\t" $2 +→˓$10 + 100}' > Gm12878.CTCF.top500.w200.bed

Note that we use the summit as the center of the peak. If you have summit information available, always use this! OK,let’s scan:

$ gimme scan Gm12878.CTCF.top500.w200.bed -g hg19 > result.scan.gff

The first time you run gimme scan for a specific combination of motif database, genome, input sequence lengthand FPR (which is 0.01 by default) it will determine a motif-specific cutoff based on random genome backgroundsequences. This will take a while. However, results will be cached for future scanning.

To get a BED file with the genomic location of motif matches add the -b argument. You can specify the motif databasewith the -p argument. This can be either one of the databases included with GimmeMotifs or a PFM file. For instance,to scan with the vertebrate motifs from JASPAR and output the results in BED format:

$ gimme scan Gm12878.CTCF.top500.w200.bed -g hg19 -b -p JASPAR2020_vertebrates >→˓result.scan.bed

By default, gimme scan gives at most one match per sequence for each motif, if the score of the match reaches thethreshold determined by the FPR cutoff.

For a very simple summary, we can just have a look at the most abundant motifs:

$ cut -f4 result.scan.bed | sort | uniq -c | sort -n | tail -n 5114 UN0322.1_ZNF417213 MA1102.2_CTCFL230 UN0310.1_HMGXB4395 UN0311.1_ZBTB2450 MA0139.1_CTCF

In this case, the most abundant motif is the CTCF motif.

The specified false positive rate (FPR), with a default of 0.01, determines the motif-specific threshold that is used forscanning. This means that the expected rate of occurrence, determined by scanning random genomic sequences, is1%. Based on the FPR, you can assume that any motif with more than 1% matches is enriched. However, for a morerobust measure of enrichment and significance of known motifs use gimme motifs with the --known argument.This command will give the enrichment, but also the ROC AUC and recall at 10% FDR and other useful statistics.

For many applications, it is useful to have motif occurrences as a table.

$ gimme scan Gm12878.CTCF.top500.w200.bed -g hg19 -t > table.count.txt

This will result in a tab-separated table with counts. Same defaults as above, at most one match per sequence permotif. Alternatively, gimme scan can report the score of best match, regardless of the value of this score.

$ gimme scan Gm12878.CTCF.top500.w200.bed -g hg19 -T > table.score.txt$ head table.score.txt | cut -f1-10

# GimmeMotifs version 0.14.0# Input: Gm12878.CTCF.top500.w200.bed# Motifs: /home/simon/anaconda3/envs/gimme/lib/python3.6/site-packages/gimmemotifs-0.→˓14.0-py3.6-linux-x86_64.egg/gimmemotifs/../data/motif_databases/gimme.vertebrate.v5.→˓0.pfm


4.4. Tutorials 17

http://jaspar.genereg.net/



# Scoring: logodds scoreGM.5.0.Sox.0001 GM.5.0.Homeodomain.0001 GM.5.0.Mixed.0001 GM.5.0.Nuclear_→˓receptor.0001 GM.5.0.Mixed.0002 GM.5.0.Nuclear_receptor.0002 GM.5.0.→˓bHLH.0001 GM.5.0.Myb_SANT.0001 GM.5.0.C2H2_ZF.0001chr11:190037-190237 2.954744 6.600900 4.930669 -3.541198 -→˓2.137985 0.544322 2.067236 -0.004395 6.256473chr14:106873577-106873777 2.433545 5.643687 5.517376 -3.351354→˓ 1.466310 0.339341 1.419619 -1.566716 4.527884chr14:106765204-106765404 3.063547 2.256005 5.517376 -4.264769→˓ 0.574826 -0.948136 1.419619 -3.344676 4.626366chr15:22461178-22461378 1.680438 2.256005 5.517376 -0.306294→˓ -3.518806 4.715836 1.077683 -3.288322 4.527884chr14:107119996-107120196 0.473710 2.256005 5.517376 -7.013300→˓ -3.518806 -0.948136 1.352120 -5.136550 4.952816

4.4.4 Find differential motifs

The gimme maelstrom command can be used to compare two or more different experiments. For instance, ChIP-seq peaks for multiple factors, ChIP-seq peaks of the same factor in different cell lines or tissues, ATAC-seq peaks orexpression data.

The input can be in one two possible formats. In both cases the genomic location should be present aschrom:start-end in the first column. The first option is a two-column format and looks like this:

loc clusterchr15:49258903-49259103 NKchr10:72370313-72370513 NKchr4:40579259-40579459 Monocyteschr10:82225678-82225878 T-cellschr5:134237941-134238141 B-cellschr5:58858731-58858931 B-cellschr20:24941608-24941808 NKchr5:124203116-124203316 NKchr17:40094476-40094676 Erythroblastchr17:28659327-28659527 T-cells

This can be the result of a clustering analysis, for instance.

The second option looks like this:

loc NK Monocytes T-cells B-cellschr12:93507547-93507747 3.11846121722 2.52277241968 1.93320358405 0.→˓197177179733chr7:38236460-38236660 1.0980120443 0.502311376556 0.200701906431 0.→˓190757068752chr10:21357147-21357347 0.528935300354 -0.0669540487727 -1.04367733597 -0.→˓34370315226chr6:115521512-115521712 0.406247786632 -0.37661318381 -0.480209252108 -0.→˓667499767004chr2:97359808-97360008 1.50162092566 0.905358101064 0.719059595262 0.→˓0313480230265chr16:16684549-16684749 0.233838577502 -0.362675820232 -0.837804056065 -0.→˓746483496024chrX:138964544-138964744 0.330000689312 -0.29126319574 -0.686082532015 -0.→˓777470189034





chr2:186923973-186924173 0.430448401897 -0.258029531121 -1.16410548462 -0.→˓723913541425chrX:113834470-113834670 0.560122313347 -0.0366707259833 -0.686082532015 -→˓0.692926848415

This is a tab-separated table, with a header describing the experiments. In case of sequencing data, such as ChIP-seq, ATAC-seq or DNaseI seq, we recommend to use log-transformed read counts which are mean-centered perrow. For optimal results, it is recommended to normalize between experiments (columns), for instance by quantilenormalization or scaling.

By default, gimme maelstrom will run in ensemble mode, where it will combine the results from different classi-fication and regression methods and statistical tests through rank aggregation. The only arguments necessary are theinput file, the genome and an output directory.

Here, we will run maelstrom on a dataset that is based on Corces et al.. The example file hg19.blood.most_variable.1k.txt contains normalized ATAC-seq read count data for several hematopoietic cell types:Monocytes, CD4+ and CD8+ T cells, NK cells, B cells and erythrocytes. This is a subset of the data and containsonly the 1000 most variable peaks (highest standard deviation). There is also a larger file, that contains more regionshg19.blood.most_variable.10k.txt and that will also take longer to run.

$ gimme maelstrom hg19.blood.most_variable.1k.txt hg19 maelstrom.blood.1k.out

There output directory contains several files:

$ ls maelstrom.blood.1k.out

The two motif files, motif.count.txt.gz and motif.score.txt.gz contain the motif scan results. Theactivity.*.out.txt files are tables with the results of the individual methods. The main result is final.out.txt, which integrates all individual methods in a final score. This score represents the combined result ofmultiple methods. The individual results from different methods are ranked from high-scoring motif to low-scoringmotif and then aggregated using the rank aggregation method from Kolde, 2012. The score that is shown is the-log10(p-value). This procedure is then repeated with the ranking reversed. These are shown as negative values.

The file gimme.maelstrom.report.html contains a graphical summary of this file that can be opened in yourweb browser.

4.4. Tutorials 19

https://dx.doi.org/10.1038/ng.3646

https://www.ncbi.nlm.nih.gov/pubmed/22247279


You can sort on the different columns by clicking on them.

The following Python snippet will create a heatmap of the results.

from gimmemotifs.maelstrom import MaelstromResultimport matplotlib.pyplot as plt

mr = MaelstromResult("maelstrom.blood.1k.out/")mr.plot_heatmap(threshold=6)plt.savefig("maelstrom.blood.1k.out/heatmap.png", dpi=300)

This will show a heatmap like this:



We see that the expected motifs for different cell types are identified. GATA/TAL1 for Erythrocytes, CEBP formonocytes, LEF/TCF for T cells (ie. Wnt signaling), SPIB and PAX5 for B cells and so on. Keep in mind thatthis shows only the most relevant motifs (-log10 p-value cutoff of 6), there are more relevant motifs. This example

4.4. Tutorials 21


was run only on 1,000 variable enhancer. A file with more regions, hg19.blood.most_variable.10k.txtfor this example, will usually yield better results.

The Jupyter notebook example maelstrom.ipynb shows a more extensive example on how to work with maelstromresults in Python.

4.5 Command-line reference

In addition to gimme motifs the GimmeMotifs package contains several other tools that can perform the varioussubsteps of GimmeMotifs, as well as other useful tools. Run them to see the options.

4.5.1 List of tools

• gimme motifs

• gimme scan

• gimme maelstrom

• gimme background

• gimme logo

• gimme match

• gimme cluster

• gimme threshold

• gimme location

• gimme diff

4.5.2 Input formats

Most tools in this section take a file in PFM format as input. This is actually a file with Position Specific ScoringMatrices (PSSMs) containing frequencies. It looks like this:

>motif10.3611 0.0769 0.4003 0.16640.2716 0.0283 0.5667 0.13810.6358 0.0016 0.3344 0.03300.0016 0.9859 0.0016 0.01570.8085 0.0063 0.0502 0.1397>motif20.2276 0.0157 0.0330 0.72840.0031 0.0016 0.9984 0.00160.0377 0.3799 0.0016 0.58560.0816 0.7096 0.0173 0.19620.1350 0.4035 0.0675 0.3987

The frequencies are separated by tabs, and in the order A,C,G,T.

4.5.3 Command: gimme motifs

The gimme motifs command can be used for known and/or de novo motif analysis. By default it runs both.


https://github.com/vanheeringen-lab/gimmemotifs/blob/master/docs/notebooks/maelstrom.ipynb


Quick example of de novo motif analysis

You can try GimmeMotifs with a small example dataset included in the examples directory, included with GimmeM-otifs. This example does not require any additional configuration if GimmeMotifs is installed correctly.

Change to a directory where you have write permissions and run the following command (substitute the filename withthe location of the file on your system):

gimme motifs /usr/share/gimmemotifs/examples/TAp73alpha.fa p73 --denovo -g hg19

The first argument is the name of the input file and the second argument defines the name of the output directory thatis created. All output files are stored in this directory. The genome is set to the hg38 genome. This requires you tohave installed hg38 using genomepy. Alternatively, you can also supply the path to a genome FASTA file with the-g option.

Depending on your computer, this analysis will take around 15-20 minutes. By default, the three top-performingde novo motif tools will be used: MEME, Homer and BioProspector. Once GimmeMotifs is finished you can openp73/p73_motif_report.html in your browser.

You can also run the same analysis with a BED file or narrowPeak file as input, or a text file with regions inchrom:start-end format

Best practices and tips

GimmeMotifs is multi-threaded

GimmeMotifs runs multi-threaded and by default uses 12 threads. All the de novo programs will be run in parallel asmuch as possible. Of course some programs are still single-threaded, and will not really benefit from multithreading.You can change the number of threads that are used using the -N parameter.

Running time

The running time of the de novo part of GimmeMotifs largely depends on three factors:

• the size of the input dataset;

• the motif prediction tools you use;

• the size of the motifs to be identified.

Size of input dataset

While GimmeMotifs is developed specifically for ChIP-seq datasets, most motif prediction tools are not. In practicethis means that it does not make much sense to predict motifs on a large amount of sequences, as this will usuallynot result in higher quality motifs. Therefore GimmeMotifs uses an absolute limit for the prediction set. By default20% of the sequences are used as input for motif prediction, but with an absolute maximum. This is controlled by theabs_max parameter in the configuration file, which is set to 1000 by default. In general, if you have a large amountof peaks, you can also consider to run GimmeMotifs on the top sequences of your input, for instance the 5000 highestpeaks.

Motif prediction tools

By default, gimme motifs uses three de novo motif prediction tools: MEME, BioProspector and Homer. These wefound to be the best performing programs for ChIP-seq data (Bruse & van Heeringen, 2018). You can include a largevariety of other tools by using the -t parameter. This will result in an increased running time and some tools, such asGADEM, will take a very long time. The following tools are supported:

• AMD

4.5. Command-line reference 23


https://www.biorxiv.org/content/10.1101/474403v1

p73/p73_motif_report.html

https://www.biorxiv.org/content/10.1101/474403v1.full

https://dx.doi.org/10.1371%2Fjournal.pone.0024576


• BioProspector

• ChIPMunk

• DiNAMO

• GADEM

• DREME

• HMS

• Homer

• Improbizer

• MDmodule

• MEME

• MotifSampler

• POSMO

• ProSampler

• RPMCMC

• Trawler

• Weeder

• XXmotif

• YAMDA

With the exception of RPMCMC and YAMDA, all tools come installed with GimmeMotifs when using the biocondapackage. AMD, HMS, Improbizer, MotifSampler and DiNAMO are not supported on OSX.

Please note: all these programs include their own license and many are free for academic or non-commercial useonly. For commercial use of any of these programs, please consult the respective author! GimmeMotifs itself can befreely used commercially.

Motif size

The default setting for motif size is -a xl, which searches for motifs with a length of up to 20. You can use differentanalysis sizes: small (up to 8), medium (up to 10) or large (up to 14). The running time can be significantlyshorter for shorter motifs. However, keep in mind that the xl analysis setting results in the best motifs in general.

Small input sets

GimmeMotifs is developed for larger datasets, such as ChIP-seq peaks, where you have the luxury to use a largefraction of your input for validation. This means that at least several hundred sequences would be optimal. If you wantto run GimmeMotifs on a small input dataset, it might be worthwile to increase the fraction used for motif predictionwith the -f argument, for instance -f 0.5.

Running on FASTA files

It is possible to run GimmeMotifs on a FASTA file as input instead of a BED file. This is detected automatically if yourinputfile is correctly formatted according to FASTA specifications. Please note that for best results, all the sequencesshould be of the same length. This is not necessary for motif prediction, but the statistics and positional preferenceplots will be wrong if sequences have different lengths.


http://ai.stanford.edu/~xsliu/BioProspector/

http://autosome.ru/ChIPMunk/

https://github.com/bonsai-team/DiNAMO

https://dx.doi.org/10.1089%2Fcmb.2008.16TT

http://meme-suite.org/index.html

https://doi.org/10.1093/nar/gkr1135

http://homer.ucsd.edu/homer/motif/

https://doi.org/10.1126/science.1102216

http://people.math.umass.edu/~conlon/mr.html

http://meme-suite.org/index.html

http://bioinformatics.intec.ugent.be/MotifSuite/motifsampler.php

https://dx.doi.org/10.1093%2Fnar%2Fgkr1135

https://github.com/zhengchangsulab/ProSampler

http://daweb.ism.ac.jp/yoshidalab/motif/

https://trawler.erc.monash.edu.au/

http://159.149.160.51/modtools/

https://github.com/soedinglab/xxmotif

https://github.com/daquang/YAMDA


Intermediate results

GimmeMotifs produces a lot of intermediate results, such as all predicted motifs, FASTA files used for validation andso on. These are deleted by default (as they can get quite large), but if you are interested in them, you can specify the-k option.

Detailed options for gimme motifs

Positional arguments

• INPUT

The inputfile needs to be in BED, FASTA, narrowPeak or region format. By default gimme motifs willtake the center of these features, and extend those to the size specified by the -s or --size argument. Bydefault this is 200 bp. Keep in mind that the smaller the regions are, the better motif discovery will work.BED-fomatted files need to contain at least three tab-separated columns describing chromosome name, startand end. The fourth column is optional. If it is specified it will be used by some motif prediction tools to sortthe features before motif prediction. FASTA files can be used as input for motif prediction. For best resultsit is recommended to use sequences of the same size. Peak files in narrowPeak format, such as produced byMACS2, can also directly be used as input. With these files, gimme motifs will use the summit of the peakand create regions of size 200 centered at this summit. Use the -s parameter to change this size. Finally, regionfiles can be used. These contain one column, with regions specified in chrom:start-end format.

• OUTDIR

The name of the output directory. All output files will be saved in this directory. If the directory already existsfiles will be overwritten.

Optional arguments

• -b BACKGROUND, --background BACKGROUND

Type of background to use. There are five options: gc, genomic, random, promoter or the path to filewith background sequences (FASTA, BED or regions). By default gc is used, which generates random regionsfrom the genome with a similar GC% as your input sequences. The genomic background will select randomgenomic regions without taking the sequence composition into account. The random background will createartificial sequences with a similar nucleotide distribution as your input sequences. The promoter backgroundwill select random promoters. For this option, your genome needs to be installed with genomepy using the--annotation option. Finally, you can select your own custom background by supplying the path to a file.

• -g GENOME

Name of the genome to use. This can be the name of a genome installed with genomepy or the path to a FASTAfile.

• --denovo

Only run de novo motif analysis. By default, the analysis includes known motifs. For specific arguments for denovo motif analysis, see below.

• --known

Only run known motif analysis. By default, the analysis includes de novo motifs. For specific arguments forknown motif analysis, see below.

• --noreport

Don’t create a HTML report, only plain text output files.





• --rawscore

Don’t use z-score normalization for motif scores. The raw logodds motif score are dependent on motif length.This means that the same logodds score will mean different things for motifs with a different length. By default,GimmeMotifs uses the scores in a set of genomic background regions to determine the a background distributionof scores. The logodds score is then scaled using this distribution.

• --nogc

By default GimmeMotifs calculates the motif logodds score distribution for regions with a different GC%. Thescore is then normalized according to the GC% bin per input sequence. Use this argument to turn this off.

• -N INT, --threads INT

Number of threads to use (default is 12).

Optional arguments for known motif analysis

• -p PFMFILE

PFM file with motifs to use for known motif analysis. You can use a custom PFM file, or use any ofthe databases included with GimmeMotifs such as, for instance, JASPAR2020_vertebrates, HOMER,HOCOCOMOv11_HUMAN or CIS-BP. By default, a database of clustered vertebrate motifs is used, gimme.vertebrate.v5.0. This database has a limited motif redundancy.

Optional arguments for de novo motif analysis

• -t TOOLS, --tools TOOLS

The de novo motif prediction tools to use, separated by commas. This can be any combination of the follow-ing: AMD, BioProspector, ChIPMunk, DiNAMO, GADEM, DREME, HMS, Homer, Improbizer, MDmodule,MEME, MEMEW, MotifSampler, POSMO, ProSampler, RPMCMC, Trawler, Weeder, XXmotif, YAMDA. Bydefault TOOLS is BioProspector,Homer,MEME. Note that some tools may not be installed. Runninggimme motifs -h will always list the tools that are supported on your installation of GimmeMotifs.

• -a, --analysis

The size of motifs to look for: small (5-8), medium (5-12), large (6-15) or xl (6-20). The larger the motifs, thelonger the de novo motif prediction will take. By default, xl will be used as this generally yields the best motifs.However, some prediction tools take a very long time in combination with the xl setting.

• k, --keepintermediate

Keep intermediate files.

• -s, --singlestrand

Only use the forward strand for prediction. By default both strands are used.

• -f FRACTION, --fraction FRACTION

This parameter controls the fraction of the sequences used for prediction. This 0.2 by default, so in this case arandomly chosen 20% of the sequences will be used for prediction. The remaining sequences will be used forvalidation (enrichment, ROC curves etc.). If you have a large set of sequences (ie. most ChIP-seq peak sets), thisis fine. However, if your set is smaller, it might be worthwile to increase this prediction fraction. The number ofsequences that is used is also influenced by the abs_max parameter in the configuration file. Regardless of the-f parameter, the total number of sequences used for motif prediction will never exceed the number specifiedby abs_max.

• -s N, --size N

This is the size of the sequences used for motif prediction. Smaller sequences will result in a faster analysis, butyou are of course limited by the accuracy of your data. For the tested ChIP-seq data sets 200 performs fine. Ifthis parameter is set to 0, the original size of the regions in the input file will be used.



4.5.4 Command: gimme maelstrom

This command can be used to identify differential motifs between two or more data sets. See the maelstrom tutorialfor more details.

Positional arguments:

INPUTFILE file with regions and clustersGENOME genomeDIR output directory

Optional arguments:

-h, --help show this help message and exit-p PFMFILE, --pfmfile PFMFILE

PFM file with motifs (default:gimme.vertebrate.v5.0.pwm)

-m NAMES, --methods NAMESRun with specific methods

Input file formats

The input can be in one of two possible formats. In both cases the genomic location should be present aschrom:start-end in the first column. The first option is a two-column format and looks like this:

loc clusterchr15:49258903-49259103 NKchr10:72370313-72370513 NKchr4:40579259-40579459 Monocyteschr10:82225678-82225878 T-cellschr5:134237941-134238141 B-cellschr5:58858731-58858931 B-cellschr20:24941608-24941808 NKchr5:124203116-124203316 NKchr17:40094476-40094676 Erythroblastchr17:28659327-28659527 T-cells

This can be the result of a clustering analysis, for instance.

The second option looks like this:

loc NK Monocytes T-cells B-cellschr12:93507547-93507747 3.118 2.522 1.933 0.197chr7:38236460-38236660 1.098 0.502 0.201 0.190chr10:21357147-21357347 0.528 -0.066 -1.04 -0.343chr6:115521512-115521712 0.406 -0.376 -0.480 -0.667chr2:97359808-97360008 1.501 0.905 0.719 0.031chr16:16684549-16684749 0.233 -0.362 -0.837 -0.746chrX:138964544-138964744 0.330 -0.291 -0.686 -0.777chr2:186923973-186924173 0.430 -0.258 -1.164 -0.723chrX:113834470-113834670 0.560 -0.036 -0.686 -0.692

This is a tab-separated table, with a header describing the experiments. In case of sequencing data, such as ChIP-seq,ATAC-seq or DNaseI seq, we recommend to use log-transformed read counts which are mean-centered per row.For optimal results, it is recommended to normalize between experiments (columns) after the log-transformatiion step,for instance by quantile normalization or scaling.

The second input format generally gives better results than the first one and would be the recommended format.



The output scores of gimme maelstrom represent the combined result of multiple methods. The individual resultsfrom different methods are ranked from high-scoring motif to low-scoring motif and then aggregated using the rankaggregation method from Kolde, 2012. The score that is shown is the -log10(p-value), where the p-value comes fromthe rank aggregation. This procedure is then repeated with the ranking reversed. These are shown as negative values.

4.5.5 Command: gimme scan

Scan a set of sequences with a set of motifs, and get the resulting matches in GFF, BED or table format. If the FASTAheader includes a chromosome location in chrom:start-end format, the BED output will return the genomiclocation of the motif match. The GFF file will always have the motif location relative to the input sequence.

A basic command would look like this:

$ gimme scan peaks.bed -g hg38 -b > motifs.bed

The threshold that is used for scanning can be specified in a number of ways. The default threshold is set to a motif-specific 1% FPR by scanning random genomic sequences. You can change the FPR with the -f option and/or the setof sequences that is used to determine the FPR with the -B option.

For instance, this command would scan with thresholds based on 5% FPR with random genomic mouse sequences.

$ gimme scan input.fa -g mm10 -f 0.05 -b > gimme.scan.bed

And this command would base a 10% FPR on the input file hg38.promoters.fa:

$ gimme scan input.fa -f 0.1 -B hg38.promoters.fa -b > gimme.scan.bed

Alternatively, you can specify the threshold as a single score. This score is relative and is based on the maximumand minimum possible score for each motif. For example, a score of 0.95 means that the score of a motif should beat least 95% of the (maximum score - minimum score). This should probably not be set much lower than 0.8, andshould be generally at least 0.9-0.95 for good specificity. Generally, as the optimal threshold might be different foreach motif, the use of the FPR-based threshold is preferred. One reason to use a single score as threshold is when youwant a match for each motif, regardless of the score. This command would give one match for every motif for everysequence, regardless of the score.

$ gimme scan input.bed -g hg38 -c 0 -n 1 -b > matches.bed

Finally, gimme scan can return the scanning results in table format. The -t will yield a table with number ofmatches, while the -T will have the score of the best match.


• INPUT

The inputfile needs to be in BED, FASTA or region format. BED-fomatted files need to contain at least threetab-separated columns describing chromosome name, start and end. Region files can also be used. Thesecontain one column, with regions specified in chrom:start-end format.

Optional arguments

• -g GENOME

Name of the genome to use. This can be the name of a genome installed with genomepy or the path to a FASTAfile.

• -p PFMFILE, --pfmfile PFMFILE

PFM file with motifs to use for known motif analysis. You can use a custom PFM file, or use any ofthe databases included with GimmeMotifs such as, for instance, JASPAR2020_vertebrates, HOMER,


https://www.ncbi.nlm.nih.gov/pubmed/22247279



HOCOCOMOv11_HUMAN or CIS-BP. By default, a database of clustered vertebrate motifs is used, gimme.vertebrate.v5.0. This database has a limited motif redundancy.

• -f, --fpr

Base the motif score threshold on this FPR. By default this is set to 1%, equivalent to -f 0.01. The scorethreshold is based on scanning random genomic regions with the same size and the same GC% distribution.This threshold is calculated once for a specific sequence size and cached. Therefore, scanning will take longerthe first time you use a specific FPR with a specific input sequence size.

• -B, --bgfile

Specify a FASTA file to use for FPR calculation, instead of taking random genomic regions.

• -c, --cutoff

Use this score cutoff instead of an FPR-based threshold. This score is relative and is based on the maximumand minimum possible score for each motif. For example, a score of 0.95 means that the score of a motif shouldbe at least 95% of the (maximum score - minimum score). This should probably not be set much lower than0.8, and should be generally at least 0.9-0.95 for good specificity. Generally, as the optimal threshold might bedifferent for each motif, the use of the FPR-based threshold is preferred.

• -n, --nreport

Maximum number of matches to report per motif per sequence. By default this is set to 1.

• -r, --norc

Don’t scan the reverse complement of the sequence. By default both strands will be scanned.

• -b, --bed

Output motif matches in BED format.

• -t, --table

Ouput number of matches in a table format, where columns represent motifs and rows represent input sequences.

• -T, --score_table

Ouput maximum motif score in a table format, where columns represent motifs and rows represent input se-quences. The score will be reported for each motif, regardless if it is a good match or not.

• -z, --zscore

Use z-score normalization for motif scores. The raw logodds motif score are dependent on motif length. Thismeans that the same logodds score will mean different things for motifs with a different length. By default,GimmeMotifs uses the scores in a set of genomic background regions to determine the a background distributionof scores. The logodds score is then scaled using this distribution.

• --gc

Use this option to calculate the motif logodds score distribution based on regions with a similar GC%.

• -N INT, --threads INT

Number of threads to use (default is 12).

4.5.6 Command: gimme background

Generate random sequences according to one of several methods:

• random - randomly generated sequence with the same dinucleotide distribution as the input sequences accord-ing to a 1st order Markov model



• genomic - sequences randomly chosen from the genome

• gc - sequences randomly chosen from the genome with the same GC% as the input sequences

• promoter - random promoter sequences

The background types gc and random need a set of input sequences in BED or FASTA format. If the input sequencesare in BED format, the genome version needs to be specified with -g.


FILE outputfileTYPE type of background sequences to generate

(random,genomic,gc,promoter)

Optional arguments:

-h, --help show this help message and exit-i FILE input sequences (BED or FASTA)-f TYPE output format (BED or FASTA-l INT length of random sequences-n NUMBER number of sequence to generate-g GENOME genome version (not for type 'random')-m N order of the Markov model (only for type 'random', default 1)

4.5.7 Command: gimme logo

Convert one or more motifs in a PFM file to a sequence logo. Most of these logos are made possible by the excellentLogomaker package. You can optionally supply a PFM file, otherwise gimme logo uses the default gimme.vertebrate.v5.0. With the -i option, you can choose one or more motifs to convert.

This will convert all the motifs in CTCF.pfm to a sequence logo:

$ gimme logo -p CTCF.pfm

This will create a logo for GM.5.0.Ets.0026 from the default database.

$ gimme logo -i GM.5.0.Ets.0026

You can specify four types of sequence logos:


https://logomaker.readthedocs.io/en/latest/


information

frequency

energy

ensembl

You can leave the motif title out with the --notitle argument.

$ gimme logo JASPAR2020_vertebrates -i MA1115.1_POU5F1 -k energy --notitle




• pfmfile

PFM file with motifs. You can use a custom PFM file, or use any of the databases included with GimmeMotifssuch as, for instance, JASPAR2020_vertebrates, HOMER, HOCOCOMOv11_HUMAN or CIS-BP.

Optional arguments:

• -i IDS, --ids IDS

Comma-separated list of motif ids (default is all ids).

• -k TYPE, --kind TYPE

Type of motif (information, frequency, energy or ensembl). The default is information.

• --notitle

Don’t include motif ID as title.

• -h, --help

Show help message.

4.5.8 Command: gimme match

Find the the best match of every motif in a PFM file with input motif(s) to a database of reference motifs. By defaultthe gimme.vertebrate.v5.0 database is used, however, other databases can be specified using the -d argu-ment. This can be a custom PFM file, or any of the databases included with GimmeMotifs such as, for instance,JASPAR2020_vertebrates, HOMER, HOCOCOMOv11_HUMAN or CIS-BP. If an ouput file is specified, a graph-ical output with aligned motifs will be created. However, this is slow for many motifs and can consume a lot ofmemory (see issue). It works fine for a few motifs at a time.


PFMFILE File with input pfms

Optional arguments:


https://github.com/simonvh/gimmemotifs/issues/5


-h, --help show this help message and exit-d DBFILE File with pfms to match against (default:

gimme.vertebrate.v5.0.pfm)-n INT Number of top matches to report-o FILE Output file with graphical report (png, svg, ps, pdf)

4.5.9 Command: gimme cluster

Cluster a set of motifs with the WIC metric.


INPUTFILE Inputfile (PFM format)OUTDIR Name of output directory

Optional arguments:

-h, --help show this help message and exit-s Don't compare reverse complements of motifs-t THRESHOLD Cluster threshold

4.5.10 Command: gimme threshold

Create a file with motif-specific thresholds based on a specific background file and a specific FPR. The FPR should bespecified as a float between 0.0 and 1.0. You can use this threshold file with the -c argument of gimme scan. Notethat gimme scan by default determines an FPR based on random genomic background sequences. You can use thiscommand to create the threshold file explicitly, or when you want to determine the threshold based on a different typeof background. For instance, this command would create a file with thresholds for the motifs in custom.pwm with aFPR of 1%, based on the sequences in promoters.fa.

$ gimme threshold custom.pwm 0.05 promoters.fa > custom.threshold.txt


PFMFILE File with pwmsFAFILE FASTA file with background sequencesFPR Desired fpr

4.5.11 Command: gimme location

Create the positional preference plots for all the motifs in the input PWM file. This will give best results if all thesequences in the FASTA-formatted inputfile have the same length. Keep in mind that this only makes sense if thesequences are centered around a similar feature (transcription start site, highest point in a peak, etc.). The defaultthreshold for motif scanning is 0.95, see gimme scan for more details.


PFMFILE File with pwmsFAFILE Fasta formatted file

Optional arguments:



-h, --help show this help message and exit-w WIDTH Set width to W (default: determined from fastafile)-i IDS Comma-separated list of motif ids to plot (default is all ids)-c CUTOFF Cutoff for motif scanning (default 0.95)

4.5.12 Command: gimme diff

This is a simple command to visualize differential motifs between different data sets. You are probably better of usinggimme maelstrom, however, in some cases this visualization might still be informative. The input consists of a numberof FASTA files, separated by a comma. These are compared to a background file. The last two arguments are a filewith pwms and and output image. The gimme diff command then produces two heatmaps (enrichment and frequency)of all enriched, differential motifs. Reported motifs are at least 3 times enriched compared to the background (changewith the -e argument) and have a minimum frequency in at least one of the input data sets of 1% (change with the-f argument). You can specify motif threshold with the -c argument (which can be a file generated with gimmethreshold).

For a command like this. . .

$ gimme diff VEGT_specific.summit.200.fa,XBRA_specific.summit.200.fa,XEOMES_specific.→˓summit.200.fa random.w200.fa gimme_diff_tbox.png -p tbox.pwm -f 0.01 -c threshold.0.→˓01.txt

. . . the output will look like this (based on ChIP-seq peaks of T-box factors from Gentsch et al. 2013):

The image layout is not always optimal. If you want to customize the image, you can either save it as a .svg file,or use the numbers that are printed to stdout. The columns are in the same order as the image, the row order may bedifferent as these are clustered before plotting.

Note that the results might differ quite a lot depending on the threshold that is chosen! Compare for instance an FPRof 1% vs an FPR of 5%.


https://doi.org/10.1016/j.celrep.2013.08.012



FAFILES FASTA-formatted inputfiles OR a BED file with anidentifier in the 4th column, for instance a clusternumber.

BGFAFILE FASTA-formatted background filePNGFILE outputfile (image)

Optional arguments:

-h, --help show this help message and exit-p PFMFILE, --pfmfile PFMFILE

PWM file with motifs (default:gimme.vertebrate.v3.1.pwm)

-c , --cutoff motif score cutoff or file with cutoffs (default 0.9)-e MINENR, --enrichment MINENR

minimum enrichment in at least one of the datasetscompared to background

-f MINFREQ, --frequency MINFREQminimum frequency in at least one of the datasets

-g VERSION, --genome VERSIONGenome version. Only necessary in combination with aBED file with clusters as inputfile.

4.6 API documentation

4.7 Examples

You can follow the interactive API example documentaion in a Jypyter notebook using mybinder: .. image:: https://mybinder.org/badge_logo.svg :target: https://mybinder.org/v2/gh/vanheeringen-lab/gimmemotifs/develop?filepath=docs/api_examples.ipynb

4.7.1 Working with motifs

The Motif class stores motif information. There are several ways to create a Motif instance.

from gimmemotifs.motif import Motif,read_motifs

# Read from filemotifs = read_motifs("example.pfm")

for motif in motifs:print(motif)

AP1_nTGAGTCAyCTCF_CCAsyAGrkGGCr

# Create from scratchm = Motif([[0,1,0,0],[0,0,1,0]])m.id = "CpG"print(m)

4.6. API documentation 35

https://mybinder.org/badge_logo.svg

https://mybinder.org/badge_logo.svg

https://mybinder.org/v2/gh/vanheeringen-lab/gimmemotifs/develop?filepath=docs/api_examples.ipynb

https://mybinder.org/v2/gh/vanheeringen-lab/gimmemotifs/develop?filepath=docs/api_examples.ipynb


CpG_CG

# Or from a consensus sequencefrom gimmemotifs.motif import motif_from_consensusap1 = motif_from_consensus("TGASTCA")print(ap1.to_pwm())

>TGASTCA0.0001 0.0001 0.0001 0.99980.0001 0.0001 0.9998 0.00010.9998 0.0001 0.0001 0.00010.0001 0.4999 0.4999 0.00010.0001 0.0001 0.0001 0.99980.0001 0.9998 0.0001 0.00010.9998 0.0001 0.0001 0.0001

Read motifs from files in other formats.

motifs = read_motifs("MA0099.3.jaspar", fmt="jaspar")print(motifs[0])

You can convert a motif to several formats.

motifs = read_motifs("example.pfm")

# pwmprint(motifs[0].to_pwm())

>AP10.5558 0.1469 0.2734 0.02400.0020 0.0015 0.0017 0.99480.0039 0.0019 0.9502 0.04390.9697 0.0220 0.0018 0.00650.0377 0.3311 0.6030 0.02830.0033 0.0031 0.0043 0.98930.0188 0.9775 0.0023 0.00140.9951 0.0021 0.0012 0.00150.0121 0.3096 0.1221 0.5561

# pfmprint(motifs[0].to_pfm())

>AP1555.8 146.9 273.4 24.02.0 1.5 1.7 994.80000000000013.9 1.9 950.2 43.9969.7 22.0 1.8 6.537.699999999999996 331.1 603.0 28.2999999999999973.3 3.1 4.3 989.318.8 977.5 2.3 1.4995.1 2.1 1.2 1.512.1 309.59999999999997 122.1 556.1

# consensus sequenceprint(motifs[0].to_consensus())



ATGAsTCAy

# TRANSFACprint(motifs[0].to_transfac())

DE AP1 unknown0 555 146 273 24 A1 2 1 1 994 T2 3 1 950 43 G3 969 22 1 6 A4 37 331 603 28 s5 3 3 4 989 T6 18 977 2 1 C7 995 2 1 1 A8 12 309 122 556 yXX

# MEMEprint(motifs[0].to_meme())

MOTIF AP1BL MOTIF AP1 width=0 seqs=0letter-probability matrix: alength= 4 w= 9 nsites= 1000.1 E= 00.5558 0.1469 0.2734 0.0240.002 0.0015 0.0017 0.99480.0039 0.0019 0.9502 0.04390.9697 0.022 0.0018 0.00650.0377 0.3311 0.603 0.02830.0033 0.0031 0.0043 0.98930.0188 0.9775 0.0023 0.00140.9951 0.0021 0.0012 0.00150.0121 0.3096 0.1221 0.5561

Some other useful tidbits.

m = motif_from_consensus("NTGASTCAN")print(len(m))

9

# Trim by information contentm.trim(0.5)print(m.to_consensus(), len(m))

TGAsTCA 7

# Slicesprint(m[:3].to_consensus())

TGA

# Shufflerandom_motif = motif_from_consensus("NTGASTGAN").randomize()print(random_motif)

4.7. Examples 37


random_snCTAGTAn

To convert a motif to an image, use plot_logo(). Many different file formats are supported, such as png, svg andpdf.

m = motif_from_consensus("NTGASTCAN")m.plot_logo(fname="ap1.png")

4.7.2 Motif scanning

For very simple scanning, you can just use a Motif instance. Let’s say we have a FASTA file called test.fa thatlooks like this:

>seq1AAAAAAAAAAAAAAAAAAAAAA>seq2CGCGCGTGAGTCACGCGCGCGCG

TGASTCAAAAAAAAAATGASTCA

Now we can use this file for scanning.

from gimmemotifs.motif import motif_from_consensusfrom gimmemotifs.fasta import Fasta

f = Fasta("test.fa")m = motif_from_consensus("TGAsTCA")

m.pwm_scan(f)

{'seq1': [], 'seq2': [6, 6], 'seq3': [0, 16, 0, 16]}

This return a dictionary with the sequence names as keys. The value is a list with positions where the motif matches.Here, as the AP1 motif is a palindrome, you see matches on both forward and reverse strand. This is more clear whenwe use pwm_scan_all() that returns position, score and strand for every match.

m.pwm_scan_all(f)



{'seq1': [],'seq2': [(6, 9.02922042678255, 1), (6, 9.02922042678255, -1)],'seq3': [(0, 8.331251500673487, 1),(16, 8.331251500673487, 1),(0, 8.331251500673487, -1),(16, 8.331251500673487, -1)]}

The number of matches to return is set to 50 by default, you can control this by setting the nreport argument. Usescan_rc=False to only scan the forward orientation.

m.pwm_scan_all(f, nreport=1, scan_rc=False)

{'seq1': [],'seq2': [(6, 9.02922042678255, 1)],'seq3': [(0, 8.331251500673487, 1)]}

While this functionality works, it is not very efficient. To scan many motifs in potentially many sequences, use thefunctionality in the scanner module. If you only want the best match per sequence, is a utility function calledscan_to_best_match, otherwise, use the Scanner class.

from gimmemotifs.motif import motif_from_consensusfrom gimmemotifs.scanner import scan_to_best_match

m1 = motif_from_consensus("TGAsTCA")m1.id = "AP1"m2 = motif_from_consensus("CGCG")m2.id = "CG"motifs = [m1, m2]

print("motif\tpos\tscore")result = scan_to_best_match("test.fa", motifs)for motif, matches in result.items():

for match in matches:print("{}\t{}\t{}".format(motif, match[1], match[0]))

motif pos scoreCG 0 -18.26379789133924CG 0 5.554366880674296CG 0 -7.743307225501047AP1 0 -20.052563923836903AP1 6 9.029486018303187AP1 0 8.331550321011443

The matches are in the same order as the sequences in the original file.

While this function can be very useful, a Scanner instance is much more flexible. You can scan different inputformats (BED, FASTA, regions), and control the thresholds and output.

As an example we will use the file Gm12878.CTCF.top500.w200.fa that contains 500 top CTCF peaks. Wewill get the CTCF motif and scan this file in a number of different ways.

from gimmemotifs.motif import read_motifsfrom gimmemotifs.scanner import Scannerfrom gimmemotifs.fasta import Fastaimport numpy as np

# Input file


4.7. Examples 39



fname = "Gm12878.CTCF.top500.w200.fa"

# Select the CTCF motif from the default motif databasemotifs = [m for m in read_motifs() if "CTCF" in m.factors['direct']][:1]

# Initialize the scanners = Scanner()s.set_motifs(motifs)

Now let’s get the best score for the CTCF motif for each sequence.

scores = [r[0] for r in s.best_score("examples/Gm12878.CTCF.top500.w200.fa")]print("{}\t{:.2f}\t{:.2f}\t{:.2f}".format(

len(scores),np.mean(scores),np.min(scores),np.max(scores)))

500 11.00 1.45 15.07

In many cases you’ll want to set a threshold. In this example we’ll use a 1% FPR threshold, based on scanningrandomly selected sequences from the ghg38 genome. The first time you run this, it will take a while. However, thetresholds will be cached. This means that for the same combination of motifs and genome, the previously generatedthreshold will be used.

# Set a 1% FPR threshold based on random hg38 sequences.set_genome("hg38")s.set_threshold(fpr=0.01)

# get the number of sequences with at least one matchcounts = [n[0] for n in s.count("Gm12878.CTCF.top500.w200.fa", nreport=1)]print(counts[:10])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

# or the grand total of number of sequences with 1 matchprint(s.total_count("examples/Gm12878.CTCF.top500.w200.fa", nreport=1))

[404]

# Scanner.scan() just gives all informationseqs = Fasta("Gm12878.CTCF.top500.w200.fa")[:10]for i,result in enumerate(s.scan(seqs)):

seqname = seqs.ids[i]for m,matches in enumerate(result):

motif = motifs[m]for score, pos, strand in matches:

print(seqname, motif, score, pos, strand)

chr11:190037-190237 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.4959558370929 143 -1chr11:190037-190237 C2H2_ZF_Average_200_CCAsyAGrkGGCr 10.821440417077262 22 -1chr11:190037-190237 C2H2_ZF_Average_200_CCAsyAGrkGGCr 10.658439190070851 82 -1chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 120 -1





chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.72460285196088 83 -1chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.450778540447134 27 -1chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 10.037330832055455 7 -1chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 8.998038360035828 159 -1chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 8.668660161058972 101 -1chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 145 -1chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.848270770440264 185 -1chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.668171128367552 165 -1chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 12.785329839873164 27 -1chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.886792072933595 126 -1chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.25063146496227 67 -1chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 28 -1chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 185 -1chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.261096435278661 67 -1chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.450778540447134 147 -1chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.022594547749485 126 -1chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 10.194691222675097 7 -1chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.886792072933595 37 -1chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.886792072933595 95 -1chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.886792072933595 153 -1chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.972530270193543 75 -1chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.949273408029276 17 -1chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.949273408029276 133 -1chr14:107238917-107239117 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 92 -1chr14:107238917-107239117 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.25063146496227 34 -1chr14:107238917-107239117 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.246743494108388 15 -1chr6:53036754-53036954 C2H2_ZF_Average_200_CCAsyAGrkGGCr 8.764279993851783 62 1chr14:107147705-107147905 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.697109967765122 33 -1chr14:107147705-107147905 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.204664711685334 149 -1chr14:107147705-107147905 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.222131525579154 92 -1chr14:107147705-107147905 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.222131525579154 130 -1chr14:50328834-50329034 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.148765667117496 133 1chr1:114889205-114889405 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.752478102244137 124 1

4.7.3 Finding de novo motifs

Let’s take the Gm12878.CTCF.top500.w200.fa file as example again. For a basic example we’ll just use twomotif finders, as they’re quick to run.

from gimmemotifs.denovo import gimme_motifs

peaks = "Gm12878.CTCF.top500.w200.fa"outdir = "CTCF.gimme"params = {

"tools": "Homer,BioProspector",}

motifs = gimme_motifs(peaks, outdir, params=params)

2017-06-30 07:37:00,079 - INFO - starting full motif analysis2017-06-30 07:37:00,082 - INFO - preparing input (FASTA)2017-06-30 07:37:32,949 - INFO - starting motif prediction (medium)2017-06-30 07:37:32,949 - INFO - tools: BioProspector, Homer2017-06-30 07:37:40,540 - INFO - BioProspector_width_5 finished, found 5 motifs


4.7. Examples 41



2017-06-30 07:37:41,308 - INFO - BioProspector_width_7 finished, found 5 motifs2017-06-30 07:37:41,609 - INFO - BioProspector_width_6 finished, found 5 motifs2017-06-30 07:37:42,003 - INFO - BioProspector_width_8 finished, found 5 motifs2017-06-30 07:37:44,054 - INFO - Homer_width_5 finished, found 5 motifs2017-06-30 07:37:45,201 - INFO - Homer_width_6 finished, found 5 motifs2017-06-30 07:37:48,246 - INFO - Homer_width_7 finished, found 5 motifs2017-06-30 07:37:50,503 - INFO - Homer_width_8 finished, found 5 motifs2017-06-30 07:37:54,649 - INFO - BioProspector_width_9 finished, found 5 motifs2017-06-30 07:37:56,169 - INFO - BioProspector_width_10 finished, found 5 motifs2017-06-30 07:37:56,656 - INFO - Homer_width_9 finished, found 5 motifs2017-06-30 07:37:59,313 - INFO - Homer_width_10 finished, found 5 motifs2017-06-30 07:37:59,314 - INFO - all jobs submitted2017-06-30 07:39:21,298 - INFO - predicted 60 motifs2017-06-30 07:39:21,326 - INFO - 53 motifs are significant2017-06-30 07:39:21,410 - INFO - clustering significant motifs.2017-06-30 07:39:47,031 - INFO - creating reports2017-06-30 07:40:41,024 - INFO - finished2017-06-30 07:40:41,024 - INFO - output dir: CTCF.gimme2017-06-30 07:40:41,024 - INFO - report: CTCF.gimme/motif_report.html

This will basically run the same pipeline as the gimme motifs command. All output files will be stored in outdirand gimme_motifs returns a list of Motif instances. If you only need the motifs but not the graphical report,you can decide to skip it by setting create_report to False. Additionally, you can choose to skip clustering(cluster=False) or to skip calculation of significance (filter_significant=False). For instance, thefollowing command will only predict motifs and cluster them.

motifs = gimme_motifs(peaks, outdir,params=params, filter_significant=False, create_report=False)

All parameters for motif finding are set by the params argument.

Although the gimme_motifs() function is probably the easiest way to run the de novo finding tools, you can alsorun any of the tools directly. In this case you would also have to supply the background file if the specific tool requiresit.

from gimmemotifs.tools import get_toolfrom gimmemotifs.background import MatchedGcFasta

m = get_tool("homer") # tool name is case-insensitive

# Create a background fasta file with a similar GC%fa = MatchedGcFasta("TAp73alpha.fa", number=1000)fa.writefasta("bg.fa")

# Run motif predictionparams = {

"background": "bg.fa","width": "20","number": 5,

}

motifs, stdout, stderr = m.run("TAp73alpha.fa", params=params)print(motifs[0].to_consensus())

nnnCnTGynnnGrCwTGyyn



4.7.4 Motif statistics

With some motifs, a sample file and a background file you can calculate motif statistics. Let’s say I wanted to knowwhich of the p53-family motifs is most enriched in the file TAp73alpha.fa.

First, we’ll generate a GC%-matched genomic background. Then we only select p53 motifs.

from gimmemotifs.background import MatchedGcFastafrom gimmemotifs.fasta import Fastafrom gimmemotifs.stats import calc_statsfrom gimmemotifs.motif import default_motifs

sample = "TAp73alpha.fa"bg = MatchedGcFasta(sample, genome="hg19", number=1000)

motifs = [m for m in default_motifs() if any(f in m.factors['direct'] for f in ["TP53→˓", "TP63", "TP73"])]

stats = calc_stats(motifs=motifs, fg_file=sample, bg_file=bg, genome="hg19")

print("Stats for", motifs[0])for k, v in stats[str(motifs[0])].items():

print(k,v)

print()

best_motif = sorted(motifs, key=lambda x: stats[str(x)]["recall_at_fdr"])[-1]print("Best motif (recall at 10% FDR):", best_motif)

Stats for GM.5.0.p53.0001_rCATGyCCnGrCATGyrecall_at_fdr 0.833fraction_fpr 0.416score_at_fpr 9.05025905735enr_at_fpr 41.6max_enrichment 55.5phyper_at_fpr 3.33220067463e-132mncp 1.85474606318roc_auc 0.9211925roc_auc_xlim 0.0680115pr_auc 0.927368602993max_fmeasure 0.867519181586ks_pvalue 0.0ks_significance inf

Best motif (recall at 10% FDR): GM.5.0.p53.0001_rCATGyCCnGrCATGy

A lot of statistics are generated and you will not always need all of them. You can choose one or more specific metricswith the additional stats argument.

metrics = ["roc_auc", "recall_at_fdr"]stats = calc_stats(motifs=motifs, fg_file=sample, bg_file=bg, stats=metrics, genome=→˓"hg19")

for metric in metrics:for motif in motifs:

print("{}\t{}\t{:.2f}".format(motif.id, metric, stats[str(motif)][metric]))

4.7. Examples 43


p53_M5923_1.01 roc_auc 0.63p53_M5922_1.01 roc_auc 0.64p53_Average_10 roc_auc 0.83p53_Average_8 roc_auc 0.93p53_M3568_1.01 roc_auc 0.83p53_M5923_1.01 recall_at_fdr 0.00p53_M5922_1.01 recall_at_fdr 0.00p53_Average_10 recall_at_fdr 0.24p53_Average_8 recall_at_fdr 0.83p53_M3568_1.01 recall_at_fdr 0.42

4.7.5 Motif comparison

Compare two motifs.

from gimmemotifs.comparison import MotifComparerfrom gimmemotifs.motif import motif_from_consensusfrom gimmemotifs.motif import read_motifsm1 = motif_from_consensus("RRRCATGYYY")m2 = motif_from_consensus("TCRTGT")

mc = MotifComparer()score, pos, orient = mc.compare_motifs(m1, m2)

if orient == -1:m2 = m2.rc()

pad1, pad2 = "", ""if pos < 0:

pad1 = " " * -poselif pos > 0:

pad2 =" " * posprint(pad1 + m1.to_consensus())print(pad2 + m2.to_consensus())

rrrCATGyyyACAyGA

Find closest match in a motif database.

motifs = [motif_from_consensus("GATA"),motif_from_consensus("NTATAWA"),motif_from_consensus("ACGCG"),

]

mc = MotifComparer()results = mc.get_closest_match(motifs, dbmotifs=read_motifs("HOMER"), metric="seqcor")

# Load motifsdb = read_motifs("HOMER", as_dict=True)

for motif in motifs:match, scores = results[motif.id]print("{}: {} - {:.3f}".format(motif.id, match, scores[0]))dbmotif = db[match]





orient = scores[2]if orient == -1:

dbmotif = dbmotif.rc()padm, padd = 0, 0if scores[1] < 0:

padm = -scores[1]elif scores[1] > 0:

padd = scores[1]print(" " * padm + motif.to_consensus())print(" " * padd + dbmotif.to_consensus())print()

GATA: AGATAASR_GATA3(Zf)/iTreg-Gata3-ChIP-Seq(GSE20898)/Homer - 0.823GATA

AGATAAnr

NTATAWA: NGYCATAAAWCH_CDX4(Homeobox)/ZebrafishEmbryos-Cdx4.Myc-ChIP-Seq(GSE48254)/→˓Homer - 0.747nTATAwA

nnynrTAAAnnn

ACGCG: NCCACGTG_c-Myc(bHLH)/LNCAP-cMyc-ChIP-Seq(Unpublished)/Homer - 0.744ACGCG

CACGTGGn

4.8 Auto-generated

This part of the API documentation is not yet complete.

4.8.1 The Motif class

4.8.2 Prediction of de novo motifs

4.8.3 Motif scanning

4.8.4 Maelstrom

4.8.5 Motif activity prediction

4.8.6 Motif statistics

4.8.7 Motif comparison

4.9 FAQ

4.9.1 ‘i’ format requires -2147483648 <= number <= -2147483646

If you get the following error with gimme maelstrom:

4.8. Auto-generated 45


` 'i' format requires -2147483648 <= number <= -2147483646 `

this means that you should decrease the amount of input sequences. It is a bug that should be solved with a newerversion of Python. However, it might be bestg anyway to use a limited set of sequences (< ~100k) as input.

4.9.2 Sorry, motif prediction tool [X] is not supported

If gimme motifs does not recognize a motif tools that should be installed, you can update the configuration file. Thisfile is likely located at ~/.config/gimmemotifs/gimmemotifs.cfg.

Edit this file and update the following lines under the [params] section:

available_tools = MDmodule,MEME,MEMEW,Weeder,GADEM,MotifSampler,Trawler,Improbizer,→˓BioProspector,Posmo,ChIPMunk,AMD,HMS,Homertools = MDmodule,MEME,Weeder,MotifSampler,trawler,Improbizer,BioProspector,Posmo,→˓ChIPMunk,AMD,Homer

Add the tool that you want to the available_tools parameter. Keep in mind the exact upper-/lower-case combinationthat GimmeMotifs uses. By updating the tools parameter you can set the tools that gimme motifs uses by default. Thiscan always be changed at the command-line.

In addition, you might also have to update the binary location. Either update this section if it exists or add it. Forinstance, to set this information for MEME:

[MEME]bin = /home/simon/anaconda2/envs/gimme/gimmemotifs/tools/meme.bindir = /home/simon/anaconda2/envs/gimme/gimmemotifs/tools

The dir variable ususally doesn’t matter and you can set it the same directory as where the binary is located.

4.9.3 I get motifs that have differential scores in gimme maelstrom, however, thenumber is not different across clusters

The different methods use different ways to rank the motifs. The hypergeometric test is the only one that uses motifcounts. All the other methods use the PWM logodds score of the best match. While the counts may not be differentacross clusters, the scores most likely are.

4.9.4 I have upgraded GimmeMotifs and now it doesn’t find my genome

The genome index in GimmeMotifs has changed, see upgradegenome_.

4.9.5 I cannot run gimme index anymore

The genome index in GimmeMotifs has changed, see upgradegenome_.

4.9.6 I get ‘RuntimeError: Invalid DISPLAY variable’

The default matplotlib configuration expects a display. Probably you are running GimmeMotifs on a server withoutan X server. There are several ways to solve it.



Option 1

Change the matplotlib configuration. Find the path of the matplotlib installation of your current environment (makesure to activate the environment you use for GimmeMotifs first).

$ python -c "import matplotlib; print(matplotlib.__file__)"/home/simon/anaconda2/envs/gimme3/lib/python3.5/site-packages/matplotlib/__init__.py

So, matplotlib is in /home/simon/anaconda2/envs/gimme3/lib/python3.5/site-packages/matplotlib/. Now you can edit <MATPLOT_DIR>/mpl-data/matplotlibrc. In my example case thiswould be:

/home/simon/anaconda2/envs/gimme3/lib/python3.5/site-packages/matplotlib/mpl-data/matplotlibrc

Change the line

backend : Qt5Agg

to

backend : Agg

You can also put a matplotlibrc file in $HOME/.config/matplotlib.

Option 2

Run GimmeMotifs via xvfb-run. If this program is installed, you can simply run GimmeMotifs in a virtual X serverenvironment.

For example:

$ xvfb-run gimme motifs [args]

4.9.7 I get a KeyError when running gimme maelstrom

You get an error like this:

File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc→˓(pandas/_libs/index.c:5280)File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc→˓(pandas/_libs/index.c:5126)File "pandas/_libs/hashtable_class_helper.pxi", line 1210, in pandas._libs.hashtable.→˓PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)File "pandas/_libs/hashtable_class_helper.pxi", line 1218, in pandas._libs.hashtable.→˓PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)KeyError: '5'

This a bug in gimme maelstrom. The column headers can’t be numbers. Change this to a word, for instancecluster5 or col5.

4.9. FAQ 47


4.10 Acknowledgments

We are grateful to Waseem Akhtar, Robert Akkers, Max Koeppel, Evelyn Kouwenhoven, Marion Lohrum, LeonieSmeenk and Jo Zhou for providing data and feedback during GimmeMotifs development. Also we would like to thankStefanie Bartels, Adalberto Costessi, Joost Martens and Nagesha Rao for testing and helpful discussion. Of courseGimmeMotifs by itself wouldn’t be able to do anything, if there wasn’t such a number of excellent tools available.Therefore, a big thanks to all the authors of the motif prediction programs for making their software publicly availableand allowing me to distribute them with GimmeMotifs! In addition, I would like to thank Wolfgang Lugmayr andAaron Statham for various bugfixes.


GimmeMotifs Documentation · 2019-11-21 · CHAPTER FOUR FULL CONTENTS 4.1Installation GimmeMotifs...

Documents

Transcript of GimmeMotifs Documentation · 2019-11-21 · CHAPTER FOUR FULL CONTENTS 4.1Installation GimmeMotifs...