Thesis projects in the Bioinformatics Group · Thesis projects in the Bioinformatics Group ... In...

Thesis projects in the Bioinformatics Group

Last updated: 12-03-2018 Unraveling bacterial lifestyles by comparative genomics

Pangenomic QTL analysis

Systematic exploration of the biosynthetic capacity of a collection of well-characterized human colonic bacteria

Identification of high-quality SNPs in allopolyploid crops

Pangenomics for crops

Machine learning to predict protein-protein interactions in natural product biosynthetic assembly lines

Integration of kinase specificity knowledge in the analysis of peptide microarrays

Integrative QTL analysis

Protein function prediction

RDF for QTL candidate gene prioritization

Transcriptional networks from eQTL data

Novel enzymes for fragrance and flavour

Biosynthetic Gene Cluster Prediction Using Phylogenomic Synteny Networks

MADS-box transcription factor - DNA interaction specificity

Mining the dark genome for ‘known unknowns’

Integrative analysis of Arabidopsis BRC1 ChIP- and RNA-seq data

Combining methods for prediction of protein interfaces

Prediction of homomeric protein interfaces

Arabidopsis root development cell trajectories

Protein contact prediction by biclustering

Inferring Gene Ontology annotation from -omics data

Calling repetitive sequences in nanopore data

Haplotyping using 10X Genomics synthetic long reads

Predicting pathogen-host protein-protein interactions

Metagenomic analysis of soil samples using nanopore sequencing of ribosomal operons

Analysis of variations in event duration of nanopore reads

Mapping chemical substructures in the natural product space

In silico generation of mass-based natural product substructures

Phylogenetic distribution and evolution of microbial terpene synthases

Phylogeny-corrected discovery of gene modules encoding for specialized molecule substructures

Topic modelling to discover gene modules encoding for specialized molecule substructures

Prioritizing genomic variants

Unraveling bacterial lifestyles by comparative genomics

Supervisors Mauricio Dimitrov, Victor Carrion, Marnix Medema and Jos Raaijmakers Type Data analysis, Comparative genomics, Phylogenetics Requirements Programming in Python, Advanced Bioinformatics Skills Genomics, Programming (some R skills are desirable) Timestamp August 15, 2017 Description Mammals and plants are externally and internally colonized by diverse microbial communities, comprising bacteria, archaea, fungi and protists [1, 2]. These microorganisms play crucial roles in plant development, growth, fitness and diversification [3-5]. In this project, you will explore genomes and evolutionary relationships of bacterial species and strains that belong to the same genus, but present completely different lifestyles and occupy different niches in their respective hosts. The main goal of this project is to identify genes or genetic markers that define specific bacterial lifestyles. Within the species belonging to the Burkholderia and Pseudomonas genera [6], obligate and opportunistic human pathogens, plant pathogens and beneficial bacteria can be found. In order to define bacterial lifestyles in these genera, phylogenetic analyses and comparative genomics will be carried out using pipelines adapted and developed at NIOO-KNAW [7]. After identification of clusters of orthologous groups (COGs) and BGCs (e.g., using antiSMASH [8]), you will be able to determine i) whether particular genes or gene clusters are specific for certain bacterial lifestyle(s), ii) if genes defining a particular bacterial lifestyle are shared by phylogenetically distinct bacteria, i.e. Burkholderia and Pseudomonas spp., and iii) which BGCs are associated with a certain lifestyle and what is their putative function. Moreover, these analyses will provide insights into the evolutionary histories of related bacteria that adapted to different environments. Finally, this project will reveal genetic markers that could potentially be used to rapidly discriminate bacterial lifestyles and to pinpoint pathogenic strains. Such rapid and reliable identification may be applied to various research fields, such as medicine and agriculture. Key references 1. Mendes, R., et al., ISME J, 2015. 2. Hacquard, S., et al., Cell Host & Microbe, 2015. 17(5): p. 603-616. 3. Truyens, S., et al., Environmental Microbiology Reports, 2015. 7(1): p. 40-50. 4. Mendes, R., et al., FEMS Microbiology Reviews, 2013. 37(5): p. 634-663. 5. Hardoim, P.R., et al., Microbiology and Molecular Biology Reviews, 2015. 79(3): p. 293-320. 6. Eberl, L., et al., F1000Research, 2016. 7. Bai, Y., et al., Nature, 2015. 528(7582): p. 364-369. 8. Weber, T., et al., Nucleic acids research, 2015. 43(W1): p. W237-W243.

Pangenomic QTL analysis

Supervisor Sandra Smit, Aalt-Jan van Dijk Type Algorithm development Requirements Adv. bioinformatics, Algorithms in bioinformatics Skills Programming, genomics, algorithm development Timestamp 8/6/2018 Description For a trait of interest, such as flowering time or yield, researchers or plant breeders often want to be able to indicate the most likely causal gene(s). Therefore they often analyze Quantitative Trait Loci (QTLs), which describe associations between genome regions and traits. However these QTL regions on a genome may contain hundreds of genes. Comparing genomes of parental species used to generate the QTL data can be helpful in order to indicate likely causal genes [1]. Further strategies for gene prioritization are under development [2][3], using enrichment of certain gene annotations in these regions. In this project, the idea is to add further power to these prioritization methods by incorporating multiple annotated genomes. This is in particular relevant because QTL data by design is obtained from individuals with different genomes. We will use so called “pangenomes” for this. A pangenome is a data structure, which contains multiple annotated genomes and facilitates comparative analyses across these genomes [4]. The data structure is stored in a Neo4j graph database, which can incorporate heterogeneous data types, e.g. QTL annotations. Specific activities within this project would be: 1. Annotation of QTL regions in pangenomes 2. Extract homologous regions and annotations from multiple assemblies and analyze these in terms of potentially relevant differences in sequence 3. Develop a method to incorporate genes and annotations from multiple genomes in gene prioritization 4. Optionally, incorporate GO terms in the pangenome to facilitate GO-centric navigation Test cases: pangenome and QTLs on rice, tomato, and/or brassica References [1] Lim et al. (2014) Quantitative Trait Locus Mapping and Candidate Gene Analysis for Plant Architecture Traits Using Whole Genome Re-Sequencing in Rice. Mol Cells. 37(2): 149–160 [2] Kourmpetis et al., (2010) Bayesian Markov Random Field analysis for protein function prediction based on network data. PLoS One 5(2):e9293. [3] Bargsten et al. (2014) BMC Plant Biology 14:330. [4] Sheikhizadeh S, Schranz ME, Akdel M, de Ridder D, Smit S (2016) PanTools: representation, storage and exploration of pan-genomic data. Bioinformatics 32(17):i487-i49.

Systematic exploration of the biosynthetic capacity of a collection of

well-characterized human colonic bacteria

Supervisors Marnix Medema, Victoria Pascal Andreu, Peter van Baarlen, Jerry Wells Type Data analysis, Computational genomics Requirements Advanced Bioinformatics Skills Genomics, programming, chemical biology Timestamp August 23, 2017

Description Numerous comparative microbiome studies of disease vs healthy human cohorts have correlated microbiome markers to disease or impaired health. Increasingly, these studies are backed up by preclinical studies suggesting the existence of broadly occurring ecological defects in different disease states and the existence of specific “beneficial” commensals that improve health in multiple (disease) contexts. In a recent project aimed at exploiting the potential utilization of microbiota as a potential therapeutic option to prevent or treat human diseases we have gained access to a collection of more than 100 human colonic isolates and their respective genome sequences. In vitro screening of these bacterial strains and their secreted products has identified several isolates with strong anti-inflammatory activity on human immune cells (unpublished). Diffusible and cell-associated small molecules often mediate host-microbe interactions in complex environments [1]. For example, polysaccharides from specific strains of Faecalibacterium prausnitzii or Bacteroides fragilis have been shown to attenuate inflammatory colitis in mice [2, 3, 4]. Additionally, specific microbial strains may provide direct colonisation resistance against pathogenic species through the production of antimicrobial peptides or specialized metabolites, thereby conferring a benefit to the host [5]. In this project, you will systematically explore the biosynthetic capacity of the strain collection first using a systematic identification and analysis of biosynthetic gene clusters (BGCs), including comparisons among different isolates of the same species. You will classify these BGCs into families and explore their architectural diversity. Subsequently, you will use sequence analysis and phylogenetics of various types of enzymes and enzymatic domains encoded in the identified BGCs to predict their functions. The work may lead to the design of experiments to experimentally isolate and characterize the products of BGCs of interest. This project will contribute to the future discovery of novel mediators of microbe-microbe and host-microbe interactions. Key references [1] Donia et al. (2014) A systematic analysis of biosynthetic gene clusters in the human microbiome reveals a common family of antibiotics. Cell 158, 1402–1414. [2] Mazmanian et al. (2005) An immunomodulatory molecule of symbiotic bacteria directs maturation of the host immune system. Cell 122, 107–118. [3] Rossi et al. (2015) Faecalibacterium prausnitzii Strain H.TF-F and its extracellular polymeric matrix attenuate clinical parameters in DSS-induced colitis. PLoS One 10(4):e0123013 [4] Rossi et al. (2016) Faecalibacterium prausnitzii A2-165 has a high capacity to induce IL-10 in human and murine dendritic cells and modulates T cell responses. Sci Rep. 6:1850 [5] Buffie & Pamer. (2013) Microbiota-mediated colonization resistance against intestinal pathogens. Nat Rev Immunol 13, 790-801, doi:10.1038/nri3535.

Identification of high-quality SNPs in allopolyploid crops

Supervisors Ehsan Motazedi, Dick de Ridder, Chris Maliepaard Type Tool development, Genome analysis, Computational genomics Requirements Advanced Bioinformatics, Genomics Skills Programming, basic knowledge of plant genetics/genomics Timestamp 9/8/2016 Description Single Nucleotide Polymorphism (SNP) markers are the most widely used molecular markers in genetics and genomic studies as they are ubiquitous in the genome and high-throughput measurement methods, such as Genotyping by Sequencing (GBS) and SNP arrays, are available. However, detection of the SNPs becomes complicated in allopolyploid crops, i.e. crops like strawberry, peanut and wheat which have more than two copies of each chromosome from different ancestral subgenomes. In particular, false positive SNPs are often detected with Genotyping by Sequencing (GBS), i.e. SNPs whose alleles do not belong to the same subgenome. Such SNPs are therefore not good candidates to design genotyping arrays. One strategy to filter out such false positive SNPs, called homoeologous SNPs, has been to map the GBS reads with stringent criteria to separate references for each subgenome and hence detect subgenome specific SNPs. This strategy has been applied successfully, e.g. for wheat, when these separate references have been obtained from the ancestral diploid species. However, such distinct subgenome references are usually not available for other crops, or the subgenomes might be too similar in case the polyploidization event has been a recent one, e.g. in peanut and strawberry. The aim of this project will be to develop a new tool for filtering true allelic SNPs from the abundant homoeologous SNPs by estimating the haplotypes from GBS data, using simulation data to evaluate the methods. The methods will also be applied to real data sets to identify haplotype tagging SNPs, which will be validated by comparing to the already available SNP sets. Key references: 1. Clevenger, Josh P., and Peggy Ozias-Akins (2015) SWEEP: a tool for filtering high-quality

SNPs in polyploid crops." G3: Genes|Genomes|Genetics 5.9:1797-1803. 2. Bassil, Nahla V., et al. (2015) "Development and preliminary evaluation of a 90K Axiom®

SNP array for the allo-octoploid cultivated strawberry Fragaria× ananassa . BMC Genomics 16.1:1.

Pangenomics for crops

Supervisor Sandra Smit, Dick de Ridder Type Algorithm development Requirements Adv. bioinformatics, Algorithms in bioinformatics Skills Programming, genomics, algorithm development Timestamp 8/6/2018 Description In recent years, the number of genomes has grown rapidly. Thus, many species and phylogenetic groups are no longer represented by a single reference genome but by numerous related genomes. In this situation, analyzing hundreds of genomes by individual comparison to a single reference genome becomes inefficient and misses genomic content not present in the reference. Likewise, pairwise comparison of hundreds of linear genomes is also far from practical. Hence, to capitalize on the genomic diversity in large collections of genomes, we need to transition from a reference-centric approach to a pangenome approach. Therefore computational pangenomics is currently a hot topic and a challenging field of research [1]. In the Bioinformatics Group we are developing a pangenome solution which compresses multiple annotated sequences into a single graph representation [2]. The graph is constructed, stored, and annotated in a Neo4j graph database. We are looking for students to catalyze the development of various algorithms to increase the pangenome applications for crop research. Application areas are: visualization, variation mining, annotation and gene space exploration. If you are interested in one of these directions, talk to us for more information. Two specific projects at the moment:

● Annotation transfer: how can you use a pangenome to transfer annotated features from one genome to another?

● Haplotyping: haplotype annotation and reconstruction in polyploid species [1] The Computational Pan-Genomics Consortium (2016) Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, in press; http://biorxiv.org/content/early/2016/03/12/043430 [2] Sheikhizadeh S, Schranz ME, Akdel M, de Ridder D, Smit S (2016) PanTools: representation, storage and exploration of pan-genomic data. Bioinformatics 32(17):i487-i49.

Machine learning to predict protein-protein interactions in natural product

biosynthetic assembly lines

Supervisors Marnix Medema, Aalt-Jan van Dijk Type Algorithm development Requirements Advanced Bioinformatics, Machine learning Skills Genomics, programming, machine learning Timestamp April 13, 2017 Description Polyketides are a class of natural products that comprise a wide variety of bioactive molecules used in the clinic as immunosuppressants, antibiotics, cholesterol-lowering agents and anticancer drugs. Enzymatically, many polyketides are produced by large multi-protein complexes of polyketide synthases (PKSs), which effectively function as assembly lines: each PKS module (comprised of multiple protein domains) selects a defined short acyl-CoA chemical moiety with which the growing polyketide chain is elongated [1,2]. The order of the PKS modules in such an assembly line therefore governs the molecular nature of the final product, and is itself determined through protein-protein interactions between subsequent PKS modules. Predicting functional interactions between PKS modules would allow predicting polyketide chemical structures from sequence information. Moreover, synthetic biology engineering of novel PKS assembly lines also depends on predicting which sequences will lead to the specific protein-protein interactions required to obtain functional enzyme complexes [3]. In this project, you will apply state-of-the-art machine learning techniques to improve our understanding of PKS protein-protein interactions. In particular, this will involve further development of a correlated mutation algorithm. Correlated mutation algorithms allow to predict residue contacts based on multiple sequence alignments [4,5]. We recently developed a combination of correlated mutation analysis with an Expectation-Maximization step. In this approach, prediction of protein interaction status and prediction of residue contacts are alternated, until convergence. A key task in this project will be to tailor this algorithm towards specific enzymological characteristics of PKS assembly lines: e.g. the fact that a limited number of combinations is possible between PKS components encoded in a given genome. Key references [1] Dutta et al. (2014) Structure of a modular polyketide synthase. Nature 510: 512–517. [2] Robbins, Liu, Cane and Khosla (2016) Structure and mechanism of assembly line polyketide synthases. Curr. Opin. Struct. Biol. 41, 10-18. [3] Klaus et al. (2016) Protein-protein interactions, not substrate recognition, dominate the turnover of chimeric assembly line polyketide synthases. J. Biol Chem. 291, 16404-16415. [4] de Juan et al. (2013) Nat Rev Genet, 14, 249-61. [5] Sreekumar et al. (2011) BMC Bioinformatics, 12, 444.

Integration of kinase specificity knowledge in the analysis of peptide

microarrays

Supervisors Aalt-Jan van Dijk, Dick de Ridder Type Data analysis Requirements Programming in Python, Adv. bioinformatics,

Adv. statistics / Modern statistics for the life sciences / Machine learning Skills Programming, statistics Timestamp 14/4/2017 Description Protein kinases are a class of enzymes heavily involved in cellular signal transduction. Their activity consists of phosphorylating amino acid residues of proteins. Aberrant cellular signaling involving kinase activity is implicated in many diseases. Kinase activity profiling using peptide microarrays can be used to study signal transduction in cell lines and clinical samples. Typical experiments consist of comparing lysates of different experimental groups, e.g. patients of different phenotypes or cell lines perturbed with different treatments. Various such datasets are available for analysis. The goal of this project is to infer upstream kinase activities based on peptide microarray data. For this we would like to address the following points:

● Literature knowledge about kinase-substrate relations is available, as well as in-silico predictions. We would like to combine both types of information, while e.g. assigning a higher confidence to literature than to predictions.

● The result of the analysis should be a hypothesis of which kinases are modulated in an experiment.

● Possible extensions / follow-up could include in silico design of sets of peptides that can be used to optimally discriminate between particular (families of) kinases.

The analysis could be inspired by considering a related problem for which solutions have been proposed: inference of transcription factor (TF) activity based on knowledge of TF-target gene interactions and measured gene expression levels. Algorithms developed for this analysis [1-2] might provide a good start for analysing the peptide microarray data available in this project. [1] Gao et al., BMC Bioinformatics 2004, 5:3 http://www.biomedcentral.com/1471-2105/5/31 [2] Pournara & Wernisch, BMC Bioinformatics 2007, 8: 51 http://www.biomedcentral.com/1471-2105/8/61

http://www.biomedcentral.com/1471-2105/5/31

http://www.biomedcentral.com/1471-2105/8/61

Integrative QTL analysis

Supervisors Harm Nijveen, Wilco Ligterink, Dick de Ridder Type Algorithm development Requirements Programming in Python, Adv. bioinformatics,

Adv. statistics / Modern statistics for the life sciences Skills Genomics, programming, statistics, machine learning Timestamp 18/4/2017 Description In a recent national study on maternal effects on seed quality, 165 homozygous recombinant lines of Arabidopsis thaliana grouped in a number of different growth conditions were genotyped based on 69 markers and transcript levels were measured. These lines were also extensively phenotyped, with the goal of performing generalized genetical genomics [1] – correlating genotype with phenotype (expression) under a range of conditions. Levels of a number of primary metabolites were measured as well. In this project, the goal is to develop methods to learn which genes influence which genotype, extending the QTL approach by incorporating expression and metabolic pathway information [2]. Prior knowledge on metabolic regulation and the relation between condition and metabolic activation can be used to refine the search and zoom in on possible mechanistic explanations of the observed phenotypes. The desired outcome is a method to optimally combine genetical genomics data with prior knowledge. [1] Y. Li et al . (2008) Generalizing genetical genomics: getting added value from environmental

perturbation. Trends Genetics 24(10):518-24. [2] R.C. Jansen et al. (2009) Dening gene and QTL networks. Current Opinion in Plant

Biology 2009, 12:1–6. [3] Nijveen, H. et al. (2017) AraQTL – workbench and archive for systems genetics in

Arabidopsis thaliana . Plant J, 89: 1225–1235. doi:10.1111/tpj.13457

Protein function prediction

Supervisors Aalt-Jan van Dijk, Dick de Ridder Type Algorithm development Requirements Programming in Python, Adv. bioinformatics, Adv. statistics

/ Par. estimation / Modern statistics for the life sciences / Machine Learning Skills Programming, statistics Timestamp 14/4/2017 Description Protein function information is available in the Gene Ontology (GO). While reliable, the GO is far from complete, and machine learning algorithms are applied to predict novel annotations based on additional (measurement) data. This is challenging, as: (a) the data sources available are noisy, biased and incomplete; (b) the GO contains only positive labels, there is no information on functions that proteins do not have; and (c) there is inherent structure between proteins and between functions that should be exploited. In recent years, a successful protein function prediction tool has been developed at Wageningen University called BMRF [1]. The goal of this project is to extend BMRF by taking into account additional available data, such as various known (tissue-specific) networks and/or QTL data [2]; investigate the use of learning methods tailored to deal with only positive labels, such as PU learning [3]; and model relationships between annotations, such as the hierarchical nature of the Gene Ontology [4], for example by structured output learning of protein/function modules rather than individual proteins. [1] Y.A.I. Kourmpetis et al. (2010) Bayesian Markov Random Field analysis for protein function

prediction based on network data. PLoS ONE 5(2):e9293. [2] J.W. Bargsten et al. (2014) Prioritization of candidate genes in QTL regions based on

associations between traits and biological processes”, BMC Plant Biology, (2014) 14:330 [3] B. Calvo (2008) Positive unlabeled learning with applications in computational biology. PhD

thesis, Dept. of Computer Science and Articial Intelligence, University of the Basque Country.

[4] A. Sokolov and A. Ben-Hur (2010). Hierarchical classification of Gene Ontology terms using the GOstruct method. Journal of Bioinformatics and Computational Biology 8(2):357-76.

RDF for QTL candidate gene prioritization

Supervisors Aalt-Jan van Dijk, Maria Suarez-Diez, Dick de Ridder Type Tool development Requirements Programming in Python, Molecular systems biology, Adv. bioinformatics Skills Programming, elementary statistics; no prior knowledge on RDF is required Timestamp April 14, 2017 Description Quantitative Trait Locus (QTL) data describe associations between genome regions and traits such as flowering time or yield. These genome regions can contain hundreds of genes. It is often important to be able to indicate the most likely causal candidate gene(s). We recently demonstrated the integration of gene function predictions with QTL data resulted in a significant prioritization of known candidate genes [1,2]. Instead of gene function predictions, various alternative labels for genes are available: e.g. InterPro protein domains; transcription factor binding sites; Genome-Wide Association results; etc. To integrate QTLs with these datasets, an easy way to integrate, query and analyse the available data sources is essential.

The goal of this project is to generate a resource integrating all available data, and apply it to QTL candidate gene prioritization. We will use RDF, which is one of the enabling technologies of the Semantic Web [3]. In the RDF data model the resources are described as self-descriptive subject, predicate and object triples, for example <ProteinX, has_domain, IPR0001>. These triples are then linked in an RDF-graph and SPARQL [4] can be used to query the graph. One of the main advantages of RDF is that new data types can be added without changing the structure of the database, which is not the case for relational databases (such as SQL). Therefore, when using RDF we do not have to establish at the beginning of the project which type of gene characteristics can be linked to the observed QTL. We recently developed a framework that hosts bacterial genomic information in RDF started to adapt it to contain plant data. The activities will include: 1. Establishment of an RDF resource containing genomic information for Arabidopsis thaliana 2. Sequential introduction of new datasets: Pfam, InterPro, GO annotation and QTL data 3. Candidate gene prioritization from QTL data 4. Generalization to other organisms (rice and tomato)

[1] Kourmpetis et al., (2010) Bayesian Markov Random Field analysis for protein function prediction based on network data. PLoS One 5(2):e9293.

[2] Bargsten et al. (2014) BMC Plant Biology 14:330. [3] Antezana E, Mironov V and Kuiper M (2013) The emergence of Semantic Systems

Biology, New Biotechnology 30(3):25. [4] http://www.w3.org/TR/rdf-sparql-query/

http://www.w3.org/TR/rdf-sparql-query/

Transcriptional networks from eQTL data

Supervisors Harm Nijveen, Basten Snoek Type Tool development, Algorithm development Requirements Programming in Python, Adv. bioinformatics, Skills Programming, statistics, molecular biology Timestamp 18/4/2017 Description In eQTL studies by combining variation in gene expression with genetic variation we can find transcriptional regulators and ideally infer gene regulatory networks (GRN). A main limitation of eQTL analysis is the low resolution that typically leads to hundreds of potential regulators for a gene. Fine mapping is the main approach to identify the actual regulator, but this is very laborious. We are developing an in silico fine mapping approach to speed up the elucidation of GRNs by combining eQTL data with other experimental data to prioritise the genes identified by eQTL analysis. To this end we are setting up a web based eQTL analysis platform to collect the publicly available Arabidopsis thaliana eQTL data and include various other data, like gene annotation, co-expression, cis -regulatory sequences, etc. Target genes will be connected to all of their potential regulators in a densely connected network, and then the edges will be scored based on the additional evidence, using weighted scoring. Low scoring edges will be pruned and the resulting network will be evaluated based on known regulatory interactions. Weights for the different data types can then be adjusted to improve the predictions. The eQTL analysis platform also should allow biologists to inspect the eQTL data and for instance find new genes involved in specific biological processes based on predicted interactions. Nijveen, H. et al. AraQTL – workbench and archive for systems genetics in Arabidopsis

thaliana . Plant J, 89: 1225–1235. doi:10.1111/tpj.13457 (2017). Keurentjes, J. J. B. et al. Regulatory network construction in Arabidopsis by using genome-wide

gene expression quantitative trait loci. Proceedings of the National Academy of Sciences 104, 1708-1713, doi:10.1073/pnas.0610429104 (2007).

Snoek, L. B. et al. WormQTL—public archive and analysis web portal for natural variation data in Caenorhabditis spp. Nucleic Acids Research 41, D738-D743, doi:10.1093/nar/gks1124 (2013).

Novel enzymes for fragrance and flavour

Supervisors Aalt-Jan van Dijk, Marnix Medema Type Algorithm development, Machine learning Requirements Programming in Python, Pattern Recognition Skills Programming, statistics, chemical biology, data analysis Timestamp 11/4/2017 Description Ingredients from plants used in the flavour and fragrance industry are increasingly produced by microbial production platforms. Terpenes (a class of >10,000 natural compounds) are prime examples of such plant flavour compounds, used in a wide range of products. Microbial platforms for plant compounds often work by expression of the plant biochemical pathway in the microorganism, upon which the microorganism will produce the plant metabolite. The production of terpenes is largely mediated by a single class of enzymes, the terpene synthases. These synthases are often limiting for production, and differ greatly in their efficiency. To improve microbial production platforms, it is imperative to identify superior plant terpene synthases. In this project, you will apply machine learning, to recognize synthases with a specific product, e.g. valencene or patchoulol, among thousands of uncharacterized terpene synthases in daily expanding plant genomics data. We will focus on one particular class of terpene synthases, producing sesquiterpenes. Sesquiterpenes can be categorized based on their cyclization pattern. The terpene synthase reaction can comprise one or a few of a set of 13 reactions, all catalyzed by a single enzyme. A particular combination of these reactions results in a specific cyclization pattern of the final product. Machine learning includes methods that convert data (here: terpene synthase sequences) into numerical representations (“features”) and find patterns in these features that distinguish various “classes” (here: types of terpene products produced). Training data consisting of sequences with known product specificity is available, and additional cases will be obtained from literature and databases. Features can be derived by e.g. counting sub-strings in a sequence, or by analysing conservation of amino acids in an alignment. By associating particular patterns in these features with product specificity, the algorithm can learn how to recognize different classes of proteins. Specifically, support vector machines (SVMs) will be trained to predict absence or presence of each of the 13 different reactions as labels, each time for the entire set of enzymes with known product specificity. After training and estimation of classification performance using cross-validation, the SVMs will be applied to predict functionality for all available sesquiterpene synthase sequences. This will allow to prioritize sequences as candidate enzymes for particular compounds of interest, and will allow targeted exploration of the rich biosynthetic diversity encoded in plant genomes.

Key references: [1] Medema & Osbourn (2016) Computational genomic identification and functional reconstitution of plant natural product biosynthetic pathways. Natural Product Reports 33: 951. [2] Röttig M, Rausch C, Kohlbacher O (2010) Combining Structure and Sequence Information Allows Automated Prediction of Substrate Specificities within Enzyme Families. PLoS Comput Biol 6(1): e1000636. doi: 10.1371/journal.pcbi.1000636

Biosynthetic Gene Cluster Prediction Using Phylogenomic Synteny Networks

Supervisors Hernando Suarez, Eric Schranz, Tao Zhao, Marnix Medema Type Data analysis, Computational genomics, Phylogenetics Requirements Programming in Python, Advanced Bioinformatics Skills Genomics, programming, chemical biology Timestamp September 1, 2017 Description Biosynthetic gene clusters (BGCs) encode the enzymatic machinery to synthesize valuable natural products. In plants, BGCs are sometimes conserved across species: e.g., the α-tomatine and α-chaconine/solanine are homologous gene clusters in the tomato and potato genomes1, and genomes with recent Whole-Genome Duplication (WGD) contain more BGCs than other plants in the same taxonomic clade, based on automated BGC discovery tools like plantiSMASH2. This is due to the majority of clusters in these plants having one or more additional copies in the genome, some with stark architectural differences. Recently, advances have been made in the way we visualize and analyze syntenic relations between two or more species with the use of phylogenomic synteny networks3,4. These networks allow identifying evolutionary patterns across a number of species within a graph theory framework similar those widely used in other bioinformatics fields (gene coexpression networks, regulatory networks, etc). Given the known conservation of BGCs, synteny networks can be used to enhance plantiSMASH’s predictive capabilities by identifying the loci syntenic to predicted or known BGCs and use plantiSMASH’s existing capabilities with less stringent (or more specific) parameters to find previously-undetected gene clusters that may not pass our current definition of BGCs. This approach would be ideal to discover new clusters and study the possible divergence of evolutionary trajectories of cluster copies in closely related species pairs, like tomato and potato, and for species with recent WGDs like Camelina sativa and Brassica napus. Mining synteny networks may also help to find associations of known classes of biosynthetic pathways with previously unassociated protein families, by identifying conserved synteny blocks around genes encoding scaffold-generating enzymes. This analysis may help us understand patterns of gene birth, death and divergence among diverse biosynthetic loci. Key references 1. Itkin, M. et al. Biosynthesis of Antinutritional Alkaloids in Solanaceous Crops Is Mediated by

Clustered Genes. 341, 175–179 (2013). 2. Kautsar, S. A., Suarez Duran, H. G., Blin, K., Osbourn, A. & Medema, H. plantiSMASH :

automated identification , annotation and expression analysis of plant biosynthetic gene clusters. 1–9 (2017). doi:10.1093/nar/gkx305

3. Zhao, T., Holmer, R., Bruijn, S. de, Burg, H. A. van den & Schranz, M. E. Phylogenomic Synteny Network Analysis Reveals an Ancient MADS-Box Transcription Factor Tandem Duplication and Lineage-Specific Transpositions. 1, 3–13 (2017).

4. Zhao, T. & Schranz, M. E. Network approaches for plant phylogenomic synteny analysis. Curr. Opin. Plant Biol. 36, 129–134 (2017).

MADS-box transcription factor - DNA interaction specificity

Supervisors Aalt-Jan van Dijk Type Data analysis Requirements Machine Learning, Programming in Python, Adv. bioinformatics Skills Programming, statistics Timestamp 12/3/2018 Description Correct flower formation requires highly specific regulation of gene expression. In Arabidopsis thaliana the majority of the regulators that determine flower organ identity belong to the MADS-box transcription factor family. The canonical DNA binding motif for this transcription factor family is the CArG-box, which has the consensus CC(A/T)6GG. The availability of genome-wide binding maps for transcription factors based on ChIP-seq allows to investigate binding patterns of MADS-box proteins in a lot of detail. Recently, we re-analyzed eight ChIP-seq datasets of MADS-box proteins. The preferred DNA binding motif of each protein was found to be a CArG-box with the 3’ extension 5’-NAA-3’. Furthermore, motifs of other transcription factors were found in the binding sites of the MADS-box transcription factors, suggesting that interaction of MADS-box proteins with other transcription factors is important for target gene regulation. In this project we will focus on three aspects of transcription factor - DNA interactions:

1) What distinguishes binding sites from different MADS proteins from each other? To answer this question, we will train a classifier to predict for a given putative binding site which MADS proteins bind to it. Given the recent success of deep learning approaches, we might follow such approach, but we will test other classifiers as well.

2) What role, if any, do dependencies between positions in the binding site play? Standard descriptions of binding sites (such as the above mentioned CArG-box definition) consider each position as independent from other positions. However, it has been shown that due to e.g. structural properties of DNA, interactions (dependencies) between positions can be quite important as well [1]. Such dependencies can be analyzed with for example a Bayesian Network based approach. Possibly, sequence conservation in ecotypes could give insight as well: dependencies between positions in a binding site might lead to compensatory mutations.

3) Can we discover a higher-order “grammar” that indicates how binding sites for different MADS proteins and/or for other types of regulators occur together in regulatory modules? Unsupervised analyses e.g. using a Self Organizing Map might be able to discover such patterns.

[1] http://biorxiv.org/content/early/2016/04/07/047647

http://biorxiv.org/content/early/2016/04/07/047647

Mining the dark genome for ‘known unknowns’

Supervisors Ben Oyserman, Marnix Medema Type Tool development, Comparative genomics Requirements Advanced Bioinformatics Skill Programming Timestamp April 19, 2017 The process of genome annotation identifies the location of gene coding regions and their function throughout the genome. Sequence similarity with previously characterized genes (the ‘known knowns’) allows for rapid annotation of genomic content. However, a large fraction of sequence space does not share sequence similarity with the known knowns, and consequently, these sequences remain unannotated. Experimental characterization of unknown genomic content is relatively slow; thus, a fundamental asymmetry exists in the rate at which novel sequences are being discovered and the rate at which they are being characterized. This asymmetry necessitates the development of bioinformatics approaches independent of sequence similarity to associate unknown sequences with a putative function. To address this challenge, a bioinformatic approach will be developed in this project based on the concept of the ‘known unknowns’. In this context, the known unknowns are those functions that may be predicted to be contained within the genome but for which no genomic content is yet annotated (the function is known to be present, but the sequence corresponding to the function is unknown). For example, auxotrophy, the inability to synthesize a specific compound, is commonly assigned to organisms based on the lack of a biosynthetic pathways. For an organism to survive, these compounds must be transported into a cell. However, transporters are generally not well annotated. Consequently, there are likely many instances in which an auxotrophy is predicted, yet a corresponding transporter is lacking in the annotation. Hence, the presence of an uncharacterized transporter represents a known unknown . If a specific known unknown (e.g. auxotrophy/missing transporter combination) is identified across many genomes, comparative analysis may be leveraged to associate shared genomic content as a candidate genes that fill in this gap in annotation. Using publicly available reference genomes, a pipeline to identify contradictions in annotations that may be explained through the concept of ‘known unknown’ will be developed. Methods used will include genome annotation, identifying orthologous gene clusters, predicting auxotrophy based on genomic content, comparative genomics. Based on progress, phylogenetic and transcriptomic information may also be incorporated into the project.

Integrative analysis of Arabidopsis BRC1 ChIP- and RNA-seq data

Supervisors Aalt-Jan van Dijk, Sam van Es Type Data analysis Requirements Programming in Python, Advanced bioinformatics Skills Programming, statistics, data analysis Timestamp May 15, 2017 Description Plants are hugely diverse in their appearance under different conditions. Probably the most famous example of what changes in plant architecture can accomplish is the domestication of teosinte (Zea mays spp. parviglumis ) to the modern-day mays (Zea mays spp. mays ). In this species the reduction of lateral branches gave rise to a highly productive staple crop. In our model plant Arabidopsis thaliana branching is controlled by the transcription factor BRANCHED1 (BRC1 ) [1]. The actual molecular mode of action of the BRC1 transcription factor is yet largely unknown. For this purpose we have performed a transcriptome analysis (RNA-seq) on Arabidopsis seedlings. By this method we can identify genes that respond to changes in BRC1 expression. This gives an idea on the downstream mechanisms regulated by BRC1; however, it is very unlikely that BRC1 does this singlehandedly and that all the observed transcriptional changes are a direct consequence of BRC1 DNA binding. Transcription factors typically bind gene promoter regions in large multimeric complexes. Moreover, if a target regulated by a transcription factor is itself a transcription factor, it can in turn regulate other targets (which would then only indirectly be regulated by BRC1). To identify the possible direct binding sites of BRC1 complexes in the genome, we have performed a Chromatin Immunoprecipitation experiment followed by in-depth sequencing (ChIP-seq). A thorough and integrative analysis of the comprehensive RNA-seq and ChIP-seq datasets provides an excellent opportunity to unravel the role of BRC1 in Arabidopsis branching and axillary meristem outgrowth. The project gives you the chance to work with new and largely unanalysed datasets and to learn about transcriptome analysis from raw sequence reads aiming to identify lists of differentially expressed genes, and followed e.g. by enrichment and clustering analyses. ChIP-seq analysis will consist of the identification of binding regions, identifying DNA sequences that enable BRC1 binding (“DNA binding motifs”), enrichment analyses, and ultimately, a detailed comparison with the RNA-seq outcomes. [1] Aguilar-Martínez, J.A., Poza-Carrión, C. & Cubas, P., 2007. Arabidopsis BRANCHED1 acts

as an integrator of branching signals within axillary buds. The Plant cell , 19(2), pp.458–472.

Combining methods for prediction of protein interfaces

Supervisors Aalt-Jan van Dijk, Miguel Correa Marrero Type Data analysis, Tool development Background / skills Programming, machine learning, basic biochemistry Requirements Programming in Python, Adv. bioinformatics, Machine learning Timestamp September 1, 2017 Description Proteins are the chief actors within the cell, but they seldom carry out their function by themselves; they are actually rather social, and many essential cellular processes depend on their interactions. These are formed thanks to interactions between residues in the surface of the interacting proteins. These residues, as a whole, form what is called a protein interface. Given the importance of protein-protein interactions, understanding these interfaces can give insight into many processes. Knowledge about protein interfaces can also be used in rational drug design.

Experimental methods to gather information about which residues form part of the interface are rather cumbersome and time-consuming. In response, there have been many efforts to develop computational methods to predict interface residues. These methods can be roughly divided into predictors that exploit physicochemical information about the residues and co-evolutionary methods. The latter are based on the fact that interacting proteins evolve together to maintain the interaction, giving rise to correlations between their sequences [2].

In parallel, there have been many efforts to predict which residues within a protein interact with each other. A recent breakthrough in predicting protein structures has come with the analysis of correlated mutations within proteins. The best performance can be achieved, however, by combining analysis of correlated mutations with methods that use physicochemical information [3].

This approach has not been applied yet to predict protein interfaces. In this project, you will explore how and why different strategies yield different results, for example, by inspecting biophysical and physicochemical differences between the sets of predicted residues. After this initial analysis, you will investigate if it is possible to combine different methods to achieve greater prediction accuracy. Key references [1] Xue, Li C., et al. "Protein-protein interface predictions by data-driven methods: a review."

FEBS letters 589.23 (2015): 3516. [2] De Juan, D., Pazos, F., & Valencia, A. (2013). Emerging methods in protein co-evolution.

Nature Reviews Genetics , 14 (4), 249-261. [3] Kosciolek, T., & Jones, D. T. (2015). Accurate contact predictions using covariation

techniques and machine learning. Proteins: Structure, Function, and Bioinformatics .

Prediction of homomeric protein interfaces

Supervisors Aalt-Jan van Dijk, Miguel Correa Marrero Type Data analysis, Tool development Background / skills Programming, machine learning, basic biochemistry Requirements Programming in Python, Adv. bioinformatics, Machine learning Timestamp September 1, 2017 Description Proteins have evolved to carry out a large number of functions within the cell, such as catalyzing chemical reactions or signal transduction. Most frequently, they must interact with other proteins to perform these functions. This might be with other, different proteins (forming hetero-oligomers), or with other copies of the same proteins (forming homo-oligomers). Understanding exactly what residues participate in the interaction can give important insight, but is experimentally hard to determine. For this reason, different computational sequence-based methods have been developed to predict these residues. In the case of homo-oligomeric interactions, these have been traditionally based on machine learning approaches that distinguish them based on biophysical and physicochemical approaches, as in [1].

A recent study [2] shows that a different strategy is able to predict residues in homomeric interfaces. This study uses a co-evolutionary approach, which relies on the correlated evolution of residues within the protein [3] and has been widely used before to predict residues that are interacting within a protein. However, distinguishing whether these predictions are residues within the protein or residues in the interface poses another problem. To differentiate them, the authors used the structures of the proteins. However, this cannot really be done for most use cases of co-evolutionary methods, as the structure is frequently unknown.

In this project, you will investigate if this approach is applicable to cases where only the protein sequence is known. You will analyze model predictions and train classifiers to distinguish whether these predictions are residues participating in the interface or not, based exclusively on data that can be derived from the protein sequence (e.g. whether the residue is buried in the protein). This could be applied not only to predict homomeric interfaces, but also to improve predictions of intra-molecular contacts. Key references [1] Hou, Q., et al. "Seeing the Trees through the Forest: Sequence-based Homo-and Heteromeric

Protein-protein Interaction sites prediction using Random Forest." Bioinformatics (Oxford, England) (2017).

[2] Uguzzoni, Guido, et al. "Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis." Proceedings of the National Academy of Sciences 114.13 (2017): E2662-E2671.

[3] De Juan, D., Pazos, F., & Valencia, A. (2013). Emerging methods in protein co-evolution. Nature Reviews Genetics , 14 (4), 249-261.

Arabidopsis root development cell trajectories

Supervisors Dick de Ridder Type Algorithm development Requirements Programming in Python, Advanced Bioinformatics,

Machine Learning Skills Programming, machine learning (some R skills are desirable) Timestamp September 1, 2017 Description Recent years have seen an increasing interest in single cell transcriptomics, i.e. measurements of mRNA expression in individual cells [1]. Such measurements are essential to help understand developmental processes, in which cells switch between expression programs as they mature into various cell types. This has mostly been applied in human developmental biology, but promises to also help understand development of plants [2].

In plants, single cell sequencing has not been widely applied, but a number of root development transcription atlases are available [3-5]. In this exploratory project, the aim is to investigate the possibility of applying a number of recently developed methods for learning development trajectories from single cell sequencing data, such as Monocle [6], to the data found in these atlases. The goal is to verify whether the actual spatial development of the root is reflected in these trajectories, and to what extent actual location data can help to guide learning such trajectories [7]. References [1] Saliba AE, Westermann AJ, Gorski SA, Vogel J. Single-cell RNA-seq: advances and future

challenges. Nucleic Acids Researc h 42(14):8845-60, 2014. [2] Efroni I, Birnbaum KD. The potential of single-cell profiling in plants. Genome Biology

17:65, 2016. [3] Brady SM, Orlando DA, Lee JY, Wang JY, Koch J, Dinneny JR, Mace D, Ohler U, Benfey

PN. A high-resolution root spatiotemporal map reveals dominant expression patterns. Science 318(5851):801-6, 2007.

[4] Birnbaum K, Shasha DE, Wang JY, Jung JW, Lambert GM, Galbraith DW, Benfey PN. A gene expression map of the Arabidopsis root. Science 302(5652):1956-60, 2003.

[5] Li S, Yamada M, Han X, Ohler U, Benfey PN. High-resolution expression map of the Arabidopsis root reveals alternative splicing and lincRNA regulation. Developmental Cell 39(4):508–22, 2016.

[6] Qiu X, Mao Q, Tang Y, Wang L, Chawla R, Pliner HA, Trapnell C. Reversed graph embedding resolves complex single-cell trajectories. Nature Methods 2017.

[7] Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nature Biotechnology 33(5):495-502, 2015.

Protein contact prediction by biclustering

Supervisors Miguel Correa, Dick de Ridder, Aalt-Jan van Dijk Type Algorithm development Requirements Programming in Python, Advanced Bioinformatics,

Machine Learning Skills Programming, machine learning (some R skills are desirable) Timestamp September 5, 2017 Description Interactions between the amino acid residues that constitute a protein drive its folding. Similarly, interactions between residues at the surface allow for the formation of protein complexes. Knowing which residues participate in such contacts can give us important information: for example, it can help us predict protein structures ab initio, or model the structures of complexes [1]. Evolutionary pressure on maintaining interactions between residues, whether intramolecular or intermolecular, gives rise to correlations in protein sequences. This phenomenon allows us to predict which residues participate in contacts using classifiers that model the relationship between different positions in a multiple sequence alignment. These algorithms, however, model global trends in the proteins under study, and it is not straightforward to distinguish the existence of subgroups which, for example, oligomerize in a different way [2]. A different way to approach the contact prediction problem would be by biclustering, a set of unsupervised techniques that simultaneously cluster observations and features [3,4]. In this project, you will explore whether an algorithm based on biclustering methods can help to predict both subgroups of proteins according to their patterns of residue contacts and these residue contacts. References [1] Simkovic, Felix, et al. "Applications of contact predictions to structural biology." IUCrJ 4.3

(2017). [2] Uguzzoni, Guido, et al. "Large-scale identification of coevolution signals across

homo-oligomeric protein interfaces by direct coupling analysis." Proceedings of the National Academy of Sciences 114.13 (2017).

[3] Clevert, Djork-Arné, et al. "Rectified factor networks for biclustering of omics data." Bioinformatics 33.14 (2017).

[4] Padilha, Victor A., and Ricardo JGB Campello. "A systematic comparative evaluation of biclustering techniques." BMC Bioinformatics 18.1 (2017).

Inferring Gene Ontology annotation from -omics data

Supervisors Jan Top, Dick de Ridder Type Algorithm development Requirements Programming in Python, Advanced Bioinformatics,

Machine Learning/Algorithms in Bioinformatics Skills Programming, machine learning (some R skills are desirable) Timestamp September 1, 2017 Description The Gene Ontology (GO) is widely used to assign (potential) functions to unknown genes and to interpret the outcomes of high-throughput measurements [1]. In recent years, a number of methods have been proposed to automatically predict GO annotations based on omics data, by deriving gene similarity from molecular interaction networks [2,3,4]. Mostly, these approaches have been tested on already well characterized organisms. However, in most plants GO annotations are derived based on homology of genes to their counterparts in model organisms, such as Arabidopsis thaliana (e.g. BLAST2GO, [5]). Moreover, GO annotations are to a far lesser extent based on solid experimental evidence in plants than in other organisms, such as human or yeast.

In this project, we intend to implement and benchmark a number of these methods, and experimentally study the influence of: ● the type of molecular interaction network available: inferred from co-expression or based on

actual interaction measurements; ● the fact that for many plants, molecular interaction networks (derived from co-expression) are

derived from homology to other plants; ● the overall lower level of annotation, leading to many terms that are assigned to just few

genes (cf. [4]). The goal is to learn whether GO inference is feasible to help annotate crop species and, if so, what data is required and what the procedure and settings are needed to achieve optimal results. References [1] The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genetics

25(1): 25–29, 2000. [2] Kramer M, Dutkowski J, Yu M, Bafna V, Ideker T. Inferring gene ontologies from pairwise similarity

data. Bioinformatics 30(12):i34–i42, 2014. [3] Li L, Yip KY. Integrating information in biological ontologies and molecular networks to infer novel

terms. Scientific Reports 15(6):39237, 2016. [4] Wang S, Cho H, Zhai C, Berger B, Peng J. Exploiting ontology graph for predicting sparsely

annotated gene function. Bioinformatics 31(12):i357-64, 2015. [5] Conesa A, Götz S. Blast2GO: A comprehensive suite for functional analysis in plant genomics.

International Journal of Plant Genomics 619832, 2008.

Calling repetitive sequences in nanopore data

Supervisors Carlos de Lannoy, Judith Risse, Dick de Ridder Type Algorithm development Requirements Programming in Python, Advanced Bioinformatics,

Machine Learning Skills Programming, machine learning (some R skills are desirable) Timestamp September 1, 2017 Description Nanopore technology provides a novel approach to DNA sequencing that yields long, label-free reads of constant quality [1]. The first commercial implementation of this approach, the MinION, has shown promise in various sequencing applications, however the error-prone nature of its reads remains an unresolved issue. In particular, it is hard to distinguish homopolymer stretches, dinucleotide repeats and other repetitive structures. In recent work, we explored the use of recurrent neural networks (cf. [2]) to address this problem. Rather than relying on a pre-segmented and averaged signal as most currently used analysis tools do, the neural network directly interpreted the raw signal produced by the MinION. While this yielded promising results on simulated data, on real data performance was still unsatisfactory. In this project, the goal is to continue this work and explore a number of alternative solutions:

● improving the simulation, to create more realistic reads; ● study the underlying complexity of the prediction problem, by extracting simple features

from the signal and predicting repeat presence; ● fitting physical models of DNA translocation through the pore (cf. [3]).

At the same time, existing and novel solutions will have to be compared to the continuously developing state-of-the-art, among which the recently available Scrappie basecaller. The final goals of the project will be to (1) develop a realistic simulation tool for MinION raw signal, i.e. mimicking true-to-nature error rates and signal variation, and (2) to develop a solution for the repeat detection problem in nanopore sequencing data, most likely as a postprocessing step on the results of a standard basecaller. References [1] De Lannoy C, de Ridder D, Risse J. A sequencer coming of age: de novo genome assembly

using MinION reads. BiorXiv , https://doi.org/10.1101/142711, 2017. [2] Boza V, Brejova B, Vinar T. Deepnano: deep recurrent neural networks for base calling in

MinION nanopore reads. PLoS One 12(6):e0178751, 2017. [3] Szalay T, Golovchenko JA. De novo sequencing and variant calling with nanopores using

PoreSeq. Nature Biotechnology 33(10):1087–1091, 2015.

https://doi.org/10.1101/142711

https://doi.org/10.1101/142711

Haplotyping using 10X Genomics synthetic long reads

Supervisors Ehsan Motazedi, Mehmet Akdel, Dick de Ridder Type Tool/Algorithm development Requirements Programming in Python, Advanced Bioinformatics,

Algorithms in Bioinformatics Skills Programming, genomics Timestamp September 1, 2017 Description Knowledge of haplotypes, i.e. the sequence of a single chromosome, provides invaluable information about the genes functionality and the inheritance pattern in a population. Besides, haplotypes can be used as powerful genetic marker to detect the genomic loci associated with quantitative and qualitative traits [1]. Having the haplotypes, one can determine which variants are located on the same chromosome (the so-called phasing of the variants) and therefore tend to be inherited together. However, it is much more difficult to obtain the haplotypes in heterozygous individuals than to simply detect unphased genomic variants using next generation sequencing data [2].

Recently, phasing aware assemblers have been released that promise to obtain the consensus sequence of each haplotype taking benefit of the long sequence reads generated by technologies such as 10X Illumina or PacBio [3], [4]. The aim of this project is to apply these assemblers to simulated as well as real data, and to compare their performance with each other on “diploid” and “polyploid” plants, which the latter inherits more than one chromosome set from at least one of the parents. An important example of a polyploid is the tetraploid potato that inherits two sets of core chromosomes from each parent (2n=4x=48). In case time allows, the performance of these assemblers may also be compared to traditional haplotype estimation methods that use the sequence reads plus the unphased called variants in order to estimate the phasing of these variants [2], [5]. References [1] Zheng G et al. Haplotyping germline and cancer genomes with high-throughput linked-read

sequencing. Nature Biotechnology 34(3):303-311, 2016. [2] Motazedi E, Finkers R, Maliepaard C, de Ridder D. Exploiting next-generation sequencing to

solve the haplotyping puzzle in polyploids: a simulation study. Briefings in Bioinformatics, 2017

[3] Hulse-Kemp, Amanda M., et al. "Reference Quality Assembly of the 3.5 Gb genome of Capsicum annuum from a Single Linked-Read Library." bioRxiv (2017): 152777.

[4] Chin CS, Peluso P, Sedlazeck FJ, et al. Phased diploid genome assembly with single molecule real-time sequencing. Nat Methods 2016.

[5] Edge, Peter, Vineet Bafna, and Vikas Bansal. "HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies." Genome research 27.5 (2017): 801-812.

Predicting pathogen-host protein-protein interactions

Supervisors Sander Rodenburg, Dick de Ridder Type Data analysis, Algorithm development Requirements Programming in Python, Advanced Bioinformatics, Genomics Skills Programming, Statistics Timestamp 08-09-2017 The oomycete plant pathogen Phytophthora infestans is considered one of the most destructive plant pathogens for tomato and potato harvest, causing huge crop losses worldwide. Genome sequencing of this pathogen revealed that it has an extremely large set of so-called effector proteins that are secreted into the cytoplasm of the plant cell to disrupt host defense [1]. For only a number of these effector proteins, its interacting counterpart in the host cell is determined. One example is the P. infestans Avr1 effector that binds to the Sec5 exocyst subunit in potato [2], resulting in a disrupted plant immune system, with that facilitating infection by the pathogen. Few years ago an interaction network was reconstructed for P. infestans , based on predicted protein co-localization, co-occurrence, co-expression and interologs (conserved interactors) [3]. Back then, only few effectors were included in the network. However, the latest bioinformatics tools and databases provide a huge array of new opportunities to characterize these proteins. This project comprises the re-characterization of P. infestans effector proteins with new database information, and prediction of their host interaction partners. By integrating the information from predictors, pathogen-host -omics data, and known interactions in other species we aim to predict new candidate interactors in the P. infestans -host system [4,5].

1. Haas, B. J. et al. Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans . Nature 461, 393–398 (2009).

2. Du, Y., Mpina, M. H., Birch, P. R. J., Bouwmeester, K. & Govers, F. Phytophthora infestans RXLR Effector AVR1 interacts with exocyst component Sec5 to manipulate plant immunity. Plant Physiol. 169, 1975–90 (2015).

3. Seidl, M. F., Schneider, A., Govers, F. & Snel, B. A predicted functional gene network for the plant pathogen Phytophthora infestans as a framework for genomic biology. BMC Genomics 14, 483 (2013).

4. Wuchty, S. Computational prediction of host-parasite protein interactions between P. falciparum and H. sapiens. PLoS One 6, e26960 (2011).

5. Durmuş, S., Çakır, T., Özgür, A. & Guthke, R. A review on computational systems biology of pathogen-host interactions. Front. Microbiol. 6, 235 (2015).

Metagenomic analysis of soil samples

using nanopore sequencing of ribosomal operons

Supervisors Carlos de Lannoy, Dick de Ridder Type Data analysis Requirements Programming in Python Skills Programming Timestamp 1/11/2017 Metagenomic analysis of soil samples may provide agriculturists with valuable information allowing them to attune fertilization and their choice of crops to the soil microbiome [1]. The high cost, amount of expertise required for analysis and the duration between sample acquisition and result of second generation sequencing (SGS) however impede large-scale and frequent sampling, especially in remote areas. Due to recent advances in accuracy and throughput, nanopore sequencing technology may prove to be a viable and cost-effective alternative to SGS in metagenomic analyses in the near future [2].

Nanopore sequencers, such as the MinION and the GridION, produce much longer reads than SGS methods [2] and thus require less sequencing depth to acquire the same phylogenetic depth. Furthermore, sequences may be acquired and analysed directly during sequencing, thus a sequencing run may be allowed to run only as long as strictly necessary to characterize a metagenomic sample. The MinION is also a highly portable device, roughly the size of a phone, thus potentially allowing on-the-spot analysis of samples.

Summarizing, nanopore sequencers have attractive properties which could open up routine metagenomic analysis to various ends. Several metagenomic studies using nanopore sequencing have been published [3, 4], however to our knowledge agricultural soil sample analysis has not been attempted before. The aims of this project are to (1) set up an easy-to-use pipeline for the characterization of agricultural soils, (2) apply the pipeline to actual samples collected from fields throughout The Netherlands and (3) compare the pipeline results to a similar analysis conducted using SGS (MiSeq) sequencing. References [1] Roesch LF, Fulthorpe RR, Riva A, Casella G, Hadwin AK, Kent AD, Daroub SH, Camargo

FA, Farmerie WG, Triplett EW. Pyrosequencing enumerates and contrasts soil microbial diversity. The ISME journal. 2007 Aug;1(4):283.

[2] De Lannoy C, de Ridder D, Risse J. A sequencer coming of age: de novo genome assembly using MinION reads. BioRxiv, https://doi.org/10.1101/142711, 2017.

[3] Kerkhof LJ, Dillon KP, Häggblom MM, McGuinness LR. Profiling bacterial communities by MinION sequencing of ribosomal operons. Microbiome. 2017 Sep 15;5(1):116.

[4] Leggett RM, Alcon-Giner C, Heavens D, Caim S, Brook TC, Kujawska M, Hoyles L, Clarke P, Hall L, Clark MD. Rapid MinION metagenomic profiling of the preterm infant gut microbiota to aid in pathogen diagnostics. bioRxiv. 2017 Jan 1:180406.

Analysis of variations in event duration of nanopore reads

Supervisors Carlos de Lannoy, Judith Risse, Dick de Ridder Type Data analysis, algorithm development Requirements Programming in Python Skills Programming, machine learning Timestamp 1/11/2017 Nanopore sequencing is a novel, rapidly developing approach to biological sequence analysis, which yields long error-prone reads. The first commercially available nanopore sequencers, the phone-sized MinION and the table top-sized GridION, have shown promise in portable sequencing applications at low investment costs [1]. Moreover, its long reads allow assemblies of unprecedented contiguity [e.g. 2, 3] and the resolution of low-complexity regions [4]. A major issue however, remains the accuracy of the reads, which is lacking with respect to second generation sequencing methods.

In a nutshell, nanopore sequencing allows the direct analysis of DNA and RNA strands by pulling the strand through a protein pore over which a constant electric potential is applied and assessing the way in which the strand influences electric current. Combinations of bases obstruct the electrical current to a specific extent, thus the measured current level is characteristic for the nucleotides residing in the pore at a given time. The pace at which the strand is threaded through the pore is regulated by a modified helicase which feeds the strand in a step-wise manner through the pore [1]. While the helicase keeps the feed-through pace reasonably stable, some variation in the duration of these steps, commonly referred to as events, remains. We hypothesize that this variation may be explained, in part, by the base sequence that is at a given moment threaded through the helicase (i.e. a part of the strand that has yet to enter the pore and be sequenced). If this is the case, then the duration of events may be used explicitly to improve sequencing accuracy. Moreover, it would grant a valuable insight in the molecular mechanics at work during nanopore sequencing and the functioning of helicases in general.

The aim of this project is to analyse whether there is a relation between event duration and the biological sequence that is analysed and, if such a relation is found, to design a tool that utilizes this information to correct nanopore reads. References [1] De Lannoy C, de Ridder D, Risse J. A sequencer coming of age: de novo genome assembly using

MinION reads. BioRxiv, https://doi.org/10.1101/142711, 2017. [2] Fournier T, Gounot JS, Freel KC et al. High-quality de novo genome assembly of the Dekkera

bruxellensis UMY321 yeast isolate using Nanopore MinION sequencing. bioRxiv, https://doi.org/10.1534/g3.117.300128, 2017.

[4] Michael TP, Jupe F, Bemm F et al. High contiguity Arabidopsis thaliana genome assembly with a single nanopore flow cell. bioRxiv, https://doi.org/10.1101/149997, 2017.

[4] Jain M, Olsen HE, Turner DJ et al . Linear Assembly of a Human Y Centromere using Nanopore Long Reads. bioRxiv, https://doi.org/10.1101/1703732017 2017.

NP Plug&Play - Linking Natural Product Molecular Substructures to Mass Spec data:

Mapping chemical substructures in the natural product space Supervisors Justin van der Hooft and Marnix Medema Type Metabolome analysis, Computational metabolomics Requirements Programming in Python, Advanced Bioinformatics Skills Basic Chemistry and Biology, Programming Timestamp January 15, 2018 Mammals and plants are externally and internally colonized by diverse microbial communities that have evolved an arsenal of structurally diverse specialized molecules or natural products (NPs) that play key roles in microbial-microbial and microbial-host interactions [1]. Recently, metabolome mining strategies have been introduced that find biochemical/chemical relationships between molecules [2, 3]. Whilst making progress in structural and functional annotations, we are still largely in the dark regarding the diversity and identities of the full arsenal of specialized molecules [4]. The recognition and annotation of NP biochemical building blocks (substructures) is a promising way to accelerate structural elucidation efforts.

The main goal of this project is to define and represent both computer and human readable versions of molecular substructures that are detected by current metabolome mining strategies. In the process, you will assist in setting up a framework to connect those molecular substructures to gene modules present in the biosynthesis gene clusters producing specialized molecules. A large selection of molecules from NPs are available from different databases including Super Natural II (>325000 structures). During the project, you will: i) assemble an up-to-date NP database ii) extract chemical substructures/scaffolds and their statistics using an open-source framework like RDkit, CDK, or INDIGO, and inspired by proposed algorithms to compare functional substructures, iii) define a subset of the chemical substructures that are mass-spectrometry detectable, iv) visualize and analyse substructures in a Molecular Network with CytoScape and NetworkX, and v) correlate the presence/absence of substructures and Mass2Motifs of mass spec data [3].

Finally, this project will reveal tractable molecular substructures in natural products. This is essential to categorize them and link those substructures to genomics data. In particular, recognition of mass spectrometry-detectable substructures will aid in NP discovery workflows that rely on mass spectrometry fragmentation data as molecular structural input. Furthermore, such an approach does not only work for natural products: we foresee that general metabolomics annotation workflows will also be enriched with substructure-based annotations. Key references: 1. Donia, M.S. and M.A. Fischbach, Science, 2015. 349(6246). 2. Wang, M., et al., i Nat Biotech, 2016. 34(8): p. 828-837. 3. van der Hooft, J.J.J., et al., 2016. 113(48): p. 13738-13743. 4. Loulou Peisl, B.Y., E.L. Schymanski, and P. Wilmes, 2018, Analytica Chimica Acta, in press. NP Plug&Play - Linking Natural Product Molecular Substructures to Mass Spec data:

In silico generation of mass-based natural product substructures

Supervisors Justin van der Hooft and Marnix Medema Type Metabolome analysis, Computational metabolomics Requirements Programming in Python, Advanced Bioinformatics Skills Basic Chemistry and Biology, Programming Timestamp January 15, 2018 Mammals and plants are externally and internally colonized by diverse microbial communities that have evolved an arsenal of structurally diverse specialized molecules or natural products (NPs) that play key roles in microbial-microbial and microbial-host interactions [1]. Recently, metabolome mining strategies have been introduced that find (bio)chemical relationships between molecules [2, 3]. Whilst making progress in structural and functional annotations, we are still largely in the dark regarding the diversity and identities of the full arsenal of specialized molecules [4]. The recognition and annotation of NP biochemical building blocks (substructures) is a promising way to accelerate structural elucidation efforts. The main goal of this project is to exploit state-of-the-art in silico spectral prediction software to support molecular substructure annotations. You will assist in setting up a framework to connect those molecular substructures to gene modules present in the biosynthesis gene clusters producing specialized molecules. A large selection of molecules from NPs are available from different databases including Super Natural II (>325000 structures). During the project, you will: i) predict MS/MS fragmentation spectra and fragment structures of a very large set of natural products using CFM-ID [5], ii) discover mass fragmental patterns (Mass2Motifs) [5] from in silico data, iii) correlate the presence/absence of chemical substructures and “in silico Mass2Motifs” in reference compounds and experimental data, and iv) visualize/analyse substructure presence/absence with CytoScape and NetworkX. Finally, this project will build a NP substructure inventory of potentially tractable molecular substructures in natural products. Recognition of these building blocks of specialized molecules will help with linking those substructures to genomics data. In particular, the large set of in-silico MS/MS spectra and discovered Mass2Motifs will aid in natural product discovery workflows relying on mass spectrometry. Furthermore, such an approach does not only work for natural products: we foresee that general metabolomics annotation workflows will also be enriched with substructure-based annotations. Key references: 1. Donia, M.S. and M.A. Fischbach, Science, 2015. 349(6246). 2. Wang, M., et al.,, 2016. 34(8): p. 828-837. 3. van der Hooft, J.J.J., et al., 2016. 113(48): p. 13738-13743. 4. Loulou Peisl, B.Y., E.L. Schymanski, and P. Wilmes,. Analytica Chimica Acta., in press. 5. Allen, F., R. Greiner, and D. Wishart, Metabolomics, 2015. 11(1): p. 98-110.

Phylogenetic distribution and evolution of microbial terpene synthases

Supervisors Marnix Medema, Paolina Garbeva, Lara Martin-Sanchez Type Data analysis, Computational genomics Requirements Advanced Bioinformatics Skills Genomics, programming, chemical biology Timestamp January 15, 2018

Description Terpenes are the largest and most diverse group of natural products. Enzymes for their synthesis are found in organisms belonging to all kingdoms of life. Terpenes play important roles in plant-plant, plant-insect, plant-microbe, but also in microbe-microbe interactions and communication and therefore are considered a possible common language for communication between different organisms [1, 2]. As an example, recently we revealed that terpenes can play an important role in a long-distance fungal-bacterial interactions between the fungal soilborne plant pathogen Fusarium culmorum and the rhizosphere bacteria Serratia plymuthica [3]. A recent publication showed that microbial terpene synthase-like genes are widely distributed in non-seed plants and suggests that they acquired these genes from bacterial and fungal terpene synthases by horizontal gene transfer [4]. Terpene biosynthesis gene clusters (BGCs) are recognised by Antismash and include four different types of enzymes: terpene synthases (TS), phytoene synthases (PS), Squalene Synthases (SS) and Terpene/Squalene Cyclases (SC). The aim of the project is to gain more knowledge on the evolution of the different terpene synthase genes (TSs, PSs, SSs, and SCs) in microbes. We will perform phylogenetic analyses to figure out their distribution in bacteria (and later in fungi and protists) by using different microbial genomes belonging to representatives of different phylogenetic groups. We will compare these analyses with those of housekeeping genes of bacteria, fungi and protists to see how the evolution of terpene synthases genes is linked to microbial evolution. Furthermore, we will also explore what type of terpene synthase genes are present in microbial communities and what is the distribution of the different genes. We will perform analyses of terpene synthase genes in the metagenomes from sugar beet endophytes to determine which terpene synthases are dominant in the metagenomes and how do these link to the microbial communities. Key references [1] van Dam et al. (2016) Calling in the Dark: the Role of Volatiles for Communication in the Rhizosphere. in Deciphering Chemical language of Plant Communication. Springer, pp. 175-210. [2] Schulz-Bohm et al. (2017) Microbial volatiles: small molecules with an important role in intra- and inter-kingdom interactions. Frontiers in Microbiology. vol 8, 02484. [3] Schmidt et al. (2017). Fungal volatile compounds induce production of the secondary metabolite Sodorifen in Serratia plymuthica PRI-2C. Scientific Reports, 7, [862] [4] Jia et al. (2016). Microbial-type terpene synthase genes occur widely in nonseed land plants, but not in seed plants. PNAS, 113(43), 12328–12333.

Phylogeny-corrected discovery of gene modules encoding for specialized

molecule substructures

Supervisors Justin v/d Hooft, F. del Carratore (UoM), Satria Kautsar, Marnix Medema Type Genome analysis, Computational genomics Requirements Programming in Python, Advanced Bioinformatics Skills Basic Biology and Chemistry, Programming Timestamp January 23, 2018 Specialized metabolites produced by microbes, fungi, and plants are increasingly used for various applications like antibiotics and anticancer drugs. The genes encoding the enzyme ensembles that produce those specialized molecules (also called natural products) are often physically clustered together in so-called biosynthetic gene clusters (BGCs). Progress in genome analysis has led to sophisticated algorithms that can predict the presence of such BGCs in whole genome sequences [1]. These tools yield large numbers of putative BGCs with known but also largely unknown molecular products - of which the structural prediction remains difficult. Experimental validation of links between BGCs and the products which they encode has led to the discovery of several subclusters of genes (also known as ‘gene modules’) that are responsible for the production of specific molecular substructures across otherwise structurally diverse molecules. Automated methods based on statistical inference of gene co-localization have been devised to recognize these subclusters; however, these methods suffer from redundancy in the sets of BGCs that are used as input data, which fail to adhere to the independency criterion of the hypergeometric tests employed. The main goal of this project is to develop an improved algorithm to identify subclusters, which takes into account the evolutionary relationships between gene clusters. In the process, you will assist in setting up a framework to connect those gene modules to molecular substructures, which will assist in functional annotation of both genes and molecules. During the project you will: i) run AntiSmash [1] followed by orthoMCL [2] to find clusters of co-evolved genes, i.e., secondary metabolite Clusters of Orthologous Groups (smCOGs), iii) implement/adapt methods to correct for phylogenetic distance of the producing organisms and/or the relatedness of the gene clusters, e.g. inspired by existing methods [3, 4] and iv) finally, gene module presence/absence will be visualized/analysed with CytoScape and NetworkX by mapping the results onto Gene Cluster Families (GCFs), including the validation using modules with known products.

In the end, this project will build a library of gene modules possibly representing the biosynthesis of novel chemical moieties, which will drive discovery of natural products such as antibiotics and anticancer agents by matching them to patterns related to chemical substructures in large-scale metabolomic data. Key references: 1. Blin, K., et al.,. Nucleic Acids Research, 2017. 45(W1): p. W36-W41.; 2. Li, L., C.J. Stoeckert, and D.S. Roos, 13(9): p. 2178-2189; 3. Lozupone, C. and R. Knight, Applied and Environmental Microbiology, 2005. 71(12): p.

8228-8235. 4. Keck, F., et al., 2016. 6(9): p. 2774-2780.

Topic modelling to discover gene modules encoding for specialized molecule

substructures

Supervisors Justin van der Hooft, Simon Rogers (UoG), Satria Kautsar, Marnix Medema Type Genome analysis, Computational genomics Requirements Programming in Python, Advanced Bioinformatics Skills Basic Biology and Chemistry, Programming Timestamp May 14, 2018 Specialized molecules produced by microbes, fungi, and plants are increasingly used for various applications like antibiotics and anticancer drugs. The enzyme ensemble that encode the production of those specialized molecules (also called natural products) is often physically clustered together in so-called biosynthesis gene clusters. Progress in genome analysis has led to sophisticated algorithms that can predict the presence of such BGCs in whole genome sequences [1]. These tools yield large amounts of putative BGCs with known but also largely unknown molecular products - of which the structural prediction remains difficult. Experimental validations of links between BGCs and the products which they encode for has led to the discovery of several subclusters of genes (modules) that together produce the same molecular substructure in otherwise structurally diverse molecules.

The main goal of this project is to find co-occurring groups of genes in a large pool of BGCs. You will assist in setting up a framework to connect those gene modules to molecular substructures, which will assist in functional annotation of both genes and molecules. Topic modelling or latent dirichlet allocation (LDA) [2] is nowadays applied in different fields including metabolomics [3] to discover co-occurring patterns of signals. Here, you will apply LDA to discover these gene subclusters in a large pool of BGCs from a variety of sources. During the project you will: i) run AntiSmash [1] ii) followed by orthoMCL [4] to find clusters of co-evolved genes, i.e., secondary metabolite Clusters of Orthologous Groups (smCOGs), iii) subsequently, LDA will be implemented to find co-occuring smCOGS (i.e., gene modules) in the BGCs, and iv) finally, gene module presence/absence will be visualized/analysed with CytoScape and NetworkX, including validation with known modules.

Finally, this project will build a library of gene modules possibly representing the biosynthesis of natural product building blocks. Reliable structural annotation of substructures will boost product prediction and prioritization efforts since chemical classification of molecules is facilitated and extensive elucidation efforts can be steered towards building blocks known to be bioactive or with completely novel chemistry. Key references: 1. Blin, K., et al.,. Nucleic Acids Research, 2017. 45(W1): p. W36-W41. 2. Blei, D.M., A.Y. Ng, and M.I. Jordan, J. Mach. Learn. Res., 2003. 3: p. 993-1022. 3. van der Hooft, J.J.J., et al., 2016. 113(48): p. 13738-13743. 4. Li, L., C.J. Stoeckert, and D.S. Roos, Genome Research, 2003. 13(9): p. 2178-2189.

Prioritizing genomic variants

Supervisors Christian Gross, Dick de Ridder Type Data analysis Requirements Programming in Python, Advanced Bioinformatics Skills Programming Timestamp May 28, 2018 Description Every individual genome is different, which leads to changes in evolutionary fitness. Besides structural differences, insertions and deletions, every human varies by around 10 million single nucleotide variants (SNV). To identify which of these SNVs cause changes in fitness, genome wide association studies (GWAS) are conducted which provide a plethora of SNVs that are associated with a particular phenotypical trait. The large number of identified SNVs cannot be reasonably evaluated. Therefore, to further filter those SNVs, in silico variant prioritization tools are applied. The first prioritization tool which was capable of comparatively prioritizing variants according to their functional effect across the entire genome was Combined Annotation Dependent Depletion (CADD) [1]. The authors of CADD developed a method to generate data sets of variants which can be used as proxy for variants with a true effect on the phenotype of an organism. To do this, they simulate variants which have not experienced purifying selection pressure and therefore may have a fitness reducing effect. Optimizing the data generation process will lead to better variant scoring algorithms and novel discoveries in the research of the genomes of human and other species. One important aspect of this data generation process are differently sized/deep whole genome sequence alignments that are used to generate SNVs and conservation scores for annotation of those SNVs. In this project, you will try to identify optimally sized multiple sequence alignments based on various mouse models and compare them with models derived from human and pig species. References

1. M. Kircher, D. M. Witten, P. Jain, B. J. O'Roak, G. M. Cooper and J. Shendure, “A general framework for estimating the relative pathogenicity of human genetic variants,” Nature Genetics , vol. 46, no. 3, pp. 310-317, 2014.

Thesis projects in the Bioinformatics Group · Thesis projects in the Bioinformatics Group ... In...

Documents

Transcript of Thesis projects in the Bioinformatics Group · Thesis projects in the Bioinformatics Group ... In...