Research Strategy A. Significance

15
Research Strategy A. Significance Plant derived natural products are not only a promising emerging area in modern medicine, but a foundation of successful current treatments. Over 50% of U.S. pharmaceuticals derive from plants 1 , and many “breakthrough” cancer treatment compounds are produced directly from plants. For example, vinblastine extracted from C. roseus plants has helped bring childhood leukemia survival rates from ~10% before the drug’s inception as a pharmaceutical in the 1960’s up to ~95% survival chances today 2 . These complex natural products are synthesized as specialized metabolites, which are substances that a plant produces in response to stress in order to provide a survival advantage when an environmental trigger occurs (e.g. drought or soil salinity stress, insect feeding, pathogen attack). Therefore, these compounds are normally produced in plants at very low concentrations, except under stressful conditions that are detrimental to plant growth and development. This has led to global shortages of critical plant-derived pharmaceuticals 3 . Two major barriers must be overcome in order to address this problem in the native plant system. The first is to understand how to upregulate specialized metabolite production in the plant, particularly where biosynthetic pathways are mostly known. The second challenge is to systematically identify genes responsible for the biosynthesis of plant natural products where pathways are not known. Solutions to these two challenges are therefore recognized as specific priorities listed in PAR-18-434 (Synthetic Biology for Engineering Applications). Our proposal directly addresses each of these challenges through the modeling and analysis of plant transcriptional control as a lynchpin of plant synthetic biology, and is significant at multiple levels: A1) Our proposal presents an efficient modeling approach for identifying critical regulatory elements in biosynthetic gene promoters. The Transcription Initiation Pattern Recognition (TIPR) machine learning model allows reverse-engineering of endogenous gene promoters, to efficiently identify direct transcriptional regulators and their specific binding locations for each gene in a pathway of interest. A2) Our proposal develops algorithms to extend natural product genome mining to the estimated 75% of plant biosynthetic genes not located in physical clusters in plant genomes. The TIPR concept will allow researchers to go beyond transcriptomic screens to an efficient new technique that can use patterns of direct regulators to predict genes involved in natural products biosynthesis; this technique does not require physical proximity of genes in a pathway. A3) Our proposal develops much-needed genomic resources for medicinal plants using Catharanthus roseus (C. roseus) as a model system. The limited availability of high-quality full-coverage genomes for medicinal plants presents a practical barrier to computational genomics research in plant natural products development. C. roseus is an ideally suited choice in which to develop and apply principled solutions to PAR-18-434 that are extensible to other plant systems. The expression of plant natural product pathways in yeast or other heterologous host is challenging due to the complexity of plant gene regulation 4 . Even if a full biosynthetic pathway is known, molecular conditions (e.g. transcriptional control pathways, post-transcriptional modifications) for some specialized metabolites make heterologous expression in yeast or bacteria particularly difficult to achieve. Tissue-culture systems can encounter similar difficulties when critical biosynthetic pathway components synthesized by a plant are not produced 5, 6 (e.g. vinblastine production in hairy root cultures). Tackling this general problem requires a more complete scientific understanding of specialized metabolite pathway regulation in plant systems to allow in planta pathway engineering. Several common factors underpin the challenges in genome mining and synthetic biology of plant specialized metabolites, causing it to lag significantly behind that of bacteria and fungi. Specifically, the increased size and complexity of plant genomes as compared to those found in microorganisms, the lack of localized biosynthetic gene clusters involved in specialized metabolite production, and more complex regulatory pathways all pose unique challenges in plant genomes. This leads to difficulties in predicting the direct transcriptional regulators of natural product biosynthetic genes, despite the expectation that plant stress response pathways regulating specialized metabolite production typically involve control by several master regulators 7, 8 . Transcriptional analysis in plant genomes has historically been considered an informational bottleneck due to lack of knowledge about promoter structure. In fact, only about ~20% of promoters in the model plant Arabidopsis have a TATA-box, with poorly understood core promoter structure overall as compared to non-plant model organisms 9 . However, my lab (the Megraw lab) has developed machine learning

Transcript of Research Strategy A. Significance

Research Strategy

A. Significance

Plant derived natural products are not only a promising emerging area in modern medicine, but a foundation

of successful current treatments. Over 50% of U.S. pharmaceuticals derive from plants1, and many

“breakthrough” cancer treatment compounds are produced directly from plants. For example, vinblastine extracted from C. roseus plants has helped bring childhood leukemia survival rates from ~10% before the drug’s inception as a pharmaceutical in the 1960’s up to ~95% survival chances today2. These complex natural products are synthesized as specialized metabolites, which are substances that a plant produces in response to stress in order to provide a survival advantage when an environmental trigger occurs (e.g. drought or soil salinity stress, insect feeding, pathogen attack). Therefore, these compounds are normally produced in plants at very low concentrations, except under stressful conditions that are detrimental to plant growth and development. This has led to global shortages of critical plant-derived pharmaceuticals3.

Two major barriers must be overcome in order to address this problem in the native plant system. The first is to understand how to upregulate specialized metabolite production in the plant, particularly where biosynthetic pathways are mostly known. The second challenge is to systematically identify genes responsible for the biosynthesis of plant natural products where pathways are not known. Solutions to these two challenges are therefore recognized as specific priorities listed in PAR-18-434 (Synthetic Biology for Engineering Applications).

Our proposal directly addresses each of these challenges through the modeling and analysis of plant transcriptional control as a lynchpin of plant synthetic biology, and is significant at multiple levels:

A1) Our proposal presents an efficient modeling approach for identifying critical regulatory elements in biosynthetic gene promoters. The Transcription Initiation Pattern Recognition (TIPR) machine learning model allows reverse-engineering of endogenous gene promoters, to efficiently identify direct transcriptional regulators and their specific binding locations for each gene in a pathway of interest.

A2) Our proposal develops algorithms to extend natural product genome mining to the estimated 75% of plant biosynthetic genes not located in physical clusters in plant genomes. The TIPR concept will allow researchers to go beyond transcriptomic screens to an efficient new technique that can use patterns of direct regulators to predict genes involved in natural products biosynthesis; this technique does not require physical proximity of genes in a pathway.

A3) Our proposal develops much-needed genomic resources for medicinal plants using Catharanthus roseus (C. roseus) as a model system. The limited availability of high-quality full-coverage genomes for medicinal plants presents a practical barrier to computational genomics research in plant natural products development. C. roseus is an ideally suited choice in which to develop and apply principled solutions to PAR-18-434 that are extensible to other plant systems.

The expression of plant natural product pathways in yeast or other heterologous host is challenging due to the complexity of plant gene regulation4. Even if a full biosynthetic pathway is known, molecular conditions (e.g. transcriptional control pathways, post-transcriptional modifications) for some specialized metabolites make heterologous expression in yeast or bacteria particularly difficult to achieve. Tissue-culture systems can encounter similar difficulties when critical biosynthetic pathway components synthesized by a plant are not produced5, 6 (e.g. vinblastine production in hairy root cultures). Tackling this general problem requires a more complete scientific understanding of specialized metabolite pathway regulation in plant systems to allow in planta pathway engineering.

Several common factors underpin the challenges in genome mining and synthetic biology of plant specialized metabolites, causing it to lag significantly behind that of bacteria and fungi. Specifically, the increased size and complexity of plant genomes as compared to those found in microorganisms, the lack of localized biosynthetic gene clusters involved in specialized metabolite production, and more complex regulatory pathways all pose unique challenges in plant genomes. This leads to difficulties in predicting the direct transcriptional regulators of natural product biosynthetic genes, despite the expectation that plant stress response pathways regulating specialized metabolite production typically involve control by several master regulators7, 8. Transcriptional analysis in plant genomes has historically been considered an informational bottleneck due to lack of knowledge about promoter structure. In fact, only about ~20% of promoters in the model plant Arabidopsis have a TATA-box, with poorly understood core promoter structure overall as compared to non-plant model organisms9. However, my lab (the Megraw lab) has developed machine learning

models that have led to significant advancements in predicting direct transcriptional regulators from genomic sequence in plants10, 11. This proposal outlines an approach to apply these machine-learning techniques to identify and upregulate genes involved in specialized metabolite biosynthesis based on the DNA sequences involved in plant transcriptional control4.

B. Innovation

Even for known natural product biosynthetic pathways, the transcriptional engineering of a gene or entire pathway that one wishes to control in planta is currently a time-consuming guess-and-check process. To gain an understanding of the endogenous combinatorial control mechanisms of each gene in a pathway, one may either start by (i) taking advantage of a very limited number of confirmed direct interactions reported in the literature in the plant species, tissues, and conditions of interest (if present), or (ii) first generating lists of co-expressed Transcription Factors (TFs) identified through transcriptomic screens under these conditions (if available), then testing many hundreds of candidates one-by-one for TF:promoter interactions. Moving forward to understand combinatorial regulation of each gene by multiple TFs is then orders of magnitude more time-consuming. If libraries of exogenous elements are available, this may be helpful for constitutively overexpressing individual genes; however, even if the desired effect is achieved for individual genes in planta at the appropriate developmental time and in the correct tissues, control of an entire pathway is difficult to achieve. Thus, systematic methods for reverse-engineering and rational promoter design are greatly needed.

The genome mining process presents similar challenges. Traditionally, genome mining for plant biosynthetic enzymes has taken advantage of standard transcriptomic approaches such as RNA-Seq. This approach has

aided success in identifying plant natural product genes in specific families of great importance12, 13

. This

approach typically works best when the specific family of enzymes is already known (e.g. cytochrome P450, methyltransferase), so that correlational transcriptomic data (RNA-Seq) can be used to reduce the set of candidates down to a manageable size for exhaustive laboratory testing. However, there are a few commonly encountered problems that have greatly hindered wide scale adoption on a genomic scale of discovery, including difficulty in determining which genes are truly directly transcriptionally regulated as compared to simply being correlated in RNA-Seq readouts. Therefore it has been specifically noted in prominent recent

reviews on the subject4, 14 that new gene discovery approaches are critically needed to drive the field forward.

In summary, current techniques for plant synthetic pathway engineering and genome mining are labor intensive and require extensive manual curation. In the following research design, we propose a completely different paradigm that is made possible by machine learning models. The central idea of the Transcription Initiation Pattern Recognition (TIPR) system is that the functional combinations of direct TF regulators and their patterns of cis-regulatory elements (CREs) in each promoter can be learned from accurate genome-wide information about the start sites of the transcripts expressed in a sample of interest. This modeling concept is applicable for inferring the combinations of direct regulators for every gene in a specific pathway of interest, and thus in reverse-engineering promoters in the pathway. This concept is immediately extensible to genome mining, as it allows one to learn regulatory patterns that are pathway-specific for metabolites of interest. This innovative use of pattern recognition techniques enables the possibility to move from genome sequencing efforts directly to specialized metabolite biosynthetic pathway identification and engineering in a simplified, efficient, and reproducible way.

C. Approach

C.1 Overview of the team and the approach

A genomics-based approach to increasing production of plant specialized metabolites is complementary to the more standard approach of metabolomic screening and flux modeling, which is then necessarily followed by the researcher’s application of chemical logic. Central to a genomics approach is tackling the understanding of combinatorial gene regulation in plants, an area of study typically considered too opaque for practical application. The advent of new computational approaches based on machine learning models10, 11 make this feasible. The scientific and technological challenges to our approach are quite surmountable as described in each proposed step of our research strategy; yet the historical separation between the fields of computational biology, plant biology, synthetic biology, and medicinal chemistry yields a current research climate in which a much-needed synthesis of these approaches is under-utilized. Our research team comprised of the Megraw and Philmus labs at Oregon State University (OSU), and the Lee-Parsons lab at Northeastern University (NU). Our team environment is uniquely suited in this area; it brings together strengths in computational biology, plant biology, plant genomics, plant bioengineering, and natural products chemistry. OSU’s Center for Genome Research and Biocomputing (CGRB) and Linus Pauling Institute (LPI) are outstanding drivers of successful

Figure 1: Project overview and group interactions.

interdisciplinary collaboration across this unusual spectrum of research areas. NU’s unique strengths in biological chemistry and bioengineering complement and complete this environment for our project.

Project leadership and algorithm development: PI Megraw will lead and coordinate the project. Dr. Megraw is a computational biologist by graduate degree, and she received training in laboratory plant biology during her postdoctoral experience. Dr. Megraw’s lab has extensive experience in plant computational genomics algorithms, focusing on the development of applied machine learning methods for understanding plant promoter architecture. The Megraw lab has used these algorithms to develop approaches specifically for elucidating plant stress response pathways, which is integral to the understanding of specialized metabolite

production. This research led to a scientific collaboration with Dr. Philmus, whose expertise in natural products chemistry guided the team in applying these algorithms to the problem of plant natural products discovery. Megraw and Philmus labs have worked together to compile an internal database of specialized metabolites in the model plant Arabidopsis thaliana (many of which come from the same families as the medically relevant metabolites in C. roseus), and to implement a preliminary machine learning analysis using available Transcription Start Site Sequencing (TSS-Seq) datasets10, 15 in the Megraw lab. Both labs have extensive experience working with the CGRB to ensure that data dissemination, resource sharing, and reproducible research are carried out at all levels. The Lee-Parsons lab brings a critical complementary component to our

team with extensive experience in bioengineering medicinal plant systems. Dr. Lee-Parsons’ depth of expertise is in investigating and engineering the transcriptional regulation of alkaloid biosynthesis in C. roseus plants and tissue culture systems, completing our team. The Megraw, Philmus, and Lee-Parsons labs currently meet at scientific conferences and collaborate using Zoom video-conference meetings, a practice which we would enhance during this project by meeting monthly via Zoom and in-person at two scientific conferences per year.

Choice of organism for the proposed study: The team has performed preliminary experiments in the medicinal plant C. roseus, which is ideally suited for the proposed research. Like the model plant Arabidopsis in which our preliminary machine learning analyses took place, C. roseus is a diploid flowering dicot with a relatively compact genome. C. roseus produces vinblastine, vincristine, and other specialized metabolites currently in use as pharmaceuticals. The vinca alkaloid “model pathway” producing vinblastine and vincristine is almost entirely known, with missing genes for only a few biosynthetic steps. C. roseus is also stably transformable16-18, making it ideally suited for the genome-engineering goals of our proposed study. The C. roseus genome was initially sequenced19 as part of the Medicinal Plant Genomics Resource, which provides a good starting point for initial mapping of TSS-Seq reads. However, the need for renewed efforts at obtaining highly accurate full-coverage medicinal plant genomes is currently being highlighted by the plant natural

products community4, 14

. Without recently developed long-read technologies for scaffolding, the present C.

roseus genome sequence20 does not contain several known biosynthetic genes in the pathway of interest. The transcriptional regulatory reverse-engineering goal of our proposal also relies on a complete genome sequence that is accurate even in the non-genic regions where short (~6-10 nt) DNA cis-regulatory elements control RNA Polymerase-II transcription. Furthermore, multiple varieties of C. roseus are important to the field, but their genomes differ substantially. The CGRB at OSU has the requisite experience in long-read assembly along with a history of successful sequencing of highly complex plant genomes in conjunction with OSU’s Herbarium.

Wet laboratory work: The Megraw, Philmus, and Lee-Parsons groups are independently funded laboratories committed to gathering the necessary data. As part of her K99 postdoctoral training project, Dr. Megraw worked to adapt TSS-Seq from animal systems to Arabidopsis, and further developed this technique in her own laboratory15 as one of her R00 research goals; her lab was the first to publish this method and the resulting analysis in a plant system. The Megraw lab has also taken on other challenging protocols, including development of an adaptation of the DNase-I Hypersensitive Site Sequencing protocol for plants21, 22, and has experience in transcriptome profiling under various abiotic stress treatments (including plant hormone

treatments) in addition to protein:DNA interaction assays including AlphaScreen and Electrophoresis Mobility Shift Assays (gel-shifts). Dr. Megraw and Dr. Philmus have worked together on induction experiments using meJA and ethephon treatments of C. roseus seedlings23. The Philmus group is experienced in medicinal chemistry, including mass spectrometric (MS) analysis of specialized metabolites. The proximity of the Megraw and Philmus laboratories on campus facilitates the collaboration, enabling rapid progress with in-person troubleshooting particularly for MS experiments. Dr. Lee-Parsons’ extensive wet-laboratory experience includes design of synthetic promoter constructs, development of transient expression assays (i.e. the EASI protocol) for transformation in C. roseus seedlings24, and development of transgenic plant tissue culture in C. roseus, bringing critical complementary technical experience in plant bioengineering. These methods will be applied by our team to generate the proposed C. roseus datasets in Table C.1 below.

Table C.1: Overview of datasets to be collected by the Megraw, Philmus, and Lee-Parsons groups.

Aim Assay Group Dataset Scale

1 TSS-Seq* Megraw C. roseus seedlings, induced seedlings Genome-wide

1 AlphaScreen assays Megraw C. roseus seedlings, induced seedlings ~1500 assays

2 HPLC analysis+ Philmus C. roseus induced root culture Vinca alkaloid pathway

2,3 MS+ Philmus C. roseus induced root culture Vinca alkaloid pathway

1 EASI assays Lee-Parsons C. roseus induced seedlings Vinca alkaloid pathway

2,3 Transgenic constructs Lee-Parsons C. roseus induced root culture, seedlings Vinca alkaloid pathway

Prototype workflow and preliminary data funded by: *NIH R00 GM097188,

+OSU College of Pharmacy

Timeline: Work on Aim 1 will begin immediately and take place throughout the project time. Computational work in Aims 2.1 and 3.1 will commence approximately six months after funding when TSS-Seq datasets in C. roseus seedlings are complete, and will be completed in their initial modeling phases during the second year. Engineering in root culture (Aim 2) will take place immediately following the modeling phases in Aims 1.2, 1.3, and 2.1, and will be finished by the end of the fourth year. Aim 3.2 missing pathway candidates that are confirmed will be considered for incorporation in Aim 1 promoter modeling. In planta validations in all Aims will be completed and evaluated by the end of year 5.

C.2 Key preliminary studies supporting the modelling approach

C.2.1. Plant promoter architecture discovery modeling with TIPR identifies functional CREs

The Megraw laboratory has developed a technique for promoter architecture analysis that accurately identifies Transcription Start Site (TSS) locations and the unique sets of Transcription Factor (TF) binding sites known as CREs (~6-10 nucleotide Cis-Regulatory Elements) associated with these TSS locations, using high precision datasets generated by TSS-Seq. For instance, we accurately predicted the positions of highly expressed TSS locations with extremely high sensitivity and specificity (auROC=0.98) using a TSS-seq dataset generated from wildtype Arabidopsis roots10. In this study, genome-wide sequencing of 5’ TSSs (TSS-Seq data) was generated from wild type Arabidopsis roots. We then designed a high-resolution machine-learning model based on L1-regularized logistic regression to accurately predict positions of highly expressed TSS locations. Our model uses features (i.e. CREs and their locations relative to the TSS) that corresponded to these TSS locations. Our model showed that a large compendium of known CREs is necessary and sufficient for accurate promoter prediction in developing Arabidopsis roots. Our model can identify high-confidence collections of CREs that are likely to work together to upregulate their targets. This is a significant contribution to previous TF:gene interaction network estimates in Arabidopsis. We also developed TIPR (Transcription Initiation Pattern Recognizer)11, a sequence-based machine learning model which identifies not only the locations of TSSs, but also the expected peak type (for example ‘Narrow Peak’ vs ‘Broad Peak’) that each TSS tag cluster will form along the chromosome (Fig. 2 below). The TIPR pipeline ‘learns’ which TF binding site Regions of Enrichment (i.e., candidate CRE sites) are relevant to the TSS tag cluster at hand without having to observe a priori that a TSS tag cluster has a particular shape. For simplicity, we will generally refer to our modeling concept and pipeline throughout the proposal as TIPR. The TIPR modeling framework can be translated to TSS-seq data from other plant species, tissue types, or sets of conditions to identify relevant combinations of TF binding sites (different feature weights) associated with alkaloid biosynthesis in C. roseus.

As a first step in constructing the feature set associated with the TSS prediction, we “scan” the Positional Weight Matrix (PWM) representing each TF’s DNA binding profile over the sequence in the vicinity of each

Figure 3. EMSA using purified AtMyb20 protein (+) and CRE from ABI1 promoter (probe). Binding was abolished in the presence of an excess of competitor DNA probe.

TSS in the training set ( + or – 5kb from the TSS) using standard log-likelihood scoring with a di-nucleotide background model (Fig. 2, Left). At each nucleotide, we sum log-likelihood scores above zero to produce a summary signal across this region (Fig. 2, Right) for each TF. Strikingly, for dozens of TFs including a majority that have not previously been identified as core promoter elements in the literature, the TF’s summed positive log-likelihood signal exhibits a single unimodal enrichment in the vicinity of TSSs. In other words, we observed a preferred functional positioning of TF with respect to the TSS. When a TF’s CRE sites were enriched in a region of the promoter (i.e. Region of Enrichment or ROE), we then subdivided this defined region into sub-windows, and scored the presence/absence of TF sites by the cumulative positive log-likelihood signal for each TF in each window; this is defined as the feature set. Heavily weighted features in a successfully trained model (auROC=0.98) thus represent specific combinations of TFs and their TF-associated regions relative to a TSS location such that CRE sites in these regions are strongly predictive of expression at this location. We will use a similar process for generating a feature set and model to identify putatively functional CREs associated with condition-specific expression of alkaloid biosynthesis in C. roseus.

C.2.2 Reverse-engineering of direct TF regulators with TIPR yields strong CRE validation rate

Our computational approach for identifying functional CREs was supported through experimental TF protein:DNA binding assays; the computational approach followed by experimental validation is described as follows. First, we used the TIPR machine learning model10, 11 described above to accurately predict the presence of TSSs by analyzing CREs within 5 kilobases (kb) of observed transcription start site Narrow Peak modes. In this experiment, the TIPR model was trained and tested using our Arabidopsis whole root TSS-Seq dataset. Then, likelihood scores for individual putative CRE sites that contributed to each TSS prediction (i.e. each transcript detected in the Arabidopsis root) were calculated using their corresponding Positional Weight Matrix (PWM) and their position relative to the ROE for each element. The output of this pipeline was a genome-wide “master-list” of potentially functional CRE sites. These candidates were filtered by considering only strongly expressing Narrow-Peak TSSs (associated with regulatory genes) and other restrictions to generate a stringent short-list of 100 CRE sites (i.e. sites associated with the top 20% of CREs according to their TIPR model weight, then sites with highest likelihood scores located near the center of their ROE and within 120bp of their corresponding TSS). Finally, we arbitrarily selected “high-scoring” CRE candidates for experimental validation using the Electrophoretic Mobility Shift Assay (EMSA or “gel shift” assay); EMSA is a commonly used in vitro assay for probing TF protein:DNA binding interactions.

For example, we demonstrated the interaction of an Arabidopsis TF AtMYB20 with a DNA probe (25 bp) derived from the promoter of the ABA

Figure 2: The Transcription Initiation Pattern Recognizer (TIPR) extension of 3PEAT recognizes the TF binding site Regions of Enrichment associated with each TSS read cluster shape. (Left) TIPR scans sequences upstream of TSS for TF binding site sequences (i.e. CCAT/ATG). (Right) TIPR then recognizes patterns of TF sites in genome sequences that associate with specific Regions of Enrichment for Narrow Peak (shown), Broad with Peak, or Weak Peak TSS tag clusters.

-40 -35 0

INSENSITIVE 1 (ABI1) gene25. A gel-shift occurred in the presence of AtMYB20 (Fig. 3, center) demonstrating formation of a TF protein-DNA complex. Next, we tested our five predicted Arabidopsis CRE candidates using EMSA and nuclear extracts containing the TFs prepared from Arabidopsis roots. Four out of the five putative CREs caused distinct gel shifts. Next, we further improved our initial candidate filtration process and generated a formal scoring function that estimates the relative likelihood of a candidate’s functional binding given its contribution to an accurate TSS prediction. We again selected five top-scoring candidate CRE interactions for validation (BPC1, GT1, ARR10, CCA1, RAV1-A), and all five candidates gel-shifted. The results of these TF:DNA binding experiments (9/10 EMSA assays supporting predicted TF binding locations) demonstrate the strong potential of our modeling approach to identify functional CREs.

C.2.3 Extending TIPR to plant natural products discovery

Evaluation of current methods for promoter discovery. Genome mining for plant specialized metabolite biosynthetic genes poses a challenging open problem. A recently published tool called plantiSMASH26 uses a genome-cluster based discovery method. We applied this tool to identify a class of well-studied specialized metabolite biosynthetic genes, i.e. terpene synthases, in the model plant Arabidopsis with a very well-sequenced genome. Of the 52 literature-identified terpene synthase genes in Arabidopsis, only 17 terpene synthases were identified as biosynthetic genes using plantiSMASH with a standard genome release (TAIR10); this number increased to 19 when the latest genome annotation release (Araport 11) was used. This outcome was consistent with current estimates that ~ 25% of plant biosynthetic genes may be arranged in a clustered formation on the genome26, therefore extensions of the genomic cluster-based algorithms currently available are unlikely to identify the majority of genes in plant specialized metabolite pathways.

Preliminary application of TIPR to genome mining. We hypothesized that patterns of CREs in the promoters of biosynthetic genes form “regulatory signatures” that can be uniquely distinguished from other gene types regardless of location on the genome. We then applied the TIPR-based classifier for identifying these patterns of CREs. Using TSS-Seq data for wildtype Arabidopsis seedlings under normal conditions (when many plant specialized metabolites are expressed at detectable levels without explicit stress induction), 46 out of the 81 specialized metabolite biosynthetic genes (compiled from the literature) had high-quality TSS peaks in our Arabidopsis root or leaf datasets. This dataset of 81 biosynthetic genes was considered the ‘positive set’, and a negative set of 81 genes was selected at random from the rest of the genome. Datasets were then split (with random selection) into an 80% cross-validation set, and a 20% completely independent held-out test-set. The unaltered TIPR model pipeline was applied to generate features (from the promoter regions of each “TSS” location) and train a classifier. Even with this relatively crude set-up and imprecise TSS approximation for much of the dataset, 5-fold cross-validation yielded an average performance of 80% auROC and 68% auPRC, with performance holding at 80% auROC, 70% auPRC on the independent held-out test set. Heavily positively weighted model factors (putative direct regulators whose presence of associated CREs in a promoter are indicative of the specialized metabolite biosynthetic gene class) included many TFs involved in stress response, as well as light response, and circadian control of gene expression, as expected.

This preliminary study strongly supported our hypothesis that patterns of CREs in gene promoters do form useful regulatory signatures that show excellent promise for genome mining. When a comparison was performed using only TAIR10 annotated start sites for all genes, cross-validation performance dropped to 74% auROC and 58% auPRC, a substantial performance decrease. It is likely that just as we observed in TIPR’s published examination as a TSS prediction model, precise TSS data plays an important role in developing a very high-performance classifier (auROC, auPRC above 90%). This is because CREs are typically very short (~6-10 nt) elements, thus if the estimate of a heavily transcribed TSS in a sample is inaccurate by even 10-20nt on average, locational patterns of these elements with respect to the TSS will create a noisy feature set that significantly reduces classifier performance. We will apply TIPR to identify patterns of CREs associated with alkaloid biosynthesis in C. roseus and by obtaining a higher quality sequenced genome for C. roseus.

C.3 Research Plan

Aim 1: Construct datasets and a corresponding machine learning model to identify direct transcriptional regulators of key TIA biosynthetic genes in C. roseus seedlings.

Objective. We will first obtain Transcription Start Site Sequencing (TSS-Seq) data to pinpoint promoter locations for all genes expressed in the following C. roseus samples: roots and shoots of hormone-induced and control plants, for two medicinally relevant varieties. Second, we will construct a genome-scale TIPR machine learning model to predict patterns of CREs within each promoter that collectively upregulate gene expression in each sample. To identify relevant TFs, we will then reverse-engineer the promoters of the seven currently

Catharanthine Tabersonine

Vindoline

VinblastineVincristine

Strictosidine

Geranyl PP (GPP)

Secologanin

Tryptophan

Tryptamine

7 steps

~ 2 steps

9 steps 1 step

1 step

10 steps

Fig. 4. The C. roseus terpenoid indole alkaloid (TIA) biosynthetic pathway:

solid & dotted lines represent single & multiple enzymatic steps, respectively

known biosynthetic genes in the vindoline pathway to back out the patterns of CREs associated with their upregulation; the predicted TF:promoter regulatory interactions will be screened in vitro and then top candidates evaluated further in planta using C. roseus seedlings.

Rationale. Terpene indole alkaloids (TIAs) including vinblastine are produced at low levels in developing C. roseus plants under normal growth conditions27, and induced as part of the plant’s stress response to insect feeding28, 29. A central goal of this research is to provide a framework for increasing medicinally relevant specialized metabolite production. We will build our study around the vindoline branch of the TIA biosynthetic pathway, as it is critical to vinblastine and vincristine production, yet the TFs involved in directly regulating vindoline biosynthetic genes remain mostly unknown. Comparisons among plant tissue, treatment, and variety are keys to understanding regulation of TIA biosynthesis, particularly vindoline biosynthesis. For instance, understanding transcriptional regulation both in plant roots and shoots is essential, as vindoline is not produced in C. roseus roots, even upon pathway induction using standard laboratory phytohormone treatment with methyl jasmonate (meJA). Treatment with meJA induction differentially increases production of the upstream precursors catharanthine and tabersonine and to a lesser extent vindoline (in seedlings).

A recent preliminary study by the Megraw and Philmus labs23 showed that production of important precursors—including vindoline, catharanthine, and tabersonine—differs significantly between the sequenced C. roseus variety, SunStorm Apricot (SSA), and the predominantly laboratory-studied variety Little Bright Eyes (LBE). Additionally, our study showed that the SSA and LBE varieties respond differently to TIA pathway induction treatments including meJA, that these responses can vary dramatically between roots and shoots, and that transcriptional regulation is very likely to be playing a role. Studies in the Lee-Parsons lab show that there are extensive DNA sequence differences in the upstream sequences of vindoline pathway genes in LBE as compared to SSA30, 31, supporting transcriptional regulation as a driver of varietal differences in precursor expression under both control and treatment conditions.

Thus, in order to understand how to upregulate vindoline pathway biosynthetic genes, it will be critical to dissect regulatory differences between roots and shoots, meJA treatment and control conditions, and SSA and LBE varieties. Systematic promoter model construction in C. roseus, including comparison of models in each sample type, will provide a conceptual platform that is broadly applicable to other specialized metabolite pathways. TSS-Seq data generated for this project will be a directly useful contribution to the C. roseus research community by precisely characterizing promoter regions, thus providing information on CREs that recruit direct transcriptional regulators.

Preliminaries and Feasibility. Our team has extensive experience generating TSS-Seq data in plant tissues10, 15, 21, 22, and we have developed software tools for cleaning, mapping, and peak-identification for these data types. As discussed in C.1, the C. roseus genome has been initially sequenced19, and CGRB re-sequencing efforts will provide fully sufficient information for the proposed model construction. C. roseus has a diploid genome that facilitates mapping of high-throughput sequencing data and genomic analysis. The proposed modeling process uses known TF binding domain profiles (PWMs) which have often been characterized using transcription factors isolated in Arabidopsis. While this could be of concern, we have recently confirmed (unpublished data) that a well-performing TSS prediction model can be generated in rice using existing TF binding domain databases. This preliminary test strongly suggests that known binding domain profiles will be sufficient in C. roseus as well. Our team also has the requisite experience in natural products extraction and LCMS characterization of C. roseus tissues23 using previously published procedures32. The vinblastine alkaloid pathway (Fig. 4) is one of the best-characterized pathways in plant natural product production, with all intermediates known (33 intermediates) up to the last two steps converting α-vinblastine to vinblastine and vincristine. This lends this pathway well to the following set of research process components.

Proposed research.

1.1 TSS-Seq data acquisition shoots and roots

High quality re-sequencing of the SSA variety genome and sequencing of the LBE variety genome for C. roseus will be performed by the CGRB at OSU using PacBio HiFi reads, followed by genome assembly,

haplotig identification, repeat annotation, gene prediction, and gene annotation. This will provide the complete assemblies necessary for mapping of TSS-Seq reads.

TSS-Seq will be performed by the Megraw lab in eight samples from C. roseus seedlings: SSA roots and shoots under control conditions and treated with the TIA pathway induction hormone meJA (4 samples), and LBE roots and shoots under control conditions and treated with meJA (4 samples).

These data along with the modeling process in Aim 1.2 below will allow us to identify previously unknown TF regulators of the TIA biosynthesis pathway in each tissue, validate the process for known regulators from the literature including MYC233, ORCA234, ORCA335, BIS136 and BIS237, and use outcomes to engineer the TFs that bind the cis-regulatory element (CRE) sites as needed for Aim 2.2 (LBE root culture).

1.2 Model development

We will perform TSS-Seq analysis using our nanoCAGE-XL and CapFilter process15 to precisely identify TSS peak locations for all genes expressed in each C. roseus tissue sample (Aim 1.1). We will then use the same computational pipeline that we have tested using our Arabidopsis TSS-Seq dataset (Section C.2.1). The Transcription Initiation Pattern Recognizer machine learning model11 described in Section C.2.1 will predict interrelationships between CRE locations and their TSSs by analyzing 5 kilobases (kb) upstream (and 500nt downstream) of observed TSS peak modes. Then, likelihood scores for individual putative CRE sites that contributed to each TSS location will be calculated using their corresponding Positional Weight Matrix (PWM) and their position relative to the Region of Enrichment (ROE) for each element. This pipeline will provide a scored list of potentially functional CRE sites in gene promoter regions. We will use this process to identify CRE locations in each sample type, allowing us to compare outcomes in each variety (SSA and LBE) in both root and shoots under control conditions and under TIA pathway-induced conditions resulting from meJA treatment. Fig. 5 shows the modeling pipeline that we have in place for this process.

We will use the TIPR models in roots and shoots to predict direct regulators of the seven vindoline sub-pathway genes. We will initially focus on the SSA variety, as modeling work here can begin immediately after mapping TSS-Seq data from these samples while high-resolution genomes for both SSA and LBE varieties are completed. Vindoline biosynthesis gene upstream regions are well-covered in the current SSA sequence, therefore their modeling outcomes are not expected to be greatly affected by re-sequencing. These SSA variety models will provide an initial understanding of potential differences between CRE site usage across tissues, as well as important common TF regulators in vindoline biosynthesis. We will prepare the scored sets of CRE locations in promoters of the seven vindoline biosynthesis genes from TIPR modeling outcomes in each sample, in both varieties as data becomes available, so that the top TF:promoter interaction sites per gene can be screened in vitro and evaluated in planta in Aim 1.3 below in preparation for developing transgenic lines in Aim 2.

Figure 5. Schematic of TIPR modeling pipeline. The pipeline scans sequences upstream of TSS tag clusters for transcription factor binding site sequences, turns these into model features, trains a binary classifier model to recognize predictive patterns, and evaluates model performance.

We will use the TIPR modeling outcomes for meJA-treated LBE roots and shoots, where TFs from the ORCA and BIS gene families have each been reported in the literature to regulate specific genes from the TIA pathway33-37 upstream of the vindoline biosynthesis, to prepare a literature-curated set of TF:promoter interactions for in vitro and in planta testing in Aim 1.3 below. Top-scoring CRE locations (representing potential binding sites for these TFs) will be identified in the promoters of the literature-reported target genes. This will provide a helpful additional gauge of strongly-regulating CRE sites and their locations with respect to the TSSs of these genes, as the literature-reported interactions via laboratory assays do not test or report genomic binding sites (the LBE genome is currently un-sequenced).

In preparation for Aims 2 and 3, we will perform comparisons of TSS peak locations in treated vs control samples of both varieties, in order to understand whether meJA induction affects core promoter location, TSS peak shape, as well as level of gene expression for TIA pathway genes. We will also compare TSS locations in roots and shoots in each variety. This is important for subsequent transcriptional engineering steps, because it suggests whether the entire promoter sequence may be varying for certain genes by treatment and tissue (not just the embedded functional CRE site patterns). Similarly, we will compare ROE locations (highest binding site affinity regions for each TF) and highest TIPR model weights (most influential/predictive TFs and their binding locations) between treatment vs control samples and roots vs shoots samples within each variety, to gain an understanding of genome-wide differences in specific transcriptional control patterns of CREs across these sample types. This is the first time that TSS peak locations, TSS peak shape, and ROE have been evaluated under different tissue, treatment, and varieties for a given plant species.

1.3 In vitro and in planta validation

We will begin by evaluating predicted TF:promoter binding interactions in vitro using AlphaScreen high-throughput protein:DNA interaction assays, which are conceptually similar to EMSA assays (Section C.2.2) but can be performed using TF and DNA-oligo labeled beads in a 96-well plate on a plate-reader that detects a specific frequency of emitted light if the TF and DNA-oligo beads interact. We will use AlphaScreens to screen two sets of predicted interactions:

(1) The union of the 20 highest-scoring CRE sites in the promoters of each of the seven vindoline sub-pathway genes across all sampled conditions. At most 1120 interactions (2 varieties x 2 tissues x 2 treatments x 7 genes x 20 sites) or about 12 AlphaScreen plates (~1 month of screening time)—however, not all of these genes are expected to express strongly under all conditions, particularly un-induced conditions—this union most likely represents ~600-800 interactions.

(2) The 20 highest-scoring CRE sites in each promoter of the literature-curated TF:promoter interaction set prepared in Aim 1.1 using TIPR modeling and the induced LBE roots and shoots samples. At most ~280 interactions (2 tissues x 7 genes x 20 sites) or about 3 AlphaScreen plates.

These initial screens will aid in identifying any suspected modeling issues, for example a PWM not well-representing a particular TF’s binding domain, or needing to adjust the extent of the TSS-proximal sequences in which one should be including ROEs.

We will then narrow the candidate interaction set to screened TF:promoter interactions, and investigate up to 10 top-scoring CRE sites per vindoline pathway gene for gene upregulation in SSA and LBE seedlings using a transient expression assay24 (known as EASI, efficient Agrobacterium seedling infiltration) developed by the Lee-Parsons lab. The EASI assay uses Agrobacterium to infect and express the engineered vector of interest into the plant seedling. For TF:promoter interactions, we will transiently co-express two vectors, one encoding the TF and another encoding the promoter of the vindoline pathway gene in C. roseus seedlings driving the luciferase reporter gene using our EASI method. If the predicted TF interacts with CRE site in the promoter, the expression of the luciferase gene will increase and luminescence will significantly increase relative to the control. This assay will enable us to evaluate the interaction of specific TFs and the associated CREs in a given promoter in planta. We can validate the interaction with specific CREs by repeating the EASI assay with the TF and the promoter with the mutated CRE. The in vitro assay allows high-throughput screening of CREs determined from TIPR followed by in planta validation of specific TF:CRE interactions in C. roseus seedlings.

Potential problems and alternative strategies. We do not anticipate difficulty in TSS-Seq (or if necessary, OC-Seq) data generation or mapping due to extensive past experience with these protocols in a variety of plant tissues and genomes. If difficulty arises in C. roseus TIPR modeling, we will examine any substantial differences between past TIPR models in A. thaliana10, 11 and C. roseus. If we find that known TFs are not showing strong predictive binding site enrichments in C. roseus, we will revisit model feature construction. We

will determine whether inclusion of non-position-specific cis-regulatory element locations coupled with a more aggressive feature reduction method is necessary in C. roseus. Should we encounter any unexpected difficulties with AlphaScreens, we will scale back in vitro screening to the most confident CRE candidate sites and proceed with EMSA assays as in C.2.2.

Future research. This Aim constructs the machine learning model structure and pipeline that allows us to learn about the regulation of TIA biosynthesis across different tissues, treatments, and C. roseus varieties. It also builds the foundation for developing in planta TIA pathway engineering strategies in C. roseus, and provides a modeling process prototype for promoter engineering of other medicinal plants. We specifically apply this foundation for expressing a shoot-specific pathway in roots in Aim 2, and for discovering and identifying missing biosynthetic genes in Aim 3.

AIM 2: Extend our modeling framework to a tissue-specific expression setting in the C. roseus hairy root culture system

Objective. We will identify TF binding sites in the promoters of vindoline biosynthetic genes that are associated with its shoot-specific expression in C. roseus and use this understanding to upregulate vindoline biosynthesis in C. roseus root cultures, a synthetic biology and scalable production system.

Rationale. Vindoline is not produced in the roots while the direct precursor tabersonine is (Fig. 4). Therefore, the 7-step vindoline pathway represents a barrier to vinblastine and vincristine production in C. roseus hairy root cultures. Understanding the regulation of the vindoline pathway in shoots or in the LBE variety will enable us to engineer root cultures to produce vindoline from tabersonine, a root-abundant precursor. Transgenic root cultures are easier to develop and genetically engineer than plants, grow vigorously, and can be cultivated in a bioreactor under controlled conditions for consistent production of the desired product. If we are successful, we will see vindoline accumulate in a tissue that normally does not produce vindoline, breaking a major barrier to efficient and reproducible vinblastine and vincristine production in C. roseus hairy root culture.

Preliminaries and Feasibility. The Lee-Parsons Lab focuses on understanding and engineering the transcriptional regulation of the C. roseus TIA biosynthetic pathway. The discovery of TFs that regulate the TIA pathway is currently based on correlation between TF and TIA biosynthetic gene expression using RNA-seq and hierarchical clustering. The application of machine learning models from the Megraw Lab will provide a complementary, unbiased approach to accelerate the discovery of relevant TFs of the vindoline branch. The Lee-Parsons lab has developed the highly efficient EASI assay for rapidly assessing TF function and promoter-activity24, 38, which will allow the team to evaluate each TF’s role rapidly (< 2 weeks) before proceeding to develop a stable transgenic system with the top candidates. The Lee-Parsons lab has also optimized the genetic engineering method described in Aim 2.2 for manipulating TF expression in hairy root culture18, 39.

Proposed research.

2.1 Model application to engineering hairy root cultures

There appears to be a transcriptional bottleneck governed by TFs that prevents the synthesis of vindoline in roots, yet vindoline is expressed in plant shoots. TIPR model comparison in roots and shoots will reveal important differences in the TF-ROE regions in both tissues, identifying promoter features (CREs) that result in different levels of vindoline production in LBE shoots (high) vs SSA shoots (lower) vs roots (none). We will select up to 5 TFs for over-expression in hairy root cultures. The TFs will (i) bind to top-scoring CREs from Aim 1.3, (ii) bind CREs that appear in the promoters of multiple vindoline pathway genes, and (iii) express strongly in LBE shoots, moderately in SSA shoots, and not in roots.

The TFs will be encoded in DNA vectors under the control of an estrogen-inducible promoter system (previously used in the Lee-Parsons Lab)18, 31. This inducible system allows the expression level of the TFs to be controlled by the estrogen dosage to reflect biologically relevant levels (unlike a constitutive promoter system). Transgenic hairy root cultures will be developed by infecting C. roseus seedlings with Agrobacterium and selected using our optimized protocol (4 – 6 months)18. We will then select healthy hairy root lines, assess the expression of the desired TF at the optimized inducer concentration (5 uM estradiol)18, 31, measure the expression of the vidoline pathway genes using quantitative PCR (Lee-Parsons Lab), and profile their alkaloid content (Philmus Lab, Aim 2.2).

2.2 LCMS quantification of specialized metabolites from the TIA pathway

We will utilize high-performance liquid chromatography-high resolution mass spectrometry (LC-HRMS) using a Q-ToF instrument to identify and quantify the changes in accumulation of TIA pathway intermediates in both

engineered hairy root cultures as compared to control cultures (using GFP in place of the TF). We have chosen LC-HRMS as this method allows separation of compounds, high-resolution mass accuracy for identifying molecular formulas, and the ability to generate characteristic fragmentation patterns that definitively identify specialized metabolites. For example, catharanthine and tabersonine have the identical molecular formula (C21H24N2O2) but they separate out, and generate different fragmentation patterns which allow them to be distinguished with confidence. We have a suite of commercially available alkaloids (tabersonine, ajmalicine, vindoline, etc.) that are used to accurately quantitate the amount present in the plant material. For compounds that we do not have standards for, the data LC-HRMS provides accurate mass (and therefore molecular formula) and fragmentation data, which help determine the structure of the unknown compound. The unknown compound can then be targeted for isolation using standard natural product methodologies and the structure will be determined using 1D and 2D nuclear magnetic resonance.

Potential problems and alternative strategies. The TFs that bind to the CREs from Aim 1.3 are identified in literature-curated databases, and in some cases multiple TFs are able to bind a particular CRE. In this situation, TSS-Seq (or RNA-seq) can be used identify the most highly-expressed TF(s) in a sample as the best candidates for binding, and a ChIP (Chromatin ImmunoPrecipitation) experiment can be performed if further disambiguation is required. The most well-studied TFs can also be prioritized for experimental validation.

Future research. The study in Aim 2 will apply and build upon the machine learning pipeline and algorithms from Aim 1 toward engineering a shoot-specific pathway (vindoline biosynthesis) in C. roseus root cultures. Aim 2 will demonstrate the value and the foundational features highlighted through the machine learning approach to engineering the production of valuable TIA pathway precursors in hairy root culture, an industrially scalable production system for plant natural products.

AIM 3: Train machine learning algorithms to identify biosynthetic genes, using the TIA pathway in C. roseus as an experimental model system for validation

Objective. We will create machine learning model extensions of TIPR trained using C. roseus TSS-Seq data to predict genes involved in the TIA pathway that are currently unknown/missing. As part of this process, we propose to extend this method as a general computational platform for specialized metabolite gene classification in plants.

Rationale. As discussed in section C.2.3, physical gene cluster-based methods such as plantiSMASH26 are helpful for the estimated up to 25% of plant biosynthetic genes that may be clustered in this way in various genomes. Augmenting such a method with RNA-Seq data greatly boosts the performance of such methods because co-expression (or closely timed expression) of genes in a specialized metabolite pathway ensures that product intermediates are available for biosynthesis of the end product. This is also why transcriptomic screens have been so successful after a time-consuming manual curation process of reducing from thousands of gene candidates to tens that can be tested. However, if one can move from identifying co-expressed genes to identifying directly co-regulated genes, one can use the “promoter signatures” or patterns of co-regulators to distinguish genes in a pathway of interest from other genes. The TIPR machine learning model provides a strategy for doing exactly this.

Preliminaries and Feasibility. Just as in Aim 1, the main idea behind using TSS-Seq data and extending the TIPR method for this aim is that by mining the precise locational patterns of CREs within the promoter regions of genes, it is possible to determine the collection of TF binding sites that are indicative of transcription initiation at each highly transcribed TSS36,41. These TF binding sites suggest direct upstream regulators of their target genes, thereby forming pathway regulatory signatures. Here we propose to exploit this technique to extend the TIPR model base to develop a platform for identifying genes involved in natural products biosynthesis. The Megraw lab has successfully used this computational method along with TSS-Seq data in order to identify direct transcriptional regulators of genes involved in many settings including plant development36. We have also shown in preliminary experiments (C.2.3) that even with incomplete TSS-Seq data (but with the advantage of a very well-annotated genome and limiting to specialized metabolite-associated transcripts near to the annotated sites), one can train a TIPR-based classifier that distinguishes a small set of specialized metabolite pathway genes from randomly selected non-biosynthetic genes with ~80% auROC and ~70% auPRC. This is extremely encouraging, and suggests that with proper genomic data and model design, very well-performing models can be trained to recognize specialized metabolite pathways of interest. We will carry out this process in the following components.

Proposed research.

3.1 Build TIA pathway and general specialized-metabolite gene classification models

Just as in C.2.3 for Arabidopsis, from a comprehensive set of literature-curated specialized metabolite genes in C. roseus, we will associate genes to TSS peaks, forming a ‘positive’ training dataset of TSS locations that specifically identify strongly expressed specialized metabolite promoter regions. We will use SSA seedlings, where mapped TSS-Seq data will first become available. We will then build a contrasting ‘negative’ dataset from two component gene sets that the model will need to distinguish from specialized metabolite genes: (i) genes from the same families as the positive examples (e.g. terpene synthases involved in volatile compound production) that are not specialized metabolites, and (ii) a balanced random selection from gene families that are known not to participate in specialized metabolite synthesis (e.g. ribosomal genes, amino acid biosynthetic genes). We will use this training dataset to build a machine learning classifier, using 5-fold cross-validation as a standard training/testing process to ensure that a successful model is likely to generalize well to as-yet-unseen datasets. We will begin by examining performance of L1-regularized logistic regression, a highly interpretable model type with automatic feature selection that has performed well on TIPR modeling problems in the past by our group10, 11, 40. We will consider whether preliminary feature selection without L1-regularization is necessary in this case given the number of examples, and evaluate Support Vector Machine (SVM) classifiers if necessary (less straightforward to interpret but can sometimes achieve better performance on datasets containing highly dependent features). We will evaluate performance according to auROC and auPRC values (standard measures of whether a model can simultaneously achieve high sensitivity and high specificity, as well as high precision and high recall).

In summary, using the process above we will apply the resulting predictor to obtain a map of potential metabolite genes in the C. roseus genome. We will generate a TIA-pathway-specific sub-model by placing only known TIA specialized metabolite genes in the positive set (Fig. 4), and use the negative set that includes terpene synthases involved in volatile compound production. We will then apply a two-step classification process. We will first apply the general specialized-metabolite recognition model; this will generate a set of genes that participate in specialized metabolism. These predicted specialized metabolite genes will then be classified using the TIA-specific model, to distinguish TIA biosynthetic genes from other specialized metabolite pathways. We will use this set of predicted TIA biosynthetic genes for testing in Aim 3.2 below. Finally, we will apply the general C. roseus specialized metabolite model to Arabidopsis to evaluate how well it generalizes across genomes. This comparison will identify species-specific feature patterns as well as those patterns that generalize across models. Ultimately, this analysis will provide critical information toward the future goal of constructing a more universal model that generalizes well across plant genomes.

3.2 Use the model to find missing genes for conversion of α-vinblastine to vinblastine

We will first narrow the TIA-pathway-specific model generated in 3.1 to a vinblastine-specific model by including biosynthetic genes in the negative training set from branches of the TIA pathway that don’t lead to vinblastine production. We will then apply the vinblastine-specific model to obtain candidate gene set. Finally, we will evaluate whether gene knock-down leads to reduced metabolite production in C. roseus seedlings (Lee-Parsons and Philmus Labs). The Lee-Parsons Lab will develop viral vectors to transiently silence genes individually in C. roseus seedlings (using viral induced gene silencing, VIGS41). The alkaloid profile from the harvested leaves will be analyzed by the Philmus Lab, specifically for the presence and absence of metabolites downstream of the silenced gene. Genes that are shown to be involved in the vinca alkaloid biosynthetic pathway through gene silencing and chemical profiling will be then investigated through the use of in vitro biochemical characterization. Genes will be amplified from the previously constructed cDNA libraries and cloned into Escherichia coli heterologous expression vectors (e.g. pET series). The heterologously expressed proteins will be purified using standard protein purification techniques (e.g IMAC) and then analyzed in vitro.

Potential problems and alternative strategies. While we anticipate that DNA cis-regulatory sequence information is sufficient to distinguish specialized metabolite genes in a pathway, it is possible that additional features will need to be considered. For example, it is straightforward to expand the set of TIPR model features to include regulatory elements that may be contained further upstream or downstream of the TSS, such as those in introns. Thus, if it is suspected that additional genomic elements should be accounted for in pathway classification, this is readily incorporated into the current framework.

Future research. Once the missing biosynthetic genes have been identified in Aim 3, direct regulators of these genes can then be identified as in Aim 1 using the TIPR modeling process.

REFERENCES

1. Balunas, M.J. & Kinghorn, A.D. Drug discovery from medicinal plants. Life Sciences 78, 431-441 (2005).

2. Cooper, R. & Deakin, J.J. Botanical Miracles: Chemistry of Plants That Changed the World. (CRC Press, 2016).

3. Ventola, C.L. The drug shortage crisis in the United States: causes, impact, and management strategies. P T 36, 740-757 (2011).

4. Owen, C., Patron, N.J., Huang, A. & Osbourn, A. Harnessing plant metabolic diversity. Curr Opin Chem Biol 40, 24-30 (2017).

5. Industrial Biotechnology: Products and Processes. (Wiley, 2017).

6. Vazquez-Flota, F., De Luca, V., Carrillo-Pech, M., Canto-Flick, A. & de Lourdes Miranda-Ham, M. Vindoline biosynthesis is transcriptionally blocked in Catharanthus roseus cell suspension cultures. Mol Biotechnol 22, 1-8 (2002).

7. Nascimento, N.C. & Fett-Neto, A.G. Plant secondary metabolism and challenges in modifying its operation: an overview. Methods in molecular biology (Clifton, N.J.) 643, 1-13 (2010).

8. Memelink, J., Menke, F.L.H., Van Der Fits, L. & Kijne, J.W. in Metabolic Engineering of Plant Secondary Metabolism. (eds. R. Verpoorte & A.W. Alfermann) (Springeri, Dordrecht; 2000).

9. Megraw, M., Cumbie, J.S., Ivanchenko, M.G. & Filichkin, S.A. Small Genetic Circuits and MicroRNAs: Big Players in Polymerase II Transcriptional Control in Plants. Plant Cell 28, 286-303 (2016).

10. Morton, T. et al. Paired-end analysis of transcription start sites in Arabidopsis reveals plant-specific promoter signatures. Plant Cell 26, 2746-2760 (2014).

11. Morton, T., Wong, W.K. & Megraw, M. TIPR: transcription initiation pattern recognition on a genome scale. Bioinformatics 31, 3725-3732 (2015).

12. O'Connor, S.E. & Maresh, J.J. Chemistry and biology of monoterpene indole alkaloid biosynthesis. Natural Product Reports 23, 532-547 (2006).

13. Lau, W. & Sattely, E.S. Six enzymes from mayapple that complete the biosynthetic pathway to the etoposide aglycone. Science 349, 1224-1228 (2015).

14. Wurtzel, E.T. & Kutchan, T.M. Plant metabolism, the diverse chemistry set of the future. Science 353, 1232-1236 (2016).

15. Cumbie, J.S., Ivanchenko, M.G. & Megraw, M. NanoCAGE-XL and CapFilter: an approach to genome wide identification of high confidence transcription start sites. BMC Genomics 16, 597 (2015).

16. Srivastava, T., Das, S., Sopory, S.K. & Srivastava, P.S. A reliable protocol for transformation of Catharanthus roseus through Agrobacterium tumefaciens. Physiol Mol Biol Plants 15, 93-98 (2009).

17. Wang, Q. et al. Development of efficient Catharanthus roseus regeneration and transformation system using agrobacterium tumefaciens and hypocotyls as explants. BMC Biotechnol 12, 34 (2012).

18. Rizvi, N.F. et al. An efficient transformation method for estrogen-inducible transgene expression in Catharanthus roseus hairy roots. Plant Cell, Tissue and Organ Culture (PCTOC) 120, 475-487 (2015).

19. Kellner, F. et al. Genome-guided investigation of plant natural product biosynthesis. The Plant Journal 82, 680-692 (2015).

20. Franke, J. et al. Gene Discovery in Gelsemium Highlights Conserved Gene Clusters in Monoterpene Indole Alkaloid Biosynthesis. ChemBioChem 20, 83-87 (2019).

21. Filichkin, S.A. & Megraw, M. DNase I SIM: A Simplified In-Nucleus Method for DNase I Hypersensitive Site Sequencing. Methods in molecular biology (Clifton, N.J.) 1629, 141-154 (2017).

22. Cumbie, J.S., Filichkin, S.A. & Megraw, M. Improved DNase-seq protocol facilitates high resolution mapping of DNase I hypersensitive sites in roots in Arabidopsis thaliana. Plant Methods 11, 42 (2015).

23. Fraser, V., Philmus, B. & Megraw, M. Metabolomics Analysis Reveals Both Plant Variety and Choice of Hormone Treatment Modulate Vinca Alkaloid Production in Catharanthus Roseus. ChemRxiv, 10.26434/chemrxiv.11312828.v2 (2020). (Under Review in Plant Direct)

24. Mortensen, S. et al. EASI Transformation: An Efficient Transient Expression Method for Analyzing Gene Function in Catharanthus roseus Seedlings. Front Plant Sci 10, 755 (2019).

25. Xu, R. et al. Salt-induced transcription factor MYB74 is regulated by the RNA-directed DNA methylation pathway in Arabidopsis. Journal of Experimental Botany 66, 5997-6008 (2015).

26. Kautsar, S.A., Suarez Duran, H.G., Blin, K., Osbourn, A. & Medema, M.H. plantiSMASH: automated identification, annotation and expression analysis of plant biosynthetic gene clusters. Nucleic Acids Res (2017).

27. Aerts, R.J., Gisi, D., De Carolis, E., De Luca, V. & Baumann, T.W. Methyl jasmonate vapor increases the developmentally controlled synthesis of alkaloids in Catharanthus and Cinchona seedlings. The Plant Journal 5, 635-643 (1994).

28. Duge de Bernonville, T. et al. Folivory elicits a strong defense reaction in Catharanthus roseus: metabolomic and transcriptomic analyses reveal distinct local and systemic responses. Sci Rep 7, 40453 (2017).

29. Luijendijk, T.J., van der Meijden, E. & Verpoorte, R. Involvement of strictosidine as a defensive chemical inCatharanthus roseus. J Chem Ecol 22, 1355-1366 (1996).

30. Goklany, S., Rizvi, N.F., Loring, R.H., Cram, E.J. & Lee-Parsons, C.W. Jasmonate-dependent alkaloid biosynthesis in Catharanthus Roseus hairy root cultures is correlated with the relative expression of Orca and Zct transcription factors. Biotechnol Prog 29, 1367-1376 (2013).

31. Rizvi, N.F., Weaver, J.D., Cram, E.J. & Lee-Parsons, C.W. Silencing the Transcriptional Repressor, ZCT1, Illustrates the Tight Regulation of Terpenoid Indole Alkaloid Biosynthesis in Catharanthus roseus Hairy Roots. PLoS One 11, e0159712 (2016).

32. Liscombe, D.K., Usera, A.R. & O’Connor, S.E. Homolog of tocopherol C methyltransferases catalyzes N methylation in anticancer alkaloid biosynthesis. Proceedings of the National Academy of Sciences 107, 18793-18798 (2010).

33. Zhang, H. et al. The basic helix-loop-helix transcription factor CrMYC2 controls the jasmonate-responsive expression of the ORCA genes that regulate alkaloid biosynthesis in Catharanthus roseus. The Plant Journal 67, 61-71 (2011).

34. Li, C.Y. et al. The ORCA2 transcription factor plays a key role in regulation of the terpenoid indole alkaloid pathway. BMC Plant Biology 13, 155 (2013).

35. van der Fits, L. & Memelink, J. ORCA3, a jasmonate-responsive transcriptional regulator of plant primary and secondary metabolism. Science 289, 295-297 (2000).

36. Van Moerkercke, A. et al. The bHLH transcription factor BIS1 controls the iridoid branch of the monoterpenoid indole alkaloid pathway in Catharanthus roseus. Proceedings of the National Academy of Sciences 112, 8130 (2015).

37. Van Moerkercke, A. et al. The basic helix-loop-helix transcription factor BIS2 is essential for monoterpenoid indole alkaloid production in the medicinal plant Catharanthus roseus. The Plant Journal 88, 3-12 (2016).

38. Mortensen, S. et al. The regulation of ZCT1, a transcriptional repressor of monoterpenoid indole alkaloid biosynthetic genes in Catharanthus roseus. Plant Direct 3, e00193 (2019).

39. Grützner, R. et al. Addition of Multiple Introns to a Cas9 Gene Results in Dramatic Improvement in Efficiency for Generation of Gene Knockouts in Plants. bioRxiv, 2020.2004.2003.023036 (2020).

40. Megraw, M., Pereira, F., Jensen, S.T., Ohler, U. & Hatzigeorgiou, A.G. A transcription factor affinity-based code for mammalian transcription initiation. Genome Research 19, 644-656 (2009).

41. Liscombe, D.K. & O’Connor, S.E. A virus-induced gene silencing approach to understanding alkaloid metabolism in Catharanthus roseus. Phytochemistry 72, 1969-1977 (2011).