Sequence, organization, and genes characteristics of ORFs ...

75
American University in Cairo American University in Cairo AUC Knowledge Fountain AUC Knowledge Fountain Theses and Dissertations Student Research 2-1-2015 Sequence, organization, and genes characteristics of ORFs Sequence, organization, and genes characteristics of ORFs identified in a metagenomic DNA fragment from microbial identified in a metagenomic DNA fragment from microbial community of the deep brine environment of Atlantis II in the Red community of the deep brine environment of Atlantis II in the Red Sea. Sea. Dina Hassan Assal Follow this and additional works at: https://fount.aucegypt.edu/etds Recommended Citation Recommended Citation APA Citation Assal, D. (2015).Sequence, organization, and genes characteristics of ORFs identified in a metagenomic DNA fragment from microbial community of the deep brine environment of Atlantis II in the Red Sea. [Master's Thesis, the American University in Cairo]. AUC Knowledge Fountain. https://fount.aucegypt.edu/etds/1176 MLA Citation Assal, Dina Hassan. Sequence, organization, and genes characteristics of ORFs identified in a metagenomic DNA fragment from microbial community of the deep brine environment of Atlantis II in the Red Sea.. 2015. American University in Cairo, Master's Thesis. AUC Knowledge Fountain. https://fount.aucegypt.edu/etds/1176 This Master's Thesis is brought to you for free and open access by the Student Research at AUC Knowledge Fountain. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of AUC Knowledge Fountain. For more information, please contact [email protected].

Transcript of Sequence, organization, and genes characteristics of ORFs ...

Page 1: Sequence, organization, and genes characteristics of ORFs ...

American University in Cairo American University in Cairo

AUC Knowledge Fountain AUC Knowledge Fountain

Theses and Dissertations Student Research

2-1-2015

Sequence, organization, and genes characteristics of ORFs Sequence, organization, and genes characteristics of ORFs

identified in a metagenomic DNA fragment from microbial identified in a metagenomic DNA fragment from microbial

community of the deep brine environment of Atlantis II in the Red community of the deep brine environment of Atlantis II in the Red

Sea. Sea.

Dina Hassan Assal

Follow this and additional works at: https://fount.aucegypt.edu/etds

Recommended Citation Recommended Citation

APA Citation Assal, D. (2015).Sequence, organization, and genes characteristics of ORFs identified in a metagenomic DNA fragment from microbial community of the deep brine environment of Atlantis II in the Red Sea. [Master's Thesis, the American University in Cairo]. AUC Knowledge Fountain. https://fount.aucegypt.edu/etds/1176

MLA Citation Assal, Dina Hassan. Sequence, organization, and genes characteristics of ORFs identified in a metagenomic DNA fragment from microbial community of the deep brine environment of Atlantis II in the Red Sea.. 2015. American University in Cairo, Master's Thesis. AUC Knowledge Fountain. https://fount.aucegypt.edu/etds/1176

This Master's Thesis is brought to you for free and open access by the Student Research at AUC Knowledge Fountain. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of AUC Knowledge Fountain. For more information, please contact [email protected].

Page 2: Sequence, organization, and genes characteristics of ORFs ...

i

School of Sciences and Engineering

Sequence, organization, and genes characteristics of ORFs

identified in a metagenomic DNA fragment from microbial

community of the deep brine environment of Atlantis II in

the Red Sea

A Thesis Submitted to the Biotechnology Program

In partial fulfilment of the requirements for

the degree of Master’s of Science

By:

Dina Hassan Assal

Under the supervision of

Dr. Hamza El-Dorry

January/2015

Page 3: Sequence, organization, and genes characteristics of ORFs ...

ii

The American University in Cairo

Sequence, organization, and genes characteristics of ORFs identified in a

metagenomic DNA fragment from microbial community of the deep brine

environment of Atlantis II in the Red Sea

A Thesis Submitted by

Dina Hassan Assal

To the Biotechnology Graduate Program

Month/ Year

In partial fulfilment of the requirements for

The degree of Master of Science

Has been approved by

Thesis Committee Supervisor/Chair _______________________________________________

Affiliation ____________________________________________________________________

Thesis Committee Reader/Examiner ______________________________________________

Affiliation ____________________________________________________________________

Thesis Committee Reader/Examiner ______________________________________________

Affiliation ____________________________________________________________________

Thesis Committee Reader/External Examiner _______________________________________

Affiliation ____________________________________________________________________

____________________ _____________ ___________________ _______________

Dept. Chair/Director Date Dean Date

Page 4: Sequence, organization, and genes characteristics of ORFs ...

iii

DEDICATION

I would like to dedicate this thesis to my dearest family: parents, siblings and

their spouses and youngsters who were always the source of encouragement and

love throughout my life.

Also I want to dedicate my work to my friends and colleagues inside and

outside the AUC who supported me throughout this journey to earn my MSc.

degree.

Finally, for everyone who is passionate about science and research.

Page 5: Sequence, organization, and genes characteristics of ORFs ...

iv

ACKNOWLEDGEMENT

First, I would like to express my sincere gratitude to my advisor Prof. Dr.Hamza El-Dorry for

the continuous support of my MSc. study, research and thesis preparation. Also I would like

to acknowledge Dr.Mohamed Ghazy for his extensive guidance that helped me in all the time

of research and writing of this thesis.

Besides my advisors, I would like to thank the rest of AUC Academics: Dr. Rania Siam,

Dr.Ahmed Mustafa, Dr.Asma Amleh and Dr. Ahmed el Sayed who provided me with

continuous support during studying the MSc. graduate courses.

My sincere thanks also goes to Mr.Amged Ouf for his continuous help in the lab, especially

during ABI sequencing process and Mr.Mustafa Adel and Mr.Osama el-Said who helped me

throughout computational data analysis.

Also I want to extend my gratitude to Ms. Mariam Azmy for her continuous help in the lab

and during thesis preparation as well as Ms. Dalia Hamed and Ms. Nahla Hussein for their

continuous support during thesis preparation.

I thank my fellow labmates at AUC STRC lab: to all my friends and collegues in the STRC

lab exclusively : Ayman Yehia, Mohamed Maged, Mahera Mohamed, Sara Sonbol, Noha

Nagdy , Eman Rabie, Sara Al-Alawi, Hadeel EL-Bardisy for everyone who supported me in

the lab and throughout my journey at AUC.

I would also like to thank KAUST for providing us with the funds necessary to do this

research work.

Last but not the least; I would like to thank my family: my parents, my siblings and their

spouses who supported me and encouraged me throughout my life

Page 6: Sequence, organization, and genes characteristics of ORFs ...

v

Table of Contents

....................................................................................................................................................................................................i

School of Sciences and Engineering ............................................................................................................................i

LIST OF FIGURES ............................................................................................................................................................... vii

LIST OF TABLES ............................................................................................................................................................... viii

LIST OF ABBREVIATIONS ............................................................................................................................................... 10

LIST OF BIOINFORMATICS TOOLS ............................................................................................................................... 12

Chapter 1: Literature Review ................................................................................................................................... 13

1. Background ............................................................................................................................................................ 13

2. Atlantis II brine pool in the Red Sea ............................................................................................................. 14

3. The advent of Metagenomics .......................................................................................................................... 16

4. Construction of metagenomic libraries ...................................................................................................... 18

5. Targeted Metagenomics .................................................................................................................................... 18

5.1 Functional screening approach ............................................................................................................ 18

5.2 Sequence based Approach ...................................................................................................................... 19

6. Sequencing technology ...................................................................................................................................... 20

7. Data Analysis ......................................................................................................................................................... 22

8. Metagenomics’ successes in exploring microbial diversity, metabolism and industrial

potential of extreme ecosystems ............................................................................................................................ 23

8.1 Assessment of microbial and metabolic diversity ........................................................................ 23

8.2 Industrial Potential of Extreme Ecosystems ................................................................................... 26

8.3 Therapeutics and Health care applications ..................................................................................... 28

9. Thesis Objective .................................................................................................................................................... 30

Chapter 2: Materials and Methods ......................................................................................................................... 31

1. Sample collection ................................................................................................................................................. 31

2. DNA extraction and Fosmid Library Construction................................................................................. 31

3. Pyrosequencing .................................................................................................................................................... 32

4. Data Assembly and Annotation ...................................................................................................................... 32

5.Subcloning of shotgun DNA fragments of the recombinant 99B6 fosmid in bacterial vector….32

6. DNA Sequencing using dideoxy chain termination technique (Sanger Method) ...................... 33

7. Assembly of the sequenced DNA fragments ............................................................................................. 33

8. ORFs identification and gene annotation ................................................................................................... 33

9. Other descriptive features for identified ORFs ........................................................................................ 34

Page 7: Sequence, organization, and genes characteristics of ORFs ...

vi

Chapter 3: Results and Discussion ......................................................................................................................... 35

1. Analysis of the ATII-LCL library’s Fosmids by restriction endonuclease ................................ 35

2. Establishing a library for fosmid clone 99B6 in pGEM plasmid vector .................................... 36

3. Sequencing of 99B6 pGEM library clones ............................................................................................. 37

4. Assembly of the sequenced reads ............................................................................................................ 38

5. Open reading frames search for potential genes and annotation of the identified ORFs . 42

6. Structural characteristics of the annotated proteins ....................................................................... 45

7. Detailed analysis of 10 selected ORFs identified by Interproscan software ............................... 48

8. Identified proteins with potential biotechnological applications.................................................... 60

Chapter 3: Concluding Remarks .............................................................................................................................. 64

Chapter 4: References .................................................................................................................................................. 65

Page 8: Sequence, organization, and genes characteristics of ORFs ...

vii

LIST OF FIGURES

Figure 1. Location of Study Area, Atlantis II brine pool. ........................................................ 15

Figure 2. Summary of experimental metagenomic processes. ................................................ 17

Figure 3. Restriction analysis of six randomly selected recombinant fosmids. ....................... 35

Figure 4. Isolation of 1.5 to 2 kb DNA fragments of nebulized 99B6 fosmid clone. .............. 36

Figure 5. Analysis of the quality and integrity of clones from 99B6 pGEM library. .............. 37

Figure 6. Assigned base quality score value for sequenced reads.. ......................................... 38

Figure 7. Assembly of the sequenced reads. ............................................................................ 39

Figure 8. EcoRI restriction analysis of recombinant 99B6 fosmid clone. ............................... 40

Figure 9. Nucleotide sequence composition of un-sequence DNA fragment and sequence end

of assembled fosmid 99B6. ...................................................................................................... 41

Figure 10. Protein sequence analysis for identified ORFs using different protein annotation

tools. ......................................................................................................................................... 42

Figure 11. Prediction of putative secretory signals and THMMs domains in annotated ORFs.

.................................................................................................................................................. 47

Figure 12. Schematic representations of potential full-length ORFs in silico annotated using

Interproscan protein analysis software..................................................................................... 48

Figure 13. The prediction of transmembrane domains for the putative membrane associated

protease (ORF#34) was detected using the THMM tool. ........................................................ 58

Page 9: Sequence, organization, and genes characteristics of ORFs ...

viii

LIST OF TABLES

Table 1. Schematic diagram for the location and annotation of identified ORFs on contig9. ................ 43

Table 2. Protein sequence similarity search using blastx tool against nr protein database. ..................... 44

Table 3. Determination of halophilic potential for identified ORFs.. ............................................................. 46

Table 4. ORFs identified in contig9 (99B6) with potential industrial or medical applications. ............. 60

Page 10: Sequence, organization, and genes characteristics of ORFs ...

ix

Abstract

Microbial communities that reside in different natural habitats, particularly those of extreme

environments, constitute a rich source for novel industrial enzymes and bioactive compounds.

Until the advent of metagenomics technique, extreme environments represented a locked area

with huge genetic repertoire that remained unexplored. The Atlantis II brine pool of the Red

Sea (ATII) is one of such unexplored extreme environment. The lower part of this pool, the

lower convective layer (LCL), has a pH of 5.3, high temperature (68°C), elevated

concentration of toxic heavy metals, and extreme salinity (26% salt). To understand the

metabolic and the physiological properties of proteins and enzymes that contribute to the

survival of microorganisms in this extreme and hostile environment, the structure and

characteristics of their genes should be determined. Metagenomics approach helped in this

task through two different techniques: 1) mass sequencing of environmental DNA by high

throughput sequencing technique such as pyrosequencing technique; 2) sequencing of

environmental DNA fragments from metagenomic fosmid library.

The advantage of the first approach is that it produces massive number of reads that can be

assembled into long contigs. Its disadvantage is that the majority of the contigs are chimeric

i.e. assembled from reads belong to genomes of different microbial species. The second

technique has an advantage of establishing the sequence of a contiguous piece of genomic

DNA–of around 30 to 40 kb–that most probably is not a chimeric. The major disadvantages

however are the high cost of the sequencing process, it involves elaborate steps, and it has a

limited output of nucleotide sequences.

In this work we sequenced a contiguous fragment of DNA from the microbial community of the

ATII-LCL environment and presenting the structural and potential function of its annotated

genes. Interestingly, out of the 39 identified ORFs, 10 ORFs (25%) have no matches in the

database. The structure and the function of the potential annotated genes are presented and

discussed. In addition, we were able to assembled 28.378 kb out of 33.819 kb of the insert in

the recombinant fosmid. The unassembled 5.441 kb is most probably due to the detection of

characteristic patterns of low complexity regions, simple repeats as well as gene duplication

exists at the end of the assembled sequence.

Page 11: Sequence, organization, and genes characteristics of ORFs ...

LIST OF ABBREVIATIONS

Abbreviation Description

AT II Deep Atlantis II deep

BAC Bacterial Artificial Chromosome

BLAST Basic Local Alignment Search Tool

CDD Conserved domain database

COG Clusters of Orthologous Groups

DNA Deoxyribonucleic acid

GOS Global Ocean Sampling

HGT Horizontal Gene Transfer

HMM Hidden Markov Model

IPTG Isopropy-beta-D-thiogalactoside

KAUST King Abdullah University of Science and Technology

KEGG Kyoto Encyclopedia of Genes and Genomes

LCL Lower Convective Layer

LB Luria-Bertani

NCBI National Centre for Biotechnology Information

NGS Next generation sequencing

ORF Open Reading Frame

PCR Polymerase Chain Reaction

PSORT Protein Subcellular Localization Prediction Tool

PFAM Protein Family

Page 12: Sequence, organization, and genes characteristics of ORFs ...

11

Psu Practical Salinity Unit

rRNA Ribosomal Ribonucleic Acid

SSU Small Subunit

SB Subtilisin

THMM Transmembrane Hidden Markov Model

WGS Whole Genome Shotgun Sequencing

Web MGA Web Metagenomic Annotator

X-Gal 5-bromo-4-chloro-indolyl-β-D-galactopyranoside

Page 13: Sequence, organization, and genes characteristics of ORFs ...

12

LIST OF BIOINFORMATICS TOOLS

Bioinformatics tool Source

Artemis https://www.sanger.ac.uk/resources/software/artemis/

BLAST www.ncbi.nlm.nih.gov/BLAST/

BPROM www.softberry.com/berry.phtml?topic=bprom

CLC genomics Workbench http://www.clcbio.com/products/clc-genomics-

workbench

Interproscan www.ebi.ac.uk/Tools/pfa/iprscan/

SIGNALP www.cbs.dtu.dk/services/SignalP

Phred/Phrap/consed www.phrap.org/phredphrapconsed.html

PFAM pfam.sanger.ac.uk/

PRED-TAT www.compgen.org/tools/PRED-TAT

PSORT psort.hgc.jp

THMM www.cbs.dtu.dk/services/TMHMM

WebMGA http://weizhong-lab.ucsd.edu/metagenomic-

analysis/server

Page 14: Sequence, organization, and genes characteristics of ORFs ...

13

Chapter 1: Literature Review

1. Background

The total number of prokaryotic cells in the different habitats on Earth is estimated to be

around 4–6 × 1030. They are pivotal as primary producers of essential nutrients and also in

salvage and recycling of organic compounds (Whitman, Coleman, & Wiebe, 1998; Wooley,

Godzik, & Friedberg, 2010).The majority of these microorganisms are not studied yet and

most probably they possess unknown and diverse metabolic and also novel biocatalysts and

biomolecules (Schirmer, Gadkari, Reeves, Ibrahim, DeLong, et al., 2005).

Their significance is not only recognized as being the primary producers of nutrients and

recyclers of organic matter to different carbon sources , but also as a repertoire with relatively

unexplored functional and metabolic diversity (Simon & Daniel, 2011). Currently, many

microbial derived compounds are considered potentially valuable for multiple industrial

disciplines, for example: biocatalysts , antibiotics , anticancer compounds and biofuels are

on continuous demand by different biotechnological and pharmaceutical sectors (Zhang &

Kim, 2010). However, the potential for 99% of these microbes are remained untapped by

traditional genomic techniques that requires prior isolation and culturing of the organism for

microbial genome sequencing and analysis ( Kennedy, Marchesi, & Dobson, 2008; Schloss &

Handelsman, 2003).

Page 15: Sequence, organization, and genes characteristics of ORFs ...

14

2. Atlantis II brine pool in the Red Sea

The Red Sea is one of the most distinctive marine environments among others, yet it is

considered to be the least studied and explored. The Red Sea was formed as a result of the

separation of the Arabian and Asian plates 3-4 million years ago. Its water is considered to be

extremely unique due to its high temperature and salinity, which are caused by the increased

evaporation rate and low rainfall rate (Bougouffa et al., 2013; Gurvich G. E., 2006). The Red

Sea contains at least 25 deep brine pools scattered along its central axial rift (Siam et al.,

2012). These brine pools were formed as a result of water perfusion into evaporitic deposits

in the shallow layers of the sea floor, thus making water in brine pools denser than seawater.

In some of the brine pools, the water is heated by tectonic movement underneath and finally

returns to the brine pools by the current flow effect(Noguchi, Taniguchi, & Itoh, 2008).

The largest hyper saline basin, the Atlantis II deep (ATII), is considered one of the intriguing

environments located in the central region of the Red Sea (latitude 21°23′ N and longitude

38°04′ E) (Figure 1) at a depth of 2194 m below the sea surface. The ATII brine pool is

further stratified into several distinct layers: brine interface, upper convective layer,

intermediate and lower convective layer. The layers are characterized by the gradual rise in

both salinity and temperature towards deep sediments. Hence, this brine basin is uniquely

known to have the highest temperature, be most dynamic, and have the largest “ore –

forming” body (Antunes, Ngugi, & Stingl, 2011; Mohamed et al., 2013). The lowest part of

the brine pool, the lower convective layer (LCL), has a pH value of 5.3, a temperature of 68

OC and salinity of 25.7% which is 7.5 times more than the salinity of the surface of normal

seawater (Antunes et al., 2011).

The Atlantis II deep LCL (ATII-LCL) is an anoxic environment, where dissolved N2 gas and

methane CH4 are available in high levels with traces of CO2, ethane and H2S. The richness of

the ATII-LCL sediment with metalliferous sediments, high levels of iron, copper and other

toxic heavy metals, as well as its unique geochemical parameters, have made it an attractive

site for many geological and geophysical studies (Antunes et al., 2011; Bougouffa et al.,

2013). However, little interest was shown from microbiologists in this environment mainly

because in the late 1960s it was believed that sustainability of any form of life is not possible

under these extreme conditions and thus Atlantis II deep brine was considered to be sterile

(Antunes et al., 2011; M. Schmidt et al., 2003). Lately, other findings supported the presence

of distinctive microbial consortia that can thrive in these exotic environments. The

Page 16: Sequence, organization, and genes characteristics of ORFs ...

15

extremophilic microorganisms that thrive in ATII-LCL environment should therefore possess

an extraordinary biochemical and molecular machinery that is need to survive in such

extreme setting.

Figure 1. Location of Study Area, Atlantis II brine pool.

The figure represents the location of Atlantis II brine pool in the Red Sea; the study area is

located at latitude 21°23′ N and longitude 38°04′ E.Picture adapted from

http://www.emr.rwthaachen.de/aw/EMR/Energy_Mineral_Resources/Zielgruppen/iml/publik

ationen/~wcw/A_Theoretical_Analysis_of_Hydrothermal_F.

Page 17: Sequence, organization, and genes characteristics of ORFs ...

16

3. The advent of metagenomics

Metagenomics represents an alternative to traditional genomics approaches. In this approach

sequencing of total microbial genomes isolated from natural habitat (environmental DNA)

enables us to explore the biochemical and the physiological process that support growth of

microbes in its natural environment. This technique permits to develop an overall view of

these processes – biochemical and physiological – without prior need for cultivation.

Accordingly, this would presumably accelerate the rate of novel gene discovery from

different microbial ecosystems (D.-G. Lee et al., 2007; Uchiyama & Miyazaki, 2009; Verma,

Kawarabayasi, Miyazaki, & Satyanarayana, 2013).

The use of metagenomics was first proposed by Pace on 1985 and his colleagues to gain

insight into the metabolic and physiological properties of uncultured microbial communities

in different environments (Handelsman, 2004; Jeon, Kim, Kang, Lee, & Kim, 2009). Later,

many metagenomic projects were conducted such as the Global Ocean Sampling and the acid

mine drainage project (Rusch et al., 2007; Tyson et al., 2004). Their primary goal was to

investigate questions associated with organisms (who is there) and their metabolic and

physiological processes in different complex environments (what are they doing) (Eisen,

2007). As summarized in Figure 2, several methods were taken to tackle these questions in

any metagenomic project. In general, the workflow for most metagenomic approaches would

either go through constructing, screening and sequencing of metagenomic clones of interest,

or direct sequencing of environmental DNA that was facilitated by advances made in next

generation sequencing approaches (NGS).

Page 18: Sequence, organization, and genes characteristics of ORFs ...

17

Figure 2. Summary of experimental metagenomic processes.

In brief, the experimental design for any metagenomic process begins by the isolation of

microbial communities from environmental habitats under investigation, followed by the

extraction of total microbial DNA samples. Depending on the objective of the metagenomic

projects and questions that being addressed, the following approaches can be used:

(1) Targeted metagenomics: the extracted DNA is randomly sheared and cloned into

vectors to construct metagenomic libraries and clones are then subjected to either

sequence based, or functional based screening analysis.

(2) Phylogenetic analysis: assesses microbial diversity of environmental niche using set

of primers for conserved genes as 16S rRNA and Rad genes.

(3) Random sequencing of environmental DNA to gain insight on the metabolic and

physiological processes of microbial community:

a. Sanger sequencing of metagenomic clones

b. NGS based metagenomic analysis.

(4) Whole (Meta) genome assembly (WGA) of microbial genomes for sequences

obtained from Sanger or NGS sequencing processes.

Sampling of microbial communities from different enviromental niches

For example : Soil , Sea , Air and Insect's gut

Targeted metagenomics

Phylogentic analysis 16S

rRNA

Direct Sequencing

Whole (Meta) genome

assembly

DNA extraction

Page 19: Sequence, organization, and genes characteristics of ORFs ...

18

4. Construction of metagenomic libraries

In general, the construction of metagenomic libraries and screening processes begins by

microbial collection from the environmental niche of interest, such as water, soil, air, and

human gut (Culligan, Sleator, Marchesi, & Hill, 2013). Environmental DNA is then isolated

from the microbial samples and metagenomic library are constructed in a suitable vectors.

Nowadays, many vectors are available for use with capacity to accommodate different sizes

of inserts, ranging from smaller size plasmids (accommodate less than 15 kbp) to, moderate

size fosmids (accommodate 40 kbp) and bacterial artificial chromosomes (BAC) (>40 kbp).

The recombinant vectors are then transformed into appropriate surrogate hosts, usually E.coli

or others host cell that are appropriate for heterologous gene expression. Once a

metagenomic library is created, successive experimental methods and metagenomic

approaches are used to explore functional analysis of specific gene of interest, taxonomic

analysis of the microbial community, or structural studies of the microbial genomes (Daniel,

2005; Rusch et al., 2007; Simon & Daniel, 2011).

5. Targeted metagenomics

5.1 Functional screening approach

The functional based analysis relies on the detection of active clones that express

characteristic traits of interest in the surrogate host. This approach enables the discovery of

novel genes without prior need for sequence information and guarantees the retrieval of both

the sequence and activity of isolated genes (Li, McCorkle, Monchy, Taghavi, & van der

Lelie, 2009). Several microbial enzymes and biomolecules were successfully identified and

isolated using functional screens of metagenomic clones (Jeon, Kim, Kang, et al., 2009; J.

Kennedy et al., 2008; Sayed et al., 2014) . The limitations of this approach are mainly

represented by problems of gene expression in the surrogate host due to gene toxicity,

inappropriate selection of the host strain required for heterologous gene expression, or that

the promoter of the metagenome gene is not functioning in the host cell (Felczykowska,

Bloch, Nejman-Faleńczyk, & Barańska, 2012).The frequency of active clones identified from

massive screening of metagenomic libraries is considerably low, it’s not always possible to

scale up the functional assay of interest on large libraries and the presence of all genetic

elements required for proper gene expression is mandatory (Schloss & Handelsman, 2003;

Sharma, Khan, & Qazshari, 2010).

Page 20: Sequence, organization, and genes characteristics of ORFs ...

19

Several attempts were made to address these limitations such as the use of different strains

that are either adaptable to extreme conditions or genetically modified strains as those

commonly used E. coli strains shuttle vector systems to extend their capabilities for gene

expression and overcome gene toxicity problems ( Kennedy et al., 2011; Li et al., 2009;

Lorenz & Eck, 2005; J. Singh et al., 2009).

5.2 Sequence based approach

As for the sequence based approach, either a set of PCR primers or hybridization probes are

used to identify novel genes, which represent a new variant of already known genes or

protein families (Culligan et al., 2013), or using the phylogenetic based analysis where

markers of different phylogenetic anchors as 16S ribosomal RNA or rec A sequences link

the discovery of novel genes to phylogenetic analysis (Handelsman, 2004; Schloss &

Handelsman, 2003; J. Singh et al., 2009). The former approach showed to be successful in

identifying new homologs of enzymes and proteins with wide applications in various

industries such as: oxidoreductases, chitinases (Simon & Daniel, 2011), mercuric reductases

(Sayed et al., 2014), and amylases (Sharma et al., 2010). In addition, the isolation of several

polyketide synthases and antibiotics were also reported from different microbial niches

(Banik & Brady, 2010; Kerkhof & Goodman, 2009; Schirmer, Gadkari, Reeves, Ibrahim,

Delong, et al., 2005).

Regarding phylogenetic marker based sequence analysis, this strategy was frequently used

for the discovery of new genes from the sequencing of clones harboring phylogenetic

anchors. One of the famous findings was the discovery of the photorodopsin gene which was

affiliated to gama-proteobacteria after it was presumed to be only affiliated to archeal

lineages (Handelsman, 2004). This enables us to have an insight into ecology and taxonomy

of other bacteria in complex environments that have never been cultured (Culligan et al.,

2013; Daniel, 2005; Kerkhof & Goodman, 2009). Alternatively, the random sequencing of

the metagenomic clones strategy was conducted by large metagenomic sequencing projects

such as the Sargasso Sea and acid mine drainage projects, as well as many others that

revealed huge inference about microbial genome assemblage, horizontal gene transfer and

linkage of traits in addition to other insights that were gained from the studies of uncultured

bacterial communities (Handelsman, 2004).

Page 21: Sequence, organization, and genes characteristics of ORFs ...

20

6. Sequencing technology While the arrival of next generation sequencing technologies have revolutionized the field of

metagenomics (Dark, 2013; Teeling & Glockner, 2012), the Sanger sequencing was the first

technology applied for metagenome sequencing (Hoff, 2009). The Sanger method is based on

a chain termination reaction that involves the termination of DNA synthesis using a

chemically modified 2, 3-dideooxynucleotides, leading to DNA fragments of varying lengths

separated on electrophoresis gel where DNA sequence of the template is determined.

Successive advances were made in sequencing technologies using this approach over time as

the release of automated Sanger platforms introduced some modifications to old principles by

using fluorescent tagged ddNTPs. The fluorophore of ddNTPs are then excited by the laser

beam of the sequencer, resulting in the emission of four different colors. The base calling step

takes place based on tracking fluorescence signals of the four different emitted colors and

quality assessment algothirms are deployed to assess correct base calling process of DNA

sequencing (Metzker, 2005; Shendure & Ji, 2008).

The long read length (700 bp or more) obtained, high sequencing accuracy with decreased

error range (0.001-1%), and the ability to sequence clones with large insert size (fosmids and

BACS) are among the advantages that made Sanger sequencing applicable for whole genome

sequencing of low diversity microbial niches. The inherent requirement for metagenomic

propagation by vectors into bacterial hosts will probably lead to bias against genes that are

toxic for the surrogate hosts, in addition to its labor demanding and exhaustive nature. It is

also worth noting that the overall cost of sequence data per Gbp is about USD 400,000

(Thomas, Gilbert, & Meyer, 2012) .

The paradigm shift from classical Sanger sequencing to next generation sequencing

technologies is triggered by the need to lower sequencing costs and generate high throughput

of sequence data during any metagenomic study (Danhorn, Young, & DeLong, 2012).

The 454/ Roche and the Illumina/ Solexa systems are widely used NGS technologies for

DNA sequencing to date (Thomas et al., 2012). For the 454/Roche technology, emulsion

polymerase chain reaction (ePCR) is applied to “clonally amplified” random DNA fragments

of metagenome, where the single stranded DNAs are attached to microscopic beads, then

deposited into a picotitre plate for further parallel and individual pyrosequencing (Sulsultana

& Neelakanta, 2013).

Page 22: Sequence, organization, and genes characteristics of ORFs ...

21

The pyrosequencing process utilizes a “sequencing by synthesis approach” that involves the

successive addition of all deoxynucleotides and the incorporation is monitored by conversion

of the released pyrophosphate (PPi) into emitted light which is further detected by a CDD

camera and converted to sequence data (Dark, 2013).

As for any emerging NGS technologies, there are a couple of limitations associated with 454

pyrosequencing especially when applied to metagenomics. First, the emulsion based PCR

method creates artificial replicates of sequencing data that would interfere with estimation of

gene abundance; nonetheless those artificial replicates can be detected and filtered out with

bioinformatics programs. Secondly, high sequencing error rates are reported as a result of

sequencing errors associated with existing homopolymers that might cause deletion or

insertion in predicted open reading frames (ORF). However, this could also be tackled by

ORF prediction tools to correct these frame shifts. In spite of these disadvantages, the low

sequencing costs produced by 454 pyrosequencing (20,000 per Gbp), massive sequencing

capacity of ~ 500 Mb per run (multiplexing) along with less man power needed, yet have

made it the technology of choice for many metagenomic and other applications (Kim et al.,

2013; Lin Liu et al., 2012; Thomas et al., 2012).

Also, one of the major shortcomings associated with other NGS platforms is the shorter read

length produced while sequencing microbial genomes. Lately, the advanced version of 454

pyrosequencing technology (GS-FLX Titanium) can produce reasonably longer read-lengths

(400-500 bp) compared to the competing NGS platforms (Kim et al., 2013). Moreover,

Illumina, and to a lesser extent, SOLiD technologies were found to be applicable for large

scale metagenomic surveys owing to their rapidness, huge sequencing data, and more

coverage obtained at the same cost when compared to 454 pyrosequencing. It was also

noticed in previous studies that Illumna platforms showed comparable results to 454

pyrosequencing in terms of assemblies and high sequencing coverage when studying

metabolic and taxonomic diversity of microbial ecosystems (Teeling & Glockner, 2012).

Recently it has been discovered that the utility of the hybrid sequencing approach using data

from two or more sequencing technologies represents a promising alternative for many whole

microbial genome assembly projects. The primary goal is to obtain cost effective and high

quality draft or finished assemblies of microbial genomes. For instance, Craig Vendor and his

colleagues investigated the strengths and weaknesses associated with the use of the hybrid

Page 23: Sequence, organization, and genes characteristics of ORFs ...

22

sequencing approach to evaluate assemblies of six microbial genomes against assembly from

Sanger sequencing only. It was demonstrated that the incorporation of 454 data into whole

genome assembly resulted in the completion of 2 out of 6 microbial genomes and

approximately 86% of gaps were reduced for other microbial genomes. The advantage of 454

sequencing over Sanger is represented by no clonal bias and the ability to sequence non

clonal regions. Although the 454 sequencing platform was considered a promising alternative

for cost effective and high quality assemblies, it was not recommended for use in de novo

sequencing due to the lack of paired end information and ambiguous incorporation of repeats

by 454 data. The conclusion of the investigation was that de novo genome sequencing

projects should depend on the Sanger platform to make use of pair-end information and deal

with large repeats and physical gaps, and that 454 data usage would work best to complement

Sanger sequencing for “hard stop” non-clonal regions and gap closure projects (Goldberg et

al., 2006).

Despite the progress made in lowering sequence costs, the data analysis phase is still a

computational extensive process (L. D. Stein, 2010; Wooley et al., 2010). The following

challenges are still considered when analyses of huge metagenomic datasets take place: first,

a large storage capacity is needed for data storage and second, the prerequisite for data

standardization prior to the analysis and presence of advanced computing resources is

required. In addition, the short read length generated results with a higher rate of sequencing

errors and difficulties with data assembly. Accordingly, this would presumably limit insights

into gene content and correct functional assignment of metagenomes when compared to

results obtained by Sanger technology (Kim et al., 2013).

7. Data analysis The emerging NGS technologies are currently introduced in many research aspects including

metagenome sequencing. Albeit differences in sequencing chemistries, data output and

characteristics of each platform in terms of advantages and limitations, they all share the

potential to generate tremendously high throughput sequencing data. Thus, the downstream

analysis for these data output is a very complex process in terms of time consumption and

lack of bioinformatics expertise that still represent a real challenge for many biologists and

researchers of different research facilities (Barzon, Lavezzo, Militello, Toppo, & Palù, 2011).

Today, many computational tools are available to analyze sequence data of different

microbial genomes and some of them have shown applicable for metagenomic data analysis

Page 24: Sequence, organization, and genes characteristics of ORFs ...

23

of mostly small scale projects, however additional bioinformatics pipelines with higher

computing power and data storage capacity are still highly required, so that sequencing data

can be fully exploited and problems associated with metagenome sequencing can be handled

(Barzon et al., 2011; Teeling & Glockner, 2012).

In general, a typical framework for metagenomic data analysis to provide insights into

functional and taxonomic diversity of microbes might involve data cleaning and processing,

assembly, gene calling, taxonomic classification and finally gene annotation or phylogenetic

inference. Further data analysis could be achieved on the transcription or translation level as

well (Teeling & Glockner, 2012; Wooley et al., 2010).

8. Metagenomics’ successes in exploring microbial diversity, metabolism and

industrial potential of extreme ecosystems

Metagenomics represent a powerful approach to analyze microbial habitats from different

extreme environments in terms of microbial diversity, metabolism and adaptation

mechanisms to different extreme conditions as well as comparative studies (Lewin, Wentzel,

& Valla, 2013; Wemheuer, Taube, Akyol, Wemheuer, & Daniel, 2013).

Several approaches are used for the application of microbial shotgun metagenomics for

current research including the construction of metagenomic libraries in suitable vectors as

small size vectors plasmids or large size fosmids and BACs followed by random or targeted

sequencing of one or more metagenomic clones. Furthermore, direct sequencing of

environmental DNA is carried out using NGS platforms to gain more insights into the

microbial genome structure and gene content, as well as microbial genome assembly from

assembled metagenomic data (J. a Gilbert & Dupont, 2011).

8.1 Assessment of microbial and metabolic diversity

Scientists have initially focused to characterize poorly studied uncultured microbes from

different environmental habitat. In order to identify novel microbial lineages or gain more

insights into their metabolic functions, this was accomplished by sequencing part of their

genomes. This metagenomic approach is presumably have helped us to identify novel

microbial lineages and contributed in landmark discoveries from uncultured microbes. For

instance, two independent studies were conducted by Beja et.al., and Stein’s group that have

led to the discovery of bacterial derived rhodopsin and the first identification of 16S rRNA

linked genomic fragment of uncultured marine planktonic archaeon respectively (Béjà et al.,

Page 25: Sequence, organization, and genes characteristics of ORFs ...

24

2000; J. L. Stein, Marsh, Wu, Shizuya, & Delong, 1996). Interestingly, sequencing of 43kb

genomic fragment of uncultured crenarchaeota from the fosmid library of calcareous

grassland soil have revealed the functional role of ammonia oxidizing bacteria in different

marine habitats using both shotgun metagenomics and inverse transcription PCR based

analysis (Treusch et al., 2005).

Another remarkable study by Gilbert et al, have reported the shotgun sequencing of a 32,86

kb fosmid clone from the metagenomic library of a coastal marine microbial habitats. The

phylogenetic analysis has revealed high sequence similarity to SAR11 bacteria “C.

Pelagibacter unique”. However the results from shotgun metagenome analysis have showed

that almost half of putative genes identified in the sequenced fosmid clone were very less

similar to the genome of the most closely SAR11 bacteria, unlike the results demonstrated by

rRNA based analysis. These results have demonstrated that the shotgun metagenomic

approach was superior over 16S based phylogeny to explain the genetic versatility and

diversity of ubiquitous C. Pelagibacter across many habitats (J. A. Gilbert, Mühling, & Joint,

2008). More studies were conducted using shotgun metagenomic clones/s sequencing

approach as reviewed elsewhere in (Nesbø, Boucher, Dlutek, & Doolittle, 2005; Schirmer,

Gadkari, Reeves, Ibrahim, DeLong, et al., 2005; Vergin et al., 1998; G. Y. Wang et al.,

2000), as it is estimated that shotgun sequencing of large genomic fragments of uncultured

microbes from one or more metagenomic clones will not only participate in the identification

of novel microbial lineages, but also can produce large contiguous DNA sequences that

provides more insights into gene neighborhood context and identify novel genes variants and

biosynthetic clusters with biotechnological prospects that are usually unrecognized during the

analysis of short read metagenomic datasets generated by high throughput sequencing

technologies (Danhorn et al., 2012; Kunin, Copeland, Lapidus, Mavromatis, & Hugenholtz,

2008).

Lately, an increasing number of shotgun metagenomic studies describing different extreme

environments have been reported. The study of Acid mine drainage (AMD) biofilm of Iron

Mountain in California represents one of the early metagenomic studies conducted on low

complexity microbial community from extreme environments. The sequence data generated

by random shotgun sequencing using Sanger technology (76 Mbp) showed dominance by

two microbial subgroups; Leptospirillum species and archaeon Ferroplasma acidarmanus.

Interestingly, it was the first attempt to determine nearly completes genome from uncultured

dominant species, unlocking metabolic pathways for nitrogen and carbon fixation, energy

Page 26: Sequence, organization, and genes characteristics of ORFs ...

25

production and other adaptation mechanisms to extreme environments. Thus, this study

provided comprehensive insight into the microbial community and the metabolic structure, as

well as adaptation strategies for extreme habitats and this was further regarded as a milestone

for subsequent analysis of other extreme acidic environments (Tyson et al., 2004).

The hydrothermal vents represent one of the exotic marine ecosystems from which, only few

metagenomic studies were reported to date. For example, Brazelton and his colleagues

reported the isolation of DNA from microbial communities of hydrothermal chimney biofilm

samples of the lost city hydrothermal field on the Mid-Atlantic Ridge (Brazelton & Baross,

2009). The earlier phylogenetic analysis of chimney microbial community revealed that

metagenome of biofilm samples was dominated by methane cycling archaea

Methanosarcinales, which accounted for more than 80% of the total microbial structure. The

pooled metagenome was then randomly sheared, cloned and sequenced by Sanger

technology, yielding 46,316 reads of total sequence data size 35 Mbp. Sequence analyses of

shotgun reads against nr database using BLAST sequence similarity search tool revealed that

more than 8% of the data is represented by transposases, showing a significant abundance

and diversity of transposases compared to metagenome of other ecosystems. The identified

transposases showed high coverage but yet found small in size, suggesting their location on

small extragenomic molecules as phages or plasmids and thus, these mobile elements were

thought to contribute in the phenotypic variety of low-complex microbial communities.

Further comparative analysis for other deep-sea hydrothermal chimneys by metagenomics

have shown common diversity and functions that are selective for chimney biofilms

regardless the geochemical differences of studied ecosystems (Sievert & Vetriani, 2012).

The work of Simon and his colleagues represent one of the first attempts to explore microbial

diversity and metabolic functions of the Northern Schneeferner (Germany) glacial ice

metagenome using 454 pyrosequencing data. The metagenomic analysis of glacial ice DNA

revealed potential metabolic versatility of microbial community displaying autotrophic

manner in response to the low nutrient content of glacial ice (Lewin et al., 2013). Also,

several cold adaptation mechanisms were also determined as cryoprotectants, unsaturated

fatty acids and ROS scavengers. Hence, the metagenomic analysis of glacial ice samples

provided comprehensive insights into the microbial structure and metabolic functions of

psychrophilic community as well as potential adaptations to cold stress on Earth and a

Page 27: Sequence, organization, and genes characteristics of ORFs ...

26

glimpse onto microbial life of other similar extraterrestrial habitats(Simon, Wiezer,

Strittmatter, & Daniel, 2009).

8.2 Industrial potential of extreme ecosystems

Microbial enzymes and bioactive molecules from different microbial isolates have been

extensively used for many industrial applications to date (Zhang & Kim, 2010). The Global

initiatives towards the application of white biotechnology have shifted the paradigm towards

novel microbial resources which hold great potential for industries. The advent of

metagenomics along with in vitro evolution and high-throughput screening technologies, hold

great promises towards microbial enzymes commercialization. Several attempts were made

by many biotechnological companies to isolate novel biocatalysts and valuable biomolecules

using metagenomics (Lorenz & Eck, 2005). The release of novel nitrilase, phytase and

glycosidase enzymes into industry by Diversa company represent one of the success stories

accomplished by metagenomics for novel biocatalysts discovery (Urban & Adamczak, 2008).

Recent studies have employed different metagenomic approaches for the isolation of novel

microbial enzymes and molecules from different extreme ecosystems. In regard to lipolytic

and polysaccharide modifying biocatalysts, Jeon et al. have reported the isolation of a novel

low temperature active lipase enzyme from cold sea sediment. One positive recombinant

clone was detected with lipolytic activity from the metagenomic library on tricaprylin

medium, the clone of interest was then fully sequenced using the shotgun metagenomic

approach. Sequence analysis and phylogenetic studies revealed sequence similarity of the sub

clone (ORF20) to a new lipase / esterase family. Upon further characterization, the expressed

protein displayed lipolytic activity over a range of natural oil substrates and it retained more

than 50% activity at 5 degrees and resistance to different detergents. The results described a

cold active lipase with interesting features for industry (Jeon, Kim, Kim, et al., 2009). Later,

the same co-workers have identified two novel esterases from another extreme low

temperature environment, Arctic metagenome samples. These enzymes are potentially used

in chiral resolution of heat sensitive substrates (Jeon, Kim, Kang, et al., 2009).

Another example illustrates the isolation of a novel thermostable amylase from screening

metagenomic fosmid library of the Juan De Fuca Ridge Hydrothermal Vent; this novel

alpha amylase is a new member of glycosyl hydrolase family 57 and the 3D structural

dimensions of the protein were predicted by homology modelling. The enzyme showed

maximum activity at pH of 7.5 and temperature of 90 degrees Celsius (°C) with capacity

Page 28: Sequence, organization, and genes characteristics of ORFs ...

27

to retain more than 50% of its activity at this temperature for 4 hours (H. Wang et al.,

2011). More investigation into extreme cold and hot ecosystems would presumably reveal a

plethora of novel microbial enzymes and molecules that shows maximum activity at extreme

temperatures/thermal conditions.

Furthermore, metagenomics has been successfully employed to explore enzyme potential

with other desirable characteristics from different extreme habitats as well. As for example,

the thermo-alkali-stable xylanase encoding gene was identified by Verma and her co-workers

by shotgun sequencing of metagenomic subclone of composite soil samples. The

recombinant enzyme displayed activity over a wide range of pH and with optima activity at

pH 9.0 and temperature at 80 degrees Celsius (°C). This study represent one of the first

metagenomic attempts to screen for xylanases that tolerate harsh conditions predominant in

many industrial processes, as required by feed as well as paper and pulp industries (Verma et

al., 2013). Another novel thermostable pectinase was identified by Singh and his

collaborators from screening soil metagenome from hot springs of northern India. The gene

encoding for pectinase was expressed into E.coli strain and characterized over a broad

range of temperature and pH. The enzyme was found to be thermostable at temperature

of 60 degrees Celsius (°C) and activity over wide pH range, reaching temperature and

pH optima at 70 degrees Celsius (°C) and 7.0 respectively which make it applicable for

different industries as waste water treatment and textile processing (R. Singh, Dhawan,

Singh, & Kaur, 2012).

As for proteases, they are widely used in different industrial applications and represent more

than 60% of sales of the global enzyme industry (Zhang & Kim, 2010). The metagenomic

survey for novel environmental niches as Gobi and the Death Valley deserts, resulted in the

detection of 17 putative proteolytic clones, two proteases belong to the substilisin (S8A)

family and were further purified and characterized displaying different biochemical activities

in terms of pH and temperature. Protease M30 revealed optimum activity at alkaline pH >11

and temperature of 40°C, whereas protease DV1 showed an optimum activity at pH of 8 and

temperature of 55°C, and thus these unique characteristics make them suitable for

biotechnological uses (Neveu, Regeard, & DuBow, 2011).

Page 29: Sequence, organization, and genes characteristics of ORFs ...

28

Interestingly, Mohamed et al., have reported the isolation and characterization of a

thermostable, metal-resistant esterase from most exotic layer of the Atlantis II deep in the

Red Sea, the lowest convective layer of the brine pool (Mohamed et al., 2013).The above-

mentioned features make them a potential biocatalysts for industry. Recently, the richness of

the Atlantis II deep with heavy metals have intrigued further interests to explore a n Atlantis

II metagenomic dataset for novel heavy metal detoxifying enzymes using “synthetic

metagenomics” approach (Culligan et al., 2013). The work revealed the first novel

thermostable, halo tolerant and metal resistant mercuric reductase (Sayed et al., 2014).

8.3 Therapeutics and health care applications

Today, many microbial enzymes have been successfully employed in many clinical and

healthcare applications. For example, many thermostable superoxide dismutases (SOD) are

highly required for different cosmetics preparations. A study by He and his colleagues

reported the isolation of a novel thermostable Fe-superoxide dismutase enzyme from

screening metagenomic libraries constructed from hot spring sample. The identified gene was

expressed into E. coli and characterized by pyrogallol method as an iron dependant

superoxide dismutase (Fe-SOD) enzyme. The enzyme displayed optima temperature at 80

degrees and stability at a broad pH range 4-11. Comparative modelling results revealed that

the presence of a large number of inter-subunit ion pairs and hydrogen bonds and decreased

solvent accessible hydrophobic surfaces accounted for its high thermostablity (He et al.,

2007). The microbial derived production of melanin pigments represents a hot topic for R&D

scientists in cosmetics industry as an inexpensive alternative for producing melanin (Gabani

& Singh, 2013). The discovery of a melanin producing clone from screening metagenomic

library of deep sea sediment represents one of the accomplishments made from exploring

untapped microbial repertoire of extreme environments (Huang et al., 2009).

In highlight to the potential use of proteases in clinical applications, some metalloproteases

were identified as potential thrombolytic agents for treatment and prevention of

cardiovascular diseases. As the use of current thrombolytic agents is associated with

undesirable side effects and lower specify to fibrin, thus mining for novel thrombolytic

enzymes from extreme environments as deep sea and other extreme environments that

represent a unique microbial repertoire for mining novel fibrinolytic enzymes for treatment of

thrombosis. Lee et al. described the isolation of a novel zinc-dependant metalloprotease from

deep sea sediments of clam bed metagenome in the west coast of Korea. Primary screening of

Page 30: Sequence, organization, and genes characteristics of ORFs ...

29

metagenomic libraries showed one metagenomic clone with proteolytic activity on azocasein

substrate. The purified enzyme was further characterized displaying optimal activity at 50 oC

for 1 hour and pH 7.0. The functional assay was performed using azocaesin and fibrin as

substrates and enzyme activity was further inhibited by different metal chelating agents (D.-

G. Lee et al., 2007).

Also, the discovery of novel antibiotics and anticancer compounds has gained special

interest by metagenomics experts. Several bioactive compounds were successfully isolated

from temperate environmental niches by metagenomics (Gillespie et al., 2002; E. W. Schmidt

et al., 2005; G. Y. Wang et al., 2000), however still little is known about extreme

environments as a potential source for novel therapeutics for treatment of cancer and other

life threatening diseases. Recently, three recent independent metagenomics studies of desert

sand soil have reported the identification of anticancer agent indolotryptoline and two

polyketide type II synthases derived molecules; fluostatins and erdacin respectively. The

study of desert soil metagenome represents an interesting example for extreme ecosystems

that hold promises towards novel bioactive molecules discovery (Chang & Brady, 2013;

Feng, Kim, & Brady, 2011; King, Bauer, & Brady, 2010 reviewed in M. H. Lee & Lee,

2013).

The contribution of microbes to pharmaceutical and health care sector goes back to the early

discovery of antibiotics by Lewis Pasteur and later from other microbial isolates. The

emerging resistance against current antibiotics and the increasing potential of microbial

enzymes in clinical applications (Zhang & Kim, 2010), have provoked the need to fully

explore novel microbial habitats for novel therapeutics and bioactive compounds using

successful approaches as the use of metagenomic approach.

These findings are supporting evidence that a unique genetic and microbial repertoire are

showing high endurance to extreme conditions of different extreme ecosystems, and thus

necessitate further exploration by metagenomics.

Page 31: Sequence, organization, and genes characteristics of ORFs ...

30

9. Thesis objective

Several metagenomic approaches are currently used to find novel microbial products in

different environmental niches. However, despite the extensive screening efforts made for

novel genes discovery, most of the targeted metagenomic strategies that involve screening of

full libraries containing thousands of clones have usually resulted in few to zero positive hits,

leaving behind a plethora of metagenomic clones (negative hits) that could be useful for other

applications. As a consequence, the discovery rate and application of novel molecules has

remarkably decreased (Mária Džunková, Giuseppe D’Auria, David Pérez-Villarroya, 2012).

In addition, the genes most distantly related to the consensus for a gene family are most

probably of great interest for biotechnological applications. However, they are the least likely

to be detected by probes or primers designed based on the database entries. Thus, the

undirected approach is likely to unmask a homologues set of useful genes that would not

have been detected by directed metagenomic studies. Moreover, even with huge

metagenomic data generated from emerging NGS technologies, the functional potential of a

large number of generated sequences are yet remained unexplored. This is due to some

pitfalls in assembly and scaffolding strategies of NGS data as a result of short read length and

lack of mate-paired sequence information (Jackson, Rounsley, & Purugganan, 2006). To

address these problems, and to establish in our laboratory the processes to obtain full length,

contiguous, and un-chimeric assembled metagenomic sequences, the Sanger chain

termination technique was used to sequenced the insert of one recombinant fosmid clone

from an LCL-ATII metagenomic library. Therefore, the objective of this work is to determine

the sequence, gene organization and to characterize identified ORFs of long de-novo

contiguous metagenomic DNA fragment from the deep brine environment of Atlantis II in

the Red Sea.

Page 32: Sequence, organization, and genes characteristics of ORFs ...

31

Chapter 2: Materials and Methods

1. Sample collection Samples were collected by the AUC research team from the LCL of Atlantis II brine pool (at

21° 20.72 North and 38° 04.59 East) during the King Abdullah University for Sciences and

Technology (KAUST) (Thuwal, Kingdom of Saudi Arabia); Woods Hole Oceanographic

Institute (WHOI) (Woods Hole, MA); and Hellenic Center for Marine Research (HCMR)

(Anavisso, Greece) oceanographic cruise of the research vessel Aegaeo in March/April 2010.

Microbial communities were collected by serial filtration of the water samples through 3 µm,

0.8 µm and 0.1 µm of mixed cellulose-ester membrane filters (nitrocellulose/cellulose

acetate). The filters were then stored in sucrose buffer at -20 oC (Mohamed et al., 2013).

2. DNA extraction and fosmid library construction Isolation of environmental DNAs was performed as described (Rusch et al., 2007). Fosmid

DNAs were extracted by phenol/chloroform method and purified using Miniprep fosmid

extraction kit (Promega) according to manufacturer’s instructions.

To construct the metagenomic fosimd library, the CopyControl™ HTP Fosmid Library

Production Kit (Epicentre) was used according to the manufacturer’s instructions. In brief,

prokaryotic DNA isolated from LCL was further separated on 1% low-melting point agarose

by gel electrophoresis to purify the fragments that have an approximate size of 40 Kb, end-

repaired to have 5’-phosphorylated blunt ends, and then ligated to the pCC2FOS vector. The

ligated mix was then packed into lambda phage using lamda packaging extract and then used

to transfect EPI300-T1R phage T1-resistant E. coli cells provided with the kit. The

transformants were then plated on Luria-Bertani (LB) agar plates containing 12.5µg of

chlormaphenicol/ml and transformantes were collected individually into 96-well plates.

Using this approach a fosmid library containing 111plates comprising 10,656 clones were

assembled. These steps were performed in our laboratory in collaboration with Dr. Mohamed

Ghazy and Mr. Amged Ouf.

Page 33: Sequence, organization, and genes characteristics of ORFs ...

32

3. Pyrosequencing Environmental DNA isolated from the ATLII-LCL water samples were prepared and then

sequenced according to the 454 pyrosequencing GS FLX Titanium library guidelines using

Roche GS FLX Titanium 454 sequencing platform located in the AUC genomics facility. The

sequencing was conducted by Dr. Ahmed El Sayed, Dr. Mohamed Ghazy and Mr .Amged

Ouf from AUC.

4. Data assembly and annotation Data generated from 454 pyrosequencer were further assembled based on Newbler® GS

assembler version 2.6 using the default parameters and minimum overlap identity value of

90%. The software Metagene Annotator (MGA) was used to detect potential ORFs within

454 assembled contigs and further annotated using different annotation pipelines as

MGRAST.

5. Subcloning of shotgun DNA fragments of the recombinant 99B6 fosmid in

bacterial vector Six recombinant fosmid clones were randomly selected from fosmid plate #99. Subsequently,

their fosmid DNA were extracted and analyzed by restriction digestion using BamHI

restriction enzyme to check for recombinant clones’ heterogeneity. For the purpose of full

clone insert sequencing, fosmid B6 was further arbitrary selected and fosmid DNA isolated

for this purpose. Fosmid DNA was nebulized at 10 psi for 45’’ (Life Tech, K7025-05) and

fragmented DNA between 1.5 to 2 kb were localized on the gel, and the gel containing the

DNA fragments between these sizes were then excised. DNA fragments were then purified

using QIAquick Gel Extraction Kit (Qiagen, 28704) and end-repaired using T4 DNA

polymerase. Addition of 3’ dA overhangs was done by incubating end-repaired fragments

with dATP and Taq polymerase for 1 hour at 72 oC. Finally, the A-tailed DNA fragments

were ligated to pGEM®-T Easy (Promega) according to manufacturer’s instructions. Next, a

1ul of the ligation reaction was used to transform 50 ul of electrocompetent E. coli TOP10

cells using MicroPulser™ Electroporation Apparatus (Bio-Rad). Finally, the transformed

cells were then plated on LB/Ampicillin/IPTG/X-Gal agar plates.

Page 34: Sequence, organization, and genes characteristics of ORFs ...

33

6. DNA Sequencing using dideoxy chain termination technique (Sanger Method)

Subclones were randomly sequenced using in the Applied Biosystems 3730xl DNA Analyzer

using the BigDye® Terminator v3.1 Cycle Sequencing Kit. Two reactions were prepared for

each plate containing: Big dye terminator, 5X Sequencing Buffer, universal primers for

pGem-T easy vector (forward or reverse) and, DNA template. The primers used were T7

prom and Sp6 as universal vector primers. Thermocycler conditions were adjusted at initial

denaturation step at 96oC for 1 minute, 25 cycles (denaturation at 96oC for 10 seconds,

annealing at 55oC for 5 seconds and extension at 60oC for 4 minutes).

7. Assembly of the sequenced DNA fragments Data assembly was done using phredPhrap consed program. All raw Sanger reads were

screened from pGem-T easy vector and fosmid sequences and low quality reads were masked

before assembly. Phred/Phrap/Consed was used to mask raw sequences with quality score

limit with error probability 0.01(Q20). The assembly of masked raw reads was performed by

Phred/Phrap/Consed using the software default parameters. Base quality score of 20 or more

was assigned for bases in consensus sequence and overlap length of minimum 15 nucleotides

and 95% identity were used for effective assembly.

8. ORFs identification and gene annotation Putative open reading frames (ORFs) were then identified using the MetaGene Annotator

tool (MGA). Artemis genome browser was used as visualization, annotation and six-frame

translation tool to determine genomic features and protein signatures of identified ORFs. This

was achieved using a collection of embedded sequence similarity search tools as Basic Local

Alignment Search Tool (BLAST) and functional domain analysis using PFAM search engine.

Furthermore, the identified ORFs were also submitted to KEGG annotation/COG search tools

to assign potential metabolic pathways and detection of orthologous groups associated with

identified sequences based on a collection of manually curated databases within Kyoto

Encyclopedia of Genes and Genomes (KEGG) database resource and Clusters of Orthologous

groups (COGs) (Wu, Zhu, Fu, Niu, & Li, 2011).

Page 35: Sequence, organization, and genes characteristics of ORFs ...

34

9. Other descriptive features for identified ORFs Several bioinformatics tools were used to determine other descriptive features for identified

ORFs. The transcriptional regulatory elements [-35,-10] of the promoter regions were traced

by Bprom software (www.softberry.com) and the ribosomal binding site (RBS) was searched

for manually RBS consensus sequence AGGAG and other A/G rich motifs. Furthermore,

SignalP4 program (Petersen, Brunak, von Heijne, & Nielsen, 2011) was primarily used to

predict possible signal peptide triggered protein secretion pathways. Alternatively, sequences

were further analyzed to identify other non-classical secretory pathways and Twin Arginine

secretory signals using SecretomeP (Bendtsen, Kiemer, Fausbøll, & Brunak, 2005) and

PRED-TAT (Bagos, Nikolaou, Liakopoulos, & Tsirigos, 2010) respectively. The subcellular

localization of proteins was predicted by PSORTb v3.0.2 (Yu et al., 2010) software and the

likelihood of proteins to contain transmembrane domain/s was checked by THMM (Krogh,

Larsson, von Heijne, & Sonnhammer, 2001)analysis tool. Finally, the halophilicity of

potential coding sequences were estimated by calculating the ratio of glutamate and aspartate

residues for pairwise alignments between ORF and corresponding reference sequence of

blastx similarity search results.

Furthermore, the potential full-length ORFs were selected for detailed protein sequence

analysis using a collection of protein sequence analysis tools as CDD search, Interproscan,

phmmer software. For instance, the EBI Interproscan (Jones et al., 2014) is a web interface

that integrates different protein sequence analysis search tools and databases into one user-

friendly platform for the detection of potential functional domains and other important

protein signatures.

Page 36: Sequence, organization, and genes characteristics of ORFs ...

35

Chapter 3: Results and Discussion

1. Analysis of the ATII-LCL library’s fosmids by restriction endonuclease A metagenome fosmid library of environmental DNA isolated from microbial communities

collected from the ATII-LCL was constructed in our laboratory. This library comprises

10,656 clones distributed in 111 96-well plates. To test if the fosmid clones do not have a

high level of redundancy of the inserted DNAs, we randomly selected 6 clones from plate #

99 for restriction endonuclease analysis of the inserted DNAs. Electrophoresis of the isolated

DNAs from the clones 99A1, 99A10, 99B6, 99B12, 99F7, and 99G8, showed DNAs with

high quality and high molecular weight (Figure 3). The DNAs of the 6 recombinant fosmids

were digested by BamHI enzyme and the result is shown in Figure 3. As it can be seen from

the restriction patterns of the 6 clones, all the clones analysed gave different restriction

patterns with different sizes DNA fragments. This result indicates that no particular tendency

for insertion of specific fragment of DNA occurs during construction of the library.

Figure 3. Restriction analysis of six randomly selected recombinant fosmids.

Six clones were randomly selected; 99A1, 99A10, 99B6, 99B12 and 99G8, and their

recombinant fosmids were isolated and digested with BamHI and different DNA fragments

were then analysed on 0.8 % agarose gel electrophoresis as indicated in the figure.

Page 37: Sequence, organization, and genes characteristics of ORFs ...

36

2. Establishing a library for fosmid clone 99B6 in pGEM plasmid vector To sequence the entire DNA insert from one fosmid clone, we arbitrary selected clone 99B6

for this purpose. Purified DNA from clone 99B6 was nebulized using Nebulizer (Life Tech,

K7025-05) for 45 seconds at 10 psi. The condition of nebulization was adjusted to yield DNA

fragments of approximate size of 1.5-2 kb (Figure 4). DNA fragments of size 1.5-2 kb were

gel purified after electrophoretic separation without exposure to UV light and then ligated

into pGEM-T easy vector (Promega). The ligated mixture was used to transform E. coli

Top10 competent cell (Invitrogen) using electroporation technique. This transformation

yielded 469 recombinant clones. The 469 clones were inoculated into five 96-well plates and

stored at -80 oC.

Figure 4. Isolation of 1.5 to 2 kb DNA fragments of nebulized 99B6 fosmid clone.

Isolated fosmid 99B6 was nebulized for 15’’ at 10 psi as described in the Material and

Methods section. The nebulized fragments were separated on 0.8% agarose gel

electrophoresis. The fragments sizes, 1.5 to 2 kb, were isolated from the excised gel section

shown in the figure without exposure to UV light.

Page 38: Sequence, organization, and genes characteristics of ORFs ...

37

Plasmid DNAs from 12 subclones were isolated using R.E.A.L. Prep 96 Plasmid Kit

(Qiagen) and analysed on 0.8 % agarose gel electrophoresis. As it can be seen from Figure 5,

the plasmids DNAs have good quality. pGEM-T easy vector has 3015 bp and each

recombinant plasmid was found to have an estimated size of about 4.0 to 4.5 kb (Figure 5).

Therefore, each plasmid has an insert size that range from 1 to 1.5 kb. This result indicates

that the size of the inserts in our pGEM-T 99B6 library is adequate for complete sequencing

of each insert using the Sanger technique using the pGEM-T easy forward and reverse

primers.

Figure 5. Analysis of the quality and integrity of clones from 99B6 pGEM library.

Recombinant plasmid isolated from 12 clones of the 99B6 pGEM library was randomly

selected and their plasmids were isolated and analysed on 0.8% agarose electrophoresis.

3. Sequencing of 99B6 pGEM library clones The plasmids from the 469 clones of the 99B6 pGEM library were isolated and purified as

described earlier. Plasmids were sequenced using the chain-termination methods of Sanger

and analysed on Applied Biosystems 3730xl DNA sequencer. Plasmid’s inserts were

sequenced from both ends using the forward and the reverse primers of the pGEM-T easy

vector. Out of 952 forward and reverse sequencing reaction, only 476 reads were mate-

paired, i.e., sequences from the forward and the reverse reaction that was overlapped. The

rest of the reads, 298, were found to be sequences of low quality, vector sequences or

remained as singletons during the assembly process.

Page 39: Sequence, organization, and genes characteristics of ORFs ...

38

4. Assembly of the sequenced reads DNA sequences from the ABI 3730xl DNA sequencer files were traced and each base was

assigned a quality value using the Phred software utilizing the default parameters (a cut-off

value of 20 or more represents a good), the base quality score values before and after

screening process are provided in Figure 6 (Ewing & Green, 1998). The quality reads were

then assembled, visualized, and edited by the Phrap and Consed softwares (Gordon, Abajian,

& Green, 1998).

A. Quality (Pre-screening) B. Quality (Post-screening)

Avg. quality: 41.9 per base

Cum. expected error :19.39

Avg. full length trimmed: 657.3

Avg. quality: 31.8 per base

Cum. expected error :335.85

Avg. full length: 945.9

Figure 6. Assigned base quality score value for sequenced reads. (A) Quality assessment for

the sequenced reads before low sequence quality screening process (B) Quality for the

sequenced reads after screening process using Phred/Phrap/Consed software.

The 672 assembled reads is presented in Figure 7. The assembled reads has a single contig of

28,378 bp with average read depth of coverage value of 20 times. The orientation of reads is

denoted by color codes; red color represents reverse read and green represents forward reads

as denoted by CLC genomics workbench tool (Figure 7).

Page 40: Sequence, organization, and genes characteristics of ORFs ...

39

Figure 7. Assembly of the sequenced reads. Sequenced reads were traced and bases were

assigned a quality values using Phred software(Ewing & Green, 1998) , and then assembled

and visualized using Phrap, consed softwares and CLC genomics workbench (Gordon et al.,

1998; Sequencing, 2011) For more details refer to the text and the materials and methods

section.

As mentioned above, around 298 reads did not assembled in the major contig. The estimated

average insert size in our fosmid library is between 30 to 40 kb. Our assembled contig has

28,378 bp, therefore, it is apparent that we did not sequence the entire insert. To estimate the

size of the inserted DNA in the sequenced fosmid, and to clarify the size of the un-sequenced

DNA fragment, we digested the recombinant 99B6 fosmid clone with Eco RI. Four DNA

fragments were observed with the following sizes: 25, 8, 6, and 3 kb (Figure 8A). From the

assembled 28,378 bp, we mapped two Eco RI sites within the inserted DNA fragment, and

one Eco RI site within the fosmid vector located at 300 bp upstream from the insertion site

(Figure 8B). From the sizes of the Eco RI DNA fragments, we estimated that our

recombinant vector 99B6 has a size of 42 kb. Since the fosmid vector has a size of 8.181 kb,

the inserted DNA fragment, therefore, should have 33.819 kb. Based on this analysis, the un-

sequenced DNA fragment should have an estimated size of 5.441 kb (Figure 8B).

Page 41: Sequence, organization, and genes characteristics of ORFs ...

40

Figure 8. Eco RI restriction analysis of recombinant 99B6 fosmid clone.A) Recombinant

fosmid 99B6 was isolated, purified, and digested with Eco RI restriction enzyme. The

digested fragments were analyzed on 0.8% agarose gel electrophoresis. B) Eco RI restriction

map of recombinant 99B6 fosmid. The sketch shows the un-sequenced fragments highlighted

in red.

Several attempts were made to finish sequencing of the fosmid clone insert. As for instance,

we manually inspect the base composition of the assembled 99B6 sequence (28,378 bp) and

sequenced un-assembled singletons for the presence of low complexity regions and repeats.

The results have shown a characteristic patterns of simple repeats and low complexity regions

at the nucleotide sequences of some of the unassembled singletons as well as at the end of the

assembled 99B6 sequence (28,378bp) (Figure 9). It has been estimated that regions of low

complexity and simple repeats, compositional bias of sequence data (AT-rich, poly-purines

etc. are presumably accompanied with DNA polymerase slippage that takes place during

DNA sequencing when a nearby repeat is present (Frith, 2011). These simple repeats and

low complexity regions are thus most likely associated with assembly problems and spurious

annotations (Treangen & Salzberg, 2012).

Page 42: Sequence, organization, and genes characteristics of ORFs ...

41

Furthermore, annotation of 28,378 bp sequence from 99B6 recombinant clone, have

demonstrated a gene which encodes for a methyl transferase enzyme from an uncultured

microbe is duplicated, these results are further displayed in the annotation section.

Accordingly, we have presumed that the presence of simple repeats, low complexity regions

as shown in Figure 9, along with gene duplication at the end of the assembled 99B6

consensus sequence, have enforce us to recognize the difficulty to recover the un-assembled

DNA fragment 5.441 kb of the recombinant fosmid clone 99B6 insert.

> Sequence end (approx.400bp) of the assembled fosmid sequence (28,378 bp) AACGTCCGACCACGGTGAACTCCTCGGGGAGGACGGAATATACGAGCATCCGGTATGGTCGGATCACCCCGTATTGAGGGAAGTTCCTTGGTT AAGGGTGGAGGAGAATGAAGATTGAAGATTTCGTTCCCGACGACGAAAAAACTCTCGACGTGGGATGCGGGAATAGGAAAATCGGCGATGTG GGCATTAAAAGACCCAGGCTCAGACGCGGATGTTATCCAAGACGTTGAGAAAGGACTCCCCTTCGAAAGCGAGGAGTTGGGTGCGTCGTTTCG CGCCACTTGCGGAACACTTGAGTGACGCGCAAGGACTATTGAACGAGGTTCATAGGATTCTGAAACCGGATGGAAGGCTGGTCCTTGAAACCG AAAAGCCATACCCGAAGGCCAGTTCCTTTTGGGACGACCCCACTCACGTAAGGCCATACACCAAGAAGTCCTTGAGGAAATTGTCAAAGAATGC CGGATTTTCAGAGGTAAAACTTACGACTGGGGGATAAGGGGGTTATCCTATTGTCCAACCTTCTTGTCCAAAGTGCTGATGAACGCCGGCTTG GGTTCATCGCTATTATTGGTCGCCGTGAAATAGTTCAGAAGTCGAATTAAACCTTCGGGGGCTAATGTCCACGGTTCCGTGCTGTTCAACCGA AAAATGAGGAGAATAAGGTAGCTTAGGAGGGAAAAAACCTCCTTAAGAAAAAAAGGGGGGGA

> Unassembled sequence I

TTATTCCTTCAGTTCCAATATTGCCCATATCTTTCAGAACCTTGCCGGTCTTAGAATCCTCTATTTTACATTGGAACCCCAAGCCAGTTGGTGA

GGGCTTACACTTGAGCAGGTTCACTCTATCTAAGTCTGAGCTACATCCAGCTGCTTTCTCCCTAAACTCTTTTTCCACCATCTACCTACGCACC

TTTGAGTGTGAAAGGGAGCGATGCCGACGGCCTGCCCGCGTCGCCCCCTTTAGCGTGATAATTATACCTCGATGCCTTCCGAGGCGGCCTCCTC

GGACATGCAGCTCAGGAACTTCGGCAACCTGTCTTTCTTCCCCCTGGTCTTCCTCGAACACTTCGCCGCCGATTTCTTGAAGAACTTCCAGCGC

TTCCTCAGTTTTTTCGGCACCTGCCGTTTGCCTTTTTTGATCTTCGGCTCGCCCTTGCAGAATACTTTCTCCTGCGGGTCGCTCAGCATCCCTA

TGACCCCTGGCGACCACAGGAACTTTTCCCCGTCCATCTGGTAGCCGTGGCAACCCGTAGTCCTAGCCGTATCCCTGTCCACTACGAGGCCCAT

AGCTGATCACTACTCTTAAACCTTTTTCTTGCACTCGTGGGCCGCCTTCCTGAAAGCTTTTTGGACTTCCTCCAAGTTGGCGAAACTTTCTCCA

GACATCTTTCCGGCAACGCACCCGCGATATTTGAGCAGTCTCGGTGAAGTTGCCTTCGTGGACGCCTTTCGGGTGGCGAACTTGCCCCATCCGA

GGTAGGGGCTGTCCGGCGCTTCGCGCTCGTACTCAAACATCGGGGTGGTTACAACTTTTCTTCGCGCCATATTATAGGTTCGAGAAGCTCCTAT

AAAAAAGGCGTAGAAAGAAGTTTTCGCAAAGAAAAATTTGTGAAAACTTTTTATCTAATTAGAAAGCTCTTGGGACTCCCCCTTATTAAGTTTC

TTGATTATATCGGGAAATTTCTTTGCGGTTTCGCTCCCAGGTTTAGCCACGATGTGGATGGTGTTATCTTGTGATTTAAGAAACATAGCTTCGT

CCATCACCTAGCACATTTATCGGGTAAGTTTCGTGCGCCGGGTCTTTCCTGTTCGCCGGTCCCGCCTATCCCTTAACCCA

> Unassembled sequence II

TCTAGCTTTTTCAAAGACATTAAAACAAAAACTACAGAATGGGAAGTTCCGAAATTTCTTTTTTCTTCAAAGAAAAAGATAAAAGAAAATTTCC

TAAGGGGTTTTGCAGATTCGGAAGGAAGTGTTGGTAATAGGTCTATTTGTCTTTGTAATAGAAATAAGAGAGGTTTGAGAGAAGTGAAGTACCT

CTTAGACTTAGTTGGACTGAGAAAGAAAAATATAAATTTAGGTAAAAAGAAGCTGAATATTTACCATAAGAAAAATCTTTTAATATTCAAAAAA

AGAATTGGGTTTTCTTTAGAAAGAAAAAACAGAAAATTGGAAGATGTTTTAAATGAATACAAAAATTTGCCCGAACTGTAAGAAGTTAATCATC

TGCAAAGGGGGCAAATGCCCCACGCGCTGCCCAAACTGCAGGAAGCCGCTATAGATATTTAAGGCGGGAAGGAGAAAATTTTATATGCCGGAAT

TTAAGTCCAAGAAGGAATTCCTCAAAAACCTCTCGGGCTGCGACGATATGTCGGCGGCGAAGAGGAACTACCCAAAGCTCTGGAAAAAATACAT

GGAGCCCAAATTGAAGAAGGCCTTCGAGCAGAGGAAGAAAAGGCCGAAAATTTCTTAGGTCACCACCCGAACGAGAGCTTCTTGAGCGACCTCT

TCTGGACTTTCAATATTTCCTTGACCGCCTCCTTCACTTTACGGAGCAATTTCTCGAACCGCTCGATTTCGGATAGCAGTTCTTTCCTCAGCTC

CTTCATTTTCGGCTTGATCGGGCAATTCCTTCAGGGGTTCGATCCTTCTCCTCGGCGACTTCGATTCGTATCTTTTATTCTTTCCAGATCCTTT

TTTGCTGAAAACGAACTTTCGATGTTACTTTTGGTTTTCGCCAATTTATAACCATTTTCGAGGTTTTGGTGGCTTTCCGGAAGGATATTTTGTT

TGTTAAAATTTTTTTTTTTGGAAAGATGGA

For instance, one of the unassembled singletons represent a varying pattern of small

sequence repeats and low complexity sequences , in addition its compositional bias

(low GC content 28% ).

Figure 9. Nucleotide sequence composition of sequence end of assembled fosmid 99B6 and

un-sequenced DNA fragment and. Detection of low complexity regions and simple repeats in

both assembled sequence 99B6 (approx. last 400 bp of assembled sequence) and

unassembled (sequence I and II) respectively. The estimated low complexity regions and

simple repeats are highlighted with different colour codes and underlined respectively.

Page 43: Sequence, organization, and genes characteristics of ORFs ...

42

5. Open reading frames search for potential genes and annotation of the

identified ORFs Using Metagene Annotator (Noguchi et al., 2008), we identified 39 ORFs displayed along the

28,378 kb assembled contig. To annotate the 39 identified ORFs, different annotation tools

were deployed simultaneously for data analysis including NCBI-CDD sequence search tool

for conserved sequence detection of domains, Interproscan integrated pipeline for detection

of protein domains, families and other protein signatures against different databases (Jones et

al., 2014), Phmmer for protein sequence similarity search based on the more sensitive hidden

Markov model (hmmer.janelia.org), and finally assigning ORFs to different metabolic

networks databases using COG/KEGG for orthologous group detection and infer ORF

metabolic functions (Wu et al., 2011).

The outcome of the annotation processes using these programs is presented in Figure 10. Out

of the 39 identified ORFs, 29 have matches; While 10 ORFs have no recognized matches

using the pipelines mentioned earlier.

The 39 annotated ORFs and their orientations are presented in Figure 9 and Table 1

respectively.

Figure 10. Protein sequence analysis for identified ORFs using different protein annotation

programs.

0 5 10 15 20 25 30 35

No match

NCBI-CDD

Interproscan

phmmer

COG/KEGG

Total number of non-redundant hits

# of hits

# of hits

Page 44: Sequence, organization, and genes characteristics of ORFs ...

43

Table 1. Schematic diagram for the location and annotation of identified ORFs on contig9.

It is not surprisingly that about 26% of identified ORFs remained unannotated during our

analysis, as previous studies have estimated that about 50% of all metagenomic data

generated to date are considered as hypothetical proteins or proteins of unknown function.

For instance, it was estimated that about 50% of 1.2 million protein coding genes identified

from shotgun sequencing of Sargasso Sea metagenome were predicted as hypothetical

proteins(Venter et al., 2004). Hence, there is a global initiative toward implementing more

powerful computing resources and integrated platforms to alleviate bottlenecks of huge

dataset analysis to accelerate novel gene discovery processes (Teeling & Glockner, 2012).

Table 2 presents a detailed analysis of the annotated ORFs including the % of query coverage

to the best hit and the % of sequence identity and similarity. The table also presents the

identification of -10 and -35 consensus sequences required for RNA polymerase interaction,

potential ribosomal binding site and the start and stop codons for the translation process.

Page 45: Sequence, organization, and genes characteristics of ORFs ...

44

Table 2. Protein sequence similarity search using blastx tool against nr protein

database.

# ‘highlighted’ ORF descriptions: Blastx hits resulted using significant E-value (10E-5).

ORFs in

Contig 9

Similarity Query

Coverage

Identity% RBS -10 -35 Start

Codon

Stop

codon

Annotation

1 NA NA NA YES YES NA GTG* YES No similarity

2 43% 87% 30% YES YES YES TTG* YES DeoR family transcriptional regulator [Bacillus megaterium

QM B1551

3 69% 91% 44% YES YES YES YES YES hypothetical protein [Natronorubrum bangense]

4 72% 94% 53% NO YES YES YES YES initiation factor IIB [Archaeoglobus fulgidus DSM 4304]

5 53% 65% 41% YES* NO NO TTG* YES membrane protein [uncultured organism]

6 56% 58% 35% NO YES YES TTG* YES ribonuclease H [Cellulomonas fimi ATCC 484]

7 60% 25% 43% YES* YES YES GTG* YES ECF family sigma-70 type sigma factor [Rhodococcus opacus]

8 NA NA NA NO YES YES TTG* YES NO similartiy

9 53% 67% 29% YES YES YES GTG* YES serine/threonine kinase family protein [uncultured bacterium

A1Q1_fos_1807]

10 62% 37% 49% YES* YES YES YES YES NADPH-dependent F420 reductase [Natrialba chahannaoensis]

11 54% 50% 29% NO YES YES GTG* YES AhpC/TSA family protein [Zunongwangia profunda SM-A87]

12 56% 54% 42% NO NO NO GTG* YES Glutathione S-transferase [Pseudomonas pseudoalcaligenes]

13 61% 52% 39% NO YES YES YES YES serine protease [Streptococcus mitis B6]

14 59% 74% 33% YES* YES YES TTG* YES 2-hydroxypenta-2,4-dienoate hydratase [Pseudomonas stutzeri]

15 62% 58% 44% YES* YES YES GTG* YES hypothetical protein GUITHDRAFT_88488 [Guillardia theta

CCMP2712]

16 51% 36% 30% NO YES YES YES YES Hore_22870 hypothetical protein [ Halothermothrix orenii

H 168 ]

17 59% 79% 43% YES* YES YES GTG* Yes PREDICTED: similar to GTP cyclohydrolase I [Tribolium

castaneum]

18 49% 46% 37% YES* YES YES YES YES sterol 3beta-glucosyltransferase [Puccinia graminis f. sp. tritici

CRL 75-36-700-3]

19 52% 56% 35% YES* YES YES GTG* YES hypothetical protein [Bacillus cereus]

20 48% 60% 33% YES YES YES GTG* YES 60 kDa chaperonin [uncultured pig faeces bacterium]

21 NA NA NA YES* YES YES GTG* YES No similarity

22 50% 43% 46% NO YES YES YES YES hypothetical protein Deide_07800 [Deinococcus deserti

VCD115]

23 39% 65% 27% NO YES YES TTG* YES PREDICTED: collagen alpha-5(VI) chain [Orcinus orca]

24 41% 92% 28% NO YES YES GTG* YES hypothetical protein LEPBI_I2319 [Leptospira biflexa serovar

Patoc strain 'Patoc 1 (Paris)'

25 NA NA NA YES YES NO YES YES No similarity

26 NA NA NA YES YES YES GTG* YES No similarity

27 47% 89% 31% YES* YES YES GTG* YES arabinosyltransferase [Mycobacterium leprae TN]

28 52% 19% 39% NO YES YES YES YES hypothetical protein GYMC10_1090 [Paenibacillus sp.

Y412MC10]

29 47% 29% 37% YES YES YES GTG* YES filamentous hemagglutinin outer membrane protein

[Enterobacter sp. SST3]

30 92% 97% 30% NO YES YES YES YES GH13668 [Drosophila grimshawi]

31 62% 24% 43% NO YES YES YES YES Putative subtilisin peptidase [Micromonospora lupini]

32 50% 12% 35% NO YES YES YES YES hypothetical protein Thimo_0605 [Thioflavicoccus mobilis

8321]

33 58% 97% 39% NO YES YES YES YES Polyprenyl synthetase [Aciduliprofundum boonei T469]

34 64% 18% 52% NO YES YES YES YES membrane-associated protease [Prochlorococcus marinus str.

AS9601]

35 50% 56% 32% NO YES YES YES YES membrane or secreted protein [uncultured organism]

36 64% 59% 43% YES* YES YES YES YES Methyltransferase type 11 [uncultured bacterium

37 49% 49% 26% NO YES YES YES YES glycosyl transferase family protein [Rhodobacter sphaeroides

ATCC 17025]

38 47% 98% 33% NO YES YES YES YES hypothetical protein [Halorubrum tebenquichense]

39 64% 59% 43% YES* YES YES YES YES Methyltransferase type 11 [uncultured bacterium

Page 46: Sequence, organization, and genes characteristics of ORFs ...

45

# ‘*’ Start codons other than ATG.

# RBS ‘Yes*’ stands for A/G rich region other than consensus AGGAGG.

# ORF 36 and 39 are identical in both nucleotide and protein sequence composition (i.e. gene duplication).

6. Structural characteristics of the annotated proteins Potential halophilic features of the annotated proteins —Microorganisms that survive in

halophilic environments are classified into slightly, moderately, and extremely halophilic,

with an optimal growth in 0.5, 0.5–2.5, and around 4 M NaCl respectively ( a Oren, 2002;

Sayed et al., 2014). Two mechanisms are known that adapted by halophilic microorganisms

to cope with the high salinity of its surrounding environment and keep the osmotic pressure

of the cell within the physiological range, the salt-out and the salt-in mechanisms.

Microorganisms that adapted the salt-out approach accumulate within their cytoplasm

osmotic solute such as glycine, betaine, and ectoine (A. Oren, 2008). The second approach,

the salt-in, the microorganism accumulates molar concentration of potassium chloride to

balance the high salt concentration of the surrounding environment (CHRISTIAN &

WALTHO, 1962; Chromatography et al., 1992; Lanyi, 1974) In this case the proteins are

adapted to function at high salt concentration reaching up to 4 molars. Such adaptations

include increase of acidic amino acid residues on the surface of the protein and decrease the

in hydrophobic amino acids (Paul, Bag, Das, Harvill, & Dutta, 2008; Siglioccolo, Paiardini,

Piscitelli, & Pascarella, 2011).

The halophilicty of the annotated proteins was estimated by calculating the ratio between the

numbers of acidic amino acid residues (D+E) in each protein coding sequence and the

number of acidic amino acid residues (D+E) in corresponding homolog blastx match. A

protein is set to be potentially halophilic if possess a halophilic ratio of 1.25 or more. As it

can be seen from, Table 2, 14 out of the 39 Proteins have ratios of D+E at least 1.25 times

higher than their corresponding homologous proteins indicating that they may have some

level of halophilic properties.

Page 47: Sequence, organization, and genes characteristics of ORFs ...

46

Table 3.Determination of halophilic potential for identified ORFsThe halophilicty of the

annotated orfs was estimated by calculating the ratio between the numbers of acidic amino

acid residues (D+E) in each protein coding sequence and the number of acidic amino acid

residues (D+E) in corresponding blastx matches.

ORFs in Contig 9 Annotation D+E( ORF) D+E(REF) orf/ref ratio Halophilicity

1 No similarity 3 8 0.375 NO

2 DeoR family transcriptional regulator [Bacillus megaterium QM B1551 6 11 0.545 NO

3 hypothetical protein [Natronorubrum bangense] 14 8 1.75 YES

4 initiation factor IIB [Archaeoglobus fulgidus DSM 4304] 46 41 1.12 NO

5 membrane protein [uncultured organism] 3 3 1 NO

6 ribonuclease H [Cellulomonas fimi ATCC 484] 5 4 1.25 YES

7 ECF family sigma-70 type sigma factor [Rhodococcus opacus] 5 6 0.833 NO

8 NO similartiy 4 9 0.444 NO

9 serine/threonine kinase family protein [uncultured bacterium

A1Q1_fos_1807] 27 16 1.687 YES

10 NADPH-dependent F420 reductase [Natrialba chahannaoensis] 5 6 0.833 NO

11 AhpC/TSA family protein [Zunongwangia profunda SM-A87] 13 9 1.44 YES

12 Glutathione S-transferase [Pseudomonas pseudoalcaligenes] 7 5 1.4 YES

13 serine protease [Streptococcus mitis B6] 1 0 no ratio N/A

14 2-hydroxypenta-2,4-dienoate hydratase [Pseudomonas stutzeri] 8 7 1.14 NO

15 hypothetical protein GUITHDRAFT_88488 [Guillardia theta CCMP2712] 10 6 1.66 YES

16 Hore_22870 hypothetical protein [ Halothermothrix orenii H 168 ] 23 21 1.09 NO

17 PREDICTED: similar to GTP cyclohydrolase I [Tribolium castaneum] 3 2 1.5 YES

18 sterol 3beta-glucosyltransferase [Puccinia graminis f. sp. tritici CRL 75-36-

700-3] 13 14 0.92 NO

19 hypothetical protein [Bacillus cereus] 2 2 1 NO

20 60 kDa chaperonin [uncultured pig faeces bacterium] 8 9 0.889 NO

21 No similarity NA NA NA NA

22 hypothetical protein Deide_07800 [Deinococcus deserti VCD115] 5 6 0.833 NO

23 PREDICTED: collagen alpha-5(VI) chain [Orcinus orca] 37 20 1.85 YES

24 hypothetical protein LEPBI_I2319 [Leptospira biflexa serovar Patoc strain 'Patoc 1 (Paris)' 27 21 1.28 YES

25 No similarity NA NA NA NA

26 No similarity NA NA

NA

27 arabinosyltransferase [Mycobacterium leprae TN] 3 9 0.333 NO

28 hypothetical protein GYMC10_1090 [Paenibacillus sp. Y412MC10] 9 10 0.9 NO

29 filamentous hemagglutinin outer membrane protein [Enterobacter sp. SST3] 10 7 1.42 YES

30 GH13668 [Drosophila grimshawi] 15 23 0.65 NO

31 Putative subtilisin peptidase [Micromonospora lupini] 10 15 0.667 NO

32 hypothetical protein Thimo_0605 [Thioflavicoccus mobilis 8321] 27 20 1.35 YES

33 Polyprenyl synthetase [Aciduliprofundum boonei T469] 70 61 1.14 YES

34 membrane-associated protease [Prochlorococcus marinus str. AS9601] 3 3 1 YES

35 membrane or secreted protein [uncultured organism] 18 16 1.125 NO

36 Methyltransferase type 11 [uncultured bacterium 20 11 1.81 YES

37 glycosyl transferase family protein [Rhodobacter sphaeroides ATCC 17025] 12 7 1.71 YES

38 hypothetical protein [Halorubrum tebenquichense] 40 64 0.625 NO

39 Methyltransferase type 11 [uncultured bacterium 20 11 1.81 YES

#Putative Proteins are considered potentially halophilic protein if the halophilic ratio is 1.25 or more.

Page 48: Sequence, organization, and genes characteristics of ORFs ...

47

Detection of potential secretory signals —Detection of potential secretory signals and

THMM domains within annotated ORFs were performed using different detection online

tools as well as predictions for protein subcellular localization of putative proteins (Bagos et

al., 2010; Bendtsen et al., 2005; Krogh et al., 2001; Petersen et al., 2011; Yu et al., 2010).

Figure 11 shows the different secretory pathways that were identified in the 39 potential

proteins based on the presence of sequences and domains involved in the corresponding

secretory process.

The results indicate that the majority of the potential secreted proteins identified using the

above mentioned tools use a non-classical secretory pathway (20 out of 39; around 50%),

which uses signal peptide independent secretion pathways (Bendtsen et al., 2005).The second

secretory pathway that is potentially used by the microbial community in LCL environment

employ transmembrane domains (16 out of 39; around 40%). Interestingly, just very few

proteins (4 out of 39; around 10%) were found a signal peptide using the classical secretion

process.

Figure 11. Prediction of putative secretory signals and THMMs domains in annotated ORFs.

4

20

16

00

5

10

15

20

25

Classical secretions Non-classicalsecretions

Transmembranedomains

Twine -argininepathway

Number of ORF matches per secretory pathways

classification

Page 49: Sequence, organization, and genes characteristics of ORFs ...

48

7. Detailed analysis of 10 selected ORFs identified by Interproscan software Out of the 39 ORFs identified, 10 potential full-length ORFs were annotated using

Interproscan web interface software. The detailed analysis of each ORF including the

identified specific domains for the putative proteins is presented in Figure 12.

Figure 12. Schematic representations of potential full-length ORFs in silico annotated using

Interproscan protein analysis software.

1. ORF#4

a) Gene level

o Position :start: 1887 , end :2837

o Blast best-hit :Transcription factor TFIIB [Archaeoglobus

profundus DSM 5631]

o Query coverage : 94%

o Identity : 53% , Fraction conserved : 72%

ATG TAA

19 bp 140 bp

b) Protein level (length 316 amino acids)

o Zinc finger, TFIIB type IPR013137 (18-59)

o Transcription factor TFIIB, Cyclin-like domain IPR013150 (134-202,228-298)

o Transcription factor TFIIB, conserved site IPR023486 (260-275)

o Non-classical protein secretion detected by SecretomeP

o No transmembrane regions detected

o Cytoplasmic protein

o Halophilic ratio 1.25

-35 -10

Cyclin-like domain (228-298)

TFIIB domain (18-59)

Cyclin-like domain (134-202)

Page 50: Sequence, organization, and genes characteristics of ORFs ...

49

2. ORF#9

a) Gene level

o RBS : GGAGG

o Position :start: 4278 , end :4676

o Blast best-hit : Serine/Threonine Kinase family protein

[uncultured bacterium A1Q1_fos_1807]

o Query coverage : 67%

o Identity: 29% , Fraction conserved : 53%

GTG TGA

15 bp 84 bp 8 bp

b) Protein level (length 132 amino acids)

o AraC-type arabinose-binding/dimerization domain IPR003313 (20-104)

o Possible coil-coil region (74-102,107-128)

o Non-classical protein secretion detected by SecretomeP

o No transmembrane regions detected

o Cytoplasmic protein

o Halophilic ratio 1.687

3. ORF#14

a) Gene level

o RBS :

o Position :start: 5921 , end :6151

o Blast best-hit : 2-hydroxypenta-2, 4-dienoate

hydratase[Peudomonas stutzeri]

o Query coverage : 74%

o Identity: 33% , Fraction conserved: 59%

-35 RBS

-10

COIL (74-102)

COIL (107-128)

AraC-type arabinose-binding domain (20-104)

Page 51: Sequence, organization, and genes characteristics of ORFs ...

50

TTG TGA

14 bp 31 bp

b) Protein level (76 amino acids) )

o Ribbon-helix-helix protein,copG (Arc-type repressor) domain IPR02145 (10-43)

o No secretory signals detected

o No transmembrane regions detected

o Cytoplasmic protein

o Halophilic ratio 1.14

4. ORF#28

a) Gene level

o Position :start: 13974, end :15800

o Blast best-hit : Hypothetical protein GYMC10_1090

[Paenibacillus sp,Y41MC10]

o Query coverage : 19%

o Identity: 39% , Fraction conserved : 52%

ATG TAG

18 bp 127 bp

b) Protein level (608 amino acids)

o Carbohydrate-binding-like fold IPR013784 (492-571)

o Domain of unknown function 4480 (DUF4480) , IPR027804 (497-575)

o Non-classical protein secretion detected by SecretomeP

o Two transmembrane regions detected

o Unknown localization

o Halophilic ratio 0.9

-35 -10

CopG domain (Arc-type repressor) (10-43)

-35 -10

Domain of unknown function 4480 (497-575)

Page 52: Sequence, organization, and genes characteristics of ORFs ...

51

5. ORF#31

a) Gene level

o Position :start: 17001, end :18302

o Blast best-hit : Putative subtilisin peptidase [Micromonospora

lupini]

o Identity: 43% , Fraction conserved : 62%

b) Protein level ( length 433 amino acids)

o Peptidase S8/S53 domain IPR000209 (117-382).

o Peptidase S8, substilisin, Ser-active site IPR023828 characterized by catalytic triad

Asp, His and Ser (342-352).

o Signal peptide detected by Secretome P, Pred-TAT and THMM

o One transmembrane region detected, most likely detected as signal peptide

o Cytoplasmic membrane protein

o Halophilic ratio 0.667

-35 -10

Peptidase S8/S53 domain (117-382) Serine-active site

(342-352)

33 bp 66 bp

ATG TGA

Page 53: Sequence, organization, and genes characteristics of ORFs ...

52

6. ORF#32

a) Gene level

o RBS : GGAGG

o Position :start: 18351, end :22940

o Blast best-hit : Hypothetical protein Thimo_0605 [Thioflavicoccus

moblilis 8321]

o Query coverage : 12%

o Identity: 35% , Fraction conserved : 50%

ATG TGA

13 bp 272 bp 5 bp

b) Protein level (length 1529 amino acids)

o Concanavalin A like lectin/glucanase,subgroup domain IPR013320 (82-296)

o Glycosyl hydrolase , five bladed beta propeller domain IPR023296 (585-827)

o Signal peptide detected by SignalP4, Secretome P and Pred-TAT and THMM

o Two transmembrane regions detected, one of them most likely detected as signal

peptide

o Probable multiple localization sites (as cytoplasmic membrane or outermembrane

protein)

o Halophilic ratio 1.36

-35 RBS

-10

Glycosyl hydrolase, five bladed beta propeller domains (585-827)

Glucanase, subgroup domain (82-296)

Page 54: Sequence, organization, and genes characteristics of ORFs ...

53

7. ORF#33

a) Gene level

o Position :start: 22994, end :24010

o Blast best-hit : Polytprenyl synthetase [Aciduliprofundum boonei

T467]

o Query coverage : 97%

o Identity: 39% , Fraction conserved: 58%

GAT GTA

132 bp 19 bp

b) Protein level (338 amino acids)

o Terpenoid synthase domain IPR008949 (10-337)

o Polyprenyl synthase_1 PS00723 (81-95) and polyprenyl synthase_2 PS00444 (222-

234) aspartic acid rich patterns

o Non-classical protein secretion detected by SecretomeP

o No transmembrane regions detected

o Cytoplasmic protein

o Halophilic ratio 1.14

-35

-10

Terpenoid synthase domain Polyprenyl

synthase_1

Polyprenyl

synthase_2

Page 55: Sequence, organization, and genes characteristics of ORFs ...

54

8. ORF#34

a) Gene level

o Position :start: 24011, end :24772

o Blast best-hit : Membrane associated protease [Prochlorococccus

marinus str.AS9601]

o Query coverage : 18%

o Identity: 52% , Fraction conserved : 64%

AGT GTA

123 bp 17 bp

b) Protein level (length 253 amino acids)

o CAAX amino terminal protease family IPR003675 and Abi domain PF02517 (152-

244)

o Non classical protein secretion detected by Signal peptide detected by Secretome P

o Eight transmembrane region detected

o Cytoplasmic membrane

o Halophilic ratio 1

-35 -10

Abi domain (152-244)

Page 56: Sequence, organization, and genes characteristics of ORFs ...

55

9. ORF#38

a) Gene level

o RBS : AGxAGG/AGGxGG

o Position :start: 26948, end :27778

o Blast best-hit : Hypothetical protein [Halorubrum tebenquichense]

o Query coverage : 60%

o Identity: 33% , Fraction conserved : 47%

14 bp 174 bp 19 bp

b) Protein level (length 276 amino acids)

o Alkaline-phosphatase-like, core domain IPR017850 (86-135,207-258)

o No Signal peptide detected

o No transmembrane regions detected

o Cytoplasmic protein

o Halophilic ratio 0.625

-35 RBS -10

Alkaline-phosphatase-like, core domain (86-135) Alkaline-phosphatase-like, core domain (207-258)

domain

ATG TGA

Page 57: Sequence, organization, and genes characteristics of ORFs ...

56

10. ORF#39

a) Gene level

o RBS : GGA/GAG/AGG

o Position :start: 27768, end :28256

o Blast best-hit : Methyltransferase type 11 [uncultured bacterium]

o Query coverage : 59%

o Identity: 43%, Fraction conserved: 64%

15 bp 19 bp 24 bp

b) Protein level (length 162 amino acids)

o S- adenosyl-L-Methionine-dependent methyltransferases superfamily SSF53335

(3-140) ,Methyltransferase_23 domain PF13489 (9-127) and no IPR

o Non classical protein secretion detected by Signal peptide detected by SecretomeP

o Cytoplasmic protein

o No transmembrane regions detected

o Halophilic ratio 1.81

-10

RBS -35

Methyltransferase_23 domain (9-127)

ATG TGA

Page 58: Sequence, organization, and genes characteristics of ORFs ...

57

The translation start and the stop codons of each ORF and the upstream transcription -35

and -10 sequences were identified using MGA and PBROM programs respectively. The RBS

sequence was searched for manually. The RBS consensus sequence AGGAG was detected

for 6 out of the 10 analyzed ORFs, suggesting that RBS of the uncultured metagenome could

possess more diverse RBS sequences than commonly known RBS conserved sequences.

Furthermore, the 10 selected full-length ORFs were further investigated for the presence of

potential secretory signals and proteins localization sites were also predicted using a different

set of bioinformatics tools. The results from Signal4P analysis demonstrated that, out of the

ten selected full-length ORFs, two putative proteins were predicted to have signal peptides

needed for classical protein secretion. While for the remaining ORFs that possess no signal

peptides, five were found to be secreted through non-classical pathway, and two were not

secreted.

Furthermore, the protein localization sites were predicted for the two aforementioned

secreted proteins using PSORT analysis. The analysis results suggest that ORF#31 which

potentially encodes for subtilisin was secreted to the cytoplasmic membrane, whereas the

subcellular location of the hypothetical Thimo_0605 protein encoded by ORF#32 was not

exactly determined, suggesting that it might have multiple localization sites and hence, was

further regarded as a cytoplasmic membrane or an outer membrane protein.

As for the remaining 8 ORFs that possess no secretory signals, they were further analysed by

SecretomeP that further categories them into non-classically secreted proteins and proteins

that are not secreted (Yu et al., 2010).

Results from both SecretomeP and PSORT analysis revealed that ORF#14 (2-hydroxypenta-

2, 4-dienoate hydratase) and ORF#38 (Hypothetical protein) were potentially a cytoplasmic

proteins.

Regarding ORF#28, which encodes for Hypothetical protein GYMC10_1090, was predicted

by SecretomeP to be secreted through a non-classical secretory pathway. However, the

subcellular localization of the protein was not known.

Page 59: Sequence, organization, and genes characteristics of ORFs ...

58

Similarly, ORFs#4 (Serine/Threonine kinase family protein), ORF# 9 (Transcription factor

TFIIB), ORF#33(Polytprenyl synthetase), and ORF#39 (Methyltransferase type 11), were

regarded as cytoplasmic proteins by PSORT analysis. However, these findings are

contradicted by SecretomeP results which indicated their potential non-classical secretion of

proteins.

As for the membrane associated protease which was encoded by ORF#34, it was denoted as a

potential cytoplasmic membrane protein by both PSORT and SecretomeP; these findings

would presumably indicate that the protein is secreted non-classically.

In addition, the full-length ten ORFs were assessed for the presence of transmembrane

domains using the THMM search tool. The results have shown that, only ORFs# 28, 32 and

34 have displayed true positive signals for the presence of transmembrane regions within

proteins. For instance, THMM analysis of the putative membrane associated protease

(ORF#34) has predicted 8 transmembrane helices within the protein. Moreover, THMM

assigned the first transmembrane domain detected in the N-terminus region as a potential

signal peptide sequence (Figure 13).

Figure 13.The prediction of transmembrane domains for the putative membrane associated

protease (ORF#34) was detected using the THMM tool.

The putative enzyme displayed 8 transmembrane helices with Topology=i2-20o53-70i77-99o119-

141i153-175o185-204i209-226o230-252i. Moreover, THMM assigned the first transmembrane

domain detected in the N-terminus region as a potential signal peptide sequence

Page 60: Sequence, organization, and genes characteristics of ORFs ...

59

Finally, the halophilic potential of the 10 full-length annotated ORFs was further estimated.

By comparing the number of acidic amino acid residues (D+E) in our selected ORFs to

number of acidic amino acid residues in the corresponding best matches of pairwise

alignment; ORFs #4, 9, 32 and 39 have displayed halophilic ratios greater than 1.25, and thus

were regarded as putative halophilic proteins (Danson & Hough, 1997).

It has been estimated that bacteria and archaea mainly depend on two mechanisms to protein

machinery characterized by high acidic amino acid residues to high salt concentration and

osmotic stress of extreme environments; these acidic proteins are stabilized through increase

of intracellular K+ ions levels to balance excess salt concentration outside (salt-in) which is

common for most halophilic archaea and some extremely halophiles from domain bacteria.

Otherwise, the accumulation of compatible organic solutes (osmolytes) inside the cell

without altering protein machinery is capable to maintain osmotic stability across cell. A

good example for the use of organic molecules as osmolytes by different domains of life is

represented by the accumulation of ectoine and glycine betaine by halophilic bacteria and use

of other osmolytes by several archaeal lineages as halophilic methangens (salt-out) ( a Oren,

2002). As for the molecular adaptation of secreted halophilic proteins to high salt

concentration, it is noteworthy that they were not only characterized by increased acidic

amino acid residues as Aspartate and Glutamate, decreased hydrophobic amino acid residues

as cysteine and thus less positive charges and hydrophobic surfaces, but also it is presumed

that some transport systems as TAT-pathways are usually involved to maintain structural

stability of these acidic proteins as noticed for several haloarchaea (Rose, Brüser, Kissinger,

& Pohlschröder, 2002). However, according to our findings there were no TAT secretory

signals detected for any of the identified orfs using PRED-TAT software.

Page 61: Sequence, organization, and genes characteristics of ORFs ...

60

8. Identified proteins with potential biotechnological applications Table 4 shows description and potential applications of some putative genes of interest

identified in our metagenomic study.

Table 4. ORFs identified in contig9 (99B6) with potential industrial or medical applications.

Protein Name ORF Number Potential Applications

Serine peptidase 31 Pharmaceutical and Industrial applications in

detergents and paper industries (Pushpam, Rajesh, &

Gunasekaran, 2011)

Glycosyl hydrolase family 43 (GH43) 32 Valuable products (e.g.Arabinan) used in Food

Industry (Shallom & Shoham, 2003; Shi et al., 2014)

Na+/K+ antiporter 25 Salt tolerance; agriculture and industrial

applications(Khan, 2011; Waditee, Hibino,

Nakamura, Incharoensakdi, & Takabe, 2002)

Geranyl pyrophosphate synthase 34 Isoprenoids with cosmetic, pharmaceutical and food

applications (Holstein & Hohl, 2004)

Methylase enzyme 36 and 39 VitK2 and coenzyme Q10 biosynthesis ; they

maintain coagulation homeostasis and prevention of

neurodegenerative disorders (Cluis, Burja, & Martin,

2007; Conly, Stein, Worobetz, & Rutledge-Harding,

1994).

Biotin synthase 16 Biotin with pharmaceutical and livestock industries

(Ikeda et al., 2013).

Hyaluronan synthase (GT2) 37 Production of Hyaluronic acid used in cosmetics and

pharmaceutical preparations (Long Liu, Liu, Li, Du,

& Chen, 2011).

Antimicrobial peptide 15 Antibacterial and antifungal peptide involved in

pharmaceutical preparations and as a food

preservative against food pathogens (Pushpanathan et

al., 2012; Pushpanathan, Gunasekaran, & Rajendhran,

2013).

For example, the potential in-silico characterization of novel member of subtilisin S8/S53

serine peptidase superfamily from our metagenomic dataset (ORF#31, NCBI Blast CDD ,

Interproscan and COG) that contains peptidase S8/S53 domain and Ser active site (Asp, His

and Ser catalytic triad residues) essential for activity. As previously mentioned, industrial

applications of proteases account for two-thirds of sales in the worldwide enzyme industry,

this is due to their potential role in different pharmaceutical and industrial applications

(Pushpam et al., 2011).

Page 62: Sequence, organization, and genes characteristics of ORFs ...

61

Another remarkable finding from our ATLCL brine pool metagenomic library is the

identification of a novel glycosyl hydrolase like protein (ORF #32, CDD and Interproscan),

where the domain structure analysis of the putative protein by Interproscan suggests that it

most probably belongs to glycosyl hydrolase, family 43. In general, O-Glycosyl hydrolases

represent a group of enzymes that are involved in glycosidic bond cleavage between

carbohydrates or carbohydrate and non-carbohydrate moieties. They are further classified

based on sequence homology into 85 protein families as provided in the CAZy

(CArbohydrate-Active EnZymes) database. Glycosyl hydrolase, family 43(GH43) involves

different GH members of varying enzymatic activities, such as; beta-xylosidase, alpha-L-

arabinofuranosidase, arabinanase and xylanase which are generally involved in the cleavage

of different polysaccharides. The detection of glycosyl hydrolase, five-bladed beta propellor

domain in ORF#32 further describes it as a member of GH43 family. This domain was first

displayed in the three-dimensional structure of a-L-arabinanase enzyme isolated from C.

japonicus (Shallom & Shoham, 2003). However, experimental analysis for enzyme activity

and substrate specificity is required for full characterization. Members of GH43 family are

potentially used for many industrial applications, such as: pulp and paper industry

(xylanases), the food industry (arabinanase and alpha-L-arabinofuranosidase) (Shallom &

Shoham, 2003; Shi et al., 2014) and xylitol manufacturing processes (xylosidase)

(Prakasham, Rao, & Hobbs, 2009).

Another protein of interest is Na+/K+ antiporter found in our identified metagenomic clone

(ORF#25, COG). Na+/K+ antiporter is a ubiquitous membrane protein that can be located at

the cytoplasm or cellular membrane of many microbial origins as plants, animals microbes

and humans (Liew, Illias, Mahadi, & Najimudin, 2007). Their main physiological role is to

maintain pH homeostasis of cells in the presence of high salt concentration environment. In

addition, the application of Na+/K+ antiporter proteins in transgenic plants and engineered

microbes for salt tolerance in agriculture and industry is currently reviewed in several studies

(Khan, 2011; Waditee et al., 2002)

Page 63: Sequence, organization, and genes characteristics of ORFs ...

62

Our results were not only exclusive for the isolation of novel biocatalysts but they also

included the identification of other important enzymes as putative geranyl geranyl

pyrophosphate (GGPP) synthase (ORF#34), which is involved in the biosynthetic pathway of

several biologically important isoprenoids. The Intermediate product GGPP acts as a

precursor for several monoterpenes and monoterpenoids that are primarily found in herbs and

plant extracts and commonly used in the food and cosmetic industries. Recently, some of

these derivatives such as: limonene and perillyl alcohol, were assessed for their potential role

as chemopreventive agents (Holstein & Hohl, 2004).

We found a gene that encodes for a putative methylase enzyme involved in vitamin K2 and

coenzyme Q10 biosynthesis (ORF#39, NCBI CDD, Interproscan and phmmer). Vitamin K2

is an essential cofactor involved in the production of various clotting factors to maintain

coagulation homeostasis of the body. Hence, deficiency in vitk2 production from diet of

inadequate production by intestinal microbiota might lead to significant coagulopathy (Conly

et al., 1994). As for coenzyme Q10, there is growing evidence that coenzyme Q10

supplementations help in treatments of various disease conditions such as cardiomyopathy,

diabetes and Alzheimer’s. In addition, it has been acknowledged in cosmetics applications

owing to its antioxidant properties (Cluis et al., 2007).

We also report the identification of a gene classified as a biotin synthases (ORF#16, COG).

Biotin (vitamin B7) is not only regarded as an important cofactor but also have significant

contribution in many industrial applications such as cosmetics, pharmaceuticals and livestock

industries. It was estimated that the global market for biotin production have reached 10-30

tons with a hundred million U.S. dollars per year. Recent attempts investigate the potential

use of molecular biology and genetic engineering to produce an ecofriendly biotin, instead of

more hazardous multistep chemical synthesis (Ikeda et al., 2013).

It is noteworthy; a gene was annotated according to COG classification as a putative glycosyl

transferase with hyaluronic acid synthase activity (ORF#37). Hyaluronic acid is one of the

valuable natural sugar polymer produced by glycosyl tranferases_2 (GT) which is a member

of large protein families that are involved in the biosynthesis and transfer of different

polysaccharides, oligosaccharides and glycoconjugates which mediate various functions as

ranging from structure and storage to signalling. Hyaluronic acid is highly diverse in

prokaryotes and also present in eukaryotes. The most common forms of GTs are involved in

sugar residue transfer from activated nucleophile sugar donor to specific acceptor molecule

Page 64: Sequence, organization, and genes characteristics of ORFs ...

63

that takes place in new sugar biosynthesis through the retention of the inversion of the

configuration of anomeric carbon (Breton, Snajdrová, Jeanneau, Koca, & Imberty, 2006).

Owing to its unique rheological, physiological, biological characteristics and safety, it is

currently implicated in many Industrial and medical applications. It is estimated that the

global market for Hyaluronic acid to have reached over $1 billion with prominent increase in

the fields of dermal fillers and viscosupplementation. Recently, the recombinant microbial

production of hyaluronic acid has gained an increasing attention by the industrial society to

alleviate the unnecessarily costs and potential toxicity associated with the traditional

Hyaluronic acid isolation from rooster combs (Long Liu et al., 2011).

Until now, the continuous rise in infectious diseases and development of microbial resistance

has provoked the need for novel antimicrobial compounds. The use of metagenomics for the

isolation of novel antimicrobial compounds from different environmental resources have

proven to be an effective approach as it has been reported in earlier studies (Gillespie et al.,

2002).We found an ORF that potentially encodes for a antimicrobial peptide that showed

homology to well characterized alpha-helical antimicrobial peptide (Maganin) isolated from

an African clawed frog Xenopus laevis (ORF#15, Phmmer.pl.). Antimicrobial peptides

(AMPs) represent a diverse set of ubiquitous molecules with different biological functions.

The structural properties and their proposed specific mode of action against microbes, which

evade possible resistant mechanisms, are amongst the factors that made them a promising

alternative to currently available antibiotics (Pushpanathan et al., 2013). Many AMPs were

reported in literature as antibacterial, antifungal, immune-modulatory and anticancer

peptides. For instance, a novel antifungal peptide was identified from marine metagenome

through functional metagenomic approach, and then its antifungal activity and mode of action

against most common fungal infections were further assessed (Pushpanathan et al., 2012,

2013).

Page 65: Sequence, organization, and genes characteristics of ORFs ...

64

Chapter 3: Concluding Remarks

The continuous need for novel gene inventories with unusual biochemical characteristics will

hold the attention of the scientific community towards mining exotic natural microbial

habitats that were early thought to be deemed of any forms of life.

In this study, a 28.378 kb DNA fragment of recombinant fosmid clone from the deep brine

environment of Atlantis II in the Red Sea was characterized. The Shotgun sequencing of 469

subclones of the recombinant 99B6 fosmid followed by assembly process, have produced

28.387 Kb of a contiguous DNA fragment encompassing 39 ORFs.

Out of the 39 ORFs identified, 10 (26%) have no match in the database, whereas at least 8

out of the 39 ORFS, according to literature, were shown to have potential for

biotechnological applications.

Using restriction endonuclease analysis, it was estimated that the recombinant fosmid 99B6

has an insert of around 33.819 kb indicating that 5.441 kb was not assembled at the end of the

28.378 kb contig.

Analyses of the end of the assembled contig show a high degree of repetitions (simple repeats

and gene duplication) that most probably explain the difficulties of recruiting more reads at

this end.

Despite high costs and laborious process, the use of Sanger sequencing procedure for

recovery of long de-novo contiguous assemblies yet represent the method of choice for

obtaining accurate gene prediction and annotation.

Page 66: Sequence, organization, and genes characteristics of ORFs ...

65

Chapter 4: References

Antunes, A., Ngugi, D. K., & Stingl, U. (2011). Microbiology of the Red Sea (and other)

deep-sea anoxic brine lakes. Environmental Microbiology Reports, 3(4), 416–33.

doi:10.1111/j.1758-2229.2011.00264.x

Bagos, P. G., Nikolaou, E. P., Liakopoulos, T. D., & Tsirigos, K. D. (2010). Combined

prediction of Tat and Sec signal peptides with hidden Markov models. Bioinformatics

(Oxford, England), 26(22), 2811–7. doi:10.1093/bioinformatics/btq530

Banik, J. J., & Brady, S. F. (2010). Recent application of metagenomic approaches toward

the discovery of antimicrobials and other bioactive small molecules. Current Opinion in

Microbiology, 13(5), 603–609. doi:10.1016/j.mib.2010.08.012

Barzon, L., Lavezzo, E., Militello, V., Toppo, S., & Palù, G. (2011). Applications of next-

generation sequencing technologies to diagnostic virology. International Journal of

Molecular Sciences, 12(11), 7861–84. doi:10.3390/ijms12117861

Béjà, O., Aravind, L., Koonin, E. V, Suzuki, M. T., Hadd, A., Nguyen, L. P., … DeLong, E.

F. (2000). Bacterial rhodopsin: evidence for a new type of phototrophy in the sea.

Science (New York, N.Y.), 289, 1902–1906. doi:10.1126/science.289.5486.1902

Bendtsen, J. D., Kiemer, L., Fausbøll, A., & Brunak, S. (2005). Non-classical protein

secretion in bacteria. BMC Microbiology, 5, 58. doi:10.1186/1471-2180-5-58

Bougouffa, S., Yang, J. K., Lee, O. O., Wang, Y., Batang, Z., Al-Suwailem, A., & Qian, P.

Y. (2013). Distinctive microbial community structure in highly stratified deep-sea brine

water columns. Applied and Environmental Microbiology, 79, 3425–3437.

doi:10.1128/AEM.00254-13

Brazelton, W. J., & Baross, J. a. (2009). Abundant transposases encoded by the metagenome

of a hydrothermal chimney biofilm. The ISME Journal, 3(12), 1420–1424.

doi:10.1038/ismej.2009.79

Breton, C., Snajdrová, L., Jeanneau, C., Koca, J., & Imberty, A. (2006). Structures and

mechanisms of glycosyltransferases. Glycobiology, 16(2), 29R–37R.

doi:10.1093/glycob/cwj016

Chang, F.-Y., & Brady, S. F. (2013). Discovery of indolotryptoline antiproliferative agents

by homology-guided metagenomic screening. Proceedings of the National Academy of

Sciences of the United States of America, 110(7), 2478–83.

doi:10.1073/pnas.1218073110/-

/DCSupplemental.www.pnas.org/cgi/doi/10.1073/pnas.1218073110

CHRISTIAN, J. H., & WALTHO, J. A. (1962). Solute concentrations within cells of

halophilic and non-halophilic bacteria. Biochimica et Biophysica Acta, 65, 506–508.

Page 67: Sequence, organization, and genes characteristics of ORFs ...

66

Chromatography, A. S., Chromatography, A., Methods, O., Aspects, B., Aspects, S. M.,

Subunits, R., … Tools, G. (1992). Biochemical. structural. and molecular genetic

aspects of halophilism.

Cluis, C. P., Burja, A. M., & Martin, V. J. J. (2007). Current prospects for the production of

coenzyme Q10 in microbes. Trends in Biotechnology, 25(11), 514–21.

doi:10.1016/j.tibtech.2007.08.008

Conly, J. M., Stein, K., Worobetz, L., & Rutledge-Harding, S. (1994). The contribution of

vitamin K2 (menaquinones) produced by the intestinal microflora to human nutritional

requirements for vitamin K. The American Journal of Gastroenterology, 89, 915–923.

Culligan, E. P., Sleator, R. D., Marchesi, J. R., & Hill, C. (2013). Metagenomics and novel

gene discovery: Promise and potential for novel therapeutics. Virulence, 5(3), 1–14.

Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/24317337

Danhorn, T., Young, C. R., & DeLong, E. F. (2012). Comparison of large-insert, small-insert

and pyrosequencing libraries for metagenomic analysis. The ISME Journal, 6(11),

2056–66. doi:10.1038/ismej.2012.35

Daniel, R. (2005). The metagenomics of soil. Nature Reviews. Microbiology, 3(6), 470–478.

doi:10.1038/nrmicro1160

Danson, M. J., & Hough, D. W. (1997). The structural basis of protein halophilicity. In

Comparative Biochemistry and Physiology - A Physiology (Vol. 117, pp. 307–312).

doi:10.1016/S0300-9629(96)00268-X

Dark, M. J. (2013). Whole-genome sequencing in bacteriology: state of the art. Infection and

Drug Resistance, 6, 115–123. doi:10.2147/IDR.S35710

Eisen, J. a. (2007). Environmental shotgun sequencing: Its potential and challenges for

studying the hidden world of microbes. PLoS Biology, 5(3), 0384–0388.

doi:10.1371/journal.pbio.0050082

Ewing, B., & Green, P. (1998). Base-calling of automated sequencer traces using phred. II.

Error probabilities. Genome Research, 8, 186–194. doi:10.1101/gr.8.3.186

Felczykowska, a, Bloch, S. K., Nejman-Faleńczyk, B., & Barańska, S. (2012). Metagenomic

approach in the investigation of new bioactive compounds in the marine environment.

Acta Biochimica Polonica, 59(4), 501–505. Retrieved from

http://www.scopus.com/inward/record.url?eid=2-s2.0-

84873731682&partnerID=40&md5=7c92570459037bba8e66bf5e56f2929a

Feng, Z., Kim, J. H., & Brady, S. F. (2011). reassembled environmental DNA derived Type

II PKS gene cluster, 132(34), 11902–11903. doi:10.1021/ja104550p.Fluostatins

Frith, M. C. (2011). A new repeat-masking method enables specific detection of homologous

sequences. Nucleic Acids Research, 39. doi:10.1093/nar/gkq1212

Page 68: Sequence, organization, and genes characteristics of ORFs ...

67

Gabani, P., & Singh, O. V. (2013). Radiation-resistant extremophiles and their potential in

biotechnology and therapeutics. Applied Microbiology and Biotechnology, 97(3), 993–

1004. doi:10.1007/s00253-012-4642-7

Gilbert, J. a, & Dupont, C. L. (2011). Microbial metagenomics: beyond the genome. Annual

Review of Marine Science, 3, 347–71. doi:10.1146/annurev-marine-120709-142811

Gilbert, J. A., Mühling, M., & Joint, I. (2008). A rare SAR11 fosmid clone confirming

genetic variability in the “Candidatus Pelagibacter ubique” genome. The ISME Journal,

2, 790–793. doi:10.1038/ismej.2008.49

Gillespie, D. E., Brady, S. F., Bettermann, A. D., Cianciotto, N. P., Liles, M. R., Rondon, M.

R., … Handelsman, J. (2002). Isolation of antibiotics turbomycin a and B from a

metagenomic library of soil microbial DNA. Applied and Environmental Microbiology,

68(9), 4301–4306. doi:10.1128/AEM.68.9.4301

Goldberg, S. M. D., Johnson, J., Busam, D., Feldblyum, T., Ferriera, S., Friedman, R., …

Venter, J. C. (2006). A Sanger/pyrosequencing hybrid approach for the generation of

high-quality draft assemblies of marine microbial genomes. Proceedings of the National

Academy of Sciences of the United States of America, 103, 11240–11245.

doi:10.1073/pnas.0604351103

Gordon, D., Abajian, C., & Green, P. (1998). Consed: A graphical tool for sequence

finishing. Genome Research, 8, 195–202. doi:10.1101/gr.8.3.195

Gurvich G. E. (2006). Metalliferous Sediments of the World Ocean. Metalliferous Sediments

of the World Ocean—Fundamental Theory of Deep-Sea Hydrothermal Sedimentation (p.

416). Springeronline.com. Retrieved from http://epic.awi.de/19211/1/Gur2006a.pdf

Handelsman, J. (2004). Metagenomics: application of genomics to uncultured

microorganisms. Microbiology and Molecular Biology Reviews : MMBR, 68(4), 669–

685. doi:10.1128/MBR.68.4.669

He, Y.-Z., Fan, K.-Q., Jia, C.-J., Wang, Z.-J., Pan, W.-B., Huang, L., … Dong, Z.-Y. (2007).

Characterization of a hyperthermostable Fe-superoxide dismutase from hot spring.

Applied Microbiology and Biotechnology, 75(2), 367–376. doi:10.1007/s00253-006-

0834-3

Hoff, K. J. (2009). The effect of sequencing errors on metagenomic gene prediction. BMC

Genomics, 10, 520. doi:10.1186/1471-2164-10-520

Holstein, S. a, & Hohl, R. J. (2004). Isoprenoids: remarkable diversity of form and function.

Lipids, 39(4), 293–309. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/15357017

Huang, Y., Lai, X., He, X., Cao, L., Zeng, Z., Zhang, J., & Zhou, S. (2009). Characterization

of a deep-sea sediment metagenomic clone that produces water-soluble melanin in

Escherichia coli. Marine Biotechnology (New York, N.Y.), 11(1), 124–31.

doi:10.1007/s10126-008-9128-3

Page 69: Sequence, organization, and genes characteristics of ORFs ...

68

Ikeda, M., Miyamoto, A., Mutoh, S., Kitano, Y., Tajima, M., Shirakura, D., … Takeno, S.

(2013). Development of biotin-prototrophic and -hyperauxotrophic Corynebacterium

glutamicum strains. Applied and Environmental Microbiology, 79(15), 4586–94.

doi:10.1128/AEM.00828-13

Jeon, J. H., Kim, J.-T., Kang, S. G., Lee, J.-H., & Kim, S.-J. (2009). Characterization and its

potential application of two esterases derived from the arctic sediment metagenome.

Marine Biotechnology (New York, N.Y.), 11(3), 307–16. doi:10.1007/s10126-008-9145-

2

Jeon, J. H., Kim, J.-T., Kim, Y. J., Kim, H.-K., Lee, H. S., Kang, S. G., … Lee, J.-H. (2009).

Cloning and characterization of a new cold-active lipase from a deep-sea sediment

metagenome. Applied Microbiology and Biotechnology, 81, 865–874.

doi:10.1007/s00253-008-1656-2

Jones, P., Binns, D., Chang, H.-Y., Fraser, M., Li, W., McAnulla, C., … Hunter, S. (2014).

InterProScan 5: genome-scale protein function classification. Bioinformatics (Oxford,

England), 30(9), 1236–40. doi:10.1093/bioinformatics/btu031

Kennedy, J., Marchesi, J., & Dobson, A. (2008). Marine metagenomics: strategies for the

discovery of novel enzymes with biotechnological applications from marine

environments. Microb Cell Fact, 7, 27. doi:10.1186/1475-2859-7-27

Kennedy, J., O’Leary, N. D., Kiran, G. S., Morrissey, J. P., O’Gara, F., Selvin, J., & Dobson,

a D. W. (2011). Functional metagenomic strategies for the discovery of novel enzymes

and biosurfactants with biotechnological applications from marine ecosystems. Journal

of Applied Microbiology, 111(4), 787–799. doi:10.1111/j.1365-2672.2011.05106.x

Kerkhof, L. J., & Goodman, R. M. (2009). Ocean microbial metagenomics. Deep-Sea

Research Part II: Topical Studies in Oceanography, 56(19-20), 1824–1829.

doi:10.1016/j.dsr2.2009.05.005

Khan, M. S. (2011). Role of sodium and hydrogen ( Na + / H + ) antiporters in salt tolerance

of plants : Present and future challenges, 10(63), 13693–13704.

doi:10.5897/AJB11.1630

Kim, M., Lee, K.-H., Yoon, S.-W., Kim, B.-S., Chun, J., & Yi, H. (2013). Analytical tools

and databases for metagenomics in the next-generation sequencing era. Genomics &

Informatics, 11(3), 102–13. doi:10.5808/GI.2013.11.3.102

King, R. W., Bauer, J. D., & Brady, S. F. (2010). NIH Public Access, 48(34), 6257–6261.

doi:10.1002/anie.200901209.An

Krogh, a, Larsson, B., von Heijne, G., & Sonnhammer, E. L. (2001). Predicting

transmembrane protein topology with a hidden Markov model: application to complete

genomes. Journal of Molecular Biology, 305(3), 567–80. doi:10.1006/jmbi.2000.4315

Page 70: Sequence, organization, and genes characteristics of ORFs ...

69

Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K., & Hugenholtz, P. (2008). A

bioinformatician’s guide to metagenomics. Microbiology and Molecular Biology

Reviews : MMBR, 72, 557–578, Table of Contents. doi:10.1128/MMBR.00009-08

Lanyi, J. K. (1974). Salt-dependent properties of proteins from extremely halophilic bacteria.

Microbiology and Molecular Biology Reviews. Retrieved from

http://mmbr.asm.org/cgi/reprint/38/3/272.pdf\npapers2://publication/uuid/71242963-

F85C-4506-9123-A10526E4C337

Lee, D.-G., Jeon, J. H., Jang, M. K., Kim, N. Y., Lee, J.-H. J. H., Kim, S.-J., … Lee, S.-H.

(2007). Screening and characterization of a novel fibrinolytic metalloprotease from a

metagenomic library. Biotechnology Letters, 29(3), 465–472. doi:10.1007/s10529-006-

9263-8

Lee, M. H., & Lee, S.-W. (2013). Bioprospecting potential of the soil metagenome: novel

enzymes and bioactivities. Genomics & Informatics, 11(3), 114–20. Retrieved from

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3794083&tool=pmcentrez&

rendertype=abstract

Lewin, A., Wentzel, A., & Valla, S. (2013). Metagenomics of microbial life in extreme

temperature environments. Current Opinion in Biotechnology, 24(3), 516–525.

doi:10.1016/j.copbio.2012.10.012

Li, L.-L., McCorkle, S. R., Monchy, S., Taghavi, S., & van der Lelie, D. (2009).

Bioprospecting metagenomes: glycosyl hydrolases for converting biomass.

Biotechnology for Biofuels, 2, 10. doi:10.1186/1754-6834-2-10

Liew, C. W., Illias, R. M., Mahadi, N. M., & Najimudin, N. (2007). Expression of the

Na+/H+ antiporter gene (g1-nhaC) of alkaliphilic Bacillus sp. G1 in Escherichia coli.

FEMS Microbiology Letters, 276(1), 114–22. doi:10.1111/j.1574-6968.2007.00925.x

Liu, L., Li, Y., Li, S., Hu, N., He, Y., Pong, R., … Law, M. (2012). Comparison of next-

generation sequencing systems. Journal of Biomedicine & Biotechnology, 2012, 251364.

doi:10.1155/2012/251364

Liu, L., Liu, Y., Li, J., Du, G., & Chen, J. (2011). Microbial production of hyaluronic acid:

current state, challenges, and perspectives. Microbial Cell Factories, 10(1), 99.

doi:10.1186/1475-2859-10-99

Lorenz, P., & Eck, J. (2005). Metagenomics and industrial applications. Nature Reviews.

Microbiology, 3(6), 510–516.

Mária Džunková, Giuseppe D’Auria, David Pérez-Villarroya, A. M. (2012). Hybrid

Sequencing Approach Applied to Human Fecal Metagenomic Clone Libraries Revealed

Clones with Potential Biotechnological Applications. PLoS ONE, 7(10), e47654.

doi:10.1371/journal.pone.0047654

Page 71: Sequence, organization, and genes characteristics of ORFs ...

70

Metzker, M. L. (2005). Emerging technologies in DNA sequencing. Genome Research,

15(12), 1767–76. doi:10.1101/gr.3770505

Mohamed, Y. M., Ghazy, M. a, Sayed, A., Ouf, A., El-Dorry, H., & Siam, R. (2013).

Isolation and characterization of a heavy metal-resistant, thermophilic esterase from a

Red Sea Brine Pool. Scientific Reports, 3, 3358. doi:10.1038/srep03358

Nesbø, C. L., Boucher, Y., Dlutek, M., & Doolittle, W. F. (2005). Lateral gene transfer and

phylogenetic assignment of environmental fosmid clones. Environmental Microbiology,

7, 2011–2026. doi:10.1111/j.1462-2920.2005.00918.x

Neveu, J., Regeard, C., & DuBow, M. S. (2011). Isolation and characterization of two serine

proteases from metagenomic libraries of the Gobi and Death Valley deserts. Applied

Microbiology and Biotechnology, 91(3), 635–644. doi:10.1007/s00253-011-3256-9

Noguchi, H., Taniguchi, T., & Itoh, T. (2008). MetaGeneAnnotator: detecting species-

specific patterns of ribosomal binding site for precise gene prediction in anonymous

prokaryotic and phage genomes. DNA Research : An International Journal for Rapid

Publication of Reports on Genes and Genomes, 15(6), 387–96.

doi:10.1093/dnares/dsn027

Oren, a. (2002). Diversity of halophilic microorganisms: environments, phylogeny,

physiology, and applications. Journal of Industrial Microbiology & Biotechnology,

28(1), 56–63.

Oren, A. (2008). Microbial life at high salt concentrations: phylogenetic and metabolic

diversity. Saline Systems, 4, 2. doi:10.1186/1746-1448-4-2

Paul, S., Bag, S. K., Das, S., Harvill, E. T., & Dutta, C. (2008). Molecular signature of

hypersaline adaptation: insights from genome and proteome composition of halophilic

prokaryotes. Genome Biology, 9, R70. doi:10.1186/gb-2008-9-4-r70

Petersen, T. N., Brunak, S., von Heijne, G., & Nielsen, H. (2011). SignalP 4.0: discriminating

signal peptides from transmembrane regions. Nature Methods, 8(10), 785–6.

doi:10.1038/nmeth.1701

Prakasham, R. S., Rao, R. S., & Hobbs, P. J. (2009). Current trends in biotechnological

production of xylitol and future prospects, 3(January), 8–36.

Pushpam, P. L., Rajesh, T., & Gunasekaran, P. (2011). Identification and characterization of

alkaline serine protease from goat skin surface metagenome. AMB Express, 1(1), 3.

doi:10.1186/2191-0855-1-3

Pushpanathan, M., Gunasekaran, P., & Rajendhran, J. (2013). Antimicrobial Peptides :

Versatile Biological Properties, 2013(Table 1).

Pushpanathan, M., Rajendhran, J., Jayashree, S., Sundarakrishnan, B., Jayachandran, S., &

Gunasekaran, P. (2012). Identification of a Novel Antifungal Peptide with Chitin-

Page 72: Sequence, organization, and genes characteristics of ORFs ...

71

Binding Property from Marine Metagenome. Protein and Peptide Letters.

doi:10.2174/092986612803521620

Rose, R. W., Brüser, T., Kissinger, J. C., & Pohlschröder, M. (2002). Adaptation of protein

secretion to extremely high-salt conditions by extensive use of the twin-arginine

translocation pathway. Molecular Microbiology, 45(4), 943–50. Retrieved from

http://www.ncbi.nlm.nih.gov/pubmed/12180915

Rusch, D. B., Halpern, A. L., Sutton, G., Heidelberg, K. B., Williamson, S., Yooseph, S., …

Venter, J. C. (2007). The Sorcerer II Global Ocean Sampling expedition: northwest

Atlantic through eastern tropical Pacific. PLoS Biology, 5(3), e77.

doi:10.1371/journal.pbio.0050077

Sayed, A., Ghazy, M. a, Ferreira, A. J. S., Setubal, J. C., Chambergo, F. S., Ouf, A., … El-

Dorry, H. (2014). A novel mercuric reductase from the unique deep brine environment

of Atlantis II in the Red Sea. The Journal of Biological Chemistry, 289(3), 1675–87.

doi:10.1074/jbc.M113.493429

Schirmer, A., Gadkari, R., Reeves, C. D., Ibrahim, F., DeLong, E. F., & Hutchinson, C. R.

(2005). Metagenomic analysis reveals diverse polyketide synthase gene clusters in

microorganisms associated with the marine sponge Discodermia dissoluta. Applied and

Environmental Microbiology, 71, 4840–4849. doi:10.1128/AEM.71.8.4840-4849.2005

Schloss, P. D., & Handelsman, J. (2003). Biotechnological prospects from metagenomics.

Current Opinion in Biotechnology, 14(3), 303–310. doi:10.1016/S0958-1669(03)00067-

3

Schmidt, E. W., Nelson, J. T., Rasko, D. a, Sudek, S., Eisen, J. a, Haygood, M. G., & Ravel,

J. (2005). Patellamide A and C biosynthesis by a microcin-like pathway in Prochloron

didemni, the cyanobacterial symbiont of Lissoclinum patella. Proceedings of the

National Academy of Sciences of the United States of America, 102(20), 7315–7320.

Schmidt, M., Botz, R., Faber, E., Schmitt, M., Poggenburg, J., Garbe-Sch??nberg, D., &

Stoffers, P. (2003). High-resolution methane profiles across anoxic brine-seawater

boundaries in the Atlantis-II, Discovery, and Kebrit Deeps (Red Sea). Chemical

Geology, 200(3-4), 359–375. doi:10.1016/S0009-2541(03)00206-7

Sequencing, H. (2011). CLC Genomics Workbench. Workbench, 1–4.

Shallom, D., & Shoham, Y. (2003). Microbial hemicellulases. Current Opinion in

Microbiology, 6(3), 219–228. doi:10.1016/S1369-5274(03)00056-0

Sharma, S., Khan, F. G., & Qazi, G. N. (2010). Molecular cloning and characterization of

amylase from soil metagenomic library derived from Northwestern Himalayas. Applied

Microbiology and Biotechnology, 86(6), 1821–1828. doi:10.1007/s00253-009-2404-y

Shendure, J., & Ji, H. (2008). Next-generation DNA sequencing. Nature Biotechnology,

26(10), 1135–45. doi:10.1038/nbt1486

Page 73: Sequence, organization, and genes characteristics of ORFs ...

72

Shi, H., Ding, H., Huang, Y., Wang, L., Zhang, Y., Li, X., & Wang, F. (2014). Expression

and characterization of a GH43 endo-arabinanase from Thermotoga thermarum. BMC

Biotechnology, 14(1), 35. doi:10.1186/1472-6750-14-35

Siam, R., Mustafa, G. a., Sharaf, H., Moustafa, A., Ramadan, A. R., Antunes, A., … Dorry,

H. El. (2012). Unique prokaryotic consortia in geochemically distinct sediments from

red sea Atlantis II and discovery deep brine pools. PLoS ONE, 7(8), e42872.

doi:10.1371/journal.pone.0042872

Sievert, S., & Vetriani, C. (2012). Chemoautotrophy at Deep-Sea Vents: Past, Present, and

Future. Oceanography. doi:10.5670/oceanog.2012.21

Siglioccolo, A., Paiardini, A., Piscitelli, M., & Pascarella, S. (2011). Structural adaptation of

extreme halophilic proteins through decrease of conserved hydrophobic contact surface.

BMC Structural Biology. doi:10.1186/1472-6807-11-50

Simon, C., & Daniel, R. (2011). Metagenomic analyses: past and future trends. Applied and

Environmental Microbiology, 77(4), 1153–61. doi:10.1128/AEM.02345-10

Simon, C., Wiezer, A., Strittmatter, A. W., & Daniel, R. (2009). Phylogenetic diversity and

metabolic potential revealed in a glacier ice metagenome. Applied and Environmental

Microbiology, 75(23), 7519–7526. doi:10.1128/AEM.00946-09

Singh, J., Behal, A., Singla, N., Joshi, A., Birbian, N., Singh, S., … Batra, N. (2009).

Metagenomics: Concept, methodology, ecological inference and recent advances.

Biotechnology Journal, 4(4), 480–494. doi:10.1002/biot.200800201

Singh, R., Dhawan, S., Singh, K., & Kaur, J. (2012). Cloning, expression and

characterization of a metagenome derived thermoactive/thermostable pectinase.

Molecular Biology Reports, 39(8), 8353–8361. doi:10.1007/s11033-012-1685-x

Stein, J. L., Marsh, T. L., Wu, K. Y., Shizuya, H., & Delong, E. F. (1996). Characterization

of uncultivated prokaryotes: Isolation and analysis of a 40-kilobase-pair genome

fragment from a planktonic marine archaeon. Journal of Bacteriology, 178, 591–599.

Stein, L. D. (2010). The case for cloud computing in genome informatics. Genome Biology,

11(5), 207. doi:10.1186/gb-2010-11-5-207

Sultana, H., & Neelakanta, G. (2013). The Use of Metagenomic Approaches to Analyze

Changes in Microbial Communities. Microbiology Insights, 37.

doi:10.4137/MBI.S10819

Teeling, H., & Glockner, F. O. (2012). Current opportunities and challenges in microbial

metagenome analysis--a bioinformatic perspective. Briefings in Bioinformatics, 13(6),

728–742. doi:10.1093/bib/bbs039

Page 74: Sequence, organization, and genes characteristics of ORFs ...

73

Thomas, T., Gilbert, J., & Meyer, F. (2012). Metagenomics - a guide from sampling to data

analysis. Microbial Informatics and Experimentation, 2(1), 3. doi:10.1186/2042-5783-2-

3

Treangen, T. J., & Salzberg, S. L. (2012). Repetitive DNA and next-generation sequencing:

computational challenges and solutions. Nature Reviews Genetics. doi:10.1038/nrg3164

Treusch, A. H., Leininger, S., Kietzin, A., Schuster, S. C., Klenk, H. P., & Schleper, C.

(2005). Novel genes for nitrite reductase and Amo-related proteins indicate a role of

uncultivated mesophilic crenarchaeota in nitrogen cycling. Environmental Microbiology,

7, 1985–1995. doi:10.1111/j.1462-2920.2005.00906.x

Tyson, G. W., Chapman, J., Hugenholtz, P., Allen, E. E., Ram, R. J., Richardson, P. M., …

Banfield, J. F. (2004). Community structure and metabolism through reconstruction of

microbial genomes from the environment. Nature, 428(6978), 37–43.

Uchiyama, T., & Miyazaki, K. (2009). Functional metagenomics for enzyme discovery:

challenges to efficient screening. Current Opinion in Biotechnology, 20(6), 616–622.

doi:10.1016/j.copbio.2009.09.010

Urban, M., & Adamczak, M. (2008). – A REVIEW, 58(1), 11–22.

Venter, J. C., Remington, K., Heidelberg, J. F., Halpern, A. L., Rusch, D., Eisen, J. A., …

Smith, H. O. (2004). Environmental genome shotgun sequencing of the Sargasso Sea.

Science (New York, N.Y.), 304, 66–74. doi:10.1126/science.1093857

Vergin, K. L., Urbach, E., Stein, J. L., DeLong, E. F., Lanoil, B. D., & Giovannoni, S. J.

(1998). Screening of a fosmid library of marine environmental genomic DNA fragments

reveals four clones related to members of the order Planctomycetales. Applied and

Environmental Microbiology, 64, 3075–3078.

Verma, D., Kawarabayasi, Y., Miyazaki, K., & Satyanarayana, T. (2013). Cloning,

Expression and Characteristics of a Novel Alkalistable and Thermostable Xylanase

Encoding Gene (Mxyl) Retrieved from Compost-Soil Metagenome. PLoS ONE, 8(1),

e52459. doi:10.1371/journal.pone.0052459

Waditee, R., Hibino, T., Nakamura, T., Incharoensakdi, A., & Takabe, T. (2002).

Overexpression of a Na ؉ ͞ H ؉ antiporter confers salt tolerance on a freshwater

cyanobacterium , making it capable of growth in sea water, (13).

Wang, G. Y., Graziani, E., Waters, B., Pan, W., Li, X., McDermott, J., … Davies, J. (2000).

Novel natural products from soil DNA libraries in a streptomycete host. Organic

Letters, 2(16), 2401–2404.

Wang, H., Gong, Y., Xie, W., Xiao, W., Wang, J., Zheng, Y., … Liu, Z. (2011).

Identification and characterization of a novel thermostable gh-57 gene from

metagenomic fosmid library of the Juan de Fuca Ridge hydrothemal vent. Applied

Biochemistry and Biotechnology, 164(8), 1323–1338. doi:10.1007/s12010-011-9215-1

Page 75: Sequence, organization, and genes characteristics of ORFs ...

74

Wemheuer, B., Taube, R., Akyol, P., Wemheuer, F., & Daniel, R. (2013). Microbial diversity

and biochemical potential encoded by thermal spring metagenomes derived from the

Kamchatka peninsula. Archaea, 2013.

Whitman, W. B., Coleman, D. C., & Wiebe, W. J. (1998). Perspective Prokaryotes : The

unseen majority, 95(June), 6578–6583.

Wooley, J. C., Godzik, A., & Friedberg, I. (2010). A primer on metagenomics. PLoS

Computational Biology, 6(2), e1000667. doi:10.1371/journal.pcbi.1000667

Wu, S., Zhu, Z., Fu, L., Niu, B., & Li, W. (2011). WebMGA: a customizable web server for

fast metagenomic sequence analysis. BMC Genomics, 12(1), 444. doi:10.1186/1471-

2164-12-444

Yu, N. Y., Wagner, J. R., Laird, M. R., Melli, G., Rey, S., Lo, R., … Brinkman, F. S. L.

(2010). PSORTb 3.0: improved protein subcellular localization prediction with refined

localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics

(Oxford, England), 26(13), 1608–15. doi:10.1093/bioinformatics/btq249

Zhang, C., & Kim, S.-K. (2010). Research and application of marine microbial enzymes:

status and prospects. Marine Drugs, 8(6), 1920–1934. doi:10.3390/md8061920