Metagenomic Analysis of Mariana Trench Sediment Samples

i

METAGENOMIC ANALYSIS OF

MARIANA TRENCH SEDIMENT

SAMPLES

Vera Maria Leal Carvalho

MSc IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

ii

So long, and thanks for all the fish!

iii

Abstract

The emergence of Metagenomics allowed the study of the microbial community in the

deepest point on Earth: the Challenger Deep on the Mariana Trench. Its extreme

conditions, a water depth of almost 11km, a temperature of 2.5 degrees Celsius and a

pressure around 112 MPa, made it very difficult to perform a comprehensive study of

its microecology, given the previous dependency on culturing methods. This

metagenomic analysis included taxonomic identification and exploration of some

functional potential of the genomic sequences of the community, generated by Illumina

Next-Generation Sequencing technique, therefore bypassing the need for cloning. Here

we show that Proteobacteria clearly dominate this environment but that there is no

obvious correlation between the sediment depth and the community composition.

Moreover, the abundance of enzymes involved in oxidative phosphorylation in all

samples, suggests aerobic activity within the sediment. This supports the finding that

there is oxygen consumption along the depth of the sediment. An extensive description

of all the data generated was prohibitive; however as soon as the data becomes

available, it will be accessible to the public to search for their features of interest.

Keywords: metagenomics, Mariana Trench, Challenger Deep, extreme environments,

Illumina, community structure, energy metabolism

O aparecimento da Metagenómica permitiu o estudo da comunidade microbiana no

ponto mais profundo na Terra: o “Challenger Deep” na Fossa das Marianas. As

condições extremas aí presentes - a coluna de água de quase 11km, 2.5ºC de

temperatura e a pressão à volta de 112MPa - tornaram um estudo aprofundado da sua

microecologia muito difícil de executar, dada a prévia dependência em métodos que

envolviam culturas em laboratório. Esta análise metagenómica incluiu identificação

taxonómica e a pesquisa do potencial funcional das sequências genómicas da

comunidade, geradas utilizando a tecnologia de nova geração de sequenciação da

Illumina, ultrapassando assim a necessidade de clonagem. Neste trabalho demonstra-

se que Proteobacteria domina claramente este habitat, mas que não há uma

correlação inequívoca entre a profundidade do sedimento e a composição da

comunidade. Além disso, a abundância de enzimas envolvidas na oxidação

iv

fosforilativa em todas as amostras, sugere actividade aeróbia no sedimento. Isto

sustenta a descoberta de que há consumo de oxigénio ao longo da profundidade do

sedimento. Uma descrição extensa de todos os dados que foram gerados era

proibitivo, no entanto, assim que os dados se tornarem públicos, serão acessíveis a

todos os que os queiram investigar consoante os seus interesses.

Palavras-chave: metagenómica, Fossa das Marianas, “Challenger Deep”, ambientes

extremos, Illumina, estrutura da comunidade, metabolismo energético

Introduction

1

1. Introduction

1.1. Background

The Challenger Deep on the Mariana Trench is one of the most extreme environments

on Earth, with a depth of almost 11km, a temperature of 2.5 degrees Celsius[1] and a

pressure around 111.79 MPa – calculated assuming the mean density of sea water

1036 kg/m3[2] and the gravity to be 9,81 m/s2[3]. It is located roughly at 11ºN 22.1’N

142º 25.8’ E [1](Figure 1).

Figure 1 - Challenger Deep location (11º 22.1'N 142º 25.8' E)

It has been subject to human curiosity for many years[4], however so far, there wasn't a

detailed study of its microecology. With the emergence of Metagenomics, it was now

finally possible to unravel which organisms live in the deepest point on Earth, and what

are they doing.

It was in 1998 that the term "Metagenomics" was first used by Jo Handeslman [5] in an

effort to study the microflora as a unit, the metagenome, instead of addressing each

type of organism individually.

Previously, it was thought that it was necessary to study the morphology, physiology

and pathogenic characters in order to classify a microorganism[6], but since Woese in

Introduction

2

1977 pioneered the use of 16S sequences for classification[7], sequence comparison

has been widely used and accepted as valid to do so.

With the development of the sequencing technology, one can now take a sample

directly from the environment, extract its DNA, sequence it, and infer the microbial

composition of the sample, therefore overcoming the bottleneck of growing pure

cultures in the laboratory. This method enables the discovery of new forms of life that

are not cultivable, and to assess the genetic richness and diversity, as well as the

metabolic potential, of a community of organisms as a whole[8].

Metagenomic analysis can accordingly be defined as “the identification, and functional

and evolutionary analysis of the genomic sequences of a community of organisms”.[9]

Moreover, the paradigm that most of the microbial world was known changed, to the

acknowledgement that there is still a lot to know and to explore[10]. Discovering new

forms of life in extreme environments can provide insights into a variety of topics, like

the biogeochemical activities that occur in the ocean[11], and the impact that human

activity may have on them[12] .

1.2. Metagenomic Analysis

To analyse a metagenome, several steps are typically involved, from the experimental

design to sharing the data[13]. Firstly, one has to obtain the samples. Ideally, true

replicates should be taken as well. Afterwards, one may filter the samples, to target a

(more-or-less) specific group of organisms [14].

The following step is sequencing. There are several technologies to sequence DNA,

each with its own advantages and weaknesses. The Mariana Trench sediment

samples were sequenced using Illumina’s paired-end assay. Its advantage is that it is

cheap and generates a large number of reads per run, however they are very short (50

– 250 bp), which can pose a problem for assembly and comparison since it becomes

more difficult to assign a read unequivocally to a template[15].

Illumina’s technology consists in attaching random DNA fragments to a surface, amplify

them to form clusters of the same sequence, and then use them as templates for

repeated cycles of polymerase-directed single base extension. This is guaranteed by

using 3′-modified nucleotides, labeled with a removable fluorophore. After determining

the identity of the nucleotide incorporated by laser-induced excitation of the

fluorophores, these as well as the side arm (that prevents the incorporation of more

than one nucleotide per cycle) are removed. The images of the fluorescent signal are

Introduction

3

used to determine the sequence (each nucleotide is attached to a fluorophore of a

different colour), and its quality, defined as the likelihood of each call being correct[16].

The paired-end option means that a fragment is sequenced in both directions (5’ → 3’,

and 3’ → 5’), therefore being helpful for the assembly[17].

Assembly is the next step in the Metagenomic analysis pipeline, although it is

sometimes skipped. Its usefulness is debatable[18], given that the accuracy of the

assemblers is difficult to assess, since there is currently no microbial community with

known reference sequences to compare to[13].

The main problem with assembly is that it distorts abundance information, since

abundant fragments will be considered as belonging to the most abundant species,

when in reality they may be present in rare species[18]. Moreover, some fragments

may be incorrectly discarded as mistakes or repeats, or joined up in the wrong places

or orientations[19]. Nonetheless, if these setbacks are taken into account when doing

the analysis, then assembly can be advantageous as it produces longer sequences

that are easily unambiguously annotated.

Gene prediction and annotation usually follow. The first classifies the sequences as

coding or non-coding, and the second tries to find homology between the coding

sequences and known sequences stored in databases. Once again, these methods

have their own flaws, mainly because they are based on models, hence failing to

predict exceptions that can occur in the biological world.

Typically, the final step is to share the sequence data on public databases together

with the metadata. Contextual data is necessary to compare with other datasets,

essentially making the sequences useful for the database and the scientific community.

By complying with standard languages for metadata, such as MIMS, the data becomes

more accessible, as complex searches will retrieve more information[20].

The whole set of drawbacks that are surrounding metagenomic analysis, are not at all

surprising, if one considers that it is still a very young field. A quick search on Web of

Knowledge[21] for the total number of articles featuring the term “metagenome” or

“metagenomics”, gives a very clear perception on how novel this field is, and how much

data has been produced (Figure 2).

With the popularity of the field expanding, a multitude of tools were developed making

the choice of which one to use, a not so trivial one. There is still no evident consensus

on which is the best tool for each step (not even for sequencing), so the errors in the

Introduction

4

data are most likely directly related to the flaws in each method, which means that a

different set of methods will yield a different set of errors.

Figure 2 - Total number of metagenomics articles published since 1998

Given this explosion of data, an obvious question is on its applicability. One example

would be bioremediation[22]. The process of biodegradation encompasses several

metabolic pathways that being considered in a community-basis, instead of an

individual-basis, lead to a global understanding of what is essential and what is

superfluous, easing the design of such a system. Moreover, the industry sector is

always in search of novel enzymes and processes[23].

Even so, metagenomics tends to be regarded as exploratory research, raising more

questions instead of addressing them. Accordingly, the aim of this project was not only

to answer some simple questions, but also to raise some more, and hopefully to

encourage further studies in this environment.

1.3. Objective

This project dealt solely with the analysis of the raw data output by sequencing. The

goal was to assess the taxonomical distribution of the community along the depth of

the sediment and to explore its metabolic potential, using the most adequate tools.

However, with the publication of the article [1], which included these sediment samples,

the focus turned to assess if the data generated by this analysis would corroborate the

published data, namely to confirm the O2 consumption throughout the sediment depth.

1 3 4 7 19 52 110 225 383 637 1,046 1,689

6,538

9,381

13,106 13,853

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

To

tal n

um

ber

of

art

icle

s

Year

Number of articles with the keyword metagenome* or metagenomic*

Introduction

5

1.4. Structure of the thesis

This report is organized in five chapters. Starting with the Introduction, some

background information is presented regarding both the site of the samples, as well as

the technology and pipeline typically employed in this kind of studies. The second

chapter has the methodology, with an explanation of each method and its output.

Chapter three includes the selected results, and in chapter four a critical discussion of

the previous is given. The last chapter has the conclusion with the final remarks and

some future directions to similar studies.

Methods

6

2. Methods

Seven out of eight sequenced samples from different depths were analysed. Each

sample corresponded to a gradient of 5cm, starting from 0-5cm (sample 1) to 35-40cm

(sample 8). The data was cleaned using the collection of tools Biopieces[24], the reads

assembled using IDBA-UD[25] [26], and both the generated contigs and the clean

reads were submitted to MG-RAST[27].

2.1. Sample collection, preparation and sequencing

The upstream methodology was done by collaborators and consisted on the following:

the DNA was extracted from 5g of sediment collected at different depths from the

Challenger Deep-Mariana Trench at 10,900m, using PowerMax soil DNA isolation kit

(MoBio Laboratories, CA USA). Eight DNA samples, each corresponding to a different

depth, were sent to BGI-Shenzhen (China), for library preparation and sequencing.

Since one of the samples (sample 4: 15-20 cm) did not contain enough DNA for library

preparation (as reported from the Sample Test Report of BGI), 14 fastq files were

received back, 2 for each of the seven samples – one with the forward and another

with the reverse reads.

2.2. Preliminary Analysis

The initial number of reads on each sample ranged from around 41 million to almost 84

million (Table 1).

Table 1 - Percentage of reads removed with the cleaning

Sample Number of raw reads (forward + reverse)

Number of clean reads

Percentage of Cleaning

1 58,814,066 43,569,096 25.921%

2 45,533,260 18,717,708 58.892%

3 47,163,190 34,419,612 27.020%

5 83,968,942 36,784,382 56.193%

6 61,751,498 43,891,904 28.922%

7 46,894,236 33,786,396 27.952%

8 41,030,848 28,508,242 30.520%

Methods

7

2.3. Biopieces

Sub-quality residues from the ends of the reads were removed, as well as the adaptors

used in the sequencing. The reads with a length inferior to 30 bp were also excluded, in

addition to reads with a local mean score under 15, to overcome errors propagated

from cycle to cycle[28]. The cleaning removed from 27% to almost 59% of the reads in

the samples (Table 1). The Biopieces script used is shown in Figure 3.

The tool trim_seq removes residues from the ends of sequences whose quality, in the

scores of the FASTQ file, does not match the minimum quality specified (in this case

25). The flag “-l” makes sure that residues are removed until a stretch of at least 3

residues with good quality is found, to avoid a premature termination due to a good

quality residue at the end. This step is necessary to overcome the effect of phasing and

pre-phasing. These are caused by incomplete removal of the 3' terminators and

fluorophores, sequences missing an incorporation cycle, or by the incorporation of

nucleotides without effective 3' terminators[28]. This means that each cycle’s signal is

affected by the signal of the previous and subsequent cycles, hindering the detection of

the right base.

read_fastq –i - |

trim_seq –m 25 –l 3 |

find_adaptor –l 6 –L 6 –f ACACGACGCTCTTCCGATCT –r AGATCGGAAGAGCACACGTC |

clip_adaptor |

merge_pair_seq |

grab –e ‘SEQ_LEN_LEFT >= 30’ |

grab –e ‘SEQ_LEN_RIGHT >= 30’ |

mean_scores –l |

grab –e ‘SCORES_MEAN_LOCAL >= 15’ |

split_pair_seq |

write_fastq –x

Figure 3 – Cleaning script

Find_adaptor searches the reads for the given adaptors (forward:

ACACGACGCTCTTCCGATCT and reverse: AGATCGGAAGAGCACACGTC), or

partial adaptors with at least 6 residues of length – flags “-l” for the forward and “-L” for

the reverse adaptor. By default, a percentage of the adaptor length is allowed for

mismatches, insertions, and deletions (10%, 5% and 5%, respectively).

Once the adaptors are found, clip_adaptor removes them, based on the keys output by

find_adaptor: ADAPTOR_POS_RIGHT, ADAPTOR_POS_LEFT, and ADAPTOR_LEN_LEFT.

Methods

8

The merge_pair_seq merges paired sequences, as long as they are interleaved.

Sequence names must be in either Illumina1.3/1.5 format trailing a “/1” or “/2” or

Illumina1.8 containing “1:” or “2:”. The sequence names should also match.

Grab is an improved version of Unix’s “grep”. It selects records that match a pattern, a

regular expression, or a numerical evaluation. In this case, we selected for reads with a

length superior to 30bp, by examining the keys SEQ_LEN_LEFT and

SEQ_LEN_RIGHT, output by merge_pair_seq.

Afterwards, mean_scores –l was used to calculate the local mean scores, which means

that instead of calculating the mean as the sum of all the scores over the length of the

string, it uses means from a sliding window, and returns the smallest value.

Finally, split_pair_seq was used to split the sequences merged with merge_pair_seq.

To speed up the process, this script was ran with GNU parallel[29] with the –L 8 option,

which takes two records at a time (each record has 4 lines), to circumvent breaking the

pairs. GNU Parallel allows Biopieces to be executed in parallel using multiple CPUs on

multiple cores and servers[24].

The merge_pair_seq and split_pair_seq tools were created within this project, to

overcome speed and memory problems originated by the use of order_pairs. The latter

interleaves the sequences, as long as their names are in Illumina 1.5 or 1.8 scheme,

and ads a key stating if the read is “paired” or “orphan”. This should be used after the

trimming and grabbing steps, and subsequently, only the paired reads should be

grabbed.

Example of a script using order_pairs (Figure 4):

read_fastq –i - |

trim_seq –m 25 –l 3 |

find_adaptor –l 6 –L 6 –f ACACGACGCTCTTCCGATCT –r AGATCGGAAGAGCACACGTC |

clip_adaptor |

grab –e ‘SEQ_LEN >= 30’ |

mean_scores –l |

grab –e ‘SCORES_MEAN_LOCAL >= 15’ |

order_pairs |

grab –p ‘pair’ –k ORDER |

write_fastq –x

Figure 4 – order_pairs script

Methods

9

2.4. Assembly

The decision to assemble smaller reads into larger contigs was made based on the

postulation that “The longer the sequence information, the better is the ability to obtain

accurate information.” The annotation procedure becomes easier since longer

sequences yield more information to compare with the databases, but it also applies for

classification of DNA fragments, as well as to rise the confidence in accuracy due to

the lower quality of single reads, by having multiple reads covering the same segment

of information, provided that the coverage is high enough[13]. The IDBA-UD algorithm

is based on de Bruijn graphs adapted for metagenomic sequencing technologies with

uneven sequencing depths[26].

De Bruijn graphs have every possible (k-1)-mer assigned to a node and it has a direct

edge to another one if there is some k-mer whose prefix is the former and whose suffix

is the latter. This means that all the edges in the graph represent all possible k-mers.

The idea is to find an Eulerian cycle[30] with the shortest superstring that contains each

k-mer exactly once (Figure 5).

By visiting each edge only once, the time to run the algorithm is roughly proportional to

the number of edges[31], unlike in a Hamiltonian cycle[32], where each node is visited

only once, making it an NP-complete problem[33] (meaning the time to solve it

increases quickly with the size of the input).

Applied to genome assembly, all the k-mers are the ones present in the reads

generated by sequencing[31], so ideally, the Eulerian cycle would generate the

genome. In practice this method cannot be applied directly, since there are some

assumptions that do not hold. Firstly, we cannot be sure that all the k-mers present in

the genome were generated; secondly, k-mers are not error-free; thirdly, each k-mer is

very likely to appear more than once in the genome; and lastly, we should not assume

that the genome is a single circular chromosome.

To deal with the first problem, instead of trying to assemble the reads, the algorithm

breaks them into smaller k-mers which are more likely to be representative of the whole

genome. To handle errors, the assembler chooses the path which is supported by

higher coverage. Regarding repeats, if a k-mer appears more than once in the

genome, it shall be represented by several edges connecting the same two nodes.

Finally, rather than searching for an Eulerian cycle, if the algorithm is modified to

search for an Eulerian path[34], then it is not required to end in the same node where it

began[31].

Methods

10

Figure 5 – Genome assembly strategies: Hamiltonian and Eulerian cycles[31].

The main problem with metagenomic data is that species with different abundances will

be represented by reads with uneven depth, and this cannot be disregarded as, e.g.,

an amplification bias. IDBA_UD solves this problem by adopting variable thresholds on

the multiplicity of the k-mers, making them dependent on the sequencing depth of the

neighboring contigs. The idea is that contigs with much lower sequencing depths that

their neighbors are more likely erroneous[26]. Moreover, IDBA_UD uses paired-end

information, namely the distance between the pairs, to solve issues such as missing k-

mers and repeats.

The assembler IDBA_UD was firstly used with the default minimum contig size setting

(200 bp), which yielded a N50 from 3545 to 9240. N50 is the length of the smallest

contig that contains the fewest largest contigs whose combined length represents no

less than 50% of the assembly. It is one of the common assembly statistics[35].

Therefore, then a higher minimum contig size of 500 bp was chosen, which improved

the N50 values, so these contigs were uploaded to the server MG-RAST[27]. The

complete analysis of both assemblies (using the Biopiece analyze_assembly) is shown

on Table 2 and Table 3, including N50, contig length (maximum, minimum, mean and

total) and the number of contigs.

Methods

11

Table 2 – Analysis of the assembly with minimum contig size 200 bp.

200 bp

Sample 1 2 3 5 6 7 8

N50 3545 4705 4397 6430 8726 7136 9240

Leng

th

Max 614,662 215,848 466,951 305,081 551,041 305,025 452,197

Min 200 200 200 200 200 200 200

Mean 1439 1956 1651 2206 2068 1822 2234

Total 106,683,337 42,414,452 79,943,730 68,959,229 61,784,508 70,790,963 64,146,341

Number contigs 74,124 21,681 48,418 31,250 29,868 38,839 28,704

Table 3 – Analysis of the assembly with minimum contig size 500 bp.

500 bp

Sample 1 2 4 5 6 7 8

N50 14,106 6,199 14,122 8,662 17,856 16,340 16,698

Leng

th

Max 614,662 215,848 548,284 305,081 551,037 337,423 551,034

Min 504 502 503 501 505 518 503

Mean 3,261 3,180 3,384 3,883 4,277 4,235 4,492

Total 76,906,252 36,624,750 60,016,272 59,604,646 51,333,136 55,864,260 54,104,132

Number contigs 23,581 11,514 17,732 15,349 12,000 13,190 12,044

2.5. MG-RAST

MG-RAST[27] uses several bioinformatics tools in its pipeline. Firstly, it filters

sequences based on length, number of ambiguous bases and quality values. All the

contigs from all the 7 samples uploaded, passed this preprocessing stage.

Then, “technical replicates”, identified as sequences with identical first 50 base-pairs,

are removed in a step called Dereplication. Between 0,7% (surface sample) and 2,3%

(sample 7) of the contigs were removed in this step, but no reads were removed. This

can be explained by the use of the same reads for different contigs.

After that, FragGeneScan[36] is used to predict coding regions. This tool is an ab-initio

gene calling algorithm that uses hidden Markov Model for coding and non-coding

regions, and that was developed specially for metagenomes. It includes codon usage

bias, sequencing error models and start/stop codon patterns. A gene is reported if it’s

longer than 60 bp, and begins either with a start or an internal codon of a gene and

ends with a stop or an internal codon. This way, both complete and partial genes are

predicted. From 29,239 (sample 2) to 63,877 (sample 1) coding sequences were

Methods

12

predicted within the contigs, and from 16,387,405 (sample 2) to 40,199,546 (sample 6)

within the reads.

The sequences output from FragGeneScan are then clustered at 90% identity with

qiime-uclust. QIIME[37] is a software package developed specially for high throughput

amplicon sequencing data, although it also supports metagenomic data. It incorporates

many third party tools, such as UCLUST[38]. This algorithm clusters sequences based

on their similarity, according to a threshold set by the user (or in this case by MG-

RAST). Each cluster is therefore represented by a sequence, and all the sequences in

it should have a similarity higher than the threshold to the sequence representing the

cluster (centroid), and centroids should have similarity below the threshold to the other

centroids. The algorithm starts with no centroids, and each sequence is compared to

the list of centroids and it is either assigned to a cluster or selected as a new centroid.

The centroids and the singletons (unclustered sequences) are then searched using

BLAT[39] against the M5NR protein database. M5NR is a non-redundant protein

database which incorporates data from GO[40], KEGG[36][37], NCBI[38][39],

SEED[40][41], UniProt[47], VBI[48] and eggNOG[49], and has almost 16,000,000

sequences. BLAT builds an index of the database and then scans linearly through the

query sequence, unlike BLAST which builds an index of the query sequence and then

scans linearly through the database, making it faster since it does not have to scan

through a database of gigabases of sequence but only through a relatively short query

sequence. BLAT, however, looses to BLAST in terms of sensitivity, since it needs an

exact or nearly-exact match to find a hit, making it suitable mostly for closely related

species. The alignment identified between 25,261 (sample 2) and 50,816 (sample 1)

protein features in the contigs, and from 4,859,593 (sample 2) to 10,890,942 (sample

6) in the reads, which proved to be correlated at 98% with the number of dereplicated

reads, using Pearson’s coefficient:

Where and are the average of the number of dereplicated reads and the number of

protein features, respectively.

The results of the search against the M5NR database were retrieved for each of the

samples, at 90% identity, to map against the metabolic pathways maps based on

KEGG data, using KEGG Mapper[41] [42] and iPath[50] [51].

Methods

13

Besides from being the input for the Dereplication step, the filtered sequences are pre-

screened to identify ribosomal sequences at 70% identity, and then they are clustered

using UCLUST at 97% identity. The clusters are then searched for similarity against the

M5RNA database (Greengenes[52], SILVA[53] and RDP[54]), using BLAT[39]. This

alignment identified between 36 rRNA features (sample 2) to 72 (sample 1) in the

contigs, whilst in the reads the number ranged from 19,014 (sample 2) to 38,639

(sample 1).

MG-RAST also calculated automatically the alpha diversity of each sample, to

summarize the distribution of species-level annotations in that sample, using the

following equation:

Where p is a ratio of the number of annotations for each species to the total number of

annotations and m is the total number of different species annotations, using all the

annotation source databases incorporated by MG-RAST[27].

Based on the abundances of each species in each sample (using the reads), the R

package vegan[55] was used to calculate the beta diversity, as suggested in the

manual[56]. Therefore it was calculated pair wise between samples, using the

Sørensen index of dissimilarity:

Where a is the number of species shared by the two samples, and b and c are the

number of unique species to each sample; as well as the widely known Whittaker's

species turnover:

Where γ is the total number of species in the collection of samples (gamma diversity),

and is the average richness per sample. Subtraction of one guarantees that β=0

means that there are no excess species or no heterogeneity between samples.

Rarefaction curves were also automatically generated. The theory behind it, is to

repeatedly re-sample the pool of reads, at random, plotting the average number of

species represented by 1, 2,…N reads[57].

Methods

14

Krona[58] was used to view the percentage of reads with predicted proteins and

ribosomal RNA genes annotated based on all the databases.

Results

15

3. Results

The reads and contigs submitted to MG-RAST were automatically attributed with

unique ID’s, as indicated on Table 4.

Table 4 - ID's of the contigs submitted to MG-RAST

Sample Reads Contigs

1 4525786.3 4518922.3

2 4525785.3 4518923.3

3 4525784.3 4518924.3

5 4525781.3 4518925.3

6 4525782.3 4518926.3

7 4525783.3 4518927.3

8 4525787.3 4518928.3

To compare the abundances among the samples, the results were extracted from the

reads, whereas to assess presence or absence of a defined feature, the contigs’

results were retrieved.

3.1. Taxonomic Hits Distribution

Extracting the best hit classification from the reads compared to M5NR using a

maximum e-value of 1e-5, a minimum identity of 90%, and a minimum alignment length

of 15 aa, it is clear that Bacteria, and more specifically Proteobacteria, largely dominate

in all the 7 samples (Figure 6 and Figure 20).

In terms of class, Betaproteobacteria seems to comprise 78% of Proteobacteria in

Sample 1, unlike the other samples, where Gammaproteobacteria seems to be the

dominant class (Figure 7). Sample 3 shows a larger representation of

Alphaproteobacteria compared to the other samples.

Most of Gammaproteobacteria in sample 1 is Pseudoalteromonas, in sample 2 is

Pseudomonas, whereas from sample 3 to sample 8 other genera, namely

Marinobacter, become just as dominant (See Figure 21 to Figure 27).

Results

16

Figure 6 – Taxonomic distribution of the reads at the domain level

Figure 7 - Taxonomic distribution of the reads at the class level (Proteobacteria)

In terms of α-diversity, calculated using the reads against all the annotation databases

used by MG-RAST, sample 1 shows the highest: 430.83 species. The other samples

have diversities between 184.10 species (sample 6) and 252.14 species (sample 7).

The values of α-diversity for all the samples are shown on Table 5.

Results

17

Table 5 - α-diversity

α-diversity

Sample 1 430.83

Sample 2 213.47

Sample 3 232.97

Sample 5 210.42

Sample 6 184.10

Sample 7 252.14

Sample 8 240.39

The β-diversity value, using the Whittaker's species turnover was 1.181461, and the

pairwise comparisons are shown on Table 6 and Figure 8.

Table 6 – Pairwise β-diversity

Sample 1 Sample 2 Sample 3 Sample 5 Sample 6 Sample 7

2 0.422489

3 0.353043 0.319049

5 0.382264 0.283298 0.292187

6 0.364884 0.30632 0.292165 0.288654

7 0.360278 0.333708 0.314927 0.307126 0.287154

8 0.365677 0.324216 0.309876 0.306393 0.292684 0.294254

Figure 8 - β-diversity barchart

Results

18

A correlation analysis of the distance between samples and their β-diversity, shows no

relation between them (Figure 28).

The rarefaction curves of annotated species richness for all the samples show a quick

rise at first, and then they become flatter but without leveling off towards an asymptote

(Figure 9). This means that if there had been more reads, probably more species would

be found. Even so, these results allow a reasonable guess of the community structure.

Figure 9 – Rarefaction curve of annotated species richness

The Principle Component Analysis for the reads of the 7 samples, with annotation

against the M5RNA database, using the Bray-Curtis measure (chosen for showing a

robust relationship with ecological distance[59]), an e-value of 1e-5 and a minimum

identity of 97%, does not show a clear trend, neither when using the M5NR database,

with a minimum identity of 90% (Figure 10). See Figure 29 and Figure 30 for the

heatmaps with the same thresholds and normalized values to the size of the samples.

Results

19

Figure 10 - PCoA using the M5RNA database (left) and the M5NR database (right)

Nevertheless, when comparing with metagenomes from 1) the gut microbiota of 91

pregnant women of varying prepregnancy BMIs and gestational diabetes status and

their infants (http://metagenomics.anl.gov/linkin.cgi?project=265), and 2) metagenomes

from activated sludge from 2 full-scale tannery wastewater treatment plants

(http://metagenomics.anl.gov/linkin.cgi?project=922), it is clearly seen, that the Mariana

Trench samples group together in a very distinct group. As these two environments are

expected to be very and quite different, respectively, from the deep sea Mariana

Trench samples, this is a good indicator on the reliability of the latter. See for example

Figure 11, for a comparison against the M5NR database, at 90% minimum identity, and

an e-value of 1e-5.

Results

20

Figure 11 - PCoA of the reads against the M5NR database. Red - Mariana Trench; Blue - Activated Sludge; Green – Gut Microbiota

Results

21

3.2. Functional Category Hits Distribution

Looking at the number of features that were annotated based on the reads compared

to the contigs, it is noticeable that the latter provide a much more reliable source for

annotation, as seen from the range of e-values, which was expected. See, for example,

sample 7 in Figure 12 and Figure 13. One might notice that there were more features

predicted from the reads, but at the same time there were more reads than contigs.

Figure 12 - Number of features in the reads of sample 7 annotated by the different databases

Figure 13 – Number of features in the contigs of sample 7 annotated by the different databases

Moreover, taking again sample 7 as an example, only 50.7% of the predicted protein

features in the reads could be annotated with similarity to a protein of known function,

whereas 84.9% of the predicted protein features of the contigs were annotated.

Results

22

From all the databases that were used to compare the protein sequences generated

from the contigs, SEED Subsystems[45] had the higher number of annotations. (Figure

12 and Figure 13) It is worth noting, however, that each database has a different type

of annotation data, hence the different number of hits. Since the tools to analyse the

pathways (KEGG Mapper and iPath) use the KEGG database, the focus was put on

the functional hierarchy given by KEGG Orthology (KO)[41][42].

Comparing the reads to KO, using a maximum e-value of 1e-5, a minimum identity of

90%, and a minimum alignment length of 15, on average 53% (±0.03) of the reads with

predicted protein functions were annotated as belonging to the Metabolism category.

From those, 14% (±0.05) of the reads belong to Energy metabolism.

Roughly 100% of the reads from Energy metabolism, in the reads from sample 1,

correspond to oxidative phosphorylation, and on the rest of the samples, this value lays

around 77% (±0.07).

In fact, the F-type H+-transporting ATPase subunit beta (K02112), involved in both

oxidative phosphorylation (Figure 14) and photosynthesis (Figure 15), is the second

most abundant hit in sample 1 (out of 54 hits), with an average identity of 91.06% and

an average e-value of -6.14.

Figure 14 - Oxidative Phosphorylation, pathway ko00190.

Results

23

In sample 2, K02112 appears in 11th place (out of 239 hits) with an abundance of 9187

together with F-type H+-transporting ATPase subunit alpha (K02111) in 10th place with

an abundance of 9307.

In sample 3, K02112 has an abundance of 9513 and K02111 of 9758, appearing in 8th

and 6th, respectively, when sorting for abundance. For sample 5 the values are 13405

for K02112 and 12764 for K02111 (10th and 12th). Sample 6 has even higher

abundances for K02112 and K02111: 16632 and 16260 (8th and 9th most abundant). In

samples 7 and 8 they appear in 5th and 6th place, out of 108 and 115 hits, with

abundances of 11492 and 11257, and 10691 and 10294. In all samples from the

second to the seventh, these subunits have an average identity above 91.5%.

Figure 15 – Photosynthesis, pathway ko00195.

Using the contigs, with the same settings, only K02112 was found, and only in samples

2 and 8. However, the average alignment length of the hits was 356.55 and 332.22,

respectively, whereas for the reads it was 27.67 and 27.57. Nevertheless, other hits

also classified as belonging to Oxidative Phosphorylation were found, like NADH-

quinone oxidoreductase subunit (K13380 and K13378), NADH-quinone oxidoreductase

subunits (K00338 and K00340), F-type H+-transporting ATPase subunit c (K02110), V-

type H+-transporting ATPase subunits (K02118 and K02122), cytochrome c oxidase

Results

24

assembly protein subunit 17 (K02260), nucleosome-remodeling factor 38 kDa subunit

(K11726), cytochrome o ubiquinol oxidase subunit III (K02299), cytochrome o ubiquinol

oxidase operon protein cyoD (K02300) and NAD(P)H-quinone oxidoreductase subunit

5 (K05577).

To address, with some degree of confidence, whether alternative energy metabolism

processes occur in any of the samples, the contigs results were further explored.

Indeed, all samples contained contigs involved in Methane Metabolism (Figure 16).

Figure 16 - Methane metabolism, pathway ko00680. In red the enzymes found in the samples.

In addition, contigs from samples 2, 5 and 8, matched hits from nitrogen metabolism

(Figure 17). In all the three samples, nitric oxide reductase subunit B (K04561)

(EC:1.7.2.5) was present, which is involved in denitrification (nitrate → nitrogen).

Results

25

Sample 2 also had a nitrogenase iron protein NifH (K02588) (EC:1.18.6.1), a

nitrogenase molybdenum-cofactor synthesis protein NifE (K02587) and a nitrogen

fixation protein NifX (K02596).

Figure 17 - Nitrogen metabolism, pathway ko00910. In red the enzymes found in samples 2, 5 and 8.

Finally, the map generated with iPATH (Figure 18) gives a general overview of the

pathways present, when combining all samples. It is worth noting that photosynthesis

appears mapped; however, this is most likely a misleading mapping, since the enzyme

Results

26

identified is an F-type H+-transporting ATPase, which is involved in photosynthesis but

also in oxidative phosphorylation, as mentioned earlier.

Figure 18 - Metabolic map of the seven samples

Discussion

27

4. Discussion

Marine sediments, and in particular hadal trenches, receive substantial deposition of

microbes and organic matter from the upper water layer[1], and provide a matrix of

complex nutrients and solid surfaces for microbial growth[60]. However, the low

temperature and the extreme hydrostatic pressure demand a certain degree of

adaptation from the organisms inhabiting such an environment. Even so, there seems

to be a fairly high diversity along the sediment depth, as seen in Table 5 and Figure 9.

Proteobacteria is the largest and most metabolically diverse group of Bacteria. They

are all gram-negative, and they divide into 5 classes: alpha, beta, gamma, delta and

epsilon[61]. The dominance of Gammaproteobacteria is in accordance with a study

from the Pacific Artic Ocean, where the temperatures are also very low[62], and

somewhat with the study of sediments at 4000m depth in Pacific Ocean, where not

only Gammaproteobacteria but also Alphaproteobacteria dominate the community[63].

Intriguingly, the outer-layer of an actively venting black-smoker chimney from a

hydrothermal vent field on the Juan de Fuca Ridge[64], is also dominated by

Gammaproteobacteria, even though its temperature lies above 310ºC.

The PCoA graphs show samples that exhibit similar abundance profiles, in terms of

taxonomy or function, grouped together. However, when comparing the seven

samples, there is no obvious trend in the community towards the depth of the sediment

(Figure 10). Nevertheless, the fact that this project’s samples group together and very

distinctly from other project’s samples, is a good indicator that this environment has its

own community structure.

The poor correlation between β-diversity and distance between samples also supports

the PCoA results (Table 6 and Figure 28). This means that the difference in microbial

community composition (as defined in [65]) is most likely due to factors other than

depth. It is possible that, under such high pressure, some centimeters of sediment do

not really make a difference in the community structure. Alternatively, there might have

been some mixing of the communities during the sampling process.

It should be noted however, that the fact that the community as a whole does not show

a shift alongside the depth of the sediment, does not exclude the hypothesis that some

taxa correlate with it.

Regarding the decision to assemble, the range of e-values of the number of features

annotated with the different databases, as well as the percentage of predicted protein

Discussion

28

features that were annotated, should provide some degree of confidence in the

assembly.

The high number of hits of the oxidative phosphorylation pathway supported the

predictions from [1], that there is intensified O2 consumption within the sediment, unlike

in the sediment of the reference site (≈6000m of water depth), where the microbial

activity has reduced rates. This was supported by measurements of the O2

concentration throughout the depth of the sediment. Attenuation in the O2

concentration reflects higher rates of its consumption[1] (Figure 19), which is consistent

with the presence of genes involved in aerobic respiration in all the samples.

Figure 19 – Oxygen micro-profiles at 6,018 m water depth (a); and at Challenger Deep (b) [1].

Even though oxidative phosphorylation dominates the energy metabolism processes,

methane and nitrogen metabolism still play a part in the community’s energetic

potential.

Normally, methanogenesis is associated with anoxic environments; still, it is known that

even in oxic environments, anoxic microenvironments can form, where

methanogenesis takes place[61].

Discussion

29

Once more, the predictions that there is intensified mineralization mediated by the

prokaryotic community at Challenger Deep[1] are supported by the contigs with

homology to features involved in nitrogen metabolism.

Finally, the misleading mapping of the ATPase (Figure 18), should be taken as an

example that care and criticism are fundamental when using automated tools.

Conclusion

30

5. Conclusion

This study was a first description of both the community structure and its functional

potential, in the Mariana Trench, a unique environment for its extreme conditions. The

amount of data generated made it prohibitive to describe it in total. The energy

metabolism was selected for this thesis, since it was interesting to compare with the

results from [1]. The finding that there are enzymes involved in the oxidative

phosphorylation pathway in all 7 samples, supported the published measurements of

oxygen consumption throughout the sediment.

It was expected to observe a taxonomic and/or functional gradient along the depth of

the sediment but that does not seem to happen. A further investigation on this matter

would be helpful to prove if there are any signature taxa of the depth.

The data used in the study will soon be publicly available on MG-RAST, therefore

accessible for additional investigation. However, in the future, it would be sensible to

sample with true replicates, and take a broader number of environmental

measurements, to allow the data to be more comparable to other studies. It would also

be interesting to take samples from sediments from other depths along the Challenger

Deep, to assess if the community uniqueness is due to the extreme depth or to the

overall conditions on that site.

To conclude, it is probable that in 10 years time, with the development of new tools or

with the improvement of the existing ones, all of these results will be proved inaccurate.

However, the aim of this thesis was neither to develop new tools, nor to compare the

existing ones, but to use them wisely and understand their purpose for this analysis.

Hence, the argument of this project is that with this set of tools, this is the product.

References

31

References

[1] R. N. Glud, F. Wenzhöfer, M. Middelboe, K. Oguri, R. Turnewitsch, D. E. Canfield, and H. Kitazato, “High rates of microbial carbon turnover in sediments in the deepest oceanic trench on earth,” Nature Geoscience, vol. 6, no. 4, pp. 284–288, Mar. 2013.

[2] R. Pawlowicz, “Key physical variables in the ocean: temperature, salinity, and density,” Nature Education Knowledge, vol. 4, no. 4, p. 13, 2013.

[3] “The international system of units.” Bureau International des Poids et Mesures, 2006.

[4] R. A. Lutz and P. G. Falkowski, “Ocean science. A dive to Challenger Deep.,” Science (New York, N.Y.), vol. 336, no. 6079, pp. 301–2, Apr. 2012.

[5] J. Handelsman, M. R. Rondon, S. F. Brady, J. Clardy, and R. M. Goodman, “Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products,” Chemistry & Biology, vol. 5, no. 10, pp. R245–R249, Oct. 1998.

[6] Society of American Bacteriologists., Bergey’s manual of determinative bacteriology, 1st ed. Baltimore, Williams & Wilkins Co., 1923.

[7] C. R. Woese and G. E. Fox, “Phylogenetic structure of the prokaryotic domain: The primary kingdoms,” Proceedings of the National Academy of Sciences, vol. 74, no. 11, pp. 5088–5090, Nov. 1977.

[8] P. Hugenholtz and G. W. Tyson, “Microbiology: metagenomics.,” Nature, vol. 455, no. 7212, pp. 481–3, Sep. 2008.

[9] E. M. Glass and F. Meyer, “Analysis of metagenomics data,” in in Bioinformatics for High Throughput Sequencing, N. Rodríguez-Ezpeleta, M. Hackenberg, and A. M. Aransay, Eds. New York, NY: Springer New York, 2012, pp. 219–229.

[10] J. Handelsman, “Metagenomics: application of genomics to uncultured microorganisms.,” Microbiology and molecular biology reviews : MMBR, vol. 68, no. 4, pp. 669–85, Dec. 2004.

[11] X. Hao and T. Chen, “OTU analysis using metagenomic shotgun sequencing data,” PLoS ONE, vol. 7, no. 11, p. e49785, Nov. 2012.

[12] V. Iverson, R. M. Morris, C. D. Frazar, C. T. Berthiaume, R. L. Morales, and E. V. Armbrust, “Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota.,” Science (New York, N.Y.), vol. 335, no. 6068, pp. 587–90, Feb. 2012.

[13] T. Thomas, J. Gilbert, and F. Meyer, “Metagenomics - a guide from sampling to data analysis.,” Microbial informatics and experimentation, vol. 2, no. 1, p. 3, Jan. 2012.

[14] J. C. Wooley, A. Godzik, and I. Friedberg, “A primer on metagenomics.,” PLoS computational biology, vol. 6, no. 2, p. e1000667, Feb. 2010.

[15] N. Whiteford, N. Haslam, G. Weber, A. Prügel-Bennett, J. W. Essex, P. L. Roach, M. Bradley, and C. Neylon, “An analysis of the feasibility of short read sequencing.,” Nucleic acids research, vol. 33, no. 19, p. e171, Jan. 2005.

References

32

[16] D. R. Bentley, S. Balasubramanian, H. P. Swerdlow, G. P. Smith, J. Milton, C. G. Brown, K. P. Hall, D. J. Evers, C. L. Barnes, H. R. Bignell, J. M. Boutell, J. Bryant, R. J. Carter, R. Keira Cheetham, A. J. Cox, D. J. Ellis, M. R. Flatbush, N. A. Gormley, S. J. Humphray, L. J. Irving, M. S. Karbelashvili, S. M. Kirk, H. Li, X. Liu, K. S. Maisinger, L. J. Murray, B. Obradovic, T. Ost, M. L. Parkinson, M. R. Pratt, I. M. J. Rasolonjatovo, M. T. Reed, R. Rigatti, C. Rodighiero, M. T. Ross, A. Sabot, S. V Sankar, A. Scally, G. P. Schroth, M. E. Smith, V. P. Smith, A. Spiridou, P. E. Torrance, S. S. Tzonev, E. H. Vermaas, K. Walter, X. Wu, L. Zhang, M. D. Alam, C. Anastasi, I. C. Aniebo, D. M. D. Bailey, I. R. Bancarz, S. Banerjee, S. G. Barbour, P. A. Baybayan, V. A. Benoit, K. F. Benson, C. Bevis, P. J. Black, A. Boodhun, J. S. Brennan, J. A. Bridgham, R. C. Brown, A. A. Brown, D. H. Buermann, A. A. Bundu, J. C. Burrows, N. P. Carter, N. Castillo, M. Chiara E Catenazzi, S. Chang, R. Neil Cooley, N. R. Crake, O. O. Dada, K. D. Diakoumakos, B. Dominguez-Fernandez, D. J. Earnshaw, U. C. Egbujor, D. W. Elmore, S. S. Etchin, M. R. Ewan, M. Fedurco, L. J. Fraser, K. V Fuentes Fajardo, W. Scott Furey, D. George, K. J. Gietzen, C. P. Goddard, G. S. Golda, P. A. Granieri, D. E. Green, D. L. Gustafson, N. F. Hansen, K. Harnish, C. D. Haudenschild, N. I. Heyer, M. M. Hims, J. T. Ho, A. M. Horgan, K. Hoschler, S. Hurwitz, D. V Ivanov, M. Q. Johnson, T. James, T. A. Huw Jones, G.-D. Kang, T. H. Kerelska, A. D. Kersey, I. Khrebtukova, A. P. Kindwall, Z. Kingsbury, P. I. Kokko-Gonzales, A. Kumar, M. A. Laurent, C. T. Lawley, S. E. Lee, X. Lee, A. K. Liao, J. A. Loch, M. Lok, S. Luo, R. M. Mammen, J. W. Martin, P. G. McCauley, P. McNitt, P. Mehta, K. W. Moon, J. W. Mullens, T. Newington, Z. Ning, B. Ling Ng, S. M. Novo, M. J. O’Neill, M. A. Osborne, A. Osnowski, O. Ostadan, L. L. Paraschos, L. Pickering, A. C. Pike, A. C. Pike, D. Chris Pinkard, D. P. Pliskin, J. Podhasky, V. J. Quijano, C. Raczy, V. H. Rae, S. R. Rawlings, A. Chiva Rodriguez, P. M. Roe, J. Rogers, M. C. Rogert Bacigalupo, N. Romanov, A. Romieu, R. K. Roth, N. J. Rourke, S. T. Ruediger, E. Rusman, R. M. Sanches-Kuiper, M. R. Schenker, J. M. Seoane, R. J. Shaw, M. K. Shiver, S. W. Short, N. L. Sizto, J. P. Sluis, M. A. Smith, J. Ernest Sohna Sohna, E. J. Spence, K. Stevens, N. Sutton, L. Szajkowski, C. L. Tregidgo, G. Turcatti, S. Vandevondele, Y. Verhovsky, S. M. Virk, S. Wakelin, G. C. Walcott, J. Wang, G. J. Worsley, J. Yan, L. Yau, M. Zuerlein, J. Rogers, J. C. Mullikin, M. E. Hurles, N. J. McCooke, J. S. West, F. L. Oaks, P. L. Lundberg, D. Klenerman, R. Durbin, and A. J. Smith, “Accurate whole human genome sequencing using reversible terminator chemistry.,” Nature, vol. 456, no. 7218, pp. 53–9, Nov. 2008.

[17] W. Zhang, J. Chen, Y. Yang, Y. Tang, J. Shang, and B. Shen, “A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies.,” PloS one, vol. 6, no. 3, p. e17915, Jan. 2011.

[18] H. Teeling and F. O. Glöckner, “Current opportunities and challenges in microbial metagenome analysis--a bioinformatic perspective.,” Briefings in bioinformatics, Sep. 2012.

[19] M. Baker, “De novo genome assembly: what every biologist should know,” Nature Methods, vol. 9, no. 4, pp. 333–337, Mar. 2012.

[20] P. Yilmaz, R. Kottmann, D. Field, R. Knight, J. R. Cole, L. Amaral-Zettler, J. A. Gilbert, I. Karsch-Mizrachi, A. Johnston, G. Cochrane, R. Vaughan, C. Hunter, J. Park, N. Morrison, P. Rocca-Serra, P. Sterk, M. Arumugam, M. Bailey, L. Baumgartner, B. W. Birren, M. J. Blaser, V. Bonazzi, T. Booth, P. Bork, F. D. Bushman, P. L. Buttigieg, P. S. G. Chain, E. Charlson, E. K. Costello, H. Huot-Creasy, P. Dawyndt, T. DeSantis, N. Fierer, J. A. Fuhrman, R. E. Gallery, D. Gevers, R. A. Gibbs, I. San Gil, A. Gonzalez, J. I. Gordon, R. Guralnick, W. Hankeln, S. Highlander, P. Hugenholtz, J. Jansson, A. L. Kau, S. T. Kelley, J. Kennedy, D. Knights, O. Koren, J. Kuczynski, N. Kyrpides, R. Larsen, C. L. Lauber, T. Legg, R. E. Ley, C. A. Lozupone, W. Ludwig, D. Lyons, E. Maguire, B. A. Methé, F. Meyer, B. Muegge, S. Nakielny, K. E. Nelson, D. Nemergut, J. D. Neufeld, L. K. Newbold, A. E. Oliver, N. R. Pace, G. Palanisamy, J. Peplies, J. Petrosino, L. Proctor, E. Pruesse, C. Quast, J. Raes, S. Ratnasingham, J. Ravel, D. A. Relman, S. Assunta-Sansone, P. D. Schloss, L. Schriml, R. Sinha, M. I. Smith, E. Sodergren, A. Spo, J. Stombaugh, J. M. Tiedje, D. V Ward, G. M. Weinstock, D. Wendel, O. White, A. Whiteley, A. Wilke, J. R. Wortman, T. Yatsunenko, and F. O. Glöckner, “Minimum

References

33

information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.,” Nature biotechnology, vol. 29, no. 5, pp. 415–20, May 2011.

[21] “Web of Knowledge.” [Online]. Available: www.webofknowledge.com.

[22] J. L. Fox, “Natural-born eaters.,” Nature biotechnology, vol. 29, no. 2, pp. 103–6, Feb. 2011.

[23] P. Lorenz and J. Eck, “Metagenomics and industrial applications.,” Nature reviews. Microbiology, vol. 3, no. 6, pp. 510–6, Jun. 2005.

[24] “www.biopieces.org.” .

[25] Y. Peng, H. Leung, S. Yiu, and F. Chin, “IDBA – a practical iterative de Bruijn graph de novo assembler,” in 14th RECOMB 2010, 2010, pp. 426–440.

[26] Y. Peng, H. C. M. Leung, S. M. Yiu, and F. Y. L. Chin, “IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.,” Bioinformatics (Oxford, England), vol. 28, no. 11, pp. 1420–8, Jun. 2012.

[27] F. Meyer, D. Paarmann, M. D’Souza, R. Olson, E. M. Glass, M. Kubal, T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, and R. A. Edwards, “The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes.,” BMC bioinformatics, vol. 9, no. 1, p. 386, Jan. 2008.

[28] M. Kircher, U. Stenzel, and J. Kelso, “Improved base calling for the Illumina Genome Analyzer using machine learning strategies.,” Genome biology, vol. 10, no. 8, p. R83, Jan. 2009.

[29] O. Tange, “GNU Parallel: the command-line power tool | USENIX,” ;login: The USENIX Magazine, pp. 42–47, 2011.

[30] E. W. Weisstein, “Eulerian Cycle -- from Wolfram MathWorld.” Wolfram Research, Inc.

[31] P. E. C. Compeau, P. A. Pevzner, and G. Tesler, “How to apply de Bruijn graphs to genome assembly.,” Nature biotechnology, vol. 29, no. 11, pp. 987–91, Nov. 2011.

[32] E. W. Weisstein, “Hamiltonian Cycle -- from Wolfram MathWorld.” Wolfram Research, Inc.

[33] E. W. Weisstein, “NP-Complete Problem -- from Wolfram MathWorld.” Wolfram Research, Inc.

[34] E. W. Weisstein, “Eulerian Path -- from Wolfram MathWorld.” Wolfram Research, Inc.

[35] J. R. Miller, S. Koren, and G. Sutton, “Assembly algorithms for next-generation sequencing data.,” Genomics, vol. 95, no. 6, pp. 315–27, Jun. 2010.

[36] M. Rho, H. Tang, and Y. Ye, “FragGeneScan: predicting genes in short and error-prone reads.,” Nucleic acids research, vol. 38, no. 20, p. e191, Nov. 2010.

[37] J. G. Caporaso, J. Kuczynski, J. Stombaugh, K. Bittinger, F. D. Bushman, E. K. Costello, N. Fierer, A. G. Peña, J. K. Goodrich, J. I. Gordon, G. A. Huttley, S. T. Kelley, D. Knights, J. E. Koenig, R. E. Ley, C. A. Lozupone, D. McDonald, B. D. Muegge, M. Pirrung, J. Reeder, J. R. Sevinsky, P. J. Turnbaugh, W. A. Walters, J. Widmann, T.

References

34

Yatsunenko, J. Zaneveld, and R. Knight, “QIIME allows analysis of high-throughput community sequencing data.,” Nature methods, vol. 7, no. 5, pp. 335–6, May 2010.

[38] R. C. Edgar, “Search and clustering orders of magnitude faster than BLAST.,” Bioinformatics (Oxford, England), vol. 26, no. 19, pp. 2460–1, Oct. 2010.

[39] W. J. Kent, “BLAT--the BLAST-like alignment tool.,” Genome research, vol. 12, no. 4, pp. 656–64, Apr. 2002.

[40] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, “Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.,” Nature genetics, vol. 25, no. 1, pp. 25–9, May 2000.

[41] M. Kanehisa and S. Goto, “KEGG: Kyoto encyclopedia of genes and genomes.,” Nucleic acids research, vol. 28, no. 1, pp. 27–30, Jan. 2000.

[42] M. Kanehisa, S. Goto, Y. Sato, M. Furumichi, and M. Tanabe, “KEGG for integration and interpretation of large-scale molecular data sets.,” Nucleic acids research, vol. 40, no. Database issue, pp. D109–14, Jan. 2012.

[43] E. W. Sayers, T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, M. DiCuccio, R. Edgar, S. Federhen, M. Feolo, L. Y. Geer, W. Helmberg, Y. Kapustin, D. Landsman, D. J. Lipman, T. L. Madden, D. R. Maglott, V. Miller, I. Mizrachi, J. Ostell, K. D. Pruitt, G. D. Schuler, E. Sequeira, S. T. Sherry, M. Shumway, K. Sirotkin, A. Souvorov, G. Starchenko, T. A. Tatusova, L. Wagner, E. Yaschenko, and J. Ye, “Database resources of the National Center for Biotechnology Information.,” Nucleic acids research, vol. 37, no. Database issue, pp. D5–15, Jan. 2009.

[44] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers, “GenBank.,” Nucleic acids research, vol. 37, no. Database issue, pp. D26–31, Jan. 2009.

[45] R. Overbeek, T. Begley, R. M. Butler, J. V Choudhuri, H.-Y. Chuang, M. Cohoon, V. de Crécy-Lagard, N. Diaz, T. Disz, R. Edwards, M. Fonstein, E. D. Frank, S. Gerdes, E. M. Glass, A. Goesmann, A. Hanson, D. Iwata-Reuyl, R. Jensen, N. Jamshidi, L. Krause, M. Kubal, N. Larsen, B. Linke, A. C. McHardy, F. Meyer, H. Neuweger, G. Olsen, R. Olson, A. Osterman, V. Portnoy, G. D. Pusch, D. A. Rodionov, C. Rückert, J. Steiner, R. Stevens, I. Thiele, O. Vassieva, Y. Ye, O. Zagnitko, and V. Vonstein, “The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes.,” Nucleic acids research, vol. 33, no. 17, pp. 5691–702, Jan. 2005.

[46] R. K. Aziz, D. Bartels, A. A. Best, M. DeJongh, T. Disz, R. A. Edwards, K. Formsma, S. Gerdes, E. M. Glass, M. Kubal, F. Meyer, G. J. Olsen, R. Olson, A. L. Osterman, R. A. Overbeek, L. K. McNeil, D. Paarmann, T. Paczian, B. Parrello, G. D. Pusch, C. Reich, R. Stevens, O. Vassieva, V. Vonstein, A. Wilke, and O. Zagnitko, “The RAST Server: rapid annotations using subsystems technology.,” BMC genomics, vol. 9, p. 75, Jan. 2008.

[47] The UniProt Consortium, “Reorganizing the protein space at the Universal Protein Resource (UniProt).,” Nucleic acids research, vol. 40, no. Database issue, pp. D71–5, Jan. 2012.

[48] J. J. Gillespie, A. R. Wattam, S. A. Cammer, J. L. Gabbard, M. P. Shukla, O. Dalay, T. Driscoll, D. Hix, S. P. Mane, C. Mao, E. K. Nordberg, M. Scott, J. R. Schulman, E. E. Snyder, D. E. Sullivan, C. Wang, A. Warren, K. P. Williams, T. Xue, H. S. Yoo, C. Zhang, Y. Zhang, R. Will, R. W. Kenyon, and B. W. Sobral, “PATRIC: the comprehensive bacterial bioinformatics resource with a focus on human pathogenic species.,” Infection and immunity, vol. 79, no. 11, pp. 4286–98, Nov. 2011.

References

35

[49] S. Powell, D. Szklarczyk, K. Trachana, A. Roth, M. Kuhn, J. Muller, R. Arnold, T. Rattei, I. Letunic, T. Doerks, L. J. Jensen, C. von Mering, and P. Bork, “eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges.,” Nucleic acids research, vol. 40, no. Database issue, pp. D284–9, Jan. 2012.

[50] I. Letunic, T. Yamada, M. Kanehisa, and P. Bork, “iPath: interactive exploration of biochemical pathways and networks.,” Trends in biochemical sciences, vol. 33, no. 3, pp. 101–3, Mar. 2008.

[51] T. Yamada, I. Letunic, S. Okuda, M. Kanehisa, and P. Bork, “iPath2.0: interactive pathway explorer.,” Nucleic acids research, vol. 39, no. Web Server issue, pp. W412–5, Jul. 2011.

[52] T. Z. DeSantis, P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K. Keller, T. Huber, D. Dalevi, P. Hu, and G. L. Andersen, “Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB.,” Applied and environmental microbiology, vol. 72, no. 7, pp. 5069–72, Jul. 2006.

[53] C. Quast, E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer, P. Yarza, J. Peplies, and F. O. Glöckner, “The SILVA ribosomal RNA gene database project: improved data processing and web-based tools.,” Nucleic acids research, vol. 41, no. Database issue, pp. D590–6, Jan. 2013.

[54] J. R. Cole, Q. Wang, E. Cardenas, J. Fish, B. Chai, R. J. Farris, A. S. Kulam-Syed-Mohideen, D. M. McGarrell, T. Marsh, G. M. Garrity, and J. M. Tiedje, “The Ribosomal Database Project: improved alignments and new tools for rRNA analysis.,” Nucleic acids research, vol. 37, no. Database issue, pp. D141–5, Jan. 2009.

[55] J. Oksanen, R. Blanchet, F. Guillaume Kindt, P. Legendre, P. R. Minchin, R. B. O’Hara, G. L. Simpson, P. Solymos, M. H. H. Stevens, and H. Wagner, “vegan: Community Ecology Package. R package version 2.0-7.” 2013.

[56] J. Oksanen, “Vegan: ecological diversity.” .

[57] N. J. Gotelli and R. K. Colwell, “Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness,” Ecology Letters, vol. 4, no. 4, pp. 379–391, Jul. 2001.

[58] B. D. Ondov, N. H. Bergman, and A. M. Phillippy, “Interactive metagenomic visualization in a web browser.,” BMC bioinformatics, vol. 12, p. 385, Jan. 2011.

[59] D. P. Faith, P. R. Minchin, and L. Belbin, “Compositional dissimilarity as a robust measure of ecological distance,” Vegetatio, vol. 69, no. 1–3, pp. 57–68, Apr. 1987.

[60] Y. Wang, H.-F. Sheng, Y. He, J.-Y. Wu, Y.-X. Jiang, N. F.-Y. Tam, and H.-W. Zhou, “Comparison of the levels of bacterial diversity in freshwater, intertidal wetland, and marine sediments by using millions of illumina tags.,” Applied and environmental microbiology, vol. 78, no. 23, pp. 8264–71, Dec. 2012.

[61] M. T. Madigan, J. M. Martinko, P. V. Dunlap, and D. P. Clark, Brock Biology of Microorganisms, 12th ed. Pearson, 2009.

[62] H. Li, Y. Yu, W. Luo, Y. Zeng, and B. Chen, “Bacterial diversity in surface sediments from the Pacific Arctic Ocean.,” Extremophiles : life under extreme conditions, vol. 13, no. 2, pp. 233–46, Mar. 2009.

References

36

[63] K. T. Konstantinidis, J. Braff, D. M. Karl, and E. F. DeLong, “Comparative metagenomic analysis of a microbial community residing at a depth of 4,000 meters at station ALOHA in the North Pacific subtropical gyre.,” Applied and environmental microbiology, vol. 75, no. 16, pp. 5345–55, Aug. 2009.

[64] W. Xie, F. Wang, L. Guo, Z. Chen, S. M. Sievert, J. Meng, G. Huang, Y. Li, Q. Yan, S. Wu, X. Wang, S. Chen, G. He, X. Xiao, and A. Xu, “Comparative metagenomics of microbial communities inhabiting deep-sea hydrothermal vent chimneys with contrasting chemistries.,” The ISME journal, vol. 5, no. 3, pp. 414–26, Mar. 2011.

[65] J. Wang, Y. Wu, H. Jiang, C. Li, H. Dong, Q. Wu, J. Soininen, and J. Shen, “High beta diversity of bacteria in the shallow terrestrial subsurface,” Environmental Microbiology, vol. 10, no. 10, pp. 2537–2549, Oct. 2008.

Appendix

37

Appendix

Figure 20 - Taxonomic distribution of the reads from the seven samples at the phylum level

Appendix

38

Figure 21 - Krona graph of the distribution of the reads of Gammaproteobacteria in sample 1


Appendix

39



Appendix

40



Appendix

41


Figure 28 - Beta diversity related to spacial distance

y = 0.0013x + 0.3036 R² = 0.0999

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 5 10 15 20 25 30 35 40

Dis

sim

ila

rity

Distance (cm)

Correlation between Spacial Distance and Dissimilarity

Appendix

42

Figure 29 - Heatmap of the reads agains the M5RNA database at 97% identity

Appendix

43

Figure 30 - Heatmap of the reads agains the M5NR database at 90% identity

Metagenomic Analysis of Mariana Trench Sediment Samples

Documents

Transcript of Metagenomic Analysis of Mariana Trench Sediment Samples