Bioinformatics Review - October 2015
-
Upload
bioinformatics-review -
Category
Documents
-
view
215 -
download
2
description
Transcript of Bioinformatics Review - October 2015
OC T OBER 2015 VOL 1 ISSUE 1
ALFALFA:
explained! By Muniba Faiza
Computer and Drugs: What you need to know
“Gene expression signatures are
commonly used to create cancer
prognosis and diagnosis methods.
Gene expression signature is a
group of genes in a cell whose
combined expression pattern is
uniquely characteristic of a
biological phenotype or a medical
condition.”
- Charles Wins
Public Service Ad sponsored by IQLBioinformatics
Contents
October 2015
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Topics
03
22
34
99
34
99
Genomics
Proteomics
Systems Biology
Data Analysis
Software
CADD
Editorial.... 5
The basic concepts of genome assembly19 MUSCLE: Tool for Multiple Sequence Alignment 27 DNA test for paternity: This is how you can fail! 32
How to check new peptides accuracy in Proteogenomics 17
Tumor progression prediction by variability based expression signatures 8 Basics of Mathematical Modelling 25 Introduction to mathematical modelling - Part 2 29
ALFALFA: explained 06 BioMiner & Personalized Medicine: A new perspective 12
Meta-analysis of biological literature: Explained 15 Basic Concept of Multiple Sequence Alignment 23 Computer and Drugs: What you need to
know 21
EDITORIAL
SECTION EDITORS
ALTAF ABDUL KALAM MANISH KUMAR MISHRA
SANJAY KUMAR PRAKASH JHA NABAJIT DAS
REPRINTS AND PERMISSIONS
You must have permission before reproducing any material from Bioinformatics Review. Send E-mail
requests to [email protected]. Please include contact detail in your message.
BACK ISSUE
Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com
at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery,
subject to availability. Pre-payment is required
CONTACT
PHONE +91. 991 1942-428 / 852 7572-667
MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025
STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as [email protected]
PUBLICATION INFORMATION
Volume 1, Number 1, Bioinformatics Review™ is published monthly for one year(12 issues) by Social
and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015
Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used
under licence by SEWA trust.
Published in India
EXECUTIVE EDITOR FOUNDING EDITOR
FOZAIL AHMAD MUNIBA FAIZA
Bioinformatics Review – The
Road Ahead
Bioinformatics, being one of the best fields in terms of future prospect,
lacks one thing - a news source. For there are a lot of journals publishing
a large number of quality research on a variety of topics such as genome
analysis, algorithms, sequence analysis etc., they merely get any notice
in the popular press.
One reason behind this, rather disturbing trend, is that there are very
few people who can successfully read a research paper and make news
out of it. Plus, the bioinformatics community has not been yet
introduced to research reporting. These factors are common to every
relatively new (and rising) discipline such as bioinformatics.
Although there are a number of science reporting websites and portals,
very few accept entries from their audience, which is expected to have
expertise in some or the other field.
Bioinformatics Review has been conceptualized to address all these
concerns. We will provide an insight into the bioinformatics - as an
industry and as a research discipline. We will post new developments in
bioinformatics, latest research.
We will also accept entries from our audience and if possible, we will
also award them. To create an ecosystem of bioinformatics research
reporting, we will engage all kind of people involved in bioinformatics -
Students, professors, instructors and industries. We will also provide a
free job listing service for anyone who can benefit out of it.
EDIT
OR
IAL
Tariq Abdullah
Founder
Letters and responses:
Bioinformatics Review | 6
ALFALFA: explained Muniba Faiza
Image Credit: Google Images
“ALFALFA is a new platform for sequencing. It is extremely fast and accurate at mapping long reads (> 500bp), while still being competitive for moderately s ized reads (> 100bp). Both end -to-end (i.e., global) and local read alignment is supported and several strategies for paired -end (i.e., global) mapping can efficiently handle large variations in insert s ize (i.e., input genome to be sequenced)”
igh throughput
sequencing has
revolutionized the new
world of bioinformatics
research. Since everyone is aware of
the Human Genome project in which
the human genome has been
sequenced, millions of species have
been sequenced so far. Sequencing is
a very important aspect of
bioinformatics so new faster and
better sequencing techniques are
needed . New sequencing platforms
produce biological sequence
fragments faster and cheaper.
Ideal read mappers should accomplish
the following aspects:
Maximal speed
Minimal memory
Maximal accuracy
Shoot at a moving target (since fast
evolving technologies differ in length
distribution and sequencing errors).
Recent advances in next generation
sequencing technologies have led to
increase read lengths, higher error
rates and error models showing more
and longer indels (insertions and
deletions). A preprocessing step of
indexing reference genomes and/or
sequencing reads must guarantee fast
substring matching. The overall search
space is pruned to candidate genomic
regions by searching matching
segments (called seeds) between
reads and the reference genome.
These candidate regions are then
further investigated to look for
acceptable alignments that reach a
particular score. Then the sequencing
is done.
ALFALFA is a new platform for
sequencing is extremely fast and
accurate at mapping long reads (>
500bp), while still being competitive
for moderately sized reads (> 100bp).
Both end-to-end (i.e., global) and local
read alignment is supported and
several strategies for paired-end (i.e.,
global) mapping can efficiently handle
large variations in insert size (i.e.,
input genome to be sequenced). The
name is an acronym for “A Long
Fragment Aligner/A Long Fragment
Aligner". It is repeated twice as a pun
on repetitive and overlapping
fragments observed in genome
sequences that heavily distort read
mapping and genome assembly.
The most fascinating feature of
ALFALFA is that it uniquely uses the
‘enhanced sparse suffix arrays’ to
index reference genome (the genome
to be sequenced). Index refers to a
data structure that allow for the quick
location of all occurrences of patterns
starting at interesting positions only.
Sparse suffix array is a technology
which uses LCP (Longest Common
Prefix) series which reduces the
solution space and forms a suffix tree
efficiently. Sparse Suffix Array uses a
chaining algorithm to speed up
dynamic programming extensions of
the candidate region. This data
structure facilitates fast calculation of
maximal and super-maximal exact
matches. The speed-memory trade-
off is tuned by setting the sparseness
value of the index.
H
SOFTWARE
Bioinformatics Review | 7
ALFALFA follows a canonical seed-
and-extend work- flow for mapping
reads onto a reference genome:
Root system
Reference genome indexed by
enhanced sparse suffix array
Seed
Super-maximal exact match between
reference genome and read
(To enable quick retrieval of variable-
length seeds called super-maximal
exact matches between a read and
the reference genome).
Flower bud
Cluster of seed forms candidate
genomic region
(Seeds are then grouped in to non-
overlapping clusters that mark
candidate genomic regions for read
alignment).
Flower
Gaps between seeds filled by
dynamic programming
(Handling of candidate region is
prioritized by agglomerate base pair
coverage of seeds. the final extend
phase sample seeds from candidate
regions to form collinear chains that
are bridged using dynamic
programming).
Features of ALFALFA:
ALFALFA uses the technological
evolution for the production of longer
reads by using maximal exact matches
[MEMs] and super-maximal exact
matches [SMEMs] as seeds. (Since
MEMs between a read and a
reference genome may overlap,
super- maximal exact matches are
defined as MEMs that are not
contained in another MEM in the read
) . These seeds are then extensively
filtered and then decide the order of
alignment to allow for more accurate
prioritization of candidate regions. To
reduce the number of expensive
dynamic programming computations
needed, ALFALFA chains seeds
together to form a gapped alignment.
As a result, the extension phase
(aligning the matches) is limited to
filling gaps in between chains while
evaluating alignment quality.
The sparseness value‘s of sparse suffix
arrays (controlled by the option -s)
provides an easily tunable trade-off to
balance performance and memory
footprint. In theory, sparse suffix
arrays take up 9/s + 1 bytes of
memory per indexed base. A sparse
suffix array with sparseness factor 12
thus indexes the entire human
genome with a memory footprint of
5.8GB. It shows that ALFALFA is good
to perform sequencing at maximal
speed acquiring minimal memory
space.
ALFALFA tries to balance the number
and the quality of seeds using a
combination of maximal and super-
maximal exact matches. The intervals
*i..i + l−1+ and *j..j + l-1] correspond to
a maximal exact match between a
read and a reference genome if there
is a perfect match between both
subsequences of length ` starting at
position i in the read and at position j
in the reference genome, with
mismatches occurring at positions
(i−1,j−1) and (i + l, j + l) just before and
after the location of the matching
subsequence.
A combination of neighboring seeds
increases the evidence that some
region in the reference genome holds
potential to serve as a mapping
location. ALFALFA therefore sorts
seeds according to their starting
position in the reference genome and
bins them into non-overlapping
clusters using the locally longest seeds
as anchors around which regions are
built. This results in a list of candidate
regions along the reference genome.
To limit the number of candidate
regions requiring further examination,
only SMEMs and rare MEMs are used
for candidate region identification. .
Candidate regions are then ranked by
their cov- erage of read bases,
calculated from the seeds that make
up the clusters. Sequential processing
of these prioritized candidate regions
halts when either a high number of
feasible alignments has been found, a
series of consecutive candidate
regions failed to produce an
acceptable alignment or read
coverage drops below a certain
threshold.
The dimensions of a dynamic
programming matrix correspond to
the bounds of a candidate region, but
Bioinformatics Review | 8
computations are often restricted to a
band around the main diagonal of the
matrix. The width of this band
depends on the minimal alignment
score required.ALFALFA further
reduces the dimensions of the matrix
by forming a collinear chain of a
subset of the seeds that make up a
candidate region. Dynamic
programming can then be restricted
to fill the gaps in between consecutive
non-overlapping seeds. The chaining
algorithm starts from an anchor seed
and greedily adds new seeds that do
not exhibit a high skew to the chain.
The skew is defined as the difference
of the distances between two seeds
on the read sequence and the
reference genome. The amount of
skew allowed is automatically decided
based on the gap between the seeds
and the parameters that influence the
feasibility of an alignment. ALFALFA
allows multiple chains per candidate
region, based on the available anchor
seeds. Anchor selection is based on
seed length and seeds contained in
chains can no longer be used as
anchors in successive chain
construction.
Overall, Bowtie 2 has the highest
sensitivity, which reaches 100%.
However, Bowtie 2 is also less able to
distinguish between good and bad
alignments. CUSHAW3, BWA-MEM
and ALFALFA exhibit the best trade-off
between true positives and false
positives. The mapping quality is
determined by ROC (receiver
operating characteristic ) curve.
The benchmark results demonstrate
that ALFALFA is extremely fast at
mapping long reads, while still being
competitive for moderately sized
reads. Together with BWA-SW and
BWA-MEM, it is one of a few mappers
that scale well for read lengths up to
several kilobases.
Reference
BMC Bioinformatics Sample 16:59
doi:10.1186/s12859-015-0533-0
Michaël Vyverman
Bernard De Baets
([email protected]) Veerle
Fack ([email protected]) Peter
Dawyndt
Bioinformatics Review | 9
Tumor progression prediction by variability based expression signatures Muniba Faiza
Image Credit: Stock Photos
“Gene expression signatures are commonly used to create cancer prognosis and diagnosis methods. Gene expression
signature is a group of genes in a cell whose combined expression pattern is uniquely characteristic of a biological
phenotype or a medical condition.”
ancer has become a very
common disease now a
days, but the main reason
of causing this is unknown
up till now. Various reasons have been
given and recent research says that
improper sleeping patterns may also
lead to cancer. Like cause of cancer is
difficult to predict, similarly, its
progression and prognosis is also very
difficult. Despite of many advances in
cancer treatment, early detection is
still very difficult. While there have
been many early cancer screening
techniques, but are not realistic
because of the lack of cost-
effectiveness or requirement of
invasive procedures. Genomic
screening techniques are a promising
approach in this area. Gene
expression signatures are commonly
used to create cancer prognosis and
diagnosis methods. Gene expression
signature is a group of genes in a cell
whose combined expression pattern is
uniquely characteristic of a biological
phenotype or a medical condition. But
only few of them were successfully
able to utilize in clinics and many of
them failed to perform. Since these
signatures attempt to model the
highly variable and unstable genomic
behavior of cancer, they are unable to
predict cancer. The degree of
deviation in gene expression from the
normal tissue, i.e., the hyper-
variability across cancer types can be
used as a measurement of risk of
relapse or death. This gives rise to the
concept of Gene expression anti-
profiles. Anti- profiles are used to
develop cancer genomic signatures
that specifically takes advantage of
gene expression heterogeneity. They
explicitly model increased gene
expression variability in cancer to
define robust and reproducible gene
expression signatures capable of
accurately distinguish tumor samples
from healthy controls.
Differentially variable genes= anti-
profile genes
After employing many experiments
regarding cancer anti-profiles, the
results indicated that the anti-profile
approach can be used as a more
robust and stable indicator of tumor
malignancy than traditional
classification approaches.
The researchers’ hypothesis is that
the degree of hyper-variability (w.r.t
normal samples) is directly
proportional to the tumor progression,
i.e., degree of hyper- variability as
measured with respect to the normal
samples would increase with tumor
progression. Corrada Bravo et al found
SYSTEMS BIOLOGY
C
Bioinformatics Review | 10
out a way to derive a colon-cancer
anti-profile for screening colon tumors
by measuring deviation from normal
colon samples. To create an anti-
profile, they used a set of normal
samples and tumor samples, probe-
sets are then ranked by the quantity
σj,tumor/ σ j,normal(where σj,tumor and
σj,normal are the standard deviations
among the tumor samples and normal
samples, respectively, for probeset j)
in descending order, and a certain
number of probesets (typically 100)
with the highest value are selected.
Then they calculated the normal
regions of each probe set and then the
number of probe sets for which the
expression lies outside the normal
region was calculated to get an anti-
profile score of the sample.
To test their hypothesis, they
obtained two publicly available
microarray datasets with normal,
adenoma, and cancer colon samples.
By studying these datasets, they
plotted the distribution of variance of
cancer/adenoma samples to variance
of normal samples ratio (in log2 scale)
for these probe sets on the other
dataset (Fig. 1A and B).
Fig.1 Among probes that exhibit higher
variability among cancers than among
normals, the degree of hypervariability
observed is related to the level of
progression. (a) Distribution of
variance ratio statistic ,log2 σ σ2tumor
÷ σ2normal- for colon dataset (Gyorffy
et al; GSE4183) from anti-profile
computed using another colon dataset
(Skrzypczak et al; GSE20916). (B)
Distribution of variance ratio statistic
for Skrzypczak et al colon dataset from
anti-profile computed using Gyorffy et
al colon dataset. (C) Distribution of
variance ratio statistic for
adrenocortical data (Giordano et al;
GSE10927) for universal anti-profile
probe sets.
Both adenoma and cancer samples
show higher variability than normals
(region to the right of x = 0), while
cancer samples show higher
hypervariability than adenomas. This
suggests that hypervariability is a
stable marker between experimental
datasets and that specific selection of
hypervariable genes across cancer
types and the anti-profile method can
be extended to model tumor
progression. These studies showed
that Gene expression anti-profiles
capture tumor progression.
DNA methylation is one of the
primary epigenetic mechanisms for
gene regulation, and is believed to
play a particularly important role in
cancer. High levels of methylation in
promoter
are usually associated with low
transcription. Cancer has loss of
sharply methylation levels which is
associated with increased
hypervariability in gene expression
across multiple tumor types. They
applied the anti-profile scoring
method to DNA methylation data
from thyroid and colon samples,
where for each tissue type, normal,
adenoma and cancer samples were
available. Figure 2 shows the
distribution of adenoma and
Bioinformatics Review | 11
carcinoma samples against normal
samples on a principal component
plot, showing the presence of the
hypervariability pattern in
methylation data: the normal
samples cluster tightly, while the
adenomas show some
dispersion and the carcinomas show
even greater dispersion. Since these
behaviors are present for both colon
and thyroid data, it again reinforces
their notion that the anti-profile
approach has wide application for
classification in cancer.
figure 2. Anti-profiles applied to
methylation data: first two principal
components of (A) thyroid
methylation data and (B) colon
methylation data.
Conclusion:
The anti-profile approach is more
suitable for cancer prognosis. It can
robustly predict the tumor progression
and prognosis based on the variability
in the gene expressions. The results
presented above also confirms that
gene expression signatures based on
hyper-variability can be highly
valuable.
Reference:
Wikum Dinalankara and Héctor
Corrada Bravo Center for
Bioinformatics and Computational
Biology, Department of Computer
Science and UMIACS, University of
Maryland, College Park, MD, USA.
Bioinformatics Review | 12
BioMiner &
Personalized
Medicine: A new
perspective
Muniba Faiza
Image Credit: Google Images
“BioMiner is a web-based tool which provides various tools for studying the statistical analysis and a deep insight of
transcriptomics, proteomics and metabolomics with cross-omics concept”
ersonalized medicines have
become a very important part
of the medicine world now a
days. They are also known as
‘Individualized Medicines’.
Personalized medicines allow a
doctor to prescribe more specific
and efficient medicines to a
particular patient. This concept has
created many more opportunities
and aspects in the medicine world.
Personalized medicine concept is
accomplished by obtaining high-
throughput data sets from genomics,
transcriptomics, proteomics and
metabolomics, but more specifically
it requires the ‘cross-omics’, i.e.,
linkage between transcriptomics,
proteomics and metabolomics.
Currently, there are simple web-
based tools which do not allow much
access to the high throughput
datasets from the omics. But a new
novel web based tool “Biominer” has
been launched recently which
provides access to a wide variety of
high-throughput datasets. This tool
was developed within the scope of
an international and interdisciplinary
project (SYSTHER). Biominer
provides the user various facilities
with convenient tools which help
them to analyze the high-throughput
datasets and provides a deep insight
for complex cross-omics datasets
with enhanced visualization abilities.
Since Biominer was developed under
Systher (System Biology Tools
Development for Cell Therapy and
Drug Development –
www.systher.eu) project so its main
focus is on cancer.
Public data repositories such as Gene
Expression Omnibus (GEO) and
ArrayExpress for microarray data,
PRoteomics IDEntifications (PRIDE) for
proteomics data, or Sequence Read
Archive (SRA) of NCBI are used to store
the biological high-throughput
datasets for next- generation
sequencing. The only limitation with
these repositories is that they store
biological data of a dedicated set of
single omic type and do not support
the cross-omics.
A database namely, SystherDB has
been developed in which the stored
data is well presented and easily
accessible, and whose data is mined
and analyzed by the BioMiner tools.
A public instance of BioMiner is
freely available online. It currently
contains 18 different studies, with
almost 4,000 microarrays and more
than 187 Mio measured values of
genes, proteins, or metabolites.
Since BioMiner was developed in
P
TOOLS
Bioinformatics Review | 13
the SYSTHER project, most of the
studies are focused on the
glioblastoma multiforme (GBM).
Fig.1 Workflow of BioMiner
FEATURES:
1. BioMiner uses Google Web
Toolkit (GWT) for the graphical user
interface (GUI).
2. A separate MySQL database is
created which is manually curated and
used to store the Experimental data
from genomics, proteomics and
metabolomics.
3. Data import has to be
performed by a dedicated specialist to
ensure the data consistency.
4. Response time is with in just a
few seconds, for this purpose special
indexing methods are implemented.
5. Metabolite data are annotated
using three different identifier
systems: Golm Metabolome Database,
Human Metabolome Database
(HMDB), and Kyoto Encyclopedia of
Genes and Genomes (KEGG).
6. Predefined cross-omics
relationship (e.g., a mapping of
metabolites onto genes or vice versa)
among the biological datasets.
7. Pathway and functional
information from Reactome, KEGG,
and Wiki- Pathways.
8. Gene Ontology is also
supported.
9. Correlation analyses (statistical
analysis of any two variables) are
based on Pearson correlation
coefficients.
10. Correlations are calculated for
high-variance genes (by default top
500 genes).
11. BioMiner complies with public
data management standards such as
Minimum Information About a
Microarray Experiment (MIAME),
Minimum Information About a
Proteomics Experiment (MIAPE), and
Minimum Information About a
Metabolomics Experiment (MIAMET).
12. ENSEMBL database is used for
cross-mapping between the genes
and proteins.
13. Cross-mapping between the
genes and metabolites the combined
information of ConsensusPathDB and
HMDB is used.
A
Fig2 . Data mining with Biominer.
screenshots of different results
from data mining with Biominer
including the following: (a) study
overview, (B) detection of
differentially expressed genes, (C)
correlation of gene expression and
survival time, (d) identification of
significantly enriched pathways, (e)
visual pathway inspection based on
predefined layouts, and (f)
biomolecule comparison of gene
and protein expression. results are
typically presented in synchronized,
parallel views composed of a table
and a plot. Fig3. Pathway
visualization. Interactive pathway
visualization of the cell cycle
pathway from WikiPathways
repository.
BioMiner is a web-based tool which
provides various tools for studying the
statistical analysis and a deep insight of
transcriptomics, proteomics and
Bioinformatics Review | 14
metabolomics with cross-omics concept.
Results are presented in two parallel
views composed of a table and a plot.
Both views are interactive and user-
defined selections can be synchronized.
Pathway visualization is achieved by
extending the PathVisio library. It also
provides clinicians and physicians a
platform integrating high-throughput
data together with clinical parameters,
thereby leading to better personalized
medicines.
Reference:
Chris Bauer1, Karol stec1, alexander
Glintschert1, Kristina Gruden2,
Christian schichor3, michal or-Guil4,5,
Joachim selbig6 and Johannes
schuchhardt1
Bioinformatics Review | 15
Meta-analysis of
biological
literature:
Explained
Manish Kumar Mishra
Image Credit: Google Images
“ Meta-analysis an analysis of already published data by s imply rearranging it, sorting it, and trying to find hidden patters out of published literature. ”
t’s a fine Monday morning, and
the new intern finds his way to
the laboratory of biological
data mining procedures. His brief
interview with the concerned
scientist has allowed him to have
very limited understanding of the
subject.... Upon his arrival he is
greeted with a humongous corpus of
mixed articles, say some 4000, and
he is required to assemble specific
information out of the data set, by
diligently scrutinizing the
components of each article.
Well, the situation could be
frightening to a purely wet lab
working biologist, but a man who
has had any exposure to the real
power of file handling with any
programming language will know
how to let a simple few lines of
code do his bidding.
So what is meta-analysis about?
The new cool word to biological
realm “meta-analysis” can be better
understood by understanding the
meaning of first half of the
term; META, meaning data of data,
thence making meta-analysis an
analysis of already published data
by simply rearranging it, sorting it,
and trying to find hidden patters out
of published literature.
By most rudimentary means, meta-
analysis can be achieved by reading
the corpus of research and review
articles concerning a particular topic
which may be as wide a whole
Eukaryotic genome or may be
narrowed down to phyla, groups,
species may be a specific disease or
even any gene in particular. Where on
one part we try to narrow down to
disease or gene, one must also realize
biological systems are most complex
to date and present day computer
simulations fail to rival the complexity
with equal efficiency, so any analysis
narrowed down to gene must also
consider that the gene may very well
be found in multiple organisms and
thus may present considerably high
amount of results irrelevant to the
study.
A rigorous manual inspection of
program sorted data is required to
sort out such entries. Since meta-
analysis relies heavily on statistical
studies of data, researchers tend to
rely on programming languages such
as Stata and R to write their specific
codes for analyses, R unlike Stata is
free, produces publication quality
outputs and provides a plethora of
packages, of which a few provide
programs like PDF miner, PubMed
miner etc used for accessing PubMed
I
META ANALYSIS
Bioinformatics Review | 16
database, these packages contain
codes to access the database and
extract all information off them with a
command based interface for huge
data sets at once cutting down
manual efforts and time taken to
achieve the task.
All praises sent, the method has its
own fair share of drawbacks and
issues. The current query system to
NCBI and sister organizations fail to
acknowledge synonymous terms and
treats them as individual entities not
linked to any, but only in association
with the length of query items made
alongwith. A robust query system is
needed to enhance the results, and
make the whole concept more
efficient.
Need of the hour is to engage more
resources into developing well-
structured and somewhat intelligent
query systems which can truly
acknowledge the gene-names and
abbreviations, scientific and English
names of organisms and also the
variations of presenting names of
techniques involved.
Bioinformatics Review | 17
How to check
new peptides
accuracy in
Proteogenomics
Muniba Faiza
Image Credit: Google Images
“ During the discovery of novel genes, there are huge chances of getting false results as positives, i.e., we can also get
those peptides which in actual are not but the algorithm may show them .”
roteogenomics is an emerging
area which is an interface of
proteomics and genomics. This
intersection employs the genomic and
transcriptobmic information to
identify the novel peptides by using
mass spectrometry based techniques.
The proteomic data can then be used
to identify the fingerprints of genic
regions in that particular genome
which may results in the modification
of gene models and can also improve
the gene annotations. So, we can say
that proteogenomics has been well
accepted as a tool to discover novel
proteins and genes."But, during the
discovery of novel genes, there are
huge chances of getting false results
as positives, i.e., we can also get those
peptides which in actual are not but
the algorithm may show them".
Therefore, to avoid or more
accurately, to minimize the chances of
false positives, a False Discovery Rate
(FDR) is used. FDR is a ratio of number
of decoy hits / number of targets.
FDR = decoy/ target
In most conventional proteogenomic
studies, a global false discovery rate
(i.e., the identifications of annotated
peptides and novel peptides are
subjected to FDR estimation in
combination) is used to filter out false
positives for identifying credible novel
peptides. However, it has been found
that the actual level of false positives
in novel peptides is often out of
control and behaves differently for
different genomes.
It has been observed previously that,
under a fixed FDR, the inflated
database generated by, e.g. six-open-
reading-frame (6-ORF) translation of a
whole genome significantly decreases
the sensitivity of peptide
identification.
Recently, Krug implied that the
identification accuracy of novel
peptides is greatly affected by the
completeness of genome annotation,
i.e., more the genome is annotated,
higher are the chances of
identification of accurate novel
peptides.
In this recent paper, they followed the
same framework as in Fu’s work to
P
PROTEOGENOMICS
Bioinformatics Review | 18
quantitatively investigate the
subgroup FDRs of annotated
and novel peptides identified by 6-ORF
translation search.
In this article, they have revealed that
the genome annotation completeness
ratio is the dominant factor
influencing the identification accuracy
of novel peptides identified by 6-ORF
translation search when a global FDR
is used for quality
assessment. However, with a stringent
FDR control (e.g. 1%), many
low scoring but true peptide
identifications may be excluded along
with false positives.
To increase the sensitivity and
specificity of novel gene discovery,
one should reduce the size of
searched database as much as
possible. For example, when the
transcriptome information (especially
from the strand-specific cDNAseq
data) is available, it is apparently more
favorable to search against the
transcriptome as well than to search
against the genome alone. If the
transcriptome information is
unavailable, it would be also helpful to
reduce the 6-OFR translation database
by removing sequences that are
predicted to be hardly possible to be
real proteins.
Reference:
A note on the false discovery rate of
novel peptides in proteogenomics
Kun Zhang1,2, Yan Fu3,*, Wen-Feng
Zeng1,2, Kun He1,2, Hao Chi1,
Chao Liu1, Yan-Chang Li4, Yuan Gao4,
Ping Xu4,* and Si-Min He1
Bioinformatics Review | 19
The basic concepts
of genome
assembly
Muniba Faiza
Image Credit: Google Images
” Genome, as we all know, is a complete set of DNA in an organism including all of its genes. It consists of all the
heritable information and also some regions which are not even expressed.”
enome, as we all know, is a
complete set of DNA in an
organism including all of its
genes. It consists of all the heritable
information and also some regions
which are not even expressed.
Almost 98 % of human genome has
been sequenced by the Human
Genome Project, only 1 to 2 % has
been understood. Still the human
genome has to be discovered more
whether it would be in terms of genes
or proteins. Many sequencing
strategies and algorithms have been
proposed for genome assembly. Here I
want to discuss the basic strategy
involved in genome assembly, which
sounds quite difficult but is not really
complex if understood well.
Basic strategy involved behind
discovering the new information of
genome is explained in following steps:
1. First of all, the whole genome of
an organism is sequenced which
results in thousands or hundreds
of different unknown fragments
starting from anywhere and
ending upto anywhere.
2. Now, since we don't know what
the sequence is and which fragment
should be kept near to which one, the
concept for 'Contigs' is employed.
Contigs are the repeated overlapping
reads which are formed when the
broken fragments comes to each
other only by matching the
overlapping regions of the sequence.
It means that many fragments which
are consecutive are joined to form
contig. Many such contigs are formed
during the joining process.
3. Now, the question that arises is how
come we know that a fragment which
may be a repeat has been kept in its
right place as a genome may have
many repeated regions? To overcome
this, paired ends are used. Paired ends
are the two ends of the same
sequence fragments which are linked
together, so that if one of the end of
the fragment is aligned in lets say
contig1 then the other end which is a
part of the former will also be aligned
in the same Contig as it is the
consecutive part of the sequence.
There are various software with the
help of which we can define different
lengths of the paired ends.
4. After that all the Contigs combine
to form a scaffold, sometimes
GENOMICS
G
Bioinformatics Review | 20
called
as Metacontigs or Supercontigs,
which are then further processed
and the genome is sequenced.
All of this is done by different
assembly algorithms, mostly used are
Velvet and the latest is SPADES.
According to my experiences, more
efficient algorithms are which may
provide us large information in one go.
Just imagine that we got a thread of
sequence with unknown base pairs,
then what would we do with that
thread and how would we identify and
extract the useful information from
this thread??
Thank you for reading, Don't forget to
share this article if you like it.
Bioinformatics Review | 21
Computer and Drugs: What you need to know
Altaf Abdul Kalam
Image Credit: Google Images
” Computers are being used to design drugs and it is being done on a humongous level, by almost every
multinational pharma company.”
ould you chance your life
to a lowly piece of
hardware called the
computer? Would you let it fabricate
and determine drugs for life
threatening diseases like hepatitis,
cancers and even AIDS? Well, actually
your answer (or your opinion)
doesn’t seem to matter. Because the
world has moved over to the other
side. Computers are being used to
design drugs and it is being done on a
humongous level, by almost every
multinational pharma company, the
names of which you will undoubtedly
find at the back of your prescription
medicines at home. So what’s with all
this computer stuff? Have we parted
with our perspicacity, our intuition,
our ready willingness to tackle any
challenge head on? We have always
found solutions to mankind’s biggest
problems all by ourselves. As
Matthew McConaughey’s character
in Interstellar says "..or perhaps
we’ve forgotten we are still
pioneers?"
Well philosophical mubo-jumbo
aside, its not that simple as its
sounds. Ofcourse, most of you
reading this already have some
background in this topic and have
already understood what I am talking
about. But for those of you who
haven’t the slightest clue, don’t
worry, this write up is for you.
Throughout this series of articles on
this particluar issue, I am going to try
and break it down to the basics. Lets
say that by the end you would see a
car not for what it is – with all its
complexity and slickness- but for
what made it the way it is – the nuts
and bolts and rubber and.. whatever,
you get the point!
So where do we start? Money! Yes
the thing that runs the world.
Contrary to what all the losers who
never made a dime say, money
simply is everything. Even Bill Gates
was forced to acknowledge the fact
and decalre "Money isn’t everything
in this world , but you gotta have a
lot of it before you say such rubbish."
So that settles it then. Now lets come
back.
The basic modus opernadi of
designing a drug is that you first find
a suitable target which you believe
will be key to challenging the disease.
This mostly is a protein/enzyme that
can metabolise a particular drug or in
some cases even the disease causing
genes from the pathogen itself.
Finding this target is not easy, but it is
not that hard either. We have
documentations, intensive studies
and databases dedicated to listing,
characterizing and studying the drug
metabolizing genes and proteins in
the body. Different classes of
metabolizers act on different types of
chemicals (or drugs if you like). A
class of metabolizers called the CYP
CADD
W
Bioinformatics Review | 22
enzymes metabloize over sixty
percentage of the known drugs and
medicines that humans consume.
This includes drugs (the real ones –
LSD, cocain, heroin.. get it?) and
even poisons and sedatives. The
metabolizers ofcourse don’t know
which is which. If it suits them they
metabolize it, else they are out of
your system.
Now, under the assumption that we
have a drug target, the next step is
finding the suitable drug candidate
itself. Now this step is what you call
finding a needle in a haystack. There
are literally millions of drugs out
there and if that is not enough you
can go design your own and get it
synthesized. In a drug target (we will
call it simply the ‘protein’ henceforth)
there are multiple points of action
where multiple drugs can act. So for
example in a protein made of 200
amino acids, we might find 50
actionable amino acids. For these
fifty amino acids we may find
thousands and thousands of drug
candidates, all capable in bringing
about some or the other change in
the protein.
So how do we find the One? If you
asked that question about fifteen
years back, the answer would have
been to slog it out. Match every drug
candidate you got against the protein
you have and check the effects in
vivo. Now countless number of
factors come into play when a drug
and a protein interact – global
minima, energy minimization, binding
affinity, hydrogen bonding intensity
and what not. We shall learn about
them in more detail in upcoming
articles.
So to put it simply, scientists spent
their whole sorry lives pitting
different drug candidates against the
same protein over and over again
until they found something
worthwile to hang on to. Even if all
the above mentioned factors blended
in wonderfully, they might sadly, at
the end discover that the drug
actually caused more harm than it did
good. So the candidate gets
discarded and they start the process
all over again! Sometimes you got
lucky and found the right drug within
performing a few combinations. But
mostly it took years to even zero in
on a drug that could be moved
further into the drug discovery
pipeline which in itself is another
torturous process!
So coming back to the money factor.
You don’t need to be a Harvard
Business School garduate to learn
that this tiresome task costs money,
a lot of money! Money in the form of
manpower, reagents, biological
matter like tissues and test animals
and plants, instrumentation,
electiricity and what not. Another
thing it costs is something which
none of use care about much – time.
Picture designing a drug for some
novel disease which is killing
thousands of people each year. And
picture having to0 do this same
procedure and coming out with a
drug after 10-15 years. The cost of
such a life saving drug will also be
high because the company or lab that
made them would want to recover all
the time and money they spent on it
in the first place. Not exactly feasible
and effective I would say.
So here comes computer aided drug
design. Which – brace yourself, can
shave off years from the drug
discovery pipeline. It can get you into
the clinical trials phase within say 2-3
years as opposed to the earlier
average of 7-8 years. Less money
spent, less time spent, faster
avilability of a potential cure and who
knows, even less expensive
medicines.
So how does it work? How does the
entry of a man made machine change
everything for the better so
drastically? What does a computer
do that humans could not? Can you
trust the results that happen in silico
over something that happen in vivo?
Is a computer finally that evolved
that it can simulate life forms inside
its mother box and processors? We
will hopefully see those questions
answered in the next few posts!
Bioinformatics Review | 23
Basic Concept of
Multiple Sequence
Alignment
Muniba Faiza
Image Credit: Google Images
“ The major goal of MSA pairwise alignment is to identify the alignment that maximizes the protein sequence
similarity.”
ultiple Sequence
Alignment (MSA) is a very
basic step in the
phylogeny analysis of organisms. In
MSA, all the sequences under study
are aligned together pairwise on the
basis of similar regions with in
them. The major goal of MSA
pairwise alignment is to identify the
alignment that maximizes the protein
sequence similarity.
This is done by seeking an alignment
that "maximizes the sum of
similarities for all the pair of
sequences", which is called as the
'Sum-of-scores or SP Score'. The SP
Score is the basic of many alignment
algorithms.
The most widely used approach for
constructing MSA is "Progressive
Alignment", where a set of n proteins
are aligned by performing n-1
pairwise alignments of pairs of
proteins or pairs of intermediate
alignments guided by a phylogeny
tree connecting the sequences. A
methodology that has been
successfully used as an improvement
of progressive alignment based on
the SP Score is "Consistency-based
Scoring",where the alignment is
consistent dependent on the
previously obtained alignment, for
example, we have 3 sequences
namely, A,B, & C ,the pairwise
alignment A-B, B-C imply an
alignment of A-C which may be
different from the directly computed
A to C alignment.
Now, the question arises that how
much can we rely on the obtained
MSA? and how an MSA is validated?
The validation of MSA program
typically uses a benchmark data set
of reference alignments. An MSA
produced by the program is
compared with the corresponding
reference alignment which gives an
accuracy score.
Before 2004, the standard
benchmark was BAliBASE (
Benchmark Alignment dataBASE) , a
database of manually refined MSAs
consisting of high quality
documented alignments to identify
the strong and weak points of the
numerous alignment programs now
available.
"Recently, several new benchmark
are made available,
namely, OXBENCH, PREFAB,
SABmark, IRMBASE and a new
extended version of BAliBASE."
Another parameter which is
considered as basic in most of the
alignment programs is fM Score. It is
used to assess the specificity of an
alignment tool and identifies the
proportion of matched residues
predicted that also appears in the
M
SEQUENCE ANALYSIS
Bioinformatics Review | 24
reference alignment. Many of the
times, it is encountered that some
regions of the sequences are
alignable and some are not, however,
there are usually also intermediate
cases , where sequence and structure
have been diverged to a point at
which homology is not reliably
detectable.In such a case, the fM
Score , at best, provides a noisy
assessment of alignment tool
specificity, that becomes increasingly
less reliable as one considers
sequences of increasing structural
divergence.
However, after considering the
reference alignments, the accuracy of
results is still questionable as the
reference alignments generated are
of varying quality.
REFERENCES:
Multiple sequence alignment
Robert C Edgar1 and Serafim
Batzoglou2
BAliBASE: a benchmark alignment
database for the evaluation of
multiple alignment programs
Julie D. Thompson, Frédréric
Plewniak and Olivier Poch
Bioinformatics Review | 25
Basics of
Mathematical
Modelling - Part 1
Fozail Ahmad
Image Credit: Google Images
“Mathematical modeling receives a broad domain of cellular processes such as metabolic regulation, gene-gene
interaction, and gene-protein interaction. This has made a bridge between experimental and expected outcome. ”
iochemical processes are
simply complex, and their
apparent feature does not
easily allow us to investigate what
exactly system means. Moreover,
most of the biochemical processes
obey nonlinear reaction kinetics. That
is, amount of reactant
(Protein/RNA/DNA/) is not directly
proportional to its product. This leads
to further increase in complexity level
of molecular mechanism. And create
biological noises such as
randomization (stochasticity) of
biomolecules, perturbation in cell
signaling, difficulty in quantification of
cell products and even unexpected
response of the entire system. Here
comes development and utilization of
a mathematical model takes multiple
factors/parameters into consideration
and provides researcher with a visual
understanding how complex biological
system functions and responds to
external (hormone/drug/cation/anion)
and internal signals
(protein/enzyme/cation/anion) or
adverse environmental condition such
as deficiency of Fe2+ ion during
formation of Vitamin-D. Basically,
mathematical modeling receives a
broad domain of cellular processes
such as metabolic regulation, gene-
gene interaction, and gene-protein
interaction. This has made a bridge
between experimental and expected
outcome. In case of discrepancies,
between the two, parameter taken
into consideration need to be refined.
The general approach of the modeling
give us the following benefit:
1. Discrepancies between
mathematical model and actual
experimental result point to
components that still are missing
from hypothetically developed
model, and therefore one can
develop a more comprehensive
scenario of systems behavior. On
the hand, a well developed
model assist in designing and
clarifying the additional issues in
ongoing experiment.
2. With the help of a mathematical
model researcher can modify
experimental parameter (e.g., by
introducing modified protein
associated with Mg2+ uptake into
cell) and run the computer
simulations.
3. Most importantly, mathematical
models are not limited to
environmental/experimental
constraints. They may be quickly
changed for multiple
conditions/parameters and most
suitable simulation can be
assessed for developing a reliable
experimental design.
4. A mathematical model may help
to investigate sub-system that
sometimes regulate special
SYSTEMS BIOLOGY
B
Bioinformatics Review | 26
biochemical process, though all
biological reactions cannot be
treated same, which necessarily
provides substantial information
about large systems behaviour.
Fig: 1. Schematic representation of
biological modelling process,
knowledge generation and
experimental design.
By doing multiple simulations and
changing parameter values, we are
able to represent the real
biochemical/molecular phenomena
which seem to be difficult-to-treat.
→ (To be Continued…
Bioinformatics Review | 27
MUSCLE: Tool for
Multiple Sequence
Alignment
Muniba Faiza
Image Credit: Google Images
“MUSCLE is one of the software which is known for its speed and accuracy on each of the four benchmark test sets (
BAliBASE, SABmark, SMART and PREFAB).”
n my last article I discussed
about the Multiple Sequence
Alignment and its creation.
Now in this article, I am going to
explain the workflow of one of the
MSA tool, i.e., MUSCLE. MUSCLE is a
software which is used to create MSA
of the sequences of interest. MUSCLE
is one of the softwares which is known
for its speed and accuracy on each of
the four benchmark test sets (
BAliBASE, SABmark, SMART
and PREFAB). It is comparable to T-
COFFEE & MAFFT (these tools will be
explained in upcoming articles).
MUSCLE algorithm:
Two distance measures are used by
MUSCLE for a pair of sequences: a
kmer distance (for an unaligned pair)
and the Kimura distance (for an
aligned pair). A kmer is a
contiguous subsequence of length k,
also known as a word size or k-tuple,
i.e., it decides that how much
alphabets in the sequences will be
searched & aligned together. Kimura
distance is the measure which is based
on the fact that multiple substitutions
occurs at a single site.
For an aligned pair of sequences,
MUSCLE
computes the pairwise percent
identities ,i.e., how much
percentage of the sequences are
aligned/matched and convert it
to a distance matrix applying
Kimura.
Distance matrices are then
compiled using UPGMA method
(i,e., a method of phylogeny tree
construction based on the fact
that the mutations occur at the
constant rate), which gives a
TREE1,which is followed by an
progressive alignment and forms
MSA1,
compute pairwise percent
identities from MSA1 and
construct a Kimura Distance
MAtrix,
again apply UPGMA method and
forms TREE2
again followed by a progressive
alignment and forms MSA2, and
forms tree.
now from the last obtained tree
it delete the edges which results
in the formation of two sub trees,
computes the sub tree profile
(align the sub trees)
and then finally gives an MSA, for
which the SP Score is calculated
(explained in previous article
"Basic Concept of MSA"),
if the SP Score is better, then only
it saves that last obtained MSA as
MSA3, otherwise it discard the
MSA,
again repeat from the step 6, and
finally gives a clustered MSA.
TOOLS
I
Bioinformatics Review | 28
Fig.1 The workflow of MUSCLE
This is how MUSCLE works.
MUSCLE alignment is also used in
MEGA6 tool which is used for
phylogeny tree construction.
Every software or tool has its
own benefits depending up on
the needs under consideration.
There are various other tools also
available for MSA such as T-
COFFEE, MAFFT, etc, which have
high accuracy and speed. They
will be explained in upcoming
articles.
Reference:
MUSCLE : multiple sequence
alignment with high accuracy and
high throughput
Robert C. Edgar*
Bioinformatics Review | 29
Introduction to
mathematical
modelling - Part 2 Fozail Ahmad
Image Credit: Google Images
“In order to obtain the parameter values for analysing the kinetic behaviour of biochemical processes, we simply
investigate into expression data (gene/protein) at different transcriptional and translational level that enable us to frame-out the comprehensive structural biochemical pathway.”
athering of Dynamic/Kinetic
information
In the previous section you
might have noticed that modelling
biochemical process requires
calibrated set of fine parameters which
fit into and across the set of
chemical/reactant species
(gene/protein/molecule) involved in
the process. Question arises, where do
we collect data from? And what are
the standard criteria for determining
parameters? Basically, for a
researcher, it is necessary to know the
source of the data first and then how
to manipulate to get relevant
information for modelling. Source of
the data can be chosen depending
upon the requirement of experimental
design. For modelling, data can be
taken in the form of gene-gene
interaction, Gene expression (micro-
array) and gene-protein interaction.
Basically, interaction and expression
do not simply reveal the
dynamic/kinetic values of the system
and therefore need to be manipulated
for further implication.
In order to obtain the parameter
values for analysing the kinetic
behaviour of biochemical processes,
we simply investigate into expression
data (gene/protein) at different
transcriptional and translational level
that enable us to frame-out the
comprehensive structural biochemical
pathway. This can be done by
accounting the following methods:
1. Genome (complete set of genes)
analysis at transcription level
through DNA sequencing &
genotyping
2. Transcriptome (all mRNA)
analysis at translation level using
microarrays
3. Proteome (entire protein)
analysis at cellular level (reaction
between proteins and other
molecule in cell) using Mass
spectroscopy and 2D gel
electrophoresis
4. Metabolome (total metabolites
and their intermediate) analysis
at cell level, interaction
metabolites and regulator
using 13C labelling and NMR
techniques
5. Interactome (all interacting
molecule) analysis by yeast-2
hybrid screen & TAP techniques
(Ligand: TAP- Tandem affinity
purification, NMR- Nuclear magnetic
resonance).
SYSTEMS BIOLOGY
G
Bioinformatics Review | 30
Fig: 2. OMICS generates data for
developing structural pathway as well
as parameter values are set from the
same. A mathematical model
(formula), in the form of differential
equations, form reaction channels are
derived and then executed/solved
using suitable algorithm. The resultant
Simulation shows the dynamic
behaviour of the system. This can be
fluctuated by changing parameter
values to obtain the close-up result
form experimental data.
Above mentioned techniques are
collectively referred to as Omics. They
provide us structural and dynamic data
that is used to generate mathematical
formula representing observable
reactions, followed by development of
mathematical model and
comprehensive pathway of biological
system. These tentative models allow
us (as mentioned in part-1) to observe
the effect of a stimulus on specific
signalling pathway, perturbation in
cellular activities and gene expression
level etc.
Omics are characterized by a number
of features. First, they allow
researcher for analyses on different
molecular levels such as gene, protein
and metabolites level. These different
molecular levels sometimes show
asynchronous behaviour—that is,
although some metabolite such as
glucose is higher in a cell and
corresponding enzymes are lower than
sugar to catalyse the reaction or vice
versa. Asynchronous behaviour is an
indication of complex regulatory
mechanism. Therefore, it is crucially
important to evaluate the degree of
synchronization of all cellular level.
Second, Omics are highly parallelized.
This means all genes/mRNA (read-outs
in sample) can be studied
simultaneously rather than having to
perform separate experiments
focusing on individual genes. This
parallelization also allows researcher
to compare the degree of expression
results for the same gene and to have
an interaction between resultant
proteins. Third, they are very
standardized and therefore needs very
high automation computing, providing
scientist with a large number of
samples at time. In the process, after
collecting huge data, most relevant
information is picked up and then
processed further for final analyses.
Entireties of techniques in an Omics
are very important in the sense that
they generate numerical data based
upon which we are able to develop
structural pathway for mimicking the
real picture of biological system and
then to represent in the form of
mathematical model.
( →Continue to part 3 )
Bioinformatics Review |
31
Bioinformatics Review |
32
DNA test for paternity:
This is how you can fail!
Tariq Abdullah
Image Credit: Google Images
“DNA contain genetic information of an individual. The whole set of genes in an organism is known as Genome.
99.9% of genome is exactly same in every human individual but, 0.1% differs. 0.1% of each individual is unique to the person thus making it possible to identify each individual.”
NA test, also called as DNA
fingerprinting is done to verify
paternity, to find out criminal
involvement, in forensic science, in
archaeology and other scientific fields.
It is a well-established fact that DNA
fingerprinting test is foolproof. It has
its merits in court cases too. It is
considered to be a credible evidence
of criminal involvement and
paternity. Case of N.D Tiwari had also
got wide media attention in recent
times.
BUT the chances are, you can fail a
DNA test even with your real father or
mother!
To understand how is it possible, let
us look at how DNA Fingerprinting is
done.
The technique of DNA
fingerprinting/genetic fingerprinting or
simply DNA Test was discovered by Dr.
Alec Jeffreys in 1984. DNA contain
genetic information of an individual.
The whole set of genes in an organism
is known as Genome. 99.9% of
genome is exactly same in every
human individual but, 0.1% differs.
0.1% of each individual is unique to
the person thus making it possible to
identify each individual.
To identify a person by the difference
in DNA sequence, the sequence are
simply compared to each other. To
speed up this process of comparison,
rather than each nucleotide, a
molecular biologist compares the
regions of high variation in DNA
sequence, called minisatellites. The
location and sequence
of minisatellites in genome varies in
every individual. The chances of
occurrence of same minisatellites are
very low (1 in a billion). Hence it can be
treated as unique to every individual
just like a fingerprint.
To perform DNA test, the set of DNA is
first broken into smaller pieces by an
enzyme called restriction
endonuclease called Eco R1 which cuts
the sequence at distinct location
where the sequence is GAATTC or its
complementary. The location of this
repetitive sequence varies in every
individual. Each fragment is then
sorted according to their molecular
weight (or size) by a technique
called gel electrophoresis.
The fragment are then compared to
each other. If the fragments generated
by restriction enzymes are of same size
it is more likely that both the
sequences originated from the same
individual. Click here to read more
about genetic fingerprinting.
So how can DNA fingerprinting fail?
BIOINFORMATICS NEWS
D
Bioinformatics Review |
33
For the DNA test to fail we have to
have two different set of DNA
(Genome) in our bodies. It is possible
by following ways, this is a concise list
for a quick reference.
1. Since the human body is a
complex and dynamic system, the
environmental conditions in
different parts of body may lead
to changes in DNA, this is a
comparatively new idea studied
in epigenetics. Though changes
occur this way, It is not likely to
change the entire DNA and
location of microsatellites.
2. Transposable elements may also
cause the location of some
sequence to change. The
occurrence of transposable
elements is not so widespread as
to change locations of all
microsatellite. Hence, this
idea does not seems satisfactory.
3. Occurrence of more than one
kind of cells in terms of genome,
in human, i.e. Human
Chimera was recently seen in a
US man who failed the paternity
test with his real child.
So what is Human Chimera?
To be simple and precise, human
chimera is the occurrence of cells with
completely different set of genes in a
single individual. It is a very rare
condition and may go unnoticed.
Sometimes, when one of the twin dies
during early pregnancy, the remnant
cell may be taken up and absorbed by
the surviving embryo. Thus surviving
embryo will have two kinds of genome
in different part of the body depending
upon the process of differentiation.
Thus, if the cells that form sperm in
your body have different set of
genome than rest of the cells in your
body, you may fail a DNA test! This is a
rare condition and chances are low
that you would get away with a crime.
If you liked this article or find this
article worth reading, please do not
forget to share. Who knows, there
might be cases of human chimera
around you. :)
Subscribe to Bioinformatics Review newsletter to get the latest post in your
mailbox and never miss out on any of your favorite topics.
Log on to
www.bioinformaticsreview.com