VEL TECH HIGH TECH Dr. RANGARAJAN Dr. SAKUNTHALA ...

VEL TECH HIGH TECH Dr. RANGARAJAN Dr. SAKUNTHALA ENGINEERING COLLEGE

DEPARTMENT OF BIOTECHNOLOGY

BT6701 BIOINFORMATICS AND COMPUTATIONAL BIOLOGY


1

UNIT I

DIFFERENT NETWORK TOPOLOGY WITH ILLUSTRATION.

TOPOLOGY

The term topology, or more specifically, network topology, refers to the arrangement or physical layout of

computers, cables, and other components on the network. "Topology" is the standard term that most network

professionals use when they refer to the network's basic design

“Network topology is the study of the arrangement or mapping of the elements of a network, especially the

physical (real) and logical (virtual) interconnections between nodes”.

Classification of network topologies

There are also three basic categories of network topologies:

Physical topologies

Signal topologies

Logical topologies

a) Physical topologies

The mapping of the nodes of a network and the physical connections between them – i.e., the layout of wiring,

cables, the locations of nodes, and the interconnections between the nodes and the cabling or wiring system is

known as physical topologies.

Classification of physical topologies

Bus

Star

Mesh

Tree

Hybrid-network topologies

b) Signal topology

The mapping of the actual connections between the nodes of a network, as evidenced by the path that the

signals take when propagating between the nodes.

c) Logical topology

The mapping of the apparent connections between the nodes of a network, as evidenced by the path that data

appears to take when traveling between the nodes.

TOPOLOGIES

BUS STAR RING MESH TREE HYBRID





2

1. BUS TOPOLOGY

A linear bus topology consists of a main run of cable with a terminator at each end. All nodes (file

server, workstations, and peripherals) are connected to the linear cable. Ethernet and Local Talk networks use

a linear bus topology.

Advantages of a Linear Bus Topology

•Easy to connect a computer or peripheral to a linear

bus.

•Requires less cable length than a star topology.

Disadvantages of a Linear Bus Topology

•Entire network shuts down if there is a break in the main cable.

• Terminators are required at both ends of the backbone cable.

• Difficult to identify the problem if the entire network shuts down.

• Not meant to be used as a stand-alone solution in a large building.

2. STAR TOPOLOGY

In star topology, each device has a dedicated point to point link only to a central controller called a hub.

These devices are not linked to each other. Data on a star network passes through the hub or concentrator

before continuing to its destination. The hub or concentrator manages and controls all functions of the

network. It also acts as a repeater for the data flow. This configuration is common with twisted pair cable and

with coaxial cable or fiber optic cable. The protocols used with star configurations are usually Ethernet or

LocalTalk. Token Ring uses a similar topology, called the star-wired ring.

Advantages of a Star Topology

• Easy to install and wire.

• No disruptions to the network when connecting or

removing devices.

• Easy to detect faults and to remove parts.

Disadvantages of a Star Topology

• Requires more cable length than a linear topology, If the hub or concentrator fails, nodes attached are

disabled, More expensive than linear bus topologies because of the cost of the concentrators

3. TREE TOPOLOGY

A tree topology combines characteristics of linear bus and star topologies. It consists of groups of star-

configured workstations connected to a linear bus backbone cable. Tree topologies allow for the expansion of

an existing network, and enable schools to configure a network to meet their needs.

http://members.tripod.com/barhoush_2/glossary.htm#terminator

http://members.tripod.com/barhoush_2/glossary.htm#node

http://members.tripod.com/barhoush_2/glossary.htm#ethernet

http://members.tripod.com/barhoush_2/glossary.htm#local_talk

http://members.tripod.com/barhoush_2/glossary.htm#repeater

http://members.tripod.com/barhoush_2/glossary.htm#twisted

http://members.tripod.com/barhoush_2/glossary.htm#coaxial

http://members.tripod.com/barhoush_2/glossary.htm#fiber

http://members.tripod.com/barhoush_2/glossary.htm#ethernet

http://members.tripod.com/barhoush_2/glossary.htm#local_talk





3

Advantages of a Tree Topology

•Point-to-point wiring for individual

segments.

•Supported by several hardware and software

venders.

Disadvantages of a Tree Topology

• Overall length of each segment is limited by the type of cabling used.

• If the backbone line breaks, the entire segment goes down.

• More difficult to configure and wire than other topologies.

4. MESH TOPOLOGY

A mesh topology network offers superior redundancy and reliability. In a mesh topology, each

computer is connected to every other computer by separate cabling. This configuration provides redundant

paths throughout the network so that if one cable fails, another will take over the traffic. While ease of

troubleshooting and increased reliability are definite pluses, these networks are expensive to install because

they use a lot of cabling. Often, a mesh topology will be used in conjunction with other topologies to form a

hybrid topology.

5. RING TOPOLOGY

In Ring topology, each device has dedicated point to point line

configuration only with two devices on either side of it. A signal is passed along the ring in one direction

from device to device, until it reaches its destination. Each device in the ring incorporates a repeater. When a

device receiver signal intended for another device, its repeater regenerates the bits and passes them

along.Advantages of a ring topology A ring is relatively easy to install and reconfigure. Disadvantages of

ring topology Unidirectional traffic can be troublesome

TYPES OF NETWORKING AND ITS SIGNIFICANCES

Networks are often classified by their physical or organizational extent or their purpose. Usage, trust level, and

access rights differ between these types of networks.

Personal area network

A personal area network (PAN) is a computer network used for communication among computer and different

information technological devices close to one person. Some examples of devices that are used in a PAN are

personal computers, printers, fax machines, telephones, PDAs, scanners, and even video game consoles. A PAN

http://en.wikipedia.org/wiki/Personal_area_network





4

may include wired and wireless devices. The reach of a PAN typically extends to 10 meters.[11]

A wired PAN is

usually constructed with USB and Firewire connections while technologies such as Bluetooth and infrared

communication typically form a wireless PAN.

Local area network

A local area network (LAN) is a network that connects computers and devices in a limited geographical area

such as home, school, computer laboratory, office building, or closely positioned group of buildings. Each

computer or device on the network is a node. Current wired LANs are most likely to be based on Ethernet

technology, although new standards like ITU-TG.hn also provide a way to create a wired LAN using existing

home wires (coaxial cables, phone lines and power lines).[12]

Typical library network, in a branching tree topology and controlled access to resources

A sample LAN is depicted in the accompanying diagram. All interconnected devices must understand the

network layer (layer 3), because they are handling multiple subnets (the different colors). Those inside the

library, which have only 10/100 Mbit/s Ethernet connections to the user device and a Gigabit Ethernet

connection to the central router, could be called "layer 3 switches" because they only have Ethernet interfaces

and must understand IP. It would be more correct to call them access routers, where the router at the top is a

distribution router that connects to the Internet and academic networks' customer access routers.

The defining characteristics of LANs, in contrast to WANs (Wide Area Networks), include their higher data

transfer rates, smaller geographic range, and no need for leased telecommunication lines. Current Ethernet or

other IEEE 802.3 LAN technologies operate at data transfer rates up to 10 Gbit/s. IEEE has projects

investigating the standardization of 40 and 100 Gbit/s.[13]

LANs can be connected to Wide area network by

using routers.

Home area network

A home area network (HAN) is a residential LAN which is used for communication between digital devices

typically deployed in the home, usually a small number of personal computers and accessories, such as printers

and mobile computing devices. An important function is the sharing of Internet access, often a broadband

service through a cable TV or Digital Subscriber Line (DSL) provider.

http://en.wikipedia.org/wiki/Computer_network#cite_note-10

http://en.wikipedia.org/wiki/Local_area_network

http://en.wikipedia.org/wiki/Ethernet

http://en.wikipedia.org/wiki/ITU-T

http://en.wikipedia.org/wiki/ITU-T


http://en.wikipedia.org/wiki/Internet_Protocol

http://en.wikipedia.org/wiki/IEEE_802.3

http://en.wikipedia.org/wiki/IEEE


http://en.wikipedia.org/wiki/Home_area_network

http://en.wikipedia.org/wiki/Digital_Subscriber_Line

http://en.wikipedia.org/wiki/File:NETWORK-Library-LAN.png





5

Storage area network

A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data

storage. SANs are primarily used to make storage devices, such as disk arrays, tape libraries, and optical

jukeboxes, accessible to servers so that the devices appear like locally attached devices to the operating system.

A SAN typically has its own network of storage devices that are generally not accessible through the local area

network by other devices. The cost and complexity of SANs dropped in the early 2000s to levels allowing wider

adoption across both enterprise and small to medium sized business environments.

Campus area network

A campus area network (CAN) is a computer network made up of an interconnection of LANs within a limited

geographical area. The networking equipment (switches, routers) and transmission media (optical fiber, copper

plant, Cat5 cabling etc.) are almost entirely owned (by the campus tenant / owner: an enterprise, university,

government etc.).

In the case of a university campus-based campus network, the network is likely to link a variety of campus

buildings including, for example, academic colleges or departments, the university library, and student

residence halls.

Backbone network

A backbone network is part of a computer network infrastructure that interconnects various pieces of network,

providing a path for the exchange of information between different LANs or subnetworks. A backbone can tie

together diverse networks in the same building, in different buildings in a campus environment, or over wide

areas. Normally, the backbone's capacity is greater than that of the networks connected to it.

A large corporation which has many locations may have a backbone network that ties all of these locations

together, for example, if a server cluster needs to be accessed by different departments of a company which are

located at different geographical locations. The equipment which ties these departments together constitute the

network backbone. Network performance management including network congestion are critical parameters

taken into account when designing a network backbone.

A specific case of a backbone network is the Internet backbone, which is the set of wide-area network

connections and core routers that interconnect all networks connected to the Internet.

Metropolitan area network

A Metropolitan area network (MAN) is a large computer network that usually spans a city or a large campus.

http://en.wikipedia.org/wiki/Storage_area_network

http://en.wikipedia.org/wiki/Campus_area_network

http://en.wikipedia.org/wiki/Category_5_cable

http://en.wikipedia.org/wiki/Backbone_network

http://en.wikipedia.org/wiki/Network_performance_management

http://en.wikipedia.org/wiki/Network_congestion

http://en.wikipedia.org/wiki/Internet_backbone

http://en.wikipedia.org/wiki/Core_router

http://en.wikipedia.org/wiki/Internet

http://en.wikipedia.org/wiki/Metropolitan_area_network





6

Sample EPN made of Frame relay WAN connections and dialup remote access.

Sample VPN used to interconnect 3 offices and remote users

Wide area network

A wide area network (WAN) is a computer network that covers a large geographic area such as a city, country,

or spans even intercontinental distances, using a communications channel that combines many types of media

such as telephone lines, cables, and air waves. A WAN often uses transmission facilities provided by common

carriers, such as telephone companies. WAN technologies generally function at the lower three layers of the

OSI reference model: the physical layer, the data link layer, and the network layer.

Enterprise private network

An enterprise private network is a network built by an enterprise to interconnect various company sites, e.g.,

production sites, head offices, remote offices, shops, in order to share computer resources.

Virtual private network

A virtual private network (VPN) is a computer network in which some of the links between nodes are carried by

open connections or virtual circuits in some larger network (e.g., the Internet) instead of by physical wires. The

data link layer protocols of the virtual network are said to be tunneled through the larger network when this is

the case. One common application is secure communications through the public Internet, but a VPN need not

http://en.wikipedia.org/wiki/Frame_relay

http://en.wikipedia.org/wiki/Wide_area_network

http://en.wikipedia.org/wiki/OSI_model

http://en.wikipedia.org/wiki/Physical_layer

http://en.wikipedia.org/wiki/Data_link_layer

http://en.wikipedia.org/wiki/Network_layer

http://en.wikipedia.org/wiki/Enterprise_private_network

http://en.wikipedia.org/wiki/Virtual_private_network

http://en.wikipedia.org/wiki/File:EPN_Frame-Relay_and_Dial-up_Network.svg

http://en.wikipedia.org/wiki/File:Virtual_Private_Network_overview.svg

http://en.wikipedia.org/wiki/File:EPN_Frame-Relay_and_Dial-up_Network.svg

http://en.wikipedia.org/wiki/File:Virtual_Private_Network_overview.svg





7

have explicit security features, such as authentication or content encryption. VPNs, for example, can be used to

separate the traffic of different user communities over an underlying network with strong security features.

VPN may have best-effort performance, or may have a defined service level agreement (SLA) between the

VPN customer and the VPN service provider. Generally, a VPN has a topology more complex than point-to-

point.

Internetwork

An internetwork is the connection of multiple computer networks via a common routing technology using

routers. The Internet is an aggregation of many connected internetworks spanning the Earth.

Organizational scope

Networks are typically managed by organizations which own them. According to the owner's point of view,

networks are seen as intranets or extranets. A special case of network is the Internet, which has no single owner

but a distinct status when seen by an organizational entity – that of permitting virtually unlimited global

connectivity for a great multitude of purposes.

UNIX, ITS COMMANDS AND ADVANTAGES OVER OTHER OS.

The UNIX has become quite popular since its inception in 1969, running on machines of varying

processing power from microprocessors to mainframes.

The system is divided in two.

The first part consists of programs and services that have made the UNIX system environment

so popular; it is readily apparent to users, including such programs as the shell, mail, text

processing packages, and source code control systems.

The second part consists of the operating system that supports these programs and services.

UNIX ARCHITECTURE

The high level architecture of the UNIX system is depicted below;

The hardware at the centre of the diagram provides the operating system with basic services. The

hardware at the centre provides the OS with several services. The OS interacts directly with the

hardware, providing common services to the programs and insulating them from hardware

idiosyncrasies.

The operating system is commonly called the system kernel, which is isolated from user programs. As

the programs are independent of the underlying hardware, it is easy to move them between UNIX

systems running on different hardware.

Programs such as shell and editors (ed and vi) in the outer layers interact with kernel by invoking to a

well defined set of system calls.

http://en.wikipedia.org/wiki/Internetwork


http://en.wikipedia.org/wiki/Earth






8

The system calls instruct the kernel to do various operations for the calling program and exchange

data between the kernel and the program. Several programs shown in the figure are in standard system

configurations and are known as commands.

Private user programs also exist in this layer as indicated by the programs such as a.out, the standard

name for executable files produced by the C compiler.

FEATURES OF UNIX SYSTEM

Some high-level features of the UNIX system are

The file system,

The processing environment, and

The building block primitives

THE FILE SYSTEM

The UNIX file system is characterized by

A hierarchical structure

Consistent treatment of file data

The ability to create and delete files

Dynamic growth of files

The protection of file data

The treatment of peripheral devices (such as terminal and tape units) as files

BASIC UNIX COMMANDS

1. WHO:

This command is used to display all the user’s who are currently logged on to the system.

Syntax: $who

2. who am i:

This is used to display the current user Syntax: $whoami

3. log name:

This is used to confirm the login name. Syntax: $logname

4. pwd:

This will display the current working directory. Syntax: $pwd

5. echo:

This is used to display the text typed from the keyboard. Syntax: $echo<text>

TELNET

TELNET is an abbreviation for Terminal Network. TELNET enables the establishment of a





9

connection to a remote system in such a way that the local terminal appears to be a terminal at the

remote system.

TELNET is a general purpose client-server program.

Briefly, TELNET is a program that allows the user to log into another computer on the internet as a

user on that system. With TELNET, a user can log into a server to access information stored on it.

For example, all public databases such as GenBank, EMBL and PDB all work on the principle of

TELNET only.

File Transfer Protocol (FTP)

File Transfer Protocol (FTP) is a standard mechanism provided by TCP/IP for copying a file from one

host to another. Transferring files from one computer to another is one of the most common tasks expected

from a networking or internetworking environment.

FTP differs from other client-server applications in that it establishes two connections between the

hosts.

One connection is used for data transfer; and the other for control information (commands and

responses), making FTP more efficient.

The client has three components

The user interface

The client control process and

The client data transfer process.

The server has two components;

The server control process

The server data transfer process

The control connection remains connected during the entire interactive FTP session. The data

connection is opened and then closed for each file transferred.

It opens each time commands that involved transferring files are used, and it closes when the file is

transferred.

The two FTP connections are

Control

Data

HYPERTEXT TRANSFER PROTOCOL (HTTP)

The HTTP is a protocol used mainly to access data on the WWW.

The protocol transfers data in the form of plain text, hypertext, audio, and video and so on.





10

It is highly efficient as it allows its use in a hypertext environment where there are rapid jumps from

one document to another.

HTTP functions like a combination of FTP and SMTP.

BIOLOGICAL DATABASES

A database is a collection of data stored in a standardized format, designed to be shared by multiple users.

Databases provide the long term memory of computer operations; take on variety of names, depending on

their structures, contents, use and amount of data they contain.

Databases are efficiently electronic filing cabinets, a convenient and efficient method of storing vast

amounts of information.

Databases range from the

o nature of information being stored and

o On the manner of data storage.

CLASSIFICATION OF DATABASES

PRIMARY SEQUENCE DATABASES

SECONDARY DATABASES

COMPOSITE DATABASES

PRIMARY SEQUENCE DATABASES

In the early 1980’s, sequence information started to become more abundant in the scientific literature

due the invention of the DNA and protein sequencing tools and development in molecular biology. Realizing

this impact, several laboratories started developing and storing these biological sequences in central

repositories (Data ware house) leading to the birth of primary databases.

The most important primary databases are

For nucleic acid;

o EMBL(European Molecular Biology Laboratory)

o GenBank

o DDBJ(DNA DataBank of Japan)

For proteins sequence databases;

o PIR(Protein Information Resources)

o MIPS

o SWISS-PROT

o TrEMBL

o NRL-3D





11

NUCLEIC ACID SEQUENCE DATABASES

The principal DNA sequence databases are GenBank(USA), EMBL(European Molecular Biology

Laboratory-Europe) and DDBJ (Japan), which exchange data on a daily basis to ensure

comprehensive coverage at each of the sites

EMBL, the nucleotide sequence database from the European Bioinformatics Institute(EBI), includes

sequences both from

o direct author submissions and

o genome sequence groups, and

o from scientific literature and

o Patent applications.

Information can be retrieved from EMBL using SRS Sequence Retrieval System; this links the

principal DNA and protein sequence databases with motif, structure, mapping and other specialized

databases

EMBL also has links with MEDLINE.

EMBL can be searched with query sequences via EBI’s web interfaces to the BLAST and FastA

programs

DDBJ

DDBJ is the DNA Data Bank of Japan, began in 1986 as a collaborators with EMBL and GenBank.

The database is produced, maintained and distributed at the National Institute of Genetics.

GenBank

GenBank, the DNA database from the NCBI, incorporates sequences from publicly available sources,

primarily from direct author submissions and large-scale sequencing projects.

GenBank is split into 17 divisions (till 2005) for it convenience such as;

o PRI Primate

o ROD Rodent

o MAM Other mammalian

o SYN Synthetic

o HTG High throughput genomic sequences

Information can be retrieved from GenBank using the Entrez integrated retrieval system

Entrez combines data from the principal DNA and protein sequence databases with information from

genome maps and protein structures.

Additional information on the sequences can be accessed via the MEDLINE , which provides

abstracts from the original published articles.





12

STRUCTURE OF GENBANK ENTRIES

GenBank entries include sequence files, indices created on various database fields and information

derived from the database.

The below figure depicts a GenBank entry;

In the below figure, keywords includes LOCUS, DEFINITION, ACCESSION, NID, SOURCE,

REFERENCE, FEATURES, BASE COUNT and ORIGIN

The LOCUS keyword introduces a short label for the entry that may suggest the function of the

sequence(1593 bp DNA linear VRL, where 1593 bp is the length of linear DNA from Viral source),

the line summarizes the number of bases, source of DNA, and date of submission

The DEFINITION line contains concise definition of the sequence(Canine parvovirus, VP-2 gene,

complete cds)

Following this is , the ACCESSION line gives the accession number(AY742949)

The KEYWORDS line introduces a list of short phrases, assigned by the author, describing gene

products and other relevant information about the entry (Here, VP-2 gene of CPV, Viral coat protein).

The SOURCE record provides the information on the fecal sample from which the data have been

derived.

SECONDARY DATABASES

In addition to the numerous primary and composite resources, there are many secondary (or pattern)

databases, so-called because they contain the fruits of analyses of the sequences in the primary sources. Some

of the main secondary resources are;

SECONDA

RY

DATABAS

E

PRIMARY

SOURCE

STORED INFORMTION

PROSITE SWISS-PROT Regular expressions(patterns)

Profiles SWISS-PROT Weighted matrices(profiles)

PRINTS OWL Aligned motifs(fingerprints)

Pfam SWISS-PROT Hidden Markov Models(HMMs)

BLOCKS PROSITE/PRINTS Aligned motifs(blocks)

IDNETIF

Y

BLOCKS/PRINTS Fuzzy regular expressions(patterns)

PROSITE

PROSITE was the first secondary database developed and maintained by Swiss Institute of

Bioinformatics. The rationale behind was that protein families could be simply and effectively characterized

by the single most conserved motif observable in a multiple alignment of known homologues, such motifs

usually encoding key biological functions.

Within PROSITE, motifs are encoded as regular expressions, referred to as patterns. The process used

to derive patterns involves the construction of multiple alignment and manual inspection to identify





13

conserved regions. Sequence information within individual motifs is reduced to single consensus expressions

and the resulting seed patterns are used to search SWISS-PROT. Results are checked manually to determine

how well the patterns are performed.

STRUCTURE CLASSIFICATION DATABASES

Many proteins share structural similarities, reflecting common evolutionary origins. The evolutionary

process involves substitutions, insertions and deletions in amino acid sequences. For distantly related

proteins, such changes can be extensive, yielding folds in which the numbers and orientations of secondary

structures vary considerably. Two classification schemes are

SCOP (Structural Classification of Proteins)

CATH (Class, Architecture, Topology, Homology)

PDBsum-a different structural information database

SCOP

The SCOP (Structural Classification of Proteins) database maintained by the MRC laboratory of

Molecular Biology and Centre for Protein Engineering.

The database describes the structural and evolutionary relationships between proteins of known

structure.

As current automatic structure comparison tools cannot reliably identify all such relationships, SCOP

has been constructed using a combination of manual inspection and automated methods.

SCOP classification is based on hierarchy reflecting their structural and evolutionary relatedness.

Within the hierarchy there are many levels, but principally these describe the family, super family and

fold.

Family – Proteins are clustered into families with clear evolutionary relationships if they have

sequence identities ≥30%. But this is not absolute measure, as it is possible to infer common descent

from similar structures and functions in the absence of significant sequence identity.

Super family- Proteins are placed in superfamilies when, in spite of low identity, their structural and

functional characteristics suggest a common evolutionary origin.

Fold – Proteins are classed as having a common fold if they have same secondary structures in the

same arrangement and same topology.

CATH

The CATH (Class, Architecture, Topology, and Homology) database is a hierarchical domain

classification of protein structures maintained at UCL. The resource is largely derived using automatic

methods assisted by manual inspection. Different categories within the classification are identified by means





14

of both unique numbers and descriptive names. Such a numbering scheme allows efficient computational

manipulation of data. There are five levels within the hierarchy;

Class

Architecture

Topology

Homology

Sequence

CATH is accessible for keyword interrogation via UCL’s Biomolecular Structure and Modeling unit

web server.

FSSP

FSSP (Fold classification based on the structure-structure alignment of proteins and families of

structurally similar proteins) is based on the structural alignment of pairwise combinations of proteins in the

PDB. Alignments and classification are done automatically and are updated continuosly by the DALI search

engine. The FSSP database presents a continuously updated structural classification of three dimensional

protein folds. It is derived using an automatic structure comparison program (DALI)

MMDB

NCBI’s macromolecular 3D structure database is called MMDB or molecular modelling database.

It is also known as the NCBI structure division. It was designed to archive structural data from PDB as well

as biomolecules generated by electron microscopy. MMDB is linked to the rest of the NCBI’s databases.

PDBsum

PDBsum is a web-base compendium structural information database maintained at UCL. PDBsum

provides summaries and analyses of all structures in the PDB. Each summary gives an at-a- glance overview

of the contents of a PDB entry in terms of resolution and R-factor, number of protein chains, ligands, metal

ions, secondary structures, folds and ligand interactions. This compendium is helpful for retrieving 1D

(sequence), 2D (motifs) and 3D (structure) information’s on the proteins.

DALI

The DALI or Distance mAtrix aLIgnment server is a network used to compare three dimensional

protein structures.The query sequence coordinates are compared against those in the protein data bank.A

multiple alignment of structural neighbours is the output. The DALI serveris useful to compare 3D structures

where similarities are not detectable by comparing sequences direcly. The comparison uses Max Sprout

program to generate backbone and side-chain co-ordinates.





15

UNIT II

DOT BLOT matrix

Dot plots

Dot plot have proved to be easy yet powerful means of sequence comparison. A dot plot is a graphical

representation of the similarities between two sequences. Dot matrix analysis is especially useful in

identifying repeats-repeating characters or short sequences especially to single alignment problems,

especially those with relative high degrees of similarities.

Merits of dot plots

Dot plots are useful as a first-level filter for determining an alignment between two sequences.

Regions of similarity will show up as diagonals within the dot plot matrix.

Regions containing insertions/deletions are easily identified with respect to the coding regions.

Regions of direct and inverted repeats can be detected using dot plots

Dot plots show all possible matches of residues between two sequences given a certain threshold

level

Dot plots can be used in order to compare two different assembles of the same sequences

Dot plots are excellent approach for finding sequence transpositions.

C O R R E L A T I O N S

R × ×

E ×

L ×

A ×

T ×

I ×

O × ×

N ×

S ×

H

I ×

P

S ×

Demerits of dot plots

They can be noisy due to lots of insignificant matches when long sequences are compared

Dot plots are not alignments.

Dot plot packages

Emboss software package

o Dot matcher

o Dotpath

o Dotup

Vector NTI software package

o Dotlet java applet





16

o Dotter

GCG software package

o Compare

o Dotplot

PAM & BLOSUM

Scoring matrices-Definition

A scoring matrix gives the score for aligning two amino acids (match or mismatch) in a pairwise

alignment. A scoring matrix can be considered a measure of the evolutionary change. The most widely used

matrices are PAMs and BLOSUMs. Both calculates substitution frequencies between amino acids, and both

are derived from known protein alignments

Substitution matrices algorithms

Scoring matrices are also known as substitution matrices. These are the tools that are extensively used

to model mutations in sequences. The most common method of creating a substitution matrix in the case of the

amino acids is to observe the actual substitution rates among the various amino acid residues in the nature.

The score is favourable if the substitution is observed frequently. On the other hand, the alignment for a pair

residues is penalized if the substitution is not observed frequently.

The score using this rule looks only at the degree of identical nature of the residues. It does not show

how similar those proteins are with respect to their structure and function. Thus to make amino acid

alignment and scoring more significant, matrices are developed that score mutation among amino acids with

similar physiochemical properties

Types of scoring models

There are two popular scoring models for protein sequences;

PAM (Point accepted mutation)

BLOSUM (BlOcks Substitution Matrix)

Differences between PAM and BLOSUM

PAM BLOSUM

Based on explicit evolutionary model Based on empirical

Represents a specific evolutionary distance Always a blend of distances as seen in

the database and PROSITE

Ranges from identical to complete random Narrower range than PAM matrix

PAM matrices

PAM is Percent accepted mutations

Margaret dayhoff and co-workers proposed PAM model.

These matrices are not mere mathematical constructs designed simply to facilitate computational

sequence alignment, but they reflect the biology of the molecules represented by the sequences.

There are many types of PAM matrices

o PAM1 - 1% of residues have undergone mutation

o PAM 250- An average of 2.5 accepted mutation per residues.

o PAM 1000-

A matrix value of 0 signifies that a substitution typically occurs at a random base rate.

A negative matrix value infers that the substitution is less likely than by chance alone





17

A positive matrix value means that substitution occurs more often than suggested by chance.

Creation of PAM matrix

Step 1 : Construct multiple sequence alignment between sequences with high similarity

Step 2 : Construct phylogenetic tree to show the order of various substitutions

Step 3 : Compute relative mutability (mi,j) for each amino acid

Step 4 : Compute relative mutability divided by the total number of mutations and multiplied by the

frequency of amino acid and a scaling factor of 100.

Step 5 : Compute substitution tally

Step 6 : Calculate the probability(M i,j) for each pair of amino acid Mi,j=

mjAi,j∑(i)Ai,j

Step 7 : Compute mutation probability Mi,j divided by the frequency of occurence of fi of the residue

(i)and then calculate Ri,j by taking the log of the resulting value for each entry of the PAM matrix. This is

repeated to compute values for non-diagonal entries in the PAM matrix.

Step 8 : Diagonal enteries in the PAM matrices are computed by taking Mi,j= 1-mj and then following

step 6.

Multiple Sequence Alignment

A multiple sequence alignment of three or more biological sequence, generally protein or DNA. MSA

is a tool to determine levels of homology and phylogeny, hence from the members of globally related

sequence. Visual depictions illustrates point mutation, insertions or deletions that appear as a gaps in one or

more sequences in the sequence alignment. MSA is used to assess sequence conservation of protein domains,

tertiary and secondary structures.

MSA is time consuming to handle manually when more than three biological sequences are

compared. Hence, computational algorithms are used to analyze alignments.

Tools for MSA

HMM

BLOCKS-HMM profile library

CDD-Conserved Domain Database

Pfam-Profile HMM library

PRINTS-Protein Fingerprints from SP/TrEMBL

PROSITE- A dictionary of protein motifs

AMAS-Analyze Multiple Aligned Sequences

ClustalW-General purpose MSA tool

DIALIGN-Local MSA

Musca-MSA

CINEMA-Colour Interactive Editor for Multiple Alignment

MEME/MAST-Search motifs and then query it against the database

MultiAlin-MSA with hierarchical clustering

Types of MSA

Global MSA

Local MSA

Multiple Sequence Alignment require sophisticated methodologies than pairwise alignments. Most

MSA’s program uses heuristic methods rather than global optimization.





18

Methods for MSA

Dynamic programming

Sum of pair method

Progressive alignment

Iterative methods

Hidden Markov Models

Genetic Algorithms

Automated tools (Macaw, Meme etc)

Dynamic programming

Refer Dynamic programming (Needlemann-wunsch algorithm and Smith-watermann

algorithm)

Sum of pair method

The SP score of a multiple sequence alignment is the sum of the scores of the pair-wise alignments

implied by the multiple alignment. DP is similar to that of DP for two sequences. However, instead of

aligning two sequences at a time with dynamic programming, one need to align three or more simultaneously.

This method requires three dimensions making this method as a cumbersome and practically difficult

methodology to work.

A multiple alignment of N sequences implies N(N-1)/2 pair-wise alignment:

Example:

– Scoring system: identity 1, mismatch -2, gap –2 –2 length

– Multiple alignment: Implied pair-wise alignments:

Consider three sequences seq A = ACCTGTGAC, seq B = AC-TGTAAC, seq C = ACCTGT-

AC ACCTGTGAC ACCTGTGAC AC-TGTAAC

AC -TGTAAC ACCTGT- AC ACCTGT-AC

Score = 1 Score = 4 Score = -1

Sum of pairs (SP) score = 1 + 4 – 1 = 4

Progressive alignment construction

One of the heuristic approach to MSA is to use the progressive technique (Two step method) that

builds up a final alignment by first performing a series of pairwise alignments on successively less closely

related sequences.

This approach has two steps

Find the most similar sequences by comparing all combinations of the sequences using

dynamic programming method to obtain similarity scores.

The most similar consensus sequence is derived from these comparisons, aligned with each other or

with other sequences which are similar to them and finally all the sequences are joined using

consensus sequence as a guide tree. The alignment obtained will favour the most similar sequences.

Demerits

Initial clustering of related sequences

Quality of alignments

Sensitive

The most related pair is determined by clustering method such as neighbour-joining method and

then progressive techniques constructs such as phylogenetic trees.





19

Examples of progressive alignments are

Clustal – It is widely used MSA alignment tool. It has two main variation

o Clustal X

o Clustal W

T-Coffee

Pile up-UPGMA (Unweighted Pair Group Method with Arithmetic Mean)

Iterative methods

A set of method to produce MSA’s while reducing the errors inherent in progressive methods are

classified as iterative because they work similar to progressive methods but repeatedly realign the initial

sequences as well as adding new sequences to the growing MSA

Examples for Iterative methods are

MUSCLE

CHAOS/DIALIGN

PRRN/PRPP

Hidden Markov Models

HMMs are probabilistic models that can assign likelihoods to all possible combinations of gaps,

matches and mismatches to determine the most likely MSA or set of possible MSAs.

HMMs produce highest-scoring output generating the best optimal alignment which is of biological

significance producing both local and global alignments. HMMs are dynamic as they are probabilistic.

Examples of HMMs

Viterbi algorithm - align the growing MSA to the next

POA-Partial Order Algorithm- HMM based

SAM-Sequence Alignment and Modelling Package

Genetic algorithm

Genetic algorithm is a dynamic and an iterative process continues indefinitely based on evolutionary

principles wherein a particular function or definition that best fits the constraints of an environment survives

to the next generation, and the other functions are eliminated. This method works by breaking a series of

possible MSAs into fragments and repeatedly rearranging those fragments with introduction of gaps at varying

positions.

Examples of Genetic algorithm

SAGA-Sequence Alignment by Genetic Algorithm

MSASA-Multiple Sequence Alignment by Simulated Annealing

Applications of Multiple Sequence Alignment

The principal applications of MSA are as follows;

Alignment of amino and nucleotide sequences

Searching for sequences

PCR primer design

Needleman –Wunsch algorithm (Global alignment)

The most basic algorithm to align two sequences was developed by S.A. Needleman and C.D.Wunsch (1970, J.

Mol. Biol. 48:443). The algorithm is a simple and beautiful way to find an alignment that maximizes a

particular score.

The initial steps of the algorithm are reminiscent of the dot plot. The first step is to place the two sequences

along the margins of a matrix





20

The introduction of a gap (either by an insertion or a deletion - an indel) in either sequence would correspond to

moving either above or below the main diagonal. To find the best route, Needleman and Wunsch suggested that

you modify the matrix to represent this idea of tracing different pathways through the matrix. However, you

want to include all possible pathways and from among these choose only that one which is best (in the sense of

maximizing some score). Their method consists of two passes through the matrix. The first pass traces a score

for all possible routes and moves right to left, bottom to top. Once the score for all possible routes are found, the

maximum can be chosen (it will be somewhere on the topmost row or leftmost column) and a second pass can

be carried out, this time running left to right, top to bottom to find that alignment that gives the maximum score.

The following is an example of global sequence alignment using Needleman/Wunsch techniques. For

this example, the two sequences to be globally aligned are

G A A T T C A G T T A (sequence #1) G G

A T C G A (sequence #2)

So M = 11 and N = 7 (the length of sequence #1 and sequence #2, respectively)

General algorithm

A simple scoring scheme is assumed where

Si,j = 1 if the residue at position i of sequence #1 is the same as the residue at position j of

sequence #2 (match score); otherwise

Si,j = 0 (mismatch score)

w = 0 (gap penalty)

Three steps in dynamic programming

1. Initialization

2. Matrix fill (scoring)

3. Traceback (alignment)

Step 1 Initialization Step

The first step in the global alignment dynamic programming approach is to create a matrix with M + 1

columns and N + 1 rows where M and N correspond to the size of the sequences to be aligned. In this

example there is no gap opening or gap extension penalty, the first row and column is initially filled with 0.

Step 2 Matrix Fill Step

The matrix fill step starts in the upper left hand corner of the matrix and finding the maximal score

Mi,j for each position in the matrix. To find Mi,j, it necessary to know Mi-1,j, Mi,j-1 and Mi-1, j-1.

For each position, Mi,j is defined to be the maximum score at position i,j; i.e.

Mi,j = MAXIMUM[

Mi-1, j-1 + Si,j (match/mismatch in the diagonal),

Mi,j-1 + w (gap in sequence #1),





21

Mi-1,j + w (gap in sequence #2)]

Since the first residue in both sequences is a G, S1,1 = 1, and by the assumptions stated at the

beginning, w = 0. Thus, M1,1 = MAX[M0,0 + 1, M1, 0 + 0, M0,1 + 0] = MAX [1, 0, 0] = 1.

A value of 1 is then placed in position 1,1 of the scoring matrix.

Use the algorithm, fill the column and rows. For example, at row 1 and column 1, both the bases are

guanine(G), the assigned value of maximum of 1(match), 0(horizontal gap) or 0 (vertical gap).

Similarly now let's look at column 2. The location at row 2 will be assigned the value of the maximum

of 1(mismatch), 1(horizontal gap) or 1 (vertical gap). So its value is 1.

At the position column 2 row 3, there is an A in both sequences. Thus, its value will be the maximum

of 2(match), 1 (horizontal gap), 1 (vertical gap) so its value is 2.

Moving along to position colum 2 row 4, its value will be the maximum of 1 (mismatch), 1

(horizontal gap), 2 (vertical gap) so its value is 2.

Note that for all of the remaining positions except the last one in column 2, the choices for the value

will be the exact same as in row 4 since there are no matches. The final row will contain the value 2

since it is the maximum of 2 (match), 1 (horizontal gap) and 2(vertical gap).

Step 3 Traceback Step

After the matrix fill step, the maximum alignment score for the two test sequences is 6.

The traceback step determines the actual alignment(s) that result in the maximum score.

The traceback step begins in the M,J position in the matrix, i.e. the position that leads to the maximal

score. In this case, there is a 6 in that location.

Traceback takes the current cell and looks to the neighbour cells that could be direct predecessor. This

means it looks to the neighbour cells (to its left, to above and to the diagonal neighbour) that could be

direct predecessors of the cell in consideration. Each neighbour determines a gap, match or a

mismatch.

o Neighbour to the left (gap in sequence #2),

o The diagonal neighbour (match/mismatch), and

o The neighbour above it (gap in sequence #1).

The algorithm for traceback chooses as the next cell in the sequence one of the possible predecessors.





22

UNIT III

MUTATIONS

Any change in the DNA sequence of an organism is a mutation. Mutations are the source of the altered

versions of genes that provide the raw material for evolution. Most mutations have no effect on the organism,

especially among the eukaryotes, because a large portion of the DNA is not in genes and thus does not affect

the organism’s phenotype. Of the mutations that do affect the phenotype, the most common effect of

mutations is lethality, because most genes are necessary for life. There are various types of mutations such as

Point mutation

Insertion mutation

Deletion mutation

Frameshift mutation

Types of mutations in ORFs: Nonsense mutation

Base pair substitution results in a stop codon (and shorter polypeptide). Example: Hb-β McKees Rock.

Normal beta-globin is 146 amino acids long. In this mutation, codon 145 UAU (codes for tyrosine) is mutated

to UAA (stop). The final protein is thus 143 amino acids long. The clinical effect is to cause overproduction

of red blood cells, resulting in thick blood subject to abnormal clotting and bleeding.

Non synonymous/missense mutation

Base pair substitution results in substitution of a different amino acid. Example: HbS, sickle cell

hemoglobin, is a change in the beta-globin gene, where a GAG codon is converted to GUG. GAG codes for

glutamic acid, which is a hydrophilic amino acid that carries a -1 charge, and GUG codes for valine, a

hydrophobic amino acid. This amino acid is on the surface of the globin molecule, exposed to water. Under

low oxygen conditions, valine’s affinity for hydrophobic environments causes the hemoglobin to crystallize

out of solution.

Neutral nonsynonymous mutation

Base pair substitution results in substitution of an amino acid with similar chemical properties (protein

function is not altered).





23

Synonymous/silent mutation

Base pair substitution results in the same amino acid.

Frameshift mutations:

Deletions or insertions (not divisible by 3) result in translation of incorrect amino acids, stops codons

(shorter polypeptides),or read-through of stop codons (longer polypeptides). Example Hb-α Wayne. The final

codons of the alpha globin chain are usually AAA UAC CGU UAA, which code for lysine- tyrosine-arginine-

stop. In the mutant, one of the A’s in the first codon is deleted, resulting in altered codons: AAU ACC GUU

AAG, for asparagine-threonine-valine-lysine. There are also 5 more new amino acids added to this, until the

next stop codon is reached.

Reverse mutations and suppressor mutations: Forward mutation

Mutation changes wild type to mutant.

Reverse mutation (back mutation)

Mutation changes mutant to wild type. Reversion to the wild type amino acid restores function.

Reversion to another amino acid partly or fully restores function.





24

Suppressor mutation

Occur at sites different from the original mutation and mask or compensate for the initial mutation

without reversing it.

• Intragenic suppressors occur on the same codon; e.g., nearby addition restores a deletion

• Intergenic suppressors occur on a different gene.

Spontaneous and induced mutations: Spontaneous mutations

Spontaneous mutations can occur at any point of the cell cycle. Movement of transposons (mobile

genetic elements; see chapter 20) causes spontaneous mutations. Mutation rate = ~10-4 to 10-6

mutations/gene/generation Rates vary by lineage, and many spontaneous errors are repaired. Different types of

spontaneous mutations include;

Wobble pairing

Insertions and Deletions

Deamination and Depurination (Mutations by spontaneous chemical changes)

Depurination

Common; A or G are removed and replaced with a random base.





25

Deamination

Amino group is removed from a base (C





26

Induced mutations

Radiation (e.g.,X-rays, UV). Ionizing radiation breaks covalent bonds including those in DNA and is

the leading cause of chromosome mutations. Ionizing radiation has a cumulative effect and kills cells at high

doses. UV (254-260 nm) causes purines and pyrimidines to form abnormal dimer bonds and bulges in the

DNA strands.

Base analogs

Similar to normal bases, incorporated into DNA during replication.Some cause mis-pairing (e.g., 5-

bromouracil). Not all are mutagenic.

METHODS OF ANALYZING PHYLOGENETIC TREE

There are various methods of building and analyzing phylogenetic trees. Broadly these methods are of

two types

Optimization methods

Algorithmic methods

These methods can either use character or distance data.





27

Distance methods are also called phonetic methods. The trees are constructed by similarities of

sequences and the resulting tree is called a dendogram. A dendogram does not necessarily reflect

evolutionary relationships. Distance methods compress all of the individual differences between pairs of

sequences into a single number.

Character based approaches are also called cladistic methods. The trees are calculated by considering

the various possible pathways of evolution and are based on parsimony or likelihood methods. The resulting

tree is called cladogram. Cladistic methods use each alignment position as evolutionary information to build

tree.

Reconstruction models

Inferring a phylogeny is done on the basis of incomplete information because we do not have

information about the past. This process involves performing the best estimate of the evolutionary history,

so it is an estimation process.

There are two processes involved in the inference estimation of topology and estimation of branch

lengths for a given topology. Once topology is known, the branch length can be easily estimated while

other is to estimate topology.

Therefore, for comparing alternative phylogenies and to select one or few trees to represent the best

estimate of the true evolutionary history, we must specify reconstruction i.e. conditions or criteria for

constructing phylogeny.

A phylogenetic reconstruction consists of two steps;





28

1. Defining an optimally criterion or objective function

This step assigns a value to a tree and is subsequently used for comparing other trees.

2. Developing specific algorithms

These steps are used to develop an algorithm to compute the objective function values. This helps to

identify the tree or a set of trees that have the best values according to this criterion.

There are two approaches to phylogenetic reconstructions such as;

Evolutionary distance method

Character state method

Evolutionary distance method

In evolutionary distance method or distance matrix method, all possible pairs of sequences are aligned

to determine when pairs are the most similar or closely related. These alignments provide a measure of the

genetic distance between the sequences. These distance measurements are then used to predict the

evolutionary relationships.

There are various distance matrix algorithms such as

UPGMA

WPGMA

Neighbouring-joining

Fitch-Morgalish

UPGMA

The clustering procedure called UPGMA stands for Unweighted Pair Group Method using Arithmetic

averages. The method is simple and intuitively appealing. It works by clustering the sequences, at each stage

amalgamating two clusters and at the same time creating a new node on a tree. The tree can be imagined as

being assembled upwards, each node being added above the others, and the edge lengths being determined by

the difference in heights of the nodes at the top and bottom of the edge.

The distance between two clusters i and j is given by dij is the average distance between pairs of

sequences from each clusters.

UPGMA assumes that the rates of evolution are the same among different lineages

• In general, should not use this method for phylogenetic tree reconstruction (unless believe

assumption)





29

• Produces a rooted tree

WPGMA

Weighted Pair Group Method using Arithmetic mean, the clusters are weighted according to their

size. So that the candidate taxon K is equivalent in weighting to all previous taxa in the cluster.

Neighbour Joining (NJ)

NJ is a clustering method related to UPGMA that is able to solve problems similar to distance matrix

algorithms such as UPGMA. These algorithms are computationally fast and doesn’t make the assumption of

additivity. It begins by choosing the two most closely related sequences and then adding the next most distant

sequence as a third branch by the tree.

NJ Algorithm

Step 1: Let – (Almost) “average” distance to other nodes

Step 2: Choose i and j for which Mij – ui –uj is smallest

– Look for nodes that are close to each other, and far from everything else

– Turns out minimizing this is minimizing sum of branch lengths

Step 3: Define a new cluster (i, j), with a corresponding node in the tree

Distance from i and j to node (i,j):

di, (i,j) = 0.5(Mij + ui-uj)

dj, (i,j) = 0.5(Mij +uj-ui)

Step 4: Compute distance between new cluster and all other clusters:

M(ij)k = (Mik+Mjk – Mij)/2

Step 5: Delete i and j from matrix and replace by (i, j) Step

6: Continue until only 2 leaves remain Advantages

Fastest tree building method

Uses empirical substitution methods

Disadvantages

Tests only a single tree





30

Doesn’t consider intermediate

Distance matrices are derived in such a way that each mismatch between two sequences adds to the

distance, and each identity substracts from the distance. Scoring matrices include values for all possible

substitutions.

Transitions and tranversion substitution models are general time reversible models.

MOLECULAR CLOCK HYPOTHESIS

For any given macromolecule (a protein or DNA sequence) the rate of evolution is approximately

constant over time in all evolutionary lineages (Zuckerkandl and Pauling 1965 in Wen-Hsiung Li 1997).

Converts measures of genetic distance between sequences into estimates of the time at which the lineages

diverged (Welch and Bromham 2005).

Relevant mutations

Species differ in the characteristics, also called characters. These characters may be observable and

measurable for all the individuals. These characters are considered as properties. For examples, among

mammals, they can be classified based on a particular morphological character, or by a molecular character

(Gene duplication).

Any character can be used to classify species and reconstruct a phylogenetic tree. These characters are

mainly due to mutations. If a species depends on a character for its continued survival, that character will not

change as any mutations of it will be eliminated and such characters are considered essential. The differences

or similarities are essential characters are very relevant to the construction of the general shape of

phylogenetic tree, but they can’t be used to determine the relative lengths of lines within the tree.

Irrelevant mutations

Changes in non-essential characters are effected by mutations which are referred to as irrevalent. The

rate of change of irrevalent mutations should be uniform among species that are closely related. Eg. In case of

amino acids, there are 64 codons, so mutation in the third codon is almost considered as irrelevant. The DNA

can mutate at this stage/site and the resulting protein doesn’t change. Mutations as a measure of time

The probability p(t) that the character has some value at the beginning of a time interval of length t as

it does at the end. The probability q(t) that the character has one value at the beginning of a time interval of

length t but a different value at the end of the interval.

Suppose there are m different possible alternate values, and suppose that the mutation rate is r

mutations per unit time interval, then,





31

When t=0 p(0) = 1 (Mutation is there)

q(0) = 0 (As there is no mutation in no time)

When t approaches infinity p(t) and q(t) approaches 1/m, which literally means that in the long run,

each of the m alternative values are equally probable. Suppose there are n different characters, not just one.

Then the expected number of characters E(t) is given by

E(t) = n(m-1) q(t)

Here's the graph of that function when there are m = 4 alternate values for each character, there are n =

40 characters, and the mutation rate is r = 0.1. Time t is shown on the horizontal axis, while the vertical axis

gives y, the expected number of character differences.

When t gets large, the expected number of character differences approaches 30.

These E(t) is not the same at the end of a time interval of length t as they were at the beginning

n(m-1)q(t).

The observed number of characters is given by the inverse function y=E(t) which is denoted by

g.

The estimate for time t in terms of the observed number of characters differences i.e., mutations is

given by above equation.

Node represents a taxonomic unit. It can be either an existing species or an ancestor.

Branch defines the relationship between the taxa in terms of descent and ancestry.

Topology is the branching pattern of a phylogenetic tree.





32

Branch length is a very significant part of the phylogenetic tree. It represents the number of changes

that have occurred in a branch.

Distance scale is a scale chosen to represent the number of differences between organisms or

sequences.

Characters define the units upon which analysis is made. Eg. Amino acids, nucleotides etc.

Bootstrap is a method for assessing the statistical significance of a particular node on the phylogenetic

tree. It is arrived by randomly resampling subsets of the data.

Monophyl etic refers to a group of taxa that have a single origin on a tree i.e., all taxa descend from a

common ancestor.

Polyphyletic is a group of taxa that have multiple origins on a tree ie they arose twice in evolution.

TYPES OF PHYLOGENETIC TREE.

Types of trees

Rooted tree

Unrooted tree

Root

Root is the common ancestor of all taxa.

Rooted tree

The root has common ancestor of all the taxonomic units of the tree. The direction and the length of

each branch correspond to the evolution and time taken for evolution. Such trees are used to study

evolutionary relationships.





33

Unrooted tree

Unrooted tree specifies the relationships among taxonomic units, but it does not show the evolution

path. The number of possible tree topologies increases as the number of operational taxonomic units (OTU)

increases. In general, the number of possible topologies for a bifurcating rooted tree and an unrooted tree of n

taxa are given by

Ways of representing rooted trees

There are three ways of representing rooted trees such as;

2. Cladogram

3. Phylogram

4. Ultrametric trees

Cladogram

It is a branching diagram representing the most parsimonious distribution of derived characters within a set of

taxa. The branching pattern of a cladogram is intended to show the relative relationship among taxa. It is not a

true evolutionary tree of how those relationships came to be. Branch lengths are equal in a cladogram

Phylogram

It is a phylogenetic tree that indicates the relationships among the taxa. It also shows evolution and the rate of

evolution. Branch lengths are proportional to distance.





34

Ultrametric trees

It is a rooted tree where each internal node is labelled with a number. Each internal node has atleast

two offsprings and the labels decrease along the path from any root to leaf. The branch lengths of an

ultrametric tree are proportional to the divergent time.

MAXIMUM PARSIMONY

Parsimony

It is an algorithm that works by finding the tree which can explain the observed sequences with

minimal number of substitutions. There is various parsimony methods are

Camin-Sokal parsimony

Dollo parsimony

Wagner parsimony

Polymorphism parsimony

Maximum parsimony

Maximum parsimony uses character state data. The principle of maximum parsimony is to search for

a tree that requires the smallest number of evolutionary changes to explain the differences among the OTUs

under study. Such a tree is called maximum parsimonious tree, i.e., each tree is a hypothetical model of

evolutionary events and the most parsimonious tree is therefore a tree which requires fewer mutational steps

to visit each node.

Advantages

Reconstructs ancestral nodes, using all evolutionary data

Provides numerous most parsimonious tree

Disadvantages

Branch length cannot be determined, only topology

Slower than matrix methods

Sensitive to order in which sequences are added to tree

Provides numerous most parsimonious trees There are

various other parsimony methods such as


Dollo parsimony

Wagner parsimony

Polymorphism parsimony





35

Felsenstein


The main assumption of Camin-Sokal parsimony is that the ancestral state is defined as 0 i.e., at one

time an allele does not exist, and at a later time it does exist. Camin-sokal further assumes that loss of the

allele does not occur. It is therefore probably more appropriate for use with morphologic traits rather than

molecular marker data.

Dollo parsimony

Dollo parsimony assumes 0 as an ancestral state. It assumes that 1→0» 0→1, but that both are rare

over the evolutionary time scale being studied.

Wagner parsimony

Wagner parsimony assumes that ancestral states are unknown, and that roughly equal rates of

substitutions occur in either direction. This assumption is probably not valid for most molecular marker

methods. With marker data for both Wagner and Dollo parsimony tree should be compared. Branches

that are seen in both trees are likely to be robust, because they appear regardless of which set of assumptions

used.

Felsenstein

Felsenstein model is the one in which polymorphism can be retained in the population, thus

effectively allowing what looks like 0→1 if the 1 allele becomes fixed at some later time. In this model,

mutation from 1→0 is more likely than loss of an allele and the probability of 0→1 is essentially negligible.

Maximum likelihood

Maximum likelihood is a well established statistical method. The first application of this method to tree

construction was made by Cavall-Sforza and Edwards using gene frequency data. Later, Felsenstein

developed maximum likelihood for amino acid and nucleic acid sequence data.

In their simplest form, they begin by listing all possible models, and then calculating the probability

that each model would generate the data actually observed. The model with the highest probability of

generating the observed data is chosen as the best model.

Advantages

Reconstruct ancestral nodes, using all datas

Generates branch lengths

Generates statistical estimate of significance of each branch

Disadvantages

Very slow

Time required increases roughly with the fourth power of the number of sequences with small number

of sequences.





36

UNIT IV

HIDDEN MARKOV MODELS

A first order discrete HMM is a stochastic (randomly determined) generative model for time series

defined by a finite set of s of states, a discrete alphabet A of symbols, a probability transition matrix T= tij

and a probability emission matrix E = eix.

The system randomly evolves from state to state while emitting symbols from the alphabet. When the

system is in state i, it has the probability tij of moving from state i to state j and a probability eix of emitting

symbol X. Thus HMM can be visualized by imaging that two different dice are associated with each state;

An emission die

A transition die

The essential first order markov assumptions is that the emission and transition depends on the current

state only and not on the past. Only the symbols emitted by the system are observable, not the underlying

random walk between states; hence “hidden”. The random walks can be viewed as hidden or latent variables

underlying the observation.

The directed graph associated with non zero tij connections is called the architecture of the HMM. There

are a start state and an end state. At time 0, the system is always in the start state. Alternatively, one can use a

distribution over all states at time 0. The transition and emission probabilities are the parameters of the model.

HMM algorithms

HMM algorithms are recursive and can be viewed as forms of dynamic programming or as

propagation algorithms in the directed graph associated with HMM. These algorithms are essential building

blocks for learning algorithms.

Forward algorithm

In addition to determining the possible paths П with highest probability, it is also necessary to

determine the probability of a sequence with a given particular HMM, as the number of possible paths П

increases exponentially with the length of the sequence. This could be done by summing the probability over





37

all the possible paths. This assumes path with significant probability П. The steps are similar to viterbi

algorithm, but we are replacing maximization steps with sums. This is called forward algorithm.

The quantity corresponding to viterbi algorithm Vk(i) in the forward algorithm is

fk(i)=P(x.....xi, Пi=k)

Which is the probability of the observed sequence upto and including xi (Пi=k). The recursion equation

is

fl (i+1) = el (xi+1) ∑ fk(i) akl

k

Viterbi algorithm

Viterbi algorithm is a dynamic programming algorithm used to find the most probable path through,

the HMM is calculated recursively. If Vk(i) is the probability of the most probable path ending in the state K

with observation I is known for all the states K, then for the next observation xi+1while in the state l

multiplied by the maximum over all the previous states K.

Vl(i+1)=el (xi+1) max [ Vk(i)akl]

k

HMM programs

HMMER

SAM

HMMPro

ProDoM

PROSITE

Interpro

NEURAL NETWORKS

It is claimed that the human central nervous system is comprised of about 1,3x1010 neurons and that

about 1x1010 of them takes place in the brain. At any time, some of these neurons are firing and the power

dissipation due this electrical activity is estimated to be in the order of 10 watts.

Monitoring the activity in the brain has shown that, even when asleep, 5x107 nerve impulses per second are

being relayed back and forth between the brain and other parts of the body. This rate is increased significantly

when awake.





38

A neuron has a roughly spherical cell body called soma (Figure below). The signals generated in soma

are transmitted to other neurons through an extension on the cell body called axon or nerve fibres. Another

kind of extensions around the cell body like bushy tree is the dendrites, which are responsible from receiving

the incoming signals generated by other neurons.

An axon having a length varying from a fraction of a millimeter to a meter in human body, prolongs

from the cell body at the point called axon hillock. At the other end, the axon is separated into several branches,

at the very end of which the axon enlarges and forms terminal buttons. Terminal buttons are placed in special

structures called the synapses which are the junctions transmitting signals from one neuron to another (Figure

1.3). A neuron typically drive 103 to 104 synaptic junctions

Basics of neural networks Neurodes

The only genuine neural networks in existence occur in the brains of all animals, including humans.

Artificial neural networks seek to mimic certain features of neurophysiology. The artificial neurons are often

called 'neural nodes' or 'neurodes'.

Neurodes are connected to each other by 'links', each of which have an associated weight. The total

input to a neurode is the sum of all the weighted inputs to that node. This input is measured against a threshold

function or activation level, and if the input exceeds the threshold then the neurode 'fires'. When the neurode

fires it passes a full strength signal, viz a 1, to the output.

A diagram showing this process is given in figure below





39

Network layers

The commonest type of artificial neural network consists of three groups, or layers, of units: a layer of

"input" units is connected to a layer of "hidden" units, which is connected to a layer of "output" units. (Refer

the simple neural layer)

The activity of the input units represents the raw information that is fed into the network.

The activity of each hidden unit is determined by the activities of the input units and the weights on the

connections between the input and the hidden units.

MICROARRAY TECHNOLOGY

Overview of steps in a Microarray

There are four laboratory steps in using a microarray to measure gene expression in a sample;

Sample preparation and labeling

Hybridization

Washing

Image acquisition





40

STEPS IN USING A MICROARRAY

Sample preparation

There are a number of different ways in which a sample can be prepared and labeled for microarray

experiments. In all the cases, the first step is to extract the RNA from the tissue. The labeling step depends on

the technology used. For the Affymetrix platforms, one constructs a biotin-labeled

complementary RNA for hybridizing to the GeneChip. In the past, the DNA has been radioactively labeled,

but now most of the laboratories use fluorescent labeling.

In most experiments, two samples are hybridized to the arrays, one with each dye; allowing the

simultaneous measurement of both samples. In future, it is possible that more than two labeled samples could

be used.

There are three common ways to make labeled cDNA.

Direct incorporation by reverse transcriptase.

Indirect labeling

Random primed labeling using Klenow fragment of DNA polymerase I.

Hybridization

Hybridization is the step in which the DNA probes on the glass and the labeled DNA (orRNA) target

form heteroduplexes via Watson-Crick base pairing. Hybridization is a complex process and it is affected by

many parameters such as;

Temperature

Humidity

Salt concentration

Formamide concentration

Volume of target solution and operator

Hybridization involves two main steps;

1. Targeting the label with fluorescent dye and mix

2. Hybridization with the probes

There are two main methods to carry out hybridization such as;





41

Manual

Robotic

Manual hybridization

In manual hybridization, the array is placed in a hybridization chamber. The scientist injects the

hybridization solution containing the target onto the array under a cover slip before sealing the chamber. The

chamber is placed in an incubator which keeps the array at the correct temperature where the hybridization

process spans from 12 to 24 hours.

Robotic hybridization

In robotic hybridization, hybridization can be performed robotically by a hybridization station.

Robotic hybridization has the advantage over manual in that it provides much better control of the

temperature of the target and slide. The consistent use of a hybridization station also reduces the variability

between hybridization and operators.

Addition of the formamide enables mixing of hybridization solution of the target over the array but

has the effect of decreasing the melting temperature of duplexes. It is also usual to include Na+ in the

hybridization solution, lesser the Na+ ions, greater the stringency of hybridization. It is common to add DNA

such as repetitive DNA, poly-A or poly-T to the hybridization solution that blocks unwanted cross-

hybridizations.

Washing

After hybridization, the slides are washed. There are two reasons for washing such as;

Remove excess hybridization solution from the array

Increase the stringency of the experiment by reducing cross hybridization.

Image acquisition (Image generation) Raw data from microarray

Microarray data comprise images from hybridized arrays representing hybridization signal intensities

for individual spots. These may be generated by single fluorescent, dual fluorescent, radioactive or

colorimetric labels and the recording methods differ in each case

Microarray image

► Scanner output consists of two TIFF images, one for each of red and green channels.





42

► Shown here is false-coloured image with two channels overlaid:

Red spot: More highly expressed in mutant

Green spot: Equally expressed

Yellow: More highly expressed in normals

Serial Analysis of Gene Expression (SAGE)

Serial analysis of gene expression, or SAGE, is an experimental technique designed to gain a direct and

quantitative measure of gene expression. The SAGE method is based on the isolation of unique sequence tags

(9-10 bp in length) from individual mRNAs and concatenation of tags serially into long DNA molecules for

lump-sum sequencing. The SAGE method can be applied to the studies exploring virtually any kinds of

biological phenomena in which the changes in cellular transcription are responsible. SAGE is a highly

competent technology that can not only give a global gene expression profile of a particular type of cell or

tissue, but also help us identify a set of specific genes to the cellular conditions by comparing the profiles

constructed for a pair of cells that are kept at different conditions.





43

SHOTGUN SEQUENCING (HIERARCHICAL SHOT GUN SEQUENCING)

The genomic genome is fragmented using restriction enzymes and the fragmented DNA is cloned into a

DNA vector, usually a bacterial artificial chromosome (BAC), and amplified in Escherichia coli. The

amplified DNA can then be purified from the bacterial cells These short DNA fragments purified from

individual bacterial colonies are then individually and completely sequenced and assembled electronically

into one long, contiguous sequence by identifying 100%-identical overlapping sequences between them. This

method does not require any pre-existing information about the sequence of the DNA and is often referred to

as de novo sequencing. Gaps in the assembled sequence may be filled by Primer walking, often with sub-

cloning steps (or transposon-based sequencing depending on the size of the remaining region to be sequenced).

These strategies all involve taking many small reads of the DNA by one of the above methods and

subsequently assembling them into a contiguous sequence.

http://en.wikipedia.org/wiki/Clone_%28genetics%29

http://en.wikipedia.org/wiki/Vector_DNA

http://en.wikipedia.org/wiki/Bacterial_artificial_chromosome

http://en.wikipedia.org/wiki/Escherichia_coli

http://en.wikipedia.org/wiki/Sequence_assembly

http://en.wikipedia.org/wiki/Sequence_assembly

http://en.wikipedia.org/wiki/Primer_walking

http://en.wikipedia.org/wiki/Transposon





44

By a DNA sequencing reaction we can get upto 600-800 bases of DNA sequence. Both the forward

and reverse strand can be sequenced separately using specific primers. In the case of 3kb DNA, DNA has to be

sequenced as a stretch of 600-800 bases in five sequencing reactions. The resulting sequences have to be

contig-assembled as shown in the picture above using software. eg. BIOEDIT, DNASIS, DNASTAR. Before

contig-assembly, vector sequence present in the sequence should be removed (usually done using molecular

biology software eg. DNASIS, DNASTAR)

The shotgun method is the most practical for sequencing large genomes, but its assembly process is

complex and potentially error-prone - particularly in the presence of sequence repeats.

2D GEL ELECTROPHORESIS

Two-dimensional electrophoresis (2-D electrophoresis) is a powerful and widely used method for the

analysis of complex protein mixtures extracted from cells, tissues, or other biological samples. This technique

sorts proteins according to two independent properties in two discrete steps: the first- dimension step,

isoelectric focusing (IEF), separates proteins according to their isoelectric points (pI); the second-dimension

step, SDS-polyacrylamide gel electrophoresis (SDS-PAGE), separates proteins according to their molecular

weights (Mr, relative molecular weight). Each spot on the resulting two- dimensional array corresponds to a

single protein species in the sample. Thousands of different proteins can thus be separated, and information

such as the protein pI, the apparent molecular weight, and the amount of eachprotein is obtained.

http://en.wikipedia.org/wiki/Microsatellites





45

Steps involved in 2D Gel Electrophoresis

2DE Image processing

In quantitative proteomics, tools are available primarily to analyze biological markers by quantifying

individual markers, and showing the separation between one or more protein "spots" on a scanned image of a

2-DE product. These tools may also be used to match spots between gels of similar samples to show, for

example, proteomic differences between early and advanced stages of an illness.

Modern day 2-DE research often utilizes software-based image analysis tools. These tools primarily

analyze bio-markers by quantifying individual proteins, and showing the separation between one or more

protein "spots" on a scanned image of a 2-DE product. Additionally, these tools match spots

between gels of similar samples to show, for example, proteomic differences between early and advanced

stages of an illness.

Software packages include Delta2D, PD Quest and Progenesis. Challenges for automatic software- based

analysis include:

incompletely separated (overlapping) spots (less-defined and/or separated)

weak spots / noise (e.g., "ghost spots")

running differences between gels (e.g., protein migrates to different positions on different gels)

differences in software algorithms and therefore analysis tendencies

http://en.wikipedia.org/wiki/Quantitative_proteomics

http://www.decodon.com/

http://www.bio-rad.com/

http://www.nonlinear.com/products/progenesis/





46

Modern software packages include advanced features, such as image warping, to try to compensate for

running differences between gels, and the generation of picking lists for the automated in-gel digestion of

protein spots, leading to the identification of the proteins by mass spectrometry. However, as noted above, 2-

DE automated image analysis technology has not been perfected - a fact which keeps manual visual analysis

as the "gold standard" for validation.

Advantages and disadvantages of 2D-PAGE.

The main advantage of using 2D-PAGE is the large mass range and the sheer amount of proteins that

can be analysed at any one time. 2D-PAGE is particularly good for looking at proteins within the mass range

of 20 - 250 kDa and pI of 3 - 11. Proteins of particular pI and mass, can be focused on by using narrow range

pH isoelectric focusing strips and the percentage of acrylamide used in the 2nd dimension PAGE gel,

respectively. If sufficient sample is present on the gel (>300 µg of total protein) then proteins can be excised

from the gel, subjected to in-gel proteolysis and analysed by mass spectrometry.

http://en.wikipedia.org/wiki/Image_warping

http://en.wikipedia.org/wiki/In-gel_digestion

http://en.wikipedia.org/wiki/In-gel_digestion

http://en.wikipedia.org/wiki/Protein_mass_spectrometry





47

UNIT V

PERL

Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally

developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing

easier. Since then, it has undergone many changes and revisions and become widely popular amongst

programmers. Larry Wall continues to oversee development of the core language. Perl borrows features from

other programming languages including C, shell scripting (sh), AWK, and sed. The language provides powerful

text processing facilities without the arbitrary data length limits of many contemporary Unix tools, facilitating

easy manipulation of text files. Perl gained widespread popularity in the late 1990s as a CGI scripting language,

in part due to its parsing abilities.

CREATING THE PROGRAM

A Perl program consists of an ordinary text file containing a series of Perl statements. Statements are written in

what looks like an amalgam of C, UNIX shell script, and English. In fact, that's pretty much what it is.

Perl code can be quite free-flowing. The broad syntactic rules governing where a statement starts and ends are

• Leading spaces on a line are ignored. You can start a Perl statement anywhere you want: at the

beginning of the line, indented for clarity (recommended) or even right-justified (definitely frowned on because

the code would be difficult to understand) if you like.

• Statements are terminated with a semicolon.

• Spaces, tabs, and blank lines outside of strings are irrelevant-one space is as good as a hundred. That

means you can split statements over several lines for clarity. A string is basically a series of characters enclosed

in quotes.

• Here's Our first uninspired Perl statement hello1.pl: print("Hello World\n");

NUMERIC AND STRING LITERALS

All computer programs use data in some way. Some use it to personalize the program. For example, a mail

program might need to remember your name so that it can greet you upon starting. Another program-say one

that searches your hard disk for files-might remember your last search parameters in case you want to perform

the same search twice. Perl uses four types of literals. Here is a quick glimpse at them:





48

• Numbers - This is the most basic data type.

• Strings - A string is a series of characters that are handled as one unit.

• Arrays - An array is a series of numbers and strings handled as a unit. You can also think of an array as a

list.

• Associative Arrays - This is the most complicated data type. Think of it as a list in which every value

has an associated lookup item.

NUMERIC LITERALS

Numeric literals are frequently used. They represent a number that your program will need to work with. Most

of the time you will use numbers in base ten-the base that everyone uses. However, Perl will also let you use

base 8 (octal) or base 16 (hexadecimal).

Note: For those of you who are not familiar with non-decimal numbering systems, here is a short explanation.

In decimal notation-or base ten- when you see the value 15 it signifies (1 *

10) + 5 or 1510. The subscript indicates which base is being used.

In octal notation-or base eight-when you see the value 15 it signifies (1 * 8) + 5 or 1310.

In hexadecimal notation-or base 16-when you see the value 15 it signifies (1 * 16) + 5 or 2110. Base 16 needs

an extra six characters in addition to 0 to 9 so that each position can have a total of 16 values. The letters A-F

are used to represent 11-16. So the value BD16 is equal to (B16 * 16) + D16 or (1110 * 16) + 1310 which is

17610.Perl is the Swiss Army chainsaw of programming languages: powerful and adaptable. It was first

developed by Larry Wall, a linguist working as a systems administrator for NASA in the late 1980s, as a way to

make report processing easier. Since then, it has moved into a several other areas: automating system

administration, acting as glue between different computer systems, web programming, bioinformatics, data

munging, and even application development.

Why did Perl become so popular when the Web came along? Two reasons: First, most of what is being done on

the Web happens with text, and is best done with a language that's designed for text processing. More

importantly, Perl was appreciably better than the alternatives at the time when people needed something to use.

C is complex and can produce security problems (especially with untrusted data), Tcl can be awkward, and

Python didn't really have a foothold.





49

It also didn't hurt that Perl is a friendly language. It plays well with your personal programming style. The Perl

slogan is "There's more than one way to do it," and that lends itself well to large and small problems alike. Even

more so, Perl is very portable and widespread -- it's available pre-installed almost everywhere -- and of course

there are thousands of freely-distributable libraries available from the CPAN.

In this first part of our series, you'll learn a few basics about Perl and see a small sample program.

A Word About Operating Systems

This series assumes that you're using a Unix or Unix-like operating system (Mac OS X and Cygwin qualify)

and that you have the perl binary available at /usr/bin/perl. It's OK if you're running Windows through

ActivePerl or Strawberry Perl; most Perl code is platform-independent.

Functions and Statements

Perl has a rich library of built-in functions. They're the verbs of Perl, the commands that the interpreter runs.

You can see a list of all the built-in functions in the perlfunc man page (perldoc perlfunc, from the command

line). Almost all functions can take a list of commma-separated parameters.

The print function is one of the most frequently used parts of Perl. You use it to display things on the screen or

to send information to a file. It takes a list of things to output as its parameters.

print "This is a single statement.";

print "Look, ", "a ", "list!";

A Perl program consists of statements, each of which ends with a semicolon. Statements don't need to be on

separate lines; there may be multiple statements on one line. You can also split a single statement across

multiple lines.

print "This is "; print "two statements.\n";

print "But this ", "is only one statement.\n";

Wait a minute though. What's the difference between say and print? What's this \n in the print statements?

The say function behaves just like the print function, except that it appends a newline at the end of its

arguments. It prints all of its arguments, and then a newline character. Always. No exceptions. print, on the

http://www.cpan.org/

http://perldoc.perl.org/index-functions.html





50

other hand, only prints what you see explicitly in these examples. If you want a newline, you have to add it

yourself with the special character escape sequence \n.

use feature ':5.10';

say "This is a single statement.";

say "Look, ", "a ", "list!";

Why do both exist? Why would you use one over the other? Usually, most "display something" statements need

the newline. It's common enough that say is a good default choice. Occasionally you need a little bit more

control over your output, so print is the option.

Note that say is two characters shorter than print. This is an important design principle for Perl -- common

things should be easy and simple

Table Perl Escape Sequences

Escape Sequences Description or Character

\b Backspace

\e Escape

\f Form Feed

\n Newline

\r Carriage Return

\t Tab

\v Vertical Tab

\$ Dollar Sign

\@ Ampersand

\0nnn Any Octal byte

\xnn Any Hexadecimal byte

\cn Any Control character

\l Change the next character to lowercase





51

\u Change the next character to uppercase

\L Change the following characters to

lowercase until a \E

sequence is encountered.

Note that you need to use an

uppercase E here, lowercase

will not work.

\Q Quote meta-characters as literals.

\U Change the following characters

to uppercase until a \E

sequence is encountered. Note that you

need to use an uppercase E

here, lowercase will not work.

\E Terminate the \L, \Q,

or \U sequence.

Variables

If functions are Perl's verbs, then variables are its nouns. Perl has three types of variables: scalars, arrays, and

hashes. Think of them as things, lists, and dictionaries respectively. In Perl, all variable names consist of a

punctuation character, a letter or underscore, and one or more alphanumeric characters or underscores.

Scalars are single things. This might be a number or a string. The name of a scalar begins with a dollar sign,

such as $i or $abacus. Assign a value to a scalar by telling Perl what it equals:

my $i = 5;

my $pie_flavor = 'apple';

my $constitution1776 = "We the People, etc.";

You don't need to specify whether a scalar is a number or a string. It doesn't matter, because when Perl needs to

treat a scalar as a string, it does; when it needs to treat it as a number, it does. The conversion happens

automatically. (This is different from many other languages, where strings and numbers are two separate data

types.)





52

If you use a double-quoted string, Perl will insert the value of any scalar variables you name in the string. This

is often useful to fill in strings on the fly:


my $apple_count = 5;

my $count_report = "There are $apple_count apples.";

say "The report is: $count_report";

The final output from this code is The report is: There are 5 apples..

You can manipulate numbers in Perl with the usual mathematical operations: addition, multiplication, division,

and subtraction. (The multiplication and division operators in Perl use the * and / symbols, by the way.)

my $a = 5;

my $b = $a + 10; # $b is now equal to 15.

my $c = $b * 10; # $c is now equal to 150.

$a = $a - 1; # $a is now 4, and algebra teachers are cringing.

That's all well and good, but what's this strange my, and why does it appear with some assignments and not

others? The my operator tells Perl that you're declaring a new variable. That is, you promise Perl that you

deliberately want to use a scalar, array, or hash of a specific name in your program. This is important for two

reasons. First, it helps Perl help you protect against typos; it's embarrassing to discover that you've accidentally

mistyped a variable name and spent an hour looking for a bug. Second, it helps you write larger programs,

where variables used in one part of the code don't accidentally affect variables used elsewhere.

You can also use special operators like ++, --, +=, -=, /= and *=. These manipulate a scalar's value without

needing two elements in an equation. Some people like them, some don't. I like the fact that they can make code

clearer.

my $a = 5;

$a++; # $a is now 6; we added 1 to it.

$a += 10; # Now it's 16; we added 10.

$a /= 2; # And divided it by 2, so it's 8.





53

Strings in Perl don't have quite as much flexibility. About the only basic operator that you can use on strings is

concatenation, which is a ten dollar way of saying "put together." The concatenation operator is the period.

Concatenation and addition are two different things:

my $a = "8"; # Note the quotes. $a is a string.

my $b = $a + "1"; # "1" is a string too.

my $c = $a . "1"; # But $b and $c have different values!

Remember that Perl converts strings to numbers transparently whenever necessary, so to get the value of $b, the

Perl interpreter converted the two strings "8" and "1" to numbers, then added them. The value of $b is the

number 9. However, $c used concatenation, so its value is the string "81".

Remember, the plus sign adds numbers and the period puts strings together. If you add things that aren't

numbers, Perl will try its best to do what you've told it to do, and will convert those non-numbers to numbers

with the best of its ability.

Arrays are lists of scalars. Array names begin with @. You define arrays by listing their contents in parentheses,

separated by commas:

my @lotto_numbers = (1, 2, 3, 4, 5, 6); # Hey, it could happen.

my @months = ("July", "August", "September");

You retrieve the contents of an array by an index, sort of like "Hey, give me the first month of the year."

Indexes in Perl start from zero. (Why not 1? Because. It's a computer thing.) To retrieve the elements of an

array, you replace the @ sign with a $ sign, and follow that with the index position of the element you want. (It

begins with a dollar sign because you're getting a scalar value.) You can also modify it in place, just like any

other scalar.



say $months[0]; # This prints "July".

$months[2] = "Smarch"; # We just renamed September!

If an array value doesn't exist, Perl will create it for you when you assign to it.

my @winter_months = ("December", "January");





54

$winter_months[2] = "February";

Arrays always return their contents in the same order; if you go through @months from beginning to end, no

matter how many times you do it, you'll get back July, August, and September in that order. If you want to find

the number of elements of an array, assign the array to a scalar.



my $month_count = @months;

say $month_count; # This prints 3.

my @autumn_months; # no elements

my $autumn_count = @autumn_months;

say $autumn_count; # this prints 0

Some programming languages call hashes "dictionaries". That's what they are: a term and a definition. More

precisely, they contain keys and values. Each key in a hash has one and only one corresponding value. The

name of a hash begins with a percentage sign, like %parents. You define hashes by comma-separated pairs of

key and value, like so:

my %days_in_month = ( "July" => 31, "August" => 31, "September" => 30 );

You can fetch any value from a hash by referring to $hashname{key}, or modify it in place just like any other

scalar.

say $days_in_month{September}; # 30, of course.

$days_in_month{February} = 29; # It's a leap year.

To see what keys are in a hash, use the keys function with the name of the hash. This returns a list containing all

of the keys in the hash. The list isn't always in the same order, though; while you can count on @months always

to return July, August, September in that order, keys %days_in_month might return them in any order

whatsoever.

my @month_list = keys %days_in_month;

# @month_list is now ('July', 'September', 'August', 'February')!





55

The three types of variables have three separate namespaces. That means that $abacus and @abacus are two

different variables, and $abacus[0] (the first element of @abacus) is not the same as $abacus{0} (the value in

%abacus that has the key 0).

Comments

Some of the code samples from the previous section contained code comments. These are useful for explaining

what a particular piece of code does, and vital for any piece of code you plan to modify, enhance, fix, or just

look at again. (That is to say, comments are important.)

Anything in a line of Perl code that follows a # sign is a comment, unless that # sign appears in a string.)


say "Hello world!"; # That's more like it.

# This entire line is a comment.

Loops

Almost every program ever written uses a loop of some kind. Loops allow you run a particular piece of code

over and over again. This is part of a general concept in programming called flow control.

Perl has several different functions that are useful for flow control, the most basic of which is for. When you use

the for function, you specify a variable to use as the loop index, and a list of values to loop over. Inside a pair of

curly brackets, you put any code you want to run during the loop:


for my $i (1, 2, 3, 4, 5) {

say $i;

}

This loop prints the numbers 1 through 5, each on a separate line. (It's not very useful; you're might think "Why

not just write say 1, 2, 3, 4, 5;?". This is because say adds only one newline, at the end of its list of arguments.)

A handy shortcut for defining loop values is the range operator .., which specifies a range of numbers. You can

write (1, 2, 3, 4, 5) as (1 .. 5) instead. You can also use arrays and scalars in your loop list. Try this code and see

what happens:





56


my @one_to_ten = (1 .. 10);

my $top_limit = 25;

for my $i (@one_to_ten, 15, 20 .. $top_limit) {

say $i;

}

Of course, again you could write say @one_to_ten, 15, 20 .. $top_limit;

The items in your loop list don't have to be numbers; you can use strings just as easily. If the hash %month_has

contains names of months and the number of days in each month, you can use the keys function to step through

them.


for my $i (keys %month_has) {

say "$i has $month_has{$i} days.";

}

for my $marx ('Groucho', 'Harpo', 'Zeppo', 'Karl') {

say "$marx is my favorite Marx brother.";

}

The Miracle of Compound Interest

You now know enough about Perl -- variables, print/say, and for() -- to write a small, useful program. Everyone

loves money, so the first sample program is a compound-interest calculator. It will print a (somewhat) nicely

formatted table showing the value of an investment over a number of years. (You can see the program at

compound_interest.pl)

The single most complex line in the program is:

my $interest = int( ( $apr / 100 ) * $nest_egg * 100 ) / 100;

http://www.perl.com/pub/2008/04/23/compound_interest.pl





57

$apr / 100 is the interest rate, and ($apr / 100) * $nest_egg is the amount of interest earned in one year. This line

uses the int() function, which returns the integer value of a scalar (its value after any stripping off any fractional

part). We use int() here because when you multiply, for example, 10925 by 9.25%, the result is 1010.5625,

which we must round off to 1010.56. To do this, we multiply by 100, yielding 101056.25, use int() to throw

away the leftover fraction, yielding 101056, and then divide by 100 again, so that the final result is 1010.56. Try

stepping through this statement yourself to see just how we end up with the correct result, rounded to cents.

Play Around!

At this point you have some basic knowledge of Perl syntax and a few simple toys to play with. Try writing

some simple programs with them. Here are two suggestions, one simple and the other a little more complex:

A word frequency counter. How often does each word show up in an array of words? Print out a report.

(Hint: Use a hash to count of the number of appearances of each word.)

Given a month and the day of the week that's the first of that month, print a calendar for the month.

Shell script

A shell script is just a text file containing a sequence of shell commands Example

$ cat testscript

#!/bin/sh

echo Here is a long listing of the current directory ls -l

Scalar data

A single number or string, depending on context

References to scalars always begin with $

Variable names may contain characters, numbers and underscores

Assignment is done using the = operator

Examples:

o $pi = 3.14159;

o $color = ’red’;





58

o $old_color = ”was $color before”;

o $host = `hostname`; # command substitution #

Strings

Sequences of characters

• No end of string character as in C

• Single-quoted (note: ’, not ̀ )

• Example

o $x = ”dog”; print ’bob $x’; # displays bob $x

What are arrays or ordered lists?

Arrays or Ordered lists of scalar data items, indexed by an integer, variable starts with a @. There is a

separate namespace for scalar and array variables. Arrays are subscripted using square brackets, where

indexing begins at 0. The (scalar) variable $#arry is the highest assigned index of the array @arry. Arrays

need not be declared; they come into existence when used.

Array functions Example

1:

push/pop

– Add/remove an element to/from the end of an array

– Either a scalar or a list can be added

Code

push (@a, $b); # same as @a = (@a, $b);

@x = (1,2);

push (@a, @x); # same as @a = (@a, 1, 2);

$c = pop (@a); # returns and removes last element of @a

Example2:

shift/unshift





59

– Add/remove element(s) at the beginning of a list

– Either a scalar or a list can be added

Code

unshift (@a, $b); # same as @a = ($b, @a);

$c = shift (@a); # returns and removes first element

# of @a

Associative arrays or Hashes?

An array indexed by arbitrary scalars (not necessarily integers) –Index values are called keys

o Associative array variable names begin with a %

o Subscripted using curly braces {}

o Elements have no particular order

String operators?

o Concatenation: .

”hello”. ”world”# the same as “helloworld””

o Repetition: x

fred”x 3 # same as ”fredfredfred” ”Bob”x

(1+1) # same as ”BobBob”

(3+2) x 4 # same as ”5”x 4 or 5555# (note auto-conversion

# of 5 to”5”)

o “Postfix”form: –In an expression, $j first used, and afterwards incremented

$j++; # the same as $j = $j + 1$j--; # the same as $j = $j -1

o “Prefix”form: –In an expression, $j first incremented, and then used in the expression

++$j; # the same as $j = $j + 1--$j; # the same as $j = $j -1

Data types used in PERL language?

Scalar −Scalars are simple variables. They are preceded by a dollar sign ($). A scalar is a number, a

string, or a reference. A reference is actually an address of a variable, which we will see in the





60

upcoming chapters.

Arrays −Arrays are ordered lists of scalars that you access with a numeric index which starts with 0.

They are preceded by an "at" sign (@).

Hashes −Hashes are unordered sets of key/value pairs that you access using the keys as subscripts.

They are preceded by a percent sign (%).

Write a code to reverse complement a DNA to RNA?

#!/usr/bin/perl-w

# Calculating the reverse complement of strand of DNA

# The DNA

$DNA =ACGGGAGGACGGGAAAATTACTACGGCATTAGC’;

# Print the DNA onto the screen

print "Here is the starting DNA:\n\n"; print "$DNA\n\n";

# Make a new copy of the DNA

$revcom= reverse $DNA;

# See the text for a discussion of tr///

$revcom=~ tr/ACGTacgt/TGCAtgca/;

# Print the reverse complement DNA onto the screen print "Here is the

reverse complement DNA:\n\n"; print "$revcom\n";

exit;

Write a Perl code how to transcribe DNA to RNA?

#!/usr/bin/perl-w

# Transcribing DNA into RNA

# The DNA

$DNA =’ACGGGAGGACGGGAAAATTACTACGGCATTAGC’;

# Print the DNA onto the screen





61

print "Here is the starting DNA:\n\n";

print "$DNA\n\n";# Transcribe the DNA to RNA by substituting

# all T’s with U’s.

$RNA = $DNA;

$RNA =~ s/T/U/g;

# Print the RNA onto the screen

print "Here is the result of transcribing the DNA to

RNA:\n\n";

print "$RNA\n";

# Exit the program.

VEL TECH HIGH TECH Dr. RANGARAJAN Dr. SAKUNTHALA ...

Documents

Transcript of VEL TECH HIGH TECH Dr. RANGARAJAN Dr. SAKUNTHALA ...