Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

30
The Complexity of Eukaryotic Genomes -genomes of most eukaryotes are larger and more complex than those of prokaryotes - However, the genome size of many eukaryotes does not appear to be related to genetic complexity. - Much of the complexity of eukaryotic genomes thus results from the abundance of several different types of noncoding sequences, which constitute most of the DNA of higher eukaryotic cells. Figure 4.1. Genome size The range of sizes of the genomes of representative groups of organisms are shown on a logarithmic scale.

Transcript of Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Page 1: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

The Complexity of Eukaryotic Genomes -genomes of most eukaryotes are larger and more complex than those of prokaryotes - However, the genome size of many eukaryotes does not appear to be related to genetic complexity. - Much of the complexity of eukaryotic genomes thus results from the abundance of several different types of noncoding sequences, which constitute most of the DNA of higher eukaryotic cells.

Figure 4.1. Genome size The range of sizes of the genomes of representative groups of organisms are shown on a logarithmic scale.

Page 2: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Introns and Exons

-some of the noncoding DNA in eukaryotes is accounted for by long DNA sequences that lie between genes (spacer sequences)

-large amounts of noncoding DNA are also found within most eukaryotic genes. Such genes have a split structure in which segments of coding sequence (called exons) are separated by noncoding sequences (intervening sequences, or introns)

Figure 4.2. The structure of eukaryotic genes The introns are then removed by splicing to form the mature mRNA.

Page 3: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

-On average, introns are estimated to account for about ten times more DNA than exons in the genes of higher eukaryotes. (eg.human gene that encodes the blood clotting protein factor VIII. This gene spans approximately 186 kb of DNA and is divided into 26 exons. The mRNA is only about 9 kb long, so the gene contains introns totaling more than 175 kb).

-introns are clearly not required for gene function in eukaryotic cells.

-most introns have no known cellular function, although a few have been found to encode functional RNAs or proteins.

-generally thought to represent remnants of sequences that were important earlier in evolution.

-introns may have helped accelerate evolution by facilitating recombination between protein-coding regions (exons) of different genes—a process known as exon shuffling.

(recombination between introns of different genes would result in new genes containing novel combinations of protein-coding sequences)

Page 4: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Table 9-1. Classification of Eukaryotic DNA

Protein-coding genes Solitary genes Duplicated and diverged genes (functional gene families and nonfunctional pseudogenes)Tandemly repeated genes encoding rRNA, 5S rRNA, tRNA, and histonesRepetitious DNA Simple-sequence DNA Moderately repeated DNA (mobile DNA elements) Transposons Viral retrotransposons Long interspersed elements (LINES; nonviral retrotransposons) Short interspersed elements (SINES; nonviral retrotransposons)Unclassified spacer DNA

Page 5: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Gene Families and Pseudogenes-another factor contributing to the large size of eukaryotic genomes is

that some genes are repeated many times.-many eukaryotic genes are present in multiple copies, called gene

families while most prokaryotic genes are represented only once in the genome

Figure 4.5. Globin gene families Members of the human α- and β-globin gene families are clustered on chromosomes 16 and 11, respectively. Each family contains genes that are specifically expressed in embryonic, fetal, and adult tissues, in addition to nonfunctional gene copies (pseudogenes).

Page 6: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Repetitive DNA Sequences-3 classes based on the number of times nucleotide sequence is

repeated within the population of DNA fragments

a. Highly Repeated DNA Sequences- sequences present in at least 105 copies per genome (1-10% of total

DNA)- typically short and present in clusters in which the given sequence

repeats itself over and over again w/out interruption- the sequence is said to be present in tandem (sequence arranged in

end-to-end manner)1. Satellite DNAs

- short sequences about 5 to a few hundred base pairs in length(form very large clusters each containing up to several million base pairs of DNA)

- base composition of the said DNA segments is different from the bulk of the DNA that fragments contg. the sequence that can be separated into a distinct ”satellite” band during density gradient centrifugation

- localization of satellite DNA within centromeres of chromosomes

Page 7: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

-precise function of satellite DNA remains a mystery2. Minisatellite DNAs

-sequences range from about 12 to 100 base pairs in length and are found in clusters contg as many as 3000 repeats.

-occupy considerably shorter stretches of the genome than the satellite sequences

-tend to be unstable, and the number of copies of a particular sequence often increase or decrease from one generation to the next (the length of a particular minisatellite locus is highly variable in the population even among the same members of the same family

-used to identify individuals in criminal or paternity cases thru the technique of DNA fingerprinting

Use of PCR in forensic science.Hypervariable microsatellite sequences (such as variable number of tandemrepeats, or VNTR) are identified by PCR.

Page 8: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

3. Microsatellite DNAs-shortest sequences (1 to 5 base pairs long) and are typically

present in small clusters of about 10 to 40 base pairs in length-scattered quite evenly thru the DNA – more than 100,000

different loci are present in the human genome

b. Moderately Repeated DNA Sequence- moderately repeated fraction of the genomes of plants and animals - vary from about 20 to more than 80% of the total DNA, depending on the organism- sequences that are repeated within the genome anywhere from a few times to tens of thousands of times.- sequences that code for known gene products, either RNAs or proteins, and those that lack a coding function

1. Repeated DNA Sequences with Coding Functions-includes genes that code for ribosomal RNA as well as

those that code for histones (impt. chromosomal proteins)-the repeated sequences that code for each of these

products are typically identical to one another and located in tandem array

Page 9: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

2. Repeated DNAs that Lack Coding Functions-the bulk of the moderately repeated DNA fraction does not

encode any type of product -repetitive DNA sequences are scattered throughout the

genome rather than being clustered as tandem repeats (i.e. interspersed thruout the genome as individual elements).

-can be grouped into two classes:a. SINEs (short intesrpersed elements)

-major SINEs in mammalian genomes are Alusequences, so-called because they usually contain a single site for the restriction endonuclease AluI.

-Alu sequences are approximately 300 base pairs long, and about a million such sequences are dispersed throughout the genome, accounting for nearly 10% of the total cellular DNA.

-Although Alu sequences are transcribed into RNA, they do not encode proteins and their function is unknown. b. LINEs (long interspersed elements)

-major human LINEs (which belong to the LINE 1, or L1, family) are about 6000 base pairs long and repeat

approximately 50,000 times in the genome.

Page 10: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

-L1 sequences are transcribed and at least some encode proteins, but like Alu sequences, they have no known function in cell physiology.

-both Alu and L1 sequences are examples of transposable elements, which are capable of moving to different sites in genomic DNA.

- some of these sequences may help regulate gene expression, but most Alu and L1 sequences appear not to make a useful contribution to the cell.

-they may have played important evolutionary roles by contributing to the generation of genetic diversity.

3. Nonrepeated DNA Sequences- are present in a single copy in the genome which includes genes that exhibit

Mendelian patterns of inheritance-always localize to a particular site on a particular chromosome-contains by far the greatest amount of genetic information-include DNA sequences that code for virtually all proteins other thanhistones-even if these sequenes are not present in multiple copies, genes that code for

polypeptides are usually members of a family of related genes(this is true for globins, actins, myosins, collagens, tubulins, integrins and most other proteins in a eukaryotic cell.

Page 11: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

The Number of Genes in Eukaryotic Cells

Organism Genome size (Mb)a Protein-coding sequence (percent)b

Number of genesb

E. coli 4.6 90 4,288S. cerevisiae 12 70 5,885C. elegans 97 25 19,099Drosophila 180 13 13,600Human 3,000 3 *100,000

a Mb = millions of base pairs.b Determined from the E. coli, S. cerevisiae, C. elegans, and Drosophila genomic sequences and estimated for the human genome.

Table 4.1. The Numbers of Genes in Cellular Genomes

Page 12: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

*Nature 431, 931-945 (21 October 2004) | doi:10.1038/nature03001Finishing the euchromatic sequence of the human genome

Latest information:-2.85 billion nucleotides interrupted by only 341 gaps. -encode only 20,000–25,000 protein-coding genes.

Page 13: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Chromosomes and Chromatin

Organism Genome size (Mb)a Chromosome numbera

Yeast (Saccharomyces cerevisiae) 12 16Slime mold (Dictyostelium) 70 7Arabidopsis thaliana 130 5Corn 5,000 10Onion 15,000 8Lily 50,000 12Nematode (Caenorhabditis elegans) 97 6

Fruit fly (Drosophila) 180 4Toad (Xenopus laevis) 3,000 18Lungfish 50,000 17Chicken 1,200 39Mouse 3,000 20Cow 3,000 30Dog 3,000 39Human 3,000 23

a Both genome size and chromosome number are for haploid cells. Mb = millions of base pairs.

Table 4.2. Chromosome Numbers of Eukaryotic Cells

Page 14: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

-genomes of prokaryotes are contained in single chromosomes, which are usually circular DNA molecules.-the genomes of eukaryotes are composed of multiple chromosomes, each containing a linear molecule of DNA. -DNA of eukaryotic cells is tightly bound to small basic proteins (histones) that package the DNA in an orderly way in the cell nucleus(total extended length of DNA in a human cell is nearly 2 m, but this DNA must fit into a nucleus with a diameter of only 5 to 10 μm).

Chromatin-complexes between eukaryotic DNA and proteins

-major proteins of chromatin are the histones

small proteins containing a high proportion of basic amino acids (arginine and lysine) that facilitate binding to the negatively charged DNA molecule.

- five major types of histones—called H1, H2A, H2B, H3, and H4—which are very similar among different species of eukaryotes

Page 15: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Histonea Molecular weight Number of amino acids

Percentage Lysine + Arginine

H1 22,500 244 30.8H2A 13,960 129 20.2H2B 13,774 125 22.4H3 15,273 135 22.9H4 11,236 102 24.5

a Data are for rabbit (H1) and bovine histones.

Table 4.3. The Major Histone Proteins

-histones are extremely abundant proteins in eukaryotic cells; together, their mass is approximately equal to that of the cell's DNA.

Page 16: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

-not found in eubacteria (e.g., E. coli), but Archaebacteria, do contain histones that package their DNAs in structures similar to eukaryotic chromatin.

-the basic structural unit of chromatin, the nucleosome, consisting of DNA (≈146-bp segment of DNA )wrapped around a histone core.

Figure 4.8. The organization of chromatin in nucleosomes (A) The DNA is wrapped around histones in nucleosome core particles and sealed by histone H1. Nonhistone proteins bind to the linker DNA between nucleosome core particles. The linker DNA between the nucleosome core particles is preferentially sensitive, so limited digestion of chromatin yields fragments corresponding to multiples of 200 base pairs.

Chromosomal DNA and Its Packaging in the Chromatin Fiber

Page 17: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Figure 4.10. Chromatin fibers The packaging of DNA into nucleosomes yields a chromatin fiber approximately 10 nm in diameter. The chromatin is further condensed by coiling into a 30-nm fiber, containing about six nucleosomes per turn.

Solenoid structure

Figure 4.9. Structure of a chromatosome (A) The nucleosome core particle consists of 146 base pairs of DNA wrapped 1.65 turns around a histone octamer consisting of two molecules each of H2A, H2B, H3, and H4. A chromatosome contains two full turns of DNA (166 base pairs) locked in place by one molecule of H1.

Page 18: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Figure 4.11. Model for the packing of chromatin and the chromosome scaffold in metaphase chromosomes. In interphase chromosomes, long stretches of 30-nm chromatin loop out from extended scaffolds. In metaphase chromosomes, the scaffold is folded into a helix and further packed into a highly compacted structure, whose precise geometry has not been determined.

Page 19: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

The chromatin in chromosomal regions that are not being transcribed exists predominantly in the condensed, 30-nm fiber form. The regions of chromatin actively being transcribed are thought to assume the extended beads-on-a-string form

Euchromatin -Decondensed, transcriptionally active interphase chromatin.Most of the euchromatin in interphase nuclei appears to be in the form of 30-nm fibers, organized into large loops containing approximately 50 to 100 kb of DNA. About 10% of the euchromatin, containing the genes that are actively transcribed, is in a more decondensed state (the 10-nm conformation) that allows transcription

Heterochromatin-Condensed, transcriptionally inactive chromatin. Heterochromatin is transcriptionally inactive and contains highly repeated DNA sequences, such as those present at centromeres and telomeres.

As cells enter mitosis, their chromosomes become highly condensed so that they can be distributed to daughter cells. The loops of 30-nm chromatin fibers are thought to fold upon themselves further to form the compact metaphase chromosomes of mitotic cells, in which the DNA has been condensed nearly 10,000-fold (Figure 4.11). Such condensed chromatin can no longer be used as a template for RNA synthesis, so transcription ceases during mitosis. Electron micrographs indicate that the DNA in metaphase chromosomes is organized into large loops attached to a protein scaffoldMetaphase chromosomes are so highly condensed that their morphology can be studied using the light microscope

Page 20: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

However, euchromatin is interrupted by stretches of heterochromatin, in which 30-nm fibers are subjected to additional levels of packing that usually render it resistant to gene expression. Heterochromatin is commonly found around centromeres and near telomeres, but it is also present at other positions on chromosomes.

Figure 4-21. A comparison of extended interphase chromatin with the chromatin in a mitotic chromosome. (A) An electron micrograph showing an enormous tangle of chromatin spilling out of a lysed interphase nucleus. (B) A scanning electron micrograph of a mitotic chromosome: a condensed duplicated chromosome in which the two new chromosomes are still linked together . The constricted region indicates the position of the centromere. Note the difference in scales.

Page 21: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Each DNA Molecule That Forms a Linear Chromosome Must Contain a Centromere, Two Telomeres, and Replication Origins Figure 4-22. The three DNA

sequences required to produce a eucaryotic chromosome that can be replicated and then segregated at mitosis. Each chromosome has multiple origins of replication, one centromere, and two telomeres. Shown here is the sequence of events a typical chromosome follows during the cell cycle. The DNA replicates in interphase beginning at the origins of replication and proceeding bidirectionally from the origins across the chromosome. In M phase, the centromere attaches the duplicated chromosomes to the mitotic spindle so that one copy is distributed to each daughter cell during mitosis. The centromere also helps to hold the duplicated chromosomes together until they are ready to be moved apart. The telomeres form special caps at each chromosome end.

Page 22: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Figure 4.17. Assay of a centromere in yeast Both plasmids shown contain a selectable marker (LEU2) and DNA sequences that serve as origins of replication in yeast (ARS, which stands for autonomously replicating sequence). However, plasmid I lacks a centromere and is therefore frequently lost as a result of missegregation during mitosis. In contrast, the presence of a centromere (CEN) in plasmid II ensures its regular transmission to daughter cells.

Page 23: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

replication origin- Location on a DNA molecule at which duplication of the DNA begins.Centromere- Constricted region of a mitotic chromosome that holds sister chromatids together. It is also the site on the DNA where the kinetochore forms that captures microtubules from the mitotic spindle.Telomere-End of a chromosome, associated with a characteristic DNA sequence that is replicated in a special way. Counteracts the tendency of the chromosome otherwise to shorten with each round of replication. (From Greek telos, end.)

-perform another function: the repeated telomere DNA sequences, together with the regions adjoining them, form structures that protect the end of the chromosome from being recognized by the cell as a broken DNA molecule in need of repair.

Figure 4.19. Structure of a telomere Telomere DNA loops back on itself to form a circular structure that protects the ends of chromosomes

Page 24: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Figure 4-11. The banding patterns of human chromosomes. Chromosomes 1–22 are numbered in approximate order of size. A typical human somatic (non-germ line) cell contains two of each of these chromosomes, plus two sex chromosomes—two X chromosomes in a female, one X and one Y chromosome in a male. The chromosomes used to make these maps were stained at an early stage in mitosis, when the chromosomes are incompletely compacted. The horizontal green line represents the position of the centromere, which appears as a constriction on mitotic chromosomes; the knobs on chromosomes 13, 14, 15, 21, and 22 indicate the positions of genes that code for the large ribosomal RNAs (discussed in Chapter 6). These patterns are obtained by staining chromosomes with Giemsa stain, and they can be observed under the light microscope.

Page 25: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Karyotype- Full set of chromosomes of a cell arranged with respect to size, shape, and number.

Figure 1-30. The genome of E. coli. (A) A cluster of E. coli cells. (B) A diagram of the E. coli genome of 4,639,221 nucleotide pairs (for E. coli strain K-12). The diagram is circular because the DNA of E. coli, like that of other procaryotes, forms a single, closed loop. Protein-coding genes are shown as yellow or orange bars, depending on the DNA strand from which they are transcribed; genes encoding only RNA molecules are indicated by green arrows. Some genes are transcribed from one strand of the DNA double helix (in a clockwise direction in this diagram), others from the other strand (counterclockwise).

The Sequences of Complete Genomes

Page 26: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Figure 4-13. The genome of S. cerevisiae (budding yeast). (A) The genome is distributed over 16 chromosomes, and its complete nucleotide sequence was determined by a cooperative effort involving scientists working in many different locations, as indicated (gray, Canada; orange, European Union; yellow, United Kingdom; blue, Japan; light green, St Louis, Missouri; dark green, Stanford, California). The constriction present on each chromosome represents the position of its centromere . (B) A small region of chromosome 11, highlighted in red in part A, is magnified to show the high density of genes characteristic of this species. As indicated by orange, some genes are transcribed from the lower strand ,while others are transcribed from the upper strand. There are about 6000 genes in the complete genome, which is 12,147,813 nucleotide pairs long.

Page 27: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Figure 4-16. Scale of the human genome. If each nucleotide pair is drawn as 1 mm as in (A), then the human genome would extend 3200 km (approximately 2000 miles), far enough to stretch across the center of Africa, the site of our human origins (red line in B). At this scale, there would be, on average, a protein-coding gene every 300 m. An average gene would extend for 30 m, but the coding sequences in this gene would add up to only just over a meter.

The Human Genome

Page 28: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Figure 4-15. The organization of genes on a human chromosome. (A) Chromosome 22, one of the smallest human chromosomes, contains 48 × 106 nucleotide pairs and makes up approximately 1.5% of the entire human genome. Most of the left arm of chromosome 22 consists of short repeated sequences of DNA that are packaged in a particularly compact form of chromatin (heterochromatin), which is discussed later in this chapter. (B) A tenfold expansion of a portion of chromosome 22, with about 40 genes indicated. Those in dark brown are known genes and those in light brown are predicted genes. (C) An expanded portion of (B) shows the entire length of several genes. (D) The intron-exon arrangement of a typical gene is shown after a further tenfold expansion. Each exon (red) codes for a portion of the protein, while the DNA sequence of the introns (gray) is relatively unimportant. The entire human genome (3.2 × 109 nucleotide pairs) is distributed over 22 autosomes and 2 sex chromosomes (see Figures 4-10 and 4-11). The term human genome sequence refers to the complete nucleotide sequence of DNA in these 24 chromosomes. Being diploid, a human somatic cell therefore contains roughly twice this amount of DNA. Humans differ from one another by an average of one nucleotide in every thousand, and a wide variety of humans contributed DNA for the genome sequencing project. The published human genome sequence is therefore a composite of many individual sequences.

Page 29: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Table 4-1. Vital Statistics of Human Chromosome 22 and the Entire Human GenomeCHROMOSOME 22 HUMAN GENOME

DNA length 48 × 106 nucleotide pairs* 3.2 × 109 Number of genes approximately 700 approximately 30,000Smallest protein-coding gene 1000 nucleotide pairs not analyzedLargest gene 583,000 nucleotide pairs 2.4 × 106 nucleotide pairsMean gene size 19,000 nucleotide pairs 27,000 nucleotide pairsSmallest number of exons per gene 1 1

Largest number of exons per gene 54 178

Mean number of exons per gene 5.4 8.8Smallest exon size 8 nucleotide pairs not analyzedLargest exon size 7600 nucleotide pairs 17,106 nucleotide pairsMean exon size 266 nucleotide pairs 145 nucleotide pairsNumber of pseudogenes** more than 134 not analyzedPercentage of DNA sequence in exons (protein coding sequences)

3% 1.5%

Percentage of DNA in high-copy repetitive elements

42% approximately 50%

Percentage of total human genome 1.5% 100%

* The nucleotide sequence of 33.8 × 106 nucleotides is known; the rest of the chromosome consists primarily of very short repeated sequences that do not code for proteins or RNA.** A pseudogene is a nucleotide sequence of DNA closely resembling that of a functional gene, but containing numerous deletion mutations that prevent its proper expression. Most pseudogenes arise from the duplication of a functional gene followed by the accumulation of damaging mutations in one copy.

Page 30: Bio108 Cell Biology lec 4 The Complexity of Eukaryotic Genomes

Figure 4-17. Representation of the nucleotide sequence content of the human genome. LINES, SINES, retroviral-like elements, and DNA-only transposons are all mobile genetic elements that have multiplied in our genome by replicating themselves and inserting the new copies in different positions. Mobile genetic elements are discussed in Chapter 5. Simple sequence repeats are short nucleotide sequences (less than 14 nucleotide pairs) that are repeated again and again for long stretches. Segmental duplications are large blocks of the genome (1000–200,000 nucleotide pairs) that are present at two or more locations in the genome. Over half of the unique sequence consists of genes and the remainder is probably regulatory DNA. Most of the DNA present in heterochromatin, a specialized type of chromatin (discussed later in this chapter) that contains relatively few genes, has not yet been sequenced.