Transcription: Re-writing the DNA code into RNA
By the end of this section, you should know more about
- The structure and properties of RNA vs. DNA
- The different classes of RNA
- The phases of transcription (DNA --> RNA)
- Post-transcriptional modification of mRNA
In the next two lectures, we will discover how genes enable the cell to make proteins for various functions. Here is a gigantic overview of gene expression in eukaryotes that will set the stage for us to add a little bit more detail.
More Genes Does not Equal Greater Complexity
How can our species, Homo sapiens, have about the same number of genes as the humble Wall Cress (Arabidopsis thaliana), and many fewer genes than many other plants? Aren't we the most complex of organisms?
- #1: No, we are not.
- #2: Who says more DNA means greater complexity?
Animal diversity in structure and function is due more to differences in how DNA is packaged, read and processed than in differences in quantity of DNA/genes. More derived eukaryotes can use the same genes to manufacture different proteins by "editing" those genes at the level of RNA.
This allows the average mammal genome consisting of about 20,000 genes to produce more than 100,000 different proteins from those genes that encode proteins.
How can this be?
It's all about the RNA. Transcription is the process of manufacturing RNA.
RNA: Structure and Properties
DNA is the permanent "blueprint" for constructing and operating an organism.
DNA does not govern these things directly, nor by itself. For a gene to
exert its function, it must first be transcribed ("rewritten") into
another nucleic acid language, that of RNA.
RNA vs. DNA
- RNA is (usually) single stranded. DNA is double-stranded.
- RNA has ribose in its "backbone" instead of deoxyribose.
- RNA has the pyrimidine uracil instead of thymine (both H-bond with adenine)

- RNA can sometimes fold into complex, three-dimensional shapes.

- RNA can be a vital component of biological machines that bind to DNA or other RNA (e.g., telomerase, ribosomes)
- Unlike DNA, RNA can catalyze biological reactions with enzyme-like activity. RNA molecules that can behave like enzymes are known as ribozymes. (This short video shows the initiation of protein synthesis, in which several RNA molecules, such as tRNA, exhibit enzyme-like activity.)

Types of RNA
RNA can be either
- informational
Messenger RNA (mRNA) is the intermediate "courier" of the protein-coding genes of the DNA. It is a relatively short-lived template that binds to the ribosome and translated by the ribosome (and enzymes) into protein.
- functional
Functional RNA is transcribed from specific genes, but never translated into protein. These are "finished products" that perform vital functions related to gene expression. These are encoded by a relatively small family of genes (a few dozen to a few hundred, depending on species).
- Transfer RNA (tRNA)
These three-dimensional molecules bind to amino acids and bring them to the ribosome for addition to the growing polypeptide chain.
- Ribosomal RNA (rRNA)
These are the major component of ribosomes, the protein/RNA complexes responsible for assembling proteins from the mRNA code. rRNA is relatively stable and long-lived. It accounts for the majority of the RNA present in the cell at any given time.
Some functional RNA molecules are unique to eukaryotes, and are involved in the processing of other RNA molecules to effect control of gene expression. These are the
Long noncoding RNAs (lncRNAs or just ncRNAs) are not translated into protein, and their function is largely unknown. They are transcribed from most regions of the genome. Some are known to participate in dosage compensation, but most are a mystery.
Transcription and translation are happening constantly in any cell. Hence, rRNA, tRNA, mRNA and snRNA are constantly being transcribed (i.e., their transcription is constitutive).
The other functional RNAs (miRNA, siRNA, piRNA, lncRNA) manufactured only when they are needed.
See a complete list of the types of RNA discovered so far.
Messenger RNA: Encoding Polypeptides
The only type of RNA that carries the code for building proteins is mRNA. In the initial phase of gene expression, DNA is transcribed into mRNA, which is read by the ribosomes and translated into protein.
Proteomes
The full range of proteins a particular organism is able to manufacture is known as the complete proteome.
A cellular proteome is all the proteins found in a specific type of cell (and/or tissue) under a specific set of environmental conditions.
Gene expression changes not only on a daily, hour-by hour basis, but also over the lifespan of an organism.
Even your daily Circadian Rhythm is ultimately under genetic control, as the hormones and other molecules that direct your body to perform daily functions are at least partly encoded by one or more genes that respond to daily cycles.
Transcribing mRNA in Prokaryotes
No matter which type of RNA is being manufactured, the machinery of transcription is essentially the same.
Transcription is the process by which the DNA code is rewritten into the informational code of RNA.
The three phases of this process are known as initiation, elongation, and termination.
Initiation
When a gene is being transcribed, the nitrogenous base code must be made available for reading by polymerizing enzymes. But where to begin? Much of what we first learned about transcription came from our old pal, E. coli.
At the start of every gene, there is a short base sequence (about 60 base pairs (bp) long) known as the promoter. This is where the polymerizing enzyme will attach.
Promoters: Conserved/Consensus Sequences of DNA
Within species, and even across species, many homologous genes have small segments of base pairs exactly or almost exactly in common. These are called conserved sequences.
When conserved sequences are found in several different locations in the genome, or in homologous regions of genomes of different species, they are known as consensus sequences. These may have resulted from functional or evolutionary (i.e., common ancestry) relationships among the sequences.
Retention of conserved sequences suggests that these sequences have not tolerated a great deal o mutation: their products are sensitive to changes, and may not function in mutant form.
In E. coli and other prokaryotes, one such consensus sequence is found in the promoter of all genes: The Pribnow Box.
- The Pribnow Box, located approximately 10 base pairs
upstream of first DNA base actually transcribed, facilitates proper attachment of RNA polymerase.
- The sequence is sometimes called the Pribnow-Schaller Box, named after its discoverers, David Pribnow and Heinz Schaller.
- The sequence is usually T(82%) A (89%) T(52%) A(59%) A(49%) T(89%), with the likelihood of the base shown in parentheses.
- The Box (a "box" is a short sequence of nucleotides with a specific function) notably consists of A-T pairs (two H-bonds), not G-C pairs (three H-bonds).
- When RNA polymerase is attached, less energy is needed to denature the double helix, opening it for transcription.
RNA Polymerase
The promoter is the starting location of RNA synthesis. In prokaryotes, that enzymatic complex known as RNA polymerase (RNAP or RNApol) is the main player.
RNA polymerase must be able to
- Recognize start (initiation) and stop (termination) signals on the DNA.
- Avoid transcribing the non-informational DNA.
Prokaryotic RNA polymerase consists of a
- core factor (made up of four protein subunits, α, α' β, and β')
this region of the enzyme polymerizes the chain of RNA
- sigma factor
The sigma (& sigma;) factor is a small protein that recognizes the promoter and facilitates binding of RNA polymerase to it. Bacterial species manufacture several different sigma factors (E. coli has seven). The specific sigma factor used to initiate transcription of a given gene depends on the gene and on the cell's internal environment.
An RNA polymerase with its sigma factor removed is "blind"
and attaches randomly to the DNA.
Note:
- holoenzyme - a biochemically active compound formed by the combination of an enzyme with a coenzyme.
- coenzyme - a small, not-proteinaceous, molecule required for activation of an enzyme
RNA polymerase, armed with its sigma factor, attaches to the promoter and denatures the double helix.
Promoters of different genes may vary slightly in sequence. Different sigma factors that recognize specific sequences are active under the physiological conditions that make it necessary to transcribe a specific gene.
Like DNA replication, RNA transcription takes place in the 5' to 3' direction. Incoming nucleotides are oriented with the 5' (phosphate) end pointing towards the 3' (-OH) end of the growing chain.
Unlike DNA replication, RNA transcription involves little proofreading. (Though there is some evidence that RNApol may pause at a mismatch and replace the mistake.)
(Why do you suppose proofreading is less critical in manufacturing RNA than in replicating DNA?)
The DNA template strand to which the mRNA nucleotides are attached is known as the non-coding strand.
- The non-coding strand base sequence is complementary to the mRNA base sequence.
- antisense strand is also known as the antisense or template strand.
The DNA strand that is not used as a template is known as the coding strand
- The coding strand base sequence is the same as the mRNA base sequence.
- The coding strand is also known as the sense strand or non-template strand.

Viruses may always use the same DNA strand as a template.
Bacteria (and possibly eukaryotes) may use either strand of DNA as
the template, but only one side is the template in any given gene
or gene sequence.
Elongation
RNA polymerase travels along the DNA template strand, laying down
nucleotides in a 5' to 3' direction, like
SO.
Hydrolysis of the two phosphate bonds of the incoming nucleotide yields
energy that drives
the reaction.
Viruses may constantly use the same DNA strand as a template.
Bacteria and eukaryotes use either strand of DNA as
the template, but only one side is the template in any given gene
or gene sequence.
Termination
In prokaryotes, two termination mechanisms are known
- direct/intrinsic (a.k.a. "rho-independent")
- rho-dependent
Intrinsic/Direct Termination
- The mRNA transcript itself has a termination sequence about 40bp long.
- Within this section are inverted repeats--sections that read the
same, forward or backward.
- When these are transcribed, they form complementary sequences that bind together, forming a stem loop:

- The "stem" is rich in the more stable G-C bonds.
- The DNA encoding the stem loop is immediately followed by a long string of Adenines.
This is transcribed as a long string of Uracils.
- This less stable area A/U of the RNA/DNA hybrid causes the RNA polymerase to pause there.
- After the RNA polymerase transcribes the two complementary regions, the
inverted repeats release from the DNA template and bind into a stem loop.
- Formation of the stem loop may reduce RNApol's affinity for the complex, facilitating its release.
Rho-dependent Termination
- In rho-dependent genes, a small hexamer (six-subunit) protein named rho recognizes the termination sequences on a nascent RNA.
- Rho-dependent genes do not form stem loops.
- Instead, they have a 40-60bp region rich in cytosines, poor in guanines, and containing a segment known as rut (rho utilization site).
- Rho protein binds to rut, which is just upstream from an unstable sequence on the gene that causes RNA polymerase to pause.
- The proximity of the rho protein allows it to actually knock the RNA polymerase off the strand, terminating transcription.

The Finished mRNA
mRNA rolling straight off the DNA template is known as the primary transcript.
- In prokaryotes, the primary mRNA transcript is translated, without
modification, directly into protein, and the two processes occur
simultaneously.

- In eukaryotes, the primary transcript is modified via processing and splicingmbefore traveling to the ribosomes for protein translation.
- 5' end is capped with 5-methyl guanosine
- 3' end is capped with a line of adenines
- introns are removed, and remaining exons are spliced together

But this is only one of the many ways that eukaryotic transcription is more derived than prokaryotic transcription. Let us explore.
Eukaryotic Transcription: What's Different?
A typical prokaryote has about 1000 genes, and one type of RNA polymerase to transcribe them all.
Eukaryotes have many more genes than prokaryotes, and also have a great deal of non-informational DNA interspersed between the genes. This means that the factors searching for promoters have a lot more searching to do before they connect. To facilitate this, eukaryotes have not one, but three different types of RNA polymerase:
- RNA polymerase I - transcribes nucleolar organizer RNA
- RNA polymerase II - transcribes protein-encoding genes (mRNA) and some snRNAs
- RNA polymerase III - transcribes 5S rRNA and tRNA genes
In eukaryotes, transcription takes place in the nucleus, and transcripts move out of the nucleus via the (selectively permeable) nuclear pores. Translation takes place in the
cytosol (liquid portion of the cytoplasm), facilitated by the ribosomes
lining the rough endoplasmic reticulum.
organelle DNA is transcribed and translated within the organelle, yet
another reminder of the Endosymbiont Model of their origins.

Quick Guide to Eukaryotic Transcription Acronyms
- CTD - carboxyl terminal (or "tail") domain - part of a general transcription factor (protein) in which the terminal end is a free carboxyl group (-COOH). The RNA polymerase II's CTD usually contains multiple (50+) repeats of the sequence tyr-ser-pro-thr-ser-pro-ser. The CTD serves as a binding site for other proteins that activate RNA polymerase II.
The CTD:
- participates in initiation transcription
- caps the RNA transcript
- attaches to the spliceosome (more on this later) to facilitate exon splicing
- GTF - general transcription factor - proteins that bind around the promoter, attracting RNA polymerase. Names are such as...
- TATA box - conserved sequence in the eukaryotic (and archaean) promoter region
- TBP - TATA-binding protein - a component of TFIID that binds to the TATA box
- PIC - pre-initiation complex - RNA polymerase bound to the appropriate GTFs
Let's watch an overview of eukaryotic transcription.
In eukaryotes, varied proteins known as general transcription factors (GTFs) bind around the promoter before RNA polymerase can attach. These are somewhat analogous to the prokaryotic sigma factors.
The GTFs attract RNA polymerase to the proper location, and help it attach correctly to the DNA template.
There are many GTFs, each with its own name, such as TFIIA, TFIIB, etc.
(TF for "transcription factor" and II for "RNA polymerase II").
Together, RNA polymerase II and the GTFs will comprise the preinitiation complex (PIC).
The Eukaryotic Promoter: TATA Box
As in prokaryotes, eukaryotic promoters are located about 25bp upstream (5' direction) of the coding region of the DNA.
The promoter includes a core sequence:
5'-TATAAA-3'
...(or very similar), usually followed by a short sequence of Adenines.
This sequence is called the TATA box (or Goldberg-Hogness Box), and is analogous to the Pribnow Box of prokaryotes.
The TATA box is found in archaeans as well as eukaryotes, providing another bit of evidence of their shared ancestry.
~~~~~~~~~~~~~~~~~~~~~~~~~~
The Initiation Complex
The TATA Box is the binding site of TATA-binding protein (TBP) , a component of TFIID (one of the GTFs).
TFIID is the first GTF to bind to promoter. It attracts (1) other GTFs and (2) RNA polymerase II to the site.
The GTFs and RNA polymerase II also bind to the site, creating the BIG BIG initiation complex.
Once properly seated, RNA polymerase II detaches from the GTFs via the activity of the carboxyl tail domain (CTD) nestled inside its β subunit.
The CTD is phosphorylated (with ATP), which
- provides energy to drive the reaction
- changes its physical configuration, allowing it to move
This change weakens RNA polymerase II's affinity for the transcription factor (GTF) proteins.
RNA polymerase II gently dissociates from the GTFs and goes on its merry, elongating way.
If multiple copies of the gene are being transcribed, some GTFs remain at the promoter and attract additional RNA polymerases to the site, and transcription proceeds with several RNA polymerases in a conga line, each producing a new transcript.
Elongation in Eukaryotes
Elongation is essentially similar to that seen in prokaryotes, with the new RNA being laid down inside the transcription bubble (bordered by two Y-junctions) formed by the denaturation of the DNA.
Two major differences:
I:
- In prokaryotes, translation may begin at the 5' end of the new RNA strand while transcription is still taking place (translation is co-transcriptional).
- In eukaryotes, transcription and translation are both spatially and temporally separated.
Processing of the eukaryotic RNA transcript begins even as it's being manufactured. What was once believed to be post-transcriptional modification is now understood to be cotranscriptional.
II:
- Processing of the mRNA transcript is directed by the CTD, which does not exist in prokaryotes.
- The repeats serve as binding sites for enzymes that cap, cleave and splice the RNA.
- Recall that the CTD is located right where the new RNA squirts out of the enzyme, so it's conveniently placed to perform these functions.
- The CTD is alternately phosphorylated and dephosphorylated, and its changing composition affects its affinity for various processing proteins. It can sequentially determine which protein task must be completed as the mRNA transcript emerges!
Processing the 5' and 3' ends of the transcript
- As the first bit of mRNA emerges from the RNA polymerase, guanyltransferase, guided by the CTD, attaches a 7-methylguanosine triphosphate cap to the 5' end. This protects the transcript from enzyme degradation, and is essential for translation.
- The A-U-rich conserved sequence (the polyadenylation signal) close to the 3' end is recognized as it emerges, and the CTD directs the associated proteins to cut the RNA about 20 bases down from the sequence.
- Finally, 150-200 adenines are attached to the cut end, forming a poly-A tail. This, too, protects the transcript from degradation.
Introns and Exons
The coding region of the mRNA transcript is comprised of introns, which are cut out and discarded, and exons, which are spliced together to form an RNA sequence that is completely colinear with the protein it encodes. (Phillip Sharp et al. published this work in 1977. In 1993 Sharp was awarded the Nobel prize for this work.)
Recall that the number and size of introns varies not only across genes, but also across species.
Why bother with introns?
- The nematode Caenorhabditis elegans has only about 12,000 - 13,000 known genes.
- Homo sapiens is believed to have about 25,000 functional genes. Only about twice as many.
- So how do we explain the ability of the human body to manufacture more than 100,000 different proteins from its protein-coding genome?
- The answer: exon shuffling.
- One gene can encode several different proteins if it is processed in alternative splicing such that the finished sequence is different when the cell requires a different protein product.
- Interesting side note: Not all species are capable of alternative splicing. Whereas alternative splicing appears to be rare in plants (they often have more genes than animals), it's now known that more than 70% of human genes can be spliced in different ways.
- Mutations that cause splicing defects (splice site mutations) can have mild to devastating consequences for the organism:
- Type 1 neurofibromatosis
- certain types of microcephaly
- neuromuscular disorders
- keratin/skin disorders
- the list will grow as new relationships are discovered.
Intron Excision, Exon Splicing
How are the introns excised and the exons splice together?
Several models for intron excision/exon splicing have been proposed.
Self-splicing Introns
- Discovered in the ciliate Tetrahymena, group
I introns are removed via this mechanism.
- A ribozyme is an RNA molecule that exhibits enzymatic activity.
- A hammerhead ribozyme (a.k.a. hammerhead RNA), is an RNA that can cleave itself via a small, conserved structural motif called a hammerhead (because of its shape).

- Introns are enzymatically snipped out by the ribozyme RNA itself.
- Hydrolysis of guanidine phosphate (GTP) molecules provides energy to drive
the reactions.
- Ribozymes thus catalyze removal of their own introns via
transesterifications reactions--breaking the phosphodiester backbone in an
unusual way to create
(a) a 2'-3' phosphodiester bond (causing the backbone to form a ring at its
broken end)
(b) a hydroxyl group at the 5' end (highly unstable and reactive) reacts to
create a terminus that is stable and unreactive:
- This configuration develops at both ends of the intron, preventing it from
re-attaching to the exon.
- Ribozymes can both hydrolyze themselves and form peptide bonds. They have been found in numerous different systems across organisms.
- Could these properties be a window to our ancient past? The RNA world theory holds that because RNA is the only macromolecule know to both encode genetic information and act as an enzyme, it was likely the first genetic material found in living organisms. DNA came later.
Self-splicing mitochondrial and chloroplast introns (group II
introns)
- introns fold into secondary structure to form stem loops
- RNA autocatalyzes the breakage points, snipping out the introns
- These may be evolutionary remnants of snRNA's (small nuclear
RNA's) involved in the splicing of nuclear precursor mRNA in the spliceosome

A spliceosome is a protein-RNA complex that mediates excision of introns and splicing of exons.
The spliceosome apparatus consists of small nuclear RNAs (snRNAs) complexed with protein to form small nuclear ribonucleoproteins (snRNPs), affectionatly known as "snurps".
Five different snurps take part in this process, and they comprise the spliceosome:
U1, U2, U4, U5 and U6
The RNA portion of each of these ranges from 100-215 base pairs in length.
The RNA portion of the snurp may be complementary to the intron, the exon or
one of the other snurps, depending upon its function.
Sequence of events in spliceosome process:
1. U1 binds at the 5' end of the intron to be removed.
2. U2 binds at a central location of the intron, known as the branch
point.
3. Meanwhile, U4, U5 and U6 bind together to form a complex.
4. U4/5/6 complex approaches the intron; as it draws near, U4 releases from
the complex; the U6 portion of the complex is now unstable and reactive.
4A. U5/6 complex replaces U1 at the 5' end of the intron.
5. U6 has a high affinity for U2; it binds to U2, creating a "loop" out of
the intron. The two ends of the adjacent exons are now close together, though
still separated by the intron.
6. U5 catalyzes the linkage of the two exons ends which have been brought into
close proximity by the binding of U6 and U2.
7. U5 and RNA may both participate in intron removal; U5 does the actual
catalysis of the final phosphodiester bond between the exons.
So cool.
The spliceosome is one type of control mechanism in intron/exon splicing,
allowing greater accuracy of splicing.
Conserved sequences in the genes being spliced are critical to the
operation of the spliceosome. Three conserved sequences (one at each end of the intron, and one
in the middle) are highly conserved across species, probably for
functional reasons.
Since spliceosomes may physically vary, different introns may be removed
from the same transcript, depending on the "needs" of the cell at the given
time.
Result: a huge variety of products potentially coded by the same gene.
(alternative splicing; exon shuffling)
Evolutionary Origin of Introns and Exons
Oddities of Eukaryotic RNA: Editing
In some cases, DNA sequence does not predict the protein product's
amino acid sequence, even when introns/exons are
considered.
It is now known that--at least in eukaryotes--RNA is sometimes "edited" at the nitrogenous base level, with chemical reactions occuring to change one nucleotide to another.
RNA editing can occur via:
- substitution editing
- chemical alteration of a base in situ
- mostly seen in mtRNA, cpRNA and some nucRNA
The significance of this phenomenon can most notably be seen in Trypanosoma, the flagellate parasite
responsible for African Sleeping Sickness.
- mini- and maxi-circles of DNA are present in the kinetoplasts (modified
mitochondria) of Trypanosoma.
- These evidently code for a fifth type of
RNA, known as "guide RNA" (gRNA).
- gRNA seems to "guide" the RNA editing
process.
- Poly Uracil editing may explain why Trypanosoma is so readily able to constantly change its surface proteins, thwarting the host's immune system.
But could gRNA be the Trypanosoma's Achille's Heel? Eukaryotic hosts of
the parasite lack gRNA.