Font size: Increase font size Decrease font size Switch style sheet from default to highcontrast and back

A short Tutorial on RNA Bioinformatics
The ViennaRNA Package and related Programs

Ivo Hofacker

RNA Web Services

This tutorial aims to give a basic introduction to using the command line programs in the Vienna RNA package in a UNIX-like (LINUX) environment. Of course, some of you may ask “Why are no friendly graphical user interfaces?”. Well, there are some, especially in the form of web services.

If a few simple structure predictions is all you want to do, there are several useful sites for doing RNA structure analysis available on the web. Indeed many of the tasks described below can be performed using various web servers.

Web servers are also a good starting point for novice users since they provide a more intuitive interface. Moreover, the Vienna RNA server will return the equivalent command line invocation for each request, making the transition from web services to locally installed software easier.

On the other hand, web servers are not ideal for analyzing many or very long sequences, and usually they offer only few often-used tasks. Much the same is true for point-and-click graphical programs. Command line tools, on the other hand, are ideally suited for automating repetitive tasks. They can even be combined in pipes to process the results of one program with another, or they can be used in parallel, running tens or hundreds of tasks simultaneously on a cluster of PCs.

You can try some of these web services in parallel to the exercises below.

Prerequisites to get started

Typographical Conventions

  • Constant width font is used for program names, variable names and other literal text like input and output in the terminal window.
  • Lines starting with a $ within a literal text block are commands. You should type the text following the $ into your terminal window finishing by hitting the return-key. (The $ signifies the command line prompt, which may look different on your system).
  • All other lines within a literal text block are the output from the command you just typed.

Data Files

Data files containing the sequences used in the examples below are available at http://www.tbi.univie.ac.at/~ivo/RNA/tutorial/Data/

Terminal, Command line and Editor

  • You can get a terminal by moving your mouse-pointer to an empty spot on you desktop, clicking the right mouse-button on choosing “Open Terminal” from the pull-down menu.
  • You can run commands in the terminal by typing them next to the command line prompt (usually something like $) followed by hitting the return-button.
          $ date
          Mon Mar 13 13:40:00 CET 2006
  • To get more information about a command type man followed by the command-name and hitting of the return-button.
          $ man date

|’ ties stdout to stdin

<’ redirects stdout to stdin

>’ redirects stdout to a file.

  • cd change the working directory (initially your “HOME”)
  • ls list files in the current (or a specified) directory
  • less Show FILE(s) one page at a time
  • echo Echo the STRING(s) to standard output
    $ ls > file_list
    $ less file_list
    $ rm file_list
    $ ls | less

The wc command prints the number of newlines, words, and bytes in files. Use cd ../ to change to the parent directory and cd FOO to change to the directory FOO. With ls -F you can get a directory listing. The command pwd displays the path to the current working directory. rm FOO.txt to remove file FOO.txt.

Use the mouse and the pull-down menu!!

  • To edit the file FOO.seq run
      $ emacs FOO.seq
  • type a short “random” DNA sequence into Emacs e.g. ATGAAGATGA”
  • save the file via the menu File->Save
  • replace all T by U to get a RNA sequence via the menu Edit->Replace->Replace String
    (space to leave unchanged; ! to replace all occurrences without asking, ...)

Installing Software from Source

Many bioinformatics programs are available only as source code that has to be compiled and installed. We’ll demonstrate the standard way to install programs from source using the Vienna RNA Package.

  1. Make a directory with your name and change to it
          $ mkdir ncRNA
          $ cd ncRNA
  2. Download the Vienna RNA package from http://www.tbi.univie.ac.at/RNA/ and save it to the newly created directory.
  3. Unpack the gzipped tar archive by running
          $ tar -zxf ViennaRNA-1.6.2.tar.gz
  4. list the content of the directory
          $ ls -F
          ViennaRNA-1.6.2/  ViennaRNA-1.6.2.tar.gz

Build the Vienna RNA Package

  1. To configure and build the package just run
          $ cd ViennaRNA-1.6.2
          $ ./configure
          $ make all
  2. To install the Vienna RNA package system wide (only for people with superuser privileges) run
          $ make install

The installation location can be controlled through options to the configure script. E.g. to change the default installation location to the directory FOO in your home directory use

  $ ./configure --prefix=$HOME/FOO

Have a look at the file INSTALL, distributed with the Vienna RNA Package, for more detail or read documentation on the web. Wherever you installed the main programs of the Vienna RNA Package, make sure the path to the executables shows up in your PATH environment variable. Similarly, the MANPATH environment variable contains the list of directories searched for man pages (online help). To check the contents of e.g. the PATH environment variable run

  $ echo $PATH

What’s in the Vienna RNA Package

The core of the Vienna RNA Package is formed by a collection of routines for the prediction and comparison of RNA secondary structures. These routines can be accessed through stand-alone programs, such as RNAfold, RNAdistance etc., which should be sufficient for most users. For those who wish to develop their own programs a library which can be linked to your own code is provided.
  • make a directory listing
      $ ls -F
aclocal.m4      configure*         lib/         Readseq/
AUTHORS         configure.in       Makefile     RNAforester/
ChangeLog       COPYING            Makefile.am  stamp-h1
Cluster/        default.par        Makefile.in  THANKS
config.guess*   depcomp*           man/         Utils/
config.h        H/                 missing*     vienna13.par
config.h.in     INSTALL            NEWS         ViennaRNA.spec
config.log      INSTALL.configure  Perl/        ViennaRNA.spec.in
config.status*  install-sh*        Progs/
config.sub*     Kinfold/           README
RNAalifold Calculate structures for a set of aligned RNAs
RNAcofold Calculate structures of two RNAs with dimerization
RNAdistance Calculate distances between RNAs
RNAduplex Compute hybridization structure of two RNA strands
RNAeval Calculate energy of RNA sequences on given structure
RNAfold Calculate structures of RNA sequences
RNAheat Calculate specific heat of RNAs
RNAinverse Find RNA sequences with given structure
RNALfold Calculate locally stable structures of RNAs
RNApaln
RNApdist Calculate distances between RNA structures ensembles
RNAplfold Calculate pair probabilities for locally stable structures
RNAplot Draw RNA Secondary Structures
RNAsubopt Calculate suboptimal structures of RNAs
b2ct converts bracket notation to Zukers mfold ’.ct’ file format
b2mt.pl converts bracket notation to x y values
cmount.pl generates colored mountain plot
coloraln.pl colorize an alirna.ps file
colorrna.pl colorize a secondary structure with reliability annotation
ct2b.pl converts Zukers mfold ’.ct’ file format to bracket notation
dpzoom.pl extract a portion of a dot plot
Fold converts sequence files from databanks using readseq
mountain.plgenerates mountain plot
popt extract Zuker’s p-optimal folds from subopt output
relplot.pl add reliability information to a RNA secondary structure plot
rotate_ss.pl rotate the coordinates of an RNA secondary structure plot

All programs in the Vienna RNA Package are documented in “man pages” that can be called up using the man command, e.g.

  $ man RNAalifold

Most Perl scripts carry embedded documentation that is displayed by typing e.g.

  $ perldoc coloraln.pl

All scripts and programs give a short usage message when called with the -h command line option (e.g. RNAalifold -h).

The Input File Format

RNA sequences come in a variety of formats. The sequence format used throughout the Vienna RNA package is very simple. An sequence file contains one or more sequences. Each sequence must appear as a single line in the file without embedded white spaces. A sequence may be preceded by a special line starting with the ‘>’ character followed by a sequence name. This name will be used by the programs in the Vienna RNA Package as basename for the PostScript output files for this sequence. Note that this is almost the fasta sequence format, except that no line-breaks are allowed within a sequence while the header line is optional.

Structure Prediction on single Sequences

The Program RNAfold

Our first task will be to do a simple structure prediction using RNAfold. This should get you familiar with the input and output format, as well as the graphical output produced.

RNAfold reads RNA sequences from stdin, calculates their minimum free energy (MFE) structure and prints to stdout the MFE structure in bracket notation and its free energy. If the -p option was given it also computes the partition function (Z) and base pairing probability matrix, and prints the free energy of the thermodynamic ensemble, the frequency of the MFE structure in the ensemble, and the ensemble diversity to stdout.

  1. Use a text editor to prepare an input file as shown below an save it under the name test.seq
    > test
    CCGCACAGCGGGCAGUGCCC
  2. Compute the best (MFE) structure for this sequence
   $ RNAfold < test.seq
   > test
   CCGCACAGCGGGCAGUGCCC
   ((((...))))......... ( -4.70)
   $ gv test_ss.ps &

The last line of the text output contains the predicted MFE structure as bracket notation and its free energy in kcal/mol. A dot in the bracket notation represents an unpaired position, while a base pair (i,j) is represented by a pair of matching parentheses at position i and j.

Compare the bracket notation to the PostScript1 drawing shown in the file test_ss.ps.

The calculation above, does not tell us whether the predicted structure is the only possibility. Let’s look at the equilibrium ensemble instead.

  1. Run RNAfold -p to compute the partition function and pair probabilities
  2. Look at the generated PostScript files test_ss.ps and test_dp.ps
   $ RNAfold -p < test.seq
   > test
   CCGCACAGCGGGCAGUGCCC
   ((((...))))......... ( -4.70)
   ..((...},||,,...)),, [ -5.72]
   freq. of mfe in ens. 0.190676; ens. diversity 2.84
   $ gv test_dp.ps &

The last two lines are new compared to the text output without the -p option and are a rough measure for the well-definedness of the MFE structure. The line before the last line shows a condensed representation of the pair probabilities, similar to the bracket notation, followed by the ensemble free energy in kcal/mol. The structure string contains additional symbols coding for the pairing tendency of that position.

Note that the MFE structure is adopted only with 19% probability, also the diversity is high for such a short sequence.

pict pict

The “dot plot” in test_dp.ps shows the pair probabilities within the equilibrium ensemble as n × n matrix, and is an excellent way to visualize structural alternatives. A square at row i and column j indicates a base pair. The area of a square in the upper right half of the matrix is proportional to the probability of the base pair (i,j) within the equilibrium ensemble. The lower left half shows all pairs belonging to the MFE structure. While the MFE consists of a single helix, 3 different helices are clearly visible in the pair probabilities.

Next, let’s use the relplot utility to visualize which parts of a predicted MFE are well-defined and thus more reliable.
Also let’s use a real example for a change, and produce yet another representation of the predicted structure, the mountain plot.

fold the 5S rRNA sequence and visualize the structure

   $ RNAfold -p < 5S.seq
   $ mountain.pl 5S_dp.ps | xmgrace -pipe
   $ relplot.pl 5S_ss.ps 5S_dp.ps > 5S_rss.ps

pict pict

A mountain plot is especially useful for long sequences, where conventional structure drawings become terribly cluttered. It is a xy-diagram plotting the number of base pairs enclosing a sequence position versus the position. The Perl script mountain.pl transforms a dot plot into the mountain plot coordinates, which can be visualized with any xy-plotting program, e.g. xmgrace.

The resulting plot shows three curves, two mountain plots derived from the MFE structure (red) and the pairing probabilities (black) and a positional entropy curve (green). Well-defined regions are identified by low entropy. By superimposing several mountain plots structures can easily be compared.

The Perl script relplot.pl adds reliability information to a RNA secondary structure plot in the form of color annotation. The script computes a well-definedness measure we call “positional entropy” (S(i) = pij log(pij) for those who want to know the details) and encodes it as color hue, ranging from red (low entropy, well-defined) via green to blue and violet (high entropy, ill-defined). In the example above, two helices of the 5S RNA are well-defined (red) and indeed predicted correctly, the left arm is not quite correct and disordered.

For the figure above, we had to rotate and mirror the structure plot, e.g.

     $ rotate_ss.pl -a 180 -m 5S_rss.ps > 5S_rot.ps

You can manually add annotation to structure drawings using the RNAplot program. Here’s a somewhat complicated example:

  $ RNAfold < 5S.seq > 5S.fold
  $ RNAplot --pre "76 107 82 102 GREEN BFmark 44 49 0.8 0.8 0.8 Fomark \
     1 15 8 RED omark 80 cmark 80 -0.23 -1.2 (pos80) Label" < 5S.fold
  $ gv 5S_ss.ps

To see what exactly the alternative structures of our sequence are, we need to predict suboptimal structures.

The Program RNASubopt

RNAsubopt calculates all suboptimal secondary structures within a given energy range above the MFE structure. Be careful, the number of structures returned grows exponentially with both sequence length and energy range.
  • Generate all suboptimal structures within a certain energy range from the MFE.
   $ RNAsubopt -e 1 -s < test.seq
   > test [100]
   CCGCACAGCGGGCAGUGCCC   -470    100
   ((((...)))).........  -4.70
   ..((((........))))..  -4.70
   .........((((...))))  -4.60
   ..((((.(....).))))..  -4.50
   ..((...))((((...))))  -4.30
   (((.....))).........  -3.80
   .........(((.....)))  -3.70

The text output shows an energy sorted list (option -s) of all secondary structures within 1 kcal/mol of the MFE structure. Our sequence actually has two ground state structures, and a third one only 0.1 kcal/mol higher. MFE folding alone gives no indication that there are actually 3 plausible structures.

Note that the number of suboptimal structures grows exponentially with sequence length and therefore this approach is only tractable for sequences with less than 100nt. To keep the number of suboptimal structures manageable the option -noLP can be used, forcing RNAsubopt to produce only structures without isolated base pairs. While RNAsubopt produces all structures within an energy range, mfold produces only few, hopefully representative, structures. If you have extra time, try folding the sequence on the mfold server at http://www.bioinfo.rpi.edu/~zukerm/rna/.

Sometimes you want to get information about unusual properties of the Boltzmann ensemble for which no specialized program exists. For example you want to know for a bacterial mRNA the fraction of structures in the Boltzmann ensemble for which the Shine-Dalgarno (SD) sequence is unpaired. If the SD sequence is concealed by secondary structure the translation efficiency is reduced.

In such cases you can resort to drawing a representative sample of structures from the Boltzmann ensemble by using the option -p. Now you can simply count how many structures in the sample possess the feature you are looking for. This number divided by the size of your sample gives you the desired fraction.

The following example calculates the fraction of structures in the ensemble that have bases 6 to 8 unpaired.

  1. Draw a sample of size 100000 from the Boltzmann ensemble
  2. Calculate the desired property
   $ RNAsubopt -p 100000 < test.seq > tt
   $ perl -nle '$h++ if substr($_,5,3) eq "...";
     END {print $h/$.}' tt
   0.2

A far better way to calculate this property is to use RNAfold -p to get the ensemble free energy, which is related to the partition function via F = RT ln(Q), for the unconstrained (Fu) and the constrained case (Fc), where the three bases are nor allowed to form base pairs (use option -C), and evaluate pc = exp((Fu Fc)RT) to get the desired probability.

RNA folding kinetics

RNA folding kinetics studies the dynamical process of how a RNA molecule approach to its unique folded biological active conformation (often referred to as the native state) starting from an initial ensemble of disordered conformations e.g. the unfolded open chain. The key for resolving the dynamical behavior of a folding RNA chain lies in the understanding of the ways in which the molecule explores its astronomically large free energy landscape, a rugged and complex hyper-surface established by all the feasible base pairing patterns a RNA sequence can form. The challenge is to understand how the interplay of formation and break up of base pairing interactions along the RNA chain can lead to an efficient search in the energy landscape which reaches the native state of the molecule on a biologically meaningful time scales.

The following assumes you have the barriers and treekin programs installed. If not, the current release can be found at http://www.tbi.univie.ac.at/~ivo/RNA/Barriers/. Installation proceeds as shown for the Vienna RNA package in section 2.4.

 $ echo UCCACGGCUGUUAGUGGAUAACGGC > FOO
 $ RNAsubopt -noLP -s -e 20 < FOO > FOO.sub
 $ barriers -G RNA-noLP --bsize --rates < FOO.sub > FOO.bar

(You can restrict the number of local minima using the barriers command-line option --max followed by a number). Look at the barriers file FOO.bar (e.g. less -S Foo.bar). Use the arrow keys to navigate.

  UCCACGGCUGUUAGUGGAUAACGGC
1 ......(((((((.....)))))))  -6.80    0  19.00    etc.
2 (((((........))))).......  -6.70    1   8.20    etc.
3 ....((..((((....)))).))..  -1.50    2   2.70    etc.
4 ....((.((......))....))..  -0.50    2   0.90    etc.
etc.

The first row holds the input sequence, the successive list the local minima ascending in energy. The meaning of the first 5 columns is as follows

  1. label (number) of the local minima (1=MFE)
  2. structure of the minimum
  3. free energy
  4. label of deeper local minimum I merge with (note that the MFE has no deeper local minimum to merge with)
  5. height of the energy barrier to the local minimum I merge with

pict

barriers produced two additional files, the PostScript file tree.ps which represents the basic information of the FOO.bar file visually (look at the file e.g. gv tree.ps) and a text file rates.out which holds the matrix of transition probability between the local minima.

 $ treekin -m I --p0 5=1 < FOO.bar | xmgrace -log x -nxy -

The program treekin is used to simulate the time evolution of the population densities of local minima starting from an initial population density distribution p0 (given on the command-line) and the transition rate matrix in the file rates.out.

pict

The simulation starts with all the population density in the open chain (local minimum 5). Over time the population density of this state decays (yellow curve) and other local minima get populated. The simulation ends with the population densities of the thermodynamic equilibrium in which the MFE (black curve) and local minimum 2 (red curve) are nearly equally populated. (Look at the dot plot of the sequence!!!)

pict

Sequence Design

The Program RNAinverse

RNAinverse searches for sequences folding into a predefined structure, thereby inverting the folding algorithm. Input consists of the target structures (in bracket notation) and (optionally) starting sequences.

Lower case characters in the start sequence indicate fixed positions, i.e. they can be used to add sequence constraints. ’N’s in the starting sequence will be replaced by a random nucleotide. For each search the best sequence found and its Hamming distance to the start sequence are printed to stdout. If the the search was unsuccessful, a structure distance to the target is appended.

By default the program stops as soon as it finds a sequence that has the target as MFE structure. The option -Fp switches RNAinverse to the partition function mode where the probability of the target structure exp(E(S)RT)Q is maximized. This tends to produce sequences with much more well-defined structure. This probability is written in brackets after the found sequence and Hamming distance. With the option -R you can specify how often the search should be repeated.

  1. Prepare an input file inv.in containing the target structure and sequence constraints
    (((.(((....))).)))
    NNNgNNNNNNNNNNaNNN
  2. Design sequences using RNAinverse
$ RNAinverse < inv.in
GCUgAGUAAUCACUaGGC    4

$ RNAinverse -R 5 -Fp < inv.in
GCCgCCCGAAAGGGaGGC   12  (0.988554)
GCCgCGCCAUGGCGaGGC   13  (0.985429)
GCCgCCCGAAAGGGaGGC   11  (0.988554)
GCCgCGCCAUGGCGaGGC   10  (0.985429)
GCCgCCCGAAAGGGaGGC   13  (0.988554)

Another useful program for inverse folding is RNA designer, see http://www.rnasoft.ca/index.html. The switch.pl program can be used to design bi-stable structures, i.e. structures with two almost equally good foldings.

RNA-RNA Interactions

The Program RNAcofold

RNAcofold works much like RNAfold, but allows to specify two RNA sequences which are then allowed to form a dimer structure. In the input the two RNA sequences should be concatenated using the ‘&’ character as separator. As in RNAfold, the -p option can be used to compute partition function (Z) and base pairing probabilities.

Since dimer formation is concentration dependent, RNAcofold can be used to compute equilibrium concentrations for all five monomer and (homo/hetero)-dimer species, given input concentrations for the monomers (see the man page for details).

  1. Prepare a sequence file for input that looks like this
    >t
    GCGCUUCGCCGCGCGCA&GCGCUUCGCCGCGCGCA
  2. Compute the MFE and the ensemble properties
  3. Look at the generated PostScript files t_ss.ps and t_dp.ps
 $ RNAcofold -p < t.seq
 > t
 GCGCUUCGCCGCGCGCA&GCGCUUCGCCGCGCGCA
 GCGCUUCGCCGCGCGCA&GCGCUUCGCCGCGCGCA
 ((((..((..((((...&))))..))..))))... (-17.70)
 ((((..,({.({((((.&))))..,)}.)})))). [-18.90]
 frequency of mfe structure in ensemble 0.143277 , delta G binding= -4.39

pict pict The cross in the dot plot marks the chain break between the two concatenated sequences.

Concentration Dependency

Cofolding is an intermolecular process, therefore whether duplex formation will actually occur is concentration dependent. Trivially, if one of the molecules is not present, no dimers are going to be formed. The partition functions of the molecules give us the equilibrium constants:
KAB = [AB] [A][B] = ZAB ZAZB

with these and mass conservation, the equilibrium concentration of homo- and hetero-dimers and monomers can be computed in dependence of the start concentrations of the two molecules. This is most easily done by creating a file with the initial concentrations of molecules A and B in two columns:

[a1]([moll]) [b1]([moll]) [a2]([moll]) [b2]([moll]) [an]([moll]) [bn]([moll])
  1. Prepare a concentration file for input
  2. Compute the MFE, the ensemble properties and the concentration dependency of binding.
  3. Look at the generated output
$ echo "1e-04 1e-9\n0.06 0.8\n 0.002 2" > concfile
$ RNAcofold -f concfile < t.seq
...
Free Energies:
AB              AA              BB              A               B
-18.469802      -18.469802      -18.469802      -7.251093       -7.251093
Kaa..624.893 624.893 624.893
Initial concentrations          relative Equilibrium concentrations
A                B               AB              AA              BB              A               B
0.0001          1e-09           0.00000         0.05050         0.00000         0.89898         0.00001
0.06            0.8             0.05302         0.00662         0.42455         0.00351         0.02811
0.002           2               0.00096         0.00000         0.48914         0.00004         0.01977

Where the 5 different free energies as well as the equilibrium constants KAA, KBB, and KAB are put out first. A list of all the equilibrium concentration follows, where the first two columns denote the initial (absolute) concentrations of molecules A and B, respectively. The next five columns denote the equilibrium concentrations of dimers and monomers, relative to the total particle number. (Hence, the concentrations don’t add up to one, except in the case where no dimers are built – if you want to know the fraction of particles in a dimer, you have to take the relative dimer concentrations times 2).
Since there are 7 columns in this output, it is not trivial to visualize the results. We usually try to keep the initial concentration of one molecule constant to avoid having to create 3D plots. As an example we show the following plot of t.seq where, because there is only one kind of molecule, one of the initial concentrations was set to 0:

pict

ΔGbinding = 5.01 kcal/mol

  sequences:GCGCUUCGCCGCGCGCG&GCGCUUCGCCGCGCGCG

In this example, at a concentration of about 0.5 mmol 50% of the molecule is still in monomer form.

Finding potential binding sites with RNAduplex

A common problem is the prediction of binding sites between two RNAs, as in the case of miRNA-mRNA interactions. If the sequences are very long (many kb) RNAcofold is too slow to be useful. The RNAduplex program is a fast alternative, that works by predicting only inter-molecular base pairs. It’s almost as fast as simple sequence alignment, but much more accurate than a BLAST search.

The example below searches the 3’ UTR of an mRNA for a miRNA binding site.

The file duplex.seq contains the 3’UTR of NM_024615 and the microRNA mir-145.

 $ RNAduplex < duplex.seq
 >NM_024615
 >hsa-miR-145
 .(((((.(((...((((((((((.&)))))))))))))))))). 34,57:1,19 (-21.8)

Most favorable binding has an interaction energy of 21.8 kcal/mol and pairs up pos. 34-57 of the UTR with pos. 1-19 of the miRNA.

RNAduplex can also produce alternative binding sites, e.g. running RNAduplex -e 5 would list all binding sites within 5kcal/mol of the best one.

Since RNAduplex forms only inter-molecular pairs, it neglects the competition between intramolecular folding and hybridization. Thus, it is recommended to use RNAduplex as a pre-filter and analyse good RNAduplex hits additionally with RNAcofold or RNAup. On the example above, running RNAup will yield:

 $ RNAup -b < duplex.seq
 >NM_024615dG = dGint + dGu_l + dGu_s
.((((.(((...(((((((((&)))))))))))))))).  35,55  :   2,18  (-7.61 = -21.38 +
11.32 + 2.44)
RNAup output in file: NM_024615_hsa-miR-145_w25_u4_up.out

The free energy of the duplex is -21.38 kcal/mol almost the same as computed by RNAduplex (differences stem from the fact that RNAup computes partition functions rather than optimal structures). However, the total free energy of binding is much less favorable (-7.61 kcal/mol), since it includes the energetic penalty for opening the binding site on the mRNA (11.32 kcal/mol) and miRNA (2.44 kcal/mol).

You can also run RNAcofold on the example to see the complete structure after hybridization (neither RNAduplex nor RNAup produce structure drawings). Note however, that the input format for RNAcofold is different, an input file suitable for RNAcofold is given in duplex2.seq.

As a more difficult example let’s look at the interaction of the bacterial smallRNA RybB and its target mRNA ompN. First we’ll try predicting the binding site using RNAduplex:

$ RNAduplex < RybB.seq
>RybB Salmonella typhimurium LT2
>ompN
.(((.((((..((((((.(((....(((((((((.(((((.((.((..((....((((..((((((..(((..((((((&.))))))..)))..)).)))).....))))....)).)).)).))).))........)))).)))))..))).)))))).))))......))).   1,79  :  80,173 (-40.30)

Note, that the predicted structure spans the full length of the RybB small RNA. Compare the predicted interaction to the structures predicted for RybB and ompN alone, and ask yourself whether the predicted interaction is indeed plausible.

pict pict

GCCAC-----TGCTTTTCTTTGATGTCCCCATTTT-GTGGA-------GC-CCATCAACCCCGCCATTTCGGTT---CAAG-GTTGGTGGGTTTTTT
 |||      ||||  |||||| |||    ||||| ||||        || ||| || ||  ||    ||||     |||| ||  |||  |||||| -40.30
AGGTCAAACAACGGC-AGAAACAATATT--TAAAGTCGCCGCACACGACGCGGTCGTCGGT-CGTCTCGGCCCTACTGTTCACGGTTATGAAAAGAAACC-3'

Compare the RNAduplex prediction with the interaction predicted by RNAcofold and RNAup.

pict

Consensus Structure Prediction

The Program RNAalifold

RNAalifold generalizes the folding algorithm for sequence alignments, treating the entire alignment as a single “generalized sequence”. To assign an energy to a structure on such a generalized sequence, the energy is simply averaged over all sequences in the alignment. This average energy is augmented by a covariance term, that assigns a bonus or penalty to every possible base pair (i,j) based on the sequence variation in columns i and j of the alignment.

Sequence co-variations are a direct consequence of RNA base pairing rules. RNA helices normally contain only 6 out of the 16 possible combinations: the Watson-Crick pairs GC, CG, AU, UA, and the somewhat weaker wobble pairs GU and UG. Mutations in helical regions therefore have to be correlated. In particular we often find “compensatory mutations” where a mutation on one side of the helix is compensated by a second mutation on the other side, e.g. a CG pair changes into a UA pair. Mutations where only one pairing partner changes (such as CG to UG) are termed “consistent mutations”.

Compensatory mutations are a strong indication of structural conservation, while consistent mutations provide a weaker signal. The covariance term used by RNAalifold therefore assigns a bonus of 1 kcal/mol to each consistent and 2 kcal/mol for each compensatory mutation. Sequences that cannot form a standard base pair incur a penalty of 1 kcal/mol. Thus, for every possible consensus pair between two columns i and j of the alignment a covariance score Cij is computed by counting the fraction of sequence pairs exhibiting consistent and compensatory mutations, as well as the fraction of sequences that are inconsistent with the pair. The weight of the covariance term relative to the normal energy function, as well as the penalty for inconsistent mutations can be changed via command line parameters.

Apart from the covariance term, the folding algorithm in RNAalifold is essentially the same as for single sequence folding. In particular, folding an alignment containing just one sequence will give the same result as single sequence folding using RNAfold. For N sequences of length n the required CPU time scales as 𝒪(N n2 + n3) while memory requirements grow as the square of the sequence length. Thus RNAalifold is in general faster than folding each sequence individually. The main advantage, however, is that the accuracy of consensus structure predictions is generally much higher than for single sequence folding, where typically only between 40% and 70% of the base pairs are predicted correctly.

Apart from prediction of MFE structures RNAalifold also implements an algorithm to compute the partition function over all possible (consensus) structures and the thermodynamic equilibrium probability for each possible pair. These base pairing probabilities are useful to see structural alternatives, and to distinguish well defined regions, where the predicted structure is most likely correct, from ambiguous regions.

As a first example we’ll produce a consensus structure prediction for the following four tRNA sequences.

$  cat four.seq
>M10740 Yeast-PHE
GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
>K00349 Drosophila-PHE
GCCGAAAUAGCUCAGUUGGGAGAGCGUUAGACUGAAGAUCUAAAGGUCCCCGGUUCAAUCCCGGGUUUCGGCA
>K00283 Halobacterium volcanii Lys-1
GGGCCGGUAGCUCAUUUAGGCAGAGCGUCUGACUCUUAAUCAGACGGUCGCGUGUUCGAAUCGCGUCCGGCCCA
>AF346993
CAGAGUGUAGCUUAACACAAAGCACCCAACUUACACUUAGGAGAUUUCAACUUAACUUGACCGCUCUGA

RNAalifold uses aligned sequences as input. Thus, our first step will be to align the sequences. We use clustalw in this example, since it’s one of the most widely used alignment programs and has been shown to work well on structural RNAs Other alignment programs can be used (including programs that attempt to do structural alignment of RNAs), but the resulting multiple sequence alignment must be in Clustal format.

  1. Prepare a sequence file (use file four.seq)
  2. Align the sequences
  3. Compute the consensus structure from the alignment
  4. Inspect the output files alifold.out,alirna.ps, alidot.ps
  5. for comparison fold the sequences individually using RNAfold
$ clustalw four.seq > four.out
$ RNAalifold -p four.aln
$ RNAfold -p < four.seq
__GCCGAU_UAGCUCAGUUGGG_AGAGCGCCAGACUGAAAAUCAGAAGGUCCCGUGUUCAAUCCACGG_UCCGGCA__
..(((((((..((((.........)))).(((((.......))))).....(((((.......))))))))))))...
 minimum free energy = -17.00 kcal/mol (-14.62 +  -2.38)
..(((((((,.((((.........)))).(((((.......))))).....{((((.......))))))))))))...
 free energy of ensemble = -17.95 kcal/mol
 frequency of mfe structure in ensemble 0.213582

The output contains a consensus sequence and the consensus structure in bracket notation. The consensus structure has an energy of 17 kcal/mol, which in turn consists of the average free energy of the structure 14.62 kcal/mol and the covariance term 2.38 kcal/mol. The strongly negative covariance term shows that there must be a fair number of consistent and compensatory mutations, but in contrast to the average free energy it’s not meaningful in the biophysical sense.

Compare the predicted consensus structure with the structures predicted for the individual sequences using RNAfold. How often is the correct “clover-leaf” shape predicted?

A structure annotated alignment or color annotated structure drawing can be produced using the coloraln.pl and colorrna.pl commands. Alternatively, these can be generated by RNAalifold by using the -aln and -color options.

  $ RNAalifold -color -aln four.aln
  $ gv aln.ps &
  $ gv alirna.ps &
4 sequence; length of alignment 78
alifold output
   6    72  0  99.9%   0.006 GC:2    GU:1    AU:1
   7    71  0  99.5%   0.020 GC:2    AU:2
   5    73  1  99.9%   0.002 CG:2    GC:1
  33    43  0  97.9%   0.122 GC:2    GU:1    AU:1
   4    74  1  99.8%   0.005 CG:3
  31    45  0  98.2%   0.062 CG:3    UA:1
   3    75  1  99.0%   0.030 GC:3
  30    46  0  96.9%   0.097 CG:2    UA:2

pict pict

The last output file produced by RNAalifold -p, named alifold.out, is a plain text file with detailed information on all plausible base pairs sorted by the likelihood of the pair. In the example above we see that the pair (6, 72) has no inconsistent sequences, is predicted almost with probability 1, and occurs as a GC pair in two sequences, a GU pair in one, and a AU pair in another.

RNAalifold automatically produces a drawing of the consensus structure in Postscript format and writes it to the file “alirna.ps”. In the structure graph consistent and compensatory mutations are marked by a circle around the variable base(s), i.e. pairs where one pairing partner is encircled exhibit consistent mutations, whereas pairs supported by compensatory mutations have both bases marked. Pairs that cannot be formed by some of the sequences are shown gray instead of black. In the example given, many pairs show such inconsistencies. This is because one of the sequences (AF346993) is not aligned well by clustalw.

Note, that subsequent calls to RNAalifold will overwrite any existing output alirna.ps (alidot.ps, alifold.out) files in the current directory. Be sure to rename any files you want to keep.

Structure predictions for the individual sequences

The consensus structure computed by RNAalifold will contain only pairs that can be formed by most of the sequences. The structures of the individual sequences will typically have additional base pairs that are not part of the consensus structure. Moreover, ncRNA may exhibit a highly conserved core structure while other regions are more variable. It may therefore be desirable to produce structure predictions for one particular sequence, while still using covariance information from other sequences.

This can be accomplished by first computing the consensus structure for all sequences using RNAalifold, then folding individual sequences using RNAfold -C with the consensus structure as a constraint. In constraint folding mode RNAfold -C allows only base pairs to form which are compatible with the constraint structure. This resulting structure typically contains most of the constraint (the consensus structure) plus some additional pairs that are specific for this sequence.

The refold.pl script removes gaps and maps the consensus structure to each individual sequence.

$ RNAalifold  RNaseP.aln >  RNaseP.alifold
$ gv alirna.ps
$ refold.pl RNaseP.aln RNaseP.alifold | head -3 > RNaseP.cfold
$ RNAfold -C -noLP < RNaseP.cfold > RNaseP.refold
$ gv E-coli_ss.ps

pict

Correct base pairs already present are indicated in yellow, correct pairs added by refolding in orange. Incorrect base pairs in blue (present in the consensus) or violet (added by refolding).

Comparing the consensus structure with the E. coli reference structure we find that 73 out of 91 predicted base pairs are correct, but we’re nevertheless missing 49122 40% of the base pairs in the reference structure. After refolding we have 90 correctly predicted pairs (17 more) and 20 false predictions (vs. 18 before). Note that since RNase P forms sizable pseudo-knots, a perfect prediction is impossible in this case.

If constrained folding results in a structure that is very different from the consensus, or if the energy from constrained folding is much worse than from unconstrained folding, this may indicate that the sequence in question does not really share a common structure with the rest of the alignment or is misaligned. One should then either remove or re-align that sequence and recompute the consensus structure.

Structural Alignments

Manually correcting Alignments

As the tRNA example above demonstrates, sequence alignments are often unsuitable as a basis for determining consensus structures. As a last resort, one may always try manually correcting an alignment. Sequence editors that are structure-aware may help in this task. In particular the SARSE http://http://sarse.kvl.dk/ editor, and the ralee-mode for emacs http://www.sanger.ac.uk/Users/sgj/ralee/ are useful, ralee has been pre-installed on your machines.

Try correcting the Clustal generated alignment four.aln from the example above. For this we first have to convert it to Stockholm format. Fortunately the formats are similar. Make a copy of the file add the correct header line and the consensus structure from RNAalifold:

$ cp four.aln four.stk
$ emacs four.stk
.....
$ cat four.stk
# STOCKHOLM 1.0

K00349          --GCCGAAAUAGCUCAGUUGGG-AGAGCGUUAGACUGAAGAUCUAAAGGUCCCCGGUUCAAUCCCGGGUUUCGGCA--
K00283          GGGCCG--GUAGCUCAUUUAGGCAGAGCGUCUGACUCUUAAUCAGACGGUCGCGUGUUCGAAUC--GCGUCCGGCCCA
M10740          --GCGGAUUUAGCUCAGUUGGG-AGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA--
AF346993        --CAGAGUGUAGCUUAAC---ACAAAGCACCCAACUUACACUUAGGAGAUUUCAACUUAACUUGACCGCUCUGA----
#=GC SS_cons    ..(((((((..((((.........)))).(((((.......))))).....(((((.......))))))))))))...

Now use the functions under the Edit menu to improve the alignment, the coloring by structure should help highlight misaligned positions.

Automatic structural alignments

Next, we’ll compute alignments using two structural alignment programs: LocARNA and StrAl. LocARNA is an implementation of the Sankoff algorithm for simultaneous folding and alignment (i.e. it will generate both alignment and consensus structure. StrAl uses a much simpler string alignment algorithm that takes structure only approximately into account (a bit like only distinguishing paired or un-paired, but forgetting about who pairs with whom). StrAl is thus a much faster program suitable for longer sequences than LocARNA.

Both programs can read the fasta file four.seq.

$ mlocarna-p four.seq
...
CLUSTAL --- LocARNA - Local Alignment of RNA --- Score: 2555

      (((((((..((((.........))))........................((((.......)))).))))))).
K00283_Halobacte   GGGCCGGUAGCUCAUUUAGGCAGAGCGUCUGACUCUUAAUCAGACGGUCGCGUGUUCGAAUCGCGUCCGGCCCA
K00349_Drosophil   GCCGAAAUAGCUCAGUU-GGGAGAGCGUUAGACUGAAGAUCUAAAGGUCCCCGGUUCAAUCCCGGGUUUCGGCA
M10740_Yeast_PHE   GCGGAUUUAGCUCAGUU-GGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
AF346993           CAGAGUGUAGCUUA---ACACAAAGCACCCAACUUACACUUAGGAGAU-UUCAACU-UAACUUGACCGCUCUGA
      (((((((..((((.---.....))))......................-.((((..-....)))).))))))).

$ stral four.seq
$ cat resultDIR/four.seq
> M10740
GCGGAUUUAGCUCAGUU-GGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
> K00349
GCCGAAAUAGCUCAGUU-GGGAGAGCGUUAGACUGAAGAUCUAAAGGUCCCCGGUUCAAUCCCGGGUUUCGGCA
> K00283
GGGCCGGUAGCUCAUUUAGGCAGAGCGUCUGACUCUUAAUCAGACGGUCGCGUGUUCGAAUCGCGUCCGGCCCA
> AF346993
CAGAGUGUAGCUUA-AC--ACAAAGCACCCAACUUACACUUAGGAGAUUUCAACUU--AACUUGACCGCUCUGA

Use RNAalifold to predict structures for all your alignments (Clustal, handcrafted, StrAl, and LocARNA) and compare them. The handcrafted and LocARNA alignments should be essentially perfect, in the StrAl alignment the T-arm is slipped for AF346993.

Other interesting approaches to structural alignment include CMfinder, dynalign, and stemloc.

Noncoding RNA gene prediction

Prediction of ncRNAs is still a largely unsolved problem in bioinformatics. Unlike protein coding genes, ncRNAs do not have any statistically significant features in primary sequences that could be used for reliable prediction. A large class of ncRNAs, however, depend on a defined secondary structure for their function. As a consequence, evolutionarily conserved secondary structures can be used as characteristic signal to detect ncRNAs. All currently available programs for de novo prediction make use of this principle and are therefore, by construction, limited to structured RNAs.

  • QRNA (Eddy & Rivas, 2001)
  • ddbRNA (di Bernardo, Down & Hubbard, 2003)
  • MSARi (Coventry, Kleitman & Berger, 2004)
  • AlifoldZ (Washietl & Hofacker, 2004)
  • RNAz (Washietl, Hofacker & Stadler, 2005)
  • EvoFold (Pedersen et al, 2006)

QRNA

QRNA analyzes pairwise alignments for characteristic patterns of evolution. The alignment is scored by three probabilistic models: (i) Position independent, (ii) coding, (iii) RNA. The independent and the coding model is a pair hidden Markov model. The RNA model is a pair stochastic context-free grammar. First, it calculates the prior probability that, given a model, the alignment is observed. Second, it calculates the posterior probability that, given an alignment, it has been generated by one of the three models. The posterior probabilities are compared to the position independent background model and a “winner” is found.

pict
  • Available at: ftp://ftp.genetics.wustl.edu/pub/eddy/software
  • Unzip/Untar the package and follow the instructions in the INSTALL document
  • QRNA reads pairwise alignments in MFASTA format (i.e. FASTA format with gaps)
  • Basic usage (for further option read the user guide that comes with the package):
    $ qrna -h
    $ qrna trnas.fa
    $ qrna coding.fa

AlifoldZ

AlifoldZ is based on an old hypothesis: functional RNAs are thermodynamically more stable than expected by chance. This hypothesis can be statistically tested by calculating z-scores: Calculate the MFE m of the native RNA and the mean μ and standard deviation σ of the background distribution of a large number of random (shuffled) RNAs. The normalized z-score z = (m μ)σ expresses how many standard deviations the native RNA is more stable than random sequences.

Unfortunately, most ncRNAs are not significantly more stable than the background. See for example the distribution of z-scores of some tRNAs.

pict

AlifoldZ calculates z-scores for consensus structures folded by RNAalifold. This significantly improves the detection performance compared to single sequence folding:

pict
    $ alifoldz.pl -h
    $ alifoldz.pl miRNA.aln
    $ alifoldz.pl -w 120 -x 100 < unknown.aln

RNAz

AlifoldZ has some shortcomings that limits its usefulness in practice: The z-scores are not deterministic, i.e. you get a different score each time you run AlifoldZ. To get stable z-scores you need to sample a large number of random alignments which is computationally expensive. Moreover, AlifoldZ is extremely sensitive to alignment errors.

The program RNAz overcomes these problems by using a different approach to asses a multiple sequence alignment for significant RNA structures. It is based on two key innovations: (i) The structure conservation index (SCI) to measure structural conservation in an alignment and (ii) z-scores that are calculated by regression without sampling. Both measures are combined to an overall score that is used to classify an alignment as “structured RNA” or “other”.

pict
  • The structure conservation index is an easy way to normalize an RNAalifold consensus MFE.
  • The mean μ and standard deviation σ of random samples of a given sequence are functions of the length and the base composition: μ,σ(length, GC AT, G C, A T)
  • It is therefore be possible to calculate z-scores by solving this 5 dimensional regression problem.
pict
  • A support vector machine learning algorithm is used to classify an alignment based on z-score and structure conservation index.
  • Available at: www.tbi.univie.ac.at/~wash/RNAz
  • Package includes the core program RNAz in ISO C, a set of helper programs in Perl, and an extensive manual.
  • Standard installation (requires root privileges):
    $ tar -xzf RNAz-1.0-tar.gz
    $ cd RNAz-1.0
    $ ./configure
    $ make
    $ make install
    $ cp /usr/local/share/RNAz/perl/* /usr/local/bin
  • RNAz reads one or more multiple sequence alignments in ClustalW or MAF format.
    $ RNAz --help
    $ RNAz tRNA.aln
    $ RNAz --both-strands --predict-strand tRNA.maf
  • RNAz is limited to a maximum alignment length of 400 columns and a maximum number of 6 sequences. To process larger alignments a set of Perl helper scripts are used.
  • Selecting one or more subsets of sequences from an alignment with more than 6 sequences:
    $ rnazSelectSeqs.pl miRNA.maf |RNAz
    $ rnazSelectSeqs.pl --num-seqs=4 --num-samples=3 miRNA.maf |RNAz
  • Scoring long alignments in overlapping windows:
    $ rnazWindow.pl --window=120 --slide=40 unknown.aln \
       | RNAz --both-strands

Large scale screens

The RNAz package provides a set of Perl scripts that implement a complete analysis pipeline suitable for medium to large scale screens of genomic data.

  1. Obtain or create multiple sequence alignments in MAF format
  2. Run through the RNAz pipeline:
pict
  1. Align Epstein Barr Virus genome (Acc.no: NC_007605) to two related primate viruses (Acc.nos: NC_004367, NC_006146) using multiz and run it through the RNAz pipeline.
  2. Analyze snoRNA cluster in the human genome for conserved RNA structures: download pre-computed alignments from the UCSC genome browser and run it through the RNAz pipeline
  >NC_004367:genome:1:+:149696
  AGAATTCGTCTTGCTCTATTCACCCTTACTTTTCTTCTTGCCCGTTCTCTTTCTTAGTAT
  GAATCCAGTATGCCTGCCTGTAATTGTTGCGCCCTACCTCTTTTGGCTGGCGGCTATTGC
  CGCCTCGTGTTTCACGGCCTCAGTTAGTACCGTTGTGACCGCCACCGGCTTGGCCCTCTC
  ACTTCTACTCTTGGCAGCAGTGGCCAGCTCATATGCCGCTGCACAAAGGAAACTGCTGAC
  ACCGGTGACAGTGCTTACTGCGGTTGTCACTTGTGAGTACACACGCACCATTTACAATGC
  ATGATGTTCGTGAGATTGATCTGTCTCTAACAGTTCACTTCCTCTGCTTTTCTCCTCAGT
  CTTTGCAATTTGCCTAACATGGAGGATTGAGGACCCACCTTTTAATTCTCTTCTGTTTGC
  • To get a multiple alignment a phylogenetic tree and the following three steps are necessary:
    1. Run blastz each vs. each
    2. Combine blastz results to multiple sequence alignments
    3. Project raw alignments to a reference sequence.
  • The corresponding commands:
    all_bz - "((NC_007605 NC_006146) NC_004367)" | bash
    tba "((NC_007605 NC_006146) NC_004367)" \
  *.sing.maf raw-tba.maf
    maf_project raw-tba.maf NC_007605 > final.maf
  • Note: The tree is given in NEWICK like format with blanks instead of commas. The sequence data files must be named exactly like the names in this tree and in the FASTA headers.
  • First the alignments are filtered and sliced in overlapping windows:
    $ rnazWindow.pl < final.maf > windows.maf
  • RNAz is run on these windows:
    $ RNAz --both-strands --show-gaps --cutoff=0.5 windows.maf \
    > rnaz.out
  • Overlapping hits are combined to “loci” and visualized on a web-site:
    $ rnazCluster.pl --html rnaz.out > results.dat
  • The predicted hits are compared with available annotation of the genome:
    $ rnazAnnotate.pl --bed annotation.bed results.dat \
          > results_annotated.dat
  • The results file is formatted in a HTML overview page:
    $ rnazIndex.pl --html results_annotated.dat \
      > results/index.html
  • rnazIndex.pl can be used to generate a BED formatted annotation file which can be analyzed using rnazBEDstats.pl (after sorting, for the case the input alignments were unsorted)”
    $ rnazIndex.pl --bed results.dat | \
          rnazBEDsort.pl | rnazBEDstats.pl
  • RNAzfilter.pl can be used to filter the results by different criteria. In this case it gives us all loci with P >0.9”:
    $ rnazFilter.pl "P>0.9" results.dat | \
         rnazIndex.pl --bed | \
       rnazBEDsort.pl | rnazBEDstats.pl
  • To get an estimate on the (statistical) false positives one can repeat the complete screen with randomized alignments:
    $ rnazRandomizeAln final.maf > random.maf
  • Go to the UCSC genome browser (genome.ucsc.edu) and go to “Tables”. Download “multiz17” alignments in MAF format for the region: chr11:93103000-93108000
pict
  • The Perl scripts are run in the same order as in Example 1:
    $ rnazWindow.pl --min-seqs=4 region.maf > windows.maf
    $ RNAz --both-strands --show-gaps --cutoff=0.5 windows.maf \
    > rnaz.out
    $ rnazCluster.pl --html rnaz.out > results.dat
    $ rnazAnnotate.pl --bed annotation.bed results.dat \
          > results_annotated.dat
    $ rnazIndex.pl --html results_annotated.dat \
       > results/index.html
  • The results can be exported as UCSC BED file which can be displayed in the genome browser:
    $ rnazIndex.pl --bed --ucsc results.dat > prediction.bed
  • Upload the BED file as “Custom Track”…
pict
  • …and have a look at the results:
pict
\let \prOteCt \relax \Protect \gl:nopartrue