The core of the Vienna RNA Package is formed by a collection of routines for the prediction and comparison of RNA secondary structures. These routines can be accessed through stand-alone programs, such as RNAfold, RNAdistance etc., which should be sufficient for most users. For those who wish to develop their own programs we provide a library which can be linked to your own code.
This document only describes the library and will be primarily useful to programmers. The stand-alone programs are described in separate man pages. The latest version of the package including source code and html versions of the documentation can be found at the ViennaRNA page. This manual documents version 1.3.
Please send comments and bug reports to <Ivo.Hofacker@tbi.univie.ac.at>.
The library provides a fast dynamic programming minimum free energy folding algorithm as described by Zuker & Stiegler (1981). Associated functions are:
fold()
. If fold_constrained
(see section Global Variables for the Folding Routines)
is 1, the structure string is interpreted on input as a list of
constraints for the folding. The characters " | x < > " mark bases that
are paired, unpaired, paired upstream, or downstream, respectively;
matching brackets " ( ) " denote base pairs, dots "." are used for
unconstrained bases. Constrained folding works by assigning bonus energies
to all structures compliing with the constraint.
fold()
.
initialize_fold()
.
temperature
(see section Global Variables for the Folding Routines).
Prototypes for these functions are declared in `fold.h'.
Instead of the minimum free energy structure the partition function of all possible structures and from that the pairing probability for every possible pair can be calculated, using a dynamic programming algorithm as described by McCaskill (1990). The following functions are provided:
fold_constrained
(see section Global Variables for the Folding Routines) is 1, the structure string is
interpreted on input as a list of constraints for the folding. The
character "x" marks bases that must be unpaired, matching brackets " ( )
" denote base pairs, all other characters are ignored. Any pairs
conflicting with the constraint will be forbidden. This usually sufficient
to ensure the constraints are honored. If do_backtrack
(see section Global Variables for the Folding Routines) has been set to 0 base pairing probabilities will not
be computed (saving CPU time), otherwise the pr[iindx[i]-j]
(see section Global Variables for the Folding Routines) will contain the probability that bases i
and j pair.
pf_fold()
.
init_pf_fold()
.
Prototypes for these functions are declared in `part_func.h'.
We provide two functions that search for sequences with a given structure, thereby inverting the folding routines.
give_up
is set to 1, the function will return as soon as it is
clear that the search will be unsuccessful, this speeds up the algorithm
if you are only interested in exact solutions.
Since inverse_fold()
calls fold()
you have to allocate memory
for folding by calling initialize_fold()
inverse_fold()
.
Since inverse_pf_fold()
calls pf_fold()
you have to allocate
memory for folding by calling init_pf_fold()
char symbolset[]
contains the allowed bases, and
is initially set to "AUGC".
Prototypes for these functions are declared in `inverse.h'.
The following global variables change the behavior the folding algorithms or contain additional information after folding.
temperature
C.
Default is 37C. You have to call the update_..._params() functions after
changing this parameter.
dangles = 1
) dangling end energies are assigned only to unpaired
bases and a base cannot participate simultaneously in two dangling ends. In
the partition function algorithm pf_fold()
these checks are neglected.
If dangles
is set to 2, the fold()
and
energy_of_struct()
function will also follow this convention. This
treatment of dangling ends gives more favorable energies to helices
directly adjacent to one another, which can be beneficial since such
helices often do engage in stabilizing interactions through co-axial
stacking. Default is 1, pf_fold()
treats 1 as 2.
fold()
.
base_pair[0].i
contains the total number of pairs.
pf_fold()
.
pr[iindx[i]-j]
.
pf_fold()
to avoid overflows. Should be set
to approximately exp((-F/kT)/length), where F is an estimate
for the ensemble free energy, for example the minimum free energy. You must
call update_pf_params()
or init_pf_fold()
after changing this
parameter. If pf_scale is -1 (the default) , an estimate will be provided
automatically when calling init_pf_fold()
or
update_pf_params()
. The automatic estimate is usually insufficient
for sequences more than a few hundred bases long.
pf_fold()
; this is about
twice as fast. Default is 1.
inverse_fold()
; 'C': force (1,N) to be paired,
'M' fold as if the sequence were inside a multi-loop. Otherwise the
usual mfe structure is computed.
include `fold_vars.h' if you want to change any of these variables from their defaults.
A default set of parameters, identical to the one described in Walter et.al. (1994), is compiled into the library. Alternately, parameters can be read from and written to a file.
The following describes the file format expected by
read_parameter_file()
. All energies should be given as integers in
units of 0.01kcal/mol.
The first line of the file should read
## RNAfold parameter file
lines of the form
# token
mark the beginning of a list of energy parameters of the type specified by
token. The following tokens are recognized:
# stack_energies
The list of free energies for stacked pairs. Since the library
distinguishes 8 types of pairs (no pair=0, CG=1, GC=2, GU=3, UG=4, AU=5,
UA=6, nonstandard=7), the list should be formated as an 8*8 matrix. As an
example the stacked pair
5'-GU-3' 5'-AC-3' 3'-CA-5' 3'-UG-5'
corresponds to the entry [2,6], which should be identical to [5,1]. Be careful to preserve the symmetry when editing parameter files.
# stack_enthalpies
enthalpies for stacked pairs, used to rescale stacking energies to
temperatures other than 37C. Same format as stack_energies.
# hairpin
Free energies of hairpin loops as a function of size. The list should
contain 31 entries on one or more lines. Since the minimum size of a
hairpin loop is 3 and we start counting with 0, the first three values
should be INF to indicate a forbidden value.
# bulge
Free energies of bulge loops. Should contain 31 entries, the first one
being INF.
# internal_loop
Free energies of internal loops. Should contain 31 entries, the first 2
being INF.
# mismatch_interior
Free energies for the interaction between the closing pair of an interior loop
and the two unpaired bases adjacent to the helix. This is a three
dimensional array indexed by the type of the closing pair (see above) and
the two unpaired bases. The library distinguishes 5 bases: A=1, C=2, G=3,
U=4 and 0 for anything else. The list therefore contains 8*5*5 entries and
should be formated either as an 8*25 matrix or 8 5*5 matrices. The order is
such that for example the mismatch
5'-CU-3' 3'-GC-5'
corresponds to entry [1,4,2] (CG=1, U=4, C=2).
# mismatch_hairpin
Same as above for hairpin loops.
# mismatch_enthalpies
Corresponding enthalpies for rescaling to temperatures other than 37C.
# dangle5
Energies for the interaction of an unpaired base on the 5' side and
adjacent to a helix in multiloops and free ends (the equivalent of mismatch
energies in interior and hairpin loops). The array is indexed by the type
of pair closing the helix and the unpaired base and, therefore, forms a 8*5
matrix. For example the dangling base in
5'-C-3' 3'-GC-5'
corresponds to entry [1,2] (CG=1, C=2);
# dangle3
Same as above for bases on the 3' side of a helix.
# ML_params
For the energy of a multi-loop a function of the form
E = cu*n_unpaired + ci*loop_degree + cc
is used where n_unpaired is the number of unpaired bases in the loop and
loop_degree is the number of helices forming the loop. The line following
the token should contain the values cu cc ci (in that order).
# Tetraloops
Some tetraloops particularly stable tetraloops are assigned an energy
bonus. Up to forty tetraloops and their bonus energies can be listed
following the token, one sequence per line. For example:
GAAA -200
assigns a bonus energy of -2 kcal/mol to tetraloops containing the sequence GAAA.
# END
Anything beyond this token will be ignored.
A parameter file need not be complete, it might may contain only a subset
of interaction parameters, such as only stacking energies. However, for
each type of interaction listed, all entries have to be present.
A `*' may be used to indicate entries of a list that are to retain
their default value. For loop energies a `x' may be used to indicate that
the value is to be extrapolated from the values for smaller loop sizes.
Parameter files may contain C-style comments, i.e. any text between
/*
and */
will be ignored. However, you may have no more
than one comment per line and no multi-line comments.
A parameter file listing the default parameter set should accompany your distribution as `default.par', the file `old.par' contains parameters used in version 1.1b of the Package.
The standard representation of a secondary structure is the "bracket notation", where matching brackets symbolize base pairs and unpaired bases are shown as dots. Alternatively, one may use two types of node labels, 'P' for paired and 'U' for unpaired; a dot is then replaced by '(U)', and each closed bracket is assigned an additional identifier 'P'. We call this the expanded notation. In Fontana et al. (1993) a condensed representation of the secondary structure is proposed, the so-called homeomorphically irreducible tree (HIT) representation. Here a stack is represented as a single pair of matching brackets labeled 'P' and weighted by the number of base pairs. Correspondingly, a contiguous strain of unpaired bases is shown as one pair of matching brackets labeled 'U' and weighted by its length. Generally any string consisting of matching brackets and identifiers is equivalent to a plane tree with as many different types of nodes as there are identifiers.
Bruce Shapiro (1988) proposed a coarse grained representation, which, does not retain the full information of the secondary structure. He represents the different structure elements by single matching brackets and labels them as 'H' (hairpin loop), 'I' (interior loop), 'B' (bulge), 'M' (multi-loop), and 'S' (stack). We extend his alphabet by an extra letter for external elements 'E'. Again these identifiers may be followed by a weight corresponding to the number of unpaired bases or base pairs in the structure element. All tree representations (except for the dot-bracket form) can be encapsulated into a virtual root (labeled 'R'), see the example below.
The following example illustrates the different linear tree representations used by the package. All lines show the same secondary structure.
a) .((((..(((...)))..((..)))).)). (U)(((((U)(U)((((U)(U)(U)P)P)P)(U)(U)(((U)(U)P)P)P)P)(U)P)P)(U) b) (U)(((U2)((U3)P3)(U2)((U2)P2)P2)(U)P2)(U) c) (((H)(H)M)B) ((((((H)S)((H)S)M)S)B)S) (((((((H)S)((H)S)M)S)B)S)E) d) ((((((((H3)S3)((H2)S2)M4)S2)B1)S2)E2)R)
Above: Tree representations of secondary structures. a) Full structure: the first line shows the more convenient condensed notation which is used by our programs; the second line shows the rather clumsy expanded notation for completeness, b) HIT structure, c) different versions of coarse grained structures: the second line is exactly Shapiro's representation, the first line is obtained by neglecting the stems. Since each loop is closed by a unique stem, these two lines are equivalent. The third line is an extension taking into account also the external digits. d) weighted coarse structure, this time including the virtual root.
For the output of aligned structures from string editing, different representations are needed, where we put the label on both sides. The above examples for tree representations would then look like:
a) (UU)(P(P(P(P(UU)(UU)(P(P(P(UU)(UU)(UU)P)P)P)(UU)(UU)(P(P(UU)(U... b) (UU)(P2(P2(U2U2)(P2(U3U3)P3)(U2U2)(P2(U2U2)P2)P2)(UU)P2)(UU) c) (B(M(HH)(HH)M)B) (S(B(S(M(S(HH)S)(S(HH)S)M)S)B)S) (E(S(B(S(M(S(HH)S)(S(HH)S)M)S)B)S)E) d) (R(E2(S2(B1(S2(M4(S3(H3)S3)((H2)S2)M4)S2)B1)S2)E2)R)
Aligned structures additionally contain the gap character '_'.
Several functions are provided for parsing structures and converting to different representations.
b2C()
.
All the above functions allocate memory for the strings they return.
tree_edit_distance()
function back to bracket notation with '_'
as the gap character. The result overwrites the input.
loop_size[0]
contains the
number of external bases.
Prototypes for the above functions can be found in `RNAstruct.h'.
A simple measure of dissimilarity between secondary structures of equal length is the base pair distance, given by the number of pairs present in only one of the two structures being compared. I.e. the number of base pairs that have to be opened or closed to transform one structure into the other. It is therefore particularly useful for comparing structures on the same sequence. It is implemented by
For other cases a distance measure that allows for gaps is preferable. We can define distances between structures as edit distances between trees or their string representations. In the case of string distances this is the same as "sequence alignment". Given a set of edit operations and edit costs, the edit distance is given by the minimum sum of the costs along an edit path converting one object into the other. Edit distances like these always define a metric. The edit operations used by us are insertion, deletion and replacement of nodes. String editing does not pay attention to the matching of brackets, while in tree editing matching brackets represent a single node of the tree. Tree editing is therefore usually preferable, although somewhat slower. String edit distances are always smaller or equal to tree edit distances.
The different level of detail in the structure representations defined above naturally leads to different measures of distance. For full structures we use a cost of 1 for deletion or insertion of an unpaired base and 2 for a base pair. Replacing an unpaired base for a pair incurs a cost of 1.
Two cost matrices are provided for coarse grained structures:
/* Null, H, B, I, M, S, E */ { 0, 2, 2, 2, 2, 1, 1}, /* Null replaced */ { 2, 0, 2, 2, 2, INF, INF}, /* H replaced */ { 2, 2, 0, 1, 2, INF, INF}, /* B replaced */ { 2, 2, 1, 0, 2, INF, INF}, /* I replaced */ { 2, 2, 2, 2, 0, INF, INF}, /* M replaced */ { 1, INF, INF, INF, INF, 0, INF}, /* S replaced */ { 1, INF, INF, INF, INF, INF, 0}, /* E replaced */ /* Null, H, B, I, M, S, E */ { 0, 100, 5, 5, 75, 5, 5}, /* Null replaced */ { 100, 0, 8, 8, 8, INF, INF}, /* H replaced */ { 5, 8, 0, 3, 8, INF, INF}, /* B replaced */ { 5, 8, 3, 0, 8, INF, INF}, /* I replaced */ { 75, 8, 8, 8, 0, INF, INF}, /* M replaced */ { 5, INF, INF, INF, INF, 0, INF}, /* S replaced */ { 5, INF, INF, INF, INF, INF, 0}, /* E replaced */
The lower matrix uses the costs given in Shapiro (1990). All distance functions use the following global variables:
edit_backtrack
set to 1. See section Representations of Secondary Structures, for
details on the representation of structures.
Tree
( essentially the postorder list ) of the
structure xstruc, for use in
tree_edit_distance()
.
xstruc may be any rooted structure representation.
Prototypes for the above functions can be found in `treedist.h'. The
type Tree
is defined in `dist_vars.h', which is automatically
included with `treedist.h'
string_edit_distance()
.
Prototypes for the above functions can be found in `stringdist.h'.
For comparison of base pair probability matrices, the matrices are first condensed into probability profiles which are the compared by alignment.
pr
(see section Global Variables for the Folding Routines) and calculates a profile, i.e. a vector containing for
each base the probabilities of being unpaired, upstream, or downstream
paired, respectively. The returned array is suitable for
profile_edit_distance
.
Prototypes for the above functions can be found in `profiledist.h'.
The following utilities are used and therefore provided by the library:
pf_fold()
from the
global array pr
and the pair list base_pair
produced by
fold()
and produces a postscript "dot plot" that is written to
filename. The "dot plot" represents each base pairing
probability by a square of corresponding area in a upper triangle
matrix. The lower part of the matrix contains the minimum free energy
structure.
base_pair
array anymore.
option
is an uppercase letter the
sequence
is used to label nodes, if option
equals 'X'
or 'x' the resulting file will coordinates for an initial layout of
the graph.
PS_rna_plot
and gmlRNA
. Current possibility are
0 for a simple radial drawing or 1 for the modified radial drawing taken from
the naview
program of Bruccoleri & Heinrich (1988).
erand48()
.
urn ()
. These should be set to some random number seeds
before the first call to urn ()
.
free()
when the string is no longer needed.
Prototypes for PS_rna_plot()
and PS_dot_plot()
reside in
`PS_dot.h', the other functions are declared in `utils.h'.
The following program exercises most commonly used functions of the library. The program folds two sequences using both the mfe and partition function algorithms and calculates the tree edit and profile distance of the resulting structures and base pairing probabilities.
#include <stdio.h> #include <math.h> #include "utils.h" #include "fold_vars.h" #include "fold.h" #include "part_func.h" #include "inverse.h" #include "RNAstruct.h" #include "treedist.h" #include "stringdist.h" #include "profiledist.h" void main() { char *seq1="CGCAGGGAUACCCGCG", *seq2="GCGCCCAUAGGGACGC", *struct1,* struct2,* xstruc; float e1, e2, tree_dist, string_dist, profile_dist, kT; Tree *T1, *T2; swString *S1, *S2; float **pf1, **pf2; /* fold at 30C instead of the default 37C */ temperature = 30.; /* must be set *before* initializing */ /* allocate memory for fold(), could be skipped */ initialize_fold(strlen(seq1)); /* allocate memory for structure and fold */ struct1 = (char* ) space(sizeof(char)*(strlen(seq1)+1)); e1 = fold(seq1, struct1); struct2 = (char* ) space(sizeof(char)*(strlen(seq2)+1)); e2 = fold(seq2, struct2); free_arrays(); /* free arrays used in fold() */ /* produce tree and string representations for comparison */ xstruc = expand_Full(struct1); T1 = make_tree(xstruc); S1 = Make_swString(xstruc); free(xstruc); xstruc = expand_Full(struct2); T2 = make_tree(xstruc); S2 = Make_swString(xstruc); free(xstruc); /* calculate tree edit distance and aligned structures with gaps */ edit_backtrack = 1; tree_dist = tree_edit_distance(T1, T2); free_tree(T1); free_tree(T2); unexpand_aligned_F(aligned_line); printf("%s\n%s %3.2f\n", aligned_line[0], aligned_line[1], tree_dist); /* same thing using string edit (alignment) distance */ string_dist = string_edit_distance(S1, S2); free(S1); free(S2); printf("%s mfe=%5.2f\n%s mfe=%5.2f dist=%3.2f\n", aligned_line[0], e1, aligned_line[1], e2, string_dist); /* for longer sequences one should also set a scaling factor for partition function folding, e.g: */ kT = (temperature+273.15)*1.98717/1000.; /* kT in kcal/mol */ pf_scale = exp(-e1/kT/strlen(seq1)); init_pf_fold(strlen(seq1)); /* calculate partition function and base pair probabilities */ e1 = pf_fold(seq1, struct1); pf1 = Make_bp_profile(strlen(seq1)); e2 = pf_fold(seq2, struct2); pf2 = Make_bp_profile(strlen(seq2)); free_pf_arrays(); /* free space allocated for pf_fold() */ profile_dist = profile_edit_distance(pf1, pf2); printf("%s free energy=%5.2f\n%s free energy=%5.2f dist=%3.2f\n", aligned_line[0], e1, aligned_line[1], e2, profile_dist); free_profile(pf1); free_profile(pf2); }
In a typical Unix environment you would compile this program using: cc -c example.c -Ihpath and link using cc -o example -Llpath -lRNA -lm where hpath and lpath point to the location of the header files and library, respectively.
Jump to: a - b - e - f - g - h - i - m - n - p - r - s - t - u - w
Jump to: a - b - c - d - e - f - g - h - i - l - n - p - r - s - t - u - x
This document was generated on 7 July 1998 using the texi2html translator version 1.52.