Detecting Conserved RNA Structures

Ivo L Hofacker and Peter F Stadler

Introduction

The programs alidot and pfrali detect conserved secondary structure elements in relatively small sets of RNAs by combining multiple sequence alignments and secondary structure predictions. Both a (good) sequence alignment and predicted secondary structure predictions for each sequence in the alignment must be provided as inputs. While alidot works with single structures, its variant pfrali uses base pairing probability matrices.

The starting point the analysis of conserved secondary structure elements is a list of all predicted base pairs. This list will in general not be a valid secondary structure, i.e., it will violate one or both of the following two conditions:

No nucleotide takes part in more than one base pair.
Base pairs never cross, that is, for two base pairs (i.j) and (k.l) we have never i<k<j<l or k<i<l<j.

The basic idea behind both alidot and pfrali is to sort the individual base pairs by their credibility and to reduce the number of entries in the list by subsequent filtering steps until only those secondary structure elements are left that are consistetly predicted.

Clearly the sorting procedure is of crucial importance. For each predicted base pair (i.j) we store the nucleotides occurring in the corresponding positions in the sequence alignment. We shall call a sequence non-compatible with a base pair (i.j) if the two nucleotides at positions i and j would form a non-standard base pair such as GA or UU. A sequence is compatible with base pair (i.j) if the two nucleotides form either one of the follow six combinations: GC, CG, AU, UA, GU, UG.

When different standard combinations are found for a particular base pair (i.j) we may speak of consistent mutations. If we find combinations such as GC and CG or GU and UA, where both positions are mutated at once we have compensatory mutations. The occurrence of consistent and, in particular, compensatory mutations strongly supports a predicted base pair, at least in the absence of non-consistent mutations.

From the frequencies f_{ij} with which (i.j) is predicted in the sample of sequences we derive a pseudo-entropy

S_ij = -\sum_k f_ikln f_ik - \sum_k f_kjln f_kj + f_ijln f_ij

involving i and j, respectively. The pseudo-entropy is a measure for the reliability with which (i.j) is predicted.

We call a base pair (i.j) symmetric if j is the most frequently predicted pairing partner of i and if i is the most frequently predicted pairing partner of j. Note that for each sequence position i there is at most one symmetric base pair involving i. A symmetric base pair (i.j) has necessarily a rather large value f_{ij}; in particular, it does not allow a large number of structural alternatives.

The basic logic of both programs in similar: In the first step, a list of "believable base pairs" is extracted from the set of all pairs that are contained in the input. The details of the procedure are not the same for both programs.

In the first processing step of alidot, all but the most frequent pair (i.j) for each base i is removed. The remaining list is then sorted according to the following hierarchical criteria:

The more sequences are non-compatible with (i.j), the less credible is the base pair.
Symmetric base pairs are more credible than other base pairs.
A base pair with more consistent mutations is more credible.
Base pairs with smaller values of a pseudo-entropy are more credible.

In contrast, pfrali does not prune the list in a preprocessing step and uses the follwing two-step mechanism for sorting the list of base pairs:

The more sequences are non-compatible with (i.j), the less credible is the base pair.
If the number of non-compatible sequences is the same, then the pairs are ranked by the product of the mean probability and the number of different pairing combinations.

The next step is common to both programs: the sorted list in reduced by running through it and removing all base pairs that cross with higher ranking ones and hence would not yield a valid secondary structure.

The resulting secondary structure will, in general, still contain ill-supported base pairs. These are removed by three subsequent filtering steps:

All pairs are removed that have more than two non-compatible sequences (for large samples this number may have to be increased), as well as pairs with two non-compatible sequences adjacent to a pair that also has non-compatible sequences.
Next we omit all isolated base pairs. The remaining pairs are collected into helices.
Only those helices are retained that satisfy the following conditions: @itemize @bullet @item the highest ranking base pair must not have non-compatible sequences. @item for the highest ranking base pair the product of the mean probability and the number of different pairing combinations must be greater than 0.3 @item if the helix has length 2, it must not have more non-compatible sequences than consistent mutations. @end itemize

Input Files

The search algorithms for detecting conserved RNA structure elements are based on both a multiple sequence alignment and predicted secondary structures. These must be present in the following form:

The multiple sequence alignment must be contained in a single file in CLUSTAL W format. In particular, the first line of the alignment file must begin with the word CLUSTAL.
The predicted secondary structures must be contained in individual files. The names of these files must conform the names of the sequences in the CLUSTAL file. The files must have the same file extension. (For example: *.mfe for minimum energy files, and blah_dp.ps or blah.pf for base pairing probabilites.)
All structure files must reside in the directory from which alidot or pfrali is invoked.
The structure file must have standard Vienna RNA Package I/O format. This means that a minimum energy (single structure) file must contain the secondary structure in dot-bracket notation. The first line of a single structure file must begin with a > sign followed by the sequence name.
Base pairing probability matrices are read from the *_dp.ps produced by RNAfold -p or the .pf produced for instance by the message passing version of RNAfold. The .pf files are plain ascii files containing a line of the form i j p for each base pairs (i,j) with sufficiently high pairing probability p.

Look at the Vienna RNA Web site for more information on RNAfold and related programs.

Output Files

Overview

The program alidot contains the detection algorithm described in Hofacker:98a. It reads CLUSTAL W alignment file from stdin and uses single structure predictions for each of the aligned sequences as input.

The program pfrali is a variant of alidot that uses base pairing probability matrices instead of single structures as input. It is described in detail in Hofacker:98b. pfrali reads a CLUSTAL W alignment file from stdin and uses base pairing probability matrices in the form of the *_dp.ps as produced by RNAfold -p or as .pf files as produced by the parallel (mpi) version of McCaskill's algorithm Fekete:98a.

Both programs are used in a similar way:

alidot -e kin < /home/murxer/ali/AlignedSeqs.aln > AlignedSeqs.aliout
pfrali        < /home/murxer/ali/AlignedSeqs.aln > AlignedSeqs.pfrali

The CLUSTAL W file named AlignedSeqs.aln in the directory /home/murxer/ali/ is read. Then the program searches for the files containing the structure predictions in the current directory. alidot will search for *.mfe files unless the -e option is specified. pfrali first searches for *.pf files. If these are not found, it tries *_dp.ps files which must be PostScript files in the format specified below.

Both programs produce both text and PostScript output. The text output is written to stdout and can be redirected into files as the example above. The PostScript output of alidot is written to aln_dp.ps, the PostScript output of pfrali is written to aln_pf_dp.ps.

Text Output

The text output generated by alidot and pfrali has the following common format:

15 sequences; length of alignment 9538. 25048 different pairs, reduced: 2755.
1425 believable pairs 0 2 6 71 442 904
 5333  5348  0  66.7%   1.891 CG:4    GC:2    GU:1    UG:7    UA:1   
 3879  3923  0  53.3%   1.643 GU:1    UG:9    AU:2    UA:3   
 8207  8342  0  66.7%   1.983 CG:4    GU:2    UG:1    AU:8   
 3133  3193  0  60.0%   2.288 CG:1    GU:8    UG:1    AU:5   
  525   549  0 100.0%   0.000 GC:11   GU:1    AU:3   
 9101  9117  0 100.0%   0.000 CG:12   UG:1    AU:2   
...
 4633  4689  0  26.7%   3.423 CG:15   +
...
 5936  6011  0  13.3%   3.962 CG:2    GC:8    GU:2    UG:3    * +
 2230  2246  0  40.0%   2.113 CG:9    UG:5    UA:1    *
 2875  2924  0  33.3%   2.629 CG:9    GU:2    UG:4    *
 7053  7070  0  33.3%   3.041 CG:13   UG:1    UA:1    *
...
 2040  2048  1  13.3%   4.338 GC:14   * +
 2726  2737  2  46.7%   2.840 GC:9    GU:1    UG:1    AU:2   
((((......(((..((((..(((((((..(

The header of the output contains the number of sequences from the alignment file for which structure files with valid secondary structures have been found in the current directory and some simple statistics. The number of reduced pairs is the number of entries in the color dot plots. The second line gives the number of "believable base pairs" followed by number of pairs with 6, 5, 4, 3, 2, 1, and no different types. The final entry is the number pairs with at least one incompatible sequence among the believable base pairs (The example shown here is taken from a sample of complete Hepatitis C virus genomes.)

The base pairing data follow from line 3. The base pairs are sorted by credibility according to the algorithm described in detail in ref.Hofacker:98a.

The last line of the text output contains the conserved structure in bracket notation.

The numbers in the first two columns identify the predicted base pair. The numbering conforms the Clustal W input file. The third column contains the number of sequence in which i.j is not one of the six standard RNA base pairs. Column 4 gives the percentage of sequences for which the base was predicted. Column 5 contains the pseudo-entropy. The following columns show which of the six standard base pairs occurs in the aligned sequences.

At the end of each record there may be a + sign and/or a *. The + sign indicates non-symmetric base pairs, that is, base pairs i.j for which either i was not the most frequently predicted pairing partner of j or vice versa. A * indicates that the base pair conflicts with a base pair that is predicted with a higher credibility.

PostScript Output

Dot plots are simple way of representing contact structures. Each base pair i.j corresponds to a small square in a matrix of size sequence length times sequence length. The size and color of this "dot" are used to encode additional information. The simplest version of the dot plots are produced automatically by RNAfold -p. In these files, which serve as input for pfrali in the present package, the area of the squares is proportional to the (predicted) equilibrium base pairing probability.

This type of output is modified to include information of the reliability of a prediction. As a general rule we use

the size of the square to indicate the frequency of prediction;
the color hue of the square to indicate the number of different consistent nucleotide pairs occurring for a given base pair; and
the saturation of the color to indicate the number of sequences that are not consistent with the base pair in the sense that they have nucleotides as the relevant positions that do not form one of the six standard RNA base pairs.

The color codes are as follows:

red all sequences have the same two nucleotides
ocre two types of base pairs occur
green three types of base pairs occur
turquoise four types of base pairs occur
blue five types of base pairs occur
violet all six types of base pairs occur

The saturation scale is as follows:

saturated no inconsistent sequences
pale color one sequence in the sample is inconsistent
very pale two sequence in the sample are inconsistent
invisible more than two sequences in the sample are inconsistent

The interpretation of the size of the dots differs somewhat between the output of alidot and pfrali. It will be described in detail in the following two sections.

The dot plot files generated by alidot and pfrali have entries both in the upper right and the lower left triangle. The two halves of the matrix do not contain the same information.

The upper right triangle displays all predicted base pairs with not more than two inconsistent sequences.

The lower left triangle contains only the secondary structure formed by the most believable base pairs. It still contains pairs that are not in the final prediction.

Technical Note. We include here a brief description of the internal format of _dp.ps files, since it is possible to recover the most interesting data from the PostScript output.

After the usual PostScript header a number of macros are defined. The first piece of useful information is the string

/sequence {(a_string) } def

In the color dot plots produced by alidot and pfrali it contains a predicted secondary structure in dot-bracket notation. In the output RNAfold -p it contains the RNA sequence.

After a number of further definitions the base pairing data are listed. The format is slightly different between the color dot plots of alidot and pfrali and the black and white version produced by RNAfold -p. In the latter case the format is

6 21 0.9309 ubox
7 20 0.9661 ubox

The first 2 columns give the position of the base pair, the third column is the edge length of the square (1 being a full square). The PostScript macros ubox or lbox determine whether the dot is plotted in the upper right or the lower left triangle.

In the color version ubox and lbox take two more arguments determining the color of the dots

6 21 0.9309 0.00 0.60 ubox
7 20 0.9661 0.00 0.60 ubox

Column 4 contains 0.16*(number of different base types -1), and column 5 contains max[0, 1-0.4*(number of inconsistent sequences)]. The computation of the three hsb-color parameters from these values is encoded in the definition of the PostScript macros.

Note that the edge length (column 3) is the square root of the equilibrium pairing probability in the output of RNAfold -p or the square root of prediction frequency. If pfrali is used, the this column contains the square root of the pairing probabilities averaged over all sequences in the alignment file.

Perl Utilities

Split ViennaRNA Files

Output from the Vienna RNA Package oftentimes contains the minimum energy structures of list of sequences in a single file. The format of such files is

> 1_10770_lmnt_1  begin= 7 end= 64
AGUCUACGUGGACCGACAAAGACAGAUUCUUUGAGGGAGCUUAGCUCAACGUAGUUCU
((.((((((...((..(...((.....))...).))((((...)))).))))))..))
> 1_10770_lmnt_2  begin= 78 end= 98
AUUAGAGAGCAGAUCUCUGAU
((((((((.....))))))))
> 1_10770_lmnt_3  begin= 116 end= 136
AAAGGCGAGAAAAACGCCUUU
(((((((.......)))))))
....

The entry for each sequence begin with a (single) comment line (beginning with >) containing the name of the sequence and optional comments. The following two lines contain the sequence and structure in dot-bracket notation. The structure entry may be followed by the folding energy in parenthesis.

The script split.pl splits such a file into separate files for each sequence.

split.pl ViennaRNA_file

The filename is determined by first string in the comment line, to which .mfe is appended. The first file created from the above example is thus 1_10770_lmnt_1.mfe. This script is necessary since alidot expects all structures to be located in separate files.

Zooming into Dot Plots

The Package provides two utilities for zooming into _dp.ps files: dpzoom.pl extracts the diagonal block between two sequence positions. The defaults for the first and large position are 1 and the sequence length, respectively.

dpzoom.pl [-f first] [-l last] [dp_file]

The script reads either from the file dp_file if its name is provided as an argument, or it reads from stdin. The output is written to stdout.

A variant of this script is dprzoom.pl. This script in addition to extracting a box along the diagonal rotates the graphics by 45 degrees. It can be used to obtain details for the band along the diagonal, i.e., for shorter range interactions. It is used in the same way as dpzoom.pl @pindex dprzoom.pl

dprzoom.pl [-f first] [-l last] [dp_file]

PostScript Mountain Plots

The Perl5 script cmount.pl produces colored Hogeweg mountain representation in PostScript from _dp.ps generated by alidot or dp_zoom.

Here the Hogeweg mountain representation is based on the following definition:
mm[i], mp[i]=number of base pairs enclosing base i. The script reads from stdin and writes to stdout.

Consensus Sequence

The Perl5 script consensus.pl prints the consensus sequence for part of an Clustal W alignment and the corresponding piece of structure in bracket notation if the text output of alidot is provided.

consens.pl [-f from] [-l last] clustalw.aln [alidot.out]

The name of the CLUSTAL W file is a required argument. If the name of the file containing the corresponding alidot text output is passed as the second argument, then the consensus structure is extracted in dot-bracket notation.

The script can be used to extract a subsequence/substracture by specifying the first (option -f) and/or last (option -l) sequence position in the alignment file that is to be printed. Defaults are -f 1 and -l length of aligned sequences.

The following additional information is printed to stderr: The length of the aligned sequence, the number of conserved positions, and the mean pairwise sequence identity for both the complete alignment and sequence window defined by the -f and -l options. This script is particularly useful as the first step towards producing input files for a secondary structure editor such as XRNA.

Annotated RNA Secondary Structure Plots

The script anote.pl read an _ss.ps file as produced e.g. by RNAplot or RNAfold and adds circles around bases with complentary mutations and marks bases that are incompatible in some sequences using a grayscale. The original .ps file is renamed to .ps.bak.

anote.pl  alidot.out rna_ss.ps [offset]

You may create a secondary structure plot for your alidot output alidot.out obtained with the CLUSTAL W data.aln for the sequence window from, say, positions 476 to 756 in the alignment by using the following command line:

consensus.pl -f 476 -l 756 data.aln alidot.out | RNAplot ;
anote.pl 476-756_ss.ps alidot.out 475

RNAplot is part of the Vienna RNA Package 1.3 and higher.

Annotate XRNA Secondary files

XRNA is an interactive program for drawing and editing secondary structures. It has its own file format with the usual extension .ss. Appart from its own format, XRNA reads Michael Zucker's .xt files which do not allow to pass information beyond the list of base pairs to XRNA.

The script xrna.pl reads the output from alidot or pfrali and edits a previously produced .ss file to include circles around bases with complentary mutations and gray letter for bases that are incompatible in some sequences. The original .ss file is renamed to .ss.bak.

Use the following procedure to create a secondary structure plot for your alidot output alidot.out obtained with the CLUSTAL W data.aln.

Identify the region that you want to plot, say positions 476 to 756 in the alignment.
Run consensus.pl -f 476 -l 756 data.aln alidot.out > fragment.dat to obtain the consensus sequence and structure for the reason of interest.
Convert fragment.dat into Zucker style base pair file using
b2ct < fragment.dat > fragment.ct.
Invoke XRNA, load the structure file fragment.ct, edit the secondary structure drawing, and save the drawing as fragment.ss. See the XRNA Manual for details.
Run xrna.pl alidot.out fragment.ss 475 to annotate the .ss file. The third argument 475 is the offset of the sequence fragment of interest. It takes the value 0 (and can be omitted) if the fragment begins at the first position of the CLUSTAL W alignment.
Run XRNA in batchmode to produce a PostScript file from fragments.ss. See the XRNA Manual for details.

References

Ivo L. Hofacker, Martin Fekete, Christoph Flamm, Martijn A. Huynen, Susanne Rauscher, Paul E. Stolorz, and Peter F. Stadler: Automatic Detection of Conserved RNA Structure Elements in Complete RNA Virus Genomes. Nucl. Acids Res. 26, 3825--3836 (1998).
Ivo L. Hofacker and Peter F. Stadler: Automatic Detection of Conserved Base Pairing Patterns in RNA Virus Genomes. Comp. & Chem. 23, 401-414 (1999).

Ivo Hofacker <ivo@tbi.univie.ac.at>

Last modified: Thu Feb 10 12:24:13 CET 2000