[Top]

[Index]

[ ? ]

Conserved RNA Structures

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

1. Introduction

The program alidot (ALIgned DOT-plots) detects conserved secondary structure elements in relatively small sets of RNAs by combining multiple sequence alignments and secondary structure predictions. Both a (good) sequence alignment and predicted secondary structure predictions for each sequence in the alignment must be provided as inputs. alidot works either with predicted mfe structures, or with base pairing probability matrices.

The starting point the analysis of conserved secondary structure elements is a list of all predicted base pairs. This list will in general not be a valid secondary structure, i.e., it will violate one or both of the following two conditions:

No nucleotide takes part in more than one base pair.
Base pairs never cross, that is, for two base pairs (i.j) and (k.l) we have never i<k<j<l or k<i<l<j.

The basic idea behind alidot is to sort the individual base pairs by their credibility and to reduce the number of entries in the list by subsequent filtering steps until only those secondary structure elements are left that are consistetly predicted.

Clearly the sorting procedure is of crucial importance. For each predicted base pair (i.j) we store the nucleotides occurring in the corresponding positions in the sequence alignment. We shall call a sequence non-compatible with a base pair (i.j) if the two nucleotides at positions i and j would form a non-standard base pair such as GA or UU. A sequence is compatible with base pair (i.j) if the two nucleotides form either one of the follow six combinations: GC, CG, AU, UA, GU, UG.

When different standard combinations are found for a particular base pair (i.j) we may speak of consistent mutations. If we find combinations such as GC and CG or GU and UA, where both positions are mutated at once we have compensatory mutations. The occurrence of consistent and, in particular, compensatory mutations strongly supports a predicted base pair, at least in the absence of non-consistent mutations.

From the frequencies f_{ij} with which (i.j) is predicted in the sample of sequences we derive a pseudo-entropy

S_ij = -Σ_k f_ikln f_ik - Σ_k f_kjln f_kj + f_ijln f_ij

involving i and j, respectively. The pseudo-entropy is a measure for the reliability with which (i.j) is predicted.

We call a base pair (i.j) symmetric if j is the most frequently predicted pairing partner of i and if i is the most frequently predicted pairing partner of j. Note that for each sequence position i there is at most one symmetric base pair involving i. A symmetric base pair (i.j) has necessarily a rather large value f_{ij}; in particular, it does not allow a large number of structural alternatives.

The basic logic of the program is the following: In the first step, a list of "believable base pairs" is extracted from the set of all pairs that are contained in the input, in the second step this list is sorted by one of several possible "credibilty" criteria.

The first sorting method is strictly hierarchical and was used for the original alidot program, as described in Hofacker:98. It uses the following criteria:

The more sequences are non-compatible with (i.j), the less credible is the base pair.
Symmetric base pairs are more credible than other base pairs.
A base pair with more consistent mutations is more credible.
Base pairs with smaller values of a pseudo-entropy are more credible.

The --compf=2 option will select this compare function.

The second sorting method was introduced in the original pfrali program Hofacker:99, now available as alidot -p. It uses the follwing two-step mechanism for sorting the list of base pairs:

The more sequences are non-compatible with (i.j), the less credible is the base pair.
If the number of non-compatible sequences is the same, then the pairs are ranked by the product of the mean probability and the number of different pairing combinations.

Use the --compf=3 option to select this compare function.

Finally it is possible to use the covariance term introduced in Hofacker:02 for the RNAalifold program. This credibility criterium is the sum of three terms:

crd_ij = log(p_ij) + C_ij -H_ij

where p_ij is the mean probabilty of the pair (i,j), H_ij is the fraction of sequences inconsistent with the pair and

C_ij = Σ_XY,X'Y' f_ij(XY)D_XY,X'Y' f_ij(X'Y'),

where f_ij(XY) is the frequency of the combination XY at positions i and j of the alignment and D is the hamming distance between XY and X'Y' if both are base pairs, 0 otherwise.

Finally, the sorted list in reduced by running through it and removing all base pairs that cross with higher ranking ones and hence would not yield a valid secondary structure.

The resulting secondary structure will, in general, still contain ill-supported base pairs. These are removed by three subsequent filtering steps:

All pairs are removed that have more than two non-compatible sequences (for large samples this number may have to be increased), as well as pairs with two non-compatible sequences adjacent to a pair that also has non-compatible sequences.
Next we omit all isolated base pairs. The remaining pairs are collected into helices.
Only those helices are retained that satisfy the following conditions:
- the highest ranking base pair must not have non-compatible sequences.
- for the highest ranking base pair the product of the mean probability and the number of different pairing combinations must be greater than 0.3
- if the helix has length 2, it must not have more non-compatible sequences than consistent mutations.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

2. Input Files

The search algorithms for detecting conserved RNA structure elements are based on both a multiple sequence alignment and predicted secondary structures. These must be present in the following form:

The multiple sequence alignment must be contained in a single file in CLUSTAL W format. In particular, the first line of the alignment file must begin with the word CLUSTAL.
The predicted secondary structures must be contained in individual files. The names of these files must conform the names of the sequences in the CLUSTAL file. The files must have the same file extension. (For example: *.mfe for minimum energy files, and blah_dp.ps or blah.pf for base pairing probabilites.)
All structure files must reside in the directory from which alidot is invoked.
The structure file must have standard Vienna RNA Package I/O format. This means that a minimum energy (single structure) file must contain the secondary structure in dot-bracket notation. The first line of a single structure file must begin with a > sign followed by the sequence name.
Base pairing probability matrices are read from the *_dp.ps produced by RNAfold -p or the .pf produced for instance by the message passing version of RNAfold. The .pf files are plain ascii files containing a line of the form i j p for each base pairs (i,j) with sufficiently high pairing probability p.

Look at the Vienna RNA Web site for more information on RNAfold and related programs.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

3. Output Files

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

3.1 Overview

The program alidot contains the detection algorithm described in Hofacker:98 and Hofacker:99. It reads CLUSTAL W alignment file from stdin and uses predicted either predicted mfe structures or base pairing probability matrices for each of the aligned sequences as input.

A typical usage would be:

alidot    /home/murxer/ali/AlignedSeqs.aln > AlignedSeqs.aliout
alidot -p /home/murxer/ali/AlignedSeqs.aln > AlignedSeqs.pfrali

The CLUSTAL W file named AlignedSeqs.aln in the directory /home/murxer/ali/ is read. Then the program searches for the files containing the structure predictions in the current directory. In the first case alidot will search for *.mfe files unless the -e option is specified. With the -p option it first searches for *.pf files. If these are not found, it tries *_dp.ps files which must be PostScript files in the format specified below.

In both cases text and PostScript output is produced. The text output is written to stdout and can be redirected into files as the example above. The PostScript output of alidot is written to aln_dp.ps or aln_pf_dp.ps (with -p).

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

3.2 Text Output

The text output generated by alidot has the following common format:

15 sequences; length of alignment 9538. 25048 different pairs, reduced: 2755.
1425 believable pairs 0 2 6 71 442 904
 5333  5348  0  66.7%   1.891 CG:4    GC:2    GU:1    UG:7    UA:1   
 3879  3923  0  53.3%   1.643 GU:1    UG:9    AU:2    UA:3   
 8207  8342  0  66.7%   1.983 CG:4    GU:2    UG:1    AU:8   
 3133  3193  0  60.0%   2.288 CG:1    GU:8    UG:1    AU:5   
  525   549  0 100.0%   0.000 GC:11   GU:1    AU:3   
 9101  9117  0 100.0%   0.000 CG:12   UG:1    AU:2   
...
 4633  4689  0  26.7%   3.423 CG:15   +
...
 5936  6011  0  13.3%   3.962 CG:2    GC:8    GU:2    UG:3    +
 2230  2246  0  40.0%   2.113 CG:9    UG:5    UA:1    
 2875  2924  0  33.3%   2.629 CG:9    GU:2    UG:4    
 7053  7070  0  33.3%   3.041 CG:13   UG:1    UA:1    
...
 2040  2048  1  13.3%   4.338 GC:14   +
 2726  2737  2  46.7%   2.840 GC:9    GU:1    UG:1    AU:2   
((((......(((..((((..(((((((..(

The header of the output contains the number of sequences from the alignment file for which structure files with valid secondary structures have been found in the current directory and some simple statistics. The number of reduced pairs is the number of entries in the color dot plots. The second line gives the number of "believable base pairs" followed by number of pairs with 6, 5, 4, 3, 2, 1, and no different types. The final entry is the number pairs with at least one incompatible sequence among the believable base pairs (The example shown here is taken from a sample of complete Hepatitis C virus genomes.)

The base pairing data follow from line 3. The base pairs are sorted by credibility according to the algorithm described in detail in ref. Hofacker:98.

The last line of the text output contains the conserved structure in bracket notation.

The numbers in the first two columns identify the predicted base pair. The numbering conforms the Clustal W input file. The third column contains the number of sequence in which i.j is not one of the six standard RNA base pairs. Column 4 gives the percentage of sequences for which the base was predicted. Column 5 contains the pseudo-entropy. The following columns show which of the six standard base pairs occurs in the aligned sequences.

At the end of each record there may be a + sign indicating that the base pair conflicts with a base pair that is predicted with a higher credibility.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

3.3 PostScript Output

Dot plots are simple way of representing contact structures. Each base pair i.j corresponds to a small square in a matrix of size sequence length times sequence length. The size and color of this "dot" are used to encode additional information. The simplest version of the dot plots are produced automatically by RNAfold -p. In these files, which serve as input for alidot -p in the present package, the area of the squares is proportional to the (predicted) equilibrium base pairing probability.

This type of output is modified to include information of the reliability of a prediction. As a general rule we use

the size of the square to indicate the frequency of prediction;
the color hue of the square to indicate the number of different consistent nucleotide pairs occurring for a given base pair; and
the saturation of the color to indicate the number of sequences that are not consistent with the base pair in the sense that they have nucleotides as the relevant positions that do not form one of the six standard RNA base pairs.

The color codes are as follows:

.
red all sequences have the same two nucleotides
ocre two types of base pairs occur
green three types of base pairs occur
turquoise four types of base pairs occur
blue five types of base pairs occur
violet all six types of base pairs occur

The saturation scale is as follows:

.
saturated no inconsistent sequences
pale color one sequence in the sample is inconsistent
very pale two sequence in the sample are inconsistent
invisible more than two sequences in the sample are inconsistent

The dot plot files generated by alidot have entries both in the upper right and the lower left triangle. The two halves of the matrix do not contain the same information.

The upper right triangle displays all predicted base pairs with not more than two inconsistent sequences.

The lower left triangle contains only the secondary structure formed by the most believable base pairs. It still contains pairs that are not in the final prediction.

Technical Note. We include here a brief description of the internal format of _dp.ps files, since it is possible to recover the most interesting data from the PostScript output.

After the usual PostScript header a number of macros are defined. The first piece of useful information is the string

/sequence {(a_string) } def

In the color dot plots produced by alidot it contains a predicted secondary structure in dot-bracket notation. In the output RNAfold -p it contains the RNA sequence.

After a number of further definitions the base pairing data are listed. The format is slightly different between the color dot plots of alidot and the black and white version produced by RNAfold -p. In the latter case the format is

6 21 0.9309 ubox
7 20 0.9661 ubox

The first 2 columns give the position of the base pair, the third column is the edge length of the square (1 being a full square). The PostScript macros ubox or lbox determine whether the dot is plotted in the upper right or the lower left triangle.

In the color version ubox and lbox take two more arguments determining the color of the dots

6 21 0.9309 0.00 0.60 ubox
7 20 0.9661 0.00 0.60 ubox

Column 4 contains 0.16*(number of different base types -1), and column 5 contains max[0, 1-0.4*(number of inconsistent sequences)]. The computation of the three hsb-color parameters from these values is encoded in the definition of the PostScript macros.

Note that the edge length (column 3) is the square root of the equilibrium pairing probability in the output of RNAfold -p or the square root of prediction frequency. If alidot -p is used, the this column contains the square root of the pairing probabilities averaged over all sequences in the alignment file.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4. Perl Utilities

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.1 Split ViennaRNA Files

Output from the Vienna RNA Package oftentimes contains the minimum energy structures of list of sequences in a single file. The format of such files is

> 1_10770_lmnt_1  begin= 7 end= 64
AGUCUACGUGGACCGACAAAGACAGAUUCUUUGAGGGAGCUUAGCUCAACGUAGUUCU
((.((((((...((..(...((.....))...).))((((...)))).))))))..))
> 1_10770_lmnt_2  begin= 78 end= 98
AUUAGAGAGCAGAUCUCUGAU
((((((((.....))))))))
> 1_10770_lmnt_3  begin= 116 end= 136
AAAGGCGAGAAAAACGCCUUU
(((((((.......)))))))
....

The entry for each sequence begin with a (single) comment line (beginning with >) containing the name of the sequence and optional comments. The following two lines contain the sequence and structure in dot-bracket notation. The structure entry may be followed by the folding energy in parenthesis.

The script split.pl splits such a file into separate files for each sequence.

split.pl ViennaRNA_file

The filename is determined by first string in the comment line, to which .mfe is appended. The first file created from the above example is thus 1_10770_lmnt_1.mfe.

This script is necessary since alidot expects all structures to be located in separate files.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.2 Zooming into Dot Plots

The Package provides two utilities for zooming into _dp.ps files: dpzoom.pl extracts the diagonal block between two sequence positions. The defaults for the first and large position are 1 and the sequence length, respectively.

dpzoom.pl [-f first] [-l last] [dp_file]

The script reads either from the file dp_file if its name is provided as an argument, or it reads from stdin. The output is written to stdout.

A variant of this script is dprzoom.pl. This script in addition to extracting a box along the diagonal rotates the graphics by 45 degrees. It can be used to obtain details for the band along the diagonal, i.e., for shorter range interactions. It is used in the same way as dpzoom.pl

dprzoom.pl [-f first] [-l last] [dp_file]

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.3 PostScript Mountain Plots

The Perl5 script cmount.pl produces colored Hogeweg mountain representation in PostScript from _dp.ps generated by alidot or dp_zoom.

Here the Hogeweg mountain representation is based on the following definition:
mm[i], mp[i]=number of base pairs enclosing base i. The script reads from stdin and writes to stdout.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.4 Consensus Sequence

The Perl5 script consensus.pl prints the consensus sequence for part of an Clustal W alignment and the corresponding piece of structure in bracket notation if the text output of alidot is provided.

consens.pl [-f from] [-l last] clustalw.aln [alidot.out]

The name of the CLUSTAL W file is a required argument. If the name of the file containing the corresponding alidot text output is passed as the second argument, then the consensus structure is extracted in dot-bracket notation.

The script can be used to extract a subsequence/substracture by specifying the first (option -f) and/or last (option -l) sequence position in the alignment file that is to be printed. Defaults are -f 1 and -l length of aligned sequences.

The following additional information is printed to stderr: The length of the aligned sequence, the number of conserved positions, and the mean pairwise sequence identity for both the complete alignment and sequence window defined by the -f and -l options.

This script is particularly useful as the first step towards producing input files for a secondary structure editor such as XRNA.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.5 Annotated RNA Secondary Structure Plots

The script anote.pl read an _ss.ps file as produced e.g. by RNAplot or RNAfold and adds circles around bases with complentary mutations and marks bases that are incompatible in some sequences using a grayscale. The original .ps file is renamed to .ps.bak.

anote.pl  alidot.out rna_ss.ps [offset]

You may create a secondary structure plot for your alidot output alidot.out obtained with the CLUSTAL W data.aln for the sequence window from, say, positions 476 to 756 in the alignment by using the following command line:

consensus.pl -f 476 -l 756 data.aln alidot.out | RNAplot ;
anote.pl 476-756_ss.ps alidot.out 475

RNAplot is part of the Vienna RNA Package 1.3 and higher.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.6 Annotate XRNA Secondary files

XRNA is an interactive program for drawing and editing secondary structures. It has its own file format with the usual extension .ss. Appart from its own format, XRNA reads Michael Zucker's .xt files which do not allow to pass information beyond the list of base pairs to XRNA.

The script xrna.pl reads the output from alidot and edits a previously produced .ss file to include circles around bases with complentary mutations and gray letter for bases that are incompatible in some sequences. The original .ss file is renamed to .ss.bak.

Use the following procedure to create a secondary structure plot for your alidot output alidot.out obtained with the CLUSTAL W data.aln.

Identify the region that you want to plot, say positions 476 to 756 in the alignment.
Run consensus.pl -f 476 -l 756 data.aln alidot.out > fragment.dat to obtain the consensus sequence and structure for the reason of interest.
Convert fragment.dat into Zucker style base pair file using
b2ct < fragment.dat > fragment.ct.
Invoke XRNA, load the structure file fragment.ct, edit the secondary structure drawing, and save the drawing as fragment.ss. See the XRNA Manual for details.
Run xrna.pl alidot.out fragment.ss 475 to annotate the .ss file. The third argument 475 is the offset of the sequence fragment of interest. It takes the value 0 (and can be omitted) if the fragment begins at the first position of the CLUSTAL W alignment.
Run XRNA in batchmode to produce a PostScript file from fragments.ss. See the XRNA Manual for details.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5. References

Ivo L. Hofacker, Martin Fekete, Christoph Flamm, Martijn A. Huynen, Susanne Rauscher, Paul E. Stolorz, and Peter F. Stadler: Automatic Detection of Conserved RNA Structure Elements in Complete RNA Virus Genomes. Nucl. Acids Res. 26, 3825--3836 (1998).
Ivo L. Hofacker and Peter F. Stadler: Automatic Detection of Conserved Base Pairing Patterns in RNA Virus Genomes. Comp. & Chem. 23, 401--414 (1999).
Ivo L. Hofacker, Martin Fekete, Peter F. Stadler: Secondary Structure Prediction for Aligned RNA Sequences J.Mol.Biol. 319, 1059--1066 (2002).

[Top]

[Contents]

[Index]

[ ? ]

1. Introduction
2. Input Files
3. Output Files
4. Perl Utilities
5. References

[Top]

[Contents]

[Index]

[ ? ]

About This Document

This document was generated by Ivo Hofacker on September, 2 2003 using texi2html 1.67.

The buttons in the navigation panels have the following meaning:

Button	Name	Go to	From 1.2.3 go to
[ < ]	Back	previous section in reading order	1.2.2
[ > ]	Forward	next section in reading order	1.2.4
[ << ]	FastBack	beginning of this chapter or previous chapter	1
[ Up ]	Up	up section	1.2
[ >> ]	FastForward	next chapter	2
[Top]	Top	cover (top) of document
[Contents]	Contents	table of contents
[Index]	Index	concept index
[ ? ]	About	about (this page)

where the Example assumes that the current position is at Subsubsection One-Two-Three of a document of the following structure:

1. Section One
- 1.1 Subsection One-One
  - ...
- 1.2 Subsection One-Two
  - 1.2.1 Subsubsection One-Two-One
  - 1.2.2 Subsubsection One-Two-Two
  - 1.2.3 Subsubsection One-Two-Three <== Current Position
  - 1.2.4 Subsubsection One-Two-Four
- 1.3 Subsection One-Three
  - ...
- 1.4 Subsection One-Four

This document was generated by Ivo Hofacker on September, 2 2003 using texi2html 1.67.

3.1 Overview
3.2 Text Output		Text Output Format
3.3 PostScript Output		PostScript

Conserved RNA Structures

1. Introduction

2. Input Files

3. Output Files

3.1 Overview

3.2 Text Output

3.3 PostScript Output

4. Perl Utilities

4.1 Split ViennaRNA Files

4.2 Zooming into Dot Plots

4.3 PostScript Mountain Plots

4.4 Consensus Sequence

4.5 Annotated RNA Secondary Structure Plots

4.6 Annotate XRNA Secondary files

5. References

Table of Contents

About This Document