The programs alidot
and pfrali
detect conserved secondary
structure elements in relatively small sets of RNAs by combining
multiple sequence alignments and secondary structure predictions. Both a
(good) sequence alignment and predicted secondary structure predictions
for each sequence in the alignment must be provided as inputs. While
alidot
works with single structures, its variant pfrali
uses base pairing probability matrices.
The starting point the analysis of conserved secondary structure elements is a list of all predicted base pairs. This list will in general not be a valid secondary structure, i.e., it will violate one or both of the following two conditions:
The basic idea behind both alidot
and pfrali
is to
sort the individual base pairs by their credibility and to
reduce the number of entries in the list by subsequent filtering
steps until only those secondary structure elements are left that
are consistetly predicted.
Clearly the sorting procedure is of crucial importance. For each predicted base pair (i.j) we store the nucleotides occurring in the corresponding positions in the sequence alignment. We shall call a sequence non-compatible with a base pair (i.j) if the two nucleotides at positions i and j would form a non-standard base pair such as GA or UU. A sequence is compatible with base pair (i.j) if the two nucleotides form either one of the follow six combinations: GC, CG, AU, UA, GU, UG.
When different standard combinations are found for a particular base pair (i.j) we may speak of consistent mutations. If we find combinations such as GC and CG or GU and UA, where both positions are mutated at once we have compensatory mutations. The occurrence of consistent and, in particular, compensatory mutations strongly supports a predicted base pair, at least in the absence of non-consistent mutations.
From the frequencies f_{ij} with which (i.j) is predicted in the sample of sequences we derive a pseudo-entropy
Sij = -\sumk fikln fik - \sumk fkjln fkj + fijln fij
involving i and j, respectively. The pseudo-entropy is a measure for the reliability with which (i.j) is predicted.We call a base pair (i.j) symmetric if j is the most frequently predicted pairing partner of i and if i is the most frequently predicted pairing partner of j. Note that for each sequence position i there is at most one symmetric base pair involving i. A symmetric base pair (i.j) has necessarily a rather large value f_{ij}; in particular, it does not allow a large number of structural alternatives.
The basic logic of both programs in similar: In the first step, a list of "believable base pairs" is extracted from the set of all pairs that are contained in the input. The details of the procedure are not the same for both programs.
In the first processing step of alidot
, all but the most
frequent pair (i.j) for each base i is removed. The remaining
list is then sorted according to the following hierarchical criteria:
In contrast, pfrali
does not prune the list in a preprocessing
step and uses the follwing two-step mechanism for sorting the list of
base pairs:
The next step is common to both programs: the sorted list in reduced by running through it and removing all base pairs that cross with higher ranking ones and hence would not yield a valid secondary structure.
The resulting secondary structure will, in general, still contain ill-supported base pairs. These are removed by three subsequent filtering steps:
The search algorithms for detecting conserved RNA structure elements are based on both a multiple sequence alignment and predicted secondary structures. These must be present in the following form:
CLUSTAL W
format. In particular, the first line of
the alignment file must begin with the word CLUSTAL.
*.mfe
for minimum energy files,
and blah_dp.ps
or blah.pf
for base pairing probabilites.)
alidot
or pfrali
is invoked.
Vienna RNA Package
I/O format. This means that a minimum energy (single structure)
file must contain the secondary structure in dot-bracket notation.
The first line of a single structure file must begin with a > sign
followed by the sequence name.
*_dp.ps
produced by RNAfold -p
or the .pf
produced for instance
by the message passing version of RNAfold
. The .pf
files
are plain ascii
files containing a line of the form
i j p
for each base pairs (i,j) with sufficiently high pairing
probability p.
Look at the Vienna RNA Web
site for more information on RNAfold
and related programs.
The program alidot
contains the detection algorithm described in
Hofacker:98a. It reads CLUSTAL W
alignment file from
stdin
and uses single structure predictions for each of the
aligned sequences as input.
The program pfrali
is a variant of alidot
that uses
base pairing probability matrices instead of single structures as
input. It is described in detail in Hofacker:98b.
pfrali
reads a CLUSTAL W
alignment file from stdin
and uses base pairing probability matrices in the form of the
*_dp.ps
as produced by RNAfold -p
or as .pf
files
as produced by the parallel (mpi
) version of McCaskill's
algorithm Fekete:98a.
Both programs are used in a similar way:
alidot -e kin < /home/murxer/ali/AlignedSeqs.aln > AlignedSeqs.aliout pfrali < /home/murxer/ali/AlignedSeqs.aln > AlignedSeqs.pfrali
The CLUSTAL W
file named AlignedSeqs.aln
in the
directory /home/murxer/ali/
is read. Then the program searches
for the files containing the structure predictions in the current
directory. alidot
will search for *.mfe
files unless
the -e
option is specified. pfrali
first searches for
*.pf
files. If these are not found, it tries *_dp.ps
files
which must be PostScript
files in the format specified
below.
Both programs produce both text and PostScript
output. The text
output is written to stdout
and can be redirected into files as
the example above. The PostScript
output of alidot
is
written to aln_dp.ps
, the PostScript
output of
pfrali
is written to aln_pf_dp.ps
.
The text output generated by alidot
and pfrali
has the
following common format:
15 sequences; length of alignment 9538. 25048 different pairs, reduced: 2755. 1425 believable pairs 0 2 6 71 442 904 5333 5348 0 66.7% 1.891 CG:4 GC:2 GU:1 UG:7 UA:1 3879 3923 0 53.3% 1.643 GU:1 UG:9 AU:2 UA:3 8207 8342 0 66.7% 1.983 CG:4 GU:2 UG:1 AU:8 3133 3193 0 60.0% 2.288 CG:1 GU:8 UG:1 AU:5 525 549 0 100.0% 0.000 GC:11 GU:1 AU:3 9101 9117 0 100.0% 0.000 CG:12 UG:1 AU:2 ... 4633 4689 0 26.7% 3.423 CG:15 + ... 5936 6011 0 13.3% 3.962 CG:2 GC:8 GU:2 UG:3 * + 2230 2246 0 40.0% 2.113 CG:9 UG:5 UA:1 * 2875 2924 0 33.3% 2.629 CG:9 GU:2 UG:4 * 7053 7070 0 33.3% 3.041 CG:13 UG:1 UA:1 * ... 2040 2048 1 13.3% 4.338 GC:14 * + 2726 2737 2 46.7% 2.840 GC:9 GU:1 UG:1 AU:2 ((((......(((..((((..(((((((..(
The header of the output contains the number of sequences from the alignment file for which structure files with valid secondary structures have been found in the current directory and some simple statistics. The number of reduced pairs is the number of entries in the color dot plots. The second line gives the number of "believable base pairs" followed by number of pairs with 6, 5, 4, 3, 2, 1, and no different types. The final entry is the number pairs with at least one incompatible sequence among the believable base pairs (The example shown here is taken from a sample of complete Hepatitis C virus genomes.)
The base pairing data follow from line 3. The base pairs are sorted by credibility according to the algorithm described in detail in ref.Hofacker:98a.
The last line of the text output contains the conserved structure in bracket notation.
The numbers in the first two columns identify the predicted base
pair. The numbering conforms the Clustal W
input file. The third
column contains the number of sequence in which i.j is not one of the
six standard RNA base pairs. Column 4 gives the percentage of sequences
for which the base was predicted. Column 5 contains the
pseudo-entropy. The following columns show which of the six standard
base pairs occurs in the aligned sequences.
At the end of each record there may be a + sign and/or a *. The + sign indicates non-symmetric base pairs, that is, base pairs i.j for which either i was not the most frequently predicted pairing partner of j or vice versa. A * indicates that the base pair conflicts with a base pair that is predicted with a higher credibility.
Dot plots are simple way of representing contact structures. Each base
pair i.j corresponds to a small square in a matrix of size sequence
length times sequence length. The size and color of this "dot" are
used to encode additional information. The simplest version of the
dot plots are produced automatically by RNAfold -p
. In these
files, which serve as input for pfrali
in the present package,
the area of the squares is proportional to the (predicted) equilibrium
base pairing probability.
This type of output is modified to include information of the reliability of a prediction. As a general rule we use
The color codes are as follows:
red
all sequences have the same two nucleotides
ocre
two types of base pairs occur
green
three types of base pairs occur
turquoise
four types of base pairs occur
blue
five types of base pairs occur
violet
all six types of base pairs occur
The saturation scale is as follows:
saturated
no inconsistent sequences
pale color
one sequence in the sample is inconsistent
very pale
two sequence in the sample are inconsistent
invisible
more than two sequences in the sample are
inconsistent
The interpretation of the size of the dots differs somewhat between
the output of alidot
and pfrali
. It will be described
in detail in the following two sections.
The dot plot files generated by alidot
and pfrali
have
entries both in the upper right and the lower left triangle. The
two halves of the matrix do not contain the same information.
The upper right triangle displays all predicted base pairs with not more than two inconsistent sequences.
The lower left triangle contains only the secondary structure formed by the most believable base pairs. It still contains pairs that are not in the final prediction.
Technical Note. We include here a brief description of the
internal format of _dp.ps
files, since it is possible to
recover the most interesting data from the PostScript output.
After the usual PostScript header a number of macros are defined. The first piece of useful information is the string
/sequence {(a_string) } def
In the color dot plots produced by alidot
and pfrali
it
contains a predicted secondary structure in dot-bracket notation.
In the output RNAfold -p
it contains the RNA sequence.
After a number of further definitions the base pairing data are
listed. The format is slightly different between the color dot plots
of alidot
and pfrali
and the black and white version
produced by RNAfold -p
. In the latter case the format is
6 21 0.9309 ubox 7 20 0.9661 ubox
The first 2 columns give the position of the base pair, the third column
is the edge length of the square (1 being a full square). The PostScript
macros ubox
or lbox
determine whether the dot is plotted in
the upper right or the lower left triangle.
In the color version ubox
and lbox
take two more
arguments determining the color of the dots
6 21 0.9309 0.00 0.60 ubox 7 20 0.9661 0.00 0.60 ubox
Column 4 contains 0.16*(number of different base types -1), and
column 5 contains max[0, 1-0.4*(number of inconsistent sequences)].
The computation of the three hsb
-color parameters from these
values is encoded in the definition of the PostScript macros.
Note that the edge length (column 3) is the square root of the
equilibrium pairing probability in the output of RNAfold -p
or the square root of prediction frequency. If pfrali
is
used, the this column contains the square root of the pairing
probabilities averaged over all sequences in the alignment file.
Output from the Vienna RNA Package
oftentimes contains
the minimum energy structures of list of sequences in a single file.
The format of such files is
> 1_10770_lmnt_1 begin= 7 end= 64 AGUCUACGUGGACCGACAAAGACAGAUUCUUUGAGGGAGCUUAGCUCAACGUAGUUCU ((.((((((...((..(...((.....))...).))((((...)))).))))))..)) > 1_10770_lmnt_2 begin= 78 end= 98 AUUAGAGAGCAGAUCUCUGAU ((((((((.....)))))))) > 1_10770_lmnt_3 begin= 116 end= 136 AAAGGCGAGAAAAACGCCUUU (((((((.......))))))) ....
The entry for each sequence begin with a (single) comment line
(beginning with >
) containing the name of the sequence and
optional comments. The following two lines contain the sequence
and structure in dot-bracket notation. The structure entry
may be followed by the folding energy in parenthesis.
The script split.pl
splits such a file into separate files
for each sequence.
split.pl ViennaRNA_file
The filename is determined by first string
in the comment line, to which .mfe
is appended. The first
file created from the above example is thus 1_10770_lmnt_1.mfe
.
This script is necessary since alidot
expects all structures
to be located in separate files.
The Package provides two utilities for zooming into _dp.ps
files:
dpzoom.pl
extracts the diagonal block between two sequence
positions. The defaults for the first and large position are 1 and
the sequence length, respectively.
dpzoom.pl [-f first] [-l last] [dp_file]
The script reads either from the file dp_file
if its name is
provided as an argument, or it reads from stdin
. The output
is written to stdout
.
A variant of this script is dprzoom.pl
. This script in addition
to extracting a box along the diagonal rotates the graphics by 45
degrees. It can be used to obtain details for the band along the
diagonal, i.e., for shorter range interactions. It is used in the same
way as dpzoom.pl
@pindex dprzoom.pl
dprzoom.pl [-f first] [-l last] [dp_file]
The Perl5
script cmount.pl
produces colored Hogeweg mountain
representation in PostScript from _dp.ps
generated by
alidot
or dp_zoom
.
Here the Hogeweg mountain representation is based on the following
definition:
mm[i], mp[i]=number of base pairs enclosing base i.
The script reads from stdin
and writes to stdout
.
The Perl5
script consensus.pl
prints the consensus sequence
for part of an Clustal W
alignment and the corresponding piece of
structure in bracket notation if the text output of alidot
is
provided.
consens.pl [-f from] [-l last] clustalw.aln [alidot.out]
The name of the CLUSTAL W file is a required argument. If the name of
the file containing the corresponding alidot
text output
is passed as the second argument, then the consensus structure is
extracted in dot-bracket notation.
The script can be used to extract a subsequence/substracture by
specifying the first (option -f
) and/or last (option -l
)
sequence position in the alignment file that is to be printed.
Defaults are -f 1
and -l
length of aligned sequences.
The following additional information is printed to stderr
:
The length of the aligned sequence, the number of conserved positions,
and the mean pairwise sequence identity for both the complete alignment
and sequence window defined by the -f
and -l
options.
This script is particularly useful as the first step towards producing
input files for a secondary structure editor such as XRNA
.
The script anote.pl
read an _ss.ps
file as produced
e.g. by RNAplot
or RNAfold
and adds circles around
bases with complentary mutations and marks bases that are incompatible
in some sequences using a grayscale.
The original .ps
file is renamed to .ps.bak
.
anote.pl alidot.out rna_ss.ps [offset]
You may create a secondary structure plot for your alidot
output
alidot.out
obtained with the CLUSTAL W data.aln
for the
sequence window from, say, positions 476 to 756 in the alignment by
using the following command line:
consensus.pl -f 476 -l 756 data.aln alidot.out | RNAplot ; anote.pl 476-756_ss.ps alidot.out 475
RNAplot
is part of the Vienna RNA Package 1.3
and higher.
XRNA
is an interactive program for drawing and editing secondary
structures. It has its own file format with the usual extension
.ss
. Appart from its own format, XRNA
reads Michael Zucker's
.xt
files which do not allow to pass information beyond the list
of base pairs to XRNA
.
The script xrna.pl
reads the output from alidot
or
pfrali
and edits a previously produced .ss
file to
include circles around bases with complentary mutations and gray letter
for bases that are incompatible in some sequences.
The original .ss
file is renamed to .ss.bak
.
Use the following procedure to create a secondary structure plot for
your alidot
output alidot.out
obtained with the CLUSTAL W
data.aln
.
consensus.pl -f 476 -l 756 data.aln alidot.out >
fragment.dat
to obtain the consensus sequence and structure
for the reason of interest.
fragment.dat
into Zucker style base pair file
using b2ct < fragment.dat > fragment.ct
.
XRNA
, load the structure file fragment.ct
,
edit the secondary structure drawing, and save the drawing as
fragment.ss
. See the XRNA
Manual for details.
xrna.pl alidot.out fragment.ss 475
to annotate the
.ss
file. The third argument 475 is the offset of the
sequence fragment of interest. It takes the value 0 (and can be
omitted) if the fragment begins at the first position of the
CLUSTAL W alignment.
XRNA
in batchmode to produce a PostScript file from
fragments.ss
. See the XRNA
Manual for details.
Ivo Hofacker <ivo@tbi.univie.ac.at>