[Top] | [Contents] | [Index] | [ ? ] |
1. Introduction | ||
2. Input Files | ||
3. Output Files | ||
4. Perl Utilities | Useful Perl/Perl5 scripts | |
5. References |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The program alidot
(ALIgned DOT-plots) detects conserved secondary
structure elements in relatively small sets of RNAs by combining
multiple sequence alignments and secondary structure predictions. Both a
(good) sequence alignment and predicted secondary structure predictions
for each sequence in the alignment must be provided as inputs.
alidot
works either with predicted mfe structures, or with
base pairing probability matrices.
The starting point the analysis of conserved secondary structure elements is a list of all predicted base pairs. This list will in general not be a valid secondary structure, i.e., it will violate one or both of the following two conditions:
No nucleotide takes part in more than one base pair.
Base pairs never cross, that is, for two base pairs (i.j) and (k.l) we have never i<k<j<l or k<i<l<j.
The basic idea behind alidot
is to sort the individual base pairs by
their credibility and to
reduce the number of entries in the list by subsequent filtering
steps until only those secondary structure elements are left that
are consistetly predicted.
Clearly the sorting procedure is of crucial importance. For each predicted base pair (i.j) we store the nucleotides occurring in the corresponding positions in the sequence alignment. We shall call a sequence non-compatible with a base pair (i.j) if the two nucleotides at positions i and j would form a non-standard base pair such as GA or UU. A sequence is compatible with base pair (i.j) if the two nucleotides form either one of the follow six combinations: GC, CG, AU, UA, GU, UG.
When different standard combinations are found for a particular base pair (i.j) we may speak of consistent mutations. If we find combinations such as GC and CG or GU and UA, where both positions are mutated at once we have compensatory mutations. The occurrence of consistent and, in particular, compensatory mutations strongly supports a predicted base pair, at least in the absence of non-consistent mutations.
From the frequencies f_{ij} with which (i.j) is predicted in the sample of sequences we derive a pseudo-entropy
Sij = -Σk fikln fik - Σk fkjln fkj + fijln fij
involving i and j, respectively. The pseudo-entropy is a measure for the reliability with which (i.j) is predicted.We call a base pair (i.j) symmetric if j is the most frequently predicted pairing partner of i and if i is the most frequently predicted pairing partner of j. Note that for each sequence position i there is at most one symmetric base pair involving i. A symmetric base pair (i.j) has necessarily a rather large value f_{ij}; in particular, it does not allow a large number of structural alternatives.
The basic logic of the program is the following: In the first step, a list of "believable base pairs" is extracted from the set of all pairs that are contained in the input, in the second step this list is sorted by one of several possible "credibilty" criteria.
The first sorting method is strictly hierarchical and was used for the
original alidot
program, as described in Hofacker:98.
It uses the following criteria:
The more sequences are non-compatible with (i.j), the less credible is the base pair.
Symmetric base pairs are more credible than other base pairs.
A base pair with more consistent mutations is more credible.
Base pairs with smaller values of a pseudo-entropy are more credible.
The --compf=2
option will select this compare function.
The second sorting method was introduced in the original pfrali
program Hofacker:99, now available as alidot -p
.
It uses the follwing two-step mechanism for sorting the list of
base pairs:
The more sequences are non-compatible with (i.j), the less credible is the base pair.
If the number of non-compatible sequences is the same, then the pairs are ranked by the product of the mean probability and the number of different pairing combinations.
Use the --compf=3
option to select this compare function.
Finally it is possible to use the covariance term introduced in
Hofacker:02 for the RNAalifold
program. This credibility
criterium is the sum of three terms:
crdij = log(pij) + Cij -Hij
where pij is the mean probabilty of the pair (i,j), Hij is the fraction of sequences inconsistent with the pair andCij = ΣXY,X'Y' fij(XY)DXY,X'Y' fij(X'Y'),
where fij(XY) is the frequency of the combination XY at positions i and j of the alignment and D is the hamming distance between XY and X'Y' if both are base pairs, 0 otherwise.Finally, the sorted list in reduced by running through it and removing all base pairs that cross with higher ranking ones and hence would not yield a valid secondary structure.
The resulting secondary structure will, in general, still contain ill-supported base pairs. These are removed by three subsequent filtering steps:
All pairs are removed that have more than two non-compatible sequences (for large samples this number may have to be increased), as well as pairs with two non-compatible sequences adjacent to a pair that also has non-compatible sequences.
Next we omit all isolated base pairs. The remaining pairs are collected into helices.
Only those helices are retained that satisfy the following conditions:
the highest ranking base pair must not have non-compatible sequences.
for the highest ranking base pair the product of the mean probability and the number of different pairing combinations must be greater than 0.3
if the helix has length 2, it must not have more non-compatible sequences than consistent mutations.
2. Input Files |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The search algorithms for detecting conserved RNA structure elements are based on both a multiple sequence alignment and predicted secondary structures. These must be present in the following form:
The multiple sequence alignment must be contained in a single
file in CLUSTAL W
format. In particular, the first line of
the alignment file must begin with the word CLUSTAL.
The predicted secondary structures must be contained in individual
files. The names of these files must conform the names of the
sequences in the CLUSTAL file. The files must have the same
file extension. (For example: *.mfe
for minimum energy files,
and blah_dp.ps
or blah.pf
for base pairing probabilites.)
All structure files must reside in the directory from which
alidot
is invoked.
The structure file must have standard Vienna RNA Package
I/O format. This means that a minimum energy (single structure)
file must contain the secondary structure in dot-bracket notation.
The first line of a single structure file must begin with a > sign
followed by the sequence name.
Base pairing probability matrices are read from the *_dp.ps
produced by RNAfold -p
or the .pf
produced for instance
by the message passing version of RNAfold
. The .pf
files
are plain ascii
files containing a line of the form
i j p
for each base pairs (i,j) with sufficiently high pairing
probability p.
Look at the Vienna RNA Web
site for more information on RNAfold
and related programs.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
3.1 Overview | ||
3.2 Text Output | Text Output Format | |
3.3 PostScript Output | PostScript |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The program alidot
contains the detection algorithm described in
Hofacker:98 and Hofacker:99. It reads CLUSTAL W
alignment file from stdin
and uses predicted either predicted mfe
structures or base pairing probability matrices for each of the aligned
sequences as input.
A typical usage would be:
alidot /home/murxer/ali/AlignedSeqs.aln > AlignedSeqs.aliout alidot -p /home/murxer/ali/AlignedSeqs.aln > AlignedSeqs.pfrali |
The CLUSTAL W
file named AlignedSeqs.aln
in the
directory /home/murxer/ali/
is read. Then the program searches
for the files containing the structure predictions in the current
directory. In the first case alidot
will search for *.mfe
files unless the -e
option is specified. With the -p
option
it first searches for *.pf
files. If these are not found, it tries
*_dp.ps
files which must be PostScript
files in the
format specified below.
In both cases text and PostScript
output is produced. The text
output is written to stdout
and can be redirected into files as
the example above. The PostScript
output of alidot
is
written to aln_dp.ps
or aln_pf_dp.ps
(with -p
).
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The text output generated by alidot
has the
following common format:
15 sequences; length of alignment 9538. 25048 different pairs, reduced: 2755. 1425 believable pairs 0 2 6 71 442 904 5333 5348 0 66.7% 1.891 CG:4 GC:2 GU:1 UG:7 UA:1 3879 3923 0 53.3% 1.643 GU:1 UG:9 AU:2 UA:3 8207 8342 0 66.7% 1.983 CG:4 GU:2 UG:1 AU:8 3133 3193 0 60.0% 2.288 CG:1 GU:8 UG:1 AU:5 525 549 0 100.0% 0.000 GC:11 GU:1 AU:3 9101 9117 0 100.0% 0.000 CG:12 UG:1 AU:2 ... 4633 4689 0 26.7% 3.423 CG:15 + ... 5936 6011 0 13.3% 3.962 CG:2 GC:8 GU:2 UG:3 + 2230 2246 0 40.0% 2.113 CG:9 UG:5 UA:1 2875 2924 0 33.3% 2.629 CG:9 GU:2 UG:4 7053 7070 0 33.3% 3.041 CG:13 UG:1 UA:1 ... 2040 2048 1 13.3% 4.338 GC:14 + 2726 2737 2 46.7% 2.840 GC:9 GU:1 UG:1 AU:2 ((((......(((..((((..(((((((..( |
The header of the output contains the number of sequences from the alignment file for which structure files with valid secondary structures have been found in the current directory and some simple statistics. The number of reduced pairs is the number of entries in the color dot plots. The second line gives the number of "believable base pairs" followed by number of pairs with 6, 5, 4, 3, 2, 1, and no different types. The final entry is the number pairs with at least one incompatible sequence among the believable base pairs (The example shown here is taken from a sample of complete Hepatitis C virus genomes.)
The base pairing data follow from line 3. The base pairs are sorted by credibility according to the algorithm described in detail in ref. Hofacker:98.
The last line of the text output contains the conserved structure in bracket notation.
The numbers in the first two columns identify the predicted base
pair. The numbering conforms the Clustal W
input file. The third
column contains the number of sequence in which i.j is not one of the
six standard RNA base pairs. Column 4 gives the percentage of sequences
for which the base was predicted. Column 5 contains the
pseudo-entropy. The following columns show which of the six standard
base pairs occurs in the aligned sequences.
At the end of each record there may be a + sign indicating that the base pair conflicts with a base pair that is predicted with a higher credibility.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Dot plots are simple way of representing contact structures. Each base
pair i.j corresponds to a small square in a matrix of size sequence
length times sequence length. The size and color of this "dot" are
used to encode additional information. The simplest version of the
dot plots are produced automatically by RNAfold -p
. In these
files, which serve as input for alidot -p
in the present package,
the area of the squares is proportional to the (predicted) equilibrium
base pairing probability.
This type of output is modified to include information of the reliability of a prediction. As a general rule we use
the size of the square to indicate the frequency of prediction;
the color hue of the square to indicate the number of different consistent nucleotide pairs occurring for a given base pair; and
the saturation of the color to indicate the number of sequences that are not consistent with the base pair in the sense that they have nucleotides as the relevant positions that do not form one of the six standard RNA base pairs.
The color codes are as follows:
.
red
all sequences have the same two nucleotides
ocre
two types of base pairs occur
green
three types of base pairs occur
turquoise
four types of base pairs occur
blue
five types of base pairs occur
violet
all six types of base pairs occur
The saturation scale is as follows:
.
saturated
no inconsistent sequences
pale color
one sequence in the sample is inconsistent
very pale
two sequence in the sample are inconsistent
invisible
more than two sequences in the sample are
inconsistent
The dot plot files generated by alidot
have
entries both in the upper right and the lower left triangle. The
two halves of the matrix do not contain the same information.
The upper right triangle displays all predicted base pairs with not more than two inconsistent sequences.
The lower left triangle contains only the secondary structure formed by the most believable base pairs. It still contains pairs that are not in the final prediction.
Technical Note. We include here a brief description of the
internal format of _dp.ps
files, since it is possible to
recover the most interesting data from the PostScript output.
After the usual PostScript header a number of macros are defined. The first piece of useful information is the string
/sequence {(a_string) } def |
In the color dot plots produced by alidot
it
contains a predicted secondary structure in dot-bracket notation.
In the output RNAfold -p
it contains the RNA sequence.
After a number of further definitions the base pairing data are
listed. The format is slightly different between the color dot plots
of alidot
and the black and white version
produced by RNAfold -p
. In the latter case the format is
6 21 0.9309 ubox 7 20 0.9661 ubox |
The first 2 columns give the position of the base pair, the third column
is the edge length of the square (1 being a full square). The PostScript
macros ubox
or lbox
determine whether the dot is plotted in
the upper right or the lower left triangle.
In the color version ubox
and lbox
take two more
arguments determining the color of the dots
6 21 0.9309 0.00 0.60 ubox 7 20 0.9661 0.00 0.60 ubox |
Column 4 contains 0.16*(number of different base types -1), and
column 5 contains max[0, 1-0.4*(number of inconsistent sequences)].
The computation of the three hsb
-color parameters from these
values is encoded in the definition of the PostScript macros.
Note that the edge length (column 3) is the square root of the
equilibrium pairing probability in the output of RNAfold -p
or the square root of prediction frequency. If alidot -p
is
used, the this column contains the square root of the pairing
probabilities averaged over all sequences in the alignment file.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
4.1 Split ViennaRNA Files | ||
4.2 Zooming into Dot Plots | ||
4.3 PostScript Mountain Plots | Convert Dot Plots to Mountain Plots | |
4.4 Consensus Sequence | ||
4.5 Annotated RNA Secondary Structure Plots | ||
4.6 Annotate XRNA Secondary files | Annotate Secondary Structure Plots with XRNA |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Output from the Vienna RNA Package
oftentimes contains
the minimum energy structures of list of sequences in a single file.
The format of such files is
> 1_10770_lmnt_1 begin= 7 end= 64 AGUCUACGUGGACCGACAAAGACAGAUUCUUUGAGGGAGCUUAGCUCAACGUAGUUCU ((.((((((...((..(...((.....))...).))((((...)))).))))))..)) > 1_10770_lmnt_2 begin= 78 end= 98 AUUAGAGAGCAGAUCUCUGAU ((((((((.....)))))))) > 1_10770_lmnt_3 begin= 116 end= 136 AAAGGCGAGAAAAACGCCUUU (((((((.......))))))) .... |
The entry for each sequence begin with a (single) comment line
(beginning with >
) containing the name of the sequence and
optional comments. The following two lines contain the sequence
and structure in dot-bracket notation. The structure entry
may be followed by the folding energy in parenthesis.
The script split.pl
splits such a file into separate files
for each sequence.
split.pl ViennaRNA_file |
The filename is determined by first string
in the comment line, to which .mfe
is appended. The first
file created from the above example is thus 1_10770_lmnt_1.mfe
.
This script is necessary since alidot
expects all structures
to be located in separate files.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The Package provides two utilities for zooming into _dp.ps
files:
dpzoom.pl
extracts the diagonal block between two sequence
positions. The defaults for the first and large position are 1 and
the sequence length, respectively.
dpzoom.pl [-f first] [-l last] [dp_file] |
The script reads either from the file dp_file
if its name is
provided as an argument, or it reads from stdin
. The output
is written to stdout
.
A variant of this script is dprzoom.pl
. This script in addition
to extracting a box along the diagonal rotates the graphics by 45
degrees. It can be used to obtain details for the band along the
diagonal, i.e., for shorter range interactions. It is used in the same
way as dpzoom.pl
dprzoom.pl [-f first] [-l last] [dp_file] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The Perl5
script cmount.pl
produces colored Hogeweg mountain
representation in PostScript from _dp.ps
generated by
alidot
or dp_zoom
.
Here the Hogeweg mountain representation is based on the following
definition:
mm[i], mp[i]=number of base pairs enclosing base i.
The script reads from stdin
and writes to stdout
.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The Perl5
script consensus.pl
prints the consensus sequence
for part of an Clustal W
alignment and the corresponding piece of
structure in bracket notation if the text output of alidot
is
provided.
consens.pl [-f from] [-l last] clustalw.aln [alidot.out] |
The name of the CLUSTAL W file is a required argument. If the name of
the file containing the corresponding alidot
text output
is passed as the second argument, then the consensus structure is
extracted in dot-bracket notation.
The script can be used to extract a subsequence/substracture by
specifying the first (option -f
) and/or last (option -l
)
sequence position in the alignment file that is to be printed.
Defaults are -f 1
and -l
length of aligned sequences.
The following additional information is printed to stderr
:
The length of the aligned sequence, the number of conserved positions,
and the mean pairwise sequence identity for both the complete alignment
and sequence window defined by the -f
and -l
options.
This script is particularly useful as the first step towards producing
input files for a secondary structure editor such as XRNA
.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The script anote.pl
read an _ss.ps
file as produced
e.g. by RNAplot
or RNAfold
and adds circles around
bases with complentary mutations and marks bases that are incompatible
in some sequences using a grayscale.
The original .ps
file is renamed to .ps.bak
.
anote.pl alidot.out rna_ss.ps [offset] |
You may create a secondary structure plot for your alidot
output
alidot.out
obtained with the CLUSTAL W data.aln
for the
sequence window from, say, positions 476 to 756 in the alignment by
using the following command line:
consensus.pl -f 476 -l 756 data.aln alidot.out | RNAplot ; anote.pl 476-756_ss.ps alidot.out 475 |
RNAplot
is part of the Vienna RNA Package 1.3
and higher.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
XRNA
is an interactive program for drawing and editing secondary
structures. It has its own file format with the usual extension
.ss
. Appart from its own format, XRNA
reads Michael Zucker's
.xt
files which do not allow to pass information beyond the list
of base pairs to XRNA
.
The script xrna.pl
reads the output from alidot
and edits a previously produced .ss
file to
include circles around bases with complentary mutations and gray letter
for bases that are incompatible in some sequences.
The original .ss
file is renamed to .ss.bak
.
Use the following procedure to create a secondary structure plot for
your alidot
output alidot.out
obtained with the CLUSTAL W
data.aln
.
Identify the region that you want to plot, say positions 476 to 756 in the alignment.
Run consensus.pl -f 476 -l 756 data.aln alidot.out >
fragment.dat
to obtain the consensus sequence and structure
for the reason of interest.
Convert fragment.dat
into Zucker style base pair file
using
b2ct < fragment.dat > fragment.ct
.
Invoke XRNA
, load the structure file fragment.ct
,
edit the secondary structure drawing, and save the drawing as
fragment.ss
. See the XRNA
Manual for details.
Run xrna.pl alidot.out fragment.ss 475
to annotate the
.ss
file. The third argument 475 is the offset of the
sequence fragment of interest. It takes the value 0 (and can be
omitted) if the fragment begins at the first position of the
CLUSTAL W alignment.
Run XRNA
in batchmode to produce a PostScript file from
fragments.ss
. See the XRNA
Manual for details.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Ivo L. Hofacker, Martin Fekete, Christoph Flamm, Martijn A. Huynen, Susanne Rauscher, Paul E. Stolorz, and Peter F. Stadler: Automatic Detection of Conserved RNA Structure Elements in Complete RNA Virus Genomes. Nucl. Acids Res. 26, 3825--3836 (1998).
Ivo L. Hofacker and Peter F. Stadler: Automatic Detection of Conserved Base Pairing Patterns in RNA Virus Genomes. Comp. & Chem. 23, 401--414 (1999).
Ivo L. Hofacker, Martin Fekete, Peter F. Stadler: Secondary Structure Prediction for Aligned RNA Sequences J.Mol.Biol. 319, 1059--1066 (2002).
[Top] | [Contents] | [Index] | [ ? ] |
[Top] | [Contents] | [Index] | [ ? ] |
This document was generated by Ivo Hofacker on September, 2 2003 using texi2html 1.67.
The buttons in the navigation panels have the following meaning:
Button | Name | Go to | From 1.2.3 go to |
---|---|---|---|
[ < ] | Back | previous section in reading order | 1.2.2 |
[ > ] | Forward | next section in reading order | 1.2.4 |
[ << ] | FastBack | beginning of this chapter or previous chapter | 1 |
[ Up ] | Up | up section | 1.2 |
[ >> ] | FastForward | next chapter | 2 |
[Top] | Top | cover (top) of document | |
[Contents] | Contents | table of contents | |
[Index] | Index | concept index | |
[ ? ] | About | about (this page) |
where the Example assumes that the current position is at Subsubsection One-Two-Three of a document of the following structure:
This document was generated by Ivo Hofacker on September, 2 2003
using texi2html 1.67.