rnazWindow.pl
- Slice alignments in overlapping windows and
process/filter alignment windows in various ways.
rnazWindow.pl [options] [file]
Size of the window (Default: 120)
Step size (Default: 120)
Slice only alignments longer than N columns. This means blocks longer than the window size given by --window but shorter than N are kept intact and not sliced. Per default this length is set to the window size given by --window (or 120 by default).
Maximum fraction of gaps. If a reference sequence is used
(i.e. --no-reference
is not set), each sequence is compared to the
reference sequence and if in the pairwise comparison the fraction of
columns with gaps is higher than X the sequence is discarded. If no
reference sequence is used, all sequences with a fraction of gaps
higher than X are discarded. (Default: 0.25)
Maximum fraction of masked (=lowercase letters) in a sequence. All
sequences with a fraction of more than X lowercase letters are
discarded. This is usually used for excluding repeat sequences marked
by RepeatMasker
but any other information can be encoded by using
lowercase letters. (Default: 0.1)
Discard alignment windows with an overall mean pairwise identity smaller than X%. (Default: 50)
Minimum number of sequences in an alignment. Discard any windows with less than N sequences (Default:2).
Maximum number of sequences in an alignment. If the number of
sequences in a window is higher than N, a subset of sequences is used
with exactly N sequences. The greedy algorithm of the program
rnazSelectSeqs.pl
is used which optimizes for a user specified mean
pairwise identity (see --opt-id
). (Default: 6)
Number of different subsets of sequences that is sampled if there are
more sequences in the alignment than --max-seqs
. (Default: 1)
Minimum number of columns of an alignment slice. After removing sequences from the alignment, ``all-gap" columns are removed. If the resulting alignment has fewer than N columns, the complete alignment is discarded.
If the number of sequences has to be reduced (see --max-seqs
) a
subset of sequences is chosen which is optimized for this value of
mean pairwise identity. (In percent, default: 80)
One sequence from pairs with pairwise identity higher than X % this is removed (default: 99, i.e. only almost identical sequences are removed) NOT IMPLEMENTED
Output forward, reverse complement or both of the sequences in the
windows. Please note: RNAz
has the same options, so if you use
rnazWindow.pl
for an RNAz screen, we recommend to set the option
directly in RNAz
and leave the default here. (Default:
---forward)
By default the first sequence is interpreted as reference sequence. This means, for example, that if the reference sequence is removed during filtering steps the complete alignment is discarded. Also, if there are too many sequences in the alignment, the reference sequence is never removed when choosing an appropriate subset. Having a reference sequence is crucial if you are doing screens of genomic regions. For some other applications it might not be necessary and in such cases you can change the default behaviour by setting this option.
Verbose output on STDERR, describing all performed filtering steps.
Prints version information and exits.
Prints a short help message and exits.
Prints a detailed manual page and exits.
In many cases it is necessary to slice, pre-process and filter alignments to get the optimal input for RNAz. This can be a tedious task if you have a large number of alignments to analyze. This program performs the most common pre-processing and filtering steps.
Basically it slices the input alignments (CLUSTAL W
or MAF
format) in overlapping windows. The resulting alignments windows are
further processed and only ``reasonable" alignment windows are finally
printed out, i.e. not too much gaps/repeats, not too few or too many
sequences...
# rnazWindow.pl --min-seqs=4 some.aln
Slices the alignment -some.aln
in overlapping windows of size 120,
slide 40 and filters the windows for an optimal input to RNAz
(=default behaviour). Only alignments with at least four sequences
are printed.
Stefan Washietl <wash@tbi.univie.ac.at>