NAME

rnazWindow.pl - Slice alignments in overlapping windows and process/filter alignment windows in various ways.


SYNOPSIS

 rnazWindow.pl [options] [file]


OPTIONS

-w, --window=N

Size of the window (Default: 120)

-s, --slide=N

Step size (Default: 120)

-m, --max-length

Slice only alignments longer than N columns. This means blocks longer than the window size given by --window but shorter than N are kept intact and not sliced. Per default this length is set to the window size given by --window (or 120 by default).

--max-gap=X

Maximum fraction of gaps. If a reference sequence is used (i.e. --no-reference is not set), each sequence is compared to the reference sequence and if in the pairwise comparison the fraction of columns with gaps is higher than X the sequence is discarded. If no reference sequence is used, all sequences with a fraction of gaps higher than X are discarded. (Default: 0.25)

--max-masked=X

Maximum fraction of masked (=lowercase letters) in a sequence. All sequences with a fraction of more than X lowercase letters are discarded. This is usually used for excluding repeat sequences marked by RepeatMasker but any other information can be encoded by using lowercase letters. (Default: 0.1)

--min-id=X

Discard alignment windows with an overall mean pairwise identity smaller than X%. (Default: 50)

--min-seqs=N

Minimum number of sequences in an alignment. Discard any windows with less than N sequences (Default:2).

--max-seqs=N

Maximum number of sequences in an alignment. If the number of sequences in a window is higher than N, a subset of sequences is used with exactly N sequences. The greedy algorithm of the program rnazSelectSeqs.pl is used which optimizes for a user specified mean pairwise identity (see --opt-id). (Default: 6)

--num-samples=N

Number of different subsets of sequences that is sampled if there are more sequences in the alignment than --max-seqs. (Default: 1)

--min-length=N

Minimum number of columns of an alignment slice. After removing sequences from the alignment, ``all-gap" columns are removed. If the resulting alignment has fewer than N columns, the complete alignment is discarded.

--opt-id=X

If the number of sequences has to be reduced (see --max-seqs) a subset of sequences is chosen which is optimized for this value of mean pairwise identity. (In percent, default: 80)

--max-id=X

One sequence from pairs with pairwise identity higher than X % this is removed (default: 99, i.e. only almost identical sequences are removed) NOT IMPLEMENTED

--forward
--reverse
--both-strands

Output forward, reverse complement or both of the sequences in the windows. Please note: RNAz has the same options, so if you use rnazWindow.pl for an RNAz screen, we recommend to set the option directly in RNAz and leave the default here. (Default: ---forward)

--no-reference

By default the first sequence is interpreted as reference sequence. This means, for example, that if the reference sequence is removed during filtering steps the complete alignment is discarded. Also, if there are too many sequences in the alignment, the reference sequence is never removed when choosing an appropriate subset. Having a reference sequence is crucial if you are doing screens of genomic regions. For some other applications it might not be necessary and in such cases you can change the default behaviour by setting this option.

--verbose

Verbose output on STDERR, describing all performed filtering steps.

-v, --version

Prints version information and exits.

-h, --help

Prints a short help message and exits.

--man

Prints a detailed manual page and exits.


DESCRIPTION

In many cases it is necessary to slice, pre-process and filter alignments to get the optimal input for RNAz. This can be a tedious task if you have a large number of alignments to analyze. This program performs the most common pre-processing and filtering steps.

Basically it slices the input alignments (CLUSTAL W or MAF format) in overlapping windows. The resulting alignments windows are further processed and only ``reasonable" alignment windows are finally printed out, i.e. not too much gaps/repeats, not too few or too many sequences...


EXAMPLES

 # rnazWindow.pl --min-seqs=4 some.aln

Slices the alignment -some.aln in overlapping windows of size 120, slide 40 and filters the windows for an optimal input to RNAz (=default behaviour). Only alignments with at least four sequences are printed.


AUTHORS

Stefan Washietl <wash@tbi.univie.ac.at>