Speaker | Roman Stocsits |
Title | Alignments von teilweise codierenden Regionen:
Das code2aln - Projekt und seine Anwendungen |
High quality sequence alignments of RNA and DNA sequences are a prerequisite for the comparative analysis of genomic sequence data. The high level of sequence heterogeneity, as compared to proteins, makes good alignments of nucleic acid sequences often impossible. In many cases, the nucleic acid sequences under consideration, or parts of them, code for proteins. While protein sequences can still show substantial homology, the corresponding nucleic acid sequences have already evolutionarily diverged, thus they are essentially randomized. This is caused by the inherent redundancy of the genetic code: Most amino acids have more than one codon on the level of nucleic acid. This specific problem leads to gaps and incorrectly aligned segments within coding regions.
In the thesis a multiple nucleic acid alignment procedure was implemented that uses genetic information about coding and non-coding regions as part of the scoring function in order to improve the resulting alignment. Our algorithm combines (mis)match scores for nucleic acids with those for the underlying amino acids in the case of open reading frames and exons. The program makes explicit use of information about overlapping open reading frames, as they occur in virus sequences, to further improve the reliability and quality of the nucleic acid alignment.
The implementation is realized in the program package
code2aln
which is freely available.
Code2aln
is based upon a Gotoh-type dynamic programming
algorithm with affine gap penalties, and features more complex scoring
functions for coding regions that combine nucleic acid with amino acid
scores.
Alignments computed with code2aln have a significantly improved
quality in coding regions compared to other methods for nucleic
acids. In particular, disruptions of codons are
reduced. Code2aln
alignments are shown to improve the
sensitivity of a method for the detection of conserved RNA structures.
An application of code2aln
to two unrelated groups of
viruses is described. We processed the alignments as input for the
procedure of alidot for detecting conserved RNA secondary structure
elements in RNA genomes of Leviviridae and the pregenomic RNA of human
hepatitis B virus. Virus genomes contain various (partially
overlapping) open reading frames and are an ideal test case for a
procedure that makes usage of information about (overlapping) coding
regions to improve the input alignments and, therefore, the
identification of conserved secondary structures.
We find a number of highly significant secondary structure elements, not being described in the literature so far, and some well known elements like the epsilon-elements and two important elements of the HPRE region in hepatitis B virus. Also the results of the Levivirus group are of particular interest: We detect various secondary structure elements that are strongly confirmed by compensatory mutations and gain novel insight into the structural organization of Levivirus genomes.