Defensio Abstract

Speaker Roman Stocsits
Title Alignments von teilweise codierenden Regionen: Das code2aln - Projekt und seine Anwendungen


High quality sequence alignments of RNA and DNA sequences are a prerequisite for the comparative analysis of genomic sequence data. The high level of sequence heterogeneity, as compared to proteins, makes good alignments of nucleic acid sequences often impossible. In many cases, the nucleic acid sequences under consideration, or parts of them, code for proteins. While protein sequences can still show substantial homology, the corresponding nucleic acid sequences have already evolutionarily diverged, thus they are essentially randomized. This is caused by the inherent redundancy of the genetic code: Most amino acids have more than one codon on the level of nucleic acid. This specific problem leads to gaps and incorrectly aligned segments within coding regions.

In the thesis a multiple nucleic acid alignment procedure was implemented that uses genetic information about coding and non-coding regions as part of the scoring function in order to improve the resulting alignment. Our algorithm combines (mis)match scores for nucleic acids with those for the underlying amino acids in the case of open reading frames and exons. The program makes explicit use of information about overlapping open reading frames, as they occur in virus sequences, to further improve the reliability and quality of the nucleic acid alignment.

The implementation is realized in the program package code2aln which is freely available.

Code2aln is based upon a Gotoh-type dynamic programming algorithm with affine gap penalties, and features more complex scoring functions for coding regions that combine nucleic acid with amino acid scores.

Alignments computed with code2aln have a significantly improved quality in coding regions compared to other methods for nucleic acids. In particular, disruptions of codons are reduced. Code2aln alignments are shown to improve the sensitivity of a method for the detection of conserved RNA structures.

An application of code2aln to two unrelated groups of viruses is described. We processed the alignments as input for the procedure of alidot for detecting conserved RNA secondary structure elements in RNA genomes of Leviviridae and the pregenomic RNA of human hepatitis B virus. Virus genomes contain various (partially overlapping) open reading frames and are an ideal test case for a procedure that makes usage of information about (overlapping) coding regions to improve the input alignments and, therefore, the identification of conserved secondary structures.

We find a number of highly significant secondary structure elements, not being described in the literature so far, and some well known elements like the epsilon-elements and two important elements of the HPRE region in hepatitis B virus. Also the results of the Levivirus group are of particular interest: We detect various secondary structure elements that are strongly confirmed by compensatory mutations and gain novel insight into the structural organization of Levivirus genomes.