Defensio Abstract

Speaker Stefan Washietl
Title Prediction of structural non-coding RNAs by comparative sequence analysis


Non-coding RNAs (ncRNAs) are transcripts that function directly as RNA molecule without ever being translated to protein. Facing the ever-growing list of newly discovered ncRNAs, it can be expected that further types of ncRNAs are still hidden in recently completed genomes. Unlike protein coding genes, ncRNAs lack any statistically significant characteristics in primary sequence that could be exploited for reliable prediction. Therefore, de novo prediction of ncRNAs is still one of the most challenging (but largely unsolved) problem in bioinformatics. Since many functional ncRNAs depend on a defined secondary structure, algorithms based on secondary structure prediction seem to be the most promising.

In the first part I show that thermodynamic stability is a characteristic feature of functional ncRNAs but, if computed for a single sequence, generally not significant enough to reliably distinguish native ncRNAs from the genomic background. However, functional structures are often evolutionary conserved. Using a comparative approach, we could demonstrate that the prediction of a consensus secondary structure of homologous sequences, which considers thermodynamical stability and covariance information, can be a significant measure. We introduced a novel method to assess multiple sequence alignments for thermodynamically stable and evolutionary conserved RNA secondary structures. The method is highly accurate but since it depends on a time-consuming random shuffling algorithm it is not suitable for screens of large genomes.

I therefore developed an alternative algorithm. It consists of two basic components: (i) a novel measure for structure conservation based on consensus structure prediction and (ii) a measure for thermodynamic stability, which, in the spirit of a z-score, is normalized with respect to both sequence length and base composition but can be calculated without sampling from shuffled sequences. With the help of a support vector machine learning algorithm, both scores are combined into a composite score that efficently detects functional secondary structures in sequence alignments. Our apprach was implemented in the program RNAz. Benchmarking tests showed that RNAz clearly outperforms any other available programs both in terms of accuracy and speed.

In the last part I used RNAz to conduct the first comprehensive screen for conserved RNA structures in the human genome. We screened alignments of several mammals/vertebrates and predict more than 30,000 putative structural RNA elements throughout the human genome. Our screen recovers hundreds of known structural ncRNAs, it identifies additional members of known ncRNA families, and detects previously undescribed conserved structural elements in some known ncRNAs. Most of the detected RNA structures, however, are of completely unknown function. Our computational results point to thousands of previously undetected functional ncRNAs in the human genome. It provides a strong basis for further theoretical and experimental studies.