CMCompare: Discriminatory Power of RNA Family Models

Christian Hoener zu Siederdissen and Ivo Hofacker

The source code for the program, two example scripts and the needed data are available. You can find the paper here.

Installation: 'cabal install cmcompare'

News

2010/Sep/28 Rfam 10 clans and their scores. This file has been generated automatically, so is not pretty (for now).

2010/Sep/14 Rfam 10 is available now. Old data remains available for comparison.

2010/Sep/08 Install the program using 'cabal install cmcompare'. This should pull all dependencies. Please mail me any problems.

2010/Jun/17 You can find the unofficial version of this page here. This page is about to get a nicer layout. ;-)

Rfam-10 Data

We again provide two files. The complete set contains all pairwise scores, while the weak pairs are only those pairs of models that have a Link score of at least 20 bit. Both files use the one-line way of storing results. You get the scores, Link sequence, secondary structures and participating nodes for each hit. Both files are gzipped and sorted by model identifiers. Here are all clans.

Installation via cabal-install

If you are interested in installing from sources, you probably have installed both GHC and cabal-install. In that case a simple 'cabal install cmcompare' should be enough.

Rfam-9 Data

The results from the pairwise comparison can be found in the Main Results file. The file is 406 MByte in size and expands to several Gigabyte. You might want to use Fuse and Archivemount. Inside the archive are about 1400 directories with entries like this: 00001/RF00001-RF00100.out. If you are looking for the result for two covariance models RFx.cm and RFy.cm and don't find it in x/RFx-RFy.out. Take a look at y/RFy-RFx.out. This is an artifact of the way (using a computer grid) the results were collected.

The interesting results are collected in the Weak Pairs list. Some 70,000 pair scores are worth to take a look at and they are collected here, together with additional data. One lines equals one result (perfect for Unix pipes!) and looks like this:

RF00186.cm RF00267.cm 20.000000   28.00 33.23 21.49   40.00 47.16 34.57   -1.490000   -14.570000   -14.570000 GUUUUGCAAUGAUGAAUUAAAAAGACAGUUAACCUUUCAGUCUUUUGAGCAAC ((,,(((.........................................))))) ((..(((.........................................))))) [[]] [[]] [] [] [] []

Some explanation is necessary:

model 1, model 2, MaxiMin score, model 1 cutoffs, model 2 cutoffs, TC - score1, TC - score2, worst TC - score{1,2}, sequence, structure 1, structure 2, shape3 1, shape3 2, shape4 1, shape4 2, shape5 1, shape5 2

(please read the CMCompare paper concerning cutoff scores, shapes, etc. until an explanation is placed here)

Rfam 9 vs. Rfam 10

This data was created before the introduction of Rfam 10, therefor we used Infernal 1.0.2 together with Rfam 9.1 built using Infernal 1.0 (cf. infernal.janelia.org). With the newest Rfam version, it now becomes possible to compare Rfam clans with some high-scoring cliques.

Program, Sources and Scripts

programs

All sources are available under the GPLv3 or later. The program was written in the functional programming language Haskell. If you do not want to compile the program yourself, this version has been compiled under Fedora 12, 64bit. This one under Archlinux, 64bit. If this doesn't work for you, mail me your system specs and we can think of a solution.

sources

The following description is now obsolete. Install via 'cabal install cmcompare'. This fetches all sources, too, and deals with compiling everything that is needed.

The easiest way, otherwise, is to just compile the program. Should you want to take a look at the implementation of the algorithm, you need only the program source.

  1. Download and install GHC 6.12. Your distribution should have this (or visit the GHC download page).
  2. Install cabal-install. Again, look into your distributions repositories.
  3. execute "cabal update" in a shell. This prepares the folder ".cabal" for user-installed libraries
  4. Using cabal-install, the next steps are easy. Repeat steps 5 to 9 for Haskell Tools, Biobase, BiobaseInfernal and the Main Program.
  5. download the xxx.tar.gz
  6. extract the xxx.tar.gz using "tar xf xxx.tar.gz"
  7. "cd xxx"
  8. "cabal install". During this process, a number of additional libraries will be installed from hackage.haskell.org. Do not panic!
  9. If something fails, send me a mail.

If everything went well, there should be a lot of libraries installed under ~/.cabal and the program itself under ~/.cabal/bin/hsCMCompare. The easiest thing to do is to just try it out: "~/.cabal/bin/hsCMCompare --verbose x.cm y.cm"

An example (without --verbose) is:

two.cm   three.cm     27.996     19.500 CCCAAAGGGCCCAAAGGG (((...)))(((...))) (((...)))(((...))) [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17] [11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]

This would be:

model 1, model 2, score 1, score 2, common string, secondary structure 1, structure 2, visited nodes 1, visited nodes 2.

By producing just one line of output, further processing via pipes is easy. The --verbose switch is for humans.

scripts

Two scripts show how graphs can easily and in an automated fashion be created from the data. We use the weak pairs list as it contains the interesting edges between families. One of the scripts HighScoreEdges.sh generates graphs of the high-scoring edges between all families. The other script, NeighborGraph.sh, puts one family in the center and shows just the edges connected to this one family. Both (albeit in a more ad-hoc fashion) were used during the creation of some of the figures in the paper. We used Graphviz which generate graphs automatically from data. While not perfect in all cases, this makes it possible to take a look at the neighborhood of, say, all 1400 family models without having to do much by hand -- except taking a look at the graphs.

Some examples

Random Stuff

If you are asking yourself why Haskell, just count the lines of code!

This page has been created using Pandoc, choener has no html skills