############################################################ # INSTALLATION # ===================== # please do the installation before the class ############################## # Get data and software # # Data for the tutorial and specific software is provided at # # https://www.tbi.univie.ac.at/~will/AlgoSB19 # # Get most data in one package: download e.g. by # wget https://www.tbi.univie.ac.at/~will/AlgoSB19/algosb19-comprna.tgz # # and extract # tar xzf algosb19-comprna.tgz cd AlgoSB19-CompRna ## # if you don't have conda installed already, install miniconda # (Python 3 version) following instructions (Steps 1 & 2) here: # # https://bioconda.github.io/ # # don't forget to set up the channels # # # if conda is not working, find instructions to install the most important applications from scratch # without using conda in INSTALL-WO-CONDA.txt # ## We assume conda resides in $HOME/miniconda3; otherwise adapt the path # CONDA_HOME=$HOME/miniconda3 ## create a new conda environment and activate it # conda create -n comprna viennarna clustalw rscape rnaz infernal bedtools newick_utils conda activate comprna ## install locarna-2.0.0RC7 in the conda environment # # NOTE: if locarna 2.0.0RC7 (and the newest Vienna package) was already # installed outside of conda, you may skip this step # # Get locarna-2.0.0RC7p0.tar.gz from # https://www.tbi.univie.ac.at/~will/AlgoSB19/ # to download it from linux command line run wget https://www.tbi.univie.ac.at/~will/AlgoSB19/locarna-2.0.0RC7p0.tar.gz tar xzf locarna-2.0.0RC7p0.tar.gz cd locarna-2.0.0RC7p0 ./configure --prefix=$CONDA_PREFIX make -j4 make install ############################################################ # Analysis of RNA alignments # ========================== # As running example, let's use a set of tRNAs in Data/trnas.fa # Data/trnas.aln contains a sequence-based alignment of the tRNAs, which # was produced using clustalw: cd Data clustalw Data/trnas.fa cd .. # # The sequences in MoreData can be analyzed in the same way as Data/trnas.fa # and used for comparison. What can you learn about those sequences? # Are they RNAs of the same family; do they have stable structure? # is their structure conserved? # ############################## # RNAalifold RNAalifold --aln --color Data/trnas.aln # have a look at the man page # man RNAalifold # # to understand the options and the output # # RNAalifold writes postscript files aln.pn and alirna.ps # view with postscript viewer (in Linux, e.g. evince, okular, gv, ...) # # look at Doc/RNAalifold-legend.png (and into the man page) for the color code # eog Doc/RNAalifold-legend.png # Usually, one gets better results from alifold by using ribosum scoring (option -r) # RNAalifold --aln --color -r --cfactor 0.6 --nfactor 0.5 Data/trnas.aln # # with the 'magic' parameters --cfactor 0.6 --nfactor 0.5 # # RNAalifold can compute partition functions and generate probability dot plots (option -p) # RNAalifold -p --color Data/trnas.aln # # have a look at alidot.ps (using evince, okular, ...) # # # Compare the consensus structure to the structures predicted from single sequences # RNAfold < Data/trnas.aln RNAalifold Data/trnas.aln # # and calculate the Structure conservation index (SCI) # # recall: SCI = MFE_alifold / MEAN_i(MFE_i) # # btw, RNAalifold even has an option for this (check whether you get the same result) # # # Finally, let's produce a stockholm file # RNAalifold --aln-stk=trnas Data/trnas.aln ############################################################ # R-scape # ======= # http://eddylab.org/R-scape/ # # Upload the stockholm file trnas.stk to the web server and run. # # Check out the additional information on significant covariation provided in the # R2R structure plot and the covariation dot plot # ############################################################ # RNAz # ==== # https://www.tbi.univie.ac.at/software/RNAz/ # RNAz Data/trnas.aln # # # What information is provided in the output of RNAz? # # Do you see how (roughly) the features "Shannon entropy", "Mean z-score", # and "Structure conservation index" influence the "SVM RNA-class probability"? # # RNAz alignment models # --------------------- # By default, RNAz assumes that the input alignment is generated by a sequence alignment # program (like clustalw). # With option -l, it assumes that the alignment is a structure alignments, # as produced by LocARNA. # # RNAz background models # --------------------- # What is the difference between option --dinucleotide and --mononucleotide? # Hint: this is related to the computation of z-scores. # ############################################################ # LocARNA # ======= # LocARNA is hosted at Github # https://github.com/s-will/LocARNA # where one finds newest releases and information. # # Additional information is found here: # http://www.bioinf.uni-freiburg.de/Software/LocARNA/ # While there is a web server for LocARNA # http://rna.informatik.uni-freiburg.de/LocARNA/Input.jsp # we will look into LocARNA's command line tools, # which provide much more flexibility (and are not limited in # the number and length of sequences that can be aligned). # Note that instead of using the released version, # I would now generally recommend to use the most recent # LocARNA release candidate (which we are using in this tutorial). # # The main tool of LocARNA is mlocarna. Use it to compute a multiple alignment # by simply calling it on a fasta file with multiple RNA sequences: # mlocarna Data/tRNA_5.fa # # This call writes output files to the directory Data/tRNA_5.out # One can change the output directory using --tgtdir mlocarna Data/tRNA_5.fa --tgtdir Out/tRNA_5.out # # To learn more about the workflow of mlocarna, # have a look at the output directory and also call mlocarna Data/tRNA_5.fa --tgtdir Out/tRNA_5.out -v # # and # mlocarna Data/tRNA_5.fa --tgtdir Out/tRNA_5.out --moreverbose # # In particular, the latter produces a lot of output, # including detailed information about each step of the # progressive multiple alignment procedure. # (One can also make mlocarna shut up with -q) # The output directory contains directories # * input # with files like input/AF008220 for each input sequence. # These files are in LocARNAs pp2.0 format. Each # specifies a sequence (or alignment) and base pair probabilities # (Recall that LocARNA requires base pair probabilities to score # the base pair matches) # # * results # which contains all final results, in particular # the final alignment results/result.aln # Moreover, the guide tree: Out/tRNA_5.out/results/result.tree # # * intermediates # which contains the intermediate alignments that are # generated by the progressive alignment steps # View the guide tree: # cat Out/tRNA_5.out/results/result.tree nw_display Out/tRNA_5.out/results/result.tree # or use a web server that displays trees in newick format, e.g. at # http://etetoolkit.org/treeview/ # # Now, have a look at the man page of mlocarna. It documents all of the many command # line arguments of the tool and provides some usage examples. mlocarna --man # As you have already seen, mlocarna performs a series of pairwise alignments for generating # the multiple alignment. These are performed by the work-horse tool locarna. # This program runs the PMcomp-style Sankoff algorithm on a pair of RNAs. Get help on # the command line options of locarna, by locarna --help # locarna takes two inputs and aligns them locarna Examples/mouse.fa Examples/human.fa # There are some special occasions, where one wants to call locarna directly. For example, # it can align alignments: # # generate two alignments by mlocarna: # mlocarna Data/tRNA_5.fa mlocarna Data/tRNA_7.fa # # and align them using locarna # locarna Data/tRNA_5.out/results/result.aln Data/tRNA_7.out/results/result.aln ## --------------------------- # The effect of heuristics # # We look at some options that control heuristics in locarna and mlocarna: # # -p controls the cutoff probability which filters considered base pairs # # --min-trace-probability controls the sequence-based pre-filter # that can strongly limit the calculated DP matrix entries based on # sequence similarities. Value 0 turns off this heuristic! # # --max-diff 60: limit the calculated entries in the DP matrices # to a diagonal band (of width 60) # compare times time locarna Examples/mouse.fa Examples/human.fa time locarna --min-trace-probability=0 Examples/mouse.fa Examples/human.fa time locarna --min-trace-probability=0 --max-diff 60 Examples/mouse.fa Examples/human.fa time locarna --min-trace-probability=0 -p 0.1 Examples/mouse.fa Examples/human.fa time locarna --min-trace-probability=0 -p 0.001 Examples/mouse.fa Examples/human.fa ## --------------------------- # Local alignment # # locarna and mlocarna supports several types of local alignment # # (sequence) local locarna --sequ-local on Examples/mouse.fa Examples/human.fa # sequence and structure local locarna --sequ-local on --struct-local on Examples/mouse.fa Examples/human.fa # free end gaps at the 3' ends or 5' ends mlocarna --free-endgaps-3 Data/web-example1.fa mlocarna --free-endgaps-5 Data/web-example1.fa ## --------------------------- ## Constraints for mlocarna # Mlocarna supports structure constraints for folding and anchor constraints # for alignment. Both types of constraints can be specified in extension of # the standard fasta format via 'constraint lines'. # (confer mlocarna's man page, section 'Use of constraints') # Here are two examples with constraints: mlocarna Example/example-w-constraints.fa mlocarna Examples/haca.snoRNA.fa # Both files give examples, how structure and anchor constraints can be # expressed in fasta input. (Have a look at the files.) # Anchor constraints can as well be passed using bed format: mlocarna --anchor-constraints Examples/example-anchors.bed Examples/example-wo-anchors.fa ## ---------------- # Realignment # # mlocarna can realign alignments, which allows to keep # certain columns unchanged by the new alignment # mlocarna --realign Examples/example-realign.aln # # This is useful to manually fix-up alignments and realign # while preserving the manual changes # Edit the example-realign.aln alignment and mark your changes: cp Examples/example-realign.aln example-realign2.aln #edit example-realign2.aln and mark changed columns mlocarna --realign example-realign2.aln # realignment can be moreover restricted to make only 'small' changes, # which is mostly useful to speed up the re-alignment procedure. # mlocarna --realign Examples/example-realign.aln --max-diff 5 --max-diff-aln . ## --------------- # Use local folding for long RNA sequences # compare: time mlocarna --plfold-span 100 Data/web-example1.fa time mlocarna Data/web-example1.fa ## --------------- # Probabilistic alignment # mlocarna --probabilistic Data/tRNA_5.fa mlocarna --probabilistic --plfold-span 100 Data/web-example1.fa ## --------------- # Use consistency transformation # mlocarna --probabilistic --consistency-transformation Data/tRNA_5.fa ## reliability profiles # # after the last call, compute a reliability profile of the alignment # to show the local alignment quality # reliability-profile.pl Data/tRNA_5.out # # This plots the profile in rel.pdf # evince rel.pdf # ## by default, it predicts the regions of highest signal, which can be used # for predicting more accurate boundaries of non-coding RNAs from an RNAz screen. # see reliability-profile.pl --help ############################## # Analyse locarna alignments # # The alignments generated by locarna can be further analyzed using RNAalifold, R-scape, etc # Go ahead:) # # Be aware of the different natures of sequence-based and structure-based alignments. # For example, RNAz provides a special mode to correctly interpret structure-based alignments clustalw Examples/hammerhead_3.fa RNAz Examples/hammerhead_3.aln mlocarna Examples/hammerhead_3.fa RNAz -l Examples/hammerhead_3.out/results/result.aln # generate stockholm format output (e.g. to analyse with R-scape) # mlocarna --stockholm Examples/hammerhead_3.fa # writes output to Examples/hammerhead_3.out/results/result.stk # additionally include consensus structure: mlocarna --consensus-structure alifold --stockholm Examples/hammerhead_3.fa cat Examples/hammerhead_3.out/results/result.stk ############################################################ # Infernal - build CMs # ==================== # Doc/Infernal-userguide.pdf # http://eddylab.org/infernal/ ## Build a consensus model from a stockholm file, # e.g. after the previous mlocarna call: cmbuild hammerhead_3.cm Examples/hammerhead_3.out/results/result.stk ## Since we want to use the model for database search, we need to # calibrate it (to later get proper E-values) cmcalibrate hammerhead_3.cm # ATTENTION: calibration is very time consuming. Use cmcalibrate --forecast hammerhead_3.cm # to learn how long this would take and # proceed with the pre-computed file Data/hammerhead_3.cm cmsearch Data/hammerhead_3.cm Data/ecoli-k12.fa ############################################################ # RNAz screen # =========== # # We are going to perform an RNAz screen on a tba whole genome alignment of Enterobacteria (with reference E.coli K12): # Data/tba_k12only_SORTED.maf.gz # # to predict non-coding RNAs. # # # get the RNAz utility tools into the path # export PATH=$CONDA_HOME/envs/comprna/share/RNAz/perl:$PATH # use an extra directory for output (stay clean and tidy) # mkdir Out # split the whole genome alignment into windows # and run RNAz on the windows; keep RNAz hits (RNAz class probability P > 0.5) # zcat Data/tba_k12.maf.gz | rnazWindow.pl -d -w 80 -s 40 --min-seqs=2 | RNAz -b -p 0.5 -o Out/tba_k12.rnaz # or zcat Data/tba_k12.maf.gz | rnazWindow.pl -d -w 100 -s 50 --min-seqs=3 | RNAz -b -p 0.5 -o Out/tba_k12.rnaz # ## NOTE: the same could be done in two steps, writing an intermediary file of the windows # zcat Data/tba_k12.maf.gz | rnazWindow.pl -w 80 -s 40 >Out/tba_k12.windows # cat Out/tba_k12.windows | RNAz -b -p 0.9 -o Out/tba_k12.rnaz # # Running the last command is very time consuming, but you can continue with # the precomputed results in Data/tba_k12.rnaz # For now, put the last command in the background. # (Stop with Ctrl-z and continue in the background with bg) # Later, see whether you get the same results. # # combine high confidence predictions (RNAz class probability P > 0.99) cat Data/tba_k12.rnaz | rnazCluster.pl -c 0.99 -d > Out/tba_k12.results # optionally: generate html output of rnaz hits (requires some additional tools) # using the option '--html' of rnazCluster.pl # look up details in the man page: rnazCluster.pl --man ## get the locus sequences # # generate bed file cat Out/tba_k12.results | grep '^locus' | awk -F$'\t' 'BEGIN {OFS=FS} {print "k12",$3,$4,$1;}' > Out/tba_k12_loci.bed # # extract fasta sequences using bedtools bedtools getfasta -fi Data/ecoli-k12.fa -bed Out/tba_k12_loci.bed -fo Out/tba_k12_loci.fa # the output file Out/tba_k12_loci.fa is ready for further clustering. # However, computing this is computationally too expensive for the tutorial # class. Let's rather look at two smaller examples. # ############################################################ # CLUSTERING with mlocarna # ======================== # Cluster the set of RNA sequences in file Data/cluster-example.fa # (alternatively Out/tba_k12_loci.fa from the Ecoli RNAz screen) # mlocarna Data/cluster-example.fa -v --threads 4 --tgtdir Out/cluster-example.out # # print the calculated guide tree # cat Out/cluster-example.out/results/result.tree nw_display Out/cluster-example.out/results/result.tree # # Alternatively: produce svg and view or edit (e.g. inkscape) nw_display -s Out/cluster-example.out/results/result.tree > Out/cluster-example.svg # # or use a web server that displays trees in newick format, e.g. at # http://etetoolkit.org/treeview/ # # # get a first overview of 'interesting' clusters by applying RNAz on # subtree alignments # for f in Out/cluster-example.out/intermediates/intermediate*.aln; do RNAz $f; done | tee Out/cluster-example.rnaz # # Let's perform the same steps on Data from a SELEX experiment. # We are interested to see whether there are any interesting clusters # that can be formed # mlocarna Data/selex.fa -v --tgtdir Out/selex.out nw_display Out/selex.out/results/result.tree for f in Out/selex.out/intermediates/intermediate*.aln; do RNAz $f; done | tee Out/selex.rnaz ############################## # Possible further analysis ("Assignments") # # 1) choose single clusters (of 3-6 RNAs) and find corresponding # alignments in Out/cluster-example.out/intermediates/ # # 2) use tools like RNAalifold, RNAz, R-scape (web-server) on the cluster's alignment # # 3) can you find cluster's of RNAs with interesting common structure? # # 4) write a script to apply the Duda rule to selected clusters # # 4a) apply the Duda rule to select maximal clusters # # 5) Find a nice cluster, and build a Consensus Model (CM) using Infernal (cmbuild,cmcalibrate) # # 6) Search through Data/ecoli-k12.fa and other bacterial genomes (cmsearch) #