############################################################
# INSTALLATION
# =====================
# please do the installation before the class

##############################
# Get data and software
#
# Data for the tutorial and specific software is provided at
#
# https://www.tbi.univie.ac.at/~will/AlgoSB19
#
# Get most data in one package: download e.g. by
#
wget https://www.tbi.univie.ac.at/~will/AlgoSB19/algosb19-comprna.tgz
#
# and extract
#
tar xzf algosb19-comprna.tgz
cd AlgoSB19-CompRna

##
# if you don't have conda installed already, install miniconda
# (Python 3 version) following instructions (Steps 1 & 2) here:
#
# https://bioconda.github.io/
#
# don't forget to set up the channels 
#

#
# if conda is not working, find instructions to install the most important applications from scratch 
# without using conda in INSTALL-WO-CONDA.txt
#


## We assume conda resides in $HOME/miniconda3; otherwise adapt the path
#
CONDA_HOME=$HOME/miniconda3

## create a new conda environment and activate it 
#
conda create -n comprna viennarna clustalw rscape rnaz infernal bedtools newick_utils
conda activate comprna

## install locarna-2.0.0RC7 in the conda environment
#
# NOTE: if locarna 2.0.0RC7 (and the newest Vienna package) was already
#       installed outside of conda, you may skip this step
#
# Get locarna-2.0.0RC7p0.tar.gz from
# https://www.tbi.univie.ac.at/~will/AlgoSB19/
# to download it from linux command line run 
wget https://www.tbi.univie.ac.at/~will/AlgoSB19/locarna-2.0.0RC7p0.tar.gz


tar xzf locarna-2.0.0RC7p0.tar.gz
cd locarna-2.0.0RC7p0
./configure --prefix=$CONDA_PREFIX
make -j4
make install

############################################################
# Analysis of RNA alignments
# ==========================

# As running example, let's use a set of tRNAs in Data/trnas.fa
# Data/trnas.aln contains a sequence-based alignment of the tRNAs, which
# was produced using clustalw: 
cd Data
clustalw Data/trnas.fa
cd ..

#
# The sequences in MoreData can be analyzed in the same way as Data/trnas.fa
# and used for comparison.  What can you learn about those sequences?
# Are they RNAs of the same family; do they have stable structure?
# is their structure conserved?
#

##############################
# RNAalifold

RNAalifold --aln --color Data/trnas.aln

# have a look at the man page
#
man RNAalifold
#
# to understand the options and the output

#
# RNAalifold writes postscript files aln.pn and alirna.ps
# view with postscript viewer (in Linux, e.g. evince, okular, gv, ...)
#
# look at Doc/RNAalifold-legend.png (and into the man page) for the color code
#
eog Doc/RNAalifold-legend.png

# Usually, one gets better results from alifold by using ribosum scoring (option -r) 
#
RNAalifold --aln --color -r --cfactor 0.6 --nfactor 0.5 Data/trnas.aln 
#
# with the 'magic' parameters --cfactor 0.6 --nfactor 0.5 

#
# RNAalifold can compute partition functions and generate probability dot plots (option -p)
#
RNAalifold -p --color Data/trnas.aln
#
# have a look at alidot.ps (using evince, okular, ...)
#


#
# Compare the consensus structure to the structures predicted from single sequences
#
RNAfold < Data/trnas.aln
RNAalifold Data/trnas.aln
#
# and calculate the Structure conservation index (SCI)
#
# recall: SCI = MFE_alifold / MEAN_i(MFE_i) 
#
# btw, RNAalifold even has an option for this (check whether you get the same result)
#

#
# Finally, let's produce a stockholm file
#
RNAalifold --aln-stk=trnas Data/trnas.aln


############################################################
# R-scape
# =======
# http://eddylab.org/R-scape/
#
# Upload the stockholm file trnas.stk to the web server and run.
#
# Check out the additional information on significant covariation provided in the
# R2R structure plot and the covariation dot plot
#

############################################################
# RNAz
# ====
# https://www.tbi.univie.ac.at/software/RNAz/
#
RNAz Data/trnas.aln
#
#
# What information is provided in the output of RNAz?
#
# Do you see how (roughly) the features "Shannon entropy", "Mean z-score",
# and "Structure conservation index" influence the "SVM RNA-class probability"?
#

# RNAz alignment models
# ---------------------
# By default, RNAz assumes that the input alignment is generated by a sequence alignment
# program (like clustalw).
# With option -l, it assumes that the alignment is a structure alignments,
# as produced by LocARNA.
#
 
# RNAz background models
# ---------------------
# What is the difference between option --dinucleotide and --mononucleotide?
# Hint: this is related to the computation of z-scores.
#

############################################################
# LocARNA
# =======
# LocARNA is hosted at Github
# https://github.com/s-will/LocARNA
# where one finds newest releases and information.
#
# Additional information is found here:
# http://www.bioinf.uni-freiburg.de/Software/LocARNA/

# While there is a web server for LocARNA
# http://rna.informatik.uni-freiburg.de/LocARNA/Input.jsp
# we will look into LocARNA's command line tools,
# which provide much more flexibility (and are not limited in
# the number and length of sequences that can be aligned).
# Note that instead of using the released version, 
# I would now generally recommend to use the most recent 
# LocARNA release candidate (which we are using in this tutorial).
#

# The main tool of LocARNA is mlocarna. Use it to compute a multiple alignment
# by simply calling it on a fasta file with multiple RNA sequences: 
# 
mlocarna Data/tRNA_5.fa
#
# This call writes output files to the directory Data/tRNA_5.out
# One can change the output directory using --tgtdir
mlocarna Data/tRNA_5.fa --tgtdir Out/tRNA_5.out

#
# To learn more about the workflow of mlocarna,
# have a look at the output directory and also call
mlocarna Data/tRNA_5.fa --tgtdir Out/tRNA_5.out -v
#
# and
#
mlocarna Data/tRNA_5.fa --tgtdir Out/tRNA_5.out --moreverbose
#
# In particular, the latter produces a lot of output, 
# including detailed information about each step of the
# progressive multiple alignment procedure.
# (One can also make mlocarna shut up with -q)

# The output directory contains directories
# * input
#   with files like input/AF008220 for each input sequence.
#   These files are in LocARNAs pp2.0 format. Each
#   specifies a sequence (or alignment) and base pair probabilities
#   (Recall that LocARNA requires base pair probabilities to score
#   the base pair matches)
#
# * results
#   which contains all final results, in particular
#   the final alignment results/result.aln
#   Moreover, the guide tree: Out/tRNA_5.out/results/result.tree
#
# * intermediates
#   which contains the intermediate alignments that are
#   generated by the progressive alignment steps

# View the guide tree:
#
cat Out/tRNA_5.out/results/result.tree
nw_display Out/tRNA_5.out/results/result.tree 
# or use a web server that displays trees in newick format, e.g. at 
#     http://etetoolkit.org/treeview/

#
# Now, have a look at the man page of mlocarna. It documents all of the many command
# line arguments of the tool and provides some usage examples.
mlocarna --man
# As you have already seen, mlocarna performs a series of pairwise alignments for generating
# the multiple alignment. These are performed by the work-horse tool locarna.
# This program runs the PMcomp-style Sankoff algorithm on a pair of RNAs. Get help on
# the command line options of locarna, by
locarna --help

# locarna takes two inputs and aligns them
locarna Examples/mouse.fa Examples/human.fa

# There are some special occasions, where one wants to call locarna directly. For example,
# it can align alignments:
#
# generate two alignments by mlocarna:
#
mlocarna Data/tRNA_5.fa 
mlocarna Data/tRNA_7.fa
# 
# and align them using locarna
#
locarna Data/tRNA_5.out/results/result.aln Data/tRNA_7.out/results/result.aln 


## ---------------------------
# The effect of heuristics
#

# We look at some options that control heuristics in locarna and mlocarna:
#

# -p controls the cutoff probability which filters considered base pairs 
#
# --min-trace-probability controls the sequence-based pre-filter
# that can strongly limit the calculated DP matrix entries based on
# sequence similarities. Value 0 turns off this heuristic!
#
# --max-diff 60: limit the calculated entries in the DP matrices 
#                to a diagonal band (of width 60)

# compare times

time locarna Examples/mouse.fa Examples/human.fa

time locarna --min-trace-probability=0 Examples/mouse.fa Examples/human.fa

time locarna --min-trace-probability=0 --max-diff 60 Examples/mouse.fa Examples/human.fa

time locarna --min-trace-probability=0 -p 0.1 Examples/mouse.fa Examples/human.fa 
time locarna --min-trace-probability=0 -p 0.001 Examples/mouse.fa Examples/human.fa 


## ---------------------------
# Local alignment
#
# locarna and mlocarna supports several types of local alignment
#

# (sequence) local
locarna --sequ-local on Examples/mouse.fa Examples/human.fa

# sequence and structure local
locarna --sequ-local on --struct-local on Examples/mouse.fa Examples/human.fa

# free end gaps at the 3' ends or 5' ends
mlocarna --free-endgaps-3 Data/web-example1.fa
mlocarna --free-endgaps-5 Data/web-example1.fa


## ---------------------------
## Constraints for mlocarna

# Mlocarna supports structure constraints for folding and anchor constraints
# for alignment. Both types of constraints can be specified in extension of
# the standard fasta format via 'constraint lines'.
# (confer mlocarna's man page, section 'Use of constraints') 

# Here are two examples with constraints: 
mlocarna Example/example-w-constraints.fa
mlocarna Examples/haca.snoRNA.fa

# Both files give examples, how structure and anchor constraints can be
# expressed in fasta input. (Have a look at the files.)

# Anchor constraints can as well be passed using bed format:
mlocarna --anchor-constraints Examples/example-anchors.bed Examples/example-wo-anchors.fa


## ----------------
# Realignment
#
# mlocarna can realign alignments, which allows to keep
# certain columns unchanged by the new alignment
# 
mlocarna --realign Examples/example-realign.aln
#
# This is useful to manually fix-up alignments and realign
# while preserving the manual changes 
# Edit the example-realign.aln alignment and mark your changes:
cp Examples/example-realign.aln example-realign2.aln
#edit example-realign2.aln and mark changed columns
mlocarna --realign example-realign2.aln


# realignment can be moreover restricted to make only 'small' changes,
# which is mostly useful to speed up the re-alignment procedure.
#
mlocarna --realign Examples/example-realign.aln --max-diff 5 --max-diff-aln .


## ---------------
# Use local folding for long RNA sequences
# compare:
time mlocarna --plfold-span 100 Data/web-example1.fa
time mlocarna Data/web-example1.fa

## ---------------
# Probabilistic alignment
#
mlocarna --probabilistic Data/tRNA_5.fa
mlocarna --probabilistic --plfold-span 100 Data/web-example1.fa

## ---------------
# Use consistency transformation
#
mlocarna --probabilistic --consistency-transformation Data/tRNA_5.fa

## reliability profiles
#
# after the last call, compute a reliability profile of the alignment
# to show the local alignment quality
#
reliability-profile.pl Data/tRNA_5.out
#
# This plots the profile in rel.pdf
#
evince rel.pdf
#
## by default, it predicts the regions of highest signal, which can be used
# for predicting more accurate boundaries of non-coding RNAs from an RNAz screen.
# see 
reliability-profile.pl --help


##############################
# Analyse locarna alignments
#
# The alignments generated by locarna can be further analyzed using RNAalifold, R-scape, etc
# Go ahead:)
#
# Be aware of the different natures of sequence-based and structure-based alignments.
# For example, RNAz provides a special mode to correctly interpret structure-based alignments

clustalw Examples/hammerhead_3.fa
RNAz Examples/hammerhead_3.aln

mlocarna Examples/hammerhead_3.fa
RNAz -l Examples/hammerhead_3.out/results/result.aln


# generate stockholm format output (e.g. to analyse with R-scape)
#
mlocarna --stockholm Examples/hammerhead_3.fa
# writes output to Examples/hammerhead_3.out/results/result.stk

# additionally include consensus structure:
mlocarna --consensus-structure alifold --stockholm Examples/hammerhead_3.fa

cat Examples/hammerhead_3.out/results/result.stk

############################################################
# Infernal - build CMs
# ====================
# Doc/Infernal-userguide.pdf
# http://eddylab.org/infernal/

## Build a consensus model from a stockholm file,
# e.g. after the previous mlocarna call:
cmbuild hammerhead_3.cm Examples/hammerhead_3.out/results/result.stk

## Since we want to use the model for database search, we need to
# calibrate it (to later get proper E-values)
cmcalibrate hammerhead_3.cm

# ATTENTION: calibration is very time consuming. Use
cmcalibrate --forecast hammerhead_3.cm
# to learn how long this would take and
# proceed with the pre-computed file Data/hammerhead_3.cm

cmsearch Data/hammerhead_3.cm Data/ecoli-k12.fa


############################################################
# RNAz screen
# ===========
#
# We are going to perform an RNAz screen on a tba whole genome alignment of Enterobacteria (with reference E.coli K12):
#
Data/tba_k12only_SORTED.maf.gz
#
# to predict non-coding RNAs.
#

#
# get the RNAz utility tools into the path
#
export PATH=$CONDA_HOME/envs/comprna/share/RNAz/perl:$PATH

# use an extra directory for output (stay clean and tidy)
#
mkdir Out

# split the whole genome alignment into windows
# and run RNAz on the windows; keep RNAz hits (RNAz class probability P > 0.5) 
#
zcat Data/tba_k12.maf.gz | rnazWindow.pl -d -w 80 -s 40 --min-seqs=2 | RNAz -b -p 0.5 -o Out/tba_k12.rnaz
# or zcat Data/tba_k12.maf.gz | rnazWindow.pl -d -w 100 -s 50 --min-seqs=3 | RNAz -b -p 0.5 -o Out/tba_k12.rnaz
#
## NOTE: the same could be done in two steps, writing an intermediary file of the windows
# zcat Data/tba_k12.maf.gz | rnazWindow.pl -w 80 -s 40 >Out/tba_k12.windows
# cat Out/tba_k12.windows | RNAz -b -p 0.9 -o Out/tba_k12.rnaz
#
# Running the last command is very time consuming, but you can continue with 
# the precomputed results in Data/tba_k12.rnaz
# For now, put the last command in the background. 
# (Stop with Ctrl-z and continue in the background with bg) 
# Later, see whether you get the same results.
#

# combine high confidence predictions (RNAz class probability P > 0.99)
cat Data/tba_k12.rnaz | rnazCluster.pl -c 0.99 -d > Out/tba_k12.results

# optionally: generate html output of rnaz hits (requires some additional tools)
# using the option '--html' of rnazCluster.pl
# look up details in the man page:
rnazCluster.pl --man

## get the locus sequences
#
# generate bed file
cat Out/tba_k12.results | grep '^locus' | awk -F$'\t' 'BEGIN {OFS=FS} {print "k12",$3,$4,$1;}' > Out/tba_k12_loci.bed
#
# extract fasta sequences using bedtools
bedtools getfasta -fi Data/ecoli-k12.fa -bed Out/tba_k12_loci.bed -fo Out/tba_k12_loci.fa

# the output file Out/tba_k12_loci.fa is ready for further clustering.
# However, computing this is computationally too expensive for the tutorial 
# class. Let's rather look at two smaller examples.
#

############################################################
# CLUSTERING with mlocarna
# ========================

# Cluster the set of RNA sequences in file Data/cluster-example.fa 
#   (alternatively Out/tba_k12_loci.fa from the Ecoli RNAz screen)
# 
mlocarna Data/cluster-example.fa -v --threads 4 --tgtdir Out/cluster-example.out

#
# print the calculated guide tree
#
cat Out/cluster-example.out/results/result.tree
nw_display Out/cluster-example.out/results/result.tree

#
# Alternatively: produce svg and view or edit (e.g. inkscape)
nw_display -s Out/cluster-example.out/results/result.tree > Out/cluster-example.svg
#
# or use a web server that displays trees in newick format, e.g. at 
#     http://etetoolkit.org/treeview/
#

#
# get a first overview of 'interesting' clusters by applying RNAz on
# subtree alignments 
#
for f in Out/cluster-example.out/intermediates/intermediate*.aln; do RNAz $f; done | tee Out/cluster-example.rnaz 

#
# Let's perform the same steps on Data from a SELEX experiment.
# We are interested to see whether there are any interesting clusters
# that can be formed
#
mlocarna Data/selex.fa -v --tgtdir Out/selex.out

nw_display Out/selex.out/results/result.tree

for f in Out/selex.out/intermediates/intermediate*.aln; do RNAz $f; done | tee Out/selex.rnaz 


##############################
# Possible further analysis ("Assignments")
#
# 1) choose single clusters (of 3-6 RNAs) and find corresponding
#    alignments in Out/cluster-example.out/intermediates/
#
# 2) use tools like RNAalifold, RNAz, R-scape (web-server) on the cluster's alignment
#  
# 3) can you find cluster's of RNAs with interesting common structure?
#
# 4) write a script to apply the Duda rule to selected clusters
#
# 4a) apply the Duda rule to select maximal clusters
#
# 5) Find a nice cluster, and build a Consensus Model (CM) using Infernal (cmbuild,cmcalibrate)
# 
# 6) Search through Data/ecoli-k12.fa and other bacterial genomes (cmsearch)
#