| [Top] | [Contents] | [Index] | [ ? ] |
1. Introduction 2. Routines 3. Parsing of alignments 4. References Function Index Variable Index
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
probA calculates the partition function over all alignments between two
sequences. The partition function is used to determine the matching
probabilities for all possible matches i,j of the two sequences.
Furthermore the partition function can be used to make a stochastic
backtracking.
The program probA is build from different modules that perform various
functions: calculation of a pairwise global alignment with affine gap
penalties, calculation of matching probabilities and a stochastic
backtracking, gernerating an ensemble of properly weighted optimal and
suboptimal alignments. For those who wish to develop their own programs we
provide a library which can be linked to your own code.
This document only describes the library and will be primarily useful to
programmers. The stand-alone program is described in a separate man page.
This manual documents version 0.1.0.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
2.1 Global Pairwise Alignment 2.2 Matching Probabilities 2.3 Stochastic Backtracking 2.4 Global Variables 2.4.3 Structures and Definitions
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The library provides a variation of a Needleman Wunsch dynamic programming
algorithm to calculate a pairwise global alignment of two sequences.
For the calculation of the alignment a affine gap penalty is used.
align is an array of two structures
sequ (see section 2.4.3 Structures and Definitions). Structure sequ stores a sequence
and the name of the sequence.
Each input sequence has to be written in the sequ structure
array.align calculates the global pairwise alignment of the two
sequences contained in seq_array and returns a structure of the type
aligm (see section 2.4.3 Structures and Definitions) containing the score of the alignment
and the alignment encoded as a string of digits
(see section 3.1 Representation of Alignments).align uses a variation of a dynamic programming algorithm
to calculate an optimal alignment of the two sequences. Using
Egap_flag the user can determine if endgaps are treated like gaps
inside the alignment or differently (see section 2.4.2 Flags).
pam_distance is the observed identity
returned by function observed_identity, the return value is the
PAM distance of the two sequences (see section 3.3 PAM Distance).align.
decode_alig.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Instead of calculating only one optimal alignment the partition function over all possible alignments of the two sequences can be calculated. From the partition function the matching probabilities for all possible matches between letters of the two sequences can be computed. The theory of probabilistic alignments derived from a thermodynamic partition function is described in Kschischo and Laessig (2000).
align
(see section 2.1 Global Pairwise Alignment), the only argument for partition_f is the aligm
structure returned by align.partition_f calculates the partition function over all alignments of
the two input sequences, the partition function is then used to determine
the matching probability for each possible match i,j between the two
sequences. partition_f returns a two dimensional real array
(see section 2.4.3 Structures and Definitions) containing the matching probability for each possible
match i,j. The matching probability array can be represented as a dot plot
using function ps_plot.
ps_plot represents the matching probabilities as a dot plot.
The first argument of ps_plot is the matching probability array
returned by partition_f, the second argument is the name of the
output file for the dot plot.
partition_f.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The partition function over all possible alignments of two sequences can be used to calculate a stochastic backtracking. The stochastic backtracking generates alternative optimal and suboptimal alignments stochastically.
stoch_backtr you have to invoke function partition_f
first.
partition_f sets some variables that are essential for the
stochastic backtracking.stoch_backtr can be used to generate alternative optimal and
suboptimal alignments by stochastic backtracking. The function returns
an aligm-structure (see section 2.4.3 Structures and Definitions) containing one stochastic
alignment.
To create an ensemble of optimal and suboptimal alignments function
stoch_backtr has to be invokes several times.
stoch_backtr.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
2.4.1 Global Variables 2.4.2 Flags
The following variables change the performance of the alignment algorithm.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The following flags should be set:
check_polymer() is called. Function
check_polymer() decides whether the input polymer is DNA or
protein. The decision is based on counting all A,C,G,T,U residues in the
polymer of question. If at least 85% of all residues of the sequence are
either A, C, G, T or U the polymer is treated as DNA. The concept for this
function is taken from ClustalW Thompson et al. (1994).
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The following structures and definitions are used in the described functions:
When compiling the library, real can be set to float or
double, depending on the memory available on your machine.
real is defined in pfgoto.h.
#define real double |
Structure sequ stores the sequence name and the sequence :
typedef struct{
char *name; /* name of the sequence */
char *seq; /* sequence */
} sequ;
|
Structure aligm stores both sequences (each in one sequ structure), the alignment encoded as a string of digits (see section 3.1 Representation of Alignments), the score of the alignment and the probability of the alignment if the alignment was generated by stochastic backtracking.
typedef struct{
/* s0 and s1 store the sequences as they are used in the program */
sequ s0;
sequ s1;
char *a; /* alignment: encoded as a string of digits
| match
: mismatch
. gaps in the shorter seq (s1)
^ gaps in the longer seq (s0) */
double score; /* score of the alignment */
double prob; /* probability of the alignment */
} aligm;
|
Structure al stores an alignment in the classical sense: the name of the sequence and the gapped sequence. If the sequences have different lengths the longer sequence is stored in s0, the shorter in s1, otherwise they are stored as entered (see section 3.1 Representation of Alignments).
typedef struct{
sequ s0;
sequ s1;
double score;
} al;
|
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
3.1 Representation of Alignments 3.2 Scoring Matrices 3.3 PAM Distance
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
In the usual representation of alignments, the first sequence is written above the second one. In regions of high similarity, similar or evolutionary related residues are written in one column, this is called a match or a mismatch, respectively. In areas where residues were deleted or inserted, the missing residues are substituted by the gap symbol ( - ).
C P S G C T N F K - C A
C P T G - - N Y K K C A
| | : | . . | : | ^ | |
|
The alignment is explicitly described by four states: match (aligned residues
are identical), mismatch (aligned residues are different), deletion (residues
in the above sequence have been deleted in the lower sequence) and
insertion (residues have been added to the lower sequence). For a
non-ambiguous representation of an alignment it is mandatory to determine
which of the sequences is the above one.
The library assigns the longer of the two input sequences as the above
sequence of the alignment. If both sequences have the same length
the first entered sequence is the above one.
Throughout the program the above sequence is referred to as s0, the
lower one as s1.
The alignment is encoded as a string of symbols:
'|' is assigned to a match, ':' to a mismatch, '.' to a deletion (missing
residues in the lower sequence are replaced by gaps) and '^' to an insertion
(missing residues in the above sequence are replaced by gaps).
Function aligne (see section 2.1 Global Pairwise Alignment) returns structure
aligm containing the sequences as used in the program (s0,
above sequence; s1, lower sequence), the alignment encoded as a string
of digits and the score of the alignment (see section 2.4.3 Structures and Definitions).
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Amino acid substitution matrices provided by the library include:
Gonnet matrix series Benner et al. (1994):
from the gonnet_series, which is the default series, seven members are used:
gonnet_series [gon]: gonnet_40, gonnet_80, gonnet_120, gonnet_160,
gonnet_250, gonnet_300, gonnet_350;
BLOSUM matrix series Henikoff and Henikoff (1992):
from the BLOSUM series four members are provided:
blosum_series [blo]: blosum_30, blosum_50, blosum_62, blosum_80;
PAM matrix series Dayhoff et al. (1978):
four of Dayhoff's PAM matrices are offered:
pam_series [pam]: pam_20, pam_60, pam_120, pam_350;
Shortcuts used for the matrix series are indicated in brackets (e.g. write pam to use the pam_series of substitution matrices. Write only pam, do not write the brackets).
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
To compute an alignment between two sequences, you need to measure their
evolutionary distance. The matrix series used in the library maximize the
similarity between two sequences: Similarity of aligned residues is rewarded
by a positive score, dissimilarity is penalized by a negative score.
A commonly used set of substitution matrices are Dayhoff's PAM matrices
Dayhoff et al. (1978). Dayhoff aligned a set of at least 85%
identical sequences to count the accepted point mutations (how often
different amino acids are replaced by each other in evolution) and the
relative mutability of different amino acids. These data were combined to
produce a mutation probability matrix. From the mutation probability matrix
a log odds matrix was constructed by dividing each element of the mutation
data matrix by its normalized frequency and then taking the log of each
element. The elements of the log odds matrix give the probability that
the amino acid in one column will be replaced by the amino acid in some
row after a given evolutionary interval. One PAM (Percent Accepted
Mutation) unit therefore represents one accepted point mutation between
two sequences, per 100 residues.
To determine the evolutionary distance for which a given substitution
matrix is calculated, the PAM matrix series and the Gonnet matrix series
use the PAM distance. The library provides the function pam_distance
(see section 2.1 Global Pairwise Alignment) to calculate the PAM distance.
For the BLOSUM matrix series the evolutionary distance is determined by
the observed identity. A BLOSUM 62 matrix is, for example, a substitution
matrix calculated for comparisons of sequences with no less than 62%
divergence (an identity of 62% and more)
Henikoff and Henikoff (1992). To calculate the observed identity
call function observed_identity (see section 2.1 Global Pairwise Alignment) and multiply
the return value by 100.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
We used a simple formula to convert an observed distance to one that
is corrected for multiple hits. The observed distance is the mean
number of differences per site in an alignment (ignoring sites
with a gap) and is therefore always between 0.0 (for identical sequences)
and 1.0 (no residues the same at any site). These distances can be
multiplied by 100 to give percent difference values. 100 minus percent
difference gives percent identity.
The formula we use to correct for multiple hits is from Motoo Kimura
M. Kimura (1983) where D is the observed distance and K is the
corrected distance:
K = -Ln(1 - D - (D.D)/5)
|
This formula gives the mean number of estimated substitutions per site and,
in contrast to D (the observed number), can be greater than 1 (i.e. more
than one substitution per site, on average).
This can also be expressed in PAM units by multiplying by 100 (mean
number of substitutions per 100 residues).
Dayhoff et al constructed an elaborate model of protein evolution based
on observed frequencies of substitution between very closely related
proteins. Using this model, they derived a table relating observed
distances to predicted PAM distances.
Kimura's formula, above, is just a "curve fitting" approximation to this
table. It is very accurate in the range 0.75 > D > 0.0 but becomes
increasingly inaccurate at high D (>0.75) and fails completely at around
D = 0.85. For D > 0.75 the PAM distance is approximated using Dayhoff's
table. The concept for the calculation of PAM distances is taken from
ClustalW Thompson et al. (1994).
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
| Jump to: | *
A D F O P S |
|---|
| Jump to: | *
A D F O P S |
|---|
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
| Jump to: | B D E M T |
|---|
| Index Entry | Section | |
|---|---|---|
| | ||
| B | ||
BETA | 2.4.1 Global Variables | |
| | ||
| D | ||
DISTANCE | 2.4.1 Global Variables | |
| | ||
| E | ||
Egap_flag | 2.4.2 Flags | |
ENDGAP | 2.4.1 Global Variables | |
| | ||
| M | ||
MAT_SER[20] | 2.4.1 Global Variables | |
| | ||
| T | ||
typ_flag | 2.4.2 Flags | |
| | ||
| Jump to: | B D E M T |
|---|
| [Top] | [Contents] | [Index] | [ ? ] |
| [Top] | [Contents] | [Index] | [ ? ] |
1. Introduction
2. Routines
3. Parsing of alignments
4. References
Function Index
Variable Index
| [Top] | [Contents] | [Index] | [ ? ] |
| Button | Name | Go to | From 1.2.3 go to |
|---|---|---|---|
| [ < ] | Back | previous section in reading order | 1.2.2 |
| [ > ] | Forward | next section in reading order | 1.2.4 |
| [ << ] | FastBack | previous or up-and-previous section | 1.1 |
| [ Up ] | Up | up section | 1.2 |
| [ >> ] | FastForward | next or up-and-next section | 1.3 |
| [Top] | Top | cover (top) of document | |
| [Contents] | Contents | table of contents | |
| [Index] | Index | concept index | |
| [ ? ] | About | this page |