[Top] [Contents] [Index] [ ? ]

probA_lib

This file documents probA_lib Version 0.1.0

1. Introduction  
2. Routines  
3. Parsing of alignments  
4. References  
Function Index  
Variable Index  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1. Introduction

probA calculates the partition function over all alignments between two sequences. The partition function is used to determine the matching probabilities for all possible matches i,j of the two sequences. Furthermore the partition function can be used to make a stochastic backtracking.
The program probA is build from different modules that perform various functions: calculation of a pairwise global alignment with affine gap penalties, calculation of matching probabilities and a stochastic backtracking, gernerating an ensemble of properly weighted optimal and suboptimal alignments. For those who wish to develop their own programs we provide a library which can be linked to your own code.
This document only describes the library and will be primarily useful to programmers. The stand-alone program is described in a separate man page. This manual documents version 0.1.0.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2. Routines

2.1 Global Pairwise Alignment  
2.2 Matching Probabilities  
2.3 Stochastic Backtracking  
2.4 Global Variables  
2.4.3 Structures and Definitions  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.1 Global Pairwise Alignment

The library provides a variation of a Needleman Wunsch dynamic programming algorithm to calculate a pairwise global alignment of two sequences. For the calculation of the alignment a affine gap penalty is used.

Function: aligm align (sequ *seq_array)
The argument of function align is an array of two structures sequ (see section 2.4.3 Structures and Definitions). Structure sequ stores a sequence and the name of the sequence. Each input sequence has to be written in the sequ structure array.
Function align calculates the global pairwise alignment of the two sequences contained in seq_array and returns a structure of the type aligm (see section 2.4.3 Structures and Definitions) containing the score of the alignment and the alignment encoded as a string of digits (see section 3.1 Representation of Alignments).
The library provides three different scoring matrix series for protein alignments (see section 3.2 Scoring Matrices). The user can select the scoring matrix series (see section 2.4 Global Variables). The library than calculates the pairwise identity of the input sequences and determines which matrix out of a series to use. It is also possible to select one specific matrix out of a substitution matrix series (see section 2.4 Global Variables)
Function align uses a variation of a dynamic programming algorithm to calculate an optimal alignment of the two sequences. Using Egap_flag the user can determine if endgaps are treated like gaps inside the alignment or differently (see section 2.4.2 Flags).

Function: al decode_alig (aligm track)
converts the alignment encoded as a string of digits to the usual alignment representation (see section 3.1 Representation of Alignments), the alignment is written in a structure of type al (see section 2.4.3 Structures and Definitions).

Function: float observed_identity (aligm track)
calculats the observed identity (matches/(matches+mismatches)) between two sequences (see section 3.3 PAM Distance).

Function: float pam_distance (float identity)
the argument for function pam_distance is the observed identity returned by function observed_identity, the return value is the PAM distance of the two sequences (see section 3.3 PAM Distance).
This function only approximates the PAM distance, it is not useful for applications which require exact evolutionary distances !

Function: void free_align (aligm track)
frees everything allocated in align.

Function: void free_al (al a)
frees everything allocated in decode_alig.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.2 Matching Probabilities

Instead of calculating only one optimal alignment the partition function over all possible alignments of the two sequences can be calculated. From the partition function the matching probabilities for all possible matches between letters of the two sequences can be computed. The theory of probabilistic alignments derived from a thermodynamic partition function is described in Kschischo and Laessig (2000).

Function: real **partition_f (aligm track)
Before calling this function it is necessary to call function align (see section 2.1 Global Pairwise Alignment), the only argument for partition_f is the aligm structure returned by align.
partition_f calculates the partition function over all alignments of the two input sequences, the partition function is then used to determine the matching probability for each possible match i,j between the two sequences. partition_f returns a two dimensional real array (see section 2.4.3 Structures and Definitions) containing the matching probability for each possible match i,j. The matching probability array can be represented as a dot plot using function ps_plot.

Function: void ps_plot (real **matchprob, char* outfile)
Function ps_plot represents the matching probabilities as a dot plot. The first argument of ps_plot is the matching probability array returned by partition_f, the second argument is the name of the output file for the dot plot.
Function: void free_partition_f (real **matchprob, sequ* seq_array)
frees everything allocated in partition_f.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.3 Stochastic Backtracking

The partition function over all possible alignments of two sequences can be used to calculate a stochastic backtracking. The stochastic backtracking generates alternative optimal and suboptimal alignments stochastically.

Function: aligm stoch_backtr (aligm track)
To call stoch_backtr you have to invoke function partition_f first. partition_f sets some variables that are essential for the stochastic backtracking.
stoch_backtr can be used to generate alternative optimal and suboptimal alignments by stochastic backtracking. The function returns an aligm-structure (see section 2.4.3 Structures and Definitions) containing one stochastic alignment. To create an ensemble of optimal and suboptimal alignments function stoch_backtr has to be invokes several times.

Function: void free_stoch (aligm stochalig)
frees the alignment returned by stoch_backtr.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.4 Global Variables

2.4.1 Global Variables  
2.4.2 Flags  

The following variables change the performance of the alignment algorithm.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.4.1 Global Variables

Variable: float BETA
BETA is the ratio of a scoring matrix dependent partition module and a variable T. T governs the relative weight of alignment-paths with different scores. The lower the values of T the higher is the weight given to paths with high scores. The value for T should be provided as a command line option. Assign the inverse of T to BETA. As a default value for T we used 1.

Variable: float ENDGAP
stores the score for terminal gaps. Terminal gaps are scored differently from gaps inside an alignment if Egap_flag is 1 (see section 2.4.2 Flags). As a default value for ENDGAP we used 0.

Variable: char MAT_SER[20]
determines which substitution matrix series is used. The library provides different scoring matrix series for proteins (see section 3.2 Scoring Matrices). A shortcut for the selected substitution matrix series is stored in the global variable MAT_SER. As default substitution matrix series we used the gonnet_series [gon].

Variable: float DISTANCE
This variable allows the user to select one scoring matrix out of a series. The matrix series of choice has to be set by MAT_SER. To determine which matrix out of a series to use, set DISTANCE to the pam distance (Gonnet or PAM series) or the observed identity (for BLOSUM series) (see section 3.3 PAM Distance) of the selected matrix. (i.g. to select the commonly used PAM250 matrix set, MAT_SER to pam and DISTANCE to 250)
default is -1, a score matrix out of a series specified by MAT_SER is selected automatically. If DISTANCE is positive the scoring matrix closest to the value of DISTANCE is selected.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.4.2 Flags

The following flags should be set:

Variable: int typ_flag
This variable stores the type of the input polymer. For nucleic acids (DNA or RNA) it should be set to 1, for proteins it should be set to 0. For the default value of typ_flag, we used -1 (or FALSE). If typ_flag is set to -1, that is, if the user does not specify the kind of polymer, the function check_polymer() is called. Function check_polymer() decides whether the input polymer is DNA or protein. The decision is based on counting all A,C,G,T,U residues in the polymer of question. If at least 85% of all residues of the sequence are either A, C, G, T or U the polymer is treated as DNA. The concept for this function is taken from ClustalW Thompson et al. (1994).

Variable: int Egap_flag
If terminal gaps are penalized with the same scores as all other gaps, alignments can be generated which have one or a few matched residues at the margin of the alignment followed by an extended gap. To prevent single residues from jumping to the edge of the alignment, terminal gaps get a lower gap penalty than gaps inside the alignment. If Egap_flag is 1 terminal gaps are scored differently. The score for terminal gaps should be stored in the global variable ENDGAP (see section Global Variables).
To treat terminal gaps like gaps inside the alignment set Egap_flag to 0.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.4.3 Structures and Definitions

The following structures and definitions are used in the described functions:

When compiling the library, real can be set to float or double, depending on the memory available on your machine. real is defined in pfgoto.h.
 
#define real double

Structure sequ stores the sequence name and the sequence :
 
typedef struct{
  char *name; /* name of the sequence */
  char *seq;  /* sequence */
} sequ;

Structure aligm stores both sequences (each in one sequ structure), the alignment encoded as a string of digits (see section 3.1 Representation of Alignments), the score of the alignment and the probability of the alignment if the alignment was generated by stochastic backtracking.
 
typedef struct{
  /* s0 and s1 store the sequences as they are used in the program */
  sequ s0;
  sequ s1; 
  char *a;       /* alignment: encoded as a string of digits
	            | match
		    : mismatch
	            . gaps in the shorter seq (s1)
	            ^ gaps in the longer seq (s0) */ 
  double score;  /* score of the alignment */
  double prob;   /* probability of the alignment */
} aligm;

Structure al stores an alignment in the classical sense: the name of the sequence and the gapped sequence. If the sequences have different lengths the longer sequence is stored in s0, the shorter in s1, otherwise they are stored as entered (see section 3.1 Representation of Alignments).

 
typedef struct{
  sequ s0;
  sequ s1;
  double score;
} al;


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3. Parsing of alignments

3.1 Representation of Alignments  
3.2 Scoring Matrices  
3.3 PAM Distance  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1 Representation of Alignments

In the usual representation of alignments, the first sequence is written above the second one. In regions of high similarity, similar or evolutionary related residues are written in one column, this is called a match or a mismatch, respectively. In areas where residues were deleted or inserted, the missing residues are substituted by the gap symbol ( - ).

 
                     
                          C P S G C T N F K - C A   
                          C P T G - - N Y K K C A   
                     
                          | | : | . . | : | ^ | |

The alignment is explicitly described by four states: match (aligned residues are identical), mismatch (aligned residues are different), deletion (residues in the above sequence have been deleted in the lower sequence) and insertion (residues have been added to the lower sequence). For a non-ambiguous representation of an alignment it is mandatory to determine which of the sequences is the above one.
The library assigns the longer of the two input sequences as the above sequence of the alignment. If both sequences have the same length the first entered sequence is the above one. Throughout the program the above sequence is referred to as s0, the lower one as s1.
The alignment is encoded as a string of symbols: '|' is assigned to a match, ':' to a mismatch, '.' to a deletion (missing residues in the lower sequence are replaced by gaps) and '^' to an insertion (missing residues in the above sequence are replaced by gaps).
Function aligne (see section 2.1 Global Pairwise Alignment) returns structure aligm containing the sequences as used in the program (s0, above sequence; s1, lower sequence), the alignment encoded as a string of digits and the score of the alignment (see section 2.4.3 Structures and Definitions).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2 Scoring Matrices

Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. Each substitution matrix is defined for a specific evolutionary distance (see section 3.3 PAM Distance). Depending on the distance between the two sequences, an appropriate matrix is selected. The program contains different scoring matrix series for proteins, that can be selected by the option MAT_SER (see section 2.4 Global Variables). For the alignment of nucleic acid the scoring matrix used by ClustalW is applied Thompson et al. (1994).

Amino acid substitution matrices provided by the library include:

Gonnet matrix series Benner et al. (1994): from the gonnet_series, which is the default series, seven members are used:
gonnet_series [gon]: gonnet_40, gonnet_80, gonnet_120, gonnet_160, gonnet_250, gonnet_300, gonnet_350;

BLOSUM matrix series Henikoff and Henikoff (1992): from the BLOSUM series four members are provided:
blosum_series [blo]: blosum_30, blosum_50, blosum_62, blosum_80;

PAM matrix series Dayhoff et al. (1978): four of Dayhoff's PAM matrices are offered:
pam_series [pam]: pam_20, pam_60, pam_120, pam_350;

Shortcuts used for the matrix series are indicated in brackets (e.g. write pam to use the pam_series of substitution matrices. Write only pam, do not write the brackets).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.3 PAM Distance

To compute an alignment between two sequences, you need to measure their evolutionary distance. The matrix series used in the library maximize the similarity between two sequences: Similarity of aligned residues is rewarded by a positive score, dissimilarity is penalized by a negative score.
A commonly used set of substitution matrices are Dayhoff's PAM matrices Dayhoff et al. (1978). Dayhoff aligned a set of at least 85% identical sequences to count the accepted point mutations (how often different amino acids are replaced by each other in evolution) and the relative mutability of different amino acids. These data were combined to produce a mutation probability matrix. From the mutation probability matrix a log odds matrix was constructed by dividing each element of the mutation data matrix by its normalized frequency and then taking the log of each element. The elements of the log odds matrix give the probability that the amino acid in one column will be replaced by the amino acid in some row after a given evolutionary interval. One PAM (Percent Accepted Mutation) unit therefore represents one accepted point mutation between two sequences, per 100 residues.
To determine the evolutionary distance for which a given substitution matrix is calculated, the PAM matrix series and the Gonnet matrix series use the PAM distance. The library provides the function pam_distance (see section 2.1 Global Pairwise Alignment) to calculate the PAM distance.
For the BLOSUM matrix series the evolutionary distance is determined by the observed identity. A BLOSUM 62 matrix is, for example, a substitution matrix calculated for comparisons of sequences with no less than 62% divergence (an identity of 62% and more) Henikoff and Henikoff (1992). To calculate the observed identity call function observed_identity (see section 2.1 Global Pairwise Alignment) and multiply the return value by 100.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.3.1 Calculation of the PAM Distance

We used a simple formula to convert an observed distance to one that is corrected for multiple hits. The observed distance is the mean number of differences per site in an alignment (ignoring sites with a gap) and is therefore always between 0.0 (for identical sequences) and 1.0 (no residues the same at any site). These distances can be multiplied by 100 to give percent difference values. 100 minus percent difference gives percent identity.
The formula we use to correct for multiple hits is from Motoo Kimura M. Kimura (1983) where D is the observed distance and K is the corrected distance:

 
                        K = -Ln(1 - D - (D.D)/5)
                      

This formula gives the mean number of estimated substitutions per site and, in contrast to D (the observed number), can be greater than 1 (i.e. more than one substitution per site, on average). This can also be expressed in PAM units by multiplying by 100 (mean number of substitutions per 100 residues).
Dayhoff et al constructed an elaborate model of protein evolution based on observed frequencies of substitution between very closely related proteins. Using this model, they derived a table relating observed distances to predicted PAM distances. Kimura's formula, above, is just a "curve fitting" approximation to this table. It is very accurate in the range 0.75 > D > 0.0 but becomes increasingly inaccurate at high D (>0.75) and fails completely at around D = 0.85. For D > 0.75 the PAM distance is approximated using Dayhoff's table. The concept for the calculation of PAM distances is taken from ClustalW Thompson et al. (1994).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4. References


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

Function Index

Jump to:   *  
A   D   F   O   P   S  

Index Entry Section

*
**partition_f2.2 Matching Probabilities

A
align2.1 Global Pairwise Alignment

D
decode_alig2.1 Global Pairwise Alignment

F
free_al2.1 Global Pairwise Alignment
free_align2.1 Global Pairwise Alignment
free_partition_f2.2 Matching Probabilities
free_stoch2.3 Stochastic Backtracking

O
observed_identity2.1 Global Pairwise Alignment

P
pam_distance2.1 Global Pairwise Alignment
ps_plot2.2 Matching Probabilities

S
stoch_backtr2.3 Stochastic Backtracking

Jump to:   *  
A   D   F   O   P   S  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

Variable Index

Jump to:   B   D   E   M   T  

Index Entry Section

B
BETA2.4.1 Global Variables

D
DISTANCE2.4.1 Global Variables

E
Egap_flag2.4.2 Flags
ENDGAP2.4.1 Global Variables

M
MAT_SER[20]2.4.1 Global Variables

T
typ_flag2.4.2 Flags

Jump to:   B   D   E   M   T  


[Top] [Contents] [Index] [ ? ]

Table of Contents


[Top] [Contents] [Index] [ ? ]

Short Table of Contents

1. Introduction
2. Routines
3. Parsing of alignments
4. References
Function Index
Variable Index

[Top] [Contents] [Index] [ ? ]

About this document

This document was generated by Ulli Mueckstein on February, 26 2002 using texi2html

The buttons in the navigation panels have the following meaning:

Button Name Go to From 1.2.3 go to
[ < ] Back previous section in reading order 1.2.2
[ > ] Forward next section in reading order 1.2.4
[ << ] FastBack previous or up-and-previous section 1.1
[ Up ] Up up section 1.2
[ >> ] FastForward next or up-and-next section 1.3
[Top] Top cover (top) of document  
[Contents] Contents table of contents  
[Index] Index concept index  
[ ? ] About this page  

where the Example assumes that the current position is at Subsubsection One-Two-Three of a document of the following structure:

This document was generated by Ulli Mueckstein on February, 26 2002 using texi2html