Defensio Abstract

Speaker Johannes Söllner
Title Parametrization and Classification of Peptides for the Detection of Linear B-cell Epitopes


To this point machine learning strategies have been applied extensively in the field of T-cell epitope prediction while B-cell epitopes have been neglected in comparison. This is presumably the result of the less restrictive definition of antibody binding sites in comparison to TCR/MHC/peptide interactions for which several practical rules could be extracted. B-cell epitope determinants seem to be more diffusely defined, presumably mainly by their degree of surface localisation and potential to interact. Existing methods in the field have focused either on mono-parametric approaches or the weighted combination of a few scales such as hydrophilicity and secondary structure prevalence. However, there is only a single solution, published by Kolaskar and Tongaonkar (1990), which provides a class assignment, i.e. epitopic or non epitopic. This lack of definitive classifications could be attributed to the difficulty to make clear cut assignments where rather probabilities would be expected. Unfortunately also those are missing from the current approaches which usually feature a propensity scale of some sort but lack classification thresholds or rates for correct and false classifications.

To close this gap we applied two conceptually simple and popular machine learning approaches to combine a number of parameters which can be calculated for short amino acid chains. Those parameters (attributes) were taken either from the literature, prepared in a molecular simulation package or developed in the course of this work. Among the latter class, mainly those describing the neighbourhood relationships between amino acids are noteworthy. It should also be noted that the literature derived single amino acid scales used in this work had only in a few cases previously been used in the context of B-cell epitopes, e.g. the one by Hopp and Woods (1981), but had rather been applied in some other field of protein/amino acid research. From the total available set of 18920 attributes several possible subsets were selected to make the applied learning strategies practically feasible. Those selections were also accompanied by a procedure for reference sampling as accessing a dataset of non-epitopic data was one of the central obstacles for this project. Finally, we used Post Test Probability (PTP) filtering and ROC curves to select classifiers and ultimately to determine how good our approach compared to the current gold standard, antigenic, by Kolaskar and Tongaonkar (1990).

Our work was facilitated by the large experimental dataset of putative epitopes of bacterial pathogens generated in the course of the Intercell Antigen Identification Program (AIP).