Probabilistic context-free grammar for pattern detection in protein sequences

  • Witold Dyrka

Research output: ThesisMaster's thesis

Abstract

Analysis of protein sequences to predict their functions is a very challenging problem where pattern recognition techniques based on Hidden Markov models (HMMs) have proved to be the most efficient. However HMMs have limitations. According to formal language theory, their expressive power is similar to Probabilistic Regular Grammars (PRG). Here, we propose a pattern recognition method based on a more powerful grammar. We developed a Probabilistic Context-Free Grammar (PCFG) based system to detect protein regions that are involved in binding sites. In order to deal with the size of the protein alphabet, we use quantitative properties of amino acids to reduce the number of rules. The grammars based on different properties are then combined to retain as much inforination as possible. To increase the number of symbols while keeping the rule set on a maintainable level, we imposed some structural constraints on grammars. Moreover, to deal with motifs of a variable length, we implemented a window-independent scoring scheme for parsing. Then the PCFGs can be generated by an evolutionary process. It was customised to PCFG induction by implementing a diversity measure based on the Weighted Hamming distance. Our PCFGs proved their ability to detect binding sites with high accuracy. They achieved very good results for protein sequence annotation and binding site localisation. We also showed that some features of protein patterns could be better represented by PCFG than PRG. This confirms our initial assumption that binding site detection benefits from the expressive power provided by a context-free language. Finally, results suggest that, unlike current state-of-the-art methods, our system would be particularly suited to deal with patterns shared by non-homologous proteins.
Original languageEnglish
QualificationMaster of Science by Research (MSc(R))
Awarding Institution
  • Kingston University
Supervisors/Advisors
  • Makris, Dimitrios, Supervisor
  • Monekosso, Ndedi, Supervisor, External person
  • Nebel, Jean-Christophe, Supervisor
Publication statusAccepted/In press - Sept 2007
Externally publishedYes

Bibliographical note

Physical Location: This item is held in stock at Kingston University Library.

Keywords

  • Biological sciences

Fingerprint

Dive into the research topics of 'Probabilistic context-free grammar for pattern detection in protein sequences'. Together they form a unique fingerprint.

Cite this