Abstract
The CATH database is a free publicly available online resource that provides annotations about the evolutionary and structural relationships of protein domains. Due to the flux of protein structures coming mainly from the recent breakthrough of AlphaFold and therefore the non-feasibility of manual intervention, the CATH team recently developed an automatic CATH superfamily (SF) classifier called CATHe, which uses a feed-forward neural network (FNN) classifier with protein Language Model (pLM) embeddings as input. Using the same dataset of remote homologues (with a 20% sequence identity threshold), this paper presents CATHe2, which improves on CATHe by switching the old pLM ProtT5 for one of the most recent versions called ProstT5, and by incorporating domain 3D information into the classifier through Structural Alphabet representation, specifically, 3Di sequence embeddings. Finally, CATHe2 implements a new version of the FNN classifier architecture, fine-tuned to perform at the CATH superfamily prediction task. The best CATHe2 model reaches an accuracy of 92.2% ± 0.7% with an F1 score of 82.3% ± 1.3%, which constitutes an improvement of 9.9% on the F1 score and 6.6% on the accuracy, from the previous CATHe version (85.6% ± 0.4% accuracy and 72.4% ± 0.7% F1 score) on its largest dataset (∼1700 superfamilies). This model uses ProstT5 amino acid (AA) sequence and 3Di sequence embeddings as input to the classifier, but a simplified version requiring only AA sequences, already improves CATHe’s F1 score by 6.7% ± 1.3% and accuracy by 6.6% ± 0.7% on its largest dataset.
| Original language | English |
|---|---|
| Article number | bpaf080 |
| Journal | Biology Methods and Protocols |
| Volume | 10 |
| Issue number | 1 |
| Early online date | 4 Nov 2025 |
| DOIs | |
| Publication status | Published - 2025 |