Research Overview
We develop ACP-ETBLCA, a benchmark-validated predictor for anticancer peptides that unifies modern protein language modeling with interpretable biochemical priors. Residue-resolved ESM-2 embeddings are fused with classical descriptors (AAC, DPC, and nine physicochemical indices) through a lightweight Transformer and a cross-attention module that aligns property cues to salient sequence positions. The result is a fast, transparent screening tool delivering calibrated probabilities and robust generalization.
Methodologies
Our pipeline extracts per-residue representations from ESM-2 (650M) and concatenates interpretable sequence descriptors capturing charge, pI, hydrophobicity, flexibility, and secondary-structure propensities. A Transformer encoder refines contextual signals; cross-attention (queries = properties, keys/values = ESM-2) learns residue–property interactions consistent with membrane-active ACP mechanisms. A compact MLP produces final probabilities. Training uses focal loss to mitigate class imbalance; property features are standardized for stability.
Benchmarking & Validation
We conduct stratified 10-fold cross-validation and independent evaluations on common ACP benchmarks (e.g., ACP135, ACP99). Performance is reported across ACC, Sn, Sp, MCC, and AUC, with probability calibration, ablations (feature subsets and attention components), and runtime profiling. Results show consistent gains over strong baselines and stable behavior across folds and test sets.
Future Directions
We will incorporate structural cues (e.g., ESMFold/AlphaFold-derived graphs with GNNs), explore RFECV/SHAP-guided feature pruning, add uncertainty estimation for decision support, and extend training with harder negatives and larger curated corpora to strengthen out-of-distribution robustness.