๐ Input
Examples:
โ๏ธ Select Properties
MW, net charge, pI, hydrophobicity (Sequence only)
*Requires protein sequence input above
Current Best Models Configuration
This table shows the models and thresholds currently being used for predictions:
๐ฌ Permeability (Penetrance) | xgb_wt_log | 0.2801 | Transformer | 0.4343 | Classifier |
Note: Models marked as SVM, SVR, or ENET are automatically replaced with XGB as these models are not currently supported in the deployment environment.
Input Requirements and Constraints
Supported Inputs
- Amino acid sequences: Linear peptides composed of standard 20 amino acids
- SMILES: Chemically modified peptides, including cyclization, D-amino acids, and noncanonical resiudes
Validation
- Invalid sequences or SMILES will be rejected
- Properties not supported are labeled as (Not Supported)
Training Data Collection
Data distribution. Classification tasks report counts for class 0/1; regression tasks report total sample size (N). AA stands for amino acid-based sequences.
Classification (counts for class 0 / 1)
| Property | AA (0) | AA (1) | SMILES (0) | SMILES (1) |
|---|---|---|---|---|
| Hemolysis | 4765 | 1311 | 4765 | 1311 |
| Non-Fouling | 13580 | 3600 | 13580 | 3600 |
| Solubility | 9668 | 8785 | โ | โ |
| Permeability (Penetrance) | 1162 | 1162 | โ | โ |
| Toxicity | โ | โ | 5518 | 5518 |
Regression (total N)
| Property | AA (N) | SMILES (N) |
|---|---|---|
| Permeability (PAMPA) | โ | 6869 |
| Permeability (CACO2) | โ | 606 |
| Half-Life | 130 | 245 |
| Binding Affinity | 1436 | 1597 |
Our models are trained on curated datasets from multiple sources. For detailed cleaning up procedures please refer to our paper.
๐ฉธ Hemolysis Dataset
- Primary Source: the Database of Antimicrobial Activity and Structure of Peptides (DBAASPv3)
- Secondary Source: peptide-dashboard
- Description: Probability of peptide disrupting red blood cell membranes.
- Interpretation 50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. Scores close to 1 indicate a high probability of red blood cell membrane disruption, while scores close to 0 indicate low hemolytic risk. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate.
๐ง Solubility Dataset
- Primary Source: PROSO-II
- Secondary Source: peptideBERT
- Description: Probability of peptide remaining dissolved in aqueous conditions.
- Interpretation: Outputs a probability (0โ1) that a peptide remains soluble in aqueous conditions. Higher scores indicate lower aggregation risk and better formulation stability.
๐ฏ Non-Fouling Dataset
- Primary Source: Classifying antimicrobial and multifunctional peptides with Bayesian network models
- Secondary Source: peptideBERT
- Description: A nonfouling peptide resists nonspecific interactions and protein adsorption.
- Interpretation: Outputs the probability (0โ1) that a peptide resists nonspecific protein adsorption. Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.
๐ชฃ Permeability Dataset
- Primary Source: CycPeptMPDB, PAMPA
- Secondary Source: PepLand
- Description: Probability of peptide penetrating the cell membrane.
- Interpretation: For PAMPA and CACO-2 regression, outputs are log-scaled permeability values. Following CycPeptMPDB conventions, log Pexp โฅ โ6.0 indicates favorable permeability, while values below โ6.0 indicate weak permeability. For penetrance prediction, the probability closer to 1 indicates higher risk of cell penetrance, and vice versa.
โฑ๏ธ Half-Life Dataset
- Primary Source: Thpdb2, PepTherDia, peplife
- Interpretation: Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.
โ ๏ธ Toxicity Dataset
- Primary Source: ToxinPred3.0
- Interpretation: Outputs a probability (0โ1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.
๐ Binding Affinity Dataset
- Primary Source: PepLand
- Description: Binding probability normalized in PepLand already. It's a combination of Kd, Ki, IC50.
- Description: The model predicts a continuous binding affinity score, where higher values indicate stronger binding. Scores are comparable across peptides binding to the same protein target.
- Interpretation:
- Scores โฅ 9 correspond to tight binders (K โค 10โปโน M, nanomolar to picomolar range)
- Scores between 7 and 9 correspond to medium binders (10โปโทโ10โปโน M, nanomolar to micromolar range)
- Scores < 7 correspond to weak binders (K โฅ 10โปโถ M, micromolar and weaker)
- A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.
- Scores โฅ 9 correspond to tight binders (K โค 10โปโน M, nanomolar to picomolar range)
Model Architecture
- Sequence Embeddings: ESM-2 650M model / PeptideCLM model. Foundational embeddings are frozen.
- XGBoost Model: Gradient boosting on pooled embedding features for efficient, high-performance prediction.
- CNN/Transformer Model: One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns.
- Binding Model: Transformer-based architecture with cross-attention between protein and peptide representations.
- SVR Model: Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets.
- Others: SVM and Elastic Nets were trained with RAPID cuML, which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository.
Model Training and Weight Hosting
- More instructions can be found here at PeptiVersse
๐งช Physicochemical Properties
Net Charge Calculation
- Uses Henderson-Hasselbalch equation
- pH-dependent calculation
- Considers all ionizable groups (K, R, H, D, E, C, Y, termini)
Isoelectric Point (pI)
- Bisection method to find pH where net charge = 0
- Precision: ยฑ0.01 pH units
Hydrophobicity (GRAVY)
- Grand Average of Hydropathy
- Uses Kyte-Doolittle scale
- Range: -4.5 (hydrophilic) to +4.5 (hydrophobic)
Citation
If you use this tool, please cite:
place holder
Contact
For questions or collaborations: yzhang@u.duke.nus.edu or pranam@seas.upenn.edu
๐ Results
PeptiVerse - A Unified Platform for peptide therapeutic property prediction.
Please cite our work if you use this tool in your research.