Data for: An Energy Landscape Approach to Miniaturizing Enzymes using Protein Language Model Embeddings

20260 citationsDatasetgreen Open Access

Authors

Stefano Angioletti-Uberti · Thomas Young Centre

Abstract

Data for: An Energy Landscape Approach to Miniaturizing Enzymes using Protein Language Model Embeddings Data for four enzyme systems: petase (PETase, PDB 5XJH), protease (subtilisin Carlsberg, PDB 1SBC), taq (Taq polymerase, PDB 1TAQ), and vioA (VioA, PDB 6FW9). Chemical Potential Energy Weight Sweep Effect of chemical potential energy weights on BAGEL protein sequence design. Figures: Figure 1 (pipeline overview, energy evolution), Figure 2 (weight effect on pLDDT/RMSD), SI Figure S1 (structural representations across weights). weight_sweep/<enzyme>/ ├── index.json # Aggregated results ├── esmfold_structures/ # ESMFold structures (2,500 per enzyme) │ └── <enzyme>_weight_<W>_run_<R>_seq_<S>.cif └── weight_<W>/run_<R>/ # 5 weights × 5 runs = 25 per enzyme ├── config.csv # Energy function weights ├── optimization.log # MCMC trajectory (step, temperature, accept) ├── best/ # Best sequences so far in run │ ├── state_A.fasta, state_A.mask.fasta, energies.csv └── current/ # Current sequences in a run ├── state_A.fasta, state_A.mask.fasta, energies.csv Weights tested: 0.0001, 0.0005, 0.001, 0.005, 0.01.Total: 4 enzymes × 5 weights × 5 runs = 100 trajectories, each with 100,000 Monte Carlo steps. index.json Aggregated results per enzyme. 500 sequences per weight (top 100 from each of 5 runs), 2,500 total per enzyme. enzyme: Enzyme name description: "Chemical Potential Energy Weight Sweep" weights: List of tested weight values statistics: {total_variants, variants_per_weight} variants[]: Each entry contains: id, sequence, sequence_length, immutable_indices weight: Chemical potential energy weight used run_id: Which of the 5 replicate runs step_id: MCMC step at which this sequence was sampled system_energy: Total system energy at sampling esmfold: structure_path, immutable_rmsd_heavy, immutable_rmsd_ca, immutable_plddt, immutable_plddt_array Manual Trimming Structural trimming to create minimal functional enzyme variants. Two approaches: naive stepwise removal of terminal then internal residues, and BAGEL-guided trimming using PLM embeddings. Figures: Figure 3 (naive trimming vs BAGEL comparison, embedding angle/RMSD/pLDDT trajectories). manual_trimming/ ├── manual/ # Stepwise trimming trajectories │ ├── <enzyme>.json # Trimming trajectory metadata │ └── <enzyme>/step_<N>.cif # ESMFold structure at step N └── bagel/ # BAGEL-guided trimming ├── <enzyme>.json # Sequence metadata with metrics └── <enzyme>/length_<L>_<C>.cif # ESMFold structure (length L, counter C) manual/<enzyme>.json Stepwise trimming trajectory. Stage 1 removes terminal residues; stage 2 removes internal segments. enzyme: Enzyme name wild_type_length: Length of the full wild-type sequence immutable_length: Number of active-site residues held fixed trajectory[]: Each step contains: step, sequence, sequence_length, stage n_term_chops, c_term_chops: Cumulative residues removed from each terminus angles: Per-residue PLM embedding angles (length = immutable_length) mean_angle, median_angle, std_angle: Summary statistics of embedding angles immutable_indices: Current positions of active-site residues esmfold: structure_path, immutable_rmsd_heavy, immutable_rmsd_ca, immutable_plddt, immutable_plddt_array bagel/<enzyme>.json BAGEL-guided trimmed sequences at various lengths. enzyme: Enzyme name source: Design method ("plm") sequences[]: Each entry contains: sequence, sequence_length, immutable_indices esmfold: structure_path, immutable_rmsd_heavy, immutable_rmsd_ca, immutable_plddt, immutable_plddt_array Production Dataset Final PLM-designed mini-enzyme variants with structure predictions and MD analysis. Figures: Figure 4 (multi-model structural validation, pLDDT/RMSD distributions), Figure 5 (per-residue RMSF of best variants), Figure 6 (active-site RMSD trajectories), SI Figures S2–S5 (RMSD-pLDDT densities, SolubleMPNN/hydrophobicity, full RMSF profiles, pLDDT-RMSF correlation). production/<enzyme>/ ├── index.json # Full variant index (all designed variants) ├── ranking.csv # Aggregated metrics for top-16 MD-simulated variants ├── rmsd.json # RMSD time series from 500 ns MD simulations ├── rmsf.json # Per-residue RMSF from 500 ns MD simulations ├── bagel.tar.gz # Compressed BAGEL optimization runs ├── mini-variants/ # Predicted structures per variant │ └── mini-<enzyme>-<N>/{esmfold,boltz2,chai1}/mini-<enzyme>-<N>.cif └── wild-type/ # Wild-type structures └── {crystal,esmfold,boltz2,chai1}/<enzyme>_WT.cif Enzyme Total Variants MD-Simulated petase 405 16 protease 338 16 taq 2,847 16 vioA 1,648 16 index.json Full index of all designed variants per enzyme. enzyme: Enzyme name statistics: {total_variants, variants_with_md} wild-type: Per-method (crystal, esmfold, boltz2, chai1) structure paths and scores variants[]: Each entry contains: id, sequence, sequence_length, sequence_identity, immutable_indices system_energy, source_run, chem_pot_weight, temperature Per-method (esmfold, boltz2, chai1): structure_path, immutable_rmsd_heavy, immutable_rmsd_ca, solmpnn scores, largest_hydrophobic_patch_area md: Non-empty for the 16 variants selected for MD simulation (empty {} for others) rmsd.json RMSD trajectories (Ångström) from 500 ns MD simulations, measuring structural deviation from the initial equilibrated frame. { "enzyme": "protease", "variants": { "WT": { "results": { "ca": { "<method>": { "rmsd_values": [...], "num_frames": 5000 } }, "heavy": { ... } } }, "mini-protease-12": { "results": { "ca": { ... }, "heavy": { ... } } } } } ca: C-alpha RMSD per frame (~5,000 frames), with num_frames count heavy: Heavy-atom RMSD per frame, with num_frames count WT has 4 methods (crystal, esmfold, boltz2, chai1); variants have 3 (no crystal) rmsf.json Per-residue C-alpha RMSF (Ångström) from 500 ns MD simulations, computed using MDAnalysis (AlignTraj alignment on all C-alpha atoms, then RMSF). { "protease": { "WT": { "rmsf_data": { "<method>": { "residue_ids": [1, 2, ...], "residue_ids_seq": ["M1", "K2", ...], "rmsf_values": [0.5, 0.6, ...], "sequence": "MKL...", "immutable_indices": [4, 5, ...] } } }, "variants": { "mini-protease-12": { "rmsf_data": { "esmfold": {...}, "boltz2": {...}, "chai1": {...} }, "wt_difference_esmfold": 0.15, "wt_difference_chai1": 0.12, "wt_difference_boltz2": 0.18, "crystal_difference": 0.14 } }, "best_variant_id": "mini-protease-35" } } immutable_indices: 1-indexed positions of active-site residues that are held fixed during design wt_difference_<method>: Mean absolute RMSF difference at immutable residues between variant and WT crystal_difference: Mean absolute RMSF difference at immutable residues between variant (esmfold) and WT (crystal) best_variant_id: Variant with lowest mean wt_difference averaged across esmfold, chai1, and boltz2 ranking.csv Aggregated metrics for the top-16 variants per enzyme (those selected for MD simulation). Consolidates structural quality scores from index.json and MD-derived metrics from rmsf.json and rmsd.json. Column Description id Variant identifier (e.g., mini-protease-12) {method}.immutable_rmsd_heavy Heavy-atom RMSD (Å) of immutable residues vs wild-type {method}.solmpnn.single_aa SolubleMPNN single-AA log-likelihood score avg.immutable_rmsd_heavy Mean across esmfold, chai1, boltz2 avg.solmpnn.single_aa Mean across esmfold, chai1, boltz2 {method}.md_rmsf_imm Mean RMSF (Å) of immutable (active-site) residues {method}.md_rmsf_all Mean RMSF (Å) of all residues avg.md_rmsf_imm Mean across esmfold, chai1, boltz2 avg.md_rmsf_all Mean across esmfold, chai1, boltz2 {method}.md_motif_rmsd Mean heavy-atom RMSD (Å) of active-site motif during MD avg.md_motif_rmsd Mean across esmfold, chai1, boltz2 Where {method} is one of: esmfold, chai1, boltz2. bagel.tar.gz Compressed BAGEL optimization runs (text files: CSV, FASTA, logs). Extract with: cd production/<enzyme> && tar -xzf bagel.tar.gz Not required for working with index.json, structures, or MD data. Note: Raw MD trajectories have been excluded due to size. Aggregated RMSD and RMSF analysis is provided instead.

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18854113

Command Palette

Data for: An Energy Landscape Approach to Miniaturizing Enzymes using Protein Language Model Embeddings

Authors

Abstract

Topics & Keywords

Publication Details