Search for a command to run...
The structural information on protein-ligand complexes is crucial for small-molecule design and drug discovery. Yet primary resources often have heterogeneous annotations, lack machine-ready ligand categorization, and require substantial postprocessing before large-scale modeling. Here, we present <i>LigandExplorer</i>, an open-source, automated postprocessing pipeline that identifies and extracts covalent and noncovalent ligands from biomolecular complex structures and standardizes outputs for downstream use. Using residue-level graphs built solely from atomic coordinates, <i>LigandExplorer</i> is robust to missing or inconsistent metadata and integrates LightGBM models to classify ligands (peptides, nucleic acids, phospholipids, carbohydrates, organics, and ions) and assess interaction relevance. Because the pipeline is rerunnable, it can be applied to each new databases release to keep derived, categorized data sets current without altering source records. On the PDBbind v2020 refined set, <i>LigandExplorer</i> achieved a 98.38% raw structural agreement under harmonized comparison criteria prior to any manual reconciliation; the remaining discrepancies were analyzed separately and were dominated by divergences between raw RCSB entries and curated PDBBind records. On the PepBDB, <i>LigandExplorer</i> successfully processed 4881 of 5005 complexes, achieving a 97.52% success rate. Most failures reflected upstream record errors, where complex cyclic peptides constituted the primary algorithmic boundary. <i>LigandExplorer</i> thus mitigates data-cleaning burdens and enables rapidly refreshed, standardized data sets for computational modeling and molecular design.
Published in: Journal of Chemical Information and Modeling
Volume 66, Issue 6, pp. 3026-3035