Search for a command to run...
The development of machine learning models for protein–ligand interactions is fundamentally constrained by the quality and diversity of available structural data. Existing databases of protein–ligand complexes present researchers with an unsatisfying trade–off: carefully curated collections such as PDBBind and HiQBind offer high structural reliability but cover only a narrow slice of the Protein Data Bank (PDB), while large–scale resources like PLInder provide broad coverage at the expense of rigorous quality control. Here, we introduce CROWN (Curated Repository Of Well–resolved Non-covalent interactions), a machine learning–ready dataset that reconciles this tension by applying a comprehensive, fully automated preprocessing pipeline to the PLInder database. Starting from 649,915 protein–ligand interaction systems, CROWN applies a series of interleaved quality filters and processing stages addressing crystallographic resolution, ligand identity, pocket completeness, structural repair, interaction quality, and protonation at physiological pH. A distinguishing feature of the pipeline is a final constrained energy minimization step using custom flat–bottomed restraints, which balances crystallographic evidence with relaxation of intramolecular strain. This step — absent from existing protein–ligand datasets — produces structurally uniform complexes by reconciling the heterogeneous refinement practices of different crystallographers and structure determination protocols, without distorting the experimentally observed binding geometry. The resulting dataset of 153,005 complexes represents a roughly four–fold increase in protein and species diversity over PDBBind and HiQBind, while maintaining rigorous structural standards. Importantly, CROWN adopts a geometry–centric design philosophy that treats the 3D arrangement of atoms at the binding interface as a self–consistent source of information, rather than relying on externally measured binding affinities that cover only a fraction of known structures and introduce well–documented biases. We anticipate that CROWN will serve as a broadly useful resource for training generative models of protein–ligand binding poses, developing scoring functions, and benchmarking interaction prediction methods.