Sixth Decade of Protein Data Bank Operations: Transition to Extended PDB IDs and PDBx/mmCIF Format

20250 citationsJournal Articlegold Open Access

Authors

Sutapa Ghosh · Rutgers, The State University of New Jersey

Zukang Feng · Rutgers, The State University of New Jersey

Yu‐He Liang · Rutgers, The State University of New Jersey

Ezra Peisach · Rutgers, The State University of New Jersey

Irina Persikova · Rutgers, The State University of New Jersey

Jasmine Young · Rutgers, The State University of New Jersey

wwPDB Team

Abstract

The Protein Data Bank (PDB) was established in 1971 as the first open-access digital data resource in biology with just seven X-ray crystallographic structures of proteins. Today, the single global PDB archive houses more than 215,000 experimentally-determined, atomic-level three-dimensional (3D) structures of biological macromolecules that are made freely available to many millions of users worldwide with no limitations on data usage. 3D biostructure information facilitates basic and applied research and education across the sciences, impacting fundamental biology, biomedicine, biotechnology, and energy sciences. The Worldwide Protein Data Bank partnership (wwPDB, wwpdb.org) currently includes five Full Members (RCSB PDB, PDBe, PDBj, BMRB, and EMDB) and one Associate Member (PDBc), which together manage the PDB, EMDB, and BMRB Core Archives. wwPDB Members are committed to ensuring that structural biology data are Findable, Accessible, Interoperable, and Reusable (FAIR). With accelerating growth of 3D structure depositions to the PDB, the wwPDB data centers anticipate that all possible four-character PDB accession codes (PDB IDs) will be consumed by 2029. In preparation for this milestone, the wwPDB has revised the PDB accession code format by extending its length to 12 characters. The format of extended PDB ID is prefix “pdb_” followed by eight alphanumeric characters, e.g., pdb_10021abc. This process will facilitate more robust text mining detection of PDB IDs across the published literature and allow for more informative and transparent delivery of revised data files. Once all available four-character PDB IDs have been assigned, newly deposited PDB entries will be issued with extended PDB ID codes. Such entries will not be compatible with the legacy PDB format and will be distributed only in Protein Data Bank exchange/macromolecular CIF (PDBx/mmCIF) format. All existing PDB entries bearing legacy (four-character) IDs will also be persistently identified with extended PDB IDs stored in the PDBx/mmCIF atomic coordinate file (e.g., PDB ID “1abc” will have extended ID format as “pdb_00001abc”). Herein, we describe the five-year transition plan to help PDB depositors, users, and scientific journals embrace the extended PDB ID and exclusively rely on PDBx/mmCIF format data files. Resources for supporting the extended PDB ID format will be provided to PDB users in the form of FAQ on PDB ID extension and learning materials/links on PDBx/mmCIF format. Additionally, a PDB “beta” archive will be provided starting from 2026, that will have a file directory structure consistent with the current PDB archive. The PDB “beta” archive will become the PDB main archive once all the four-character PDB accession codes are exhausted, and the current PDB archive will be removed. As the transition phase progresses, additional training resources will be provided. Community-wide adoption of PDBx/mmCIF format is the goal of the wwPDB. Since 2014, this format has been the official master format underpinning both the PDB Core Archive and the wwPDB global OneDep software system for complete deposition, rigorous validation, and expert biocuration of incoming 3D biostructure data. PDBx/mmCIF has many advantages versus the legacy PDB format. It is flexible, fully extensible, both human- and machine-readable, and can accommodate 3D biostructures of any size and composition. This presentation will cover (a) a guide to extended PDB ID and PDBx/mmCIF format for PDB users and scientific journals editors and expert referees; (b) how the PDBx/mmCIF format accommodates extended IDs (PDB IDs, ligand codes), and where and how extended PDB IDs are stored in PDBx/mmCIF data files; (c) how to derive the new extended PDB ID from an old PDB ID for more robust identification of all PDB structures in the scientific literature; (d) the PDB “beta” archive that will be provided in 2026 to support this persistent identifier transition, and concomitant changes in file directory architecture; and (e) example files containing extended PDB IDs that can be accessed during testing of software changes designed to enable their adoption and management. RCSB PDB is funded by the National Science Foundation (DBI-1832184), the US Department of Energy (DE-SC0019749), and the National Cancer Institute, National Institute of Allergy and Infectious Diseases, and National Institute of General Medical Sciences of the National Institutes of Health under grant R01GM133198.

Topics & Keywords

Scientific Computing and Data Management Advanced Proteomics Techniques and Applications

Publication Details

Published in: Structural Dynamics

Volume 12, Issue 2_Supplement, pp. A27-A27

DOI: 10.1063/4.0000336

Field-Weighted Citation Impact: 0.00

Command Palette

Sixth Decade of Protein Data Bank Operations: Transition to Extended PDB IDs and PDBx/mmCIF Format

Authors

Abstract

Topics & Keywords

Publication Details