A Large-Scale Structured Dataset of 3,087 Neuro-Oncology Case Reports for Clinical NLP and LLM Fine-Tuning

20260 citationsDatasetgreen Open Access

Authors

Abstract

Title: A Large-Scale Structured Dataset of 3,087 Neuro-Oncology Case Reports for Clinical NLP and LLM Fine-TuningVersion: 1.0 (Sample Subset)Curator: hamomaher06@gmail.com AbstractThis repository contains a highly curated, 100-case subset of a proprietary neuro-oncology clinical dataset. Extracted from peer-reviewed case reports and heavily standardized through rigorous medical domain-logic (MD-level curation), this sample represents the uppermost echelon of data fidelity. It is specifically designed to support the training of predictive algorithms, Large Language Models (LLMs), and Clinical Decision Support (CDS) frameworks in neuro-oncology. This repository serves as an open-access demonstrative sample. The complete, commercially available dataset comprises over 3,000 strictly validated neuro-oncology cases. Researchers or commercial entities interested in acquiring the full dataset for high-throughput pipeline integration or foundational model training must contact the curator at hamomaher06@gmail.com. Methodology & Curation ProtocolsUnlike raw biomedical extractions, which frequently suffer from signal noise and semantic contradictions, this dataset has undergone deterministic clinical mapping: 1. Morphological and Molecular Standardization: Diagnoses are aligned with the WHO Classification of Tumors of the Central Nervous System (5th Edition) constraints. Molecular markers exhibiting ambiguous reporting structures (e.g., MGMT promoter methylation statuses reported as "partial," "borderline," or "hypermethylated") have been synthetically mapped to strict binary classifications (methylated or unmethylated). Contradictory biomarker signals (e.g., coincident mutant and wild-type IDH presentations) resulted in immediate record forfeiture to preserve cohort purity. 2. Demographic Normalization: Heterogeneous pediatric and adult age metrics have been mathematically interpolated into a continuous, two-decimal floating-point format (age_years). 3. AI-Synthesized Contextualization: Each record is accompanied by a generated_vignette—a highly structured, AI-synthesized narrative summary providing rich semantic context for natural language processing architectures. 4. Legal Compliance & Redaction: Original narratives (raw_text_snippet) derived from Non-Commercial (NC) open-access licenses or paywalled journals have been actively redacted to ensure robust commercial utility and copyright compliance, while derived structural variables remain fully intact. Accessing the Full DatabaseThe 100 cases provided herein are exclusively for evaluation and algorithmic benchmarking. The master dataset, containing >3,000 verified cases, provides the statistical power necessary for foundational model development. For full access, licensing inquiries, and commercial distribution rights:Contact: hamomaher06@gmail.com

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.19373208

Command Palette

A Large-Scale Structured Dataset of 3,087 Neuro-Oncology Case Reports for Clinical NLP and LLM Fine-Tuning

Authors

Abstract

Topics & Keywords

Publication Details