Search for a command to run...
Introduction Biomedical data integration requires term-to-identifier normalization, the process of linking natural-language biomedical terms to standardized ontology codes so that extracted concepts become computable and interoperable. Although large language models perform well on clinical text summarization and concept extraction, they remain markedly less accurate at mapping ontology terms to their corresponding identifiers. Methods We examined the roles of memorization and generalization in term-to-code mapping across the Human Phenotype Ontology (HPO), the Gene Ontology (GO), and the HGNC gene naming system, including mappings between gene names, lexicalized gene symbols, and arbitrary gene identifiers. Performance was assessed across multiple base models and after task-specific fine-tuning. Results Accuracy scaled with model size, with GPT-4o outperforming Llama 3.1 70B and Llama 3.1 8B. Fine-tuning improved forward mappings from term to identifier, with larger gains for GO than for HPO and minimal improvement for gene name-to-HGNC identifier mappings. Generalization to withheld mappings occurred primarily for HGNC gene name-to-gene symbol tasks, whereas fine-tuning on HPO and GO identifiers produced little generalization. Embedding analyses revealed strong semantic alignment between gene names and HGNC gene symbols but no comparable alignment between concept names and identifiers in GO, HPO, or HGNC. Conclusions These results suggest that fine-tuning success depends on two interacting factors: popularity and lexicalization. Popularity, a proxy for pretraining exposure to term-identifier pairs, predicted baseline accuracy and the magnitude of memorization gains during fine-tuning, whereas long-tail identifiers remained difficult to consolidate. Lexicalization, the extent to which a symbol functions as a meaningful token in embedding space, enabled generalization and explains why generalization emerged for HGNC gene symbols but not for the arbitrary identifiers used in GO and HPO. Together, these findings provide a predictive framework for identifying when fine-tuning can improve factual term normalization, when gains primarily reflect memorization, and when normalization is likely to fail.