Search for a command to run...
Biomedical Named Entity Recognition (NER) is essential for structuring and extracting vital information from specialized medical texts, thereby improving research and diagnostics, particularly in emerging fields such as biofilm studies, where understanding gene-protein interactions is crucial for characterizing microbial communities and antimicrobial resistance mechanisms. This work presents an innovative hybrid architecture that integrates BioBERT's deep contextualization with HunFlair's sequential modeling capabilities through a novel dimensional reprojection mechanism. The architecture combines a specialized embedding layer (BioBERT dmis-lab/biobert-v1.1), optimized for understanding biomedical and biofilm-related contexts, with a sequential processing suite (BiLSTM-CRF) designed to accurately identify entities such as genes and proteins. A sophisticated dimensional reprojection layer (768 <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\boldsymbol{\rightarrow} \mathbf{4 2 9 6}$</tex> dimensions) employs a learned linear transformation to align and optimize information transfer between layers, enhancing overall performance without compromising structural coherence. We trained our model on 12 harmonized biomedical corpora containing gene and protein annotations related to biofilms and general biomedical domains, with fine-tuning using a learning rate of <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$5 \times 10^{-6}$</tex> over 10 epochs. Testing demonstrates that our model outperforms conventional architectures in biomedical named entity recognition, achieving F1 scores of 90.58 % on BC2GM (<tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{+ 5. 4 3 \%}$</tex> compared to BioBERT), 90.70% on JNLPBA (+13.21% compared to HunFlair), 89.20 % on BioNLPCG <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(+1.49 \%$</tex> compared to HunFlair), and 80.56 % on CRAFT (<tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$+8.37 \%$</tex> compared to HunFlair). Precision scores reach 90.75% (BC2GM), 89.32% (JNLPBA), 89.03% (BioNLPCG), and 74.19% (CRAFT). Recall scores are particularly high: 90.41% (BC2GM), 92.12% (JNLPBA), 89.37% (BioNLPCG), and 88.13% (CRAFT), which is essential for comprehensive entity detection in biofilm research, where omitting a critical gene or protein could lead to gaps in understanding microbial mechanisms. Statistical validation confirms the significance of improvements <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(\mathbf{p}<0.01)$</tex>. These results represent a notable advance over existing models, paving the way for future applications in extracting biofilm-related information from large text datasets and enabling the construction of biofilmspecific knowledge graphs. The code is publicly available to ensure reproducibility.