Comparison of BioBERT and DistilBERT for Named Entity Recognition on Indonesian Radiology Clinical Data

Nadia Eka Aprilia; Danang Wahyu Utomo

Submitted

November 4, 2025

Published

March 30, 2026

Download

pdf

Statistic

Read Counter : 13 Download : 12

Abstract

Named Entity Recognition (NER) in Indonesian language radiology reports faces significant challenges due to the limited availability of labeled data for model training. This constraint is a major obstacle to developing an accurate medical information extraction system. Pseudo-labeling emerges as a potential solution by leveraging abundant unlabeled data to expand the training dataset without the need for time-consuming manual annotation. This study aims to compare the performance of two transformer models, BioBERT and DistilBERT, fine-tuned on pseudo-labeled data for extracting medical entities from Indonesian radiology reports. The research methodology encompasses three main stages text preprocessing and normalization, text alignment using regular expressions with BIO labeling, and model fine-tuning with a pseudo-labeling strategy. Model performance was evaluated using Precision, Recall, and F1-score metrics on an adapted radiology dataset. The results indicate that pseudo-labeling was effective in enhancing the performance of both models. DistilBERT achieved a higher accuracy of 96,4, while BioBERT reached 92.78%. Nonetheless, DistilBERT demonstrated superior computational efficiency with faster training time. This study provides valuable insight for selecting an optimal model architecture for NER tasks on Indonesian medical text, considering the balance between accuracy and computational efficiency.

Keywords

Named Entity Recognition Pseudo-labeling BioBERT DistilBERT Radiology report

This work is licensed under a Creative Commons Attribution 4.0 International License.

References

Abadeer, M. (2020) Assessment of DistilBERT performance on Named Entity Recognition task for the detection of Protected Health Information and medical concepts. Available at: https://github.com/huggingface/.
Abdullahi, A.A. et al. (2025) “Deep learning for named entity recognition in Turkish radiology reports,” Diagnostic and Interventional Radiology, 31(5), pp. 430–439. Available at: https://doi.org/10.4274/dir.2025.243100.
Arzideh, K. et al. (2025) “From BERT to generative AI - Comparing encoder-only vs. large language models in a cohort of lung cancer patients for named entity recognition in unstructured medical reports,” Computers in Biology and Medicine, 195. Available at: https://doi.org/10.1016/j.compbiomed.2025.110665.
Cabrera, E.R.S. et al. (2024) “Named Entity Recognition in Mammography Radiology Reports using a Multilingual Transfer Learning Approach,” Proceedings - IEEE Symposium on Computer-Based Medical Systems. Institute of Electrical and Electronics Engineers Inc., pp. 273–277. Available at: https://doi.org/10.1109/CBMS61543.2024.00052.
Djati Prinantyo, G. and Salam, A. (no date) “Optimization of Biobert Model for Medical Entity Recognition Through Bilstm and CNN-Char Integration Optimalisasi Model Biobert untuk Pengenalan Entitas Medis melalui Integrasi Bilstm dan CNN-Char,” 10(2), p. 2025.
Huang, D.-L. et al. (2021) “Accurate Name Entity Recognition for Biomedical Literatures: A Combined High-quality Manual Annotation and Deep-learning Natural Language Processing Study.” Available at: https://doi.org/10.1101/2021.09.15.460567.
Kuligowska, K. and Kowalczuk, B. (2021) “Pseudo-labeling with transformers for improving Question Answering systems,” Procedia Computer Science. Elsevier B.V., pp. 1162–1169. Available at: https://doi.org/10.1016/j.procs.2021.08.119.
Kumar, A., Malla, J. and Sharma, A. (2025) “Predictions Through Clinical Text Analysis with BioBERT,” International Journal on Engineering Artificial Intelligence Management, 02(2), pp. 14–29. Available at: http://creativecommons.org/licenses/by/4.0/.
Lee, J. et al. (2020) “BioBERT: A pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, 36(4), pp. 1234–1240. Available at: https://doi.org/10.1093/bioinformatics/btz682.
Lima-López, S. et al. (2025) “A textual dataset of de-identified health records in Spanish and Catalan for medical entity recognition and anonymization,” Scientific Data, 12(1). Available at: https://doi.org/10.1038/s41597-025-05320-1.
Liu, H. et al. (2020) “A Natural Language Processing Pipeline of Chinese Free-Text Radiology Reports for Liver Cancer Diagnosis,” IEEE Access, 8, pp. 159110–159119. Available at: https://doi.org/10.1109/ACCESS.2020.3020138.
Paul, T. et al. (2022) “Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients,” Applied Sciences (Switzerland), 12(19). Available at: https://doi.org/10.3390/app12199976.
Pérez-Díez, I. et al. (2021) “De-identifying Spanish medical texts - named entity recognition applied to radiology reports,” Journal of Biomedical Semantics, 12(1). Available at: https://doi.org/10.1186/s13326-021-00236-2.
Rao, B.K. (no date) MICROBIAL NAMED ENTITY RECOGNITION USING BERT MODELS.
Rohanian, O. et al. (2024) “Lightweight transformers for clinical natural language processing,” Natural Language Engineering, 30(5), pp. 887–914. Available at: https://doi.org/10.1017/S1351324923000542.
Sato, J. et al. (2024) Annotation-free multi-organ anomaly detection in abdominal CT using free-text radiology reports: a multi-center retrospective study. Available at: www.thelancet.com.
Steinkamp, J. et al. (2021) “Automatic Fully-Contextualized Recommendation Extraction from Radiology Reports,” Journal of Digital Imaging, 34(2), pp. 374–384. Available at: https://doi.org/10.1007/s10278-021-00423-8.
Su, Y., Babore, Y.B. and Kahn, C.E. (2025) “A Large Language Model to Detect Negated Expressions in Radiology Reports,” Journal of Imaging Informatics in Medicine, 38(3), pp. 1297–1303. Available at: https://doi.org/10.1007/s10278-024-01274-9.
Tay, S.B. et al. (2024) “Use of Natural Language Processing to Infer Sites of Metastatic Disease from Radiology Reports at Scale,” JCO Clinical Cancer Informatics [Preprint], (8). Available at: https://doi.org/10.1200/cci.23.00122.
Tsuji, S. et al. (2021) “Developing a RadLex-based named entity recognition tool for mining textual radiology reports: development and performance evaluation study,” Journal of Medical Internet Research, 23(10). Available at: https://doi.org/10.2196/25378.
Wang, S.Y. et al. (2022) “Leveraging weak supervision to perform named entity recognition in electronic health records progress notes to identify the ophthalmology exam,” International Journal of Medical Informatics, 167. Available at: https://doi.org/10.1016/j.ijmedinf.2022.104864.

References

Abadeer, M. (2020) Assessment of DistilBERT performance on Named Entity Recognition task for the detection of Protected Health Information and medical concepts. Available at: https://github.com/huggingface/.

Abdullahi, A.A. et al. (2025) “Deep learning for named entity recognition in Turkish radiology reports,” Diagnostic and Interventional Radiology, 31(5), pp. 430–439. Available at: https://doi.org/10.4274/dir.2025.243100.

Arzideh, K. et al. (2025) “From BERT to generative AI - Comparing encoder-only vs. large language models in a cohort of lung cancer patients for named entity recognition in unstructured medical reports,” Computers in Biology and Medicine, 195. Available at: https://doi.org/10.1016/j.compbiomed.2025.110665.

Cabrera, E.R.S. et al. (2024) “Named Entity Recognition in Mammography Radiology Reports using a Multilingual Transfer Learning Approach,” Proceedings - IEEE Symposium on Computer-Based Medical Systems. Institute of Electrical and Electronics Engineers Inc., pp. 273–277. Available at: https://doi.org/10.1109/CBMS61543.2024.00052.

Djati Prinantyo, G. and Salam, A. (no date) “Optimization of Biobert Model for Medical Entity Recognition Through Bilstm and CNN-Char Integration Optimalisasi Model Biobert untuk Pengenalan Entitas Medis melalui Integrasi Bilstm dan CNN-Char,” 10(2), p. 2025.

Huang, D.-L. et al. (2021) “Accurate Name Entity Recognition for Biomedical Literatures: A Combined High-quality Manual Annotation and Deep-learning Natural Language Processing Study.” Available at: https://doi.org/10.1101/2021.09.15.460567.

Kuligowska, K. and Kowalczuk, B. (2021) “Pseudo-labeling with transformers for improving Question Answering systems,” Procedia Computer Science. Elsevier B.V., pp. 1162–1169. Available at: https://doi.org/10.1016/j.procs.2021.08.119.

Kumar, A., Malla, J. and Sharma, A. (2025) “Predictions Through Clinical Text Analysis with BioBERT,” International Journal on Engineering Artificial Intelligence Management, 02(2), pp. 14–29. Available at: http://creativecommons.org/licenses/by/4.0/.

Lee, J. et al. (2020) “BioBERT: A pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, 36(4), pp. 1234–1240. Available at: https://doi.org/10.1093/bioinformatics/btz682.

Lima-López, S. et al. (2025) “A textual dataset of de-identified health records in Spanish and Catalan for medical entity recognition and anonymization,” Scientific Data, 12(1). Available at: https://doi.org/10.1038/s41597-025-05320-1.

Liu, H. et al. (2020) “A Natural Language Processing Pipeline of Chinese Free-Text Radiology Reports for Liver Cancer Diagnosis,” IEEE Access, 8, pp. 159110–159119. Available at: https://doi.org/10.1109/ACCESS.2020.3020138.

Paul, T. et al. (2022) “Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients,” Applied Sciences (Switzerland), 12(19). Available at: https://doi.org/10.3390/app12199976.

Pérez-Díez, I. et al. (2021) “De-identifying Spanish medical texts - named entity recognition applied to radiology reports,” Journal of Biomedical Semantics, 12(1). Available at: https://doi.org/10.1186/s13326-021-00236-2.

Rao, B.K. (no date) MICROBIAL NAMED ENTITY RECOGNITION USING BERT MODELS.

Rohanian, O. et al. (2024) “Lightweight transformers for clinical natural language processing,” Natural Language Engineering, 30(5), pp. 887–914. Available at: https://doi.org/10.1017/S1351324923000542.

Sato, J. et al. (2024) Annotation-free multi-organ anomaly detection in abdominal CT using free-text radiology reports: a multi-center retrospective study. Available at: www.thelancet.com.

Steinkamp, J. et al. (2021) “Automatic Fully-Contextualized Recommendation Extraction from Radiology Reports,” Journal of Digital Imaging, 34(2), pp. 374–384. Available at: https://doi.org/10.1007/s10278-021-00423-8.

Su, Y., Babore, Y.B. and Kahn, C.E. (2025) “A Large Language Model to Detect Negated Expressions in Radiology Reports,” Journal of Imaging Informatics in Medicine, 38(3), pp. 1297–1303. Available at: https://doi.org/10.1007/s10278-024-01274-9.

Tay, S.B. et al. (2024) “Use of Natural Language Processing to Infer Sites of Metastatic Disease from Radiology Reports at Scale,” JCO Clinical Cancer Informatics [Preprint], (8). Available at: https://doi.org/10.1200/cci.23.00122.

Tsuji, S. et al. (2021) “Developing a RadLex-based named entity recognition tool for mining textual radiology reports: development and performance evaluation study,” Journal of Medical Internet Research, 23(10). Available at: https://doi.org/10.2196/25378.

Wang, S.Y. et al. (2022) “Leveraging weak supervision to perform named entity recognition in electronic health records progress notes to identify the ophthalmology exam,” International Journal of Medical Informatics, 167. Available at: https://doi.org/10.1016/j.ijmedinf.2022.104864.

Comparison of BioBERT and DistilBERT for Named Entity Recognition on Indonesian Radiology Clinical Data

Article Sidebar

Main Article Content

Abstract

Keywords

Article Details

References

References