| 研究生: |
瞿邦泰 Qu, Bang-Tai |
|---|---|
| 論文名稱: |
將多任務學習用於中文疾病症狀的實體識別和規範化 Toward the use of multi-task learning for entity recognition and normalization of Chinese disease symptoms |
| 指導教授: |
藍崑展
Lan, Kun-Chan |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 醫學資訊研究所 Institute of Medical Informatics |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 英文 |
| 論文頁數: | 72 |
| 中文關鍵詞: | 醫學命名實體識別 、醫學實體標準化 、計算聯合損失策略 、偽標簽 |
| 外文關鍵詞: | Medical Named Entity Recognition, Medical Entity Normalization, joint loss strategy, pseudo-labeling |
| 相關次數: | 點閱:45 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
人們對應用信息提取技術的興趣與日俱增,因爲這些技術能為相應的研究和應用帶來俱來收益而在醫療領域,醫學文本中含有非常多的醫療價值,而醫學命名實體識別和規範化是獲取這些價值最基本的任務,有助於後續對醫學知識圖譜的構建和輔助醫師進行診斷起到決定性作用。
先前的研究證實醫學命名實體識別和規範化兩項任務具有高相關性,兩項任務能夠相互促進學習,通過多任務學習,模型可以達到較好的識別和標準化效果,我们发现他们的方法在醫學實體識別任务上的改进相对较大,但在醫學實體標準化任务上的改进相对有限,我們審視了他们的传播策略和聯合计算损失的方法,我们发现他们的研究侧重于醫學實體識別,导致醫學實體標準化任务从醫學實體識別任務中获得的帮助有限,對於沒有見過的資料模型表現較差,且因爲資料收集難度大等原因,原先的研究都是建立在英文數據的基礎之上。
針對這三個問題,我們提出了計算聯合損失策略改進的方法,對於每一項任務都給予可訓練的權重,使用偽標簽的方法對資料進行擴增,讓模型能夠有更多的資料進行學習,我們還建立了一個可用於醫學實體識別及規範化的中文數據集,并進行訓練,在英文醫學實體識別和標準化任務上分別取得了93%和92.5%的F1 score,優於之前的基綫模型,在中文數據集方面,我們發現我們調整後的模型的表現有所下降,分析發現是中文醫學資訊中存在更多的嵌套問題和部分數據標注錯誤的問題,難以准確的分清實體邊界或是標準化匹配錯誤,如何解決中文存在的這些問題可能是未來提升模型在中文上表現的一大難點。
In recent years, there has been a growing interest in the application of information extraction techniques. These technologies are of particular interest in the medical field because medical texts contain a wealth of valuable healthcare information. Medical Named Entity Recognition (NER) and Medical Named Entity Normalization (MEN) are fundamental tasks for acquiring this valuable information, and they play a crucial role in the construction of medical knowledge graphs and assisting physicians in diagnosis.
Previous research has confirmed the high correlation between medical NER and MEN tasks. These two tasks can mutually enhance learning. Through multi-task learning, models can achieve better identification and standardization results. However, we found that existing studies have mainly focused on medical entity recognition, leading to limited improvements in medical entity normalization tasks. The performance of the model is also poorer for unseen data. Furthermore, due to the challenges of data collection, previous research has been primarily based on English data.
To address these three issues, we propose an improved method for calculating a joint loss strategy. We allocate trainable weights for each task and use pseudo-labeling methods to augment the data, enabling the model to learn from more data. We also created a Chinese dataset for MER and MEN and conducted training. In English MER and MEN tasks, we achieved F1 scores of 93% and 92.5%, respectively, outperforming previous baseline models. However, in the case of the Chinese dataset, we observed a slight performance drop after adjustment. Further analysis revealed that this was due to the presence of more nested problems and partially mislabeled data in Chinese medical information, making it challenging to accurately identify entity boundaries or perform standardization matching. Addressing these Chinese-specific issues may be a significant challenge for improving model performance on Chinese data.
[1] Leaman, R., & Lu, Z. (2016). TaggerOne: Joint named entity recognition and normalization with semi-Markov models. Bioinformatics, 32(18), 2839-2846.
[2] Lou, Y., et al. (2017). A transition-based joint model for disease named entity recognition and normalization. Bioinformatics, 33(15), 2363-2371.
[3] Zhao, S., et al. (2019). A neural multi-task learning framework to jointly model medical named entity recognition and normalization. In Proceedings of the AAAI Conference on Artificial Intelligence, 33(01).
[4] Zhou, B., et al. (2021). An end-to-end progressive multi-task learning framework for medical named entity recognition and normalization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
[5] Ji, Z., et al. (2021). A neural transition-based joint model for disease named entity recognition and normalization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
[6] Peng, Y., Chen, Q., & Lu, Z. (2020). An empirical study of multi-task learning on BERT for biomedical text mining. arXiv preprint arXiv:2005.02799.
[7] Wu, C., et al. (2020). An attention-based multi-task model for named entity recognition and intent analysis of Chinese online medical questions. Journal of Biomedical Informatics, 108, 103511.
[8] Sun, K., et al. (2021). Progressive multi-task learning with controlled information flow for joint entity and relation extraction. In Proceedings of the AAAI Conference on Artificial Intelligence, 35(15).
[9] Xiong, Y., et al. (2020). A joint model for medical named entity recognition and normalization. CEUR Workshop Proceedings, 17.
[10] Zhou, B., et al. (2021). MTAAL: Multi-task adversarial active learning for medical named entity recognition and normalization. In Proceedings of the AAAI Conference on Artificial Intelligence, 35(16).
[11] Wang, Y., et al. (2020). HIT: Nested named entity recognition via head-tail pair and token interaction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
[12] Zhang, Y., & Yang, J. (2018). Chinese NER using lattice LSTM. arXiv preprint arXiv:1805.02023.
[13] Liu, W., et al. (2019). An encoding strategy based word-character LSTM for Chinese NER. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
[14] Ding, R., et al. (2019). A neural multi-digraph model for Chinese NER with gazetteers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
[15] Li, J., et al. (2022). Unified named entity recognition as word-word relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence, 36(10).
[16] Li, X., et al. (2019). Dice loss for data-imbalanced NLP tasks. arXiv preprint arXiv:1911.02855.
[17] Almutairi, A. N. (2019). Unsupervised method for disease named entity recognition. (Doctoral dissertation).
[18] Ji, B., et al. (2019). A hybrid approach for named entity recognition in Chinese electronic medical record. BMC Medical Informatics and Decision Making, 19(2), 149-158.
[19] Li, X., Zhang, H., & Zhou, X.-H. (2020). Chinese clinical named entity recognition with variant neural structures based on BERT methods. Journal of Biomedical Informatics, 107, 103422.
[20] Khan, M. R., Ziyadi, M., & AbdelHady, M. (2020). Mt-bioner: Multi-task learning for biomedical named entity recognition using deep bidirectional transformers. arXiv preprint arXiv:2001.08904.
[21] Zhang, R., et al. (2022). Medical named entity recognition based on dilated convolutional neural network. Cognitive Robotics, 2(1), 13-20.
[22] Shi, H., et al. (2017). Towards automated ICD coding using deep learning. arXiv preprint arXiv:1711.04075.
[23] Liu, X., Zhang, Y., & Zhang, J. (2023). MSRPPnet: An automatic ICD coding method for clinical records based on a deep neural network. In International Conference on Electronic Information Engineering and Computer Science (EIECS 2022), 12602.
[24] Xu, K., et al. (2019). Multimodal machine learning for automated ICD coding. In Machine Learning for Healthcare Conference.
[25] Ji, Z., Wei, Q., & Xu, H. (2020). Bert-based ranking for biomedical entity normalization. AMIA Summits on Translational Science Proceedings, 2020, 269.
[26] Dai, X., & Adel, H. (2020). An analysis of simple data augmentation for named entity recognition. arXiv preprint arXiv:2010.11683.
[27] Raychaudhuri, D. S., et al. (2022). Controllable dynamic multi-task architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[28] Agrawal, A., et al. (2022). BERT-based transfer-learning approach for nested named-entity recognition using joint labeling. Applied Sciences, 12(3), 976.
[29] Zhang, N., et al. (2021). Cblue: A Chinese biomedical language understanding evaluation benchmark. arXiv preprint arXiv:2106.08087.
[30] Kocaman, V., & Talby, D. (2021). Biomedical named entity recognition at scale. In Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part I.
[31] Dai, H., et al. (2023). Chataug: Leveraging chatgpt for text data augmentation. arXiv preprint arXiv:2302.13007.
[32] Sarker, S., Qian, L., & Dong, X. (2023). Medical Data Augmentation via ChatGPT: A Case Study on Medication Identification and Medication Event Classification. arXiv preprint arXiv:2306.07297.
[33] Ahlbäck, E., & Dougly, M. (2023). Can Large Language Models Enhance Fake News Detection?: Improving Fake News Detection With Data Augmentation.
[34] Gu, Y., et al. (2021). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1), 1-23.
[35] Yang, H., et al. (2021). A multi-task learning model for Chinese-oriented aspect polarity classification and aspect term extraction. Neurocomputing, 419, 344-356.
[36] Wu, C., et al. (2020). An attention-based multi-task model for named entity recognition and intent analysis of Chinese online medical questions. Journal of Biomedical Informatics, 108, 103511.
[37] Hu, C., et al. (2022). Multi-task joint learning model for Chinese word segmentation and syndrome differentiation in traditional Chinese medicine. International Journal of Environmental Research and Public Health, 19(9), 5601.
[38] Lee, J., et al. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240.
[39] Fuzzy string matching like a boss. (n.d.). Retrieved from https://pypi.org/project/fuzzywuzzy/