簡易檢索 / 詳目顯示

研究生: 瞿邦泰
Qu, Bang-Tai
論文名稱: 將多任務學習用於中文疾病症狀的實體識別和規範化
Toward the use of multi-task learning for entity recognition and normalization of Chinese disease symptoms
指導教授: 藍崑展
Lan, Kun-Chan
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 醫學資訊研究所
Institute of Medical Informatics
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 72
中文關鍵詞: 醫學命名實體識別醫學實體標準化計算聯合損失策略偽標簽
外文關鍵詞: Medical Named Entity Recognition, Medical Entity Normalization, joint loss strategy, pseudo-labeling
相關次數: 點閱:45下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 人們對應用信息提取技術的興趣與日俱增,因爲這些技術能為相應的研究和應用帶來俱來收益而在醫療領域,醫學文本中含有非常多的醫療價值,而醫學命名實體識別和規範化是獲取這些價值最基本的任務,有助於後續對醫學知識圖譜的構建和輔助醫師進行診斷起到決定性作用。
    先前的研究證實醫學命名實體識別和規範化兩項任務具有高相關性,兩項任務能夠相互促進學習,通過多任務學習,模型可以達到較好的識別和標準化效果,我们发现他们的方法在醫學實體識別任务上的改进相对较大,但在醫學實體標準化任务上的改进相对有限,我們審視了他们的传播策略和聯合计算损失的方法,我们发现他们的研究侧重于醫學實體識別,导致醫學實體標準化任务从醫學實體識別任務中获得的帮助有限,對於沒有見過的資料模型表現較差,且因爲資料收集難度大等原因,原先的研究都是建立在英文數據的基礎之上。
    針對這三個問題,我們提出了計算聯合損失策略改進的方法,對於每一項任務都給予可訓練的權重,使用偽標簽的方法對資料進行擴增,讓模型能夠有更多的資料進行學習,我們還建立了一個可用於醫學實體識別及規範化的中文數據集,并進行訓練,在英文醫學實體識別和標準化任務上分別取得了93%和92.5%的F1 score,優於之前的基綫模型,在中文數據集方面,我們發現我們調整後的模型的表現有所下降,分析發現是中文醫學資訊中存在更多的嵌套問題和部分數據標注錯誤的問題,難以准確的分清實體邊界或是標準化匹配錯誤,如何解決中文存在的這些問題可能是未來提升模型在中文上表現的一大難點。

    In recent years, there has been a growing interest in the application of information extraction techniques. These technologies are of particular interest in the medical field because medical texts contain a wealth of valuable healthcare information. Medical Named Entity Recognition (NER) and Medical Named Entity Normalization (MEN) are fundamental tasks for acquiring this valuable information, and they play a crucial role in the construction of medical knowledge graphs and assisting physicians in diagnosis.
    Previous research has confirmed the high correlation between medical NER and MEN tasks. These two tasks can mutually enhance learning. Through multi-task learning, models can achieve better identification and standardization results. However, we found that existing studies have mainly focused on medical entity recognition, leading to limited improvements in medical entity normalization tasks. The performance of the model is also poorer for unseen data. Furthermore, due to the challenges of data collection, previous research has been primarily based on English data.
    To address these three issues, we propose an improved method for calculating a joint loss strategy. We allocate trainable weights for each task and use pseudo-labeling methods to augment the data, enabling the model to learn from more data. We also created a Chinese dataset for MER and MEN and conducted training. In English MER and MEN tasks, we achieved F1 scores of 93% and 92.5%, respectively, outperforming previous baseline models. However, in the case of the Chinese dataset, we observed a slight performance drop after adjustment. Further analysis revealed that this was due to the presence of more nested problems and partially mislabeled data in Chinese medical information, making it challenging to accurately identify entity boundaries or perform standardization matching. Addressing these Chinese-specific issues may be a significant challenge for improving model performance on Chinese data.

    中文摘要 I ABSTRACT II 致謝 IV CONTENTS V LIST OF TABLES VIII LIST OF FIGURES X Chapter 1. Introduction 1 1.1 Definition of Knowledge 1 1.1.1 Named Entity Recognition 1 1.1.2 Named Entity Normalization 2 1.1.3 NER and NEN in the medical field 2 1.2 Prior work shows jointly model MER and MEN with multi-task learning beneficial 4 1.3 Motivation for this study 6 1.4 Definition of Problem 10 1.5 Our contribution 10 Chapter 2. Related work 11 2.1 Prior work on NER 11 2.2 Prior work on NEN 13 2.3 Prior work on Multi-task MER and MEN 13 2.4 Prior work on Chinese data for NER 15 2.5 Prior work on data augmentation for NER 15 2.6 Adjusting the Joint loss strategy 16 2.7 Prior work on solving nested name entity 19 Chapter 3. Methods 20 3.1 The design of joint loss function 20 3.2 Data augmentation for unseen examples——pseudo-labeling 22 3.3 Method for solving nested entity 23 3.4 Overall framework 25 Chapter 4. Experiment and Result 26 4.1 Dataset 26 4.2 Data Preprocess 27 4.2.1 Creating datasets for MER and MEN tasks 27 4.2.2 Sequence labeling 30 4.3 Experiment details for how joint loss is implemented 30 4.4 Experiment details for data augmentation——pseudo labeling 31 4.5 Experiment details for add multiple levels of joint annotation to the nested entities and adjust the model output strategy 33 4.6 Experiment setting 34 4.7 Hyperparameter setting 34 Chapter 5 Result and Discussion 35 5.1 Result of our study 35 5.2 Ablation experiment 37 5.2.1 Effect of different data augmentation methods on MEN/MER 37 5.2.2 Effect of different methods for solving nested name entity 39 5.3 Discussion 39 5.3.1 How to Match ICDs for Diseases Where Similarity Doesn't Exist 39 5.3.1 Use of ChatGPT for MER and MEN tasks 41 5.3.2 ChatGPT for Data augmentation 43 5.3.3 PubMedBERT vs bioBERT 44 5.3.4 Introducing more granular labeling 45 5.3.5 noise label 46 Chapter 6. Limitation and feature work 48 Limitation 48 Feature work 48 Chapter 7. Conclusion 50 Reference 51 Appendix 1 Possible solutions for semantic similarity matching (no same words or vocabulary) code 54 Appendix 2 Disease map ICD-10(core code) 57 Appendix 3 Simple data augmentation method (core code) 58

    [1] Leaman, R., & Lu, Z. (2016). TaggerOne: Joint named entity recognition and normalization with semi-Markov models. Bioinformatics, 32(18), 2839-2846.
    [2] Lou, Y., et al. (2017). A transition-based joint model for disease named entity recognition and normalization. Bioinformatics, 33(15), 2363-2371.
    [3] Zhao, S., et al. (2019). A neural multi-task learning framework to jointly model medical named entity recognition and normalization. In Proceedings of the AAAI Conference on Artificial Intelligence, 33(01).
    [4] Zhou, B., et al. (2021). An end-to-end progressive multi-task learning framework for medical named entity recognition and normalization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
    [5] Ji, Z., et al. (2021). A neural transition-based joint model for disease named entity recognition and normalization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
    [6] Peng, Y., Chen, Q., & Lu, Z. (2020). An empirical study of multi-task learning on BERT for biomedical text mining. arXiv preprint arXiv:2005.02799.
    [7] Wu, C., et al. (2020). An attention-based multi-task model for named entity recognition and intent analysis of Chinese online medical questions. Journal of Biomedical Informatics, 108, 103511.
    [8] Sun, K., et al. (2021). Progressive multi-task learning with controlled information flow for joint entity and relation extraction. In Proceedings of the AAAI Conference on Artificial Intelligence, 35(15).
    [9] Xiong, Y., et al. (2020). A joint model for medical named entity recognition and normalization. CEUR Workshop Proceedings, 17.
    [10] Zhou, B., et al. (2021). MTAAL: Multi-task adversarial active learning for medical named entity recognition and normalization. In Proceedings of the AAAI Conference on Artificial Intelligence, 35(16).
    [11] Wang, Y., et al. (2020). HIT: Nested named entity recognition via head-tail pair and token interaction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
    [12] Zhang, Y., & Yang, J. (2018). Chinese NER using lattice LSTM. arXiv preprint arXiv:1805.02023.
    [13] Liu, W., et al. (2019). An encoding strategy based word-character LSTM for Chinese NER. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
    [14] Ding, R., et al. (2019). A neural multi-digraph model for Chinese NER with gazetteers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
    [15] Li, J., et al. (2022). Unified named entity recognition as word-word relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence, 36(10).
    [16] Li, X., et al. (2019). Dice loss for data-imbalanced NLP tasks. arXiv preprint arXiv:1911.02855.
    [17] Almutairi, A. N. (2019). Unsupervised method for disease named entity recognition. (Doctoral dissertation).
    [18] Ji, B., et al. (2019). A hybrid approach for named entity recognition in Chinese electronic medical record. BMC Medical Informatics and Decision Making, 19(2), 149-158.
    [19] Li, X., Zhang, H., & Zhou, X.-H. (2020). Chinese clinical named entity recognition with variant neural structures based on BERT methods. Journal of Biomedical Informatics, 107, 103422.
    [20] Khan, M. R., Ziyadi, M., & AbdelHady, M. (2020). Mt-bioner: Multi-task learning for biomedical named entity recognition using deep bidirectional transformers. arXiv preprint arXiv:2001.08904.
    [21] Zhang, R., et al. (2022). Medical named entity recognition based on dilated convolutional neural network. Cognitive Robotics, 2(1), 13-20.
    [22] Shi, H., et al. (2017). Towards automated ICD coding using deep learning. arXiv preprint arXiv:1711.04075.
    [23] Liu, X., Zhang, Y., & Zhang, J. (2023). MSRPPnet: An automatic ICD coding method for clinical records based on a deep neural network. In International Conference on Electronic Information Engineering and Computer Science (EIECS 2022), 12602.
    [24] Xu, K., et al. (2019). Multimodal machine learning for automated ICD coding. In Machine Learning for Healthcare Conference.
    [25] Ji, Z., Wei, Q., & Xu, H. (2020). Bert-based ranking for biomedical entity normalization. AMIA Summits on Translational Science Proceedings, 2020, 269.
    [26] Dai, X., & Adel, H. (2020). An analysis of simple data augmentation for named entity recognition. arXiv preprint arXiv:2010.11683.
    [27] Raychaudhuri, D. S., et al. (2022). Controllable dynamic multi-task architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
    [28] Agrawal, A., et al. (2022). BERT-based transfer-learning approach for nested named-entity recognition using joint labeling. Applied Sciences, 12(3), 976.
    [29] Zhang, N., et al. (2021). Cblue: A Chinese biomedical language understanding evaluation benchmark. arXiv preprint arXiv:2106.08087.
    [30] Kocaman, V., & Talby, D. (2021). Biomedical named entity recognition at scale. In Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part I.
    [31] Dai, H., et al. (2023). Chataug: Leveraging chatgpt for text data augmentation. arXiv preprint arXiv:2302.13007.
    [32] Sarker, S., Qian, L., & Dong, X. (2023). Medical Data Augmentation via ChatGPT: A Case Study on Medication Identification and Medication Event Classification. arXiv preprint arXiv:2306.07297.
    [33] Ahlbäck, E., & Dougly, M. (2023). Can Large Language Models Enhance Fake News Detection?: Improving Fake News Detection With Data Augmentation.
    [34] Gu, Y., et al. (2021). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1), 1-23.
    [35] Yang, H., et al. (2021). A multi-task learning model for Chinese-oriented aspect polarity classification and aspect term extraction. Neurocomputing, 419, 344-356.
    [36] Wu, C., et al. (2020). An attention-based multi-task model for named entity recognition and intent analysis of Chinese online medical questions. Journal of Biomedical Informatics, 108, 103511.
    [37] Hu, C., et al. (2022). Multi-task joint learning model for Chinese word segmentation and syndrome differentiation in traditional Chinese medicine. International Journal of Environmental Research and Public Health, 19(9), 5601.
    [38] Lee, J., et al. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240.
    [39] Fuzzy string matching like a boss. (n.d.). Retrieved from https://pypi.org/project/fuzzywuzzy/

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE