研究生: |
江昇翰 Chiang, Sheng-Han |
---|---|
論文名稱: |
在醫病對話語料庫中隱私內容及口語資訊處理之去識別化研究 Privacy and Spoken Information Processing for De-identification in a Medical Dialogue Corpus |
指導教授: |
高宏宇
Kao, Hung-Yu |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 醫學資訊研究所 Institute of Medical Informatics |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 英文 |
論文頁數: | 42 |
中文關鍵詞: | 自然語言處理 、命名實體識別 、資料集 、醫學隱私相關 、數據前處理 |
外文關鍵詞: | Natural Language Processing, Named Entity Recognition, Dataset, Medical privacy, Data preprocessing |
相關次數: | 點閱:64 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在醫院裡有許多醫病過程是經由醫療或是衛教人員與病患之間的對談來進行,透過對話中的判斷,醫療人員可以進行相關醫病處置與決策,而對談或是問答的過程都需要專業的醫療人員來施行,人力上的成本會是醫療體系效能的一大挑戰。此外如果想將對談或是問答過程拿來當成各種醫學資料,又會面臨了民眾隱私和醫療研究的兩難,如何在保障民眾的最重要的隱私權下合理的進行醫學研究,也是現今在醫療領域上首先需要面對的重大難題。
在本文,我們為你介紹數種針對隱私內容及口語對話資料的處理方式。我們會去分析這種類型的資料集有何種特色,並且面對這種特殊的資料集性質可以如何僅透過前處理就達到百分之五到百分之七的進步,而我們也透過文字濾除及替換去驗證、討論和處理了資料集的數據偏差問題,展現了前處理在任務上佔有多大的重要性;此外我們也為此建置了 CMCDD,這是一個繁體中文醫學命名實體識別 (NER) 數據集。該數據集的目標是利用標記出的隱私實體去訓練模型,令模型可以專注在與患者相關的隱私內容上,使我們可以獲得穩定的去識別化資料,最終達到在初步問診上的人力及經費資源節省。這些資料來自於國立成功大學醫院傳染病科的醫病對話,對話時長都超過5分鐘。此數據集包含 409 個醫病對話,共有 13 個可識別的實體。目前,該數據集的完整內容將在教育部人工智能競賽與資料標註競賽發布。
In the hospital, many medical procedures are carried out through the dialogue between the medical or health education personnel and the patient. Through the judgment in the discussion, medical personnel can make appropriate medical treatment and decision-making, and the process of dialogue or question and answer requires professional medical personnel to implement. The labor cost will be a significant challenge to the efficiency of the medical system. In addition, if you want to use the interview or question-and-answer process as various medical materials, you will face the dilemma of public privacy and medical research. How to conduct medical research reasonably under the protection of the most critical privacy rights of the people is also a significant problem that needs to be faced in the medical field today.
This article introduces you to several ways of handling private content and spoken conversation data. We will analyze the characteristics of this type of dataset and how this particular dataset can achieve 5% to 7% improvement only through preprocessing. And we also verified, discussed, and dealt with the data bias problem of the dataset through text filtering and replacement, showing how important preprocessing is in the task. In addition, we also built
CMCDD, a Medical Chinese Named Entity Recognition (NER) Dataset. The goal of this dataset is to use the marked private entities to train the model so that the model can focus on the private content related to the patient, obtain stable de-identified data, and achieve human resources savings in the initial consultation. These materials come from physician-patient conversations at the Department of Infectious Diseases, National Cheng Kung University Hospital, which lasted more than 5 minutes. This dataset contains 409 physician-patient discussions with a total of 13 identifiable entities. At present, the complete content of the dataset will be released at the Ministry of Education's AI competition and labeled data acquisition project.
[1] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
[2] Zhenjin Dai, Xutao Wang, Pin Ni, Yuming Li, Gangmin Li, and Xuming Bai. Named entity recognition using bert bilstm crf for chinese electronic health records. 2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), 2019.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[4] Chuanhai Dong, Jiajun Zhang, Chengqing Zong, Masanori Hattori, and Hui Di. Character-based lstm-crf with radical-level features for chinese named entity recognition. ICCPOL NLPCC 2016, 2016.
[5] The Assistant Secretary for Planning and Evaluation. Health insurance portability and accountability act of 1996. 1996.
[6] Christopher Fox. A stop list for general text. pages 19–21, 1989.
[7] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
[8] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001), pages 282–289, 2001.
[9] Jiwei Li, Will Monroe, Alan Ritter, Jianfeng Gao Michel Galley, and Dan Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
[10] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[11] Microsoft Research Asia (MSRA). A set of manually annotated chinese wordsegmentation data and specifications for training and testing a chinese wordsegmentation system for research purposes. 2007.
[12] National NLP Clinical Challenges (n2c2). Deidentification & heart disease risk 2014. 2014.
[13] David Nadeau and Satoshi Sekine. A survey of named entity recognition and classification. pages 3–26, 2007.
[14] P.David Pearson and Margaret C.Gallagher. The instruction of reading comprehension. 1983.
[15] Nanyun Peng and Mark Dredze. Named entity recognition for chinese social media with jointly trained embeddings. pages 548–554, 2015.
[16] Kanaka Rajan, Christopher D Harvey, and David W Tank. Recurrent network models of sequence generation and memory. pages 128–142, 2016.
[17] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
[18] Chih Chieh Shao, Trois Liu, Yuting Lai, Yiying Tseng, and Sam Tsai. Drcd: a chinese machine reading comprehension dataset. arXiv preprint arXiv:1806.00920, 2018.
[19] Jason Wei and Kai Zou. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196, 2019.
[20] Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert Belvin, and Ann Houston. Ontonotes 4.0. 2011.
[21] Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert Belvin, and Ann Houston. Ontonotes 5.0. 2013.
[22] Yue Zhang and Jie Yang. Chinese ner using lattice lstm. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1554––1564, 2018.