簡易檢索 / 詳目顯示

研究生: 黃彥霖
Huang, Yen-Lin
論文名稱: 群眾外包手機遊戲用以生醫命名實體辨識
A crowdsourcing mobile game used for biomedical named-entity recognition
指導教授: 張天豪
Chang, Tien-Hao
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2015
畢業學年度: 103
語文別: 中文
論文頁數: 30
中文關鍵詞: 群眾外包生醫命名實體辨識文字探勘手機遊戲
外文關鍵詞: crowdsourcing, biomedical named-entity recognition, text mining, mobile game
相關次數: 點閱:155下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著資訊開放,每年都有大量可供索引的文章被發表。為此,能夠汲取重要資料供資料庫存取的資訊擷取技術顯得愈來愈重要。在既有的資訊擷取工作模式中,研究單位由程式處理大量的自然語言文獻,再聘請領域專家學者進行驗證,這樣工作模式難以長時間維持。
    為解決這個問題,本研究套用了一個應用群眾外包概念的資訊擷取模型,試圖以群眾力量處理資訊擷取關鍵的第一步──命名實體辨識。基於這個模型,本研究設計並實作出一款手機遊戲──馬克菌,以遊戲樂趣作為群眾外包獎勵,讓玩家在遊戲的過程中,願意自主協助生醫命名實體辨識學術工作。參考了市面上數個成功遊戲的實例,馬克菌將繁重的標記工作,與一個有趣的寵物養成遊戲結合。
    在驗證馬克菌效果的封閉測試實驗中,鎖定蛋白質相關文獻的蛋白質名稱辨識。實驗結果顯示馬克菌可以正常運作與遊玩,玩家犯下的錯誤可以透過被系統機制彼此修正,令足量的玩家可以達到與專家相當的水平,且當玩家數量愈多,精確度也將有所提升。這些回收的結構性資料得以在後續的應用中,建構出文字探勘資料庫,回饋領域專家學者使用。

    Biomedical articles have dramatically increased recently, driving many tools for automatically extracting valuable knowledge in natural language articles. However, to make such full automatically extracted data be considered as knowledge, a manual verification step is generally required. This manual verification, which needs to employ many domain experts, may cost more than developing exaction algorithms and is hard to last for a long time. This work aims to introduce crowdsourcing, which borrows energy from community, to solve this problem. A mobile game, Markteria was implemented to extract knowledge from natural language documents “crowdsourcingly”. In this work, named-entity recognition (NER) was chosen to be the task, since it’s a required prior step for information extraction applications. Markteria is topic independent and can be extended to other NER topic in the future.

    第一章 緒論 1 第二章 相關研究 5 2.1 群眾外包 (Crowdsourcing) 5 2.2 文字探勘 (Text Mining) 6 2.2.1 停用詞 (Stop Words) 6 2.2.2 編輯距離 (Edit Distance) 6 2.3 命名實體辨識 (Named-Entity Recognition, NER) 7 2.4 相關資料庫與服務 8 2.4.1 PubMed, PubMed Central 8 2.4.2 UniProt 8 2.4.3 GeneCards 9 第三章 研究方法 10 3.1 資料集 10 3.1.1 文章蒐集 10 3.1.2 基因符號詞集蒐集 11 3.1.3 停用詞詞集蒐集 11 3.2 前處理 11 3.2.1 文章剖析 11 3.2.2 段落斷句 13 3.2.3 文句拆字 13 3.2.4 字詞分數定義 14 3.2.5 文句價值定義 15 3.3 遊戲設計 16 3.3.1 設計發想 16 3.3.2 玩法與遊戲流程 17 3.4 系統架構 17 3.4.1 應用程式遊戲端 18 3.4.2 伺服器端 18 3.4.3 離線前處理端 19 3.5 資訊擷取答案生成 19 3.6 測試實驗設計 20 第四章 研究結果 22 4.1 遊戲畫面 22 4.1.1 操控介面 22 4.1.2 標記頁面 22 4.1.3 培養皿頁面 23 4.1.4 商店及圖鑑功能視窗 24 4.2 實驗結果 26 4.2.1 系統效果評估 26 4.2.2 玩家人數對系統效果的影響 27 第五章 結論 28 參考文獻 29

    1. Delamothe, T. and R. Smith, PubMed Central: creating an Aladdin's cave of ideas. BMJ, 2001. 322(7277): p. 1-2.
    2. Bairoch, A., et al., The universal protein resource (UniProt). Nucleic acids research, 2005. 33(suppl 1): p. D154-D159.
    3. Cowie, J. and W. Lehnert, Information extraction. Communications of the ACM, 1996. 39(1): p. 80-91.
    4. Kanya, N., T. Ravi, and S. Geetha. A comparative study of Information Extraction tools used for Biological database. in Sustainable Energy and Intelligent Systems (SEISCON 2011), International Conference on. 2011. IET.
    5. Liu, H., R. Komandur, and K. Verspoor. From graphs to events: A subgraph matching approach for information extraction from biomedical text. in Proceedings of the BioNLP Shared Task 2011 Workshop. 2011. Association for Computational Linguistics.
    6. Pafilis, E., et al. OnTheFly 2.0: A tool for automatic annotation of files and biological information extraction. in Bioinformatics and Bioengineering (BIBE), 2013 IEEE 13th International Conference on. 2013. IEEE.
    7. Shendure, J. and H. Ji, Next-generation DNA sequencing. Nature biotechnology, 2008. 26(10): p. 1135-1145.
    8. Metzker, M.L., Sequencing technologies—the next generation. Nature Reviews Genetics, 2009. 11(1): p. 31-46.
    9. Howe, J., The rise of crowdsourcing. Wired magazine, 2006. 14(6): p. 1-4.
    10. Brabham, D.C., Crowdsourcing as a model for problem solving an introduction and cases. Convergence: the international journal of research into new media technologies, 2008. 14(1): p. 75-90.
    11. Nadeau, D. and S. Sekine, A survey of named entity recognition and classification. Lingvisticae Investigationes, 2007. 30(1): p. 3-26.
    12. Rebholz-Schuhmann, D., A. Oellrich, and R. Hoehndorf, Text-mining solutions for biomedical research: enabling integrative biology. Nature Reviews Genetics, 2012. 13(12): p. 829-839.
    13. Munková, D., M. Munk, and M. Vozár, Influence of Stop-Words Removal on Sequence Patterns Identification within Comparable Corpora, in ICT Innovations 2013. 2014, Springer. p. 67-76.
    14. Levenshtein, V.I. Binary codes capable of correcting deletions, insertions and reversals. in Soviet physics doklady. 1966.
    15. Altschul, S.F., et al., Basic local alignment search tool. Journal of molecular biology, 1990. 215(3): p. 403-410.
    16. Resnik, P., Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. arXiv preprint arXiv:1105.5444, 2011.
    17. Wei, C.-H., H.-Y. Kao, and Z. Lu, PubTator: a web-based text mining tool for assisting biocuration. Nucleic acids research, 2013: p. gkt441.
    18. Boeckmann, B., et al., The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic acids research, 2003. 31(1): p. 365-370.
    19. Rebhan, M., et al., GeneCards: integrating information about genes, proteins and diseases. Trends in Genetics, 1997. 13(4): p. 163.
    20. Bray, T., et al., Extensible markup language (XML). World Wide Web Consortium Recommendation REC-xml-19980210. http://www.w3.org/TR/1998/REC-xml-19980210, 1998.
    21. Koehn, P. Europarl: A parallel corpus for statistical machine translation. in MT summit. 2005.
    22. Ukkonen, E., Algorithms for approximate string matching. Information and control, 1985. 64(1): p. 100-118.
    23. Myers, E.W., AnO (ND) difference algorithm and its variations. Algorithmica, 1986. 1(1-4): p. 251-266.
    24. Charland, A. and B. Leroux, Mobile application development: web vs. native. Communications of the ACM, 2011. 54(5): p. 49-53.
    25. Fielding, R., et al., Hypertext transfer protocol–HTTP/1.1. 1999, RFC 2616, June.
    26. Stehman, S.V., Selecting and interpreting measures of thematic classification accuracy. Remote sensing of Environment, 1997. 62(1): p. 77-89.

    下載圖示 校內:立即公開
    校外:2016-08-26公開
    QR CODE