簡易檢索 / 詳目顯示

研究生: 周冠銘
Chou, Kuan-Ming
論文名稱: 利用自動化關鍵字選取與文件分群技術優化醫學文章之資訊擷取
Using automatic keywords extraction and text clustering methods for medical information retrieval improvement
指導教授: 謝孫源
Hsieh, Sun-Yuan
蔡佩璇
Tsai, Pei-Hsuan
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 醫學資訊研究所
Institute of Medical Informatics
論文出版年: 2013
畢業學年度: 101
語文別: 中文
論文頁數: 54
中文關鍵詞: 分群EM algorithmNeyman-Pearson test
外文關鍵詞: Clustering, EM algorithm, Neyman-Pearson test
相關次數: 點閱:130下載:6
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於網路上儲存著大量的文件,在搜尋時很容易會得到許多重覆的搜尋結果。本論文的出發動機是,為了減少使用者在搜尋時過濾大量重覆資訊的時間。本研究利用醫學資訊系統裡的病患個人資料進行搜尋,減少了使用不精準的關鍵字造成的問題,接著針對搜尋完後,重覆的醫學文章進行分群,讓病患及其家屬能更快的找到與病患疾病相關的資訊。
    本研究提出一個少見的分群方法,利用簡單的特徵字擷取方法,將文章轉換為特徵向量,再用主成份分析的方法將維度過高的特徵向量降維度。在計算完全部文章的相似度值後,使用EM algorithm分類成相同與不同兩類,最後再利用Neyman–Pearson Test的假說檢定方法進行相同文章的分群。
    實驗結果我們與傳統的K-means分群演算法進行比較,在常用的Precision、Recall與F measure的比較指標上都比K-means的分群結果更好。

    Because there are huge data on the web, it will get many duplicate and near-duplicate search results when we search on the web. The motivation of this thesis is that reduce the time of filtering the huge duplicate and near-duplicate information when user search.
    In this thesis, we propose a novel clustering method to solve near-duplicate problem. Our method transforms each document to a feature vector, where the weights are terms frequency of each corresponding words. For reducing the dimension of these feature vectors, we used principle component analysis to transform these vectors to another space. After PCA, we used cosine similarity to compute the similarity of each document. And then, we used EM algorithm and Neyman-Pearson hypothesis test to cluster the duplicate documents.
    We compared out results with K-means method results. The experiments show that our method is outperformer than K-means method.

    1.序論 1 2.相關文獻 6 2.1 醫療資訊系統 6 2.2 個人化搜尋 7 2.3 資訊檢索 9 2.3.1 關鍵字擷取 10 2.3.2 關鍵字權重 11 2.3.3 文件分類(Classification) 12 2.3.4 文件分群(Clustering) 13 2.4 相關研究比較 14 3.資料收集 15 3.1 關鍵字提取 15 3.2 文章搜尋 16 4.資料前處理 20 4.1樣本文章特徵字分析 20 4.2 主成份分析(Principle Component Analysis) 24 5.資料分群 27 5.1 相似度運算 27 5.2 EM algorithm 29 5.3 Neyman–Pearson Test分群 32 6.實驗結果 37 6.1 實驗流程 37 6.2 實驗結果 39 7.結論與未來展望 44 8.參考文獻 46 附錄一 51 附錄二 54

    [1]International Telecommunication Union. (2010, 12). Retrieved from International Telecommunication Union, http://www.itu.int/en/pages/default.aspx
    [2]Taiwan Network Information Centor. (2010, 1), http://www.twnic.net.tw/index2.php
    [3]S. Fox, "The Social Life of Health Information, 2011", Pew Research Center's Internet & American Life Project, 2011
    [4]D. A. Barclay and D. D. Halsted, "The Medical Library Association Consumer Health Reference Service Handbook, Part III", New York: Neal-Schuman Pub, 2001.
    [5]J.A. Diaz, R.A. Griffith, J.J. Ng, S.E. Reinert, P.D. Friedmann, and A.W. Moulton, "Patients' Use of the Internet for Medical Information", Journal of General Internal Medicine, 2002.
    [6]P.H. Tsai, Y.Z. Ou, and H.C. Lin, "iMus - Intelligent Medication Use Solution", IEEE Biomedical Circuits and Systems Conference, 2012.
    [7]A.P. Dempster, N.M. Laird, D.B. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm", Journal of the Royal Statistical Society, Series B, vol 39, No. 1, pp. 1-38, 1977
    [8]J. MacQueen, "Some methods for classification and analysis of multivariate observations." Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Vol. 1. No. 281-297. 1967.
    [9]G.H. Ball, and D.J.Hall, "ISODATA, a novel method of data analysis and pattern classification." STANFORD RESEARCH INST MENLO PARK CA, 1965.
    [10]G. Salton, and M. J. McGill, "Introduction to Modern Information Retrieval," McGraw-Hill Book Company, New York, USA, 1983.
    [11]C.F. Surprenant, and M.R. Solomon. "Predictability and personalization in the service encounter." The Journal of Marketing (1987): 86-96.
    [12]C. Allen, B. Yaeckel, and D. Kania. "Internet world guide to one-to-one web marketing." John Wiley & Sons, Inc., 1998.
    [13]F. Qiu, and J. Cho. "Automatic identification of user interest for personalized search." Proceedings of the 15th international conference on World Wide Web. ACM, 2006.
    [14]M. Speretta, and S. Gauch. "Personalized search based on user search histories." Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACM International Conference on. IEEE, 2005.
    [15]T. Joachims. "Optimizing search engines using clickthrough data." Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002.
    [16]J. Teevan, S. T. Dumais, and E. Horvitz. "Personalizing search via automated analysis of interests and activities." Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2005.
    [17]N. Matthijs, and F. Radlinski. "Personalizing web search using long term browsing history." Proceedings of the fourth ACM international conference on Web search and data mining. ACM, 2011.
    [18]A. T. A. Thuy Vu, and M. Zhang. "Term Extraction Through Unithood and Termhood Unification." IJCNLP. 2008.
    [19]J. Baldridge. "The opennlp project." (2005). http://opennlp.sourceforge.net
    [20]M. Feranti. "Amazon’s Bezos touts personalization." InfoWorld (2000).
    [21]D.A. Giuse, and K.A. Kuhn. "Health information systems challenges: the Heidelberg conference and the future." International journal of medical informatics 69.2 (2003): 105-114.
    [22]M. Kobayashi, and K. Takeda. "Information retrieval on the web." ACM Computing Surveys (CSUR) 32.2 (2000): 144-173.
    [23]R. Baeza-Yates, and B. Ribeiro-Neto. "Modern information retrieval." Vol. 463. New York: ACM press, 1999.
    [24]G. Salton, E. A. Fox, and H. Wu. "Extended Boolean information retrieval." Communications of the ACM 26.11 (1983): 1022-1036.
    [25]G. Salton, A. Wong, and C.S. Yang. "A vector space model for automatic indexing." Communications of the ACM 18.11 (1975): 613-620.
    [26]R. Burgin, and M. Dillon. "Improving disambiguation in FASIT." Journal of the American Society for Information Science 43.2 (1992): 101-114.
    [27]J.L. Fagan. "The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval." Journal of the American Society for Information Science 40.2 (1989): 115-132.
    [28]L.P. Jones, E. W. Gassie Jr, and S. Radhakrishnan. "INDEX: The statistical basis for an automatic conceptual phrase‐indexing system." Journal of the American Society for Information Science 41.2 (1990): 87-97.
    [29]H. Paijmans. "Comparing the document representations of two IR‐systems: CLARIT and TOPIC." Journal of the American Society for Information Science 44.7 (1993): 383-392.
    [30]Z. Wu, and G. Tseng. "ACTS: An automatic Chinese text segmentation system for full text retrieval." Journal of the American Society for Information Science 46.2 (1995): 83-96.
    [31]T.C. Jo. "Text categorization with the concept of fuzzy set of informative keywords." Fuzzy Systems Conference Proceedings, 1999. FUZZ-IEEE'99. 1999 IEEE International. Vol. 2. IEEE, 1999.
    [32]G. Salton, and C. Buckley. "Term-weighting approaches in automatic text retrieval." Information processing & management 24.5 (1988): 513-523.
    [33]H.P. Luhn. "The automatic creation of literature abstracts." IBM Journal of research and development 2.2 (1958): 159-165.
    [34]M.E. Maron. "Automatic indexing: an experimental inquiry." Journal of the ACM (JACM) 8.3 (1961): 404-417.
    [35]Y. Yang, and X. Liu. "A re-examination of text categorization methods." Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1999.
    [36]F. Sebastiani. "Machine learning in automated text categorization." ACM computing surveys (CSUR) 34.1 (2002): 1-47.
    [37]E. Paquet. "Exploring anthropometric data through cluster analysis." Digital Human Modeling for Design and Engineering , Seattle, Washington, USA, 2004.
    [38]J. MacQueen. "Some methods for classification and analysis of multivariate observations." Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Vol. 1. No. 281-297. 1967.
    [39]L. Kaufman, P.J. Rousseeuw. "Finding groups in data: an introduction to cluster analysis." John Wiley & Sons, Inc., New York, 1990.
    [40]T. Zhang, R. Ramakrishnan, and M. Livny. "BIRCH: an efficient data clustering method for very large databases." ACM SIGMOD Record. Vol. 25. No. 2. ACM, 1996.
    [41]S. Guha, R. Rastogi, and K. Shim. "CURE: an efficient clustering algorithm for large databases." ACM SIGMOD Record. Vol. 27. No. 2. ACM, 1998.
    [42]M. Ester, et al. "A density-based algorithm for discovering clusters in large spatial databases with noise." KDD. Vol. 96. 1996.
    [43]X. Xu, et al. "A distribution-based clustering algorithm for mining in large spatial databases." Data Engineering, 1998. Proceedings., 14th International Conference on. IEEE, 1998.
    [44]W. Wang, J. Yang, and R. Muntz. "STING: A statistical information grid approach to spatial data mining." VLDB. Vol. 97. 1997.
    [45]R. Agrawal, et al. "Automatic subspace clustering of high dimensional data for data mining applications." Vol. 27. No. 2. ACM, 1998.
    [46]J. Han, M. Kamber, and J. Pei. "Data mining: concepts and techniques." Morgan kaufmann, 2006.
    [47]A.K. Jain, M.N. Murty, and P.J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
    [48]G. Salton. "The SMART retrieval system—experiments in automatic document processing." (1971).
    [49]M.F. Porter. "An algorithm for suffix stripping." Program: electronic library and information systems 14.3 (1980): 130-137.
    [50]I. Jolliffe. “Principal component analysis.” John Wiley & Sons, Ltd, 2005.

    下載圖示 校內:2015-08-30公開
    校外:2015-08-30公開
    QR CODE