簡易檢索 / 詳目顯示

研究生: 王京盛
Wang, Ching Sheng
論文名稱: 考量語意及引用分析之研究主題趨勢分析方法
A Research Trend Analyzing Method Based on Semantics and Citation Count
指導教授: 王惠嘉
Wang, Hei-Chia
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2012
畢業學年度: 100
語文別: 中文
論文頁數: 62
中文關鍵詞: 主題偵測與追蹤趨勢分析特徵選取分群
外文關鍵詞: Topic Detection and Tracking, Trend Analysis, Feature Selection, Clustering
相關次數: 點閱:107下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   隨著資料數位化的時代來臨,文件刊物的儲存方法,逐漸轉變為電子化的形式,以便於流通。然而,由於電子化刊物資料量的快速爆增,使得研究人員雖能輕易的收集大量的資料,但卻無法從中擷取重要資訊。為了解決這類的問題,目前電子資料庫通常會提供搜尋引擎,利用關鍵字搜尋比對,但其搜尋結果仍夾雜著許多不必要的資訊。
      為了能夠更有效率的提供研究人員找尋研究相關資料,利用主題偵測與追蹤技術的特性,能夠整理出研究資料集中的代表主題及主題趨勢的追蹤,但以往的主題偵測技術,僅考慮單一資料集,且並未針對研究趨勢的走向進行分析。再者,在進行主題偵測時,所使用的研究資料集中,字詞的語意像是同義字、拼字變化及作者用字表達方式也會影響主題偵測的結果。另外,過去研究對於所產生出來的主題及其趨勢消長並沒有進一步將其整理分析,或是實作成系統,使得主題偵測結果並沒有直接地呈現給研究人員參考。
      故本研究利用研討會論文與期刊論文之間有先後影響關係,以研討會論文及期刊論文作為資料集,擷取論文的標題、摘要及關鍵字,並在特徵選取時除了字詞頻率外,額外考慮語意及論文本身的被引用次數,來增加特徵選取的效率,來進行主題偵測與追蹤,並驗證加入語意考量及引用次數對於主題偵測結果所帶來的正面影響。除此之外,也利用主題偵測與追蹤實作的結果進行系統介面的實現,能更直接地提供研究人員熱門的研究主題及趨勢走向的分析,期望能使研究人員在找尋研究方向時,有更快速的參考依據。

    With the digitization of knowledge, all kinds of documents are gradually transformed into electronic form in order to transfer easily. However, due to the rapid increase of the amount of data, researchers cannot extract important information even though they can collect research data easily. Then, most of the electronic databases provide search engine which make keywords as a filtering tool as a solution, but the results still cannot fulfill the needs of researchers.
    In order to find research materials moreefficiently, researchers use topic detection and tracking technology to generalize topics of research papers and trends of research topics. Nevertheless, the methods in the past only had one date set, and usually did not focus on analyzing research trends. Moreover, the semantic information, such as the meaning, spelling or the way that every author wrote in their paper, makes it harder to do topic detection and tracking. While implementing the topic detection and tracking technology, researchers in the past didn’t take the result for further use, like realizing a user interface to present the trend.
    Therefore, this paper takes advantage of the relations of papers between conferences and journals to do topic tracking. Besides, semantics and citation count are taken into consideration on feature selection to increase the efficiency of clustering, and this paper also takes use of the results to build a topic tracking system. This study can help researchers to reduce the time working on selecting research field by providing hot topics and trend analysis of each topic to them with a friendly user interface.

    第1章 緒論 1 1.1研究背景 2 1.2研究動機與目的 3 1.3研究範圍與限制 5 1.4研究流程 5 1.5論文大綱 6 第2章 文獻探討 7 2.1主題偵測與追蹤 7 2.1.1主題的定義 7 2.1.2研究議題 7 2.1.3主題偵測與追蹤方法沿革 9 2.2 資料檢索 11 2.2.1向量空間模型 11 2.2.2文件相似度計算 12 2.3 自然語言處理 13 2.3.1詞性標記 13 2.3.2字根還原 13 2.4特徵選取 14 2.4.1文件頻率(Document Frequency, DF) 15 2.4.2共同資訊量(Mutual Information, MI) 15 2.4.3卡方統計量(Chi-square Statistic Measure, CHI) 15 2.4.4TF-IDF延伸應用 16 2.5文件分群 16 2.5.1分割式分群 17 2.5.2分群效度評估 18 2.6學術論文 18 2.6.1研討會論文 19 2.6.2期刊論文 19 2.6.3論文關係分析 19 2.7小結 20 第3章 研究方法 21 3.1研究架構 21 3.2資料收集及處理模組 22 3.2.1資料蒐集 23 3.2.2斷句 23 3.2.3詞性標記 23 3.2.4字根還原 24 3.3 主題偵測模組 25 3.3.1特徵選取 25 3.3.2分群 27 3.3.3主題偵測 28 3.4趨勢分析模組 28 第4章 系統建置與驗證 32 4.1系統建置 32 4.1.1資料收集及前處理 33 4.1.2特徵選取與分群 33 4.1.3主題偵測及趨勢分析 34 4.2實驗方法 34 4.2.1資料來源 34 4.2.2評估指標 37 4.2.3實驗方法設計 38 4.3實驗結果與分析 39 4.3.1實驗一:分群門檻值λ的選擇 39 4.3.2實驗二:探討加入語意考量後的VSM轉換對結果的影響 42 4.3.3實驗三:探討特徵選取時,同義字合併與否的差異 44 4.3.4實驗四:探討考量引用次數後,對特徵選取的影響 46 4.3.5實驗五:趨勢分析結果討論 48 4.3.6實驗結果彙整 52 4.4系統實作展示 53 第5章 結論及未來研究方向 55 5.1研究成果 55 5.2未來研究方向 56 參考文獻 58

    英文文獻
    Allan, J. (2002). Detection as multi-topic tracking. [Article]. Information Retrieval, 5(2-3), 139-157.
    Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y., Umass, J., . . . Umass, M. (1998). Topic Detection and Tracking Pilot Study Final Report. Paper presented at the In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.
    Allan, J., Papka, R., & Lavrenko, V. (1998). On-line new event detection and tracking. Paper presented at the Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, Melbourne, Australia.
    Anaya-Sánchez, H., Pons-Porrata, A., & Berlanga-Llavori, R. (2010). A document clustering algorithm for discovering and describing topics. Pattern Recognition Letters, 31(6), 502-510.
    Bonnie Jean, D. (2001). Review of Natural Language Processing in R.A. Wilson and F.C. Keil (Eds.), The MIT Encyclopedia of the Cognitive Sciences. Artificial Intelligence, 130(2), 185-189.
    Blair, D. C. (1979). Information Retrieval, 2nd ed. C.J. Van Rijsbergen. London: Butterworths.
    Chen, C. C., Chen, Y. T., & Chen, M. C. (2007). An aging theory for event life-cycle modeling. [Article]. Ieee Transactions on Systems Man and Cybernetics Part a-Systems and Humans, 37(2), 237-248.
    Chen, K. Y., Luesukprasert, L., & Chou, S. C. T. (2007). Hot topic extraction based on timeline analysis and multidimensional sentence modeling. [Article]. Ieee Transactions on Knowledge and Data Engineering, 19(8), 1016-1025.
    Chen, W., & Chundi, P. (2011). Extracting hot spots of topics from time-stamped documents. [Article]. Data & Knowledge Engineering, 70(7), 642-660.
    Chiu, W. T., & Ho, Y. S. (2007). Bibliometric analysis of tsunami research. [Article]. Scientometrics, 73(1), 3-17.
    Cordon, O. (2003). A review on the application of evolutionary computation to information retrieval. International Journal of Approximate Reasoning, 34(2-3), 241-264.
    Cover, T., & Thomas, J. (1991). Elements of Information Theory: Wiley-Interscience.
    Davies, D., & Bouldin, D. (1979). A Cluster Separation Measure. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-1(2), 224-227.
    Decker, R., & Scholz, S. W. (2007). Unsupervised Topic Detection in document collections: an application in marketing and business journals. Int. J. Bus. Intell. Data Min., 2(3), 347-364.
    Farhoomand, A. F., & Drury, D. H. (2002). Managerial information overload. Communications of the Acm, 45(10), 127-131.
    Frakes, W. B., & Baeza-Yates, R. (1992). Information Retrieval: Data, Structures and Algorithms: Pretice Hall.
    Galavotti, L., Sebastiani, F., & Simi, M. (2000). Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization. Paper presented at the Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries.
    Gong, L., Zeng, J., & Zhang, S. (2011). Text stream clustering algorithm based on adaptive feature selection. Expert Systems with Applications, 38(3),1393-1399.
    González-Albo, B., & Bordons, M. (2011). Articles vs. proceedings papers: Do they differ in research relevance and impact? A case study in the Library and Information Science field. Journal of Informetrics, 5(3), 369-381.
    Joachims, T. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features (pp. 137-142): Springer Verlag.
    Kleinberg, J. (2002). Bursty and hierarchical structure in streams. Paper presented at the Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, Edmonton, Alberta, Canada.
    Lam, W., & Ho, K. S. (2001). FIDS: An intelligent financial web news articles digest system. [Article]. Ieee Transactions on Systems Man and Cybernetics Part a-Systems and Humans, 31(6), 753-762.
    Li, J., Wang, M.-H., & Ho, Y.-S. (2011). Trends in research on global climate change: A Science Citation Index Expanded-based analysis. Global and Planetary Change, 77(1-2), 13-20.
    Li, S., Xia, R., Zong, C., & Huang, C.-R. (2009). A framework of feature selection methods for text categorization. Paper presented at the Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2, Suntec, Singapore.
    Li, Y., Chung, S. M., & Holt J. D. (2007). Text document clustering based on frequent word meaning sequences.Data & Knowledge Engineering, 64(1), 381-404.
    Li, Z., & Ho, Y. S. (2008). Use of citation per publication as an indicator to evaluate contingent valuation research. [Article]. Scientometrics, 75(1), 97-110.
    Lin, S.-H., Shih, C.-S., Chen, M. C., Ho, J.-M., Ko, M.-T., & Huang, Y.-M. (1998). Extracting classification knowledge of Internet documents with mining term associations: a semantic approach. Paper presented at the Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, Melbourne, Australia.
    Luo, C., Li, Y., & Chung, S. M. (2009). Text document clustering based on neighbors. Data & Knowledge Engineering, 68(11), 1271-1288.
    Mahdavi, M., Chehreghani, M. H., Abolhassani, H., & Forsati, R. (2008). Novel meta-heuristic algorithms for clustering web documents. [Article]. Applied Mathematics and Computation, 201(1-2), 441-451.
    McClave, J. T., Benson, P. G., & Sincich, T. (2010) Statistics for Business and Economics. Prentice-Hall, Inc.
    Montesi, M., & Owen, J. M. (2008). From conference to journal publication: How conference papers in software engineering are extended for publication in journals. Journal of the American Society for Information Science and Technology, 59(5), 816-829.
    Özgür, L., & Güngör, T. (2010). Text classification with the support of pruned dependency patterns. Pattern Recognition Letters, 31(12), 1598-1607.
    Paice, C. D. (1990). Another stemmer. SIGIR Forum, 24(3), 56-61.
    Porter, M. F. (2006). An algorithm for suffix stripping. Program-Electronic Library and Information Systems, 40(3), 211-218.
    Robert, K. (2000). Viewing morphology as an inference process. Artificial Intelligence, 118(1-2), 277-294.
    Robert M, L. (2001). Natural language processing in support of decision-making: phrases and part-of-speech tagging. Information Processing & Management, 37(6), 769-787.
    Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Commun. ACM, 18(11), 613-620.
    Salton, G. (1988). Automatic text processing. Addison-Wesley Longman Publishing Company.
    Schultz, J. M., & Liberman, M. (1999). Topic Detection and Tracking using idf-Weighted Cosine Coefficient PROCEEDINGS OF THE DARPA BROADCAST NEWS WORKSHOP (pp. 189-192): Morgan Kaufmann Publishers, Inc.
    Shah, P. K., Perez-Iratxeta, C., Bork, P., & Andrade, M. A. (2003). Information extraction from full text scientific articles: Where are the keywords? [Article]. Bmc Bioinformatics, 4.
    Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques.
    Tu, Y.-N., & Seng, J.-L. (2009). Research intelligence involving information retrieval – An example of conferences and journals. Expert Systems with Applications, 36(10), 12151-12166.
    Walls, F., Jin, H., Sista, S., & Schwartz, R. (1999). Topic Detection in broadcast news In Proceedings of the DARPA Broadcast News Workshop (pp. 193-198): Morgan Kaufmann Publishers, Inc.
    Wan, X. (2007). A novel document similarity measure based on earth mover's distance. Information Sciences, 177(18), 3718-3730.
    Wang, H. C., Huang, T. H., Guo, J. L., & Li, S. C. (2009) Journal Article Topic Detection Based on Semantic Features. Lecture Notes in Artificial Intelligence, 5579, 644-652.
    Xie, S. D., Zhang, J., & Ho, Y. S. (2008). Assessment of world aerosol research trends by bibliometric analysis. [Article]. Scientometrics, 77(1), 113-130.
    Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. [Review]. Ieee Transactions on Neural Networks, 16(3), 645-678.
    Xu, Y., Wang, B., Li, J., & Jing, H. (2008). An extended document frequency metric for feature selection in text categorization. Paper presented at the Proceedings of the 4th Asia information retrieval conference on Information retrieval technology, Harbin, China.
    Yang, Y., Ault, T., Pierce, T., & Lattimer, C. W. (2000). Improving text categorization methods for event tracking. Paper presented at the Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, Athens, Greece.
    Yang, Y., Pierce, T., & Carbonell, J. (1998). A study of retrospective and on-line event detection. Paper presented at the Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, Melbourne, Australia.
    Zhang, X., & Wang, T. (2010). Topic Tracking with Dynamic Topic Model and Topic-based Weighting Method. Journal of Software, 5(5), 482-489.
    Zheng, H.-T., Kang, B.-Y., & Kim, H.-G. (2009). Exploiting noun phrases and semantic relationships for text document clustering. Information Sciences, 179(13), 2249-2262.

    中文文獻
    吳偉銘(民97)。基於語意及時間因素之主題偵測法。國立成功大學資訊管理研  究所碩士論文,未出版,台南市。
    林宜瑩(民99)。利用時間因子與名詞片語之文獻主題追蹤法。國立成功大學資訊管理研究所碩士論文,未出版,台南市。

    網站文獻
    WordNet (n.d.) Retrieved fromhttp://wordnet.princeton.edu/
    Conference Ranking (n.d.) Retrieved from http://webdocs.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html
    http://dbgroup.cs.tsinghua.edu.cn/ligl/CS_Conference_Ranking.htm
    http://www.ntu.edu.sg/home/assourav/crank.htm
    Journal Ranking (n.d.) Retrieved from http://www.ntu.edu.sg/home/assourav/jrank.htm
    http://www.gianvecchio.com/tier-jnl-final2008.html

    下載圖示 校內:2022-12-31公開
    校外:2022-12-31公開
    QR CODE