簡易檢索 / 詳目顯示

研究生: 林宜瑩
Lin, Yi-Ying
論文名稱: 利用時間因子與名詞片語之文獻主題追蹤法
A Topic Tracking Method Based on Temporal Factor and Noun Phrases
指導教授: 王惠嘉
Wang, Hei-Chia
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 55
中文關鍵詞: 主題追蹤老化理論時間因子名詞片語
外文關鍵詞: Topic Tracking, Aging Theory, Temporal Factor, Noun Phrase
相關次數: 點閱:55下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   隨著資訊科技的崛起與網際網路快速的發展,電子化期刊更逐步取代傳統紙本刊物以方便在網路上流通,讓使用者能即時且容易的透過網路來發佈及下載有價值的訊息;然而文件的快速暴增已造成嚴重的資訊過載問題,研究人員必須從龐大的刊物中緩慢且費時的獲取所需之訊息。面對此問題,現今已有許多電子資料庫提供搜尋引擎查詢,但搜尋結果卻未考慮到資料會隨著時間變動,且夾雜著許多過時的舊資料,無法在資料間做很好的關連,讓使用者只能循序的從中過濾出想要的文獻或找出有興趣的主題選讀,故使用者仍需投注大量的時間與精力在查詢上。
      再者,現行的資料庫對每本期刊之研究領域並無詳細的介紹說明,當研究人員欲探討特定主題可能不得其門而入,必須去閱讀近年來該期刊的文章,才能決定此期刊是否適合。即使如此,透過人工大量閱讀進而去瞭解期刊收錄的趨勢仍是困難的,且各期刊每年收錄的主題會因時間的演變、新技術的發展及篩選者的喜好而產生變化。傳統的主題偵測方法未考慮期刊收錄的趨勢並非完全固定,且多只採用單一字詞作為特徵挑選的標準,而忽略名詞片語所包含的信息,故無法提供較精確的結果給使用者。本研究在特徵選取時加入名詞片語,並採三種不同模式的特徵基準作運算,接著使用老化理論的概念考量時間因子的變化,探討特定領域的研究主題趨勢及消長,提供研究人員快速瞭解此領域近年來的熱門主題和趨勢分析,使其對能快速入門;研究結果亦證實加入時間因子的研究方法比傳統方法較好。

      With the rapid development of Internet, electronic paper will gradually replace the traditional publications. Let users to distribute and download information instantly via the Internet. However, the rapid explosion of documents has caused a serious problem of information overload. Researchers obtain information slowly from large numbers of data, so many electronic databases provide search engine to help users.Due to search results do not consider the information changes over time, and usually mix with outdated information that cause search results connect badly. If users want to find interested literature or topics, they will need to spend a lot of time in query.
      Furthermore, the electronic databases for each field of the journals do not detail description. When researchers want to study specific topic, must read the articles of the journal in recent years and then to determine this journal whether it is suitable. Although, understand the trend of journals is still difficult through the manual. Each journal contains the theme of each year will change, because of the time evolution, development of new technologies and screening of those reviewers. It is not completely fixed that traditional topic detection methods do not consider the trend of the journal. Besides, researchers usually select unigram as the feature selection criteria and ignore the noun phrase contains information. We add noun phrases in feature selection and adopt the benchmark of three characteristic models for computing. Finally, we use the concept of aging theory to provide researchers hot topics and trends analysis in this field over the years. We demonstrate that this method providers a valuable means of considering temporal factor related topics in journals and the result of our method is better than traditional methods.

    目 錄 第1章 緒論 1 1.1 研究背景 2 1.2 研究動機與目的 3 1.3 研究範圍與限制 5 1.4 研究流程 5 1.5 論文大綱 6 第2章 文獻探討 8 2.1 主題偵測與追蹤 8 2.1.1 主題與事件的定義 8 2.1.2 研究任務 9 2.2 資料檢索 10 2.3 自然語言處理 12 2.3.1 詞性標記 12 2.3.2 文法分析 13 2.3.3 字根還原 13 2.3.4 機器可讀式字典 14 2.4 特徵選取 14 2.4.1 文件頻率 15 2.4.2 共同資訊量 15 2.4.3 卡方統計量 16 2.4.4 小結 16 2.5 老化理論 16 2.6 分群 18 2.6.1 階層式分群 19 2.6.2 分割式分群 20 2.6.3 密度式分群 20 2.6.4 網格式分群 21 2.6.5 小結 22 第3章 研究方法 23 3.1 研究架構 23 3.2 資料蒐集模組 24 3.3 主題追蹤模組 27 3.4 範例解說 30 第4章 系統建置與驗證 33 4.1 系統建置 33 4.1.1 Preprocessing 34 4.1.2 Feature Extraction 35 4.1.3 Clustering 35 4.2 實驗方法 35 4.2.1 資料來源 36 4.2.2 評估指標 37 4.3 實驗結果與分析 38 第5章 結論及未來研究方向 47 5.1 研究成果 47 5.2 未來研究方向 48 參考文獻 50 表 目 錄 表 2-1向量空間模型中常見的相似度公式 11 表 2-2 部分詞性標記對照表 12 表 2-3 分群法比較表 22 表 4-1 選用期刊對照表 36 表 4-2 各期刊每年度的總篇數 36 表 4-3 ISI期刊類別項目 39 圖 目 錄 圖 1-1 考量時間因子之主題追蹤立體示意圖 5 圖 2-1 考慮時間因子之主題偵測示意圖 18 圖 2-2 階層式分群樹狀結構表示 19 圖 3-1 本研究架構流程圖 23 圖 3-2 文獻前置處理流程圖 24 圖 3-3 Corpus的資料格式 25 圖 3-4 主題追蹤模組流程圖 28 圖 3-5 此時間區間的整體概況圖 30 圖 3-6 特徵排序示意圖 31 圖 3-7 群1(C1)中所含的文獻狀況 32 圖 3-8 時間區間滑動圖 32 圖 4-1 系統部署圖 34 圖 4-2 1999~2009年資管領域整體研究主題趨勢圖 39 圖 4-3 1999~2009年資管領域整體研究主題趨勢圖 40 圖 4-4本研究與Google Scholar Citation比較之正確率 41 圖 4-5 information extraction線性趨勢分析圖 (本研究方法) 41 圖 4-6 information extraction 引用次數趨勢圖 (citation驗證) 42 圖 4-7 human resource management線性趨勢分析圖 (本研究方法) 42 圖 4-8 human resource management引用次數趨勢圖 (citation驗證) 42 圖 4-9 machine learning線性趨勢分析圖 (本研究方法) 43 圖 4-10 machine learning引用次數趨勢圖 (citation驗證) 43 圖 4-11 單一字詞與名詞片語之MRR比較圖 44 圖 4-12 含時間因子與不含時間因子之MRR比較圖 45 圖 4-13 以三種不同時間區間內的特徵群為基準之MRR比較圖 46

    Allan, J. (2002). Detection as multi-topic tracking. Information Retrieval, 5, 139-157.
    Allan, J. (2002). Topic Detection and Tracking:Event-based Information Organization Kluwer Academic Publishers.
    Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998). Topic Detection and Tracking Pilot Study Final Report. Paper presented at the Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.
    Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Record 27(2), 94-105.
    Allan, J., Lavrenko, V., & Jin, H. (2000). First Story Detection In TDT Is Hard. In Proceeding Of 9th Conference on Information Knowledge Management (CIKM).
    Ankerst, M., Breunig, M. M., Kriegel, H.-P., & Sander, J. (1999). OPTICS: ordering points to identify the clustering structure. ACM SIGMOD Record 28(2), 49 - 60
    Angheluta, R., Busser, R. D., & Moens, M.-F. (2002). The Use of Topic Segmentation for Automatic Summarization. Paper presented at the In Proceedings of the ACL-2002 Post-Conference Workshop on Automatic Summarization.
    Bawden, D., Holtham, C., & Courtney, N. (1999). Perspectives on information overload. Paper presented at the Aslib Proceedings.
    Bezdek, J. C., Keller, J. M., Krishnapuram, R., Kuncheva, L. I., & Pal, N. R. (1999). Will the real IRIS data please stand up? IEEE Transactions on Fuzzy Systems, 7, 368–369.
    Bruce, R. F., & Wiebe, J. M. (1999). Recognizing Subjectivity: A Case Study on Manual Tagging. Natural Language Engineering, 5(2), 187-205.
    Carlberger, J. & Kann, V. (1999). Implementing an efficient part-of-speech tagger. Software-Practice and Experience, 29(9), 815-832.
    Chen, C. C., Chen, Y.-T., Sun, Y., & Chen, M. C. (2003). Life Cycle Modeling of News Events Using Aging Theory. MACHINE LEARNING: ECML 2003 2837, 47-59.
    Chen, C. C., Chen, Y. T., & Chen, M. C. (2007). An aging theory for event life-cycle modeling. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 37(2), 237-248.
    Chen, Y.-J., & Chen, H.-H. (2002). NLP and IR Approaches to Monolingual and Multilingual Link Detection. In Proceeding Of the 19th International Conference on Computation Linguistics.
    Chiang, H., Yue, S., & Yin, Z. (2004). A new approach to fuzzy clustering. IEEE Transactions on Fuzzy Systems 15(2), 45-61.
    Choudhary, A. K., Harding, J. A., & Popplewell, K. (2006). Knowledge discovery for moderating collaborative projects. Proceedings of the 4th IEEE International Conference on Industrial Informatics, Singapore.
    Choudhary, A. K., Harding, J. A., & Tiwari, M. K. (2008). Data mining in manufacturing: a review based on the kind of knowledge. Journal of Intelligent Manufacturing
    Coden, A. R., Pakhomoc, S. V., Ando, R. K, Duffy, P. H., & Chute, C. G. (2005). Domain-specific language models and lexicons for tagging. Journal of Biomedical Informatics, 38, 422-430.
    Cordon, O., Herrera-Viedma, E., Lopez-Pujalte, C., Luque, M., & Zarco, C. (2003). A review on the application of evolutionary computation to information retrieval. Approximate Reasoning, 34, 241-264.
    Cover, T. M. and J. A. Thomas (2006). Elements of Information Theory. Wiley-Interscience, New York.
    Day, W. H. E., & Edelsbrunner, H. (1985). Investigation of proportional link linkage clustering methods. Journal of Classification, 2, 239-254.
    Dorr, B. J. (2001). Review of Natural Language Processing in R.A. Wilson and F.C. Keil (Eds.), The MIT Encyclopedia of the Cognitive Sciences. Artificial Intelligence, 130(2), 185-189.
    Ester, M., Kriegel, H.-p., Jörg, S., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 226-231.
    Farhoomand, A. F., & Drury, D. H. (2002). Managerial information overload. Communications of the ACM 45(10), 127-131.
    Fattore, M., & Arrigo, P. (2006). Topical Clustering of Biomedical Abstract by Self Organizing Maps. bioinformatics of Genome Regulation and Structure II, 461-490
    Frakes, W. B. and R. Baeza-Yates (1992). Information Retrieval Data Structures & Algorithms. Prentice-Hall.
    Galavotti, L., Nardi, V. J., Sebastiani, F., & Simi, M. (2000). Feature Selection and Negative Evidence in Automated Text Categorization. Proceedings of the 4 th European Conference on Research and Advanced Technology for Digital Libraries.
    Harding, J. A., Shahbaz, M., Srinivas, & Kusiak, A. (2006). Data mining in manufacturing: a review American Society of Mechanical Engineers (ASME). Journal of Manufacturing Science and Engineering, 128(4), 969–976.
    Huang, K.-C., Geller, J., Halper, M., Perl, Y., & Xu, J. (2009). Using WordNet synonym substitution to enhance UMLS source integration. Artificial Intelligence in Medicine, 46, 97-109.
    Hu, M., & Liu, B. (2004). Mining and Summarizing Customer Reviews. Paper presented at the Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
    Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys, 3(3), 264–323.
    Joachims, T. Text Categorization with Support Vector Machines; Learning with Many Relevant Features. In European Conference on Machine Learning (ECML). 1998
    Kauchak, D., & Chen, F. R. (2005, June 25-30). Feature-Based segmentation of narrative documents. Paper presented at the Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing, Ann Arbor, MI.
    Khan, M. S., & Khor, S. W. (2004). Web document clustering using a hybrid neural network. Applied Soft Computing, 4(4), 423-432.
    Krovetz, R. (2000). Viewing morphology as an inference process. Artificial Intelligence, 118, 277-294.
    Koh, C. E. (2003). IS journal review process: a survey on IS research practices and journal review issues. Information & Management, 40(8), 743-756.
    Lam, W., & Ho, K. S. (2001). FIDS: An intelligent financial web news articles digest system. IEEE Transactions on Systems Man and Cybernetics Part a-Systems and Humans, 31(6), 753-762.
    Li, S., R. Xia (2009). A Framework of Feature Selection Methods for Text Categorization. Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, 692–700.
    Lin, H. K. & Harding, J. A. (2007). A manufacturing system engineering ontology model on the semantic web for inter-enterprise collaboration. Computer in Industry, 58, 428-437.
    Lin, S.-H., Shih, C.-S., Chen, M. C., Ho, J.-M., Ko, M.-T., & Huang, Y.-M. (1998). Extracting classification knowledge of Internet documents with mining term associations: a semantic approach. Paper presented at the Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, Melbourne, Australia
    Losee, R. M. (2001). Natural language processing in support of decision-making: phrases and part-of-speech tagging. Information Processing and Management, 37, 769-787.
    Luo, C., Li, Y., & Chung, S. M. (2009). Text document clustering based on neighbors. Data & Knowledge Engineering, 68, 1271-1288.
    Mahdavi, M., Chehreghani, M. H., Abolhassani, H., & Forsati, R. (2008). Novel meta-heuristic algorithms for clustering web documents. Applied mathematics and computation, 201, 441-451.
    McDonald, D., & Chen, H. (2002). Using sentence-selection heuristics to rank text segments in TXTRACTOR. Paper presented at the Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries Portland, Oregon, USA
    Neaga, E. I. and J. A. Harding (2005). An enterprise modelling and integration framework based on knowledge discovery and data mining. International Journal of Production Research, 43(6), 1089–1108.
    Paice, C. D. (1990). Another stemmer. ACM SIGIR Forum Archive, 24(3), 56-61.
    Pham, D. T. and A. A. Afify (2005). Machine learning techniques and their applications in manufacturing. Proceedings of the Institution of Mechanical Engineers, Journal of Engineering Manufacture, Part B, 219, 395–412.
    Pilevar, A. H., & Sukumar, M. (2005). GCHL: A grid-clustering algorithm for high-dimensional very large spatial data bases. Pattern Recognition Letters, 26(7), 999-1010.
    Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130-137.
    Ratnaparkhi, A. (1999). Learning to parse natural language with maximum entropy models. Machine Learning, 34, 151-175.
    Salton, G. (1988). Automatic text processing. Addison-Wesley Longman Publishing Company.
    Salton, G., Singhal, A., Mitra, M., & Buckley, C. (1997). Automatic text structuring and summarization. Information Processing and Management, 33(2), 193-207.
    Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
    Shah, P. K., Perez-Iratxeta, C., Bork, P., & Andrade, M. A. (2003). Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics, 4(1).
    Shahbaz, M., Srinivas, Harding, J. A., & Turner, M. (2006). Product design and manufacturing process improvement using association rules. Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture, 220, 243-254.
    Steinbach, M., Karypis, G., & Kumar, V. (2000). A Comparison of Document Clustering Techniques. KDD-2000 Workshop on Text Mining.
    Suchanek, F. M., Kasneci, G., & Weikum, G. (2008). YAGO: A Large Ontology from Wikipedia and WordNet. Web Semantics: Science, Services and Agents on the World Wide Web, 6, 203–217.
    Walters, W. H. (2009). Google Scholar Search Performance: Comparative Recall and Precision. portal: Libraries and the Academy, 9(1), 5-24.
    Wang, X., & Hamilton, H. J. (2003). DBRS: A Density-Based Spatial Clustering Method with Random Sampling. PAKDD, 563–575.
    Wan, X. (2007). A novel document similarity measure based on earth mover's distance. Information Sciences, 177, 3718–3730.
    Xu, R., & Wunsch, D. I. I. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks In Neural Networks, 16(3), 645-678.
    Xu, Y., Wang, B., Li, J., & Jing, H. (2008). An Extended Document Frequency Metric for Feature Selection in Text Categorization. Lecture Notes in Computer Science, 4993, 71-82.
    Yang, Y., Pierce, T., & Carbonell, J. (1998). A Study on Retrospective and On-Line Event Detection. Paper presented at the Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval.
    Yu, H., & Lee, M. (2006). Accessing bioscience images from abstract sentences. Bioinformatics, 22, 547-556.
    Zhao, Y., Karypis, G., & Fayyad, U. (2005). Hierarchical Clustering Algorithms for Document Datasets Data Mining and Knowledge Discovery, 10(2), 141-168.
    Zhan, J., Loh, H. T., & Liu, Y. (2008). Gather customer concerns from online product reviews - A text summarization approch. Expert Systems with Applications, 1-9.
    Zhang, X., Hao, Y., Zhu, X.-Y., & Li, M. (2008). New Information Distance Measure and Its Application in Question Answering System. Journal of Computer Science and Technology 23(4), 557-572.
    Zheng, H. T., Kang, B. Y., & Kim, H. G. (2009). Exploiting noun phrases and semantic relationships for text document clustering. Information Sciences, 179(13), 2249-2262.
    HowNet
    (http://www.keenage.com/)
    WordNet
    (http://wordnet.princeton.edu/)
    The Stanford Natural Language Processing Group
    (http://nlp.stanford.edu/software/lex-parser.shtml)

    無法下載圖示 校內:2020-12-31公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE