研究生: |
林世強 LIN, Shih-Chiang |
---|---|
論文名稱: |
植基於本體論之文件摘要系統 An Ontology-based Documents Summarization System |
指導教授: |
李昇暾
Li, Sheng-Tun |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
論文出版年: | 2006 |
畢業學年度: | 94 |
語文別: | 中文 |
論文頁數: | 47 |
中文關鍵詞: | 知識管理 、最終摘要標記 、WordNet 、摘要系統 、本體論 |
外文關鍵詞: | WordNet, summarization system, Ontology, knowledge management, advanced tagging on final summarization |
相關次數: | 點閱:65 下載:3 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
伴隨網際網路時代的來臨,資訊量成指數性的爆炸成長,資訊過載已成為急需解決的問題。如何從龐大的資料來源中,快速準確地擷取出符合使用者需求的資訊,誠乃一門重要的學問。在本領域中,文件摘要技術更是扮演關鍵的角色。摘要技術的作用在於將含有雜亂資訊的龐大文件集,以精短簡潔的文句段落,來表達文件集中重要的關鍵資訊,有助使用者節省閱讀的時間與精力。欲從一龐大文件集中產生摘要,必須分析整份文件集,對各篇文件所含的資訊加以萃取過濾,並予以整理合併,才能摘要出精簡的文句段落。現行的文件自動摘要技術,多以統計分析的方式,擷取文件中具代表性的關鍵文句組合成摘要,或是以計算相似度的方式,萃取出文件集的代表性概念,在將其擴充組合為文句段落。不論在可閱讀性跟連貫性方面,都有改善空間。本研究建構一個以本體論(ontology)為基礎的英文文件集摘要產生系統,將摘要知識結構以本體論來表達。首先將文件集做一初步的分群,有助於接下來摘要產生的效率與正確性,接著計算各文句的特徵值,並將其與以加權加總,以計算出各文句在此多文件群集中的重要性,再將此排序,得知各文句的重要性順序,以摘錄出真正重要的文句。接下來,對摘錄出的文句做文件前處理,包含斷句跟詞性標記的動作,以輔助下一個步驟的處理。再來,便是最終摘要修飾的部份。本研究提出對最終摘要文句加以進一步註解的方式,方法是以領域本體論和WordNet語彙典為輔,計算文句之間的相似度,以得出文句間彼此的關連程度。最後,若文句間相關性超過預設的門檻值,則予以做進一步的註解,並且建立文句之間的超連結,讓最終摘要的文句間關聯得以彰顯與明確表達,有助於使用者閱讀並且掌握該份多文件摘要的資訊。本研究的多文件集實驗資料為ACM組織(Association for Computing Machinery)下的SIGIR(Special Interest Group on Information Retrieval)研討會中,所發表關於文件摘要領域的論文摘要。本研究在英文多文件集摘要處理上,最大的最終摘要效用品質百分比可達到將近80%的水準,並能提升使用者閱讀時的便利性。
With the coming of Internet, the amount of information has been grown exponentially. As a result, information overloading has become a severe problem. How to retrieval information suitable for users from great numbers of sources correctly and efficiently is indeed import courses, and the techniques of documents summarization play a great roles to this problem. They are applied to retrieve salient sentences from mass documents corpus to represent the most important information for users to save time and energy on reading and filtering. However, to produce coherent and irredundant summarization from huge documents corpus, we have to analyze the whole corpus and then refine, retrieve, filter, merge, and order information contained in each documents.
The most of existing techniques of documents summarization adopt statistical methodologies to extract the salient sentences to compose the final summarization. Otherwise, some use the calculation of similarity to retrieve the representative concept within corpus and then expand and combine them to form a sentences or paragraph. They urgently have to be improved no matter on readability or coherence. Consequently, the research implements an ontology-based English multi-documents summarization system. First, an initial clustering is made on our corpus to improve the efficiency and accuracy of the summarization generation. Then to measure the importance of each sentence in the document corpus, the feature values of each sentence will be calculated and be summed up after multiplying their own weights. We will rank the sentences according to their total feature values to acquire the order of their importance. With the ranked order, the real salient sentences can be easily and accurately extracted. In the next step, we will preprocess the extracted sentences by sentence segmentation and POS tagging to assist the processing of the followed step, which is the surface generation. The methodology of advanced tagging on the extracted sentences in the final summarization is proposed, which is assisted with the specified domain ontology and the thesaurus called WordNet to calculate the similarity between two sentences and acquire the degree of their association. Finally, if the similarity is bigger than the pre-defined threshold, the advanced tagging will be attached and the URL connecting to its similar sentences will be automatically constructed. It helps the sentences in the final summarization express their association between each other more understandable and clear. For users, the final summarization is more readable and absorbable. The corpuses for our use come from the conference called SIGIR (Special Interest Group on Information Retrieval) of ACM (Association for Computing Machinery). All of them are the abstracts of the theses in the research field of text summarization. An English multi-documents summarization system is implemented by the research, and to improve the readability and coherence of the final summarization, the maximum utility-based performance can approach nearly 80 percent.
Aamodt, A., & Plaza, E. (1994). Case-based reasoning: Foundational issues, methodological variations, and system approaches. Artificial Intelligence Communication, 7(1), 39–59.
Afantenos, S., Karkaletsis, V., & Stamatopoulos, P. (2005). Summarization from medical documents: a survey. Artificial Intelligence in Medicine, 33(2), 157-177.
Bernaras, A., Laresgoiti, I., & Corera, J. (1996). Building and reusing ontologies for electrical network applications. In Proceedings of the 12th ECAI, Budapest, Hungary.
Budanitsky, A., & Hirst, G. (2001) Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In Workshop on WordNet and other Lexical Resources, Second meeting of the Nord American Chapter of the Association for Computational Linguistics, Pittsburgh.
Chang, P.-C., & Lai, C.-Y. (2005). A hybrid system combining self-organizing maps with case-based reasoning in wholesaler's new-release book forecasting. Expert Systems With Applications, 29(1), 183-192.
Fensel, D. (2001). Ontologies: a silver bullet for knowledge management and electronic commerce. Springer-Verlag, Berlin, Heidelberg.
Gruber, T. R. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2), 199-220.
Gruber, T. R. (1995). Toward principles for the design of ontologies used for knowledge sharing. International Journal of Human–Computer Studies, 43(5–6), 907–928.
Guarino, N., Masolo, C., & Vetere, G. (1999). Ontoseek: Content-based access to the web. IEEE Intelligent Systems, 14(3), 70–80.
Guarino, N., & Welty, C. (2002). Evaluating ontological decisions with OntoClean. Communications of the ACM, 45(2), 61-65.
Heflin, J., & Hendler, J. (2000). Searching the web with SHOE. In Artificial intelligence for web search.
Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference Research on Computational Linguistics, Taiwan.
Khan, L., & McLeod, D. (2000). Effective retrieval of audio information from annotated text using ontologies. In S. J. Simoff, & O. R. Zaiane (Eds.), Workshop on multimedia data mining (MDM/KDD’2000), Boston, MA.
Kornilakis, H., Grigoriadou, M., Papanikolaou, K. A., & Gouli, E. (2004). Using WordNet to Support Interactive Concept Map Construction. In Proceedings of the IEEE International Conference on Advanced Learning Technologies, Joensuu, Finland.
Kosala, R., & Blockeel, H. (2000). Web Mining Research: A Survey. ACM SIGKDD Explorations, 2(1), 1-15.
Kupiec, J., Pederson, J., & Chen, F. (1995). A trainable document summarizer. ACM SIGIR 1995.
Lam, W., & Ho, K. S. (2001). FIDS: an intelligent financial web news articles digest system. IEEE Transactions on SMC-part A, 31(6), 753–762.
Leacock, C. & Chodorow, M. (1998). Combining local context and wordnet similarity for word sense identification. In Christiane Fellbaum (Ed.), WordNet: An Electronic Lexical Database(pp. 265-283). MIT Press.
Lee, C.-S., Chen, Y.-J., & Jian, Z.-W. (2003). Ontology-based fuzzy event extraction agent for Chinese e-news summarization. Expert Systems with Applications, 25(3), 431-447.
Li, Y., & Zhong, N. (2004). Web mining model and its applications for information gathering. Knowledge-Based Systems, 17(5-6), 207-217.
Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the 15th International Conf. on Machine Learning, San Francisco, CA.
Liu, S., Liu, F., Yu, C., & Meng, W. (2004). An effective approach to document retrieval via utilizing Wordnet and recognizing phrases. ACM SIGIR 2004.
Maedche, A., & Staab, S. (2001). Ontology Learning for the Semantic Web. IEEE Intelligent Systems, 16(2), 72–79.
Mani, I. (2001). Recent Developments in Text Summarization. In Proceedings of CIKM’01, Georgia.
Mani, I., & Maybury, M. T. (Eds.). (1999). Advances in Automatic Text Summarization. Cambridge, MA: MIT Press.
Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38(11), 39-41.
Moens, M.-F., Angheluta, R., & Dumortier, J. (2005). Generic technologies for single- and multi-document summarization. Information Processing and Management, 41(3), 569-586.
Musen, M. A. (1992). Dimensions of knowledge sharing and reuse. Computers and Biomedical Research, 25(5), 435-467.
Nomoto, T., & Matsumoto, Y. (2003). The diversity-based approach to open-domain text summarization. Information Processing and Management, 39(3), 363-389.
Noy, N. F., & Klein, M. (2004). Ontology evolution: Not the same as schema evolution. Knowledge and Information Systems, 6(4), 428-440.
O’Leary, D. (1998). Using AI in knowledge management: Knowledge bases and ontologies. IEEE Intelligent Systems, 13(3), 34–39.
Railoff, E. (1996). An empirical study of automated dictionary construction for information extraction in three domains. Artificial Intelligence, 85(1-2), 101-134.
Radev, D. R., Hatzivassiloglou, V., & McKeown, K. R. (1999). A description of the CIDR system as used for TDT-2. In DARPA broadcast news workshop, Herndon, Virginia.
Radev, D. R., Jing, H., & Budzikowska, M. (2000). Centroid-based summarization of mul-tiple documents: sentence extraction, utility-based evaluation, and user studies. In Proceedings of the ANLP/NAACL 2000 Workshop on Automatic Summarization.
Radev, D. R., Hovy, E., & McKeown, K. (2002). Introduction to the special issue on text summarization. Computational Linguistics, 28(4).
Radev, D. R., Jing, H., Stys, M., & Tam, D. (2004). Centroid-based summarization of multiple documents. Information Processing and Management, 40(6), 919-938.
Radev, D. R., Otterbacher, J., Winkel, A., & Blair-Goldensohn, S. (2005). NewsInEssence: Summarizing Online News Topics. Communications of the ACM, 48(10), 95-98.
Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513-523.
Shamsfard, M., & Barforoush, A. A. (2004). Learning ontologies from natural language texts. International Journal of Human-Computer Studies, 60(1), 17-63.
Sintek, M., Tschaitschian, B., Abecker, A., Bernardi, A., & Muller, H.-J. (2000). Using ontologies for advanced information access. In J. Domingue (Ed.) PAKeM 2000, The Third International. Conference and Exhibition on the Practical Application of Knowledge Management, Manchester, UK.
Smith, B., & Welty, C. (2001). Ontology: towards a new synthesis. In Proceedings of the International Conference on Formal Ontology in Information Systems, Ogunquit, Maine, USA.
Soo, V.-W., & Lin, C.-Y. (2001). Ontology-based information retrieval in a multi-agent system for digital library. In Proceedings of the Sixth Conference on Artificial Intelligence and Applications, Taiwan.
Soo, V. W., Lee, C. Y., Yeh, C. C., & Chen, C. C. (2002). Using sharable ontology to retrieve historical images. In Proceedings of ACM/IEEE International Joint Conference on Digital Library, Portland, Oregon, USA.
Soo, V.-W., Lee, C.-Y., Li, C.-C., Chen, S. L., & Chen, C. –C. (2003). Automated semantic annotation and retrieval based on sharable ontology and case-based learning techniques. In Proceedings of the Third ACM/IEEE-CS Joint Conference on Digital Libraries, Houston, Texas.
Uschold, M., & Gruninger, M. (1996). Ontologies: principles, methods and applications. The Knowledge Engineering Review, 11, 93-136.
van Elst, L., & Abecker, A. (2002). Ontologies for information management: balancing formality, stability, and sharing scope. Expert Systems with Applications, 23(4), 357-366.
Varadarajan, R., & Hristidis, V. (2005). Structure-Based Query-Specific Document Summarization. CIKM’05, Bremen, Germany.
Yeh, J.-Y., Ke, H.-R., Yang, W.-P., & Meng, I.-H. (2005). Text summarization using a trainable summarizer and latent semantic analysis. Information Processing and Management, 41(1), 75-95.