簡易檢索 / 詳目顯示

研究生: 許恒耀
Hsu, Hen-Yao
論文名稱: 應用標題產生機制於文件摘要之設計與實作
Design and Implementation of the Topic Generation Methods for Document Summarization
指導教授: 楊竹星
Yang, Chu-Sing
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2008
畢業學年度: 96
語文別: 英文
論文頁數: 43
中文關鍵詞: 文件摘要搜尋引擎標題字彙分佈
外文關鍵詞: document summarization, search engine, topic, word distribution
相關次數: 點閱:92下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,越來越多的網際網路使用者利用搜尋引擎以找尋所需要的資訊,然而在搜尋引擎回傳的資訊中常包含大量不相關的資訊,讓使用者需花上許多不必要的時間來找出真正所需要的資訊。為了幫助使用者能夠快速的找到正確的資訊,此論文提出一個高效率的演算法DiSco (Distribution Scoring),此演算法提供將搜尋引擎所回傳的資訊,自動產生摘要以回傳給使用者。DiSco以標題(Topic)作為摘要的單位,考慮字彙在文件中出現位置所代表的重要性。利用一評分機制設定權重,並依字彙的權重高低以擷取出標題集合作為文件摘要之用。
    論文中使用Reuters-21578、Google Trends兩類資料集進行模擬實驗。衡量標準使用資料涵蓋率、資料重疊率以及計算時間。為了進一步提昇DiSco的處理效率,論文並根基於DiSco設計一更加快速的演算法,並討論在提昇計算效能後與文件涵蓋率、文件重疊率之間的權衡關係。
    實驗結果顯示,本論文提出的自動產生摘要機制,在衡量標準上均有優異的表現。實驗結果亦顯示此機制所自動產生的標題集合,能更適合作為文件摘要之用,以更加滿足使用者快速得到資訊的需求。

    In recently years, more and more Internet users rely on the search engines to help them find the information they need. However, the information they often consists of a large amount of irrelevant parts. It would often spend Internet users much time to achieve the correct information users need. To help Internet users find the information they are looking for quickly, an efficient algorithm for automatically building the summaries of a collection of documents found by a search engine in response to a user query, called DiSco (Distribution Scoring), is proposed. Topics are the basic units of the summaries DiSco generated. The main idea of DiSco is to consider the distribution of lexicons in a document, while the distribution is in practice thought to be related to different importance of lexicons. Using a scoring mechanism to score the weight of individual lexicons, DiSco could generate the topic sets to be the summaries of the document based on the weights.
    To demonstrate the performance of the proposed algorithm in this thesis, Reuters-21578 text categorization collection and the search results of the hot queries from Google Trends are used in the simulation. Moreover, several measure methods such as coverage, overlap, and the computation time are employed in evaluating the performance of the proposed algorithm. To further improve the efficiency of the proposed algorithm, an alternative version of DiSco is designed and implemented. The tradeoff between computation time and the quality of the summarization is also discussed.
    All the simulation results indicate that the proposed algorithm, which is based on the distribution of lexicons, outperforms all the existing algorithms in terms of not only the benchmark of data coverage rate, data overlap rate and the computation time. The simulation results also indicate that the topic sets generated by the proposed algorithm are better suited for document summarization to fit the requirement of getting information quickly.

    List of Tables iii List of Figures iv Chapter 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2 Related Works 5 2.1 Automatic Taxonomy Generation Algorithms . . . . . . . . . . . . . . . . . 5 2.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 3 The Proposed Algorithm 11 3.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 The Alternative Version of DiSco . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 4 Simulation Results 16 4.1 Simulating Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2.1 Google Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2.2 Reuters-21578 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3.1 Google Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3.2 Reuters-21578 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4 Simulation Results of the Alternative Version . . . . . . . . . . . . . . . . . 21 4.4.1 Google Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4.2 Reuters-21578 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.4.3 Summary of the Simulation Results . . . . . . . . . . . . . . . . . . 24 Chapter 5 Conclusion and Future Work 30 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Appendix A Stop-word List 36 Appendix B The Categories of Google Trends, Feb 29, 2008 40 Appendix C The Categories of Reuters-21578 42

    [1] R. Baeze-Yates and B. Ribeiro-Neto, Modern Information Retrieval. ACM Press, 1999.
    [2] S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web search engine,”
    Computer Networks and ISDN Systems, vol. 30, no. 1–7, pp. 107–117, 1998.
    [3] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” Journal of the
    ACM, vol. 46, pp. 668–677, 1999.
    [4] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins,
    and J. Wiener, “Graph structure in the web,” in Proceedings of the 9th international
    World Wide Web conference on Computer networks : the international journal of computer
    and telecommunications netowrking, 2000, pp. 309–320.
    [5] E. Agichtein, E. Brill, and S. Dumais, “Improving web search ranking by incorporating
    user behavior information,” in SIGIR ’06: Proceedings of the 29th annual international
    ACM SIGIR conference on Research and development in information retrieval, 2006,
    pp. 19–26.
    [6] H. Berghel, “Cyberspace 2000: dealing with information overload,” Commun. ACM,
    vol. 40, no. 2, pp. 19–24, 1997.
    [7] “Vivisimo,” http://search.vivisimo.com/.
    [8] “Snaket,” http://snaket.di.unipi.it/.
    [9] “Iboogie,” http://www.iboogie.com/.
    [10] “Grokker,” http://www.grokker.com/.
    [11] “Kartoo,” http://www.kartoo.com/flash04.php3.
    [12] O. Zamir and O. Etzioni, “Grouper: a dynamic clustering interface to web search results,”
    Computer Networks, vol. 31, no. 11–16, pp. 1361–1374, 1999.
    [13] W. Lam and Y. Han, “Automatic textual document categorization based on generalized
    instance sets and a metamodel,” IEEE Transactions on Pattern Analysis and Machine
    Intelligence, vol. 25, no. 5, pp. 628–633, 2003.
    [14] K. M. Hammouda and M. S. Kamel, “Efficient phrase-based document indexing for web
    document clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 16,
    no. 10, pp. 1279–1296, 2004.
    [15] H. Al-Mubaid and S. A. Umair, “A new text categorization technique using distributional
    clustering and learning logic,” IEEE Transactions on Knowledge and Data Engineering,
    vol. 18, no. 9, pp. 1156–1165, 2006.
    [16] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algorithms and
    representations for text categorization,” in Proceedings of the 7th international conference
    on Information and knowledge management, 1998, pp. 148–155.
    [17] H. Chen and S. Dumais, “Bringing order to the web: automatically categorizing search
    results,” in Proceedings of the SIGCHI conference on Human factors in computing systems,
    2000, pp. 145–152.
    [18] S. T. Dumais and H. Chen, “Hierarchical classification of Web content,” in Proceedings
    of SIGIR-00, 23rd ACM International Conference on Research and Development in
    Information Retrieval, 2000, pp. 256–263.
    [19] I. Mani and E. Bloedorn, “Multi-document summarization by graph search and matching,”
    in Proceedings of the Fifteenth National Conference on Artificial Intelligence
    (AAAI-97), 1997, pp. 622–628.
    [20] K. R. McKeown, J. L. Klavans, V. Hatzivzssiloglou, R. Barzilay, and E. Eskin, “Towards
    multidocument summarization by reformulation: Progress and prospects,” in Proceedings
    of the Sixteenth National Conference on Artificial Intelligence, 1999, pp. 453–460.
    [21] H.-J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma, and J. Ma, “Learning to cluster web search
    results,” in Proceedings of the 27th Annual International ACM SIGIR Conference on
    Research and Development in Information Retrieval, 2004, pp. 210–217.
    [22] J. Dean and M. R. Henzinger, “Finding related pages in theWorldWideWeb,” Computer
    Networks (Amsterdam, Netherlands: 1999), vol. 31, no. 11–16, pp. 1467–1479, 1999.
    [23] E. Amitay and C. Paris, “Automatically summarising web sites - is there a way around
    it?” in Proceedings of the 9th international conference on Information and knowledge
    management, 2000, pp. 173–179.
    [24] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing
    order to the web,” Stanford Digital Library Technologies Project, Tech. Rep., 1998.
    [25] G. Carenini, R. T. Ng, and X. Zhou, “Summarizing email conversations with clue
    words,” Proceedings of the 16th international conference on World Wide Web (WWW
    2007), pp. 91–100, 2007.
    [26] P. G. Anick and S. Tipirneni, “The paraphrase search assistant: terminological feedback
    for iterative information seeking,” in Proceedings of the 22nd annual international ACM
    SIGIR conference on Research and development in information retrieval, 1999, pp. 153–
    159.
    [27] K. Kummamuru, R. Lotlikar, S. Roy, K. Singal, and R. Krishnapuram, “A hierarchical
    monothetic document clustering algorithm for summarization and browsing search results,”
    in Proceedings of the 13th conference on World Wide Web (WWW 2004), 2004,
    pp. 658–665.
    [28] M. Sanderson andW. B. Croft, “Deriving concept hierarchies from text,” in Proceedings
    of the 22nd annual international ACM SIGIR conference on Research and Development
    in Information Retrieval, 1999, pp. 206–213.
    [29] R. H. Thompson and W. B. Croft, “Support for browsing in an intelligent text retrieval
    system,” Int. J. Man-Mach. Stud., vol. 30, no. 6, pp. 639–668, 1989.
    [30] P. Pirolli, P. Schank, M. Hearst, and C. Diehl, “Scatter/gather browsing communicates
    the topic structure of a very large text collection,” in CHI ’96: Proceedings of the
    SIGCHI conference on Human factors in computing systems, 1996, pp. 213–220.
    [31] D. Lawrie, W. B. Croft, and A. Rosenberg, “Finding topic words for hierarchical summarization,”
    in Proceedings of the 24th annual international ACM SIGIR conference,
    2001, pp. 349–357.
    [32] D. J. Lawrie and W. B. Croft, “Generating hierarchical summaries for web searches,” in
    Proceedings of the 26nd annual international ACM SIGIR conference on Research and
    development in information retrieval, 2003, pp. 457–458.
    [33] K. Kummamuru and R. Krishnapuram, “A clustering algorithm for asymmetrically related
    data with applications to text mining,” in Proceedings of the tenth international
    conference on Information and knowledge management, 2001, pp. 571–573.
    [34] P. Ferragina and A. Gulli, “A personalized search engine based on web-snippet hierarchical
    clustering,” In Special interest tracks and posters of the 14th international conference
    on World Wide Web (WWW 2005), pp. 801–810, 2005.
    [35] ——, “A personalized search engine based on web-snippet hierarchical clustering,” Software:
    Practice and Experience, vol. 38, no. 2, pp. 189–225, 2008.
    [36] “Stop word list,” http://www-fog.bio.unipd.it/waishelp/stoplist.html.
    [37] M. Porter, “An algorithm for suffix stripping,” Program, vol. 14, no. 3, pp. 130–137,
    1980.
    [38] Google, “Hot trends, Feb 29, 2008,” http://www.google.com/trends/hottrends?sa=Xn
    &date=2008-2-29.
    [39] Reuters-21578, “text categorization test collection,” http://www.daviddlewis.com/
    resources/testcollections/reuters21578/.
    [40] U. M. L. Archive, http://kdd.ics.uci.edu/.

    下載圖示 校內:2009-09-03公開
    校外:2009-09-03公開
    QR CODE