簡易檢索 / 詳目顯示

研究生: 沈裕傑
Shen, Yu-chien
論文名稱: 以語句為主之LDA模型於文件摘要之應用
Sentence-Based Latent Dirichlet Allocation for Text Summarization
指導教授: 簡仁宗
Chien, Jen-Tzung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2008
畢業學年度: 96
語文別: 中文
論文頁數: 72
中文關鍵詞: 摘要
外文關鍵詞: summarization
相關次數: 點閱:92下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著網路的發達和興盛,網路上的資料量也相對日益倍增。使用者必須花費大量時間才能過濾或是搜尋到想要的資訊。因此,藉由自動化摘要技術的發展,可以自動的從大量的文件或網頁中擷取出重要的主旨和概念以方便讀者閱覽。目前自動化摘要朝著兩個方向發展,其中一個是利用查詢句來搜尋讀者有興趣的摘要內容,另一種則是從原始文件中直接摘要出主要的概念和主旨。前者是屬於資訊檢索的領域範圍稱為關鍵字摘要,後者則是屬於傳統的文件摘要。
    在本論文中,我們提出架構於LDA(latent Dirichlet allocation)方法的自動化摘要技術,是以模型訓練為主的方法,有別於傳統習慣採用空間向量的表示法。以模型訓練為主的摘要技術可以避免未出現的字無法估測權重問題,並且可利用同義詞或一起出現的詞共同分享相似的機率值以提高真正重要的句子被擷取出來成為摘要文句的機率。此外,根據LDA本身已經有的最大優點就是當有新文件時,可以不用重新訓練,即可利用原本的模型參數直接推估新文件的機率模型。
    一旦模型的參數得知後,本論文的方法不僅可以用於文字摘要的領域上,還可以進階延伸到資訊擷取領域中的關鍵字摘要。從實驗結果也顯示,本論文所提的架構的確夠有比較好的摘要效能。

    As the internet grows speedily and prosperously, the amount of information is too large and too miscellaneous to browse. Users have to spend a lot of time to dig out the information they need. The automatic text summarization was accordingly developed to help extracting the concepts or themes among large documents or web pages. There are two types of automatic summarization. One of them is to extract the theme from user’s query, while another is to find out important notes from original articles directly. The former is applied in the domain of information retrieval, and the latter is used for text summarization.
    In this thesis, we addressed a new automatic summarization technique based on the state-of-art latent Dirichlet allocation (LDA) model and applied it for text summarization. Different from the traditional vector space summarization methods, we adopted the sentence-based LDA (SLDA) model to derive the summary of a document. This SLDA method can tackle the problem of unseen words by sharing the information from synonyms and co-occurrence words and extracting the true critical sentences from given articles. Furthermore, this SLDA method can be easily generalized to new documents and calculated for sentence selection without model retraining. Using the trained model parameters, the proposed method is not only available for text summarization, but also extendible to query summarization. The experiment results are shown better performance than other methods.

    第一章 緒論 1 1.1 研究背景及動機 1 1.2 研究目的與方法 3 1.3 章節概要 4 第 二 章 文獻探討 6 2.1 自動化文件摘要技術的發展 6 2.2 向量空間模型 11 2.3 PLSA模型 14 2.4 LDA模型 16 2.5 DUC相關研究 20 第 三 章 以語句為主LDA模型 23 3.1 SLDA模型 24 3.2 貝氏variational inference 28 3.3 SLDA參數估測 32 3.4 Gibbs Sampling 37 3.5 SLDA語句摘要 38 第 四 章 實驗結果 40 4.1 實驗描述 40 4.2 評估方法 44 4.3 評估工具與項目 46 4.4 實驗設定 47 4.5 實驗結果 48 4.7 系統展示 52 第 五 章 結論及未來方向 55 5.1 結論 55 5.2 未來研究方向 55 第 六 章 參考文獻 57

    [1] C. Aone, M. E. Okurowaki, J. Gorlinsky, and B. Larsen (1999), “A Trainable summarizer with knowledge acquired from robust NLP techniques,” In I. Mani and M. Maybury (eds.), Advances in Automated Text Summarization, MIT Press, pp. 71-80, 1999.
    [2] R. Angheluta and R. De Busser and M.-F. Moens, “The use of topic segmentation for automatic summarization,” In Proceedings of the ACL-2002 Post-Conference Workshop on Automatic Summarization, 2002.
    [3] L. Azzopardi, M. Girolami and C. J. Van Rijsbergen, “Topic based language models for ad hoc information retrieval”, Proceedings of the International Joint Conference on Neural Networks, pp. 3281-3286, 2004.
    [4] R. Barzilay and M. Elhadad, “Using lexical chains for text summarization,” ACL/EACL Workshop on Intelligent Scalable Text Summarization, pp. 10-17, 1997.
    [5] D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet allocation”, Journal of Machine Learning Research, vol. 3, no. 5, pp. 993-1022, 2003.
    [6] D. M. Blei, M. I. Jordan and A. Y. Ng, “Hierarchical Bayesian models for applications in information retrieval”, Bayesian Statistics, vol. 7, pp. 25-43, 2003.
    [7] D. M. Blei and J. D. Lafferty, “Correlated topic model”, In Advances in Neural Information Processing Systems (NIPS), no. 18, pp. 147-154, 2006.
    [8] D. M. Blei and J. D. Lafferty, “Dynamic topic model”, Proceedings of the 23rd International Conference on Machine Learning, pp.113-120, 2006.
    [9] D. M. Blei and Z. Ghahramani. “Variational Bayesian learning of directed graphical model with hidden variables”, Intenational Society for Bayesian Analysis. vol. 1, no. 4, pp. 793-832, 2006.
    [10] T. Brants, F. Chen and I. Tsochantaridis, “Topic-based document segmentation with probabilistic latent semantic analysis”, Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 211-218, 2002.
    [11] L. Cai and T. Hofmann, “Text categorization by boosting automatically extracted concepts”, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 182-189, 2003.
    [12] Y.-T. Chen, H.-S. Chiu, H.-M. Wang, B. Chen, “A unified probabilistic generative framework for extractive spoken document summarization”, in Proc. The Tenth European Conference on Speech Communication and Technology (INTERSPEECH), pp. 2805-2808, 2007.
    [13] S. Furui, T. Kikuchi, Y. Shinnaka, C. Hori, “Speech-to-Text and Speech-to-Speech summarization of spontaneous speech”, IEEE transactions on speech and audio processing, vol. 12 No.4, July 2004.
    [14] J. Goldstein, V. O. Mittal, J. G. Carbonell, and M. Kantrowitz, “Multi-document summarization by sentence extraction”. In Proceedings of ANLP/NAACL workshop on Automatic Summmarization, pp. 40-48, 2000.
    [15] Y. Gong, and X. Liu, “Generic text summarization using relevance measure and latent semantic analysis,” in Proc. ACM SIGIR Conference on R&D in Information Retrieval, pp. 19-25, 2001.
    [16] T. L. Griffiths and M. Steyvers, “Finding scientific topics”, Proceedings of the National Academy of Science, vol. 101, pp. 5228–5225, 2004.
    [17] D. Harman, Overview of the Fourth Text Retrieval Conference. 1995. Available at http://trec.nist.gov/pubs/trec4/overvies.ps.gz
    [18] M. Hirohata, Y. Shinnaka, K. Iwano and S. Furui, “Sentence extraction-based presentation summarization techniques and evaluation metrics”, ICASSP, Vol 1, pp. 1065- 1068, 2005.
    [19] M. Hirohata, Y. Shinnaka, K. Iwano and S. Furui, "Sentence-extractive automatic speech summarization and evaluation techniques", Speech Communication, vol.48, iss.9, pp.1151-1161, 2006.
    [20] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis”, Machine Learning, vol. 42, no. 1, pp. 177–196, 2001.
    [21] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Sail, “Introduction to variational methods for graphical models”, Machine Learning, vol. 37, pp. 183-233, 1999.
    [22] M. Jordan, editor. Learning in Graphical models. MIT Press, Cambridge, MA, 1999.
    [23] T. Kikuchi, S. Furui, and C. Hori, “Two-stage automatic speech summarization by sentence extraction and compaction,” in Proc. IEEE and ISCA Workshop on Spontaneous Speech Processing and Recognition, pp.207-210, 2003.
    [24] C. Kruengkrai and C. Jaruskulchai, “Generic text summarization using local and global properties of sentences”, Proceedings of the IEEE/WIC International Conference on Web Intelligence, 2003.
    [25] J. Kupiec, J. Pedersen, and F. Chen, “A trainable document summarizer”, In proceedings of SIGIR, pp.68-73, 1995.
    [26] H. P. Luhn, “The automatic creation of literature abstracts”, IBM Journal of Research and Development, 1958.
    [27] R. Madsen, D. Kauchak, and C. Elkan, “Modeling word burstiness using the Dirichlet distribution”, Proceedings of the 22nd International Conference on Machine Learning, pp. 545-552, 2005.
    [28] D. McDonald and H. C. Chen, “Using sentence-selection heuristics to rank text segment in TXTRACTOR”, Proceedings of the second ACM/IEEE-CS joint conference on Digital libraries, pp. 28-35, 2002.
    [29] T. Minka and J. Lafferty, “Expectation-propagation for the generative aspect model”, Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pp. 352-359, 2002.
    [30] D. Mrva and P. C. Woodland, “Unsupervised language model adaptation for Mandarin broadcast conversation transcription”, Proceedings of International Conference on Spoken Language Processing, pp. 1961-1964, 2004.
    [31] S. H. Myaeng, and D. Jang, “Development and evaluation of a statistically based document system,” In I. Mani and M. Maybury (eds.), Advances in Automated Text Summarization, MIT Press, pp. 61-70, 1999.
    [32] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, New York: McGraw-Hill, 1983.
    [33] G. Salton, A. Singhal, M. Mitra, and C. Buckley, “Automatic text structuring and summarization,” In Information Processing & Management, Elsevier, vol. 33, no. 2, pp. 193-207, 1997.
    [34] M. Saravanan, S. Raman, and B. Ravindran, “A probabilistic approach to multi-document summarization for generating a tiled summary,” Sixth International Conference on Computational Intelligence and Multimedia Applications,. pp. 167- 172, 2005.
    [35] J. Silla, N. Carlos and A. Kaestner. “An analysis of sentence boundary detection systems for English and Portuguese documents,” In Proceedings of Conference on Intelligent Text Processing and Computational Linguistic, pp. 135-141, 2004.
    [36] Y.-C. Tam and T. Schultz, “Dynamic language model adaptation using variational Bayes inference”, Proceedings of European Conference on Speech Communication and Technology, pp. 5-8, 2005.
    [37] X. Wei and W. B. Croft, “LDA-based document models for ad-hoc retrieval”, In Proceedings on the 29th annual international ACM SIGIR conference, pp. 178-185, 2006.
    [38] S. Ye, L. Qiu, T. Chua and M. Y. Kan, “NUS at DUC 2005: understanding documents via concept links”, In Proceedings of Document Understanding Conferences, 2005.

    下載圖示 校內:2009-08-27公開
    校外:2018-08-27公開
    QR CODE