簡易檢索 / 詳目顯示

研究生: 洪瑞嶸
Hung, Jui-Jung
論文名稱: 階層式主題與句子之貝氏非參數模型
Bayesian Nonparametric Modeling of Hierarchical Topics and Sentences
指導教授: 簡仁宗
Chien, Jen-Tzung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 中文
論文頁數: 91
中文關鍵詞: 機器學習自然語言處理資訊檢索主題模型非監督式學習貝氏非參數學習文件摘要
外文關鍵詞: Machine Learning, Natural Language Processing, Information Retrieval, Topic Model, Unsupervised Learning, Bayesian Nonparametrics Learning, Document Summarization
相關次數: 點閱:151下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來隨著時代的進步與發展,分享在網路上龐大的多媒體文件資料量不再是人力容易處理解決的,因此自動化文件摘要及自動文件分群技術的發展,就顯得相當的重要,這種機器學習的技術可以自動的取得每一筆文件資料的內容重點並有效歸類文件所屬的群組,讓使用者節省大量時間在閱讀或搜索上。
    在本論文中提出架構於階層式潛在狄氏配置(hierarchical latent Dirichlet allocation)方法的一套新穎主題模型,是一個貝氏非參數方法的無限制主題數量的混合模型,專門設計用來解決模型選擇的問題,以非監督式學習為主,此方法不僅繼承了潛在狄氏配置(latent Dirichlet allocation)的優點,當有新文件時,不需要每次都重新訓練,即可利用原本的模型參數直接推估新文件導入後的機率模型。另外模型本身具有樹狀結構的階層式概念,可以清楚的將句子或詞彙群組分配到不同階層,有效區分每個句子的特性並擷取句子之階層式群聚(hierarchical clustering of sentences)資訊。我們採用貝式非參數(Bayesian nonparametric)法則建立模型,透過巢狀式中國餐廳程序(nested Chinese restaurant process)及階層式狄氏程序(hierarchical Dirichlet process)實現出句子主題(sentence topic)及詞彙主題(word topic)之階層式模型,解決模型選擇(model selection)及模型過度估測(over-estimation)等問題,本模型具高度適應性,可從無窮筆文件資料中自動建立出無窮節點及無窮階層的主題與句子模型。
    本論文使用吉布斯取樣法(Gibbs sampling)來解決模型推論(model inference)問題,我們透過事後機率取樣出模型參數及每個句子之階層路徑並獲得句子的分群結果,實驗使用DUC (Document Understanding Conferences) 2007以及Reuters 21578兩套文件集進行文件摘要及文件分群評估,以句子來代表每個詞彙群組,從實驗結果發現每個詞彙及句子有效的群聚於樹狀節點,透過Kullback-Leibler Divergence可有效從樹狀節點選擇出主旨式句子(thematic sentences)並得到摘要及分群結果,在ROUGE及分群群聚率實驗中,本論文提出的階層式主題與句子模型優於文獻中其他方法。

    In recent years, due to the explosion of information flood and a huge variety of media data on internet, the information finders are difficult to extract useful information within a very short period. To save man power and tackle filtering of tons of documents, the technologies of document summarization and text clustering have become much more important than before. Information finders shall save a lot of time on searching what they are eager to know and reading the relevant documents which are fitted to their interests.
    Accordingly, in this dissertation, we propose a novel document model for document summarization. This model is built based on topic model and is extended from the hierarchical latent Dirichlet allocation. This model is a Bayesian nonparametric methods which is designed for data representation where model selection problem is simultaneously tackled by using the unbounded number of mixture components. Also, the model structure is learnt in unsupervised learning fashion and is geared with the advantages of latent Dirichlet allocation (LDA). When new documents are enrolled, the complexity of probability model is gradually and automatically enlarged according to Bayesian nonparametric theory. The issues of model selection and over-estimation problem are resolved. In addition, we build hierarchical tree structure to represent the clusters of sentences or groups of words in different layers. Root node contains the general topic while the leaf nodes contain specific topics. These nodes represent sentence clustering in different levels. For all words in the sentences of a tree node, we further build topic model for these words. The nested Chinese restaurant process (nCRP) and the hierarchical Dirichlet process (HDP) are developed to build hybrid topic models for sentences and words, respectively. The resulting hierarchical topic and sentence model (HTSM) is constructed. This model is highly adaptive so that infinite tree nodes with infinite tree layers shall be generated from infinite observed words, sentences and documents.
    In procedure of model inference, we present the collapsed Gibbs sampling to implement HTSM from training documents. We randomly sample model parameters and multiple tree paths of different sentences according to the posterior probabilities. The clustering of sentences and their corresponding words is obtained during model training. In this study, we conduct experiments on using DUC (Document Understanding Conferences) 2007 and Reuters 21578 and investigate the performance of clustering of words, sentences and abstracts by evaluating the tree structure of HTSM. We further apply this model for document summarization where the representative or thematic sentences are extracted from sentences allocated in tree nodes by using the Kullback-Leibler divergence measure. In the evaluation of ROUGE and clustering performance, we find the proposed HTSM is superior to LDA and sentence LDA models.

    摘 要 I Abstract III 致謝 V 圖目錄 IX 表目錄 XI 第 一 章 緒論 1 1.1 研究背景及動機 1 1.2 研究目的與方法 1 1.3 章節概要 4 第 二 章 文獻探討 6 2.1 狄氏程序 6 2.2 中國餐廳程序 7 2.3 折筷子程序 8 2.4 吉布斯取樣法 10 2.5 向量空間模型 11 2.5.1 特徵向量之建立 12 2.5.2 向量空間方法之特色 13 2.6 以句子為主之潛在狄氏配置模型 14 2.6.1 變異性貝氏為主之模型推論 17 2.6.2 吉布斯取樣為主之模型推論 20 2.6.3 建構以句子為主之潛在狄氏配置模型 21 第 三 章 主要模型探討 23 3.1 階層式狄氏程序 23 3.2 巢狀式中國餐廳程序 25 3.3 階層式潛在狄氏配置模型 26 3.3.1 建構階層式潛在狄氏配置模型 27 3.3.3 吉布斯取樣公式推論 29 3.3.4 吉布斯取樣演算法 32 3.3.5 評估模型的收斂 33 3.3.6 階層式潛在狄氏配置的特色 33 第 四 章 階層式主題與句子之貝氏非參數模型 34 4.1 建立階層式主題與句子之貝氏非參數模型 35 4.2 階層式折筷子程序 39 4.3 吉布斯取樣公式推論 41 4.3.1 取樣文件中每個句子的路徑 41 4.3.2 取樣文件中每個句子的階層 42 4.3.3 取樣句子中每個詞彙對應的主題分佈 44 4.4 吉布斯取樣演算法 45 第 五 章 實驗結果 47 5.1 實驗描述 47 5.1.1 Reuters新聞實驗資料 47 5.1.2 DUC實驗資料 49 5.2 評估方法 52 5.2.1 混淆度收斂評估方法 52 5.2.2 群聚性評估方法 52 5.2.3 文件摘要評估方法 53 5.3 摘要效能評估工具 54 5.3.1 ROUGE指令 55 5.4 實驗設定與結果 57 5.4.1 混淆度收斂評估實驗 57 5.4.2 Reuters文件集之群聚性實驗 59 5.4.3 DUC文件集之摘要實驗 64 第 六 章 結論及未來方向 71 6.1 結論 71 6.2 未來方向 71 參考文獻 73 附錄一 DUC文件集之詞彙於階層樹節點可視圖 79 附錄二 DUC文件集之句子於階層樹節點可視圖 83 附錄三 DUC文件集之句子於階層樹節點列表 87

    [1] R. Adams, Z. Ghahramani, and M. I. Jordan, “Tree-structured stick breaking for hierarchical data,” Advances in Neural Information Processing Systems (NIPS), vol. 23, 2010.
    [2] E. M. Airoldi, D. M. Blei, S. E. Fienberg and E. P. Xing, “Mixed membership stochastic blockmodels,” Journal of Machine Learning Research, vol. 9, pp. 1981-2014, 2008.
    [3] D. J. Aldous, “Exchangeability and related topics,” in École d’été de probabilités de Saint-Flour, XIII-1983, vol. 1117 of Lecture Notes in Math., pp. 1-198. Springer, Berlin, 1985.
    [4] C. Antoniak, “Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems,” Annuals of Statistics, pp. 1152-1174, 1974.
    [5] C. Aone, M.E. Okurowski, J. Gorlinsky and B. Larsen, “A trainable summarizer with knowledge acquired from robust NLP techniques,” Advances in Automatic Text Summarization, pp. 71-80, 1999.
    [6] H. Attias, “A variational Bayesian framework for graphical models,” Advances in Neural Information Processing Systems, vol. 12, pp. 209-215, 2000.
    [7] C. M. Bishop, Pattern Recognition and Machine Learning, Springer Science, 2006.
    [8] D. Blackwell and J. MacQueen, “Ferguson distributions via Pólya urn schemes,” Annals of Statistics, vol. 1, pp. 353–355, 1973.
    [9] D. M. Blei and M. I. Jordan, “Modeling annotated data,” Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 127-134, 2003.
    [10] D. M. Blei and Z. Ghahramani, “Variational Bayesian learning of directed graphical model with hidden variables,” Bayesian Analysis, vol. 1, no. 4, pp. 793-832, 2006.
    [11] D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3, no. 5, pp. 993-1022, 2003.
    [12] D. M. Blei, T. L. Griffiths, M. I. Jordan and J. B. Tenebaum, “Hierarchical topic models and the nested Chinese restaurant process,” Advances in Neural Information Processing Systems, Cambridge, MA, MIT Press, 2004.
    [13] D. M. Blei, T. L. Griffths, and M. I. Jordan, “The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies,” Journal of the ACM, vol. 57, no. 2, article 7, 2010.
    [14] R. Brandow, K. Mitze and L. F. Rau, “Automatic condensation of electronic publications by sentence selection,” Information Processing and Management, vol. 31, no. 5, pp. 675-685, 1995.
    [15] Y.-L. Chang, J.-J Hung and J.-T Chien, “Bayesian nonparametric modeling of hierarchical topics and sentences,” IEEE International Workshop on Machine Learning for Signal Processing, Beijing, September 2011.
    [16] Y.-L. Chang and J.-T. Chien, “Latent Dirichlet learning for document summarization,” International Conference on Acoustics, Speech and Signal Processing, pp. 1689-1692, 2009.
    [17] Y.-L. Chang and J.-T. Chien, “Adaptation of topic model to new domains using recursive Bayes”, NIPS Workshop on Applications for Topic Models: Text and Beyond, 2009.
    [18] J.-T. Chien and M.-S. Wu, “Adaptive Bayesian latent semantic analysis”, IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 1, pp. 198-207, 2008.
    [19] J.-T. Chien and C.-H. Chueh, “Dirichlet class language models for speech recognition”, IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 3, pp. 482-495, 2011.
    [20] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas and R. A. Harshman, “Indexing by latent semantic analysis,” Journal of the Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990.
    [21] M. D. Escobar and M. West, “Bayesian Density Estimation and Inference Using Mixtures,” Journal of the American Statistical Association, vol. 90, pp. 577-588, 1995.
    [22] L. Fei-Fei and P. Perona, “A Bayesian hierarchical model for learning natural scene categories,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 524-531, 2005.
    [23] T. S. Ferguson, “A Bayesian analysis of some nonparametric problems,” The Annals of Statistics, vol. 1, no. 2, pp. 209-230, 1973.
    [24] S. Furui, T. Kikuchi, Y. Shinnaka, C. Hori, “Speech-to-text and speech-to-speech summarization of spontaneous speech”, IEEE Transactions on Speech and Audio Processing, vol. 12 no.4, pp. 401-408, 2004.
    [25] A. Gelfand and A. Smith, “Sampling based approaches to calculating marginal densities,” Journal of the American Statistical Association, vol. 85, no. 410, pp. 398-409, 1990.
    [26] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images,” Journal of Applied Statistics, vol. 20, no. 6, pp. 721-741, 1984.
    [27] Y. Gong and X. Liu, “Generic text summarization using relevance measure and latent semantic analysis,” Proc. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 19-25, 2001.
    [28] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the National Academy of Science, vol. 101, pp. 5228–5225, 2004.
    [29] A. Haghighi and L. Vanderwende, “Exploring content models for multi-document summarization,” Proc. of Annual Conference of the North American Chapter of the ACL, pp. 362-370, 2009.
    [30] M. Hirohata, Y. Shinnaka, K. Iwano and S. Furui, “Sentence extraction-based presentation summarization techniques and evaluation metrics”, International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 1065- 1068, 2005.
    [31] M. Hirohata, Y. Shinnaka, K. Iwano and S. Furui, “Sentence-extractive automatic speech summarization and evaluation techniques”, Speech Communication, vol. 48, no. 9, pp. 1151-1161, 2006.
    [32] T. Hofmann, “Probabilistic latent semantic indexing,” Proc. of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, 1996.
    [33] J. Goldstein, V. O. Mittal, J. G. Carbonell and M. Kantrowitz, “Multi-document summarization by sentence extraction,” Proc. of ANLP/NAACL Workshop on Automatic Summarization, pp. 40-48, 2000.
    [34] M. Jordan, “Learning in Graphical models,” MIT Press, Cambridge, MA, 1999.
    [35] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, “Introduction to variational methods for graphical models,” Machine Learning, vol. 37, pp. 183-233, 1999.
    [36] J. Kupiec, J. Pedersen and F. Chen, “A trainable document summarizer”, Proc. of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68-73, 1995.
    [37] J. S. Liu, “The collapsed Gibbs sampler in Bayesian computations with application to a gene regulation problem,” Journal of the American Statistical Association, vol. 89, no. 427, pp. 958-966, 1994.
    [38] R. Madsen, D. Kauchak and C. Elkan, “Modeling word burstiness using the Dirichlet distribution,” Proc. of International Conference on Machine Learning, pp. 545-552, 2005.
    [39] B. Marlin, “Modeling user rating profiles for collaborative filtering”, Advances in Neural Information Processing Systems, vol. 16, pp. 627-634, 2004.
    [40] D. McDonald and H. C. Chen, Using sentence-selection heuristics to rank text segment in TXTRACTOR, Management Information Systems, pp. 28-35, 2002.
    [41] J. Pitman, “Combinatorial Stochastic Processes,” Lecture Notes for St. Flour Summer School, Springer-Verlag, New York, 2002.
    [42] J. K. Pritchard, M. Stephens and P. Donnelly, “Inference of population structure using multilocus genotype data,” Genetics, vol. 155, pp. 945-959, 2000.
    [43] C. P. Robert and G. Casella, Monte Carlo Statistical Methods, New York: Springer-Verlag, 2004.
    [44] G. Salton, A. Wong and C. S. Yang, “A vector space model for automatic indexing,” Communications of the ACM , vol. 18, no. 11, 1975.
    [45] G. Salton, A. Singhal, M. Mitra, and C. Buckley, Automatic text structuring and summarization, Information Processing & Management, vol. 33, no. 2, pp. 193-207, 1997.
    [46] J. Sethuraman, “A constructive definition of Dirichlet priors,” Statistica Sinica, vol. 4, pp. 639-650, 1994.
    [47] Y. W. Teh, M. I. Jordan, M. J. Beal and D. M. Blei, “Hierarchical Dirichlet processes,” Journal of the American Statistical Association, vol. 101, no. 476, pp. 1566-1581, 2006.
    [48] H. Wallach, I. Murray, R. Salakhutdinov and D. Mimno, “Evaluation methods for topic models,” Proc. of International Conference on Machine Learning, pp. 1105-1112 , 2009.
    [49] C. Wang and D. M. Blei, “Variational inference for the nested Chinese restaurant process,” Advances in Neural Information Processing Systems, Cambridge, MA, MIT Press, 2009.
    [50] X. Wei and W. B. Croft, “LDA-based document models for ad-hoc retrieval”, Proc. of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , pp. 178-185, 2006.
    [51] S. Ye, L. Qiu, T. Chua and M. Y. Kan, “NUS at DUC 2005: understanding documents via concept links,” Proc. of Document Understanding Conferences, 2005.

    下載圖示 校內:2021-08-01公開
    校外:2021-09-01公開
    QR CODE