| 研究生: |
鄭家宇 Cheng, Chia-Yu |
|---|---|
| 論文名稱: |
動態分段模型應用於主題偵測及追蹤 Dynamic Segmentation Model for Topic Detection and Tracking |
| 指導教授: |
簡仁宗
Chien, Jen-Tzung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2010 |
| 畢業學年度: | 98 |
| 語文別: | 中文 |
| 論文頁數: | 68 |
| 中文關鍵詞: | 機器學習 、文件模型 、文件分段 |
| 外文關鍵詞: | machine learning, document model, document segmentation |
| 相關次數: | 點閱:80 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在資訊爆炸的時代,有效處理大量文件資料成為一個很重要的議題,潛在主題模型已可成功擷取出文件中潛在的主題資訊,然而許多語料庫屬於一連串的文件序列,如會議記錄、電子新聞或對話交談等等,對於此種無文件邊界標記的狀況下,我們無法訓練出有效的潛在主題文件模型,在本論文中,我們提出了一個建構於LDA(Latent Dirichlet Allocation)模型的動態分段模型(dynamic segmentation model, DSM),整合文件分段與文件主題模型,用於處理文件分段的工作,這是一個非監督式學習的分段模型,我們使用未分段的文件,同樣可訓練LDA模型,並達到近似原始LDA的效果,透過比較各句子之間的相似度來決定其分段機率,便可將語意相近的句子合併在同一個段落,並且,本論文再對DSM模型加入一馬可夫鏈於分段點之參數,藉由其轉移機率,使其可考慮段落長度因素,進而延伸出馬可夫動態分段模型(Markov dynamic segmentation model, MDSM)。我們將提出VB-EM演算法進行有效的模型參數估測,在本論文實驗中,我們使用TDT2語料庫進行效能評估,並比較幾個相關的模型,同時將本論文提出的DSM模型應用於主題偵測及追蹤,實驗結果顯示,DSM模型在模型複雜度、文件分段與主題偵測正確率,皆能有效的提昇效能。
As the amount of multimedia data grows speedily and prosperously, the end users face the explosion of information in daily life. It is time-consuming to extract the useful information from huge database. Effective document modeling is crucial to build state-of-art information systems. Latent topic model is a successful approach to capture the latent topic information from text data. However, in real-world applications, the data consists of sequential patterns, e.g. meeting recording, lecture transcription, conversation speech, etc. For those data that without explicit boundaries, it is difficult to train a sophisticated latent topic document model. This study presents a new dynamic segmentation model (DSM) which is based on latent Dirichlet allocation (LDA) document model. DSM is a document topic model by incorporating the contextual topic information which is an unsupervised learning method for document segmentation. The similarity between sentences is considered to form a Beta distribution that reflects the prior knowledge of segment boundaries. The segment boundaries and the latent topic regularities are extracted simultaneously. Using this approach, the distribution of segmentation variable is adaptively updated according to the contextual topic information. A flexible segmentation model is accordingly established and used to group coherent sentences into a segment. Furthermore, we build the Markov DSM (MDSM) by inducing a Markov chain in DSM model to characterize the segment length which is exploited by the transition probability. The model is trained by a variational Bayesian EM procedure and is evaluated by using the TDT2 corpus. We compare the performance of topic segmentation and detection by using DSM model and other related methods. We also apply this approach for topic detection and tracking. Experimental results show significant improvement by using the proposed method in terms of perplexity as well as topic detection and tracking accuracy.
[1] L. Azzopardi, M. Girolami and C. J. Van Rijsbergen, “Topic based language models for ad hoc information retrieval,” Proceedings of the International Joint Conference on Neural Networks, pp. 3281-3286, 2004.
[2] D. Beeferman, A. Berger, and J. D. Lafferty, “Statistical models for text segmentation,” Machine Learning, vol. 34, no. 1-3, pp. 177-210, 1999.
[3] S. M. Beitzel, On understanding and classifying web queries, PhD dissertation in USA, Illinois Institute of Technology, 2006.
[4] Y. Bestgen, “Improving text segmentation using latent semantic analysis: a reanalysis of Choi, Weimer-Hastings, and Moore (2001),” Computational Linguistics, vol. 32, no. 1, pp. 5-12, 2006.
[5] C. M. Bishop, Pattern recognition and machine learning, 2006.
[6] D. Blei, and J. D. Lafferty, “Correlated topic models,” In Advances in Neural Information Processing Systems Cambridge, MA: MIT Press, pp.17-35, 2006.
[7] D. Blei, and J. D. Lafferty, “Dynamic topic model,” In Proceedings of the International Conference on Machine Learning, pp. 113-120, 2006.
[8] D. M. Blei and J. D. McAuliffe, “Supervised topic models,” Neural Information Processing Systems, 2007.
[9] D. M. Blei, and P. J. Moreno, “Topic segmentation with an aspect hidden Markov model,” In Proceedings of ACM SIGIR, pp. 343-348, 2001.
[10] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3, no. 5, pp. 993-1022, 2003.
[11] J. Boyd-Graber, and D. Blei, “Syntactic topic models,” In Advances in Neural Information Processing Systems, pp. 185-195, 2008.
[12] T. Brants, F. Chen, and I. Tsochantaridis, “Topic-based document segmentation with probabilistic latent semantic analysis,” In Proceedings of the International Conference on Information and Knowledge Management, pp. 211-218, 2002.
[13] L. Cai and T. Hofmann, “Text categorization by boosting automatically extracted concepts,” Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 182-189, 2003.
[14] J. Chang and D. Blei, ”Hierarchical relational models for document networks,” Annals of Applied Statistics, pp. 124–150, 2010.
[15] Chien, J.-T. and Chueh, C.-H. Latent Dirichlet language model for speech recognition. In Proceeding of IEEE Workshop on Spoken Language Technology, pp. 201-204, 2008.
[16] C.-H. Chueh, and J.-T. Chien, “Segmented topic model for text classification and speech recognition,” NIPS Workshop on Applications for Topic Models: Text and Beyond, 2009. (Online available at http://umiacs.umd.edu/~jbg/nips_tm_workshop/7.pdf)
[17] K. W. Church, “A stochastic parts program and noun phrase parser for unrestricted text,” In Proceedings of the Second Conference on Applied Natural Language Processing, pp. 136–143, 1988.
[18] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990.
[19] A. P. Dempster, N. M. Laird, and D. B. Rubin. “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1–21, 1977.
[20] P Fragkou, V Petridis, A Kehagias, ”A Dynamic Programming Algorithm for Linear Text Segmentation,” Journal of Intelligent Information Systems, pp. 179–197, 2004.
[21] T. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the National Academy of Science, vol. 101, pp. 5228–5235, 2004.
[22] T. L. Griffiths, M. Steyvers, D. Blei and J. B. Tenenbaum, “Integrating topics and syntax,” In Advances in Neural Information Processing Systems, vol. 17, pp. 537-544, 2004.
[23] A. Gruber, M. Rosen-Zvi, and Y. Weiss, “Hidden topic Markov models,” In Proceedings of Conference on Artificial Intelligence and Statistics, 2007.
[24] T. Hofmann, “Probabilistic latent semantic indexing,” In Proceedings of ACM SIGIR, pp. 50-57, 1999.
[25] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, vol. 42, no. 1, pp. 177–196, 2001.
[26] T. Hofmann, “Unsupervised learning from dyadic data,” In Advances in Neural Information Processing Systems, vol. 11. MIT Press, 2006.
[27] B.-J. Hsu, and J. Glass, “Style & topic language model adaptation using HMM-LDA,” In Proceedings of Empirical Methods in Natural Language Processing, pp. 373-381, 2006.
[28] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Sail, “Introduction to variational methods for graphical models,” Machine Learning, vol. 37, pp. 183-233, 1999.
[29] T. Koshinaka, K. Iso, and A. Okumura, “An HMM-based segmentation method using variational Bayes approach and its application to LVCSR for broadcast news,” In Proceedings of ICASSP, pp. 485-488, 2005.
[30] F.-F. Li and P. Perona, “A Bayesian hierarchical model for learning natural scene categories,” In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 524-531, 2005.
[31] X Li, YY Wang and A Acero, "Learning Query Intent from Regularized Click Graphs," In Proceedings of the 31st annual international ACM, pp. 339-346, 2008.
[32] D. J. C. MacKay and L. C. Peto, “A hierarchical Dirichlet language model,” Natural Language Engineering, vol. 1, no. 3, pp. 1–19, 1994.
[33] T. Minka and J. Lafferty, “Expectation-propagation for the generative aspect model,” Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pp. 352-359, 2002.
[34] H. Misra, F. Yvon, J. M. Jose, and O. Cappé, “Text segmentation via topic modeling: an analytic study,” In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1553-1556, 2009.
[35] M. Purver, K. P. Körding, T. L. Griffiths, and J. B. Tenenbaum, “Unsupervised topic modeling for multi-party spoken discourse,” In Proceedings of International Conference on Computational Linguistics and Annual Meeting of ACL, pp. 17-24, 2006.
[36] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993.
[37] G. Salton and M. J. Mcgill, Introduction to Modern Information Retrieval, New-York: McGraw-Hill, 1983.
[38] Y.-C. Tam and T. Schultz, “Dynamic language model adaptation using variational Bayes inference,” Proc. of Annual Conference of the International Speech Communication Association, pp. 5-8, 2005.
[39] Y.-C. Tam and T. Schultz, “Unsupervised language model adaptation using latent semantic marginals,” Proc. of Annual Conference of the International Speech Communication Association, pp. 2206-2209, 2006.
[40] H. M. Wallach, “Topic modeling: Beyond bag-of-words,” In Proceedings of the 23rd international conference on Machine learning, pp. 977-984, 2006.
[41] C. J. van Rijsbergen, Information Retrieval, Butterworth-Heinemann, London, 2nd edition, 1979.
[42] C. Wang, B. Thiesson, C. Meek, and D. Blei, “Markov topic models,” In Proceedings of International Conference on Artificial Intelligence and Statistics, 2009.
[43] X. Wang and A. McCallum, “A note on topical n-grams,” Technical Report UM-CS-2005-071, Department of Computer Science University of Massachusetts Amherst, 2005.
[44] X. Wei, and W. Croft, “LDA-based document models for ad-hoc retrieval,” In Proceedings of ACM SIGIR, pp. 178-185, 2006.