研究生: |
李建志 Lee, Chien-Chih |
---|---|
論文名稱: |
應用混合式機率模型於新聞資訊檢索之研究 News Information Retrieval Based on Probabilistic Mixture Model |
指導教授: |
簡仁宗
Chien, Jen-Tzung |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2002 |
畢業學年度: | 90 |
語文別: | 中文 |
論文頁數: | 68 |
中文關鍵詞: | 混合式機率模型 、文件檢索 |
外文關鍵詞: | EM algorithm, news information retrieval, mixture model |
相關次數: | 點閱:99 下載:5 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著資訊科技的進步,資訊的量隨著逐漸增加,而使用者所面對的資訊也就越來越多。因此,若缺少幫助我們搜尋資訊的技術,資料的搜尋將會相當困難。為了解決這個問題,產生了資訊檢索 (information retrieval) 這個技術。在這種技術之下,使用者所期望的是資訊檢索系統能夠將使用者「想要」的內容搜尋出來。
本篇論文將語言模型應用於資訊檢索的領域,採用混合式機率模型(Mixture Model)來描述文件的隨機特性,由多個語言模型及 Expectation-Maximization(EM)[9]演算法經由事後機率的估算,來對每篇文章求得一組特定的參數。另外我們亦加入在資訊檢索上常見的潛在語意索引 (Latent Semantic Index, LSI ) ,將龐大訓練文集中的資訊以低維度的矩陣保存,在測試時即運用此矩陣內容來獲得長距離資訊及潛在的語意特徵。經由實驗的結果,我們發現將語言模型也視為參數,亦即以不同類別的語料庫來取代平衡語料庫並經由EM的參數估測,能更有效的提昇檢索的正確率;另將LSI的資訊加入Mixture Model中,亦能提高檢索的正確性。
Due to the rapid development of information technology, we need to explore more methodologies to resolve more difficult problem. Accordingly, it would be very important to build effective information retrieval system to obtain useful information. When using this technology, users expect this technology is helpful to search for what they really need.
Language Model is a very important technique in speech recognition and natural language processing. In this thesis, we apply the language model in the application of news information retrieval. We use the probabilistic mixture mode to characterize the documents and use and Expectation-Maximization (EM) algorithm to estimate the model parameters for each individual document. Furthermore, we combine the Latent Semantics Indexing (LSI) in the proposed model retrieval model. Using LSI, the huge training data are reduced to a low dimension vector. As a result, we can obtain long distance information and latent semantics. In the experiments, we find that the parameter for each documents achieve better performance. Also the LSI information is feasible to improve the retrieval precision accuracy.
參考文獻
[1] Bo-re Bai and Berlin Chen “Syllable-based Chinese Text/Spoken Document Retrieval Using Text/Spoken Queries”, Pattern Recognition and Artificial Intelligence, Vol. 14, No. 5, 603-616, 2000
[2] J. Bellegarda, “Exploiting latent semantic information in statistical language modeling.” Proceedings of the IEEE, pp.1279-1296, 2000.
[3] Berlin Chen “Speech Information Retrieval for Mandarin Chinese Syllable-based Index Feature, Statistical Retrieval Models and Improved Approach”, Ph.D. Dissertation
[4] Berlin Chen, Hsin-min Wang, and Lin-shan Lee, "An HMM/N-gram-based Linguistic Approach for Mandarin Spoken Document Retrieval," in Proc. The 7th EUROSPEECH Conference on Speech Communication and Technology (EUROSPEECH), Aalborg-Demark, Sept. 2001.
[5] Berlin Chen and Hsin-min Wang, “Improved Spoken Document retrieval by Exploring Extra Acoustic and Linguistic Cues”, EUROSPEECH ,pp 299-302, 2001
[6] Berlin Chen, Hsin-min Wang, and Lin-shan Lee, “Retrieval of Broadcast News Speech in Mandarin Chinese Collected in Taiwan Using Syllable-Level Statistical Characteristics,” in Proc. Int. Conf. On Acoustic, Speech, Signal Processing, 2000.
[7] Berlin Chen, Hsin-min Wang, and Lin-shan Lee, "Retrieval of Mandarin Broadcast News using Spoken Queries," in Proc. International Conference on Spoken Language Processing (ICSLP), Beijing, Oct. 2000.
[8] Croft, W.B., and Turtle, H.R. Text Retrieval and Inference. In Text-Based Intelligent Systems, edited by Paul S. Jacob, pp.127-155, Lawrence Erlbaum Associates, Publishers, 1992.
[9] A.P. Dempster, N.M. Laird, and D.B Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J.Roy. Stat. Soc., 39(1), pp.1-38, 1977.
[10] M. Federico, Bayesian Estimation Methods for N-gram Language Model Adaptation, Proc. ICSLP, pp. 240-243, Philadelphia, 1996.
[11] Frakes, W.B., and Baeza-Yates, R.(editors). Information Retrieval : Data Structure and Algorithms. Englewood Cliffs, New Jersey: Prentice Hall, 1992.
[12] C. Ng, R. Wilkinson & J. Zobel, “Experiments in spoken document retrieval using phoneme n-grams” Speech Communication, Vol 32, No. 1-2, Sept. 2000, pp. 61-77
[13] J.-T. Chien and H.-Y. Chen, “Association Rule based Language Models for Discovering Long Distance Dependency in Chinese”, Proc. of Research on Computational Linguistics Conference XIV(ROCLING XIV), pp.43-63, Tainan-Taiwan, August 2001. (in Chinese)
[14] David R.H. Miller, T. Leek, and R. Schwartz, “A Hidden Markov Model Information Retrieval System ”, Proc. ACM SIGIR , pp.214-221,1999.
[15] J.L. Gauvain, L. Lamel, Y. de Kercadio, and G. Adda. Transcription and Indexation of Broadcast Data. In Proceedings of ICASSP, pages 1663-1666, Istanbul, Jun 2000.
[16] D. Harman, Overview of the Fourth Text Retrieval Conference (TREC-4). 1995. Available at http://trec.nist.gov/pubs/trec4/overvies.ps.
[17] Hsin-min Wang “Experiments in syllable-based retrieval of broadcast news speech in Mandarin Chinese”, Speech Communication 32, pp.49-60, 2000
[18] Hsin-min Wang “Mandarin spoken document retrieval based on syllable lattice matching”, Pattern Recognition Letters 21, pp.615-624, 2000
[19] Hsin-min Wang, H. Meng, P. Schone, B. Chen and W. K. Lo, “Multi-Scale Audio Indexing for Translingual Spoken Document Retrieval,” in proc. Int. Conf. On Acoustic, Speech, Signal Processing, 2001.
[20] Iyer, R.M.; Ostendorf, M. “Modeling long distance dependence in language: topic mixtures versus dynamic cache models.” Speech and Audio Processing, IEEE Transactions on , Vol.7 Issue: 1 , Jan. 1999.
[21] M.-P. Jay and W Bruce Croft, “A Language Modeling Approach to Information Retrieval”, Proc. ACM SIGIR , pp.275-281, 1998 .
[22] Jelinek Frederick. Statistical Methods for Speech Recognition. The MIT Press 1999.
[23] P. Jourlin, S. E. Jonson, K. Sparck Jones, P. C. Woodland, “Spoken Document Representations for Probabilistic Retrieval,” Speech Communication ,32, pp. 21-36, 2000.
[24] J. Makhoul, F. Kubala, R. Leek, D. Lui, L. Nguqen, R. Schwartz and A. Srivastava, “Speech and language technologies for audio indexing and retrieval”, Pro of the IEEE Vol.88, No.8, August 2000.
[25] C. D. Manning, H. Schutze, “Foundations of statistical natural language processing”, Massachusetts Institute of Technology pp.315-407, 1999
[26] Mario A.T. Figueiredo, Anil K. Jain, “Unsupervised Learning of Finite Mixture Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.24, No.3, March 2002.
[27] M. Meteer and J. R. Rohlicek, “Statistical language modeling combining N -gram and context free grammars” , in Proc. Int. Conf. Acoustics, Speech, Signal Processing, vol. II, pp. 37–40 , 1993.
[28] Mittendorf, E. & Schuble, P. (1996): Document and Passage Retrieval Based on Hidden Markov Models. Proceedings of SIGIR96, p. 318 - 327.
[29] K. Ng, “Information fusion for Spoken Document Retrieval,” in Proc. Int. Conf. On acoustic, Speech, Signal Processing, 2000
[30] L. Rabiner and B.H. Juang, “Funadamental of Speech Recognition”, Prentice Hall, pp.321-387, 1993
[31] S. Renals, D. Abberley, D. Kirby, and T. Robinson, “Indexing and Retrieval of Broadcast news,” Speech Communication, 32, pp.5-20, 2000.
[32] B.-Y. Ricardo and Berthier Ribeiro-Neto , Modern Information Retrieval , Addison-Wesley Longman, May 1999
[33] R. Rosenfeld, “Two decades of Statistical Language Modeling: Where Do We Go From Here?” Proc of the IEEE, 88:1270-1278, August 2000.
[34] Sergios Theodoridis and Konstantinos Koutroumbas. Pattern Recognition. The ACADEMIC Press. Pp39-39, 1999.
[35] M. Siegler and M. Witbrock, “Improving the suitability of imperfect transcriptions for information retrieval from spoken documents,” ICASSP 1999.
[36] R. Silipo and F. Crestani, “Prosodic Stress and Topic Detection in Spoken Sentences,” Technical Report, International Computer Science Institute, Berkeley, 2000.
[37] F. Song and W.B. Croft, “A General Language Model for Information Retrieval”, Proc. CIKM , pp.93-96, 1999.
[38] F. Walls, H. Jin, S. Sista, and R. Schwartz. “Probabilistic models for topic detection and tracking,” In IEEE International Conference On Acoustics, Speech and Signal Processing, 1999.
[39] G. Ng, R. Wilkinson, and j. Zobel, “Experiments in spoken Document Retrieval Using Phoneme N-grams,“ Speech Communication, 32, pp. 61-77, 2000.
[40] M. Witbrock and A. Hauptmann, “Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents,” in Proc. ACM Digital Libraries Conference, pp.30-35, 1997.
[41] I. H. Witten and T. C. Bell “The zero-frequency problem : Estimating the probabilities of novel events in adaptive text compression.”, IEEE Transactions on Information Theory , Vol.37, pp.1085-1094, 1991
[42] S. Young, “Probabilistic methods in spoken dialogue systems", Proc of the Royal Society, London, Sept. 1999.
[43] CKIP, http://godel.iis.sinica.edu.tw/, 中央究院資訊科學研究所詞庫小組。
[44] 鉅亨網, http://www.cnyes.com/
[45] 民視即時新聞, http://www.can.com.tw
[46] 聯合新聞網,http://udnnews.com/NEWS
[47] ETtoday, http://www.ettoday.com/
[48] 中時電子報,http://news.chinatimes.com/
[49] 雅虎新聞, http://news.yahoo.com.tw