| 研究生: |
許軒睿 Hsu, Hsuan-Jui |
|---|---|
| 論文名稱: |
貝式主題混合資訊檢索模型 Bayesian Topic Mixture Model for Information Retrieval |
| 指導教授: |
簡仁宗
Chien, Jen-Tzung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2007 |
| 畢業學年度: | 95 |
| 語文別: | 中文 |
| 論文頁數: | 59 |
| 中文關鍵詞: | 資訊檢索 、語言模型 、文件模組化 |
| 外文關鍵詞: | information retrieval, language model, document modeling |
| 相關次數: | 點閱:97 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在資訊檢索上,文件通常廣泛地以bag-of-word形式做為表示法。在自動本文處理之相關研究中,常利用機率主題模型從字詞相互關係推斷並建立潛在主題變數。在機率潛在語意模型(PSLA)裡,文件中的每一個字詞在混合模型即視為一個樣本,其混合成分是使用多項分佈來表示。然而,此方式沒有考慮到文集中高頻率字詞之突發現象。雖然PLSA可以顯示多主題,但是每一個主題模型都十分簡單。另外,PLSA的缺點是對於先前未出現(unseen)的文件沒有表示出其機率模型以及參數數量會隨著文件數呈線性擴增。近幾年來,機率模型和圖形模型在資訊檢索和機器學習已成為一個重要的研究領域。本論文針對混合模型之非監督式學習問題,探討幾個重要之統計式資訊檢索模型。我們以PLSA模型為基礎,提出一種新型之貝氏主題混合模型來解決多項分佈無法恰當表示具鑑別性字詞機率之問題。在混合模式的結合上,我們使用Dirichlet事前機率分佈來表示多項分佈字詞在每一個主題之條件機率分佈,相同種類內之不同文件是經由不同多項分佈來產生。我們透過Gibbs抽樣法來加快參數估測之過程。本方法的優勢是可以在字詞對應的主題上表示出機率模型。在TREC 文件集之資訊檢索實驗裡,我們實現本方法在文件檢索和文件模組化來驗證貝氏主題模型之優越性。
In conventional information retrieval, the documents are usually represented as bag of words. In state-of-art information retrieval, it is popular to apply the probabilistic topic model to infer word correlation through the latent topic variables. Probabilistic latent semantic analysis model (PLSA) is such a model that each word in a document is seen as a sample from a mixture component. These mixture components are modeled by multinomial distributions. Although PLSA model deals with the issue of multiple topics, each topic model is quite simple and the burstiness phenomenon of frequent words is not taken into account. On the other hand, PLSA has the weaknesses that there is no way to find probability model for unseen document and also the number of parameters grows rapidly due to the number of training documents. In this thesis, we present a new Bayesian topic mixture model (BTMM) to handle the weaknesses inherent in PLSA model. Beyond the multinomial word variables, we use the Dirichlet prior to represent topic information. The documents are generated by the associated multinomial distribution. In the experiments on TREC corpus, we show the results of average precision and model perplexity to demonstrate the superiority of using Bayesian topic mixture model.
[1]Y. Akita and T. Kawahara, “Language Model Adaptation
based on PLSA of Topics and Speakers”, 8th
International Conference on Spoken Language Processing,
pp. 1045-1048, 2004.
[2]L. Azzopardi, M. Girolami and C. J. Van Rijsbergen,
“Topic based language models for ad hoc information
retrieval”, Proceedings of the International Joint
Conference on Neural Networks, pp. 3281-3286, 2004.
[3]J. R. Bellegarda, “Exploiting both local and global
constraints for multi-span statistical language
modeling”, Proceedings of International Conference on
Acoustics, Speech, and Signal Processing, vol. 2, pp.
677-680, 1998.
[4]J. R. Bellegarda, “Exploiting latent semantic
information in statistical language modeling,”
Proceeding of the IEEE, vol. 88, No. 8, pp. 1279-1296,
2000.
[5]M. W. Berry, S. T. Dumais and G. W. O’Brien, “Using
linear algebra for intelligent information retrieval”,
SIAM Review, vol. 37, no. 4, pp. 573-595, 1995.
[6]D. M. Blei, M. I. Jordan and A. Y. Ng, “Hierarchical
Bayesian models for applications in information
retrieval”, Bayesian Statistics, vol. 7, pp. 25-43,
2003.
[7]D. M. Blei and J. D. Lafferty, “Correlated topic
model”, In Advances in Neural Information Processing
Systems (NIPS), 18, pp. 147-154, 2006.
[8]D. M. Blei and J. D. Lafferty, “Dynamic topic model”,
Proceedings of the 23rd International Conference on
Machine Learning, pp.113-120, 2006.
[9]D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent
Dirichlet allocation”, Journal of Machine Learning
Research, vol. 3, no. 5, pp. 993-1022, 2003.
[10]T. Brants, F. Chen and I. Tsochantaridis, “Topic-
based document segmentation with probabilistic latent
semantic analysis”, Proceedings of the Eleventh
International Conference on Information and Knowledge
Management, pp. 211-218, 2002.
[11]L. Cai and T. Hofmann, “Text categorization by
boosting automatically extracted concepts”,
Proceedings of the 26th Annual International ACM SIGIR
Conference on Research and Development in Information
Retrieval, pp. 182-189, 2003.
[12]J.-T. Chien, M.-S. Wu and C.-S. Wu, “Bayesian
learning for latent semantic language”, Proceedings
of European Conference on Speech Communication and
Technology, pp. 25-28, 2005.
[13]J.-T. Chien, M.-S. Wu and H.-J. Peng, “On latent
semantic language modeling and smoothing”,
Proceedings of International Conference on Spoken
Language Processing, vol. 2, pp. 1373-1376, 2004.
[14]S. Deerwester, S. T. Dumais, G. W. Furnas, T. K.
Landauer and R. Harshman, “Indexing by latent
semantic analysis”, Journal of the American Society
for Information Science, vol. 41, no. 6, pp. 391-407,
1990.
[15]A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum
likelihood from incomplete data via the EM
algorithm”, Journal of the Royal Statistical Society,
Series B, vol. 39, no. 1, pp. 1-38, 1977.
[16]Dumais, S. T, “Latent semantic indexing (LSI): TREC-3
report”, Proceedings of the Text REtrieval Conference
(TREC-3), pp. 219–230, 1995.
[17]C. Elkan, “Clustering documents with an exponential-
family approximation of the Dirichlet compound
multinomial distribution”, Proceedings of the 23rd
International Conference on Machine Learning, pp. 289-
296, 2006.
[18]G.E. Forsythe, M.A. Malcolm, and C.B. Moler, Computer
Methods for Mathematical Computations (Chapter 9:
Least squares and the singular value decomposition).
Englewood Cliffs, NJ: Prentice Hall, 1977.
[19]M. Girolami and A. Kaban, “On an equivalence between
PLSI and LDA”, Proceedings of the 26th Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 433-434,
2003.
[20]T. L. Griffiths and M. Steyvers, “Finding scientific
topics”, Proceedings of the National Academy of
Science, vol. 101, pp. 5228–5225, 2004.
[21]D. Harman, Overview of the Fourth Text Retrieval
Conference. 1995. Available at
http://trec.nist.gov/pubs/trec4/overvies.ps.gz
[22]T. Hofmann, “Probabilistic latent semantic
analysis”, Proceedings of the Fifteenth Annual
Conference on Uncertainty in Artificial Intelligence,
pp. 289-296, 1999.
[23]T. Hofmann, “Unsupervised learning by probabilistic
latent semantic analysis”, Machine Learning, vol. 42,
no. 1, pp. 177–196, 2001.
[24]T. Hofmann, “Unsupervised learning from dyadic
data”, In Advances in Neural Information Processing
Systems, vol. 11. MIT Press, 1999.
[25]X. Jin, Y. Zhou and B. Mobasher, “Web usage mining
based on probabilistic latent semantic analysis”,
Proceedings of the 2004 ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp.
197-205, 2004.
[26]M. Jordan, editor. Learning in Graphical Models. MIT
Press, Cambrige, MA, 1999.
[27]M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Sail,
“Introduction to variational methods for graphical
models”, Machine Learning, vol. 37, pp. 183-233, 1999.
[28]S. M. Katz, “Distribution of content words and
phrases in text and language modeling”, Natural
Language Engineering, vol. 2, pp. 15-59, 1996.
[29]S. M. Katz, “Estimation of probabilities for sparse
data for the language model component of a speech
recognizer”, IEEE Transactions on Acoustics, Speech
and Signal Processing, vol. 35, no. 3, pp. 400–401,
1987.
[30]M. Keller and S. Bengio, “Theme topic mixture model:
A graphical model for document representation”, In
PASCAL Workshop on Learning Methods for Text
Understanding and Mining, 2004.
[31]M. Kevin, “An introduction to graphical models”,
Intel Research Technical Report, 2001.
[32]T. G. Kolda and D. P. O’Leary, “A semi-discrete
matrix decomposition for latent semantic indexing in
information retrieval”, ACM Transactions on
Information Systems, vol. 16, no. 4, pp. 322-346, 1998.
[33]R. Madsen, D. Kauchak, and C. Elkan, “Modeling word
burstiness using the Dirichlet distribution”,
Proceedings of the 22nd International Conference on
Machine Learning, pp. 545-552, 2005.
[34]C. Manning and H. Schutze, Foundations of Statistical
Natural Language Processing. The MIT Press, 1999.
[35]T. Minka, “The Dirichlet-tree distribution”, in
http://research.microsoft.com/~minka/papers/dirichlet/minka-dirtree.pdf
[36]T. Minka, “Estimating a Dirichlet distribution”,
Technical report, MIT, 2000.
[37]T. Minka and J. Lafferty, “Expectation-propagation
for the generative aspect model”, Proceedings of the
18th Conference on Uncertainty in Artificial
Intelligence, pp. 352-359, 2002.
[38]David Mrva and Philip C. Woodland, “A PLSA-based
Language Model for Conversational Telephone Speech”,
8th International Conference on Spoken Language
Processing, pp. 2257-2260, 2004.
[39]David Mrva and Philip C. Woodland, “Unsupervised
language model adaptation for mandarin broadcast
conversation transcription”, Proceedings of
International Conference on Spoken Language
Processing, pp. 1961-1964, 2004.
[40]K. Nigam, A. K. McCallum, S. Thrun and T. Mitchell,
“Text classification from labeled and unlabeled
documents using EM”, Machine Learning, vol. 39, no. 2-
3, pp. 103-134, 2000.
[41]J. M. Ponte and W. B. Croft, “A language modeling
approach to information retrieval”, In Proceedings of
the 21st annual international ACM SIGIR Conference on
Research and Development in Information Retrieval, pp.
275-281, 1998.
[42]G. Salton and M. J. McGill, Introduction to Modern
Information Retrieval, New York: McGraw-Hill, 1983.
[43]F. Sebastiani, “Machine learning in automated text
categorization”, ACM Computing Surveys, vol. 34, pp.1-
47, 2002.
[44]F. Song and W. B. Croft, “A general language model
for information retrieval”, In Proceedings on the
22nd annual international ACM SIGIR conference, pp.
279-280, 1999.
[45]Y.-C. Tam and T. Schultz, “Dynamic language model
adaptation using variational bayes inference”,
Proceedings of European Conference on Speech
Communication and Technology, pp. 5-8, 2005.
[46]Y.-C. Tam and T. Schultz, “Correlated latent semantic
model for unsupervised LM adaptation”, Proceedings of
International Conference on Acoustics, Speech, and
Signal Processing, vol. 4, pp. 41-44, 2007.
[47]X. Wei and W. B. Croft, “LDA-based document models
for ad-hoc Retrieval”, In Proceedings on the 29th
annual international ACM SIGIR conference, pp. 178-
185, 2006.
[48]I. H. Witten and T. C. Bell, “The zero-frequency
problem—estimating the probabilities of novel events
in adaptive text compression”, IEEE Transactions on
Information Theory, vol. 37, no. 4, pp. 1085–1094,
1991.