簡易檢索 / 詳目顯示

研究生: 許軒睿
Hsu, Hsuan-Jui
論文名稱: 貝式主題混合資訊檢索模型
Bayesian Topic Mixture Model for Information Retrieval
指導教授: 簡仁宗
Chien, Jen-Tzung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 59
中文關鍵詞: 資訊檢索語言模型文件模組化
外文關鍵詞: information retrieval, language model, document modeling
相關次數: 點閱:97下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在資訊檢索上,文件通常廣泛地以bag-of-word形式做為表示法。在自動本文處理之相關研究中,常利用機率主題模型從字詞相互關係推斷並建立潛在主題變數。在機率潛在語意模型(PSLA)裡,文件中的每一個字詞在混合模型即視為一個樣本,其混合成分是使用多項分佈來表示。然而,此方式沒有考慮到文集中高頻率字詞之突發現象。雖然PLSA可以顯示多主題,但是每一個主題模型都十分簡單。另外,PLSA的缺點是對於先前未出現(unseen)的文件沒有表示出其機率模型以及參數數量會隨著文件數呈線性擴增。近幾年來,機率模型和圖形模型在資訊檢索和機器學習已成為一個重要的研究領域。本論文針對混合模型之非監督式學習問題,探討幾個重要之統計式資訊檢索模型。我們以PLSA模型為基礎,提出一種新型之貝氏主題混合模型來解決多項分佈無法恰當表示具鑑別性字詞機率之問題。在混合模式的結合上,我們使用Dirichlet事前機率分佈來表示多項分佈字詞在每一個主題之條件機率分佈,相同種類內之不同文件是經由不同多項分佈來產生。我們透過Gibbs抽樣法來加快參數估測之過程。本方法的優勢是可以在字詞對應的主題上表示出機率模型。在TREC 文件集之資訊檢索實驗裡,我們實現本方法在文件檢索和文件模組化來驗證貝氏主題模型之優越性。

    In conventional information retrieval, the documents are usually represented as bag of words. In state-of-art information retrieval, it is popular to apply the probabilistic topic model to infer word correlation through the latent topic variables. Probabilistic latent semantic analysis model (PLSA) is such a model that each word in a document is seen as a sample from a mixture component. These mixture components are modeled by multinomial distributions. Although PLSA model deals with the issue of multiple topics, each topic model is quite simple and the burstiness phenomenon of frequent words is not taken into account. On the other hand, PLSA has the weaknesses that there is no way to find probability model for unseen document and also the number of parameters grows rapidly due to the number of training documents. In this thesis, we present a new Bayesian topic mixture model (BTMM) to handle the weaknesses inherent in PLSA model. Beyond the multinomial word variables, we use the Dirichlet prior to represent topic information. The documents are generated by the associated multinomial distribution. In the experiments on TREC corpus, we show the results of average precision and model perplexity to demonstrate the superiority of using Bayesian topic mixture model.

    摘要 IV Abstract V 誌謝 VI 章節目錄 VII 圖目錄 X 表目錄 XI 第 一 章 簡介 1 1.1 研究背景及動機 1 1.2 研究目的與方法 3 1.3 論文架構 4 第 二 章 文獻探討 5 2.1 文件表示法 5 2.2 文件混合模型之探討 9 2.2.1 Mixture of Unigrams 9 2.2.2 Probabilistic Latent Semantic Analysis 10 2.2.3 Latent Dirichlet Allocation 12 2.2.4 Theme Topic Mixture Model 17 第 三 章 貝氏主題混合模型 21 3.1 Dirichlet分佈模型 21 3.1.1 Word Burstiness Phenomenon (字詞突發現象) 21 3.1.2 Dirichlet-Multinomial合成模型 22 3.2 貝氏主題混合模型 24 3.2.1 模型定義 24 3.2.2 推論 26 3.2.3 與其他模型之關聯和比較 28 第 四 章 實驗結果 32 4.1 實驗設定和描述 32 4.1.1 文集描述 32 4.2 評估方法 36 4.2.1 召回率(Recall rate) 36 4.2.2 準確率(Precision Rate) 36 4.2.3 Mean Average Precision Rate 37 4.2.4 Perplexity 37 4.2.5 實驗收斂條件 38 4.3 文件模組化 (Document Modeling) 38 4.3.1 參數設定 39 4.3.2 字典大小對文件模組化效果分析 40 4.3.3 文件模組化 41 4.4 資訊檢索之應用 43 4.4.1 WSJ89文件集對應Topic 101-150檢索效能評估 44 4.4.2 WSJ89文件集對應Topic 151-200檢索效能評估 46 4.4.3 AP88文件集對應Topic 101-150檢索效能評估 47 AP88文件集對應Topic 151-200檢索效能評估 48 4.4.4 48 第 五 章 結論及未來研究方向 51 5.1 結論 51 5.2 未來研究方向 51 第 六 章 參考文獻 53

    [1]Y. Akita and T. Kawahara, “Language Model Adaptation
    based on PLSA of Topics and Speakers”, 8th
    International Conference on Spoken Language Processing,
    pp. 1045-1048, 2004.
    [2]L. Azzopardi, M. Girolami and C. J. Van Rijsbergen,
    “Topic based language models for ad hoc information
    retrieval”, Proceedings of the International Joint
    Conference on Neural Networks, pp. 3281-3286, 2004.
    [3]J. R. Bellegarda, “Exploiting both local and global
    constraints for multi-span statistical language
    modeling”, Proceedings of International Conference on
    Acoustics, Speech, and Signal Processing, vol. 2, pp.
    677-680, 1998.
    [4]J. R. Bellegarda, “Exploiting latent semantic
    information in statistical language modeling,”
    Proceeding of the IEEE, vol. 88, No. 8, pp. 1279-1296,
    2000.
    [5]M. W. Berry, S. T. Dumais and G. W. O’Brien, “Using
    linear algebra for intelligent information retrieval”,
    SIAM Review, vol. 37, no. 4, pp. 573-595, 1995.
    [6]D. M. Blei, M. I. Jordan and A. Y. Ng, “Hierarchical
    Bayesian models for applications in information
    retrieval”, Bayesian Statistics, vol. 7, pp. 25-43,
    2003.
    [7]D. M. Blei and J. D. Lafferty, “Correlated topic
    model”, In Advances in Neural Information Processing
    Systems (NIPS), 18, pp. 147-154, 2006.
    [8]D. M. Blei and J. D. Lafferty, “Dynamic topic model”,
    Proceedings of the 23rd International Conference on
    Machine Learning, pp.113-120, 2006.
    [9]D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent
    Dirichlet allocation”, Journal of Machine Learning
    Research, vol. 3, no. 5, pp. 993-1022, 2003.
    [10]T. Brants, F. Chen and I. Tsochantaridis, “Topic-
    based document segmentation with probabilistic latent
    semantic analysis”, Proceedings of the Eleventh
    International Conference on Information and Knowledge
    Management, pp. 211-218, 2002.
    [11]L. Cai and T. Hofmann, “Text categorization by
    boosting automatically extracted concepts”,
    Proceedings of the 26th Annual International ACM SIGIR
    Conference on Research and Development in Information
    Retrieval, pp. 182-189, 2003.
    [12]J.-T. Chien, M.-S. Wu and C.-S. Wu, “Bayesian
    learning for latent semantic language”, Proceedings
    of European Conference on Speech Communication and
    Technology, pp. 25-28, 2005.
    [13]J.-T. Chien, M.-S. Wu and H.-J. Peng, “On latent
    semantic language modeling and smoothing”,
    Proceedings of International Conference on Spoken
    Language Processing, vol. 2, pp. 1373-1376, 2004.
    [14]S. Deerwester, S. T. Dumais, G. W. Furnas, T. K.
    Landauer and R. Harshman, “Indexing by latent
    semantic analysis”, Journal of the American Society
    for Information Science, vol. 41, no. 6, pp. 391-407,
    1990.
    [15]A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum
    likelihood from incomplete data via the EM
    algorithm”, Journal of the Royal Statistical Society,
    Series B, vol. 39, no. 1, pp. 1-38, 1977.
    [16]Dumais, S. T, “Latent semantic indexing (LSI): TREC-3
    report”, Proceedings of the Text REtrieval Conference
    (TREC-3), pp. 219–230, 1995.
    [17]C. Elkan, “Clustering documents with an exponential-
    family approximation of the Dirichlet compound
    multinomial distribution”, Proceedings of the 23rd
    International Conference on Machine Learning, pp. 289-
    296, 2006.
    [18]G.E. Forsythe, M.A. Malcolm, and C.B. Moler, Computer
    Methods for Mathematical Computations (Chapter 9:
    Least squares and the singular value decomposition).
    Englewood Cliffs, NJ: Prentice Hall, 1977.
    [19]M. Girolami and A. Kaban, “On an equivalence between
    PLSI and LDA”, Proceedings of the 26th Annual
    International ACM SIGIR Conference on Research and
    Development in Information Retrieval, pp. 433-434,
    2003.
    [20]T. L. Griffiths and M. Steyvers, “Finding scientific
    topics”, Proceedings of the National Academy of
    Science, vol. 101, pp. 5228–5225, 2004.
    [21]D. Harman, Overview of the Fourth Text Retrieval
    Conference. 1995. Available at
    http://trec.nist.gov/pubs/trec4/overvies.ps.gz
    [22]T. Hofmann, “Probabilistic latent semantic
    analysis”, Proceedings of the Fifteenth Annual
    Conference on Uncertainty in Artificial Intelligence,
    pp. 289-296, 1999.
    [23]T. Hofmann, “Unsupervised learning by probabilistic
    latent semantic analysis”, Machine Learning, vol. 42,
    no. 1, pp. 177–196, 2001.
    [24]T. Hofmann, “Unsupervised learning from dyadic
    data”, In Advances in Neural Information Processing
    Systems, vol. 11. MIT Press, 1999.
    [25]X. Jin, Y. Zhou and B. Mobasher, “Web usage mining
    based on probabilistic latent semantic analysis”,
    Proceedings of the 2004 ACM SIGKDD International
    Conference on Knowledge Discovery and Data Mining, pp.
    197-205, 2004.
    [26]M. Jordan, editor. Learning in Graphical Models. MIT
    Press, Cambrige, MA, 1999.
    [27]M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Sail,
    “Introduction to variational methods for graphical
    models”, Machine Learning, vol. 37, pp. 183-233, 1999.
    [28]S. M. Katz, “Distribution of content words and
    phrases in text and language modeling”, Natural
    Language Engineering, vol. 2, pp. 15-59, 1996.
    [29]S. M. Katz, “Estimation of probabilities for sparse
    data for the language model component of a speech
    recognizer”, IEEE Transactions on Acoustics, Speech
    and Signal Processing, vol. 35, no. 3, pp. 400–401,
    1987.
    [30]M. Keller and S. Bengio, “Theme topic mixture model:
    A graphical model for document representation”, In
    PASCAL Workshop on Learning Methods for Text
    Understanding and Mining, 2004.
    [31]M. Kevin, “An introduction to graphical models”,
    Intel Research Technical Report, 2001.
    [32]T. G. Kolda and D. P. O’Leary, “A semi-discrete
    matrix decomposition for latent semantic indexing in
    information retrieval”, ACM Transactions on
    Information Systems, vol. 16, no. 4, pp. 322-346, 1998.
    [33]R. Madsen, D. Kauchak, and C. Elkan, “Modeling word
    burstiness using the Dirichlet distribution”,
    Proceedings of the 22nd International Conference on
    Machine Learning, pp. 545-552, 2005.
    [34]C. Manning and H. Schutze, Foundations of Statistical
    Natural Language Processing. The MIT Press, 1999.
    [35]T. Minka, “The Dirichlet-tree distribution”, in
    http://research.microsoft.com/~minka/papers/dirichlet/minka-dirtree.pdf
    [36]T. Minka, “Estimating a Dirichlet distribution”,
    Technical report, MIT, 2000.
    [37]T. Minka and J. Lafferty, “Expectation-propagation
    for the generative aspect model”, Proceedings of the
    18th Conference on Uncertainty in Artificial
    Intelligence, pp. 352-359, 2002.
    [38]David Mrva and Philip C. Woodland, “A PLSA-based
    Language Model for Conversational Telephone Speech”,
    8th International Conference on Spoken Language
    Processing, pp. 2257-2260, 2004.
    [39]David Mrva and Philip C. Woodland, “Unsupervised
    language model adaptation for mandarin broadcast
    conversation transcription”, Proceedings of
    International Conference on Spoken Language
    Processing, pp. 1961-1964, 2004.
    [40]K. Nigam, A. K. McCallum, S. Thrun and T. Mitchell,
    “Text classification from labeled and unlabeled
    documents using EM”, Machine Learning, vol. 39, no. 2-
    3, pp. 103-134, 2000.
    [41]J. M. Ponte and W. B. Croft, “A language modeling
    approach to information retrieval”, In Proceedings of
    the 21st annual international ACM SIGIR Conference on
    Research and Development in Information Retrieval, pp.
    275-281, 1998.
    [42]G. Salton and M. J. McGill, Introduction to Modern
    Information Retrieval, New York: McGraw-Hill, 1983.
    [43]F. Sebastiani, “Machine learning in automated text
    categorization”, ACM Computing Surveys, vol. 34, pp.1-
    47, 2002.
    [44]F. Song and W. B. Croft, “A general language model
    for information retrieval”, In Proceedings on the
    22nd annual international ACM SIGIR conference, pp.
    279-280, 1999.
    [45]Y.-C. Tam and T. Schultz, “Dynamic language model
    adaptation using variational bayes inference”,
    Proceedings of European Conference on Speech
    Communication and Technology, pp. 5-8, 2005.
    [46]Y.-C. Tam and T. Schultz, “Correlated latent semantic
    model for unsupervised LM adaptation”, Proceedings of
    International Conference on Acoustics, Speech, and
    Signal Processing, vol. 4, pp. 41-44, 2007.
    [47]X. Wei and W. B. Croft, “LDA-based document models
    for ad-hoc Retrieval”, In Proceedings on the 29th
    annual international ACM SIGIR conference, pp. 178-
    185, 2006.
    [48]I. H. Witten and T. C. Bell, “The zero-frequency
    problem—estimating the probabilities of novel events
    in adaptive text compression”, IEEE Transactions on
    Information Theory, vol. 37, no. 4, pp. 1085–1094,
    1991.

    下載圖示 校內:立即公開
    校外:2007-09-19公開
    QR CODE