簡易檢索 / 詳目顯示

研究生: 彭華瑞
Peng, Hua-Jui
論文名稱: 應用潛在式語意分析於語言模型之研究
On use of Latent Semantic Analysis for Language Modeling
指導教授: 簡仁宗
Chien, Jen-Tzung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2002
畢業學年度: 90
語文別: 中文
論文頁數: 64
中文關鍵詞: 潛在式語意分析語言模型
外文關鍵詞: Latent Semantic Analysis, Language Model
相關次數: 點閱:86下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文提出一種能擷取長距離資訊的語言模型,它可以擷取詞彙與詞彙之間以及文章及詞彙之間的潛在語意關聯性,擷取的方式是使用資訊檢索中的潛在式語意分析。傳統上N-gram語言模型只能在N-gram視窗內擷取到有限距離的資訊,對於較長距離的語意資訊則無法擷取到,如何克服N-gram模型缺乏長距離資訊一直是相當重要的研究課題。在資訊檢索中潛在式語意分析是將詞彙投影到語意空間上的位置,再利用這樣的空間找尋所要的資訊,近而得到使用者所需的文章,而本論文是利用此種方式來得到詞彙與歷史資料之間的關係估測下一個字的可能性。此外本論文也利用潛在式語意模型建立一個有效的平滑化方法,將沒有出現訓練資料的模型參數用有出現的模型參數用有出現的模型參數線性組合起來,而實驗結果也顯示本論文方法比起文獻上的結果有較低的perplexity,此技術也可以有效的與其他平滑化的技術結合,在語言模型的效率改善方面能有更良好的效果,本論文也利用語言模型開發線上文件分類系統及無聲調個人化注音輸入法做為展示系統。

    In this thesis, we propose a new statistical language modeling approach to capture long-distance dependencies of words and documents. The association between word and document is established via the Latent Semantic Analysis developed from the information retrieval field. Traditionally, the N-gram language models only capture the word dependency across a N-gram window. It becomes crucial to exploit the long-distance word dependency so that the powerful language models could be achieved. The latent semantic analysis was developed to model long distance dependencies between words and documents. This scheme transforms term to the same semantic space so that we can explore the relationship between word and document in this space. In this thesis, we adopt the retrieved to predict the next word. Also, this method can be combined with the Witten-Bell algorithm for parameter smoothing. We further employ the combined approach to document classification. The practical applications of document classification and personalized Chinese character typing translator method are also constructed are also constructed.

    摘要 v ABSTRACT vi 目錄 viii 圖目錄 xi 表目錄 xii 第一章 簡介 1 第二章 N-gram 模型簡介 3 2-1 N-gram模型之應用 3 2.1.1語音辨識 3 2.1.2文件分類 4 2.2 N-gram模型之建立 5 2.3 N-gram 模型之評估 7 2.4 N-gram 模型的缺點 8 第三章 N-gram模型改進方向 10 3.1快取N-gram模型與混合式N-gram模型 10 3.2 Witten-Bell平滑化技術 14 3.3 觸發序對 (Trigger pair) 演算法 15 3.4 資訊擷取技術運用於語言模式之調整 17 第四章 潛在式語意分析與其應用 19 4.1 潛在式語意索引 19 4.1.1 潛在式語意矩陣 20 4.2 奇異值分解(Singular Value Decomposition) 21 4.3 潛在式語意分析應用於語言模型 25 4.3.1 潛在式語意分析之資料表示式 25 4.3.3 整合潛在式語意分析資訊擷取及N-gram 29 4.3.4 潛在式語意分析運用於平滑化 33 4.4 潛在式語意分析之範例 34 第五章 實驗 40 5.1辭典 40 5.2 實驗資料 40 5.3 實驗結果 41 5.3.1 平滑化效能之評估 41 5.3.2 潛在式語意分析之效能評估 41 5.3.3 潛在式語音分析之平滑化效能評估 42 5.3.4 潛在式語意分析結合平滑化效能評估 42 5.3.5 潛在式語意分析之維度於語言模型效能之影響 44 5.3.6 潛在式語意分析之視窗大小於語言模型效能之影響 44 5.3.7 文件分類之實驗 45 5.3.8 奇異值分解所需時間 46 第六章 展示系統 48 6.1 線上自動文件分類系統 48 6.2 無聲調注音輸入法 49 第八章 結論及未來研究方法 51 參考文獻 53 附錄一 各類文章範例 59

    [1]. M. W. Berry, S. T. Dumais, and G. W. O’Brien, “Using Linear Algebra for Intelligent Information Retrieval,” SIAM Rev., vol. 37, pp.573-595, 1995.
    [2]. Langzhou Chen, Jean-Luc Gauvain, Lori Lamel, Gilles Adda and Martine Adda, “Using Information Retrieval Methods for Language Model Adaptation”, Eurospeech 2001-Scandinavia
    [3]. S. F. Chen and J. Goodman, “An Empirical Study of Smoothing Techniques for Language Modeling”, Computer Speech and Language , vol.13, 359-394 , 1999.
    [4]. P. R. Clarkson and A. J. Robinson, “Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache”, Proc. of ICASSP, pp.799-802 , 1997.
    [5]. S.Deerwester et al., “Indexing by Latent Semantic Analysis” J. Am. Soc. Inform. Science, Vol. 41, pp. 391-407, 1990.
    [6]. Y. Gotoh and S. Renals, “Document Space Models Using Latent Semantic Analysis,” in Proc. EuroSpeech’97, Rhodes, Greece, Vol. 3, pp. 1443-1448, September 1997.
    [7]. Chris H, Q. Ding, ”A Similarity-based Probability Model for Latent Semantic Indexing”, Proc. Of 22nd ACM SIGIR’99 Conference, pp.59-65.
    [8]. R. Iyer and M. Ostendorf, “Relevance weighting for combining multi-domain data for n-gram language modeling”, Computer Speech and Language, vol.13, pp.267-282, 1999.
    [9]. Hongyan Jing and Evelyne Tzoukermann. "Information retrieval based on context distance and morphology". In the Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR'99). August, 1999. University of Berkeley, CA.
    [10]. Bellegarda, J.R., “A New Approach to the Adaptation of Latent Semantic Information”, Isca ITR-Workshop 2001, Sophia-Antipolis, France.
    [11]. Bellegarda, J.R., “A Multispan Language Modeling Framework for Large Vocabulary Speech Recognition”, IEEE Transactions on Speech and Audio Processing, Vol. 6, NO. 5, September 1998
    [12]. Bellegarda, J.R., “A statistical language modeling approach integrating local and global constraints”, Automatic Speech Recognition and Understanding, 1997. Proceedings., 1997 IEEE Workshop on , 1997 Page(s): 262 –269.
    [13]. Bellegarda, J.R., “An Overview of Statistical Language Model Adaptation”, Isca ITR-Workshop 2001, Sophia-Antipolis, France.
    [14]. Bellegarda, J.R., “Exploiting latent semantic information in statistical language modeling”, Proceedings of the IEEE , Volume: 88 Issue: 8 , Aug. 2000 Page(s): 1279 –1296
    [15]. Bellegarda, J.R., “Exploiting both local and global constraints for multi-span statistical language modeling”, Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on , Volume: 2 , 1998 Page(s): 677 -680 vol.2.
    [16]. Bellegarda, J.R., “Large vocabulary speech recognition with multispan statistical language models”, Speech and Audio Processing, IEEE Transactions on , Volume: 8 Issue: 1 , Jan. 2000 Page(s): 76 –84.
    [17]. Bellegarda, J.R., ”Speech recognition experiments using multi-span statistical language models“, Acoustics, Speech, and Signal Processing, 1999. Proceedings, 1999 IEEE International Conference on , Volume: 2 , 1999 Page(s): 717 -720 vol.2.
    [18]. Bellegarda, J.R.; Butzberger, J.W.; Yen-Lu Chow; Coccaro, N.B.; Naik, D., “A novel word clustering algorithm based on latent semantic analysis”, Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings , 1996 IEEE International Conference on , Volume: 1 , 1996 Page(s): 172 -175 vol. 1.
    [19]. F. Jiang and M. Littman, “Approximate Dimension Equalization in Vector-based Information Retrieval”, In Proceedings of the 17th International Conference on Machine Learning, 2000.
    [20]. S. Khudanpur, J. Wu., "A maximum entropy language model integrating n-grams and topic dependencies for conversational speech recognition.", In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pages 553--556. IEEE, 1999.
    [21]. D. Klakow, “Selecting Articles from the Language Model Training Corpus”, Proc of ICASSP, pp.1695 –1698, 2000.
    [22]. Noriyuki Kobayashi, Tetsunori Kobayashi,"Class-combined Word N-gram for Robust Language Modeling", Proc. Eurospeech 99. pp.1599-1602, Spe. 1999.
    [23]. W. Ma Kristine, Z. George and M. Marie, “Bi-modal sentence structure for language modeling”, Speech Communication, vol.31, pp.51-67, 2000.
    [24]. R. Lau, R. Rosenfeld, and S. Roukos, “Trigger-based language models: A maximum entropy approach” , in Proc. Int. Conf. Acoustics, Speech, Signal Processing, vol. II, pp. 45–48. , 1993.
    [25]. C. D. Manning, H. Schutze, “Foundations of statistical natural language processing”, Massachusetts Institute of Technology pp.315-407, 1999.
    [26]. M. Meteer and J. R. Rohlicek, “Statistical language modeling combining N -gram and context free grammars” , in Proc. Int. Conf. Acoustics, Speech, Signal Processing, vol. II, pp. 37–40 , 1993.
    [27]. Milind Mahajan, Doug Befferman, X. D. Huang, “Improved Topic-Dependent Language Modeling Using Information Retrieval Techniques”, IEEE ICASSP March-15-19, 1999, Phoenix, AZ, USA.
    [28]. J. Makhoul, F. Kubala, R. Leek, D. Lui, L. Nguqen, R. Schwartz and A. Srivastava, "Speech and language technologies for audio indexing and retrieval", Proceedings of the IEEE vol. 88. no. 8, August 2000.
    [29]. David R. H. Miller, Tim Leek, Richard M. Schwartz ,"A Hidden Markov Model Information Retrieval System", Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval
    [30]. K. Ng, "Information Fusion for Spoken Document Retrieval," Proc. ICASSP2000, Istanbul, Turkey, June 2000.
    [31]. G. W. O’Brien, “Information Management Tools for Updating an SVD-Encoded Indexing Scheme”, Master’s thesis, The University of Knoxville, Tennessee, Knoxville, TN, 1994
    [32]. S. D. Pietra , V. D. Pietra and J. Lafferty, “Inducing Features of Random Fields”, IEEE Transaction On Pattern Analysis AMD Machine Intelligence, pp.380 –393, vol. 19, NO.4 ,APRIL , 1997.
    [33]. L. Rabiner and B.H. Juang, “Funadamental of Speech Recognition”, Prentice Hall, pp.321-387, 1993.
    [34]. R. Rosenfeld, “A maximum entropy approach to adaptive statistical language model”, Computer Speech and Language, vol 10 , pp.187-228 , 1996.
    [35]. Roni Rosenfeld, Stanley F. Chen and Xiaojin Zhu., "Whole-Sentence Exponential Language Models: a Vehicle for Linguistic-Statistical Integration.", Computers Speech and Language, 2001.
    [36]. Schultz, T., Rogina, I., "Acoustic and language modeling of human and nonhuman noises for human-to-human spontaneous speech recognition", Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on , Volume: 1 , 1995 Page(s): 293 -296 vol.1
    [37]. H. Witten and T. C. Bell, “The zero-frequency problem : Estimating the probabilities of novel events in adaptive text compression.”, IEEE Transactions on Information Theory , vol.37, pp.1085-1094, 1991.
    [38]. Dian I. Witter and Michael W. Berry, “Downdating the Latent Semantic Indexing Model for Conceptual Information Retrieval”, The Computer Journal, Vol. 41 No.8, 1998.
    [39]. G. D. Zhou and K. T. Lua, “Interpolation of n-gram and mutual-information based trigger pair language models for Mandarin speech recognition”, Computer Speech and Language, vol. 13, pp.125-141, 1999.
    [40]. 陳鴻儀, “應用關聯法則於語言模型之調整及建立個人化新聞文件瀏覽器”, 碩士論文, 成功大學資訊工程, 2001
    [41]. 張元貞,李琳山,陳克健,簡立峰,“國語語音辨認中一個以詞群為基礎的中文語言模型及其調適”,碩士論文,國立台灣大學資訊工程學研究所 ,1994.
    [42]. 楊榮荃,李琳山,陳克健,“國語語音辨認中語言模型技術之研究” ,碩士論文,國立台灣大學資訊工程研究所,民國83年6月.
    [43]. CKIP, http://godel.iis.sinica.edu.tw, 中央究院資訊科學研究所詞庫小組。
    [44]. 民視即時新聞, http://www.ftv.com.tw。
    [45]. 中央社, http://www.cna.com.tw。
    [46]. 中時電子報, http://news.chinatimes.com。
    [47]. 電子新聞網, http://www.eenews.com.tw。
    [48]. 聯合新聞網, http://udnnews.com/NEWS。
    [49]. ETtoday, http://www.ettoday.com /index.htm。
    [50]. 鉅亨網, http://www.cnyes.com。
    [51]. 董振東, “How-net document” , http://www.how-net.com

    下載圖示 校內:立即公開
    校外:2002-07-24公開
    QR CODE