| 研究生: |
陳嘉明 Chen, Chia-Ming |
|---|---|
| 論文名稱: |
仿人類記憶模式語言模型學習之大詞彙語音辨識與聲學文詞後處理之客製化對話系統 Customized Spoken Dialogue System Based on LVCSR with Human Memory-Like Language Model Learning and Prosodic-Contextual Post-processing |
| 指導教授: |
王駿發
Wang, Jhing-Fa |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2012 |
| 畢業學年度: | 100 |
| 語文別: | 英文 |
| 論文頁數: | 43 |
| 中文關鍵詞: | 大詞彙語音辨識 、客製化對話系統 、仿人類記憶語言模型學習 、語音辨識後處理 |
| 外文關鍵詞: | LVCSR, customized spoken dialogue system, human memory-like language model learning, post-processing of speech recognition |
| 相關次數: | 點閱:127 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著科技的進步,人機介面的方便性愈漸重要。相較於傳統的鍵盤滑鼠,利用深度與影像偵測的體感攝影機KINECT,以及iPhone 4S上的行動語音秘書Siri逐漸取而代之。但傳統語音對話系統在大詞彙語音辨識部份一直存在著語料庫無法隨著時間與新知進步和辨識率較低的問題;以及在對話處理方面無法針對使用者的需求和習慣語句來辨識與採取相對應的回饋。因此,本篇論文針對其來做改進:1.觀察人類的記憶學習模式,將大詞彙語音辨識的語言模型(Language Model)劃分為短期語言模型以及長期語言模型。希望藉此模擬人類在記憶時模式,針對緊急的事件將其置於短期記憶中,而重要的事件置於長期記憶。2.為了提升辨識率,對語音訊號做分析(prosodic analysis),利用自相關函式(ACF)以及音節的平均長度來分析語音訊號,藉此得到語音訊號中音節數與音高變化;接著對語音辨識結果做分析(contextual analysis),利用詞性的出現頻率以及不同詞性間出現頻率的bigram模型分析辨識結果的語調,藉此與語音訊號之音高變化資訊做比較來提升語音辨識率。3.設計客製化使用者介面,連結關鍵字與對話決策樹之決策節點。在實驗的部份,針對後處理後的辨識結果,約可降低6.4%字錯率;而觀察經過了長期學習後的辨識結果,約可降低約27.55%的字錯率。
With the development of technology, human-machine interface becomes more important. Compare with traditional interfaces, a spoken dialogue system is more convenient. However, there are some disadvantages about the LVCSR (Large Vocabulary Continuous Speech Recognition) and dialogue system. For example, the corpora of LVCSR can’t evolve with time and information; besides, the recognition rate of LVCSR is lower and the dialogue system is unable to serve the requirement of different keywords for people. Consequently, we focus on three parts to improve the whole system. Firstly, by observing the memory modes of human being, a human memory-like language model is proposed to learn urgent corpus in short-term language model and important corpus in long-term language model. Secondly, to increase the recognition rate, a prosodic-contextual post-processing is built. Thirdly, a customization user interface is designed to serve the requirement of different keywords for people. Last but not least, the experiments of the accuracy of prosodic-contextual post-processing and the recognition results of LVCSR are evaluated by word error rate (WER). The average WER of experimental result on prosodic-contextual post-processing is improved by 6.4%, and the average WER with long-term learning of eight persons is improved by 27.55%. With the previous experimental results, the proposed system is proved efficient for human-machine interface.
[1] Shuanhu Bai and Haizhou Li, “Bayesian learning of N-gram statistical language modeling,” in Proc. IEEE Int. Conf. Acoustic, Speech, and Signal Processing, Toulouse, France, 2006, May. 14-19, pp. 1045-1048.
[2] Shuanhu Bai, C.L. Huang, Y.K. Tan and Bin Ma, “Language modeling learning for domain-specific natural language user interaction,” in Proc. IEEE Int. Conf. Robotics and Biomimetrics, Guilin, China, 2009, Dec. 19-23, pp. 2480-2485.
[3] Jun Jiang and Lei Li, “ASR post-processing correction based on NER and pronunciation primitive,” in Proc. 7th Int. Conf. Natural Language Processing and Knowledge Engineering, Tokushima, Japan, 2011, Nov. 27-29, pp. 126-131.
[4] Shuanhu Bai, Min Zhang, and Haizhou Li, “Semi-supervised learning of domain-specific language models from general domain data,” in Proc. Int. Conf. Asian Language Processing, Singapore, 2009, Dec. 7-9, pp. 273-279.
[5] Y.X. Li, and L.I. Chew, “Influence of language models and candidate set size on contextual post-processing for chinese script recognition,” in Proc. Int. Conf. Pattern Recognition, Cambridge, UK, 2004, August. 23-26, pp. 537-540.
[6] Ridong Jiang, Y.K. Tan, and C.Y. Wong, “Development of event-driven dialogue system for social mobile robot,” in Proc. Global Conf. Intelligence Systems, Xiamen, China, 2009, May. 19-21, pp. 117-121.
[7] Tiziana Ligorio, Susan L. Epstein, and Rebecca J. Passonneau, “Wizard dialogue strategies to handle noisy speech recognition,” in Proc. IEEE Workshop on Spoken Language Technology, Berkeley, California, USA, 2010, Dec. 12-15, pp. 318-323.
[8] Teruhisa Misu, Komei Sugiura, Kiyonori Ohtake, Chiori Hori, Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura, “Dialogue strategy optimization to assist user’s decision for spoken consulting dialogue systems,” in Proc. IEEE Workshop on Spoken Language Technology, Berkeley, California, USA, 2010, Dec. 12-15, pp. 354-359.
[9] C.L. Huang, and C-H Wu, “Phone set generation based on acoustic and contextual analysis for multilingual speech recognition,” in Proc. IEEE Int. Conf. Acoustic, Speech, and Signal Processing, Honolulu, Hawaii, USA, 2007, April 15-20, pp. 1017-1020.
[10] Xiuqin Pan, Yongcun Cao, Yong Lu, and Yue Zhao, “Tibetan language speech recognition model based on active learning and semi-supervised learning,” in Proc. 10th IEEE Int. Conf. Computer and Information, Bradford, UK, 2010, Jun. 29-Jul. 1, pp. 1225-1228.
[11] Xin Li, Jielin Pan, Yonghong Yan, and Yafei Yang, “Large vocabulary Uyghur continuous speech recognition based on stems and suffixes,” in Proc. 7th Int. Symposium on Chinese Spoken Language Processing, Sun Moon Lake, Taiwan, 2010, Nov. 29-Dec. 3, pp. 220-223.
[12] Reinhard Kenser, and Hermann Ney, “Improved backing-off for M-gram language model,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Detroit, Michigan, 1995, May. 9-12, pp. 181-184.
[13] Mark Gales, and Phil Woodland, “Recent progress in large vocabulary continuous speech recognition: An HTK perspective,” Cambridge University Engineering Department, 2006, May. 15.
[14] Andreas Stolcke, “SRILM-An extensible language modeling toolkit,” Speech Technology and Research Laboratory, SRI International, Menlo Park, CA, USA.
[15] Andreas Stolcke, Jing Zheng, Wen Wang, and Victor Abrash, “SRILM at sixteen: Updata and outlook,” Microsoft Speech Labs, Mountain View, California, USA.
[16] Maozu Guo, Yang Liu, and Jacek Malec “A new Q-learning Algorithm Based on the Metropolis Criterion,” IEEE Transaction on Systems, Man, and Cybernetics.
[17] Esther Levin, Roberto Pieraccini and Wieland Eckert, “A stochastic model of human-machine interaction for learning dialogue strategy,” IEEE Transactions on Speech and Audio Processing.
[18]蔡金翰,”語音對話系統和對話策略之研究”,國立交通大學電信工程學系,碩士論文,2005,07
[19]朱育德,“基於字詞內容之適應性對話系統”, 國立中央大學資訊工程研究所,碩士論文,2006,07
[20]張弘霖,“基於位置特定事後機率詞圖及潛藏與異分析之語音文件檢索Spoken Document Retrieval Based on Position Specific Posterior Lattices and Latent Semantic Analysis”,國立台灣大學電機資訊學院資訊工程學系,碩士論文
[21]陳怡婷,黃耀民,葉耀明, 陳柏琳,“中文語音文件自動摘要之摘要模型
[22] http://en.wikipedia.org/wiki/Dijkstra's_algorithm
[23] http://en.wikipedia.org/wiki/Forgetting_curve
[24] http://en.wikipedia.org/wiki/ACF
[25]蔡如意,“Continuous Lexical Representation of Language Model for Speech Recognition”, 國立成功大學資訊學系, 碩士論文,2008,06
[26]陳冠宇,”Improved Topic Modeling Techniques for Speech Recognition”, 國立台灣師範大學資訊工程研究所, 2010, 08