| 研究生: |
黃建霖 Huang, Chien-Lin |
|---|---|
| 論文名稱: |
中英多語語音文件分析與檢索之研究 English/Mandarin Multilingual Spoken Document Analysis and Retrieval |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2008 |
| 畢業學年度: | 96 |
| 語文別: | 英文 |
| 論文頁數: | 86 |
| 中文關鍵詞: | 多語語音辨識 、語音文件檢索 、相片檢索 |
| 外文關鍵詞: | photo retrieval, spoken document retrieval, Multilingual speech recognition |
| 相關次數: | 點閱:71 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於國際化趨勢,日常生活中常可見多國語言的使用情況,文件、對話、新聞、音樂和電影。同時,隨著多媒體、語音文件快速增加,有效管理多媒體內容成為重要議題。本論文主要目的在於研究中英多語語音文件分析與檢索,針對新聞語音文件提出,多語語音辨識、語音文件分析和檢索之方法,並提出語音註解相片檢索之應用及方法。
研究針對新聞語音文件,考慮主播報導中英文語音夾雜的情形,進行自動語音辨識。研究探討定義一組中英文音素集。首先,將中英文發音對應國際標準音標發音。進一步考慮前後文相關之三連音模型,透過對聲學相似度與前後文脈分析,用以決定有效的多語語音辨識單元。再者,考慮中英文音素定義、中文辭典和CMU辭典,產生中英文發音辭典。根據辭典定義,對文字語料進行斷詞及語言模型統計。同時本論文提出新的語音文件索引及檢索之方法,利用多階層方式建立語音文件索引,結合語音辨識之音節和字元結果、語音文件抽取之特徵詞,語特徵詞之上位詞。並利用語意驗證方法,重新排序驗證檢索結果,改善檢索準確率。最後,研究提出一語音註解相片之檢索應用,根據音節轉換圖像之索引方法結合其他索引方式,達到有效而方便的相片檢索。
Due to the trend of globalization, the usage of multilingual often occurs in our daily life, such as text, dialog, news, music and movie. While the increasing of spoken documents, the efficiency management of multimedia is an important issue. The purpose of this study is English and Mandarin multilingual spoken document analysis and retrieval. In this thesis, we present novel approaches for the multilingual speech recognition, spoken document analysis and retrieval in broadcast news, and the application of speech-annotated photo retrieval.
This study considers English and Mandarin mixed-language in the anchor report of broadcast news for automatic speech recognition. This study takes into account the construction of a robust English and Mandarin phone set. First, the phone of English and Mandarin was mapped into a universal phone set according to International Phonetic Alphabet (IPA). Moreover, this study considers left- and right-context dependent tri-phone models. The efficiency of multilingual phone set is determined according to the acoustic and contextual analysis. The pronunciation lexicon of English and Mandarin is produced by the English and Mandarin phone set, Mandarin lexicon and CMU dictionary. The language model can be further estimated by the definition of the lexicon.
This study provided a novel approach for the indexing and retrieval of spoken document. The multi-level knowledge indexing was used for spoken document indexing based on the transcription data, keywords extracted from spoken documents, and hypernyms of the extracted keywords. The semantic verification approach is then utilized to re-rank the retrieved documents to improve the final results. Finally, this study proposed an approach of speech-annotated photo retrieval. Speech is a convenient way to manage the digital photograph. A novel approach of speech retrieval is applied using syllable-transformed image-like patterns.
[Bellegarda 2000] Jerome R. Bellegarda, “Exploiting latent semantic information in statistical language modeling,” Proceedings of the IEEE, vol. 88, no. 8, pp. 1279–1296.
[Birchfield et al. 2005] Stanley T. Birchfield and Amarnag Subramanya, “Microphone Array Position Calibration by Basis-Point Classical Multidimensional Scaling,” IEEE Transactions on Speech and Audio Processing, vol. 13 no. 5, pp. 1025–1034.
[Buckley et al. 2000] C. Buckley and E. Voorhees, “Evaluating evaluation measure stability,” in Proc. ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 33–40.
[Buckley et al. 2000a] Chris Buckley and Janet Walz, “SMART in TREC-8,” Proc. Eighth Text REtrieval Conf. (TREC-8 '99), NIST Special Publication 500-264, Voorhees and Harman, eds., pp. 577–582.
[Byrne et al. 2004] William Byrne, David Doermann, Martin Franz, Samuel Gustman, Jan Hajic, Douglas Oard, Michael Picheny, Josef Psutka, Bhuvana Ramabhadran, Dagobert Soergel, Todd Ward, and Wei-Jing Zhu, “Automatic recognition of spontaneous speech for access to multilingual oral history archives,” IEEE Trans. Speech Audio Processing, vol. 12, no. 4, pp. 420–435.
[Chan 1982] Tony F. Chan, “An improved algorithm for computing the singular value decomposition,” ACM Transactions on Mathematical Software, vol. 8, no. 1, pp. 72–83.
[Chen et al. 2002a] Berlin Chen, Hsin-Min Wang, and Lin-Shan Lee, “Discriminating capabilities of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese,” IEEE Trans. Speech and Audio Processing, vol. 10, no. 5, pp. 303–314.
[Chen et al. 2002b] Yeou-Jiunn Chen, Chung-Hsien Wu, Yu-Hsien Chiu, and Hsiang-Chuan Liao, “Generation of Robust Phonetic Set and Decision Tree for Mandarin Using Chi-square Testing,” Speech Communication, vol. 38, no. 3–4, pp. 349–364.
[Chen et al. 2003] Keh-Jiann Chen, Chu-Ren Huang, Feng-Yi Chen, Chi-Ching Luo, Ming-Chung Chang, Chao-Jan Chen, and Zhao-Ming Gao, “Sinica Treebank: Design Criteria, Representational Issues and Implementation,” Building and Using Parsed Corpora, ed. by Anne Abeille, Dordrecht:Kluwer, pp. 231–248.
[Chen et al. 2008] Berlin Chen, Yi-Ting Chen, "Extractive Spoken Document Summarization for Information Retrieval," Pattern Recognition Letters, vol. 29, no. 4, pp. 426-437.
[Chomsky et al. 1968] Noam Chomsky and Morris Halle, The Sound Pattern of English. New York: Harper & Row.
[Cox et al. 1994] Trevor F. Cox and Michael A. A. Cox, Multidimensional Scaling, Chapman-Hall.
[Crestani 2001] Fabio Crestani, “Towards the use of prosodic information for spoken document retrieval,” in Proc. ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 420–421.
[Croft 1995] W. Bruce Croft, “What do people want from Information Retrieval?” D-Lib Magazine, 1995.
[Cui et al. 2005] Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan, and Tat-Seng Chua, “Question answering passage retrieval using dependency relations,” in Proc. ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 400–407.
[Cutler et al. 2002] Ross Cutler, Yong Rui, Anoop Gupta, JJ Cadiz, Ivan Tashev, Li-Wei He, Alex Colburn, Zhengyou Zhang, Zicheng Liu, Steve Silverberg, “Distributed meetings: a meeting capture and broadcasting system,” ACM Multimedia, pp. 503–512.
[Dharanipragada et al. 2002] Satya Dharanipragada, and Salim Roukos, “A multistage algorithm for spotting new words in speech,” IEEE Trans. Speech and Audio Processing, vol. 10, no. 8, pp. 542–550.
[Ewerbring et al. 1990] L. Magnus Ewerbring and Franklin T. Luk, “Computing the Singular Value Decomposition on the Connection Machine,” IEEE Trans. on Computers, vol. 39 no. 1, pp. 152–155.
[Foote et al. 1995] G.J.F.; Foote, J.T.; Sparck Jones, K.; Young, S.J., “Video mail retrieval: the effect of word spotting accuracy on precision,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp. 309–312.
[Goldberger et al. 2005] Jacob Goldberger and Hagai Aronowitz, “A Distance Measure Between GMMs Based on the Unsented Transform and its Application to Speaker Recognition,” in Proc. EUROSPEECH, pp. 1985–1988, Lisbon, Portugal.
[Gruber 1993] Thomas R. Gruber, “A translation approach to portable ontologies,” Knowledge Acquisition, vol. 5, no. 2, pp. 199–220.
[Hansen et al. 2005] John H.L. Hansen, Rongqing Huang, Bowen Zhou, Michael Seadle, John R. Deller Jr., Aparna R. Gurijala, Mikko Kurimo, and Pongtep Angkititrakul, “SpeechFind: advances in spoken document retrieval for a national gallery of the spoken word,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp 712–730.
[Hauptmann et al. 1997] Alexander G. Hauptmann and Howard D. Wactlar, “Indexing and search of multimodal information,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 195–198.
[Hieronymus 1993] James L. Hieronymus, “ASCII Phonetic Symbols for the World's Languages: Worldbet,” Journal of the International Phonetic Association.
[Hoffman 1999] T. Hoffman, “Probabilistic latent semantic analysis,” Proc. 15th Conf. Uncertainty in Artificial Intelligence, pp. 289–296.
[Hori et al. 2003] Chiori Hori and Sadaoki Furui, “A new approach to automatic speech summarization,” IEEE Trans. Multimedia, vol. 5, no. 3, pp. 368–378.
[Huang et al. 2001] Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon, Spoken Language Processing, Prentice Hall, Inc..
[Huang et al. 2007a] Chien-Lin Huang and Chung-Hsien Wu, “Generation of Phonetic Units for Mixed-Language Speech Recognition Based on Acoustic and Contextual Analysis,” IEEE Transactions on Computers, vol. 56, no. 9, pp. 1225–1233.
[Huang et al. 2007b] Chien-Lin Huang and Chung-Hsien Wu, “Spoken Document Retrieval Using Multi-Level Knowledge and Semantic Verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2551–2560.
[Hui et al. 2003] Pui Yu Hui, Wai Kit Lo and Helen M. Meng, “Two Robust Methods for Cantonese Spoken Document Retrieval,” in Procs. of the ISCA Workshop on Multilingual Spoken Document Retrieval, Hong Kong SAR, China.
[James 1994] David A. James, “A system for unrestricted topic retrieval from radio news broadcasts,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp. 279–282.
[James et al. 1994] David A. James and S.J. Young, “A fast lattice-based approach to vocabulary independent word spotting,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 19–22.
[Kim et al. 2007] Wooil Kim and John H. L. Hansen, “Speechfind for CDP: Advances in spoken document retrieval for the U. S. collaborative digitization program,” IEEE workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 687–692.
[Kittler et al. 1998] Josef Kittler, Mohamad Hatef, Robert P.W. Duin, and Jiri MatasOn, “On Combining Classifiers,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226–239.
[Knight et al. 2000] Kevin Knight and Daniel Marcu, " Statistics-Based Summarization - Step One: Sentence Compression", Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, p.703–710.
[Kohler 2001] Joachim Kohler, “Multilingual phone models for vocabulary-independent speech recognition tasks,” Speech Communication, vol. 35, no. 1–2, pp. 21–30.
[Lee et al. 2005] Lin-Shan Lee and Berlin Chen, “Spoken document understanding and organization,” IEEE Signal Processing Magazine, vol. 22, no. 5, pp. 42–60.
[Li et al. 2007] Haizhou Li, Bin Ma and Chin-Hui Lee, “A Vector Space Modeling Approach to Spoken Language Identification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 1, pp. 271–284.
[Logan et al. 2000] Beth Logan, Pedro Moreno, Jean-Manuel Van Thong and Ed Whittaker, “An experimental study of an audio indexing system for the Web,” in Proc. Int. Conf. Spoken Language Processing.
[Logan et al. 2005] Beth Logan, Jean-Manuel Van Thong and Pedro J. Moreno, “Approaches to reduce the effects of OOV queries on indexed spoken audio,” IEEE Trans. Multimedia, vol. 7, no. 5, pp. 899–906.
[Mak et al. 1996] Brian Mak and Etienne Barnard, “Phone clustering using the Bhattacharyya distance,” in Proc. ICSLP, pp. 2005–2008.
[Mak et al. 2005] Brian Kan-Wing Mak, Roger Wend-Huu Hsiao, Simon Ka-Lung Ho and Kwok, J.T., “Embedded kernel eigenvoice speaker adaptation and its implication to reference speaker weighting,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1267-1280.
[Makhoul et al. 2000] John Makhoul, Francis Kubala, Timothy Leek, Daben Liu, Long Nguyen, Richard Schwartz, Amit Srivastava, “Speech and language technologies for audio indexing and retrieval,” Special Issue of the Proceedings of the IEEE, vol. 88, no. 8, pp. 1338–1353.
[Manning et al. 1999] Christopher D. Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing, The MIT Press.
[Manu et al. 1999] I. Manu and M. Maubury, Advances in Automatic Summarization. Cambridge, MA: MIT Press.
[Mathews 1975] Mathews, R. H., Mathews’ Chinese-English Dictionary, Caves, 13th printing.
[Mesot et al. 2007] Bertrand Mesot and David Barber, “Switching Linear Dynamical Systems for Noise Robust Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 6, pp. 1850–1858.
[Mojsilovic et al. 2004] Aleksandra Mojsilovic and Bernice E. Rogowitz, “Semantic metric for image library exploration,” IEEE Transactions on Multimedia, vol. 6 no. 6, pp. 828–838.
[Ney 1984] Hermann Ney, “The use of a one-stage dynamic programming algorithm for connected word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 263–271.
[Ney et al. 2000] Hermann Ney and Stefan Ortmanns, “Progress in dynamic programming search for LVCSR,” Proceedings of the IEEE, vol. 88, no. 8, pp. 1224–1240.
[Ng 2000] Kenney Ng, “Subword-based approaches for spoken document retrieval,” Ph.D. dissertation, Mass. Inst. Technol., Cambridge, MA.
[Odell et al. 2007] Julian Odell and Kunal Mukerjee, “Architecture, User Interface, and Enabling Technology in Windows Vista's Speech Systems,” IEEE Transactions on Computers, vol. 56, no. 9, pp. 1156–1168.
[Ohtake et al. 2003] K. Ohtake, K. Yamamoto, Y. Toma, S. Sado, S. Masuyama and S. Nakagawa, "Newscast Speech Summarization via Sentence Shortening based on Prosodic Features," in Proc. ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, pp. 167-170.
[Rabiner et al. 1983] Lawrence Rabiner and Biing-Hwang Juang, Fundamental of Speech Recognition, Prentice-Hall.
[Robertson et al. 2000] S.E. Robertson and S. Walker, “Okapi/Keenbow at TREC-8,” Proc. Eighth Text REtrieval Conf. (TREC-8 '99), NIST Special Publication 500-264, Voorhees and Harman, eds., pp. 151–162.
[Rocchio 1971] J. J. Rocchio Jr., “Relevance feedback in information retrieval,” in The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton, Ed. Englewood Cliffs, NJ: Prentice-Hall, pp. 313–323.
[Rodden et al. 2003] Kerry Rodden and Kenneth R. Wood, “How do people manage their digital photographs?” in Proc. ACM SIGCHI, pp. 409–416.
[Saghri et al. 1995] John Saghri, Andrew Tescher and John Reagan, “Practical transform coding of multispectral imagery,” IEEE Signal Processing Magazine, vol. 12, no. 1, pp.32–43.
[Salton et al. 1983] Salton G. and M.J. McGill, Introduction to Modern Information Retrieval, New York, McGraw-Hill.
[Smith et al. 1996] John R. Smith and Shih-Fu Chang, “VisualSEEk: a fully automated content-based image query system,” in Proc. ACM Multimedia, pp. 87–98.
[Song et al. 2003] D. Song and P.D. Bruza, “Towards context sensitive information inference,” Journal of the American Society for Information Science and Technology, vol. 54, no. 4, pp. 321–334.
[Soo et al. 2003] Von-Wun Soo, Chen-Yu Lee, Chung-Cheng Li, Shu Lei Chen, Ching-chih Chen, “Automated semantic annotation and retrieval based on sharable ontology and case-based learning techniques,” Proceedings of the third ACM/IEEE-CS Joint Conference on Digital Libraries, pp.61–72.
[Stent et al. 2001] Amanda Stent and Alexander Loui, “Using Event Segmentation to Improve Indexing of Consumer Photographs,” in Proc. ACM SIGIR, pp. 59–65.
[Talantzis et al. 2005] Fotios Talantzis, Anthony G. Constantinides, and Lazaros C. Polymenakos, “Estimation of direction of arrival using information theory,” IEEE Signal Processing Letters, vol. 12, no. 8, pp. 561–564.
[Tan et al. 2002] Tele Tan, Jiayi Chen and Philippe Mulhem, “SmartAlbum-towards unification of approaches for image retrieval,” in Proc. IEEE Int. Conf. Pattern Recognition, pp. 983–986.
[Thambiratnam et al. 2005] K. Thambiratnam and S. Sridharan, “Dynamic match phone-lattice searches for very fast and accurate unrestricted vocabulary keyword spotting,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 1, pp. 465–468.
[Waibel et al. 2000] Alex Waibel, Hagen Soltau, Tanja Schultz, Thomas Schaaf, and Florian Metze, “Multilingual Speech Recognition. Chapter in Verbmobil: Foundations of Speech-to-Speech Translation,” Springer-Verlag, pp. 452–465.
[Wang et al. 2005] Hsin-Min Wang, Berlin Chen, Jen-Wei Kuo, and Shih-Sian Cheng, “MATBN: A Mandarin chinese broadcast news corpus,” International Journal of Computational Linguistics and Chinese Language Processing, vol. 10, no. 2, pp. 219–236.
[Wells 1989] J. C. Wells, Computer-Coded Phonemic Notation of Individual Languages of the European Community. J. IPA, 19, pp. 32–54.
[Wilkinson et al. 1991] Ross Wilkinson and Philip Hingston, “Using the cosine measure in a neural network for document retrieval,” in Proc. ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 202–210.
[Wilpon et al. 1985] Jay G. Wilpon and Lawrence R. Rabiner, “A modified K-means clustering algorithm for use in isolated work recognition,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 33, no. 3, pp. 587–594.
[Wu et al. 2001] Chung-Hsien Wu and Yeou-Jiunn Chen, “Multi-keyword spotting of telephone speech using a fuzzy search algorithm and keyword-driven two-Level CBSM,” Speech Communication, vol. 33, pp. 197–212.
[Wu et al. 2004a] Chung-Hsien Wu and Yeou-Jiunn Chen, “Recovery of False Rejection Using Statistical Partial Pattern Trees for Sentence Verification,” Speech Communication, vol. 43, pp. 71–88.
[Wu et al. 2006a] Chung-Hsien Wu, Yu-Hsien Chiu, Chi-Jiun Shia, and Chun-Yu Lin, “Automatic Segmentation and Identification of Mixed-language Speech using Delta-BIC and LSA-based GMMs,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 266–276.
[Wu et al. 2006b] Chung-Hsien Wu, Jui-Feng Yeh and Yu-Sheng Lai, “Semantic Segment Extraction and Matching for Internet FAQ Retrieval,” IEEE Trans. Knowledge and Data Engineering, vol. 18, no. 7, pp. 930–940.
[Wu et al. 2007] Chung-Hsien Wu, Chia-Hsin Hsieh and Chien-Lin Huang, “Speech Sentence Compression based on Speech Segment Extraction and Concatenation,” IEEE Trans. Multimedia, vol. 9, no. 2, pp. 434–437.
[Yoo et al. 2002] Hun-Woo Yoo, She-Hwan Jung, Dong-Sik Jang, Yoon-Kyoon Na, “Extraction of major object features using VQ clustering for content-based image retrieval,” Pattern Recognition, pp. 1115–1126.
[Young et al. 1994] S.J. Young, J.J. Odell, and P.C. Woodland, “Tree-based State Tying for High Accuracy Acoustic Modelling,” in Proc. ARPA Human Language Technology Conference, Plainsboro, USA.
[Young et al. 2005] Steve Young, Gunnar Evermann, Mark Gales, Tomas Hain, Dan Kershaw, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, Valtcho Valtchev, Phil Woodland, The HTK Book (for HTK Version 3.3).
[Yu et al. 2005] Peng Yu, Kaijiang Chen, Chengyuan Ma, and Frank Seide, “Vocabulary-independent indexing of spontaneous speech,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 5, pp. 635–643.