| 研究生: |
莊則敬 Chuang, Ze-Jing |
|---|---|
| 論文名稱: |
針對聽語障人士之語音及手語處理技術之研究 A Study on Speech and Sign Language Processing for Speech/Hearing Impaired |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2007 |
| 畢業學年度: | 95 |
| 語文別: | 中文 |
| 論文頁數: | 92 |
| 中文關鍵詞: | 多媒體展現技術 、電腦輔助系統 、聽語障人士 、發音學習和矯正 、虛擬人物合成 、手語影片合成 |
| 外文關鍵詞: | communication skill learning, pronunciation learning and rectification, multimedia representation, deaf, 3D virtual character, sign language synthesis |
| 相關次數: | 點閱:83 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於近來即時資料庫存取及多媒體展現技術的快速發展,電腦輔助系統已經成為一個強化教育品質的有利工具,特別是在聽語障學童的溝通輔助學習上。許多的溝通輔助技術已經被應用在提高聽語障學童的溝通輔助技能學習上。由於聽語障人士難以自然語言溝通,因此大多數溝通輔助系統都是著眼在其他替代性溝通技能上的強化與提升,如手語、唇語、或其他整合溝通技巧。不過,這些系
統尚有若干缺點。以自然語言文法的學習為例,單獨使用文字及圖像按鍵為輸出入介面的手語學習系統通常是無效的。由於這些系統只能提供片段語言單元,如單字、片語、或子句,學習者無法由系統提供的介面理解自然語言中完整的文法架構。本論文提出一個針對聽語障人士設計之整合溝通技巧學習、訓練、及教學的架構。本論文同時描述一套依此架構建立的溝通輔助系統:TaiSign。TaiSign 為一溝通輔助系統,目的在結合不同的多媒體技術,以幫助聽語障人士與聽人間自然的溝通。
首先,第一個溝通輔助技術是發音學習和矯正系統。對於一套個人化的發音學習輔助系統而言,有效評估新使用者的發音錯誤模式是很重要的。在本方法中,我們透過估計個人化發音錯誤的機率達成這項評估。本文提出一個基於最小熵值計算的方法有效利用少量評估句子得到個人化發音錯誤機率。同時,由於對多數人而言,相同類型的發音錯誤通常都會同時出現。因此,當某些發音錯誤的機率變動時,相同類型發音錯誤的機率也會同時更新。
另一方面,以手語影片為基礎的手語影片合成系統則提供TaiSign 自然的手語輸出介面及聽語障人士與聽人間的手語溝通輔助功能。為了產生手語影片合成單元間的平滑內插影片片段,首先計算最小連接成本以決定兩個合成影片單元之最佳斷點。最小連接成本為距離成本、平滑成本、及影像失真成本的線性組合。兩斷點間的內插影片生成則利用非均勻有理B樣曲線(NURBS Curve)於手勢影像空間中建立一平滑曲線,並利用此曲線選擇最佳手部影像以達到手部影像內插之目的。當個別影像元件都準備完成後,我們使用影像疊合技術將所有影像元件合成為平滑之內插影片。
最後則是立體虛擬人物合成系統。此系統提供TaiSign 唇型動畫介面及發音輔助練習之視覺回饋。本系統首先拍攝真人唇型影帶,並由其中取得每個唇型合成單元的參數序列。為了達到唇型運動和語音輸出間的同步化,本系統使用最大方向改變演算法以根據語音速度決定唇型動畫參數變化。同時,為了達到在高語速下的最佳唇型運動平滑度,本系統結合貝茲曲線(Bernstein-Bézier curve)及近似運動理論提出四個音素相關的發音內插函式。
Based on recent advances in real-time database access and multimedia representation, computers have become instruments that can improve education quality, especially for the deaf. Numerous communication-aided approaches have been proposed and applied to enhance language and communication skills of the deaf. Since deaf people have difficulty in speaking, many communication-aided approaches, such as sign language, finger spelling, lip-reading, and total communication, have been proposed to enhance their language and communication skills. However, these systems have several drawbacks. For example, systems using only text or icon-based interfaces for sign language learning are ineffective. For learning sentence grammar, these systems provide only language fragments such as words, phrases, or fixed clauses. Consequently, users cannot readily understand the grammatical structure of a sentence translated from a sign language sequence. The approaches described in this dissertation including the method of communication skill learning, training, or teaching. A final system is developed for integrating all these approaches, called TaiSign. TaiSign is a novel communication skills learning system. The motivation in developing this system was to integrate different multimedia technologies, so as to assist deaf people in communicating naturally with hearing people.
The first technology is the pronunciation learning and rectification. For a personalized Computer Assisted Pronunciation Training (CAPT) system, efficient evaluation of mispronunciation patterns, which estimates the occurrence probabilities of personalized mispronunciation patterns, for a new user is important. This investigation presents an entropy-based approach to efficiently evaluating the personalized mispronunciation patterns using minimum number of evaluation sentences. Moreover, since people who mispronounce a phone usually mispronounce the other phones of similar pronunciation type, the probabilities of all mispronunciation patterns having high mutual information with the mispronounced phone of the same type are also updated. Besides, this investigation also provides a learning phase for users to rectify and practice the mispronounced phones.
The sign language synthesis system provides TaiSign an ability of sign language output and deaf-to-hearing communication assistance. We concentrated on a spatial interpolation approach using a Non-Uniform Rational B-spline (NURBS) function to produce a smooth interpolation curve. To generate movement epenthesis, the beginning and end “cut points” are determined based on the concatenation cost, which is a linear combination of the distance, smoothness and image distortion costs. An image component overlapping procedure is also employed to yield a smooth sign
video output.
Finally, a 3D virtual character is involved in TaiSign for the pronunciation learning. The virtual character is provided by a Text-To-Speech system. Motion parameters for each viseme are first constructed from video footage of a human speaker. To synchronize the parameter set sequence and speech signal, a maximum direction change algorithm is also proposed to select significant parameter set sequences according to the speech duration. Moreover, to improve the smoothness of co-articulation part under a high speaking rate, four phoneme-dependent co-articulation functions are generated by integrating the Bernstein-Bézier curve and apparent motion property.
N. Aim, J. L. Arnott, “Computer-assisted conversation for nonvocal people using prestored texts,” IEEE Trans. on Systems, Man and Cybernetics, Part C, vol. 28, issue 3, Aug. 1998, pp. 318-328.
F. Alonso, A. de Antonio, J. L. Fuertes, and C. Montes, “Teaching communication skills to hearing-impaired children,” IEEE Multimedia, vol. 2, no. 4, pp. 55-67, Winter 1995.
G. Bailly, M. Bérar, F. Elisei, M. Odisio, Audiovisual speech synthesis. International Journal of Speech Technology, 6(2003) 331-346.
C. Barras, E. Geoffrois, Z. Wu, and M. Liberman, “Transcriber: Development and use of a tool for assisting speech corpora production,” Speech Communication, vol. 33, no. 1-2, pp. 5-22, Jan. 2001.
R.H. Bartels, J.C. Beatty, B.A. Barsky, An Introduction to Splines for Use in Computer Graphics and Geometric Modelling, Morgan Kaufmann Publishers Inc., San Francisco, CA , 1998, pp. 211-245.
G. Breton, C. Bouville, D. Pelé, FaceEngine: A 3D Facial Animation Engine of Real Time Applications, in: Proceedings of ACM SIGWEB 2001. ACM Press, 2001, pp. 15-22.
L.S. Chen, T.S. Huang, J. Ostermann, Animated talking head with personalized 3D head model, in: Proceedings of IEEE First Workshop on Multimedia Signal, Geneva, Swiss, 1997, pp. 274-279.
J.C. Chen, J.S. R Jang, J.Y. Li, M.C. Wu, “Automatic pronunciation assessment for Mandarin Chinese,” in: Proceedsings of 2004 IEEE International Conference on Multimedia and Expo (ICME’04). Taipei, Taiwan, 27-30 June 2004, pp. 1979-1982.
Z.J. Chuang and C.H. Wu, “Text-to-Visual Speech Synthesis for General Objects Using Parameter-Based Lip Models,” Lecture Notes In Computer Science, vol. 2532, 2001, pp 589-597.
M.M. Cohen, D.W. Massaro, Modeling Coarticulation in Synthetic Visual Speech, in: Thalmann, N.M., Thalmann, D. (Ed.), Models and Techniques in Computer Animation, Springer-Verlag, Tokyo, 1993, pp. 139-156.
R. Cole1, D.W. Massaro, J.d. Villiers, B. Rundle, K. Shobaki, J. Wouters, M. Cohen, J. Beskow, P. Stone, P. Connors, A. Tarachow, D. Solcher, “New tools for interactive speech and language training: Using animated conversational agents in the classrooms of profoundly deaf children,” in Proc. ESCA/SOCRATES Workshop on Method and Tool Innovations for Speech Science Education, April 1999.
D. Coniam, “Voice Recognition Software Accuracy with Second Language Speakers of English,” System, vol. 27, pp. 49-64, Mar. 1999.
T.F. Cootes, G.J. Edwards, C.J. Taylor, Active Appearance Models. In Proceedings of European Conference on Computer Vision 1998, vol. 2, pp. 484-498, 1998.
E. Cosatto, H.P. Graf, Photo-realistic talking-heads from image samples, IEEE Transaction on Multimedia 2 (3) (2000) 152-163.
P. W. Demasco and K. F. McCoy, “Generating Text from Compressed Input: An Intelligent Interface for People with Severe Motor Impairments”, Communication of the ACM, vol. 35, no. 5, pp. 68-72, May 1992.
T.M. Derwing, M.J. Munro, and M. Carbonaro, “Does Popular Speech Recognition Software Work with ESL Speech?” TESOL Quarterly, vol. 34, no. 3, pp. 592-603, Aug. 2000.
M. Eskenazi, “Detection of foreign speakers’ pronunciation errors for second language training - preliminary results,” in Proc. International Conference on Spoken Language Processing, pp. 1465-1468, Philadelphia, USA. Oct. 1996.
Tony Ezzat, Gadi Geiger, Tomaso Poggio, Trainable Videorealistic Speech Animation. Proceedings of ACM SIGGRAPH 2002, San Antonio, Texas, July 2002.
P. Fabian and J. Francik, “Synthesis and Presentation of the Polish Sign Language Gestures,” in Proceedings of 1st International Conference on Applied Mathematics and Informatics, pp. 190-197, September 2001.
M. Fahle, A. Biester, C. Morrone, Spatiotemporal interpolation and quality of apparent motion, Journal of the Optical Society of America A: Optics, Image Science, and Vision 18 (11) (2001) 2668-2678.
H. Franco, L. Neumeyer, and Y. Kim, “Automatic Pronunciation Scoring for Language Instruction,” in Proc. of International Conference on Acoustics, Speech and Signal Processing, pp.1471-1474, Munich, Germany. 1997.
H. Franco, V. Abrash, K. Precoda, H. Bratt, R. Rao, and J. Butzberger, “The SRI EduSpeak System: Recognition and Pronunciation Scoring for Language Learning,” in Proc. InSTILL 2000, pp. 123-128, Dundee, Scotland. 2000.
R. C. Gonzalez and R. E. Woods, Digital Image Processing. Addison-Wesley Longman Publishing Co., Inc., 2002, pp. 94-102, 322, 531.
T. Guiard-Marigny, A. Adjoudani, C. Benoit, 3D Models of the Lips and Jaw for Visual Speech Synthesis, in: Santen, J. et al. (Ed.), Progress in Speech Synthesis, Springer-Verlag, New York, 1996, pp. 247-258.
H. Hamada, S. Miki, and R. Nakatsu, “Automatic Evaluation of English Pronunciation Based on Speech Recognition Techniques,” The Transactions of the IEICE of Japan, vol. E76-D, no. 3, pp. 352-359, Mar. 1993.
S. W. Henry, “Morphological Characteristics of Verbs in Taiwan Sign Language,” PhD dissertation of Indiana Univ., 1989.
S. Hiller, E. Rooney, J. Laver, and M. Jack, “SPELL: An automated system for computer-aided pronunciation teaching,” Speech communication, vol. 13, pp. 463-473, 1993.
M. H. Hsing, “An analysis on deaf-school teachers’ utterance messages and morpheme semantic features between spoken and signed language channels,” Bulletin of Special Education and Rehabilitation, vol. 8, pp. 27-52, June 2000.
X. Huang, A. Acero and H.W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall, Inc., 2001, pp.383-385.
T. Kaburagi, M. Honda, A model of articulator trajectory formation based on the motor tasks of vocal-tract shapes. Journal of the Acoustical Society of America, 99(5) (1996) 3154-3170.
D. N. Kalikow and J. A. Swets, "Experiments with computer-controlled displays in second-language learning," IEEE Transactions on Audio and Electroacoustics, vol. 20, pp. 23-28, Mar 1972.
R. Kennaway, “Synthetic Animation of Deaf Signing Gestures,” Lecture Notes in Computer Science, vol. 2298, pp. 146-157, 2002.
J.M. Kim, C. Wang, M. Peabody, and S. Seneff, “An Interactive English Pronunciation Dictionary for Korean Learners,” in Proc. of the 8th International Conference on Spoken Language Processing, Jeju Island, Korea, 2004.
J. Kleiser, A fast, efficient, accurate way to represent the human face, in: Proceedings of ACM RAPH '89 Course Notes 22: State of the Art in Facial Animation, New York, 1989, pp. 37-40.
S. Krapez and F. Solina, “Synthesis of the Sign Language of the Deaf from the Sign Video Clips,” Elektrotehniski vestnik [Electrotechnical review], vol. 66:4-5, pp. 260-265, 1999.
T. Kuratate, H. Yehia, E. Vatikiotis-Bateson, Kinematics-based synthesis of realistic talking faces. on: Auditory-Visual Speech Processing Conference. Terrigal, Sydney, Australia, 1998, pp. 185-190.
J. Lee, K.H. Lee, Precomputing avatar behavior from human motion data, Graphical Models, In Press, Corrected Proof, Available online 1 June 2005.
C. Lisetti, F. Nasoz, C. LeRouge, O. Ozyer, K. Alvarez, Developing multimodal intelligent affective interfaces for tele-home health care, International Journal of Human-Computer Studies 59 (2003) 245-255.
L.L. Lloyd, D.R. Fuller, and H.H. Arvidson, Augmentative and Alternative Communication: A Handbook of Principles and Practices. Allyn and Bacon publisher, 1997.
O. Losson and J. M. Vannobel, “Sign language formal description and synthesis,” International Journal of Virtual Reality, vol. 3, no. 4, pp. 27-34, 1998.
B. Mak, M. Siu, M. Ng, Y.C. Tam, Y.C. Chan, K.W. Chan, K.Y. Leung, S. Ho, F.H. Chong, J. Wong, and J. Lo, “PLASER: Pronunciation Learning via Automatic Speech Recognition,” in Proc. of HLT-NAACL 2003, pp. 23-29, Edmonton, Canada. 2003.
N. Magnenat-Thalmann, N.E. Primeau, D. Thalmann, Abstract Muscle Actions Procedures for Human Face Animation, Visual Computer 3(5) (1988) 290-297.
N. Magnenat-Thalmann, P. Kalra, M. Escher, Face to Virtual Face, in: Proceedings of IEEE. 86(5), 1998, pp. 807-883.
M. D. Manoranjan and J. A. Robinson, “Practical low-cost visual communication using binary images for deaf sign language,” IEEE Trans. on Rehab. Eng., vol. 8, no. 1, pp. 81-88, March 2000.
W. Menzel, D. Herron, P. Bonaventura, and R. Morton, “Automatic detection and correction of non-native English pronunciations,” in Proc. of InSTILL 2000, Dundee, Scotland, 2000.
S. Nakagawa, K. Mori, and N. Nakamura, “A Statistical Method of Evaluation Pronunciation Proficiency for English Words Spoken by Japanese,” in Proc. Eurospeech 2003, pp. 3193-3196, Geneva, Switzerland. Sep. 2003
L. Neumeyer, H. Franco, M. Weintraub, and P. Price, “Pronunciation Scoring of Foreign Language Student Speech,” in Proc. International Conference on Spoken Language Processing, pp. 1457-1460, Philadelphia, USA. Oct. 1996.
T. Okadome, T. Kaburagi, M. Honda, Articulatory movement formation by kinematic triphone model. in: IEEE International Conference on Systems Man and Cybernetics. Tokyo, Japan, 1999, pp. 469-474.
S. C. W. Ong and S. Ranganath, “Automatic Sign Language Analysis: A Survey and the Future beyond Lexical Meaning.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 873-891, June 2005.
T. Ozawa, “An integral image and text processing system for automatic generation of 3D sign-language animations,” IEEE International Symposium on Circuits and Systems, vol. 2, pp. 313-316, 2001.
I.S. Pandzic, Facial motion cloning, Computer Graphics 65 (2003) 385–404.
J. W. Park and S. U. Lee, “Recovery of Corrupted Image Data Based on the NURBS Interpolation,” IEEE Trans. Circuits and Systems for Video Technology, vol. 9, no. 7, pp.1003-1008, 1999.
I. K. Park, I. D. Yun, and S. U. Lee, “Automatic 3-D model synthesis from measured range data,” IEEE Trans. Circuits and Systems for Video Technology, vol. 10, no. 2, pp.293-301, 2000.
F. Pezeshkpour, I. Marshall, R. Elliott and J. A. Bangham, “Development of a Legible Deaf-Signing Virtual Human,” in Proceedings of IEEE International Conference on Multimedia Computing and Systems (ICMCS), pp. 333-338, 1999.
L. Piegl and W.Tiller, The NURBS Book. Berlin, Germany: Springer, 1995.
J. Piesk, G. Trogemann, Animated Interactive Fiction: Storytelling by a Conversational Virtual Actor, in: Proceedings of Virtual Systems and MultiMedia'97, Geneva, Swiss, 1997, pp. 100-108.
S.M. Platt, N.I. Badler, Animating Facial Expressions, Computer Graphics 15(3) (1981) 245-252.
O. Ronen, L. Neumeyer, and H. Franco, “Automatic detection of mispronunciation for language instruction,” in Proc. Eurospeech ‘97, pp. 649-652, vol. 2, Rhodes, Greece. 1997.
F.D. Rosis, C. Pelachaud, I. Poggi, V. Carofiglio, B.D. Carolis, From Greta’s mind to her face: modelling the dynamics of affective states in a conversational embodied agent, International Journal of Human-Computer Studies 59 (2003) 81-118.
S. Shan, W. Gao, J. Yan, H. Zhang, X. Chen, Individual 3D face synthesis based on orthogonal photos and speech driven facial animation, in: Proceedings of IEEE International Conference on Image Processing 2000, Vancouver, BC, Canada, 2000, pp. 238-241.
Yung-Ji Sher, Yeou-Jiunn Chen, Yu-Hsien Chiu, Kao-Chi Chung and Chung-Hsien Wu, “MAP-based Perceptual Speech Modeling for Noisy Speech Recognition,” Journal of Information Science and Engineering, Vol. September, 2006.
F. Solina, S. Krapez, A. Jaklic. “Multimedia Dictionary and Synthesis of Sign Language,” In M. R. Syed, Editor, Design and Management of Multimedia Information Systems. Idea Group Publishing, Hershey PA, 2001, pages 268--281.
A.L. Swiffin, J.L. Arnott, and A.F. Newell, “Adaptive and Predictive Techniques in a Communication Prosthesis,” Augmentative and Alternative Comm., vol. 3, no. 4, pp. 181-191, 1987.
T. Takahashi, C. Bartneck, Y. Katagiri, N.H. Arai, TelMeA-Expressive avatars in asynchronous, International Journal of Human-Computer Studies 62 (2005) 193-209.
M. Tamura, S. Kondo, T. Masuko, T. Kobayashi, Text-to-audiovisual speech synthesis based on parameter generation from HMM. in: European Conference on Speech Communication and Technology. Budapest, Hungary, 1999, pp. 959-962.
Y. Tsubota, T.Kawahara, and M.Dantsuji, “CALL system for Japanese students of English using pronunciation error prediction and formant structure estimation,” in Proc. InSTILL 2002, 2002.
C. Valli and C. Lucas, Linguistics of American Sign Language: An Introduction. Gallaudet University Press, 2001.
Y. C. Wang and T. J. Chou, “A comparative study of reading comprehension strategies used by elementary reading disabled and proficient reading students,” Bulletin of Special Education and Rehabilitation, vol. 8, pp. 161-182, June 2000.
A. Watanabe, S. Tomishige, and M. Nakatake, “Speech visualization by integrating features for the hearing impaired,” IEEE Trans. on Speech and Audio Processing, 2000, pp. 454-466
K. Waters, A Muscle Model for Animating Three-Dimensional Facial Expressions, Computer Graphics 21(4) (1987) 17-24.
“Phonetics: The Sound of American English,”
http://www.uiowa.edu/~acadtech/phonetics/english/frameset.html
J.D.R. Wey, J.A. Zuffo, InterFace: a Real Time Facial Animation System, in: Proceedings of International Symposium on Computer Graphics, Image Processing, and Vision. Rio de Janeiro, 1998, pp. 200-207.
S.M. Witt, and S.J. Young, “Phone-level Pronunciation Scoring and Assessment for Interactive Language Learning,” Speech Communication, vol. 30, pp. 95-108. 2000.
C. H. Wu, Y. H. Chiu and K. W. Cheng, “Multi-Modal Sign Icon Retrieval for Augmentative Communication,” Lecture Notes in Computer Science Series, Springer-Verlag, pp. 598-605, 2001.
C.H. Wu, Y.H. Chiu, and K.W. Cheng, “Error-Tolerant Sign Retrieval Using Visual Features and Maximum A Posteriori Estimation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 4, 2004, pp.495-508.
C.H. Wu, Y.H. Chiu, and C.S. Guo, “Text Generation From Taiwanese Sign Language Using A PST-Based Language Model For Augmentative Communication,” IEEE Trans. Neural Systems and Rehabilitation Engineering, vol. 12, no. 4, 2004, pp.441-454.
Chung-Hsien Wu and Gwo-Lang Yan, “Acoustic Feature Analysis and Discriminative Modeling of Filled Pauses for Spontaneous Speech Recognition,” Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology, 36, 2004, pp.87-99.
Chung-Hsien Wu and Yeou-Jiunn Chen, “Recovery of False Rejection Using Statistical Partial Pattern Trees for Sentence Verification,” Speech Communication, Vol. 43, pp. 71-88, 2004.
Chung-Hsien Wu and Chia-Hsin Hsieh, “Multiple Change-Point Audio Segmentation and Classification Using an MDL-based Gaussian Model,” IEEE Trans. Audio, Speech, and Language Processing, Vol. 14, No. 2, March 2006.
S. Young, The HTK Book, Microsoft Corporation, 2000.
W. Zhiming, C. Lianhong, A. Haizhou, A Dynamic Viseme Model for Personalizing a Talking Head, in: Proceedsings of Sixth International Conference on Signal Processing. Beijing, China, 2002, pp. 1015-1018.
Y. Zhuo, T.G. Zhou, H.Y. Rao, J.J. Wang, M. Meng, M. Chen, C. Zhou, L. Chen, Contributions of the visual ventral pathway to long-range apparent motion, Science 299 (2003) 417-420.