| 研究生: |
金緒庭 Chin, Hsu-Ting |
|---|---|
| 論文名稱: |
改進的UIS-RNN結合降噪及文字糾錯之語者分離及語音辨識應用於多人聲會議紀錄 Improved UIS-RNN Combined with Noise Reduction and Text Error Correction for Speaker Diarization and Speech Recognition Applied to Multi-Voice Conference Recording |
| 指導教授: |
王駿發
Wang, Jhing-Fa |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 英文 |
| 論文頁數: | 67 |
| 中文關鍵詞: | 噪音消除 、語音嵌入 、語者分離 、文字糾錯 |
| 外文關鍵詞: | noise reduction, speech embedding, speaker diarization, text error correction |
| 相關次數: | 點閱:78 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
全世界每天都要發起無數個會議,有時需要對會議內容進行紀錄,大部分都使用人工的方式聽打紀錄,過程往往十分緊湊繁忙,且記錄品質不穩定,尤其在法律案件審判時,數小時的審判過程中,書記官需完整製作筆錄,還需標記發言人別,這讓書記官業務繁重,也降低開庭效率。
語音辨識系統早已開發多年,但各種逐字稿應用程式錯誤率仍相當高,事後修改幅度不小。而會議過程中的非穩態噪音也是造成語音轉文字效果近一步降低的一大原因。此外,單純語音辨識無法判斷說話者的改變,無法分辨說話者也是限制語音轉換逐字稿應用的關鍵因素。
本論文希望建立一個音訊語者分離暨語音辨識系統來解決實際語音轉逐字稿錯誤率較高且未標註說話者的問題,希望透過本系統幫助記錄人員降低聽打製作筆錄負擔,增強自動語音轉錄的可讀性,提升會議進行效率。
Countless meetings are initiated every day all over the world. Sometimes it is necessary to record the content of the meeting. Most of them use manual methods to listen to and record meeting information. The process is often very tight and busy, and the recording quality is unstable. Especially in the trial of legal cases , the clerk needs to make a complete transcript and mark the speaker during the several hours of trial, which makes the clerk's business heavy and reduces the efficiency of the court session.
The speech recognition system has been developed for many years, but the error rate of various transcript applications is still quite high, and the size of subsequent revisions are not small. The non-stationary noise during the meeting is also a major reason for the further decline in the effect of speech-to-text conversion. In addition, speaker changes cannot be judged by only using speech recognition , and the inability to distinguish the speaker is also a key factor limiting the application of speech-to-text.
This paper hopes to establish an audio speaker diarization and speech recognition system to solve the problem of high error rate and unmarked speakers in actual speech-to-verbatim transcription. Through this system, the recorder can reduce the burden of listening and typing to make transcripts, enhance the readability of automatic voice transcription, and improve the efficiency of meeting.
[1] Wikipedia contributors. Mel-frequency cepstrum — Wikipedia, the free encyclo- pedia, 2022. [Online; accessed 13-December-2022].
[2] Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn. Speaker verifica- tion using adapted gaussian mixture models. Digital Signal Processing, 10(1):19– 41, 2000.
[3] Wikipedia contributors. Factor analysis — Wikipedia, the free encyclopedia, 2022. [Online; accessed 13-December-2022].
[4] Najim Dehak, Patrick J. Kenny, R´eda Dehak, Pierre Dumouchel, and Pierre Ouel- let. Front-end factor analysis for speaker verification. IEEE Transactions on Au- dio, Speech, and Language Processing, 19(4):788–798, 2011.
[5] Wikipedia contributors. Joint probability distribution — Wikipedia, the free en- cyclopedia, 2022. [Online; accessed 13-December-2022].
[6] Wikipedia contributors. Linear discriminant analysis — Wikipedia, the free en- cyclopedia, 2022. [Online; accessed 13-December-2022].
[7] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4052–4056, 2014.
[8] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329–5333, 2018.
[9] Longting Xu, Rohan Kumar Das, Emre Yılmaz, Jichen Yang, and Haizhou Li.Generative x-vectors for text-independent speaker verification. In 2018 IEEE Spo- ken Language Technology Workshop (SLT), pages 1014–1020, 2018.
[10] Wonjune Kang, Brandon C. Roy, and Wesley Chow. Multimodal speaker diariza- tion of real-world meetings using d-vectors with spatial features. In ICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6509–6513, 2020.
[11] Qian-Bei Hong, Chung-Hsien Wu, Hsin-Min Wang, and Chien-Lin Huang. Statis- tics pooling time delay neural network based on x-vector for speaker verification. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6849–6853, 2020.
[12] Tin Lay Nwe, Hanwu Sun, Haizhou Li, and Susanto Rahardja. Speaker diarization in meeting audio. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4073–4076, 2009.
[13] Fabio Castaldo, Daniele Colibro, Emanuele Dalmasso, Pietro Laface, and Claudio Vair. Stream-based speaker segmentation using speaker factors and eigenvoices. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4133–4136, 2008.
[14] Wikipedia contributors. Bayesian information criterion — Wikipedia, the free encyclopedia, 2022. [Online; accessed 22-December-2022].
[15] Wikipedia contributors. Kullback–leibler divergence — Wikipedia, the free ency- clopedia, 2022. [Online; accessed 22-December-2022].
[16] Gregory Sell and Daniel Garcia-Romero. Speaker diarization with plda i-vector scoring and unsupervised calibration. In 2014 IEEE Spoken Language Technology Workshop (SLT), pages 413–417, 2014.
[17] Zhenve Gan, Hanwen Guo, Tingting Li, Xu Ding, Jincheng Li, and Ziqian Qu.
End-to-end speaker diarization of tibetan based on blstm. In 2022 Global Confer- ence on Robotics, Artificial Intelligence and Information Technology (GCRAIT), pages 474–477, 2022.
[18] Zeqian Li and Jacob Whitehill. Compositional embedding models for speaker identification and diarization with simultaneous speech from 2+ speakers. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7163–7167, 2021.
[19] Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu, and Shinji Watanabe. End-to-end neural speaker diarization with self-attention. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 296–303, 2019.
[20] Naoyuki Kanda, Xiong Xiao, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, and Takuya Yoshioka. Transcribe-to-diarize: Neural speaker diarization for unlimited number of speakers using end-to-end speaker-attributed asr. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8082–8086, 2022.
[21] Hao Xu, Chunhui He, Chong Zhang, Zhen Tan, Shengze Hu, and Bin Ge. A multi-channel chinese text correction method based on grammatical error diagno- sis. In 2022 8th International Conference on Big Data and Information Analytics (BigDIA), pages 396–401, 2022.
[22] Min Tan, Dagang Chen, Zesong Li, and Peng Wang. Spelling error correction with bert based on character-phonetic. In 2020 IEEE 6th International Conference on Computer and Communications (ICCC), pages 1146–1150, 2020.
[23] Zongyu Yang, Hao Zeng, and Hongyan Li. Chinese text error correction method based on prefix tree merging. In 2020 IEEE 3rd International Conference on Automation, Electronics and Electrical Engineering (AUTEEE), pages 272–276, 2020.
[24] Nils L Westhausen and Bernd T Meyer. Dual-signal transformation lstm network for real-time noise suppression. arXiv preprint arXiv:2005.07551, 2020.
[25] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end- to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4879–4883. IEEE, 2018.
[26] Dingmin Wang, Yan Song, Jing Li, Jialong Han, and Haisong Zhang. A hybrid approach to automatic corpus generation for Chinese spelling check. In Proceed- ings of the 2018 Conference on Empirical Methods in Natural Language Process- ing, pages 2517–2527, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
[27] Aonan Zhang, Quan Wang, Zhenyao Zhu, John Paisley, and Chong Wang. Fully supervised speaker diarization. In ICASSP 2019-2019 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pages 6301–6305. IEEE, 2019.
[28] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. Revisiting pre-trained models for chinese natural language processing. arXiv preprint arXiv:2004.13922, 2020.
[29] Wikipedia contributors. Perceptual evaluation of speech quality — Wikipedia, the free encyclopedia, 2021. [Online; accessed 18-December-2022].
[30] Wikipedia contributors. Mean opinion score — Wikipedia, the free encyclopedia, 2022. [Online; accessed 18-December-2022].
[31] Wikipedia contributors. Hungarian algorithm — Wikipedia, the free encyclopedia, 2022. [Online; accessed 19-December-2022].
[32] Wikipedia contributors. Cross entropy — Wikipedia, the free encyclopedia, 2022. [Online; accessed 20-December-2022].
校內:2028-01-31公開