| 研究生: |
洪千焙 Hong, Qian-Bei |
|---|---|
| 論文名稱: |
強健性語者嵌入學習與語音資訊相互作用於語者驗證之研究 A Study on Robust Speaker Embedding Learning and Phonetic Information Interaction for Speaker Verification |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 共同指導: |
王新民
Wang, Hsin-Min |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 多媒體系統與智慧型運算工程博士學位學程 Multimedia System and Intelligent Computing Ph.D. Degree Program |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 英文 |
| 論文頁數: | 87 |
| 中文關鍵詞: | 語者驗證 、父嵌入學習 、部分自適應分數正規化 、語音資訊 |
| 外文關鍵詞: | speaker verification, parent embedding learning, partial adaptive score normalization, phonetic information |
| 相關次數: | 點閱:167 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
多年來,語者驗證一直是人工智慧應用的一項重要任務。將模型泛化以處理訓練和測試條件之間的不匹配以及抵抗來自其他語者聲音的干擾對於語者驗證的性能至關重要。通常,大多數語者驗證研究在特定條件下評估其模型,例如清晰發音、高音質、無干擾。這些研究往往忽略了模型泛化能力的重要性,泛化能力差的模型在復雜的現實環境中將會導致性能下降。除此之外,語音內容與語者驗證任務中語者嵌入的穩定性密切相關。由於自然語音訊號包含許多不同的因子,例如語音內容、語者身分、音高和情感等,這些因子對從中提取的語者特徵穩定性有重大影響。
本論文的重點是構建一個強健的語者嵌入學習系統,並在嵌入相似度測量中進行分數正規化處理。在實際應用中,往往使用預訓練的嵌入模型來評估條件不匹配的語音訊號,故嵌入模型的泛化能力非常重要。因此,本論文提出了一種父嵌入學習(PEL)方法來提高模型的泛化能力,以處理訓練和測試條件之間的錄音場景和語言不匹配情形。PEL的訓練策略對同一訓練任務使用了兩個分類器,以利用共享結構的泛化能力來改進語者嵌入的提取。再者,大多數語者驗證的研究都是在無干擾的語音資料上進行的,並通常可以獲得良好的性能。然而,在現實環境中,錄音訊號往往包含各種干擾,大大降低了語者驗證的性能。在這些條件下,本論文提出了一種部分自適應分數正規化(PAS-Norm)的方法來提高嵌入相似度測量中抵抗其他語者干擾的能力。在語者嵌入的相似性比較中引入的PAS-Norm是基於部分訊號的最大相似性計分,以減少來自其他語者干擾的影響。
此外,隨著添加語音特徵以改進語者嵌入學習的廣泛應用,本論文深入分析語音資訊為語者識別帶來的性能增益,進一步研究了語者嵌入模型訓練中由語音資訊引起的歧義。我們提出了一種音素分解與重組的TDNN模型(DROP-TDNN),以消除低層特徵中語音資訊的影響,以提取更具辨別力的語者特定特徵。
本論文提出的方法可以應用於語者驗證系統的各個處理階段。換句話說,使用這些方法不需要改變語者嵌入模型的骨幹結構,可以很容易地應用於不同的語者驗證研究。實驗結果顯示,將所提出的方法應用於當前最先進的語者驗證系統均可以進一步提高其性能。
Speaker verification has been an important task for artificial intelligence applications for years. Generalizing the model to handle mismatches between training and testing conditions and to resist interference from other speakers’ voices is crucial for the performance of speaker verification. Generally, most speaker verification studies evaluate their models under specific conditions, such as clear pronunciation, high voice quality, and no interference. These studies often ignore the importance of model generalization ability, and models with poor generalization ability will lead to performance degradation in complex real-world environments. Furthermore, speech content is closely related to the stability of speaker embeddings in speaker verification tasks. Since natural speech signals contain different factors, such as speech content, speaker identity, pitch and emotion, these factors have a significant impact on the stability of speaker representations extracted from speech signals.
This dissertation is focused on constructing a robust speaker embedding learning system with score normalization in embedding similarity measurement. In practical applications, pre-trained embedding models are often used to evaluate speech signals with mismatched conditions, thus the generalization ability of the embedding model is very important. Therefore, in this dissertation, a Parent Embedding learning (PEL) approach is proposed to improve the generalization ability of the model to deal with the recording scenario and language mismatch between training and testing conditions. The training strategy of PEL uses two classifiers for the same training task to leverage the generalization ability of the shared structure to improve the extraction of speaker embeddings. In addition, most studies on speaker verification are conducted on speech data without interference, and often achieve good performance. However, in real-world environments, recorded signals often contain various interferences, which greatly degrade the performance of speaker verification. Under these conditions, a Partial Adaptive Score Normalization (PAS-Norm) approach is proposed to improve the resistance to other speakers’ voices in embedding similarity measurement. The PAS-Norm introduced in the similarity comparison of speaker embeddings is based on the maximum similarity score of partial signals to reduce the influence of interference from other speakers.
Furthermore, as adding phonetic features to improve speaker embedding learning has been widely used, this dissertation provides an in-depth analysis of the performance gain brought by phonetic information for speaker recognition, and further investigates the ambiguity caused by phonetic information in the training of speaker embedding models. A Decomposition and Reorganization of Phoneme-TDNN (DROP-TDNN) model is proposed to remove the influence of phonetic information in low-level features to extract more discriminative speaker-specific features.
The methods proposed in this dissertation can be applied to individual processing stages of speaker verification systems. In other words, using these methods does not require changing the backbone structure of speaker embedding models, and these methods can be easily applied to different speaker verification studies. Experimental results show that applying the proposed methods to current state-of-the-art speaker verification systems can further improve their performance.
[Algabri et al. 2020] M. Algabri, H. Mathkour, M. A. Bencherif, M. Alsulaiman, and M. A. Mekhtiche, “Towards deep object detection techniques for phoneme recognition,” IEEE Access, vol. 8, pp. 54663–54680, Mar. 2020.
[Avutu et al. 2017] S. R. Avutu, D. Bhatia, and B. V. Reddy, “Voice control module for low cost local-map navigation based intelligent wheelchair,” in Proc. IEEE Int. Adv. Comput. Conf., Hyderabad, Telangana, India, 2017, pp. 609–613.
[Bhattacharya et al. 2017] G. Bhattacharya, J. Alam, and P. Kenny, “Deep speaker embeddings for short-duration speaker verification,” in Proc. Interspeech, Stockholm, AB, Sweden, 2017, pp. 1517–1521.
[Bisio et al. 2018] I. Bisio, C. Garibotto, A. Grattarola, F. Lavagetto, and A. Sciarrone, “Smart and robust speaker recognition for context-aware in-vehicle applications,” IEEE Trans. Vehicular Technol., vol. 67, no. 9, pp. 8808–8821, Sep. 2018.
[Brown et al. 2021] A. Brown, J. Huh, A. Nagrani, J. S. Chung, and A. Zisserman, “Playing a part: Speaker verification at the movies,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Toronto, ON, Canada, 2021, pp. 6174–6178.
[Cai et al. 2018] D. Cai, Z. Cai, and M. Li, “Deep speaker embeddings with convolutional neural network on supervector for text-independent speaker recognition,” in Asia-Pac. Signal Inf. Process. Assoc. Annu. Summit Conf., Honolulu, HI, USA, 2018, pp. 1478–1482.
[Caruana 1997] R. Caruana, “Multitask learning” Mach. Learn., vol. 28, no. 1, pp. 41–75, 1997.
[Chang and Wang 2017] J. Chang and D. Wang, “Robust speaker recognition based on DNN/i-vectors and speech separation,” in Proc. Int. Conf. Acoust., Speech, Signal Process., New Orleans, LA, USA, 2017, pp. 5415–5419.
[Chen and Bao 2021] X. H. Chen and C. C. Bao, “Phoneme-unit-specific time-delay neural network for speaker verification,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 1243–1255, Mar. 2021.
[Chen et al. 2015] N. Chen, Y. Qian, and K. Yu, “Multi-task learning for text-dependent speaker verification,” in Proc. Interspeech, Dresden, Germany, 2015, pp. 185–189.
[Cheuk et al. 2020] K. W. Cheuk, Y.-J. Luo, E. Benetos, and D. Herremans, “The effect of spectrogram reconstruction on automatic music transcription: An alternative approach to improve transcription accuracy,” in Int. Conf. Pattern Recognit., Milan, Italy, 2020, pp. 9091–9098.
[Chung et al. 2018] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” in Proc. Interspeech, Hyderabad, India, 2018.
[Cumani and Laface 2018] S. Cumani and P. Laface, “Speaker recognition using e–vectors,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 4, pp. 736–748, Apr. 2018.
[Cumani et al. 2011] S. Cumani, P. D. Batzu, D. Colibro, C. Vair, P. Laface, and V. Vasilakakis, “Comparison of speaker recognition approaches for real applications,” in Proc. Interspeech, Florence, FI, Italy, 2011, pp. 2365–2368.
[Dehak et al. 2011] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 788–798, May 2011.
[Desplanques et al. 2020] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Interspeech, Shanghai, China, 2020, pp. 3830–3834.
[Fan et al. 2020] Y. Fan, J. W. Kang, L. T. Li, K. C. Li, H. L. Chen, S. T. Cheng, P. Y. Zhang, Z. Y. Zhou, Y. Q. Cai, and D. Wang, “CN-Celeb: A challenging chinese speaker recognition dataset,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Barcelona, Spain, 2020, pp. 7604–7608.
[Furui 2018] S. Furui, “Digital speech processing: Synthesis and recognition,” CRC Press, 2018.
[Gao et al. 2019] S. H. Gao, M. M. Cheng,K. Zhao, X. Y. Zhang,M. H. Yang, and P. Torr, “Res2Net: A new multi-scale backbone architecture,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 2, pp. 652–662, Aug 2019.
[Garcia-Romero and Espy-Wilson 2011] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,” in Proc. Interspeech, Florence, FI, Italy, 2011, pp. 249–252.
[Garcia-Romero et al. 2020] D. Garcia-Romero, A. McCree, D. Snyder, and G. Sell, “JHU-HLTCOE system for the VoxSRC speaker recognition challenge,” in Proc. Int. Conf. Acoust. Speech Signal Process., Barcelona, Spain, 2020, pp. 7559–7563.
[He et al. 2016] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., Paradise, NV, USA, 2016, pp. 770–778.
[Hinton et al. 2014] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in Proc. 28th Int. Conf. Neural Inf. Process. Syst. Workshop, Montreal, Quebec, Canada, 2014.
[Hong et al. 2020] Q. B. Hong, C. H. Wu, H. M. Wang, and C. L. Huang, “Statistics pooling time delay neural network based on x-vector for speaker verification,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Barcelona, Spain, 2020, pp. 6849–6853.
[Hong et al. 2023(a)] Q. B. Hong, C. H. Wu, and H. M. Wang, “Generalization ability improvement of speaker representation and anti-interference for speaker verification,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 486–499, 2023.
[Hong et al. 2023(b)] Qian-Bei Hong, Chung-Hsien Wu, and Hsin-Min Wang, “Speaker-specific articulatory feature extraction based on knowledge distillation for speaker recognition,” APSIPA Trans. Signal Inf. Process., doi: 10.1561/116.00000150, 2023.
[Hu et al. 2018(b)] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., Salt Lake City, UT, USA, 2018, pp. 7132–7141.
[Kanagasundaram et al. 2011] A. Kanagasundaram, R. Vogt, D. Dean, S. Sridharan, and M. Mason, “I-vector based speaker recognition on short utterances,” in Proc. Interspeech, Florence, FI, Italy, 2011, pp. 2341–2344.
[Kim et al. 2021] J. H. Kim, H. J. Shim, J. W. Jung, and H. J. Yu, “Learning metrics from mean teacher: A supervised learning method for improving the generalization of speaker verification system,” arXiv preprint arXiv:2104.06604, 2021.
[Kinnunen et al. 2017] T. Kinnunen, L. Juvela, P. Alku, and J. Yamagishi, “Non-parallel voice conversion using i-vector PLDA: Towards unifying speaker verification and transformation,” in Proc. Int. Conf. Acoust., Speech, Signal Process., New Orleans, LA, USA, 2017, pp. 5535–5539.
[Ko et al. 2017] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. Int. Conf. Acoust., Speech, Signal Process., New Orleans, LA, USA, 2017, pp. 5220–5224.
[Kominek and Black 2004] J. Kominek and A. W Black, “The CMU Arctic speech databases,” in ISCA Speech Synth. Workshop, Pittsburgh, PA, USA, 2004, pp. 223–224.
[Kozhirbayev et al. 2018] Z. Kozhirbayev, B. A. Erol, A. Sharipbay, and M. Jamshidi, “Speaker recognition for robotic control via an IoT device,” in Proc. World Autom. Cong., Stevenson, WA, USA, 2018, pp. 1–5.
[Kwon et al. 2021] Y. Kwon, H.-S. Heo, B.-J. Lee, and J. S. Chung, “The ins and outs of speaker recognition: lessons from VoxSRC 2020,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Toronto, ON, Canada, 2021, pp. 5809–5813.
[Kye et al. 2020] S. M. Kye, Y. Jung, H. B. Lee, S. J. Hwang, and H. Kim, “Meta-learning for short utterance speaker recognition with imbalance length pairs,” in Proc. Interspeech, Shanghai, China, 2020, pp. 2982–2986.
[Lee and Siniscalchi 2013] C. H. Lee and S. M. Siniscalchi, “An information-extraction approach to speech processing: Analysis, detection, verification, and recognition,” Proc. IEEE, vol. 101, no. 5, pp. 1089–1115, 2013.
[Li et al. 2018(a)] N. Li, D. Tuo, D. Su, Z. Li, and D. Yu, “Deep discriminative embeddings for duration robust speaker verification,” in Proc. Interspeech, Hyderabad, India, 2018, pp. 2262–2266.
[Li et al. 2018(b)] Y. Li, F. Gao, Z. Ou, and J. Sun, “Angular softmax loss for end-to-end speaker verification,” in Proc. Int. Symp. Chin. Spoken Lang. Process., Taipei, Taiwan, 2018, pp. 190–194.
[Li et al. 2018(c)] L. Li, D. Wang, Y. Chen, Y. Shi, Z. Tang, and T. F. Zheng, “Deep factorization for speech signal,” in Proc. Int. Conf. Acoust. Speech Signal Process., Calgary, AB, Canada, 2018, pp. 5094–5098.
[Liu et al. 2017] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., Honolulu, Hi, USA, 2017, pp. 212–220.
[Liu et al. 2018] Y. Liu, L. He, J. Liu, and M. T. Johnson, “Speaker embedding extraction with phonetic information,” in Proc. Interspeech, Hyderabad, India, 2018, pp. 2247–2251.
[Liu et al. 2019] Y. Liu, L. He, J. Liu, and M. T. Johnson, “Introducing phonetic information to speaker embedding for speaker verification,” Eurasip J. Audio Speech Music Process., vol. 2019, no. 1, pp. 1–17, Dec. 2019.
[Liu et al. 2022] T. Liu, R. K. Das, K. A. Lee, and H. Li, “MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Singapore, 2022, pp. 7517–7521.
[Maas et al. 2013] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Int. Conf. Mach. Learn., Atlanta, GA, USA, 2013.
[Maaten and Hinton 2008] L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008.
[Matejka et al. 2016] P. Matejka, O. Glembek, O. Novotny, O. Plchot, F. Grezl, L. Burget, and J. Cernocky, “Analysis of DNN approaches to speaker identification,” in Proc. Int. Conf. Acoust. Speech Signal Process., Shanghai, China, 2016, pp. 5100–5104.
[Matejka et al. 2017] P. Matejka, O. Novotny, O. Plchot, L. Burget, M. D. Sanchez, and J. “H.” Cernocky, “Analysis of score normalization in multilingual speaker recognition,” in Proc. Interspeech, Stockholm, AB, Sweden, 2017, pp. 1567–1571.
[McLaren et al. 2016] M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakers in the wild (SITW) speaker recognition database,” in Proc. Interspeech, San Francisco, CA, USA, 2016, pp. 818–822.
[Memon et al. 2009] S. Memon, M. Lech, and N. Maddage, “Speaker verification based on different vector quantization techniques with gaussian mixture models,” in Proc. 3rd Int. Conf. Net. Syst. Security, Gold Coast, Australia, 2009, pp. 403–408.
[Meyerson and Miikkulainen 2018] E. Meyerson and R. Miikkulainen, “Pseudo-task augmentation: From deep multitask learning to intratask sharing—and back,” in Int. Conf. Mach. Learn., Stockholm, AB, Sweden, 2018, pp. 3511–3520.
[MohammadAmini et al. 2022] M. MohammadAmini, D. Matrouf, J.-F. Bonastre, S. Dowerah, R. Serizel, and D. Jouvet, “Learning noise-robust ResNet-based speaker embedding for speaker recognition,” in Odyssey Speak. Lang. Recognit. Workshop, Beijing, China, 2022.
[Nagrani et al. 2017] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” in Proc. Interspeech, Stockholm, AB, Sweden, 2017.
[Nagrani et al. 2020] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb: Large-scale speaker verification in the wild,” Comput. Speech Lang., vol. 60, 2020, Art. no. 101027.
[Panayotov et al. 2015] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in Proc. Int. Conf. Acoust., Speech, Signal Process., South Brisbane, QLD, Australia, 2015, pp. 5206–5210.
[Pappagari et al. 2020] R. Pappagari, T. Wang, J. Villalba, N. Chen, and N. Dehak, “X-vectors meet emotions: A study on dependencies between emotion and speaker recognition,” in Proc. Int. Conf. Acoust. Speech Signal Process., Barcelona, Spain, 2020, pp. 7169–7173.
[Pelecanos et al. 2000] J. Pelecanos, S. Myers, S. Sridharan and V. Chandran, “Vector quantization based Gaussian modeling for speaker verification,” in Int. Conf. Pattern Recognit., Barcelona, Spain, 2000, vol. 3, pp. 294–297.
[Povey et al. 2011] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in IEEE Workshop Autom. Speech Recognit. Understand., Big Island, HI, USA, 2011.
[Prince and Elder 2007] S. J. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in Proc. IEEE Int. Conf. Comput. Vision, Rio de Janeiro, Brazil, 2007, pp. 1–8.
[Ravanelli and Bengio 2018] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform with sincnet,” in IEEE Workshop Spok. Lang. Technol., Athens, Greece, 2018, pp. 1021–1028.
[Reynolds and Rose 1995] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using gaussian mixture speaker models,” IEEE Trans. Audio, Speech, Lang. Process., vol. 3, no. 1, pp. 72–83, Jan. 1995.
[Sarkar and Tan 2022] A. kr. Sarkar and Z.-H. Tan, “On training targets and activation functions for deep representation learning in text-dependent speaker verification,” arXiv preprint arXiv: 2201.06426, 2022.
[Sarkar et al. 2014] A. K. Sarkar, C.-T. Do, V.-B. Le, and C. Barras, “Combination of cepstral and phonetically discriminative features for speaker verification,” IEEE Signal Process. Lett., vol. 21, no. 9, pp. 1040–1044, Sep. 2014.
[Sarkar et al. 2019] A. K. Sarkar, Z.-H. Tan, H. Tang, S. Shon, and J. Glass, “Time-contrastive learning based deep bottleneck features for text-dependent speaker verification,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 8, pp. 1267–1279, Aug. 2019.
[Singh et al. 2003] G. Singh, A. Panda, S. Bhattuchulyya, and T. Stikunthun, “Vector quantization techniques for GMM based speaker verification,” in Proc. Int. Conf. Acoust. Speech Signal Process., Hong Kong, China, 2003, vol. 2, pp. II-65.
[Smith 2017] L. N. Smith, “Cyclical learning rates for training neural networks,” in IEEE Winter Conf. App. Comput. Vision, Santa Rosa, CA, USA, 2017, pp. 464–472.
[Snyder et al. 2015(a)] D. Snyder, D. Garcia-Romero, and D. Povey, “Time delay deep neural network-based universal background models for speaker recognition,” in IEEE Workshop Autom. Speech Recognit. Understand., Scottsdale, AZ, USA, 2015, pp. 92–97.
[Snyder et al. 2015(b)] D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
[Snyder et al. 2017] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” in Proc. Interspeech, Stockholm, AB, Sweden, 2017, pp. 999–1003.
[Snyder et al. 2018] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Calgary, AB, Canada, 2018, pp. 5329–5333.
[Stoll et al. 2007] L. Stoll, J. Frankel, and N. Mirghafori, “Speaker recognition via nonlinear discriminant features,” in NOLISP, Paris, France, 2007, pp. 27–30.
[Sturim and Reynolds 2005] D. E. Sturim and D. A. Reynolds, “Speaker adaptive cohort selection for tnorm in text-independent speaker verification,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Philadelphia, PA, USA, 2005, pp. I/741–I/744.
[Tang et al. 2017] Z. Tang, L. Li, D. Wang, and R. Vipperla, “Collaborative joint training with multitask recurrent model for speech and speaker recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 3, pp. 493–504, Mar. 2017.
[Tang et al. 2019] Y. Tang, G. Ding, J. Huang, X. He, and B. Zhou, “Deep speaker embedding learning with multi-level pooling for text-independent speaker verification,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Brighton, UK, 2019, pp. 6116–6120.
[Thienpondt et al. 2021] J. Thienpondt, B. Desplanques, and K. Demuynck, “The IDLAB VoxCeleb speaker recognition challenge 2021 system description,” arXiv preprint arXiv:2109.04070, 2021.
[Variani et al. 2014] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Florence, FI, Italy, 2014, pp. 4052–4056.
[Villalba et al. 2017] J. Villalba, N. Brümmer, and N. Dehak, “Tied variational autoencoder backends for i-vector speaker recognition,” in Proc. Interspeech, Stockholm, AB, Sweden, 2017, pp. 1004–1008.
[Waibel et al. 1989] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K.J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Trans. Audio, Speech, Lang. Process., vol. 37, no. 3, pp. 328–339, Mar. 1989.
[Wan et al. 2018] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Calgary, AB, Canada, 2018, pp. 4879–4883.
[Wang et al. 2018(a)] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno, “Speaker diarization with LSTM,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Calgary, AB, Canada, 2018, pp. 5239–5243.
[Wang et al. 2018(b)] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Process. Lett., vol. 25, no. 7, pp. 926–930, Jul. 2018.
[Wang et al. 2019(a)] S. Wang, Z. Huang, Y. Qian, and K. Yu, “Discriminative neural embedding learning for short-duration text-independent speaker verification,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 11, pp. 1686–1696, Nov. 2019.
[Wang et al. 2019(b)] S. Wang, J. Rohdin, L. Burget, O. Plchot, Y. Qian, K. Yu, and J. Cernocky, “On the usage of phonetic information for text-independent speaker embedding extraction,” in Proc. Interspeech, Graz, Austria, 2019, pp. 1148–1152.
[Wang et al. 2022] H. Wang, Y. Qian, X. Wang, Y. Wang, C. Wang, S. Liu, T. Yoshioka, J. Li, and D. L. Wang, “Improving noise robustness of contrastive speech representation learning with speech reconstruction,” in Proc. Int. Conf. Acoust. Speech Signal Process., Singapore, 2022, pp. 6062–6066.
[Yu et al. 2012] D. Yu, S. M. Siniscalchi, L. Deng, and C. H. Lee, “Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Kyoto, Japan, 2012, pp. 4169–4172.
[Zeinali et al. 2019] H. Zeinali, S. Wang, A. Silnova, P. Matejka, and O. Plchot, “BUT system description to VoxCeleb speaker recognition challenge 2019,” arXiv preprint arXiv:1910.12592, 2019.
[Zhang and Yang 2018] Y. Zhang and Q. Yang, “An overview of multi-task learning,” Nat. Sci. Rev., vol. 5, no. 1, pp. 30–43, 2018.
[Zhang and Yang 2021] Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Trans. Knowl. Data Eng., Mar. 2021.
[Zhang et al. 2020] R. Zhang, J. Wei, W. Lu, L. Wang, M. Liu, L. Zhang, J. Jin, and J. Xu, “ARET: Aggregated residual extended time-delay neural networks for speaker verification,” in Proc. Interspeech, Shanghai, China, 2020, pp. 946–950.
[Zhang et al. 2021(a)] Y.-J. Zhang, Y.-W. Wang, C.-P. Chen, C.-L. Lu, and B.-C. Chan, “Improving time delay neural network based speaker recognition with convolutional block and feature aggregation methods,” in Proc. Interspeech, Brno, Czechia, 2021, pp. 76–80.
[Zhang et al. 2021(b)] C. Zhang, M. Yu, C. Weng, and D. Yu, “Towards robust speaker verification with target speaker enhancement,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Toronto, ON, Canada, 2021, pp. 6693–6697
[Zheng et al. 2022] Y. Zheng, J. Peng, Y. Chen, Y. Zhang, J. Wang, M. Liu, and M. Xu, “The SpeakIn speaker verification system for far-field speaker verification challenge 2022,” arXiv preprint arXiv:2209.11625, 2022.
[Zhou et al. 2019] T. Zhou, Y. Zhao, J. Li, Y. Gong, and J. Wu, “CNN with phonetic attention for text-independent speaker verification,” in IEEE Autom. Speech Recognit. Underst. Workshop, Sentosa, Singapore, 2019, pp. 718–725.
[Zhou et al. 2021] T. Zhou, Y. Zhao, and J. Wu, “Resnext and res2net structures for speaker verification,” in Proc. IEEE Spoken Lang. Technol. Workshop, Shenzhen, China, 2021, pp. 301–307.
[Zhu et al. 2018(a)] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification,” in Proc. Interspeech, Hyderabad, India, 2018, pp. 3573–3577.