| 研究生: |
周至宏 Chou, Chih-Hung |
|---|---|
| 論文名稱: |
高效率語音與語者辨識演算法及積體電路架構設計之研究 A Study on High Efficient Speech and Speaker Recognition Algorithm and VLSI Architecture Design |
| 指導教授: |
王駿發
Wang, Jhing-Fa |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2016 |
| 畢業學年度: | 104 |
| 語文別: | 英文 |
| 論文頁數: | 89 |
| 中文關鍵詞: | VLSI架構設計 、語音辨識 、語者辨識 、語音信號處理 |
| 外文關鍵詞: | VLSI architecture design, speech recognition, speaker recognition, speech signal processing |
| 相關次數: | 點閱:174 下載:15 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究提出了一系列支援語音控制任務的高效率解決方案,在其中針對的語音控制任務包含了語音識別、語者識別、非待選詞彙偵測、以及非待測語者偵測。這一系列改善的研究涉及了三個層級—演算法層級(ECWRT, CWCWRT, BHC, two-stage out-of-vocabulary detection method, and statistical based out-of-speaker detection method)、電路架構層級 (ULQAB, ERT core, MFSR, CPE, and MRC)、以及系統層級。通過應用這些方法,本研究應用所提出來的改善方法,並實現在四個不同的語音控制系統來作為效能評估。
第一個系統中, ULQAB方法與ECWRT方法被使用在嵌入式語音辨識系統的實現中。此系統被實現在一個16位元微處理器平台(GPCE063A)上,工作時脈為49.152MHZ,信號取樣為8kHz;本系統的實驗展示了在30個待選詞彙中,該平台達到95.22%的辨識率,僅使用0.75kB的系統記憶體。
第二個系統中,語者獨立的語音辨識系統以積體電路的形式被設計。為了達到低成本的晶片設計目的,ULQAB被應用來在記憶體受到限制的規格下實現語音特徵值計算。與沒有採用ULQAB技術的系統相比,在自相關計算中,本系統可節省96.48%的記憶體需求以及256倍的系統回應速度(256 sample/frame 的系統中)。此外,改善的LPCC特徵值萃取計算有效的應用在其最佳化的電路設計並降低被專門設計的電路面積運算需求以及關鍵路徑。這個晶片透過國家晶片中心(CIC)以台灣積體電路公司所提供的90nm製成實現。晶片面積1.16x1.16mm2, 48-pin封裝,採用了43,609個邏輯閘,10MHz的工作時脈,支援8kH的信號取樣率。
第三個系統,非待選語者(OOS)非待選詞彙(OOV)偵測演算法被提出,並包含語音語者辨識(ASSR)實作在FPGA平台中。此語音語者辨識系統包含四個部分 – 1)信號預處理模組,2)語者模型訓練模組,3)語者語音辨識模組,以及4) 非待選語者與非待選詞彙偵測模組。本工作提出的非待選語者演算法克服了傳統方法需要預先訓練背景模型來加強分辨非參與使用者的缺點。此OOS偵測演算法被整合至或者說是緊密的嵌入至本研究提出的語者分類方法(Binary Halved Clustering,BHC)中,為語者辨識器增加了非待選語者識別效能。在此系統的實驗中指出,此系統可達到87.3%的語者辨識率以及86.6%的語音辨識率。而在非待語者與非待選詞彙的實驗中,分別達到88.3% 及 80.5%的識別率。此系統所採用的FPGA平台為ALTERA DE2-70,設定的工作時脈在50MHz, 並僅僅使用了13,355個邏輯單元以及40.41K bytes記憶體用量。
第四個系統為基於extraction, recognition, and training (ERT) 之處理單元設計之語音語者辨識系統的積體電路設計。對於積體電路系統的設計,硬體的成本以及運算的時間複雜度總是重要的議題,在本工作重兩個層級(包含演算法層級以及電路架構層級)上予以討論。在演算法層級提出了binary-halved clustering (BHC) 以降低運算的時間複雜度以及記憶體需求。而在硬體架構層級,基於運算所需的資料相關分析以及硬體重複利用率考量下提出了extraction, recognition, and training (ERT) 處理核心,利用此核心可良好的降低系統運算時間以及電路成本。本系統基於台積電提供之TSMC 90 nm製程技術經過合成元件擺放佈線等程序完成晶片設計。為了測量整合後BHC演算法帶給系統之效能改善,本工作執行了9位語者模型訓練的模擬工作。此外,在辨識功能上,系統也在語音以語者辨識兩個部份執行了模擬,並分別可達到93.38% 及 87.56% 的辨識率。本系統之積體電路設計使用了396 k gate counts平均功耗為8.74 mW。與相關系統進行比較後的結果指出,本系統優於傳統相關功能的系統。
This study proposes a series of solutions to achieve the high efficient way in voice control tasks (including speech recognition, speaker recognition, out-of-vocabulary detection, and out-of-speaker detection tasks). Three levels of improvement are involved in this study which including — algorithm level (ECWRT, CWCWRT, BHC, two-stage out-of-vocabulary detection method, and statistical based out-of-speaker detection method), architecture level (ULQAB, ERT core, MFSR, CPE, and MRC), as well as system level. Applying the proposed methods, four type of voice control systems are implemented.
In the first system, the Ultra-Low Queue-Accumulator Buffering (ULQAB) method and the Enhanced Cross-Words Reference Templates (ECWRT) are adopted for an implementation of embedded auto-speech recognition (ASR) system. This system is implemented on a 16-bit microprocessor GPCE063A platform with a 49.152 MHz clock, using a sampling rate of 8 kHz. Experimental results demonstrate that recognition accuracy reaches 95.22% in a 30-sentence speaker-independent embedded ASR task, using only 0.75 kB RAM.
In the second system, speaker-independent ASR system is implemented in a Chip. To achieve a low cost chip design, the ULQAB method is applied to realize the feature extraction operation in limited memory resources. Totally, 96.48% of memory requirement of auto-correlation computation can be reduced and the response time of the autocorrelation computation for one frame can be performed 256 times improvement. Besides, the improved linear prediction cepstral coefficients (LPCC) feature extraction procedure and its related optimized circuits reduce the dedicated hardware area, computational requirements, and the critical path. This study has been taped-out in TSMC’s 90nm process via Chip Implementation Center (CIC). The chip area is 1.16×1.16 mm2, 48-pin package, gate count is 43609, and the power dissipation is 1.006 mW. The operation frequency is 10 MHz, while the Sampling rate is 4 kHz.
In third system, an Out-of-Speaker (OOS) and an Out-of-Vocabulary (OOV) detection algorithm are proposed and implemented an Automatic Speaker-Speech Recognition (ASSR) system on a FPGA platform. The ASSR system includes four parts: 1) pre-processing, 2) speaker model training, 3) speaker and speech recognition and 4) OOS & OOV detection. The proposed OOS algorithm overcomes the drawback of the traditional method trained in the Universal Background Model (UBM) to verify the unrolled speaker in the training models. The proposed novel OOS detection is thereafter embedded Binary Halved Clustering (BHC) based ASSR system to improve the target and unrolled speakers recognition performance. The experimental results indicate that the proposed work can achieve 87.3% and 86.7% of speaker and speech recognition rate, respectively. The OOS and OOV detection rate can reach to 88.3% and 80.5%, respectively. This system is implemented on ALTERA DE2-70 under 50MHz working frequency, and the total counts of utilized logic element is only 13355 and the memory usage is only 40.41K bytes.
The fourth work presents an automatic speech speaker recognition (ASSR) system implemented in a chip which includes a built-in extraction, recognition, and training (ERT) core. For VLSI design (here ASSR system), the hardware cost and time complexity are always the important issues which are improved in this design in two levels ― Algorithmic and architecture. At the algorithm level, a newly binary-halved clustering (BHC) is proposed to achieve low time complexity and low memory requirement. Besides, at the architecture level, a new ERT core is proposed and implemented based on data dependency and reuse mechanism to reduce the time and hardware cost as well. Finally, the chip implementation is synthesized, placed, and routed by using TSMC 90 nm technology library. To verify the performance of the proposed BHC method, a case study is performed based on nine speakers. Moreover, the validation of the ASSR system is examined in two parts ― speech and speaker recognition. The results show that the proposed system can achieve the 93.38% and 87.56% of recognition rates during speech and speaker recognition, respectively. Furthermore, the proposed ASSR chip includes 396 k gate counts, and consumes power in 8.74 mW. Such results demonstrate that the performance of the proposed ASSR system is superior to the conventional systems.
[1] I. Tashev, “Kinect development kit: a toolkit for gesture-and speech-based human-machine interaction,” IEEE Signal Process. Mag., vol. 30, no. 5, pp. 129–131, Sep. 2013.
[2] N. Kubota and Y. Toda, “Multimodal communication for human-friendly robot partners in informationally structured space,” IEEE Trans. Syst., Man, Cybernetics, Part C: Appl. Rev., vol. 42, no. 6, pp. 1142–1151, Nov. 2012.
[3] S. Shirali-Shahreza, H. Sameti, and M. Shirali-Shahreza, “Parental control based on speaker class verification,” IEEE Trans. Consum. Electron., vol. 54, no. 3, pp. 1244–1251, Aug. 2008.
[4] J.-F. Wang, J.-S. Peng, J.-C. Wang, P.-C. Lin, and T.-W. Kuan, “Hardware/software co-design for fast-trainable speaker identification system based on SMO,” in Proc. IEEE Int. Conf. Syst., Man, Cybernetics (SMC), Anchorage, USA, Oct. 2011, pp. 1621–1625.
[5] O. A. Bapat, R. M. Fastow, and J. Olson, “Acoustic coprocessor for hmm based embedded speech recognition systems,” IEEE Trans. Consum. Electron., vol. 59, no. 5, pp. 629–633, Aug. 2013.
[6] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, pp. 273–297, Sep. 1995.
[7] J. Zhang, “Research of improved DTW algorithm in embedded speech recognition system,” in Proc. Int. Conf. Intell. Control Inf. Process., Dalian, China, Aug. 2010, pp. 73–75.
[8] C. Wan and L. Liu, “Research and improvement on embedded system application of DTW-based speech recognition,” in Proc. Int. Conf. Anti-counterfeiting, Secur, Identification, Guiyang, China, Aug. 2008, pp. 401–404.
[9] J. C. Platt, B. Scholkopf, C. Burges, and A. J. Smola, “Fast learning of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods, Cambridge, MA, USA: MIT Press, 1999, pp. 185–208.
[10] C. Zhang, X. Wu, T.-F. Zheng, L. Wang, and C. Yin, “A K-phoneme-class based multi-model method for short utterance speaker recognition,” in Proc. Signal Inf. Process. Assoc. Annu. Summit Conf., Hollywood, CA, USA, Dec. 2012, pp. 1–4.
[11] S. E. Levinson, L. Rabiner, A. E. Rosenberg, and J. G. Wilpon, “Interactive clustering techniques for selecting speaker-independent reference templates for isolated word recognition, ” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 2, pp. 134–141, Jan. 2003.
[12] W. H. Abdulla, D. Chow, and G. Sin, “Cross-words reference template for DTW-based speech recognition systems,” in. Proc. TENCON 2003 Conf. Convergent Technol. Asia-Pacific Reg., Bengalore, India, Apr. 2003, vol. 4, pp. 1576–1579.
[13] C.-H. Chou, T.-W. Kuan, P.-C. Lin, B.-W. Chen, and J.-F. Wang, “Memory-efficient buffering method and enhanced reference template for embedded automatic speech recognition system,” IET Comput. Digital Tech., Sep. 2014, doi: 10.1049/iet-cdt.2014.0008.
[14] H. Liu, Y. Qian, and J. Liu, “English speech recognition system on chip,” Tsinghua Sci. Technol., vol. 16, no. 1, pp. 95–99, Feb. 2011.
[15] S. N. Kim, I. C. Hwang, Y. W. Kim, and S. W. Kim, “A VLSI chip for isolated speech recognition system,” IEEE Trans. Consum. Electron., vol. 42, no. 3, pp. 458-467, Aug. 1996.
[16] G. D. Wu and K. T. Kuo, “System-on-chip architecture for speech recognition,” J. Inf. Sci. Eng., vol. 26, no. 3, pp. 1073–1089, May 2010.
[17] G. He, T. Sugahara, Y. Miyamoto, T. Fujinaga, H. Noguchi, S. Izumi, H. Kawaguchi, and M. Yoshimoto,“A 40 nm 144 mW VLSI processor for real-time 60-kWord continuous speech recognition,” IEEE Trans. Circuits Syst. I: Regul. Papers, vol. 59, no. 8, pp. 1656–1666, Aug. 2012.
[18] R. Genov and G. Cauwenberghs, “Kerneltron: support vector “machine” in silicon,” IEEE Trans. Neural Netw., vol. 14, no. 5, pp. 1426–1434, Sep. 2003.
[19] S. Chakrabartty and G. Cauwenberghs, “Sub-microwatt analog VLSI support vector machine for pattern classification and sequence estimation,” in Proc. Adv. Neural Inf. Process. Syst.: 17, Vancouver, Canada, Dec. 2004, vol. 17, pp. 249–257.
[20] P. Kucher and S. Chakrabartty, “An energy-scalable margin propagation-based analog VLSI support vector machine,” in Proc. IEEE Int. Symp. Circuits Syst., New Orleans, USA, May 2007, pp. 1289–1292.
[21] S. Chakrabartty and G. Cauwenberghs, “Sub-microwatt analog VLSI trainable pattern classifier,” IEEE J. Solid-State Circuits, vol. 42, no. 5, pp. 1169–1179, May 2007.
[22] S.-Y. Peng, B. A. Minch, and P. Hasler, “Analog VLSI implementation of support vector machine learning and classification,” in Proc. IEEE Int. Symp. Circuits Syst., Seattle, USA, May 2008, pp. 860–863.
[23] K. Kang and T. Shibata, “An on-chip-trainable Gaussian-kernel analog support vector machine,” IEEE Trans. Circuits and Syst. I: Reg. Papers, vol. 57, no. 7, pp. 1513–1524, Jul. 2010.
[24] T.-W. Kuan, J.-F. Wang, J.-C. Wang, P.-C. Lin, and G.-H. Gu, “VLSI design of an SVM learning core on sequential minimal optimization algorithm,” IEEE Trans. Very Large Scale Integr. Syst., vol. 20, no. 4, pp. 673–683, Mar. 2012.
[25] C.-H. Peng, B.-W. Chen, T.-W. Kuan, P.-C. Lin, J.-F. Wang, and N.-S. Shih, “REC-STA: Reconfigurable and efficient chip design with SMO-based training accelerator,” IEEE Trans. Very Large Scale Integr. Systems, vol. 22, no. 8, pp. 1791–1802, Jul. 2014.
[26] T.-W. Chen and S.-Y. Chien, “Bandwidth adaptive hardware architecture of K-means clustering for video analysis,” IEEE Trans. Very Large Scale Integr. Syst., vol. 18, no. 6, pp. 957–966, May 2010.
[27] T.-W. Chen and S.-Y. Chien, “Flexible hardware architecture of hierarchical K-means clustering for large cluster number,” IEEE Trans. Very Large Scale Integr. Syst., vol. 19, no. 8, pp. 1336–1345, Aug. 2011.
[28] O. A. Bapat, P. D. Franzon, and R. M. Fastow, “A generic and scalable architecture for a large acoustic model and large vocabulary speech recognition accelerator using logic on memory,” IEEE Trans. Very Large Scale Integr. Syst., vol. 22, no. 12, pp. 2701–2712, Nov. 2014.
[29] H.-K. Kim, S.-H. Choi, and H.-S. Lee, “On approximating line spectral frequencies to LPC cepstral coefficients,” IEEE Trans. Speech Audio Process., vol. 8, no. 2, pp. 195–199, Mar 2000.
[30] Q. Qu and L. Li, “Realization of embedded speech recognition module based on STM32,” in Proc. Int. Symp. Commun. Inf. Technol., Hangzhou, Chaina, Oct. 2011, pp. 73–77.
[31] J.-F. Wang, T.-W. Kuan, J.-C. Wang, and T.-W. Sun, “Dynamic fixed-point arithmetic design of embedded SVM-based speaker identification system,” in Proc. Int. Symp. On Neural Netw., Shanghai, China, Jun. 2010, pp. 524–531.
[32] S. Nakagawa, L. Wang, and S. Ohtsuka, “Speaker identification and verification by combining MFCC and phase information,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 4, pp. 1085–1095, May 2012.
[33] S.-M. Chu, H. Tang, and T.-S. Huang, “Locality preserving speaker clustering,” in Proc. IEEE Int. Conf. Multimedia Expo., New York, USA, Jun. 2009, pp. 494–497.
[34] M. Varga, I. Lapin, and J. Kacur, “Performance evaluation of GMM and KD-KNN algorithms implemented in speaker identification web-application based on Java EE,” in Proc. Int. Symp. ELMAR, Zadar, Croatia, Sep. 2014, pp. 1–4.
[35] J. Kacur, R. Vargic, and P. Mulinka, “Speaker identification by K-Nearest neighbors: Application of PCA and LDA prior to KNN,” in Proc. Int. Conf. Syst., Signals, Image Process., Sarajevo, Bosnia and Herzegovina, Jun. 2011, pp. 1–4.
[36] Pongtep Angkititrakul and John H. L. Hansen, “Discriminative In-Set/Out-of-SetSpeaker recognition,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, no. 2, pp. 498–508, Feb. 2007.
[37] D. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Process., vol. 10, pp. 19–41, 2000.
[38] Tomi Kinnunen, Evgeny Karpov, and Pasi Fränti, “Real-time speaker identification and verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 277–288, Jan. 2006.
[39] V. Wan and S. Renals, “Speaker verification using sequential discriminant support vector machines,” IEEE Trans. Speech Audio Process., vol. 13, no. 2, pp. 203–210, Mar. 2005.
[40] Tie Cai and Jie Zhu, “OOV rejection algorithm based on class-fusion support vector machine for speech recognition, “ in Proc. Int. Conf. on Machine Learning and Cybernetics, vol.6, no., pp.3695,3699 vol.6, 26-29 Aug. 2004.
[41] Ki-young Park and Soo-young Lee, “Out-of-vocabulary rejection based on selective attention model,” Neural Processing Letters, vol. 12, pp. 41–48, 2000.
[42] Pongtep Angkititrakul and John H. L. Hansen, “Discriminative in-set/out-of-set speaker recognition,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, no. 2, pp. 498–508, Feb. 2007.
[43] Tomi Kinnunen, Evgeny Karpov, and Pasi Fränti, “Real-time speaker identification and verification,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 277–288, Jan. 2006.