簡易檢索 / 詳目顯示

研究生: 吳國吉
Wu, Guo-Ji
論文名稱: 具二元對分分裂法之低成本語者及語音辨識系統晶片設計
Low Cost Chip Design for Automatic Speaker and Speech Recognition System Using Binary Halved Clustering Method
指導教授: 王駿發
Wang, Jhing-Fa
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2014
畢業學年度: 102
語文別: 英文
論文頁數: 66
中文關鍵詞: 語者辨識動態時間校準語音辨識晶片設計
外文關鍵詞: speaker recognition, dynamic time warping, speech recognition, chip design
相關次數: 點閱:105下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本篇論文提出一個低成本及可快速訓練的自動語者語音辨識系統晶片設計,利用低成本及可快速訓練的特性使得自動語者語音辨識系統在現實應用中是具效益且可負擔的。本設計分為四個部分,分別為:特徵擷取模組、語者模型訓練模組、語者辨識模組以及語音辨識模組。
    特徵擷取模組採用線性預估倒頻譜係數(Linear Prediction Cepstal Coefficients)作為語者發聲之特徵。語音辨識部分使用動態時間校準來判別目標語音。語者模型訓練部分則提出二元對分分類法 (Binary Halved Clustering Method) 產生語者模型,利用二元對分分裂的規律性降低運算複雜度,進而節省52%晶片面積,降低68%反應時間,並達到90%的辨識率,有效實現低成本需求的晶片硬體設計。
    我們利用晶片設計製作中心(Chip Implementation Center, CIC )與台灣積體電路公司(TSMC)所提供的90奈米製程梯次(TN90GUTM-103B)完成本晶片實作下線(Tape-Out)。晶片面積為1.47*1.47 mm2,以84支接腳封裝,閘總數(Gate Count)約為395000,消耗功率為8.74 mW,工作頻率為50MHz,取樣頻率為16kHz。

    This study proposed a low-cost and fast-trainable chip design for automatic speaker-speech recognition (ASSR) system. There are four parts of this proposed system, which is including: feature extraction module, speaker model training module, speaker recognition module, and speech recognition module.
    LPCC (Linear Predictive Cepstral Coefficients) is adopted into the proposed feature extraction module. The speech recognition uses dynamic time warping (DTW) to classify the target speech. The novel binary halved clustering (BHC) method uses binary-halved splitting to generate speaker models for low complexity requirement. Compared with the conventional works, simulation results indicate that the proposed hardware accelerator achieves 52% less cost, 68% less responding time, an ASSR accuracy of 90%. This ASSR system to efficiently implement the low cost chip design.
    This study has been taped-out in TSMC’s 90nm process. The chip area is 1.47*1.47 mm2, 84-pin package, gate count is 395K, and the power dissipation is 8.74 mW. The operation frequency is 50 MHz, while the Sampling rate is 16 kHz.

    中文摘要 I Abstract II 誌謝 III Content IV Table List VII Figure List VIII Chapter 1 Introduction 1 1.1 Background 1 1.2 Related Works 2 1.3 Motivation 3 1.4 Research Contributions 4 1.5 Organization 4 Chapter 2 Automatic Speaker-Speech Recognition System 6 2.1 System Overview 6 2.2 Preprocessing 7 2.2.1 Voice Activity Detection 7 2.2.2 Framing 8 2.3 Feature Extraction 9 2.3.1 Linear Predictive Coefficients 9 2.3.2 Linear Prediction Cepstral Coefficients 13 2.4 Dynamic Time Warping 13 2.5 Enhanced Cross-words Reference Templates 16 2.6 Binary Halved Clustering 18 Chapter 3 Chip Design for Automatic Speaker-Speech Recognition System 23 3.1 Architecture Overview 23 3.1.1 Proposed System Flow 23 3.1.2 Fixed-point Analysis 24 3.1.3 Memory Requirement Evaluation 25 3.2 Voice Activity Detection Part 27 3.3 Autocorrelation Part 28 3.4 Linear Prediction Cepstral Coefficients Part 31 3.5 Dynamic Time Warping Part 35 3.6 Binary Halved Clustering Part 37 3.6.1 The Architecture of Binary Halved Clustering 37 3.6.2 The Functionality of Binary Halved Clustering Module 39 3.6.3 The Finite State Machine of Binary Halved Clustering 41 3.7 Debug Interface Part 43 3.7.1 Debug Interface Architecture 43 3.7.2 System Observing Registers 44 3.7.3 LPCC Debug Mode 45 3.7.4 DTW Debug Mode 46 3.7.5 BHC Debug Mode 47 Chapter 4 Tape Out and Simulation Results 48 4.1 Cell-Based Design Flow 48 4.2 Specification of the Proposed Chip Design 50 4.3 Simulation Environment 51 4.3.1 Tools Summarize and Simulation Environmental Setting 51 4.3.2 Training Database 52 4.4 Simulation Results 52 4.4.1 Simulation Waveform 53 4.4.2 Simulation Results Overview 55 4.5 Layout 55 4.6 Measurement Considerations 59 4.6.1 Debug Interface 59 4.6.2 Testing Flow 60 4.6.3 Chip Functional Test Illustration 61 Chapter 5 Conclusions and Future Works 62 5.1 Specification Comparison and Conclusions 62 5.2 Future Works 64 References 65

    [1] J.-F. Wang, J.-S. Peng, J.-C. Wang, P.-C. Lin, and T.-W. Kuan, “Hardware/software co-design for fast-trainable speaker identification system based on SMO,” IEEE Int. Conf. Systems, Man, and Cybernetics, Anchorage, AK, 2011, Oct. 9–12, pp. 1621–1625.
    [2] O. A. Bapat, R. M. Fastow and J. Olson, “Acoustic coprocessor for HMM based embedded speech recognition systems,” IEEE Trans. Consumer Electronics, vol. 59, no. 3, pp. 629-633, Aug. 2013.
    [3] J. Zhang, “Research of improved DTW algorithm in embedded speech recognition system,” in Proc. Int. Conf. Intelligent Control and Information Processing, Dalian, China, 2010, Aug. 12-15, pp. 73-75.
    [4] C. Wan, and L. Liu, “Research and improvement on embedded system application of DTW-based speech recognition,” in Proc. 2nd Int. Conf. Anti-counterfeiting, Security and Identification, Guiyang, China, 2008, Aug. 20-23, pp. 401-404.
    [5] S. E. Levinson, L. R. Rabiner, A. E. Rosenberg, and J. G. Wilpon, “Interactive clustering techniques for selecting speaker-independent reference templates for isolated word recognition,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 134-141, Apr. 1979.
    [6] W. H. Abdulla, D. Chow, and G. Sin, “Cross-words reference template for DTW-based speech recognition,” in Proc. IEEE Region 10 Conf. Convergent Technologies for the Asia-Pacific, Bangalore, India, 2003, Oct. 15-17, pp. 1576-1579.
    [7] J.-F. Wang, T.-W. Kuan, J.-C. Wang, and T.-W. Sun, “Dynamic fixed-point arithmetic design of embedded SVM-based speaker identification system,” in Computer Science, 2010, pp. 524–531.
    [8] C. Zhang, X. Wu, T.F. Zheng, L. Wang, and C. Yin, “A K-phoneme-class based multi-model method for short utterance speaker recognition,” Asia-Pacific Signal & Information Processing Association Annual Summit and Conference, 2012.
    [9] T.-W. Chen and S.-Y. Chien, “Bandwidth Adaptive Hardware Architecture of K-Means Clustering for Video Analysis,” IEEE Transactions Very Large Scale Integration (VLSI) Systems, vol. 18, no. 6, June 2010.
    [10] T.-W. Chen and S.-Y. Chien, “Flexible Hardware Architecture of Hierarchical K-Means Clustering for Large Cluster Number,” IEEE Transactions Very Large Scale Integration (VLSI) Systems, Vol. 19, No. 8, Aug. 2011.
    [11] T.-W. Kuan, J.-F. Wang, J.-C. Wang, P.-C. Lin, and G.-H. Gu, “VLSI design of an SVM learning core on sequential minimal optimization algorithm,” IEEE Trans. Very Large Scale Integration Systems, vol. 20, no. 4, pp. 673–683, Apr. 2012.

    [12] L. R. Rabiner, and J. G. Wilpon, “Speaker-independent isolated word recognition for a moderate size(54 word)vocabulary,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 27, no. 6, pp. 583-587, Dec. 1979.
    [13] J.F. Wang, J.C. Wang, H.C. Chen, T.L. Chen, C.C. Chang, M.C. Shih, “Chip Design of Portable Speech Memopad Suitable for Persons with Visual Disabilities,” IEEE Trans. Speech and Audio Processing, vol. 10, no. 8, pp. 644-658, Nov. 2002.
    [14] G.H. He, ”Speaker-Independent Isolated Word Recognition Based on Enganced Cross-Words Reference Templates for Low Cost Embedded System Design,“ Master’s Thesis, National Cheng Kung University, Tainan City, Taiwan, 2012.
    [15] C.H. Peng, T.W. Kuan, P.C. Lin, B.W. Chen, J.F. Wang, and G.J. Wu, “Butterfly Framework of LPCC ASIC Design for Friendly HMI in Speaker Identification,” in Proc. IEEE Conf. Orange Technologies, Tainan, Taiwan, Mar. 2013
    [16] Gin-Der Wu and Kuei-Ting Kuo, “System-on-chip architecture for speech recognition,” Journal of Information Science and Engineering 26, 1073-1089, 2010.
    [17] J. Zhang, “Research of improved DTW algorithm in embedded speech recognition system,” in Proc. Int. Conf. Intelligent Control and Information Processing, Dalian, China, 2010, Aug. 12-15, pp. 73-75.
    [18] Q. Qu, and L. Li, “Realization of embedded speech recognition module based on STM32,” in Proc. 11th IEEE Int. Symposium on Communications and Information Technologies, Hangzhou, China, 2011, Oct. 12-14, pp. 73-77
    [19] C.H. Chou, G.H. He, B.W. Chen, P.C. Lin, S.H. Chen, J.F. Wang, and T.W. Kuan, “Speaker-Independent Isolated Word Recognition Based on Enhanced Cross-Words Reference Templates for Embedded Systems,“ in Proc. Hong Kong International Conference on Engineering and Applied Science, Marriott, Hong Kong, 2012, Dec. 14-16.
    [20] M. Vacher, D. Istrate and J.F. Serignat, "Speech and sound analysis: an application of probabilistic models," In Proc. Int. Symposium on System Theory, Automation, Robotics, Computers, Informatics, Electronics and Instrumentation, Craiova, Romania, Oct. 18-20. pp. 173-178, 2007
    [21] N. S. Shih, “A Reconfigurable Hardware Design for SMO to Improve Speaker Training Efficiency and Memory Reduction,” Master thesis, NCKU 2012
    [22] J. L. Ho, “A Low Cost Chip Design for Speaker Independent ASR System Using Ultra-Low Buffer Method,” Master thesis, NCKU 2013
    [23] National Chip Implementation Center, National Applied Research Laboratories. http://www.cic.org.tw/

    無法下載圖示 校內:2019-08-29公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE