| 研究生: |
王家慶 Wang, Jia-Ching |
|---|---|
| 論文名稱: |
語音辨識與壓縮架構設計之研究 A Study of Architecture Design for Speech Recognition and Compression |
| 指導教授: |
王駿發
Wang, Jhing-Fa |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2003 |
| 畢業學年度: | 91 |
| 語文別: | 英文 |
| 論文頁數: | 108 |
| 中文關鍵詞: | 語音辨識與壓縮 |
| 外文關鍵詞: | Speech Recognition and Compression |
| 相關次數: | 點閱:51 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文提出了語音辨識與語音壓縮系統之相關硬體架構及晶片設計。在語音辨認方面包括了MFCC參數擷取硬體架構及鑑別式拜氏類神經網路(Discriminative Bayesian neural net-work)晶片。而在語音壓縮方面,提出了1.6Kbps低位元率語音編碼器硬體架構,最後我們亦設計了一個同時具語音辨識及壓縮功能之可攜式之語音行事曆。
In this dissertation, we propose several VLSI architectures and implementation for the speech recognition and compression systems. In the first part, the first chip for speech features extraction based on MFCC algorithm is proposed. The chip is implemented as an IP (Intellectual Property), which is suitable to be adopted in a speech recognition system on a chip. The computational complexity and memory requirement of MFCC algorithm is analyzed in detail and improved greatly. The hybrid table-look up scheme is presented to deal with the elementary function value in the MFCC algorithm. Fixed-point arithmetic is adopted to reduce the cost under the accuracy studies of finite word length effect. Finally, the area-efficient design is implemented successfully into the single Xilinx XC4062XL FPGA.
In addition, we present an efficient VLSI architecture for the stand-alone application of a speech recognition system based on discriminative Bayesian neural network (DBNN). Regarding the recognition phase, the architecture of the Bayesian distance unit (BDU) is constructed first. In association with the BDU, we propose a template-serial architecture for the path distance accumulation to perform the recognition procedure. A corresponding architecture is also developed to accelerate the discriminative training procedure. It contains the intelligent look-up table for the sigmoid function. In comparison to the traditional one-table method, the memory size reduces drastically with only slight loss of accuracy. Combining the proposed hardware accelerators with the cost efficient programmable core, we took the most out of both programmable and application-specific architectures, including performance, design complexity, and flexibility.
In the speech coding part, we present a single-chip design for a low bit rate speech vocoder. This coder has fine quality synthesized speech, even though the bit rate is only 1.6 Kbps. For efficiency, we designed separate dedicated architectures for the encoding and the decoding modules. For lowering the cost, these architectures are integrated by resource sharing, requiring only one multiplier and adder. The proposed design was experimentally verified via semi custom chips using 0.35 ?m CMOS single-poly-four-metal technology on a die size approximately 2.26 2.25 mm2.
Finally, this dissertation presents the design of a speech recognition and compression chip for portable memopad devices, especially suitable for use by the visually impaired. The proposed chip design is based on several cores of which they can be regarded as intellectual property (IP) cores to be used for a variety of speech-related application systems. A cepstrum extraction core and a dynamic warping core are designed for mapping the speech recognition algorithms. In the cepstrum extraction core, a novel architecture computes the autocorrelation between the overlapping frames using two pairs of shift registers and an intelligent accumulation procedure. The architecture of the dynamic time warping core uses only a single processing element, and is based on our extensive study of the relationship among the nodes in the dynamic time warping lattice. Bit rate is the key factor affecting the memory size for speech compression; therefore, a very low bit-rate speech coder is used. The speech coder exploits a line-spectrum-based interpolation method, which yields fine quality synthesized speech despite the low 1.6Kbps bit rate. The 1.6K vocoder core is cost-effective, and it integrates both encoder and decoder algorithms. The proposed design has been tested via hardware simulations on Xilinx Virtex series FPGAs.
[Bor98] Borgatti, M., M. Felici, A. Ferrari, and R. Guerrieri, “A low-power integrated circuit for remote speech recognition,” IEEE Journal of Solid-State Circuits, vol. 33, pp. 1082-1089, 1998.
[Bou01] Borgatti, M., A. Rocchi, M. Bisio, M. Besana, L. Navoni, P. L. Rolandi, “A 64-min single-chip voice recorder/player using embedded 4-b/cell flash memory,” IEEE Journal of Solid-State Circuits, vol. 36, no. 3, pp. 516 –521, March 2001.
[Bri88] Brigham, E. O., The Fast Fourier Transform and its Applications, Prentice Hall, 1988.
[Cam86] Campbell, J., and T. E. Tremain, “Voiced/unvoiced classification of speech with applications of the U.S. government LPC-10e algorithm, “in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 1986, pp. 473-476.
[Cha93-1] Chang, P. C., and B. H. Juang, “Discriminative training of dynamic programming based speech recognizers,” IEEE Transactions on Speech and Audio Processing, 1993, vol. 1, no. 2, pp. 135-143.
[Cha93-2]Chang, P. C., S. H. Chen, and B. H. Juang, “Discriminative analysis of distortion sequences in speech recognition,” IEEE Transactions on Speech and Audio Processing, 1993, vol. 1, no. 3, pp. 326-333.
[Das94] DasSarma, D., and D. W. Matula, “Measuring the accuracy of ROM reciprocal tables,” IEEE Transactions on Computers, vol. 43, pp. 932-940, Aug. 1994.
[Del00] Deller, J. R., J. H. L. Hansen, and J. G. Proakis, Discrete-Time Processing of Speech Signals, IEEE Press, 2000.
[Fed84] Federal Standard 1015, “Telecommunications: analog to digital conversion of radio voice by 2400 bit/second linear predictive coding, national communication system,” National Communication Systems--Office of Technology and Standards, Nov. 1984.
[Fel98] Felici, M., M. Borgatti, A. Ferrari, and R. Guerrieri, “A low-power VLSI feature extractor for speech recognition,” Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol. 2, 1998, pp. 3061-3064.
[Fru81] Furui, S., “Cepstral analysis technique for automatic speaker verification,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-29, no. 2, pp. 254-272, 1981.
[Gom98] Gomez, P., A. Alvarez, R. Martinez, M. Perez-Castellanos, V. Rodellar, and V. Neito, “A DSP-based modular architecture for noise cancellation and speech recognition,” in Proc. Int. Symp. on Circuits and Systems, 1998, vol. 5, pp. 178-181.
[Gon00] Gong, Y., and Y. H. Kao, “Implementing a high accuracy speaker-independent continuous speech recognizer on a fixed-point DSP,” in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, 2000, pp. VI-3686-3689.
[Goo82] Goodman, D. J., and R. Nash, “Subjective quality of the same speech transmission conditions in seven different countries,” IEEE Trans. on communications, vol. COM-30, no. 4, Apr. 1982.
[Has95] Hassler, H., and N. Takagi, “Function evalution by table look-up and addition,” Proceedings of the 12th Symposium on Computer Arithmetic, 1995, pp. 10-16.
[Hua94] Huang, C. C., J. F. Wang, and J. Y. Lee, “A Mandarin speech dictation system based on neural network and language processing model,” IEEE Transactions on Consumer Electronics, vol. 40 no. 3, pp. 437-445, 1994.
[Hwa01] Hwang, D., C. Mittelsteadt, and I. Verbauwhede, “Low power showdown: comparison of five DSP platforms implementing an LPC speech codec,” in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, 2001, vol. 2, pp. 1125-1128.
[Kan85] Kang, G. S., and L. J. Fransen, “Application of line spectrum pairs to low bit rate speech encoders,”' in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, 1985, pp. 7.3.1-7.3.4.
[Kim96] Kim, S. N., I. C. Hwang, Y. W. Kim, and S. W. Kim, “A VLSI chip for isolated speech recognition system,” IEEE Transactions on Consumer Electronics, vol. 42, no. 3, pp. 458-467, 1996.
[Kon94] Kondoz, A. M., Digital Speech Coding for Low Bit Rate Communication Systems, Wiley, 1994.
[Kur97] Kurup, P., T. Abbasi, Logic Synthesis Using Synopsys, Kluwer Academic Publishers, 1997.
[Liu92] Liu, L. Y., J. F. Wang, J. Y. Lee, M. H. Sheu, and Y. L. Jeang, "An ASIC design for linear predictive coding of speech signals," in Proc. Euro ASIC' 92, 1992, pp.288-291.
[Mak79] Makhoul, J., “Linear prediction: a tutorial review,” Speech Analysis, IEEE Press, New York, 1979.
[Man93] Mandelbaum, D. M., “Some results on a SRT type division scheme,” IEEE Transactions on Computers , vol. 42, pp. 102-106, Jan. 1993.
[McC95] McCree, V., and T. P. Barnwell III, “A mixed excitation LPC vocoder model for low bit rate speech coding,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 3, pp. 242-250, July 1995.
[McC96] McCree, V., K. Truong, E. B. George, T. P. Barnwell, and V. Viswanathan, “A 2.4KBIT/S MELP Coder Candidate for the New U.S. Federal Standard,” in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 1996, pp. 200-203.
[McD94] McDonough, J. C., Chang, P. Kantak, C. Sakamaki, R. Singh, M. C. Tsai, “A single chip QCELP vocoder for CDMA digital cellular,” in Proc. IEEE Custom Integrated Circuits Conference, 1994, pp. 211 –214.
[Oku98] Okuhata, H., M. H. Miki, T. Onoye, I. Shirakawa, “A low-power DSP core architecture for low bitrate speech codec,” in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 1998, vol. 5, pp. 3121 –3124.
[Ope93] Openshaw, J. P., Z. P. Sun and J. S. Mason, “A comparison of composite features under degraded speech in speaker recognition,” Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol. 2, 1993, pp. 371-374.
[Par99] Parhi, K. K., VLSI Digital Signal Processing Systems: Design and Implementation, John Wiley & Sons, Inc., 1999.
|Pic93] Picone, J. W., “Signal modeling techniques in speech recognition,” Proceedings of IEEE, vol. 81, pp. 1215-1247, 1993.
[Rab89] Rabiner, L., “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of IEEE, vol. 77, no. 2, pp. 257-286, 1989.
[Rab93] Rabiner, L., B. H. Juang , Fundamentals of Speech Recognition, Prentice-Hall, 1993.
[Sch91] Schuler, P. D., R.H.S. Hardy, and V. Cuperman, “Custom versus standard architectures for implementing speech coding algorithms,” in Proc. IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, 1991, vol.2, pp. 639-642.
[Sch93] Schulte, M., and E. Swartzlander, “Exact rounding of certain elementary functions,” Proceedings of the 11th Symposium on Computer Arithmetic, 1993, pp. 138-145.
[Shu96] Shyu, R. C., J. F. Wang, C. C. Huang, C. H. Wu, and J. Y. Lee, “A vowel-driven connected Mandarin digit recognition system,” Journal of Information Science and Engineering, vol. 12, no. 3, 1996.
[Sie82] Siegel, L. J., and A. C. Bessey. “Voiced/unvoiced mixed excitation classification of speech,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-30, no. 3, pp. 451-460, Jun. 1982.
[Shy97] Shyu, R. C., and J. F. Wang, “A vowel-driven Mandarin speech autodialer with adaptation ability” IEEE Signal Processing Letters, vol.4, no. 6, 1997, pp 164-166.
[Son84] Song, F. K., and B. H. Juang, “Line spectrum pair (LSP) and speech data compression,“ in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 1984, pp. 1.10.1-1.10.4.
[Ste72] Stefanelli, R., “A suggestion for a high-speed parallel binary divider,” IEEE Transactions on Computers, vol. 42(1), pp.42-45, Jan. 1972.
[Sue95] Suen, A. N., J. F. Wang, and Y. L. Chiang, A cepstrum chip: architecture and implementation, Proceedings of International Symposium on Circuits and Systems, 1995, pp. 1428-1431.
[Tar96] Tardelli, J. D., and E. W. Kreamer, “Vocoder intelligibility and quality test methods,” in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 1996, pp. II1145-II1148.
[Tre82] Tremain, T., “The government standard linear predictive coding algorithm: LPC-10,” Speech Technology, vol. 1, no. 2, pp. 40-49, Apr. 1982.
[Tuz94] Tuzun, O. B., M. Demirekler and K. B. Nakiboglu, “Comparison of parametric and non-parametric representations of speech for recognition,” Proceedings of MELECON. Mediterranean Electrotechnical Conference, 1994, pp. 65-68.
[Ver96] Vergin, R., D. O’Shaughnessy and V. Gupta, “Compensated mel frequency cepstrum coefficients,” Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 1996, pp. 323-326.
[Wal91] Walker, J. S., Fast Fourier Transform, CRC Press, 1991.
[Wan95] Wang, J. F., A. N. Suen, and C. K. Chieh, “A programmable application specific architecture for real-time speech recognition,” in Proc. VLSI Design/CAD Symposium, 1995, pp.261-264.
[Wan00] Wang, J. C., J. F. Wang, and Y. S. Weng, “Chip design of mel frequency cepstral coefficients for speech recognition,” in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, 2000, pp. VI-3658-3661.
[Wu91] Wu, C. H., J. F. Wang, C. C. Huang, and Jau-Yien Lee, "Speaker independent recognition of isolated words using concatenated neural networks," International Journal of Pattern Recognition and Artificial Intelligence, vol. 5, no. 5, pp.693-714, 1991.
[Wu98] Wu, C. H., “Discriminative Mandarin speech keyword spotting using segmental Bayesian network,” Computer Processing of Oriental Languages, vol. 11, no. 3, pp. 207-220, 1998.
校內:2053-02-10公開