研究生: |
馬秀雯 Ma, Hsiu-Wen |
---|---|
論文名稱: |
使用音訊與視訊資料融合方法於生物驗證 Audio and visual data fusion for biometrics |
指導教授: |
簡仁宗
Chien, Jen-Tzung |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2004 |
畢業學年度: | 92 |
語文別: | 中文 |
論文頁數: | 104 |
中文關鍵詞: | 最大相似度 、對數相似度 、階層式高斯混合模型 、資料融合 |
外文關鍵詞: | hierarchical gaussian mixture model, data fusion, maximum likelihood, log-likelihood |
相關次數: | 點閱:74 下載:4 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在資訊科技發達的今日,如何正確地驗證使用者身份是確保個人隱私及資訊安全的重要課題。在本篇論文中,我們針對視覺性生物特徵--人臉影像與聽覺性生物特徵--語音,提出了使用高斯混合模型所建構的階層式分類器(hierarchical classifier)的資料融合策略(data fusion),以達更可信且更強健性的驗證,進而增進系統效能。
先建立各資料類型的高斯混合模型,利用各資料類型的資料與其第一層的高斯混合模型的對數相似度函數(log-likelihood)來建立第二層的高斯混合模型,將各種可能的隨機變化(randomness)用高斯混合模型來表示,當某一子系統受到雜訊干擾導致辨識率較差時或是辨識器的結果差異大時,第二層的高斯混合模型分類器可以考慮其差異性有效彌補其他子系統的辨識結果而達成正確決策。在本論文中,我們利用最大相似度(maximum likelihood) 建立階層式分類器並使用EM 演算法推導出最佳參數組,我們也發展最小分類錯誤(minimum classification error)分類器,達到資料融合的鑑別式訓練效果。
實驗結果顯示,將本篇論文所提的方法應用於人臉及語音資料庫中,在共32人的資料庫實驗中,我們得到不錯的驗證效果。利用人臉驗證及語者驗證的融合技術再結合言語資訊確認系統,更能設計出兼具系統友善度及強健性之多串流(multi-stream)多模組(multi-modal)驗證效能。
In real applications, it is important to correctly verify a user identity so that we can keep our personal privacy information safe and secure. This thesis proposes a hierarchical classifier of Gaussian mixture model (GMM) using audio and visual as inputs to establish multi-modal for user authentication system.
There are two layers of GMMs in the proposed hierarchical classifier. Using the concept of divide-and-conquer, the first-layer GMM is designed to separately perform pattern classification for different stream data. The output likelihood streams serve as a new pattern which integrates the features from different modalities. This new pattern reflects complex characteristics for a testing user. Accordingly, we can motivated to present a second-layer GMM to classify this integrated pattern. When the observation data are deteriorated by noises or different illuminations, the proposed hierarchical classifier is able to effectively combine the stream data and make a correct decision. In this thesis, we present the maximum likelihood and the minimum classification error approaches to build the hierarchical GMM classifier. Importantly, we use expectation-maximization(EM) algorithm to jointly estimate the optimal parameters in different layers.
Experimental results show that the hierarchical GMM classifier works well on MHMC multi-modal database. This database contains speech and face data from thirty-two persons, including twenty-five males and seven females. We also find that the proposed classifier achieves robust data fusion performance in presence of different noise and illumination conditions. The performances of user verification and identification are investigated.
[1] L.A. Alexandre, A.C. Campilho, M. Kamel, “On combining classifiers using sum and product rules”, Pattern Recognition Letters 22 pp. 1283-1289, 2001
[2] M. Alissali and P. Deleglise and A. Rogozan, “Asynchronous integration of visual information in an automatic speech recognition system”, Proceedings of International Conference on Spoken Language Processing (ICSLP), pp. 34-37, 1996
[3] S. Bhattacharyya, T. Srikanthan, and Pramod Krishnamurthy, “Ideal GMM parameters & posterior log likelihood for speaker verification”, Proceedings of IEEE Signal Processing Society Workshop, pp.471-480, 10-12 Sept. 2001
[4] I. Bloch, “Information combination operators for data fusion: a comparative review with classification”, IEEE Transactions on Systems Man Cybernet.—Part A: Systems Humans 26, pp.52-67,1996
[5] H. Bourlard and S. Dupont, “Subband-based speech recognition”, IEEE Proceedings of International Conference on Acoustic, Speech and Signal Processing (ICASSP), vol. 2, pp. 1251-1254, 1997
[6] L. Breiman, “Bagging Predictors,” Machine Learning, Vol. 26, pp. 123-140, 1996
[7] M.J. Carey, E.S. Parris, S.J. Bennett and L.Thomas, “A comparison of model estimation techniques for speaker verification”, Proc. of International Conference on Acoustic, Speech and Signal Processing .Vol. 2, pp. 1083 –1086, 1997
[8] I.M. Chagnolleau, G. Durou and F. Bimbot, “Application of time-frequency principal component analysis to text-independent speaker identification,” IEEE Transactions on Speech and Audio Processing, Vol. 10 No.6, pp. 371 –378, 2002
[9] C.C. Chiang and H.C. Fu, “A divide-and-conquer methodology for modular supervised neural network design”, IEEE International Conference on Neural Networks, pp. 119-124, 1994
[10] J.-T. Chien, H.-C. Wang and L.-M. Lee, “A novel projection-based likelihood measure for noisy speech recognition” Speech Communication, vol. 24, no. 4, pp. 287-297, 1998
[11] J.T Chien, C.C Wu, “ Discriminant waveletfaces and nearest feature decisions for face recognition”. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.1644 –1649, 2002
[12] L. Daza and E. Acuna, “Combining Classifiers based on Gaussian Mixture,” in the Proc. of the international conference on Computer, Communication and Control Technologies, 2003
[13] R. O. Duda, P. E. Hart and D.G. Stork, Pattern classification, John Wiley Sons, Inc, 2nd ed., 2001
[14] S. Dupont and J. Luettin, “Using the multi-stream approach for continuous audio-visual speech recognition: Experiments on the M2VTS database”, Proceedings of International Conference on Spoken Language Processing (ICSLP), vol. 4, pp.1283-1286, 1998
[15] S. Dupont and H. Bourlard, “Using multiple time scales in a multi-stream speech recognition system”, Proceedings of 5th European Conference on Speech Communication and Technology (EUROSPEECH), vol. 1, pp. 3-6, 1997
[16] S. Fine, J. Navratil and R.A. Gopinath, “A hybrid GMM/SVM approach to speaker identification”, Proc. of International Conference on Acoustic, Speech and Signal Processing, Vol. 1, pp. 417 –420, 2001
[17] Y. Freund and R. Schapire, “Experiments with a new boosting algorithm,” in the Proc. of 3rd Machine Learning international conference, pp. 148-156, 1996
[18] K. Fukunaga, “Introduction to Statistical Pattern Recognition”, Academic Press, second edition, 1991.
[19] S. Furui, “Cepstral analysis technique for automatic speaker verification”, in IEEE Trans. Acoust. Speech Signal Process. 29(2), pp. 254-272, 1981
[20] M. J. F. Gales and S. J. Young, “Robust continuous speech recognition using parallel model combination”, IEEE Trans. Speech and Audio Processing, vol. 4, pp. 352-359, 1996
[21] Y. Gong, “Speech recognition in noisy environments: A survey”, Speech Communication, vol. 16, pp. 261-291, 1995
[22] R.C. Gonzalez and R.E. Woods, “Digital Image Processing” , Princeton 2nd ed., 2003
[23] F. Goudail, E. Lange, T. Iwamoto, K. Kyuma and N. Otsu, "Face recognition system using local autocorrelations and multiscale integration", IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.18, no.10, pp.1024-1028, 1996
[24] J.F. Grandin and M. Marques, “Robust data fusion,” Proceedings of the Third International Conference on Information Fusion, Vol.1, 10-13 July 2000
[25] C. Griffin, T. Matsui and S. Furui, “Distance measures for text-independent speaker recognition based on MAR model”, Proc. of International Conference on Acoustic, Speech and Signal Processing, Vol. 1, pp. I-309-312, 1994
[26] L. Hong and A.K. Jain, " Integrating Faces and Fingerprints For Personal Identification", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.20, No.12, pp 1295-1307, 1998.
[27] T. Isobe and J. Takahashi, “A new cohort normalization using local acoustic information for speaker verification”, Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 2, pp. 841 -844, 1999
[28] H. Jiang, and L. Deng, “A Bayesian Approach to the Verification Problem: Applications to Speaker Verification”, IEEE Transactions on Speech and Audio Processing, Vol.9, No.8, pp.874- 884, November 2001
[29] K. Jonsson, J. Matas, J. Kittler and Y.P. Li, “Learning Support Vectors for Face Verification and Recognition,” the Proc. of 4th IEEE International conference on Automatic Face and Gesture Recognition, pp. 208-213, March 2000
[30] M.I. Jordan and R.A. Jacobs, “Hierarchical mixtures of experts and the EM algorithm”, Neural Computation, 6(2),pp.181-214, 1994
[31] H. Kazuyuki, H. Masashi, H. Ken-ichi, M. Hiroshi, M. Taketoshi, and Y. Shuji, “Fast algorithm for online linear discriminant analysis,” IEICE Transactions on Fundamental of Electronics, Communications and Computer Science,Vol.E-A, pp. 1431-14412001
[32] J. Kittler, F.M. Alkoot, “Sum versus Vote Fusion in Multiple Classifier Systems”, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 110-115, 2003
[33] J. Kittler, M. Hatef, R.P.W. Duin and J. Matas, “On combining classifiers,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.20, No.3, pp. 226-239, March 1998
[34] R. Kuhn, J.C. Junqua, P. Nguyen and N. Niedzielski, “Rapid speaker adaptation in eigenvoice space”, IEEE Transactions on Speech and Audio Processing, Vol. 8 No. 6, pp. 695 -707, Nov. 2000
[35] R.Kuhn, P. Nguyen, J.C. Junqua, R. Boman, N. Niedzielski, S. Fincke, K. Field and M. Contolini, “Fast speaker adaptation using a priori knowledge”, Proc. of International Conference on Acoustics, Speech, and Signal Processing, vol. 2 pp. 749 –752, 1999
[36] L.I. Kuncheva, “A Theoretical Study on Six Classifier Fusion Strategies”, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 281-286, 2002
[37] L.I. Kuncheva, J.C. Bezdek and R.P.W. Duin, “Decision templates for multiple classifier fusion: an experimental comparison,” Pattern Recognition, pp. 299-314, 2001
[38] L. Lam and C.Y. Suen, “Optimal combination of pattern classifiers,”Pattern Recognition Letter, pp. 945-954, 1995
[39] Q. Li, B.H. Juang and Q. Zhou, “Automatic Verbal Information Verification for User Authentication”, IEEE Transactions on Speech and Audio Processing, Vol. 8 No. 5, Page(s): 585 -596, Sept. 2000
[40] Q. Li and B.H. Juang, “Speaker verification using verbal information verification for automatic enrolment”, Proc. of International Conference on Acoustic, Speech and Signal Processing. Vol. 1, pp. 133 –136, 1998
[41] S. Z. Li and J.Lu. “Face Recognition Using the Nearest Feature Line Method”. IEEE Trans. Neural Networks, vol.10, no.2, pp.439-443, March 1999
[42] X. Li, K. Chen, “Mandarin verbal information verification”, Proc. of International Conference on Acoustic, Speech and Signal Processing, Vol. 1, pp. 833-1 -833-6, 2002
[43] X. Li, Chang and E. B. Dai, “Improving speaker verification with figure of merit training”, Proc. of International Conference on Acoustic, Speech and Signal Processing, Vol. 1, pp. I-693 -I-696, 2002
[44] C.P. Liao, H.J. Lin, C.C. Huang and J.T. Chien, “Multiple human face detection in complex background”, Proc. of 2002 Computer Graphics Workshop, Tainan-Taiwan, June 2002.
[45] M. Liu, E. Chang, and B.q. Dai, "Hierarchical Gaussian Mixture Model for Speaker Verification", Proceedings International Conference on Spoken Language Processing, pp.1353-1356,2002
[46] G. Matteo, M. Dario, and M. Davide, “On the error-reject trade-off in biometric verification systems”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.19, pp.786-796, July 1997
[47] B.F. Mombot and J.B. Choquel, “A new probabilistic and entropy fusion approach for management of information sources,” Information Fusion, pp. 35-47, 2004
[48] B.F. Mombot and J.B. Coquel, “An Entropy Method for Multisource Data Fusion,” in the Proc. of the 3rd international conference on Information Fusion, Fusion 2000,Vol. 2, pp. 17-23, 2000
[49] J. Neyman and E. S. Pearson, “On the use and interpretation of certain test criteria for purpose of statistical inference”, Biometrika, pp.175-240, 1928
[50] A. Poritz, “Linear predictive hidden Markov models and the speech signal”, Proc. of International Conference on Acoustic, Speech and Signal Processing ,Vol. 1, pp. 1291-1294, 1982
[51] C. Rama, C.L. Wilson, and S. Saan, “Human and machine recognition of faces: a survey”, Proceedings of the IEEE , Vol.83, pp.705-741, May 1995
[52] D.A Reynolds and R. C Rose, “Robust test-independent speaker identification using Gaussian mixture speaker models”, IEEE Trans. Speech Audio Process. Vol. 3 pp. 72-83, 1995
[53] D.A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models”, Speech Communication, Vol. 17, pp. 91-108, 1995
[54] D.A. Reynolds, F.Q. Thomas and B.D. Robert, “Speaker Verification Using Adapted Gaussian Mixture Models”, Digital Signal Processing Vol. 10, pp. 19-41, 2000
[55] B. Roberto, and F. Daniele, “Person identification using multiple cues”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.17, pp.955-966, Oct. 1995
[56] A. Rogozan and P. Deleglise, “Adaptive fusion of acoustic and visual sources for automatic speech recognition”, Speech Communication, vol. 26, pp. 149-161, 1998
[57] A. Rosenberg, F. Soong, “Evaluation of a vector quantization talker recognition system in text independent and text dependent models”, Computer Speech and Language, Vol. 22, pp. 143-157, 1987
[58] C. Sanderson, “Automatic Person Verification Using Speech and Face Information”, PhD Thesis, 2002
[59] M. Tomlinson, M. Russel and N. Brooke, “Integrating audio and visual information to provide highly robust speech recognition”, IEEE Proceedings of International Conference on Acoustic, Speech and Signal Processing (ICASSP), pp.821-824, 1996
[60] N. Ueda, “Optimal Linear Combination of Neural Networks for Improving Classification Performance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22 No.2, pp. 207-215, Feb. 2000
[61] K. Veeramachaneni, L. A. Osadciw, and P. K. Varshney, “Adaptive multimodal biometric fusion algorithm using particle swarm.” Proceedings of SPIE Vol. 5099: Multisensor, Multisource Information Fusion: Architectures, Algorithms, and Applications 2003, pp. 211-221, 2003
[62] R. Viswanathan and P.K. Varshney, "Distributed detection with multiple sensors: Part I - fundamentals," Proceedings of the IEEE, vol. 85, no. 1, pp. 54-63, Jan. 1997
[63] Y. Wang, T. Tan and A. K. Jain, "Combining Face and Iris Biometrics for Identity Verification", Proc. of 4th Int'l Conf. on Audio- and Video-Based Biometric Person Authentication (AVBPA), pp. 805-813, Guildford, UK, June 9-11, 2003
[64] K.Woods, W.P. Kegelmeyer, K. Bowyer, “Combination of multiple classifiers using local accuracy estimates”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, pp. 405-410, 1997
[65] Y. Xia, L.Henry, and B.Eloi, “Neural data fusion algorithms based on a linearly constrained least square method”, IEEE Transactions on Neural Networks, volume 13, No. 2 pp. 320-329, 2002
[66] S.B. Yacoub, Y.Abdeljaoued, and E. Mayoraz, “Fusion of Face and Speech Data for Person Identity Verification,”IEEE Transaction on Neural Network, pp.1065-1075, 1999
[67] Z.R. Yang and M. Zwolinski, “Mutual Information Theory for Adaptive Mixture Models”, IEEE Transactions on Pattern Analysis and Machine Intelligence 23(4),pp.396-403, 2001
[68] S. Young, J. Jansen, J. Odell, D. Ollason, and P Woodland. The HTK Book (Version 2.0). ECRL, 1995
[69] 吳佳珍, “以鑑別性小波參數為主之人臉辨識研究系統”, 國立成功大學資訊工程學系碩士論文, Jun 2001
[70] 洪倩玉,”建立動態線性鑑別式分析於線上人臉辨識與驗證,”國立成功大學資訊工程研究所碩士論文,2003
[71] 李孝健,”以特徵聲音調整為主之使用者言語資訊確認技術,”國立成功大學資訊工程研究所碩士論文,2003
[72] FaceOn2000, 星創科技, http://www.faceon.com.tw
[73] 言豐-語音身分辨別器, Infotalk corp. http://www.infotalkcorp.com/