簡易檢索 / 詳目顯示

研究生: 李靜娟
Li, Ching-Chuan
論文名稱: 應用於MR眼鏡之AI動態手勢辨識系統
AI-based Dynamic Hand Gesture Recognition for MR Glasses
指導教授: 楊家輝
Yang, Jar-Ferr
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 36
中文關鍵詞: 深度學習卷積神經網路遞迴神經網路手勢辨識混合實境醫療手術
外文關鍵詞: deep learning, convolutional neural network, recurrent neural network, hand gesture recognition, mixed reality, medical surgery
相關次數: 點閱:112下載:29
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著人們對於智慧生活的需求越來越高,人機互動方式的需求也從傳統的元件方式轉變為自然且直觀的肢體表達。在本論文中,我們提出了一套基於人工智慧的手勢辨識網路系統,它可應用於混合實境眼鏡上。我們的系統主要由卷積神經網路和遞迴神經網路所組成,分別能夠捕捉手勢的空間和時間訊息。其中,卷積神經網路作為空間特徵萃取器,遞迴神經網路則執行時間資訊連接器。在EgoGesture數據集上的實驗結果顯示,我們的系統具有良好的精確度和低參數量的特點。最後,我們還引入了兩階段的模型架構和結構重新參數化技術,以降低系統的資源消耗和實現實時操作。在實際應用中,我們將系統應用於混合實境眼鏡上,並展示了其在手術操作中的潛在價值。通過動態手勢辨識技術,醫療專業人員可以在手術過程中以自然且直觀的方式進行操作,提高手術的準確性和效率。

    With increasing demands of smart living, the need for natural and intuitive human-computer interaction has changed from traditional device methods to nature and intuitive human gestures. In this paper, we propose an artificial intelligence gesture recognition network system that can be applied to mixed reality glasses. The proposed system primarily consists of a convolutional neural network (CNN) and a recurrent neural network (RNN) to capture spatial and temporal information of gestures, respectively. The CNN serves as a spatial feature extractor while the RNN acts as a temporal information analyzer. Experimental results on EgoGesture dataset demonstrate that the proposed system achieve high accuracy prediction with low parameter count. Additionally, we introduce a two-stage model architecture and a structural reparameterization technique to reduce resource consumption to enable real-time operations. In practical applications, we deploy the system on mixed reality glasses to demonstrate its potential usage in surgical operations. Through this dynamic gesture recognition technology, medical professional users can perform surgical procedures naturally and intuitively, leading to improved accuracy and efficiency in their tasks.

    摘要 I Abstract II 誌謝 III Contents IV List of Tables VI List of Figures VII Chapter 1 Introduction 1 1.1 Research Background 1 1.2 Motivations 2 1.3 Thesis Organization 3 Chapter 2 Related Work 5 2.1 Dilated Convolution 5 2.2 Gated Recurrent Unit 6 2.3 Structural Re-parameterization 8 2.4 Two-stage Hierarchical Architecture 10 Chapter 3 The Proposed Dynamic Hand Gesture Recognition System for MR Glasses 12 3.1 Overview of the Proposed System 13 3.2 Network Architecture 14 3.2.1 Feature Extractor 16 3.2.2 Temporal Information Analyzer 18 3.3 Training Loss Function 18 3.4 The Proposed System Implemented in MR Glasses 19 Chapter 4 Experimental Results 22 4.1 Environment Settings 22 4.2 Dataset 23 4.3 Training Details 24 4.4 Ablation Study and Comparing Results 24 4.5 System Demonstration 28 Chapter 5 Conclusions 29 Chapter 6 Future Work 31 References 33

    [1] M. Oudah, A. Al-Naji, and J. Chahl, "Hand gesture recognition based on computer vision: a review of techniques," Journal of Imaging, vol. 6, no. 8, p. 73, 2020.
    [2] M. J. Cheok, Z. Omar, and M. H. Jaward, "A review of hand gesture and sign language recognition techniques," International Journal of Machine Learning and Cybernetics, vol. 10, pp. 131-153, 2019.
    [3] K. S. Abhishek, L. C. F. Qubeley and D. Ho, “Glove-based hand gesture recognition sign language translator using capacitive touch sensor,” In 2016 IEEE International Conference on Electron Devices and Solid-State Circuits (EDSSC), pages 334–337, Aug. 2016.
    [4] S. S. Rautaray and A. Agrawal, "Vision based hand gesture recognition for human computer interaction: a survey," Artificial Intelligence Review, vol. 43, pp. 1-54, 2015.
    [5] F. Yu and V. Koltun, "Multi-scale context aggregation by dilated convolutions," arXiv preprint arXiv:1511.07122, 2015.
    [6]Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078.
    [7] X. Ding, H. Chen, X. Zhang, J. Han, and G. Ding, "RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 578-587, 2022.
    [8] X. Ding, Y. Guo, G. Ding, and J. Han, "ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks," in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1911-1920, 2019.
    [9] X. Ding, T. Hao, J. Tan, J. Liu, J. Han, Y. Guo, and G. Ding, "ResRep: Lossless CNN Pruning via Decoupling Remembering and Forgetting," in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4510-4520, 2021.
    [10] X. Ding, X. Zhang, J. Han, and G. Ding, "Diverse Branch Block: Building a Convolution as an Inception-like Unit," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10886-10895, 2021.
    [11]X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, "RepVGG: Making VGG-style ConvNets Great Again," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13733-13742, 2021.
    [12] O. Köpüklü, A. Gunduz, N. Kose, and G. Rigoll, "Real-time hand gesture detection and classification using convolutional neural networks," in 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pp. 1-8, May 2019.
    [13] O. Köpüklü, A. Gunduz, N. Kose and G. Rigoll, "Online Dynamic Hand Gesture Recognition Including Efficiency Analysis," in IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 2, no. 2, pp. 85-97, April 2020, doi: 10.1109/TBIOM.2020.2968216.
    [14] Z. C. Lipton, J. Berkowitz, and C. Elkan, "A critical review of recurrent neural networks for sequence learning," arXiv preprint arXiv:1506.00019.
    [15] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
    [16] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556.
    [17] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
    [18] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated Residual Transformations for Deep Neural Networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
    [19] Levenshtein, V. I., "Binary codes capable of correcting deletions, insertions, and reversals," in Doklady Akademii Nauk, vol. 163, no. 4, pp. 845-848, 1965.
    [20] X. Ding, X. Zhang, J. Han, and G. Ding, "Scaling up your kernels to 31x31: Revisiting large kernel design in CNNs," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11963-11975, 2022.
    [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems, vol. 30, 2017.
    [22] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. C. Chen, "Mobilenetv2: Inverted residuals and linear bottlenecks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510-4520.
    [23] Z. Liu, H. Mao, C. Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, "A convnet for the 2020s," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11976-11986.
    [24] Y. Zhang, C. Cao, J. Cheng and H. Lu, “EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition,” IEEE Transactions on Multimedia (T-MM), Vol. 20, No. 5, pp. 1038-1050, 2018.
    [25] C. Cao, Y. Zhang, Y. Wu, H. Lu and J. Cheng, “Egocentric Gesture Recognition Using Recurrent 3D Convolutional Neural Networks with Spatiotemporal Transformer Modules,” In Proceedings of IEEE International Conference On Computer Vision (ICCV), Venice, Italy, 2017.
    [26] M. Abavisani, H. R. V. Joze, and V. M. Patel, "Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1165-1174.
    [27] H. R. V. Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, "MMTM: Multimodal transfer module for CNN fusion," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13289-13299.
    [28] J. Materzynska, G. Berger, I. Bax, and R. Memisevic, "The jester dataset: A large-scale video dataset of human gestures," in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0-0.
    [29] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248-255.
    [30] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, A. ... and A. Zisserman, "The kinetics human action video dataset," arXiv preprint arXiv:1705.06950, 2017.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE