研究生: |
朱孟淇 Chu, Meng-Chi |
---|---|
論文名稱: |
運用超解析度結合深度學習模型於課堂肢體分類 Integrating Super-Resolution Techniques with Deep Learning Models for Classroom Pose Classification |
指導教授: |
陳牧言
Chen, Mu-Yen |
學位類別: |
碩士 Master |
系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
論文出版年: | 2025 |
畢業學年度: | 113 |
語文別: | 中文 |
論文頁數: | 61 |
中文關鍵詞: | 課堂行為分析 、超解析度 、肢體預測 、深度學習 、影像分類 |
外文關鍵詞: | classroom behavior analysis, super-resolution, pose estimation, deep learning, image classification |
相關次數: | 點閱:22 下載:5 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著人工智慧與電腦視覺技術之進步,其於教育場域中之應用日益受到重視,特別是在課堂行為分析領域。針對課堂影像解析度不足、背景干擾等問題,傳統辨識模型容易受限於影像品質與資訊完整性,影響分類準確度。為改善此一困境,本研究提出結合超解析度重建與肢體預測之資料前處理策略,並搭配多種深度學習分類模型,建構一套應用於課堂行為辨識之完整實驗架構。
研究中分別採用BSRGAN(Blind Super-Resolution Generative Adversarial Network, BSRGAN)、Real-ESRGAN(Real-World Enhanced Super-Resolution Generative Adversarial Networks, Real-ESRGAN)、SwinIR(Swin Transformer for Image Restoration, SwinIR)三種超解析度技術、三種肢體預測模型OpenPose、AlphaPose、MediaPipe,並結合三種深度學習模型ResNet-50、ConvNeXt-tiny、InceptionNeXt-tiny進行實驗,比較不同資料處理與模型組合對分類效能之影響。實驗設計分為四組,包括不使用前處理、僅使用超解析度、僅使用肢體預測,以及結合超解析度與骨架疊加之多模組方法。
實驗結果顯示,超解析度技術能有效提升分類準確率,其中以 Real-ESRGAN 效果最佳;肢體預測搭配完整細部關鍵點能提升基本關鍵點之辨識能力,但整體準確率受限於資訊遺失過多;多模組整合方法則可兼顧動作與語意資訊保留,於分類任務中表現最為優異,ConvNeXt-tiny搭配 Real-ESRGAN 與 OpenPose(完整骨架)之組合達到最高平均準確率 92.33%。本研究驗證了多種資料前處理方法與模型整合之可行性與效益,提供日後於智慧教室中進行行為監測與學習分析之應用參考。
With the advancement of artificial intelligence and computer vision, automatic classroom behavior analysis has become increasingly feasible. However, low image resolution and background interference often limit the performance of behavior recognition systems in real-world classroom environments.
To address these challenges, this study proposes an integrated preprocessing strategy combining super-resolution and human pose estimation techniques, evaluated across various deep learning models. Three super-resolution methods, Blind Super-Resolution Generative Adversarial Network(BSRGAN), Real-World Enhanced Super-Resolution Generative Adversarial Networks(Real-ESRGAN), Swin Transformer for Image Restoration(SwinIR) and three pose estimation frameworks (OpenPose, AlphaPose, and MediaPipe) were utilized, combined with three classification models (ResNet-50, ConvNeXt-tiny, and InceptionNeXt-tiny). Four experimental settings were designed: baseline without preprocessing, super-resolution only, pose estimation only, and a hybrid method that overlays skeletons on background-removed super-resolved images.
Results show that super-resolution preprocessing improves classification performance, especially when using Real-ESRGAN. While pose-based inputs combined with complete detailed keypoints improve recognition performance compared to using only basic keypoints,, they suffer from limited contextual information. The hybrid approach achieves the best overall performance by preserving both action and semantic features. The combination of ConvNeXt-tiny with Real-ESRGAN and OpenPose (with full body, face, and hand keypoints) achieved the highest average accuracy of 92.33%. This study highlights the effectiveness of combining preprocessing strategies and model architectures for robust behavior classification in classroom settings.
[1]. Fu, R., Wu, T., Luo, Z., Duan, F., Qiao, X., & Guo, P. (2019). Learning behavior analysis in classroom based on deep learning. International Conference on Intelligent Control and Information Processing (ICICIP) (pp. 206-212).
[2]. Lin, J., Jiang, F., & Shen, R. (2018, April). Hand-raising gesture detection in real classroom. International conference on acoustics, speech and signal processing (ICASSP) (pp. 6453-6457).
[3]. Jia, Q., & He, J. (2024). Student Behavior Recognition in Classroom Based on Deep Learning. Applied Sciences, 14(17), 7981.
[4]. Kavitha, A., Shanmugapriya, K., Swetha, L. G., Varsana, J., & Varsha, N. (2024, March). Framework for Detecting Student Behaviour (Nail Biting, Sleep, and Yawn) Using Deep Learning Algorithm. In 2024 2nd International Conference on Artificial Intelligence and Machine Learning Applications Theme: Healthcare and Internet of Things (AIMLA) (pp. 1-6). IEEE.
[5]. R. Gopinathan.N & E. Sherly. (2022). Visual Attention Score and Fatigue Level Measure of Students through Eye Analysis–Machine Learning Approach. India Council International Conference (INDICON) (pp. 1-5).
[6]. Samir, M. A., Maged, Y., & Atia, A. (2021). Exam cheating detection system with multiple-human pose estimation. International Conference on Computing (ICOCO) (pp. 236-240).
[7]. Li, L., Liu, M., Sun, L., Li, Y., & Li, N. (2022). ET-YOLOv5s: toward deep identification of students’ in-class behaviors. IEEE Access, 10, 44200-44211.
[8]. Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., ... & Shi, W. (2017). Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4681-4690).
[9]. Dosi, M., Rathore, U., Chiranjeev, C., Agarwal, A., Singh, R., & Vatsa, M. (2024, September). Is Face Super Resolution Truly Pushing the Boundaries of Face Recognition?. In 2024 IEEE International Joint Conference on Biometrics (IJCB) (pp. 1-9). IEEE.
[10]. Jantachat, K., Bua-Ngam, P., Boonsit, P., & Boonkwang, W. (2024, May). Development of an Exercise Posture Monitoring System for Air Cadets by Using MediaPipe and OpenCV. In 2024 21st International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) (pp. 1-4). IEEE.
[11]. Vinh, T. Q., & Bang, N. (2023). Evaluation System of Student's Concentration Using Deep Learning-Based Head-Pose Estimation. In 2023 International Conference on Advanced Computing and Analytics (ACOMPA) (pp. 124-130). IEEE.
[12]. Tang, G. (2024). Sequence modeling with recurrent neural networks (RNNs) for student learning behavior pattern recognition in a flipped classroom. Journal of Electrical Systems, 20(3s), 401–418.
[13]. Lin, L., Yang, H., Xu, Q., Xue, Y., & Li, D. (2024). Research on Student Classroom Behavior Detection Based on the Real-Time Detection Transformer Algorithm. Applied Sciences (2076-3417), 14(14).
[14]. Yang, F., Wang, T., & Wang, X. (2023). Student classroom behavior detection based on YOLOv7+ BRA and multi-model fusion. In International conference on image and graphics (pp. 41-52).
[15]. Dong, C., Loy, C. C., He, K., & Tang, X. (2015). Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2), 295-307.
[16]. Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., ... & Shi, W. (2017). Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4681-4690).
[17]. Zhang, K., Liang, J., Van Gool, L., & Timofte, R. (2021). Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4791-4800).
[18]. Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., ... & Change Loy, C. (2018). Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops (pp. 0-0).
[19]. Wang, X., Xie, L., Dong, C., & Shan, Y. (2021). Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1905-1914).
[20]. Correa, V. H., Funk, P., Sundelius, N., Sohlberg, R., Ab Wahid, M., & Ramos, A. C. (2025). Thermal Image Super-Resolution Using Real-ESRGAN for Human Detection. Proceedings Copyright, 247, 254.
[21]. Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., & Timofte, R. (2021). SwinIR: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops (pp. 1833–1844).
[22]. Lin, S. (2023). Super-Resolution Reconstruction of Face Images Based on SwinIR. Frontiers of Computer Information Systems, 5(6), 27–31.
[23]. Zheng, Y., Li, S., & Wu, X. (2022). Two-Stage Framework for Video Restoration Using SwinIR in the Fine Restoration Phase. Journal of Electronics and Information Technology, 44(11), 3726–3733.
[24]. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Regonition (CVPR),779–788.
[25]. Girshick, R. (2015). Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1440–1448.
[26]. Fang, H. S., Xie, S., Tai, Y. W., & Lu, C. (2017). Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE international conference on computer vision (pp. 2334-2343).
[27]. Fang, H. S., Li, J., Tang, H., Xu, C., Zhu, H., Xiu, Y., ... & Lu, C. (2022). Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE transactions on pattern analysis and machine intelligence, 45(6), 7157-7173.
[28]. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13 (pp. 740-755). Springer International Publishing.
[29]. Jin, S., Xu, L., Xu, J., Wang, C., Liu, W., Qian, C., ... & Luo, P. (2020, August). Whole-body human pose estimation in the wild. In European Conference on Computer Vision (pp. 196-214). Cham: Springer International Publishing.
[30]. Liang, Z., & Nyamasvisva, T. E. (2023, November). Badminton action classification based on human skeleton data extracted by AlphaPose. In 2023 International Conference on Sensing, Measurement & Data Analytics in the era of Artificial Intelligence (ICSMD) (pp. 1-4). IEEE.
[31]. Zhang, W., De Ocampo, A. L., & Hernandez, R. (2024, November). Detection of Smoking Behaviors Using Human Key Points and YOLOv8. In 2024 IEEE Cyber Science and Technology Congress (CyberSciTech) (pp. 335-340). IEEE.
[32]. Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7291–7299).
[33]. Cao, Z., Hidalgo, G., Simon, T., Wei, S. E., & Sheikh, Y. (2019). OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1), 172–186.
[34]. Lee, P., Chen, T. B., Lin, H. Y., Yeh, L. R., Liu, C. H., & Chen, Y. L. (2024). Integrating OpenPose and SVM for Quantitative Postural Analysis in Young Adults: A Temporal-Spatial Approach. Bioengineering, 11(6), 548.
[35]. Shih, C. L., Liu, J. Y., Anggraini, I. T., Xiao, Y., Funabiki, N., & Fan, C. P. (2024, June). Difficulty evaluation of yoga poses by angular velocity and body area calculation for GPU-based yoga self-practice system. In 2024 IEEE Gaming, Entertainment, and Media Conference (GEM) (pp. 1-4). IEEE.
[36]. Lugaresi, C., Tang, J., Nash, H., McGuire, K., Chang, C., Yong, M., ... & Kothari, R. (2019).MediaPipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172.
[37]. Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T., Zhang, F., & Grundmann, M. (2020). Blazepose: On-device real-time body pose tracking. arXiv preprint arXiv:2006.10204.
[38]. Lugaresi, C., Tang, J., & Google Research. (2019, December 16). MediaPipe Holistic: Simultaneous face, hand and pose prediction on device. Google Research Blog. https://research.google/blog/mediapipe-holistic-simultaneous-face-hand-and-pose-prediction-on-device/
[39]. Farhan, Y., & Ait Madi, A. (2022, December). Real-time dynamic sign recognition using mediapipe. In 2022 IEEE 3rd International Conference on Electronics, Control, Optimization and Computer Science (ICECOCS) (pp. 1-7). IEEE.
[40]. Prajapati, A., Chauahan, R., & Vaidya, H. (2023, November). Human Exercise Posture Detection Using MediaPipe and Machine Learning. In 2023 3rd International Conference on Advancement in Electronics & Communication Engineering (AECE) (pp. 790-795). IEEE.
[41]. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (2002). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
[42]. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
[43]. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[44]. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[45]. Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11976-11986).
[46]. Haffar, R., Sánchez, D., & Domingo-Ferrer, J. (2025). Multi-Task Faces (MTF) Data Set: A Legally and Ethically Compliant Collection of Face Images for Various Classification Tasks. IEEE Access.
[47]. López-Lozada, E., Sossa, H., Rubio-Espino, E., & Montiel-Pérez, J. Y. (2024). Action Recognition in Videos through a Transfer-Learning-Based Technique. Mathematics, 12(20), 3245.
[48]. Yu, W., Zhou, P., Yan, S., & Wang, X. (2024). Inceptionnext: When inception meets convnext. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (pp. 5672-5683).
[49]. Chen, G., Ji, J., & Huang, C. (2022, April). Student classroom behavior recognition based on openpose and deep learning. In 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP) (pp. 576-579). IEEE.
[50]. Su, M. C., Cheng, C. T., Chang, M. C., & Hsieh, Y. Z. (2021). A video analytic in-class student concentration monitoring system. IEEE Transactions on Consumer Electronics, 67(4), 294-304.
[51]. Cui, K., Huang, M., Lv, W., Liu, S., Zhou, W., & You, Q. (2024). Research on Intelligent Recognition Algorithm of College Students’ Classroom behavior based on improved Single Shot Multibox Detector. IEEE Access.
[52]. Zhao, J., & Zhu, H. (2023). Cbph-net: A small object detector for behavior recognition in classroom scenarios. IEEE transactions on instrumentation and measurement, 72, 1-12.
[53]. Cao, Y., Cao, Q., Qian, C., & Chen, D. (2025). YOLO-AMM: A Real-Time Classroom Behavior Detection Algorithm Based on Multi-Dimensional Feature Optimization. Sensors, 25(4), 1142.
[54]. Chen, H., & Guan, J. (2022). Teacher–student behavior recognition in classroom teaching based on improved YOLO-v4 and Internet of Things technology. Electronics, 11(23), 3998.
[55]. Gourley, P. (2021). Back to basics: How reading the text and taking notes improves learning. International Review of Economics Education, 37, 100217.
[56]. Böheim, R., Urdan, T., Knogler, M., & Seidel, T. (2020). Student hand-raising as an indicator of behavioral engagement and its role in classroom learning. Contemporary Educational Psychology, 62, 101894.
[57]. Louca, M., & Short, M. A. (2014). The effect of one night's sleep deprivation on adolescent neurobehavioral performance. Sleep, 37(11), 1799-1807.
[58]. Li, Y., Liu, H., Bai, X., Li, Q., Cai, M., & Wang, J. (2023, April). The Impact of Classroom Learning Behavior on Learning Outcomes: A Computer Vision Study. In Proceedings of the 9th International Conference on Education and Training Technologies (pp. 1-8).
[59]. Glass, A. L., & Kang, M. (2019). Dividing attention in the classroom reduces exam performance. Educational Psychology, 39(3), 395-408.
[60]. Yang, F. (2023). SCB-dataset: A dataset for detecting student classroom behavior. arXiv. https://arxiv.org/abs/2304.02488
[61]. Sevinç, E. (2022). An empowered AdaBoost algorithm implementation: A COVID-19 dataset study. Computers & Industrial Engineering, 165, 107912.
[62]. Düntsch, I., & Gediga, G. (2019, May). Confusion matrices and rough set data analysis. In Journal of Physics: Conference Series (Vol. 1229, No. 1, p. 012055). IOP Publishing.