| 研究生: |
顏峻岳 Yen, Chun-Yueh |
|---|---|
| 論文名稱: |
結合YOLO-v3及非區域神經網路之物件辨識模型應用於深度影像之人體動作識別 Object Detection Model Combined YOLO-v3 with Non-local Neural Network for Human Action Recognition in Depth Map |
| 指導教授: |
李順裕
Lee, Shuenn-Yuh |
| 共同指導: |
陳儒逸
Chen, Ju-Yi |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 英文 |
| 論文頁數: | 74 |
| 中文關鍵詞: | 人體動作識別 、物件辨識 、YOLO-v3 、非區域神經網路 、深度圖 、飛時攝影 、影像處理 、長期健康照護 |
| 外文關鍵詞: | human action recognition, object detection, YOLO-v3, non-local neural network, depth map, time-of-flight camera, image processing, long-term health care |
| 相關次數: | 點閱:141 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,全球高齡化的趨勢已愈加明顯,使得全球老年人口已達到歷年新高。而隨著老年人口的增長,長期照護已成為各國政府及民間單位所需要克服的問題。在老年人口的長期照護中,摔倒預防及偵測是最需要重視的問題。根據聯合國的統計,老年人是摔倒致死率最高的族群,且年齡越高、摔倒致死率就越高。綜上所述,如能在長照裝置上,增加起身以及摔倒偵測的功能,並即時回傳結果至照護者,便能及早對被照護者做出處置。又,現今的影像跌倒偵測系統大多以RGB傳統影像紀錄,除了記錄使用者本身之外,也會一併記錄使用者私領域空間的資訊、使用者的面貌……等,有侵犯隱私權的疑慮。如能同時滿足影像照護以及使用者隱私,相信一定能在長期照護上有更多的幫助。
本論文提出一完整影像照護系統,結合了前端的飛時攝影機、中繼端的樹梅派以及雲端的伺服器。前端以深度圖的方式擷取人體動作,並於中繼端進行圖像壓縮等資料前處理,最後再透過無線傳輸傳送圖像資料至雲端,進行人體動作辨識。本論文採用YOLO-v3做為人體物件辨識的基礎架構,並結合非區域性神經網路架構,讓模型得以學習時間資訊,進而從連續八張圖幀中判讀人體動作。本論文亦建立一深度影像人體動作資料庫,包含301個影像檔案、合計共10150幀圖像,每幀皆記錄並標註了使用者在床邊常見的動作:坐姿、站姿、躺姿、起身及摔倒。本論文所提出的模型mAP為83.26%。傳統的物件辨識模型,只能解決二維空間問題;而本論文提出的模型,僅額外付出些微的運算代價,就可將物件辨識拓展至三維空間,做為更多層面的應用。
本論文提出之系統及模型是以人體動作識別,尤其是起身及摔倒偵測,做為出發點,期許未來將這些系統應用更進一步完善,能對於長期照護的照顧者及被照顧者盡一份心力。
In recent years, population aging is widespread across the world, and the population over 65 years old has hit a record high. According to the growing elderly population, long-term care has become a serious issue for governments and civic organizations all over the world. Moreover, it is important to pay attention to whether the elderly has a fall or not. Due to the statics from the United Nations, the geriatric population has the highest fall death rate, and the elder they are, the higher the fall death rate is. To sum up, if the long-term care devices can achieve human action recognition, including getting up as well as falling, the elderly will be treated timely. However, the modern vision-based devices for fall detection use traditional RGB images, which record not only users’ movements but users’ privacy, such as users’ identity and personal home furnishings. Thus, some users have doubts about an invasion of privacy. If we can meet video-based nursing and user’s privacy at the same time, it will help long-term care a lot.
This thesis proposed a comprehensive and vision-based nursing system, including the time-of-flight (ToF) cameras at the front end, the Raspberry Pi at the edge point, and the server from the cloud. First, the ToF cameras capture humans’ actions through depth maps. Next, the Raspberry Pi accomplishes image preprocessing and sends the images to the cloud by wireless transmission. Finally, the cloud server is responsible for human action recognition. This thesis utilizes YOLO-v3 as a basic structure and combines non-local neural networks to learn temporal information, which determines human actions from 8 images in sequence. The thesis also builds a human action database by depth maps, including 301 video clips, which are equivalent to 10150 frames of the images. Each frame is recorded and labeled as several actions, such as sitting, standing, lying, getting up, and falling. The mAP of the proposed model is 83.26%. The traditional object detection models only deal with 2-dimension problems. With a little extra computational cost, the proposed model expands the object detection to the 3-D space, which leads to more applications.
The proposed system and the model in this thesis are all based on human action recognition, especially getting up and falling. After further optimization in the future, the system can improve the long-term care environment and relieve the burden of nursing.
[1] United Nations, “World Population Prospects 2019,” United Nations, 2019. [Online]. Available: https://population.un.org/wpp/Graphs/Probabilistic/POP/65plus/900 [Accessed: Dec. 9, 2020].
[2] World Health Organization, “WHO Global Report on Falls Prevention in Older Age,” World Health Organization, 2007. [Online document]. Available: https://www.who.int/ageing/publications/Falls_prevention7March.pdf?ua=1 [Accessed: Dec. 9, 2020].
[3] D. A. Sterling, J. A. O’Connor, and J. Bonadies, “Geriatric Falls: Injury Severity Is High and Disproportionate to Mechanism,” Journal of Trauma – Injury, Infection and Critical Care, vol. 50, no. 1, pp. 116-119, 2001.
[4] D. Hefner, “Understaffed Nursing Homes Affecting Patients,” Journal of the National Medical Association, vol. 94, no. 5, pp. 283, May 2002.
[5] Apple, “Use fall detection with Apple Watch,” apple.com, Sep. 17, 2020. [Online]. Available: https://support.apple.com/en-us/HT208944 [Accessed: Dec. 9, 2020].
[6] mynotifi, “mynotifi Fall Detection System,” mynotifi.com, [Online]. Available: https://www.mynotifi.com [Accessed: Dec. 9, 2020].
[7] Consumer Electronics Show, “CES 2020 Innovation Award Product: PECOLA,” Consumer Electronics Show, 2020, [Online]. Available: https://www.ces.tech/Innovation-Awards/Honorees/2020/Honorees/P/PECOLA.aspx [Accessed: Dec. 9, 2020].
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Oct. 2014.
[9] R. Girshick, “Fast R-CNN,” IEEE International Conference on Computer Vision (ICVV), pp. 1440-1448, Sep. 2015.
[10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, Jun. 2017.
[11] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016.
[12] J. Redmon, and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017.
[13] J. Redmon, and A. Farhadi, “YOLOv3: An Incremental Improvement,” arXiv preprint arXiv:1804.02767, Apr. 2018.
[14] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal Speed and Accuracy of Object Detection,” arXiv preprint arXiv:2004.10934, Apr. 2020.
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single Shot MultiBox Detector, ” Lecture Notes in Computer Science (LNCS), pp. 21-37, 2016.
[16] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318-327, Feb. 2020.
[17] P. Zanuttigh, G. Marin, C. D. Mutto, F. Dominio, L. Minto, and G. M. Cortelazzo, Time-of-Flight and Structured Light Depth Cameras. New York: Springer-Verlag, 2016.
[18] Analog Devices, “3D Time of Flight Development Platform,” AD-96TOF1-EBZ datasheet, Oct. 2019. [Revised Apr. 2020].
[19] Chicony, “Camera Module: ToF,” Chicony. [Online]. Available: https://www.chicony.com.tw/chicony/jp/product/ToF [Accessed: Dec. 9, 2020].
[20] Analog Devices, “Analog Devices 3D ToF software suite,” github.com, 2019. [Online]. Available: https://github.com/analogdevicesinc/aditof_sdk [Accessed: Dec. 9, 2020].
[21] Raspberry Pi, “Single-board Computer,” Raspberry Pi 3 Model B+ datasheet, 2018.
[22] OpenCV, “ColorMaps in OpenCV,” OpenCV, Jul. 2019. [Online]. Available: https://docs.opencv.org/4.1.1/d3/d50/group__imgproc__colormap.html [Accessed: Dec. 9, 2020].
[23] OpenCV, “imgcodecs.hpp File Reference,” OpenCV, Jul. 2019. [Online]. Available: https://docs.opencv.org/4.1.1/d6/d87/imgcodecs_8hpp.html [Accessed: Dec. 9, 2020].
[24] JPEG, “Overview of JPEG,” Joint Photographic Experts Group (JPEG), 2020. [Online]. Available: https://jpeg.org/jpeg/index.html [Accessed: Dec. 10, 2020].
[25] Kaggle, “Datasets,” Kaggle Inc., 2019. [Online]. Available: https://www.kaggle.com/datasets [Accessed: Dec. 10, 2020].
[26] T. Mantecón, C.R. del Blanco, F. Jaureguizar, and N. García, “New generation of human machine interfaces for controlling UAV through depth-based gesture recognition,” SPIE Unmanned Systems Technology XVI, vol. 9084, pp. 9084:1- 9084:0C, Jun. 2014.
[27] N. Pugeault, and R. Bowden, “Spelling It Out: Real-Time ASL Fingerspelling Recognition,” IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Nov. 2011.
[28] C. Chen, R. Jafari, and N. Kehtarnavaz, "UTD-MHAD: A Multimodal Dataset for Human Action Recognition Utilizing a Depth Camera and a Wearable Inertial Sensor", Proceedings of IEEE International Conference on Image Processing, Sep. 2015.
[29] tzutalin, “labelImg,” github.com, 2015. [Online]. Available: https://github.com/tzutalin/labelImg [Accessed: Dec. 10, 2020].
[30] M. Everingham, S. M. Eslami, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge: A Retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98-136, Jan. 2015.
[31] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[32] L. Jiao, F. Zhang, F. Liu, S. Yang, L. Li, Z. Feng, Z. Feng, and R. Qu, “A Survey of Deep Learning-Based Object Detection,” IEEE Access, vol. 7, pp. 128837-128868, 2019.
[33] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders, “Selective search for object recognition,” International Journal of Computer Vision, vol. 104, no. 2, pp. 154–171, 2013.
[34] S. Ioffe, and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” International Conference on Machine Learning (ICML), vol. 1, pp. 448-456, 2015.
[35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
[36] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical Evaluation of Rectified Activations in Convolution Network,” arXiv preprint arXiv:1505.00853, 2015.
[37] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, Jan. 2014.
[38] L. Lu, Y. Shin, Y. Su, and G. E. Karniadakis, “Dying ReLU and Initialization: Theory and Numerical Examples,” arXiv preprint arXiv:1903.06733, 2019.
[39] K. Lucas, “The ‘all or none’ contraction of the amphibian skeletal muscle fibre,” The Journal of Physiology, vol. 38, no. 2-3, pp. 113-133, Feb. 1909.
[40] Z. Zou, Z. Shi, Y. Guo, and J. Ye, “Object Detection in 20 Years: A Survey,” arXiv preprint arXiv:1905.05055, 2019.
[41] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936–944, 2017.
[42] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431-3440, 2015.
[43] J. Hosang, R. Benenson, and B. Schiele, “Learning Non-maximum Suppression,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6469-6477, 2017.
[44] W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the Effective Receptive Field in Deep Convolutional Neural Networks,” Neural Information Processing System, pp. 4898-4906, 2016.
[45] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local Neural Networks,” IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794-7803, 2018.
[46] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft COCO: Common Objects in Context,” Lecture Notes in Computer Science (LNCS), vol. 8693, pp. 740-755, 2014.
[47] C. Shorten, and T. M. Khoshgoftaar, “A survey on Image Data Augmentation for Deep Learning,” Journal of Big Data, vol. 6, no. 1, p. 60, 2019.
[48] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie and M. Li, “Bag of Tricks for Image Classification with Convolutional Neural Networks,” IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 558-567, 2019.
[49] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid and S. Savarese, “Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression,” IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 658-666, 2019.