| 研究生: |
莊竣傑 Chuang, Chun-Chieh |
|---|---|
| 論文名稱: |
透過選擇性特徵整合與跨層閘控增強基於 HRNet 的人體姿態估計 Enhancing HRNet-based Human Pose Estimation through Selective Feature Integration and Cross-level Gating |
| 指導教授: |
陳奇業
Chen, Chi-Yeh |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 英文 |
| 論文頁數: | 41 |
| 中文關鍵詞: | 單階段人體姿態辨識 、解構式關鍵點回歸 、跨層閘控 |
| 外文關鍵詞: | Single stage human pose estimation, Disentangled keypoint regression, Cross-level gating |
| 相關次數: | 點閱:93 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
人體姿態辨識是近年來發展迅速的技術,廣泛應用於增強實境、虛擬實境和人機互動等現代生活中常見的領域。過去的研究主要集中在兩階段框架下的設計,而最近,許多學者開始投入單階段的人體姿態辨識模型研究,結合自下而上或由上而下方法的核心精神,設計出更易於使用且計算負擔較小的單階段辨識模型。其中,解構式關鍵點回歸是一個融合自下而上方法的經典作品。本論文以解構式關鍵點回歸模型為基礎,針對其缺點進行增強,並結合更多模組和訓練方法,以達到更高的人體姿態辨識精準度。
本研究在解構式關鍵點回歸的基礎上,首先選擇適合的特徵圖組合來預測機率響應圖和關鍵點偏移量。我們探討利用來自骨幹網路不同解析度的特徵圖組合,作為預測不同任務目標的頭部模型是否能提高精準度。針對機率響應圖的預測,我們使用權重適應式機率響應圖迴歸來提升品質。針對關鍵點偏移量的預測,我們引入瀑布式空洞空間池化模組以擴大感受野,並將低層次特徵與跨層閘控整合到特徵圖組中,以更準確地引導關鍵點偏移量接近真實位置。本研究將模型預測的機率響應圖作為特徵圖,並結合跨層閘控進行關鍵點偏移量的預測。通過這些技術和模組的結合,我們在 Crowdpose 資料集上實現了高於相同解析度的解構式關鍵點回歸模型的準確率,並且與需要更高解析度的模型相媲美。
Human pose estimation is a rapidly advancing technology in recent years, widely used in common domains such as augmented reality, virtual reality, and human-computer interaction. Previous research primarily focused on two-stage frameworks, while recently, many scholars have started to delve into single-stage human pose recognition models, combining the core principles of bottom-up and top-down approaches to design single-stage recognition models that are more user-friendly and computationally efficient. Among these, Disentangled Keypoint Regression (DEKR) is a classic work that integrates bottom-up methods.
In this study, we build upon the DEKR model and enhance its shortcomings by incorporating additional modules and training methods to achieve higher accuracy in human pose estimation. Specifically, we explore the use of appropriate combinations of feature maps from different resolutions in the backbone network to predict heatmaps and keypoint offsets. For heatmap prediction, we employ Weight Adaptive Heatmap Regression (WAHR) to enhance the quality. For keypoint offset prediction, we introduce the Waterfall Atrous Spatial Pooling module to increase the receptive field. Additionally, we utilize cross-level gating to integrate low-level features and the predicted heatmaps to guide the keypoint offsets towards more accurate positions. Through the integration of these techniques and modules, we achieve higher accuracy than the original DEKR model at the same resolution on the Crowdpose dataset, rivaling models that require higher resolutions.
[1] Bruno Artacho and Andreas Savakis. Waterfall atrous spatial pooling architecture for efficient semantic segmentation. Sensors, 19(24), 2019.
[2] Bruno Artacho and Andreas Savakis. Unipose: Unified human pose estimation in single images and videos. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7033–7042, 2020.
[3] Bruno Artacho and Andreas Savakis. Omnipose: A multi-scale framework for multi-person pose estimation. arXiv preprint arXiv:2103.10180, 2021.
[4] Vasileios Belagiannis, Christian Rupprecht, Gustavo Carneiro, and Nassir Navab. Robust optimization for deep regression. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2830–2838, 2015.
[5] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1302–1310, 2017.
[6] Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human pose estimation with iterative error feedback. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4733–4742, 2016.
[7] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and ma-chine intelligence, 40(4):834–848, 2017.
[8] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network for multi-person pose estimation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7103–7112, 2018.
[9] Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S. Huang, and Lei Zhang. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5385–5394, 2020.
[10] Xiaochuan Fan, Kang Zheng, Yuewei Lin, and Song Wang. Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1347–1355, 2015.
[11] Jun Fu, Jing Liu, Jie Jiang, Yong Li, Yongjun Bao, and Hanqing Lu. Scene segmentation with dual relation-aware attention network. IEEE Transactions on Neural Networks and Learning Systems, 32(6):2547–2560, 2021.
[12] Zigang Geng, Ke Sun, Bin Xiao, Zhaoxiang Zhang, and Jingdong Wang. Bottom-up human pose estimation via disentangled keypoint regression. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14671–14681, 2021.
[13] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
[14] Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pages 34–50. Springer, 2016.
[15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[16] Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. Pifpaf: Composite fields for human pose estimation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11969–11978, 2019.
[17] Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10855–10864, 2019.
[18] Ita Lifshitz, Ethan Fetaya, and Shimon Ullman. Human pose estimation using deep consensus voting. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 246– 260. Springer, 2016.
[19] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017.
[20] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.
[21] Zhengxiong Luo, Zhicheng Wang, Yan Huang, Liang Wang, Tieniu Tan, and Erjin Zhou. Rethinking the heatmap regression for bottom-up human pose estimation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13259–13268, 2021.
[22] Alejandro Newell, Zhiao Huang, and Jia Deng. Associative embedding: End-to-end learning for joint detection and grouping. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 2274–2284, Red Hook, NY, USA, 2017. Curran Associates Inc.
[23] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 483–499, Cham, 2016. Springer International Publishing.
[24] Xuecheng Nie, Jiashi Feng, Jianfeng Zhang, and Shuicheng Yan. Single-stage multi-person pose machines. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6950–6959, 2019.
[25] George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, and Kevin Murphy. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European Conference on Computer Vision (ECCV), pages 269–286, 2018.
[26] Leonid Pishchulin, Mykhaylo Andriluka, Peter Gehler, and Bernt Schiele. Poselet conditioned pictorial structures. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 588–595, 2013.
[27] Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter Gehler, and Bernt Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4929–4937, 2016.
[28] Joseph Redmon and Ali Farhadi. YOLOv3: An Incremental Improvement. arXiv e-prints, page arXiv:1804.02767, April 2018.
[29] Dahu Shi, Xing Wei, Xiaodong Yu, Wenming Tan, Ye Ren, and Shiliang Pu. Inspose: Instance-aware networks for single-stage multi-person pose estimation. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, page 3079–3087, New York, NY, USA, 2021. Association for Computing Machinery.
[30] Ke Sun, Cuiling Lan, Junliang Xing, Wenjun Zeng, Dong Liu, and Jingdong Wang. Human pose estimation using global and local normalization. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5600–5608, 2017.
[31] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5686–5696, 2019.
[32] Zhi Tian, Hao Chen, and Chunhua Shen. DirectPose: Direct End-to-End Multi-Person Pose Estimation. arXiv e-prints, page arXiv:1911.07451, November 2019.
[33] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9626–9635, 2019.
[34] Alexander Toshev and Christian Szegedy. DeepPose: Human pose estimation via deep neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, jun 2014.
[35] Fangyun Wei, Xiao Sun, Hongyang Li, Jingdong Wang, and Stephen Lin. Point-set anchors for object detection, instance segmentation and pose estimation. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X, page 527–544, Berlin, Heidelberg, 2020. Springer-Verlag.
[36] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision – ECCV 2018, pages 472–487, Cham, 2018. Springer International Publishing.
[37] Nan Xue, Tianfu Wu, Gui-Song Xia, and Liangpei Zhang. Learning local-global contextual adaptation for multi-person pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[38] Yi Yang and Deva Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In CVPR 2011, pages 1385–1392, 2011.
[39] Feng Zhang, Xiatian Zhu, Hanbin Dai, Mao Ye, and Ce Zhu. Distribution-aware co-ordinate representation for human pose estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7091–7100, 2020.
[40] Feng Zhang, Xiatian Zhu, and Mao Ye. Fast human pose estimation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3512–3521, 2019.
[41] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z. Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9756–9765, 2020.
[42] Xingyi Zhou, Dequan Wang, and Philipp Krahenbäuhl. Objects as points. In arXiv preprint arXiv:1904.07850, 2019.