| 研究生: |
洪翊珈 Hung, Yi-Chia |
|---|---|
| 論文名稱: |
VSDPose:用於多視角 3D 人體姿態估計的體素自蒸餾方法 VSDPose: Voxel-based Self-distillation for Multi-view 3D Human Pose Estimation |
| 指導教授: |
蔡家齊
Tsai, Chia-Chi |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 114 |
| 語文別: | 英文 |
| 論文頁數: | 52 |
| 中文關鍵詞: | 多視角三維人體姿態估計 、知識蒸餾 |
| 外文關鍵詞: | Multi-view 3D Human Pose Estimation, Knowledge Distillation |
| 相關次數: | 點閱:89 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
多視角人體姿態估計旨在從多個同步攝影機視角中預測三維人體關節位置,以提升對遮擋的魯棒性並減少深度上的不確定性。本領域一項關鍵挑戰是如何在 voxel-based 架構下更有效地利用多視角資訊,以進一步提升估計精度。本研究提出 VSDPose,即應用於多視角三維人體姿態估計的 Voxel-based Self-Distillation 方法。在典型的 voxel-based 流程中,2D 特徵由 Backbone 網路提取後,RootNet 用於根節點定位,接著由 PoseNet 進行三維關節回歸。我們設計的 VSD 模組將知識蒸餾整合至此兩階段流程中,實現 voxel-based 架構下的自我監督式姿態估計強化。VSD 模組無需仰賴外部教師模型,而是將 PoseNet 視為教師,並將 RootNet 與 Backbone 視為學生。更精確的三維姿態預測可反向提供更細緻的監督訊號,用於提升根節點與多視角 heatmap 的預測準確率,進而建立一個持續改善的正向回饋機制。我們將 VSD 模組整合至三種具代表性的 voxel-based 方法(VoxelPose、Faster VoxelPose、3DSA),並於 Panoptic 資料集上進行評估,在所有模型上皆展現穩定的效能提升。
Multi-view human pose estimation aims to predict 3D human joint locations from multiple synchronized camera views, offering greater robustness to occlusion and reduced depth ambiguity. A key challenge lies in further improving accuracy by effectively leveraging multi-view information within voxel-based frameworks. In this work, we propose VSDPose, Voxel-based Self-Distillation for multi-view 3D human Pose estimation. In a typical voxel-based pipeline, 2D features extracted by a backbone network are processed by RootNet for root localization, followed by PoseNet for 3D joint regression. Our voxel-based self-distillation (VSD) module incorporates knowledge distillation into this two-stage pipeline, enabling self-supervised refinement for multi-view human pose estimation within voxel-based frameworks. Without relying on any external teacher model, our VSD module treats PoseNet as the teacher model, while RootNet and the Backbone act as student models. More accurate 3D human poses provide refined supervision for estimating 3D root joints and multi-view heatmaps, thereby establishing a positive feedback loop that progressively improves pose estimation accuracy. We evaluate the proposed VSD module by integrating it into representative voxel-based methods—VoxelPose, Faster VoxelPose, and 3DSA—on the Panoptic dataset, and achieve consistent performance improvements across all models.
[1] Sikandar Amin, Mykhaylo Andriluka, Marcus Rohrbach, and Bernt Schiele. Multi-view pictorial structures for 3d human pose estimation. In Bmvc, volume 1. Bristol, UK, 2013.
[2] Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 3d pictorial structures for multiple human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1669–1676, 2014.
[3] Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 3d pictorial structures revisited: Multiple human pose esti-mation. IEEE transactions on pattern analysis and machine intelligence, 38(10):1929–1942, 2015.
[4] Lewis Bridgeman, Marco Volino, Jean-Yves Guillemaut, and Adrian Hilton. Multi-person 3d pose estimation and tracking in sports. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
[5] Bo-Han Chen and Chia-chi Tsai. 3dsa: Multi-view 3d human pose estimation with 3d space attention mechanisms. In European Conference on Computer Vision, pages 323–339. Springer, 2024.
[6] Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. Learn-ing efficient object detection models with knowledge distillation. Advances in neural information processing systems, 30, 2017.
[7] He Chen, Pengfei Guo, Pengfei Li, Gim Hee Lee, and Gregory Chirikjian. Multi-person 3d pose estimation in crowded scenes based on multi-view geometry. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 541–557. Springer, 2020.
[8] Sichen Chen, Yingyi Zhang, Siming Huang, Ran Yi, Ke Fan, Ruixin Zhang, Peixian Chen, Jun Wang, Shouhong Ding, and Lizhuang Ma. Sdpose: Tokenized pose estima-tion via circulation-guide self-distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1082–1090, 2024.
[9] Yuxing Chen, Renshu Gu, Ouhan Huang, and Gangyong Jia. Vtp: volumetric transformer for multi-view multi-person 3d pose estimation. Applied Intelligence, 53(22):26568–26579, 2023.
[10] Rohan Choudhury, Kris M. Kitani, and László A. Jeni. Tempo: Efficient multi-view pose estimation, tracking, and forecasting. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14704–14714, 2023.
[11] Rishabh Dabral, Nitesh B Gundavarapu, Rahul Mitra, Abhishek Sharma, Ganesh Ra-makrishnan, and Arjun Jain. Multi-person 3d human pose estimation from monocularmages. In 2019 international conference on 3D vision (3DV), pages 405–414. IEEE, 2019.
[12] Junting Dong, Qi Fang, Wen Jiang, Yurou Yang, Qixing Huang, Hujun Bao, and Xi-aowei Zhou. Fast and robust multi-person 3d pose estimation and tracking from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6981–6992, 2021.
[13] Junting Dong, Wen Jiang, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Fast and robust multi-person 3d pose estimation from multiple views. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7792–7801, 2019.
[14] Sara Ershadi-Nasab, Erfan Noury, Shohreh Kasaei, and Esmaeil Sanaei. Multiple hu-man 3d pose estimation from multiview images. Multimedia Tools and Applications, 77:15573–15601, 2018.
[15] Martin A Fischler and Robert A Elschlager. The representation and matching of pictorial structures. IEEE Transactions on computers, 100(1):67–92, 1973.
[16] Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. In International conference on machine learning, pages 1607–1616. PMLR, 2018.
[17] Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distil-lation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
[18] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[20] Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and Jin Young Choi. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1921–1930, 2019.
[21] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[22] Thi Kieu Khanh Ho and Jeonghwan Gwak. Utilizing knowledge distillation in deep learning for classification of chest x-ray abnormalities. IEEE access, 8:160749–160761, 2020.
[23] Md Imtiaz Hossain, Sharmen Akhter, Nosin Ibna Mahbub, Choong Seon Hong, and Eui-Nam Huh. Why logit distillation works: A novel knowledge distillation technique by deriving target augmentation and logits distortion. Information Processing & Man-agement, 62(3):104056, 2025.
[24] Congzhentao Huang, Shuai Jiang, Yang Li, Ziyue Zhang, Jason Traish, Chen Deng, Sam Ferguson, and Richard Yi Da Xu. End-to-end dynamic matching network for multi-view multi-person 3d pose estimation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, pages 477–493. Springer, 2020.
[25] Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. Learnable trian-gulation of human pose. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7718–7727, 2019.
[26] Mingi Ji, Seungjae Shin, Seunghyun Hwang, Gibeom Park, and Il-Chul Moon. Re-fine myself by teaching myself: Feature refinement via self-knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10664–10673, 2021.
[27] Tao Jiang, Peng Lu, Li Zhang, Ningsheng Ma, Rui Han, Chengqi Lyu, Yining Li, and Kai Chen. Rtmpose: Real-time multi-person pose estimation based on mmpose. arXiv preprint arXiv:2303.07399, 2023.
[28] Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. Panoptic studio: A massively multiview system for social motion capture. In Proceedings of the IEEE International Conference on Computer Vision, pages 3334–3342, 2015.
[29] Ziwei Liao, Jialiang Zhu, Chunyu Wang, Han Hu, and Steven L Waslander. Multi-ple view geometry transformers for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 708–717, 2024.
[30] Jiahao Lin and Gim Hee Lee. Multi-view multi-person 3d pose estimation with plane sweep stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11886–11895, 2021.
[31] Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and Jingdong Wang. Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2604–2613, 2019.
[32] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 international conference on 3D vision (3DV), pages 506–516. IEEE, 2017.
[33] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European conference on computer vision, pages 483–499. Springer, 2016.
[34] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recogni-tion, pages 3967–3976, 2019.
[35] Edoardo Remelli, Shangchen Han, Sina Honari, Pascal Fua, and Robert Wang. Lightweight multi-view 3d pose estimation through camera-disentangled representa-tion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6040–6049, 2020.
[36] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
[37] Qing Shuai, Q Fang, J Dong, Sida Peng, D Huang, et al. Easymocap-make human motion capture easier. Github, 1(3):6, 2021.
[38] Vinkle Srivastav, Keqi Chen, and Nicolas Padoy. Selfpose3d: self-supervised multi-person multi-view 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2502–2512, 2024.
[39] Yu Sun, Qian Bao, Wu Liu, Yili Fu, Michael J Black, and Tao Mei. Monocular, one-stage, regression of multiple 3d people. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11179–11188, 2021.
[40] Hanyue Tu, Chunyu Wang, and Wenjun Zeng. Voxelpose: Towards multi-camera 3d human pose estimation in wild environment. In Computer Vision–ECCV 2020: 16th Eu-ropean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 197–212. Springer, 2020.
[41] Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effective whole-body pose es-timation with two-stages distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023.
[42] Hang Ye, Wentao Zhu, Chunyu Wang, Rujie Wu, and Yizhou Wang. Faster voxelpose: Real-time 3d human pose estimation by orthographic projection. In European Confer-ence on Computer Vision, pages 142–159. Springer, 2022.
[43] Suhang Ye, Yingyi Zhang, Jie Hu, Liujuan Cao, Shengchuan Zhang, Lei Shen, Jun Wang, Shouhong Ding, and Rongrong Ji. Distilpose: Tokenized pose regression with heatmap distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2163–2172, 2023.
[44] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Im-proving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.
[45] Feng Zhang, Xiatian Zhu, and Mao Ye. Fast human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3517–3526, 2019.
[46] Jianfeng Zhang, Yujun Cai, Shuicheng Yan, Jiashi Feng, et al. Direct multi-view multi-person 3d pose estimation. Advances in Neural Information Processing Systems, 34:13153–13164, 2021.
[47] Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF international conference on com-puter vision, pages 3713–3722, 2019.
[48] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenyu Liu, and Wenjun Zeng. Voxel-track: Multi-person 3d human pose estimation and tracking in the wild. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 45(2):2613–2626, 2022.
[49] Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learn-ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4320–4328, 2018.
[50] Yuxiang Zhang, Liang An, Tao Yu, Xiu Li, Kun Li, and Yebin Liu. 4d association graph for realtime multi-person motion capture using multiple video cameras. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1324–1333, 2020.