簡易檢索 / 詳目顯示

研究生: 高銘宏
Kao, Ming-Hung
論文名稱: KMH-Net:應用於高爾夫教學的關鍵動作強化網路
KMH-Net: A Key-Motion-Highlight Network for Golf Coaching Applications
指導教授: 陳朝鈞
Chen, Chao-Chun
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 製造資訊與系統研究所
Institute of Manufacturing Information and Systems
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 71
中文關鍵詞: 高爾夫球教學生成式擴散模型深度學習姿態生成運動科學
外文關鍵詞: Golf Instruction, Generative Diffusion Models, Deep Learning, Pose Generation, Sports Science
相關次數: 點閱:12下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 高爾夫揮桿訓練是一項對生物力學要求極高的活動,需要身體姿勢、揮桿軌跡與肌肉控制的精準協調。傳統教學方式如口頭指導或是教練指導下觀看回放影片,對初學者而言往往不易取得,且難以及時且直觀地提供有效指引。而近年深度學習與電腦視覺技術雖使動作辨識與影片生成成為可能,但多數方法偏重於動作分析與錯誤診斷,缺乏能直接提供示範影片的視覺化教學內容。在本研究中,我們提出「KMH-Net:應用於高爾夫教學的關鍵動作強化網路」,這是一個個人化、可控的影片生成框架,能夠合成教學用的高爾夫揮桿影片,並將學員的外觀無縫地轉換至專業教練的揮桿動作之上。KMH-Net 僅需輸入學員的一張影像與教練的參考影片,即可自動生成完整的個人化示範影片,同時保留揮桿過程中的生物力學合理性與球桿的視覺一致性。為達成此目標,KMH-Net 整合了三個核心模組:(1) 高爾夫揮桿自動特徵擷取機制,用於辨識揮桿的關鍵階段並提取球桿的細緻軌跡;(2) 支持擴散網路結構的語意聚焦損失函數設計,用於確保生物力學合理性與視覺正確性;(3) 支援客製化學習目標的生成式網路,用於強化影片品質與球桿的可見性。實驗結果顯示,KMH-Net 在生物力學準確性與教學實用性方面,均顯著優於現有基準影片生成模型,能夠生成與專業揮桿更接近的動作,並提升球桿渲染的一致性,對缺乏專業指導的初學者具高度應用價值。消融實驗進一步證實,球桿關注機制、運動學與球桿融合網路與損失設計模組皆是生成高品質結果的關鍵。除了高爾夫運動之外,本研究的方法論亦具備可遷移性,能廣泛應用於其他運動訓練與復健教學領域的個人化教學內容生成。

    Golf swing training is a biomechanically demanding activity that requires precise coordination of body posture, swing trajectory, and muscle control. Traditional instruction approaches, such as verbal correction or coach-supervised video playback, are often inaccessible to beginners, and they frequently fail to provide immediate and intuitive guidance, particularly for novice learners. Although recent advances in deep learning and computer vision have enabled motion recognition and video generation, most approaches focus primarily on motion analysis and error diagnosis, lacking the ability to directly deliver visualized instructional demonstrations. In this work, we propose the Key-Motion-Highlight Network (KMH-Net), a personalized controllable video generation framework designed to synthesize instructional golf swing videos in which the player's appearance is seamlessly transferred onto expert swing motion. KMH-Net requires only a single image of the player and a coach's reference video as input, and it generates a fully personalized demonstration that preserves biomechanical fidelity and ensures visual consistency of the golf club throughout the swing. To achieve this, KMH-Net integrates three specialized components: (1) a Golf Swing Automatic Feature Extraction Mechanism that identifies key swing phases and extracts fine-grained trajectories of the golf club, (2) a Semantic-Focused Loss Function Design for Supporting Diffusion Network Architecture that enforces biomechanical plausibility and visual correctness, and (3) a Generative Network for Customizable Learning Objectives that fine-tunes a pre-trained video diffusion model to enhance video quality and shaft visibility. Experiments demonstrate that KMH-Net substantially outperforms baseline video generation models in terms of biomechanical accuracy and instructional utility, achieving closer alignment between generated and expert motions as well as improved consistency of golf club rendering, thus providing significant practical value for novice learners without access to professional guidance. Ablation studies further confirm that the Shaft Focus Mechanism, the Kinematic & Shaft Fusion Network, and the Semantic-Focused Loss Function are all critical for achieving high-quality outputs.

    摘要 i 英文延伸摘要 ii 目錄 vi 表格 viii 圖片 ix 符號說明 xi 第一章 緒論 1 1.1. 高爾夫運動簡介 1 1.2. 現有深度學習方法與其挑戰 2 1.3. 視覺引導學習的實證支持 2 1.4. 研究動機 3 1.5. 貢獻概述 4 1.6. 論文架構 5 第二章 文獻探討 7 2.1. 深度生成模型於影像與影片生成的發展 7 2.2. 擴散模型的可控生成技術與動作生成應用 11 2.3. 高爾夫揮桿的分析技術與評估方法 13 第三章 研究方法 15 3.1. 方法應用需求 15 3.2. 系統設計與運作流程 16 3.3. 高爾夫揮桿特徵自動擷取機制 18 3.3.1. 高爾夫揮桿階段偵測網路 20 3.3.2. 高爾夫揮桿軌跡追蹤網路 22 3.4. 支持擴散網路結構的語意聚焦損失函數設計 22 3.4.1. 運動學指標訂定 23 3.4.2. 運動學指標萃取模組 30 3.4.3. 基於運動學的損失函數 34 3.4.4. 基於球桿的損失函數 34 3.4.5. 運動學與球桿融合網路 36 3.5. 支援客製化學習目標的生成式網路 38 3.5.1. 球桿關注機制 39 3.5.2. 基於擴散模型的影片生成網路 39 vii 第四章 實驗與結果分析 42 4.1. 資料集 42 4.2. 實驗規劃 43 4.3. 定性分析 45 4.4. 定量分析 49 4.5. 消融實驗 50 第五章 結論 53 參考文獻 54

    [1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223. PMLR, 06–11 Aug 2017.
    [2] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023.
    [3] Maxime Bourgain, Philippe Rouch, Olivier Rouillon, Patricia Thoreux, and Christophe Sauret. Golf swing biomechanics: A systematic review and methodological recommendations for kinematics. Sports, 10(6), 2022.
    [4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis, 2019.
    [5] Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, and Tero Karras. Generating long videos of dynamic scenes. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 3176931781. Curran Associates, Inc., 2022.
    [6] Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial video generation on complex datasets, 2019.
    [7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
    [8] Giorgia D’Innocenzo, Claudia C Gonzalez, A Mark Williams, and Daniel T Bishop. Looking to learn: The effects of visual guidance on observational learning of the golf swing. PLoS ONE, 11(5):e0155442, 2016.
    [9] Hao-Shu Fang, Jiefeng Li, Hongyang Tang, Chao Xu, Haoyi Zhu, Yuliang Xiu, YongLu Li, and Cewu Lu. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7157–7173, 2023.
    [10] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
    [11] M Guadagnoli, W Holcomb, and M Davis. The efficacy of video feedback for learning the golf swing. In Science and Golf IV, pages 178–191. Routledge, 2012.
    [12] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
    [13] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
    [14] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020.
    [15] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 8633–8646. Curran Associates, Inc., 2022.
    [16] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
    [17] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Largescale pretraining for text-to-video generation via transformers, 2022.
    [18] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1314–1324, October 2019.
    [19] Michael I. Jordan. Chapter 25 - serial order: A parallel distributed processing approach. In John W. Donahoe and Vivian Packard Dorsel, editors, Neural-Network Models of Cognition, volume 121 of Advances in Psychology, pages 471–495. North-Holland, 1997.
    [20] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation, 2018.
    [21] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 852–863. Curran Associates, Inc., 2021.
    [22] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4401–4410, June 2019.
    [23] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
    [24] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022.
    [25] Chen-Chieh Liao, Dong-Hyun Hwang, and Hideki Koike. Ai golf: Golf swing analysis tool for self-training. IEEE Access, 10:106286–106295, 2022.
    [26] Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. Omnihuman-1:Rethinking the scaling-up of one-stage conditioned human animation models, 2025.
    [27] Jen-Jui Liu, Jacob Newman, and Dah-Jye Lee. Using artificial intelligence to provide visual feedback for golf swing training. pages 321–1–321–6, 2021.
    [28] William McNally, Kanav Vats, Tyler Pinto, Chris Dulhanty, John McPhee, and Alexander Wong. Golfdb: A video database for golf swing sequencing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
    [29] David W Meister, Amy L Ladd, Erin E Butler, Betty Zhao, Andrew P Rogers, Conrad J Ray, and Jessica Rose. Rotational biomechanics of the elite golf swing: Benchmarks for amateurs. Journal of Applied Biomechanics, 27(3):242–251, 2011.
    [30] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(5):4296–4304, Mar. 2024.
    [31] Joseph Myers, Scott Lephart, Yung-Shen Tsai, Timothy Sell, James Smoliga, and John Jolly. The role of upper torso and pelvis rotation in driving performance during the golf swing. Journal of Sports Sciences, 26(2):181–188, 2008. PMID: 17852693.
    [32] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, October 2023.
    [33] Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming-Chang Yang, and Jiaya Jia. Controlnext: Powerful and efficient control for image and video generation, 2025.
    [34] Elisa Perini. Feasibility of mobile phone-based 2d human pose estimation for golf :An analysis of the golf swing focusing on selected joint angles. Master’s thesis, KTH, Biomedical Engineering and Health Systems, 2023.
    [35] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks, 2016.
    [36] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos, 2024.
    [37] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
    [38] D. Roopa, R. Prabha, S. Sridevi, and G.A. Senthil. An artificial intelligence improved golf self-training system using deep learning. In 2023 Intelligent Computing and Control for Engineering and Business Systems (ICCEBS), pages 1–8, 2023.
    [39] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. Technical Report ICS-8506, Institute for Cognitive Science, University of California, San Diego, La Jolla, CA, September 1985.
    [40] Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Temporal generative adversarial nets with singular value clipping. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2830–2839, Oct 2017.
    [41] Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision, 128(10):2586–2606, 2020.
    [42] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data, 2022.
    [43] Joshua L Smith. Effects of video modeling on skill acquisition in learning the golf swing. Brigham Young University, 2004.
    [44] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2256–2265, Lille, France, 07–09 Jul 2015. PMLR.
    [45] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
    [46] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021.
    [47] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. Human motion diffusion model, 2022.
    [48] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1526–1535, June 2018.
    [49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
    [50] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description, 2022.
    [51] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
    [52] Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. Physdiff: Physicsguided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16010–16021, October 2023.
    [53] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to textto-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, October 2023.
    [54] Yanting Zhang, Qing’Ao Wang, Fuyu Tu, and Zijian Wang. Automatic moving pose grading for golf swing in sports. In 2022 IEEE International Conference on Image Processing (ICIP), pages 41–45, 2022.
    [55] Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, and Fangyuan Zou. Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance, 2025.
    [56] Joanne Y. Zhou, Alexander Richards, Kornel Schadl, Amy Ladd, and Jessica Rose. The swing performance index: Developing a single-score index of golf swing rotational biomechanics quantified with 3d kinematics. Frontiers in Sports and Active Living, 4, 2022.
    [57] Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne Wu, and Yizhou Wang. Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15085–15099, October 2023.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE