| 研究生: |
石維霈 Shi, Wei-Pei |
|---|---|
| 論文名稱: |
評估結合CoTracker與深度特徵編碼用於追蹤局部皮膚特徵的表現 Evaluation of CoTracker Combined with Deep Feature Encoding for Tracking Local Skin Features |
| 指導教授: |
吳馬丁
Nordling, Torbjörn |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 機械工程學系 Department of Mechanical Engineering |
| 論文出版年: | 2024 |
| 畢業學年度: | 112 |
| 語文別: | 英文 |
| 論文頁數: | 85 |
| 中文關鍵詞: | 皮膚特徵追蹤 、人體運動評估 、CoTracker 、深度特徵編碼器 、Transformer 、自動編碼器 、卷積神經網絡 |
| 外文關鍵詞: | Skin feature tracking, Human motion assessment, CoTracker, Deep feature encoder, Transformer, Autoencoder, Convolutional neural network |
| 相關次數: | 點閱:57 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
介紹: 皮膚特徵追蹤在量化人體運動方面舉足輕重,因其具有可解釋性且適用於臨床評估。儘管準確性至關重要,但對於最先進的基於深度神經網絡的點追蹤模型(如CoTracker)的相關研究在該領域仍未進行。CoTracker以其聯合點追蹤能力而聞名,在單目標點追蹤評估的兩個最廣泛使用的數據集中,超越了其他五種先進的深度學習方法。2021年,Chang和Nordling引入了深度特徵編碼器(DFE),該編碼器在皮膚特徵追蹤中實現了亞像素精度,並使用低成本訓練數據集展示了強大的性能。
問題: 如何提高皮膚特徵追蹤方法的準確性和效率?
方法: 我們的基準測試使用來自兩家醫院錄製的統一帕金森病評定量表姿勢震顫測試視頻。深度特徵編碼器(DFE)使用自動編碼器的編碼器部分,這是一個五層卷積神經網絡,旨在自動重建皮膚圖像窗口。我們比較了各窗口由編碼器壓縮的特徵向量的殘差平方誤差,以確定預測位置。我們提出了CoTracker-DFE以減少DFE的時間消耗,利用CoTracker預測大致位置,然後裁剪一個小區域,該區域隨後被輸入DFE以獲得更準確的預測位置,並減少平均像素誤差。此外,通過採用EfficientNet、ResNet、Swin Transformer和ConvNeXt等新型模型作為DFE骨幹並實施強大的追蹤算法,我們的目標是提高性能。為了確保訓練模型的質量,我們採用了多種訓練策略,包括學習率調度和數據增強。此外,我們通過移除不必要的變量和迴圈來進一步優化代碼,以減少DFE的時間消耗。
結果: CoTracker、DFE、CoTracker-DFE和CoTracker-DFE使用EfficientNet骨幹的平均歐幾里得距離誤差分別為0.88、0.86、0.86和0.45像素。在將CoTracker-DFE與EfficientNet骨幹與原始DFE進行比較時,我們在時間效率上提高了13倍(從901.556秒減少到67.6秒),並將平均像素誤差減少了48%(0.41像素)。雖然將CoTracker-DFE與EfficientNet骨幹與CoTracker相比顯示出時間消耗增加了14倍(從4.6秒到67.6秒),但它也顯示出平均像素誤差減少了49.5%(0.43像素)。
結論: 總結,實施CoTracker-DFE顯著減少了DFE的計算需求,促進了使用各種骨幹的可能性。結合追蹤算法,這種方法展示了性能的顯著提升,標誌著皮膚特徵追蹤技術的進步。
Introduction: Skin feature tracking is pivotal for quantifying human motion in a manner that is interpretable and applicable to clinical assessments. Although accuracy is crucial, the thorough examination of state-of-the-art deep neural network-based point tracking models, such as CoTracker, remains unexplored in this field. CoTracker, known for its joint point tracking capability, has outperformed five other leading deep learning methods across the two most widely utilized datasets for single target point tracking evaluation. In 2021, Chang and Nordling introduced the Deep Feature Encoder (DFE), which achieved sub-pixel accuracy in skin feature tracking and demonstrated robust performance using low-cost training datasets.
Problem: How can the accuracy and efficiency of skin feature tracking methods be enhanced?
Methods: Our benchmarking utilizes videos from the Unified Parkinson's Disease Rating Scale postural tremor test, recorded at two hospitals. We annotated the ground truth for videos of three subjects, focusing on hand moles, wrinkles, and stickers as features. The Deep Feature Encoder (DFE) employs the encoder segment of an autoencoder, comprising a five-layer convolutional neural network designed to reconstruct skin crops autonomously. We compare the residual squared error of the encoder's latent features on crops to determine a predicted position. We propose CoTracker-DFE to reduce the time consumption of DFE, leveraging CoTracker to predict an approximate position. This is followed by cropping a small area, which is then fed into DFE for a more accurately predicted position with reduced mean pixel error. Moreover, through the adoption of novel models, such as EfficientNet, ResNet, Swin Transformer and ConvNeXt, as the DFE backbone and the implementation of a robust tracking algorithm, our goal is to enhance performance. To ensure the quality of trained models, we utilize several training strategies, including learning rate scheduling and data augmentation. Additionally, we optimize the code by removing unnecessary variables and loops to further reduce the time consumption of DFE.
Results: The mean Euclidean distance error of CoTracker, DFE CoTracker-DFE and CoTracker-DFE with EfficientNet backbone are 0.88, 0.86, 0.86 and 0.45 pixels, respectively, in the hand mole dataset (subject A). In comparing CoTracker-DFE with the EfficientNet backbone against the original DFE, we achieved a 13-fold improvement in time efficiency (from 901.556 seconds down to 67.6 seconds) and reduced the mean pixel error by 48% (0.41 pixels). While CoTracker-DFE with the EfficientNet backbone shows a 14-fold increase in time consumption compared to CoTracker (from 4.6 seconds to 67.6 seconds), it also reveals a reduction in mean pixel error by 49.5% (0.43 pixels).
Conclusion: In conclusion, implementing CoTracker-DFE significantly reduces the computational demands of the DFE, facilitating the use of a variety of backbones. Coupled with the tracking algorithm, this approach demonstrates a notable enhancement in performance, marking an advancement in skin feature tracking technology.
Afifi, M. (2019). 11k hands: Gender recognition and biometric identification using a large dataset of hand images. Multimedia Tools and Applications, 78(15):20835–20854.
Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J., and Schiele, B. (2018). Posetrack: A benchmark for human pose estimation and tracking. In 2018 IEEE/ CVF Conference on Computer Vision and Pattern Recognition, pages 5167–5176.
Baker, S. and Matthews, I. (2004). Lucas-Kanade 20 Years On: A Unifying Framework. Int. J. Comput. Vis., 56(3):221–255.
Bay, H., Tuytelaars, T., and Gool, L. V. (2006). Surf: Speeded up robust features. In European conference on computer vision, pages 404–417. Springer.
Brox, T. and Malik, J. (2010). Large displacement optical flow: descriptor matching in variational motion estimation. IEEE transactions on pattern analysis and machine intelligence, 33(3):500–513.
Chang, J. R. and Nordling, T. E. M. (2021). Skin feature point tracking using deep feature encodings. arXiv preprint.
Chen, J.-Y. (2022). Hand pose tracking using deep feature encodings.
Deng, J., Dong, W., Socher, R., Li, L.-J., Kai Li, and Li Fei-Fei (2009). ImageNet: A largescale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255.
Doersch, C., Gupta, A., Markeeva, L., Recasens, A., Smaira, L., Aytar, Y., Carreira, J., Zisserman, A., and Yang, Y. (2022). Tap-vid: A benchmark for tracking any point in a video. Advances in Neural Information Processing Systems, 35:13610–13626.
Doersch, C., Yang, Y., Vecerik, M., Gokay, D., Gupta, A., Aytar, Y., Carreira, J., and Zisserman, A. (2023). Tapir: Tracking any point with per-frame initialization and temporal refinement. arXiv preprint arXiv:2306.08637.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Fahn, S. (1987). Unified parkinson’s disease rating scale. Recent developments in Parkinson’s disease, pages 153–163.
Fang, H.-S., Li, J., Tang, H., Xu, C., Zhu, H., Xiu, Y., Li, Y.-L., and Lu, C. (2022). Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Goetz, C. G., Tilley, B. C., Shaftman, S. R., Stebbins, G. T., Fahn, S., Martinez-Martin, P., Poewe, W., Sampaio, C., Stern, M. B., Dodel, R., et al. (2008). Movement disorder society-sponsored revision of the unified parkinson’s disease rating scale (mds-updrs): scale presentation and clinimetric testing results. Movement disorders: official journal of the Movement Disorder Society, 23(15):2129–2170.
Goldblum, M., Souri, H., Ni, R., Shu, M., Prabhu, V., Somepalli, G., Chattopadhyay, P., Ibrahim, M., Bardes, A., Hoffman, J., et al. (2024). Battle of the backbones: A largescale comparison of pretrained models across computer vision tasks. Advances in Neural Information Processing Systems, 36.
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
Guo, Z., Zeng, W., Yu, T., Xu, Y., Xiao, Y., Cao, X., and Cao, Z. (2022). Vision-based finger tapping test in patients with parkinson's disease via spatial-temporal 3d hand pose estimation. IEEE Journal of Biomedical and Health Informatics, 26(8):3848–3859.
Harley, A. W., Fang, Z., and Fragkiadaki, K. (2022). Particle video revisited: Tracking through occlusions using point trajectories. In European Conference on Computer Vision, pages 59–75. Springer.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
Huang, C.-Y. and Galeotti, J. (2021). Robust skin-feature tracking in free-hand video from smartphone or robot-held camera, to enable clinical-tool localization and guidance. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 10974–10980. IEEE.
Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., and Rupprecht, C. (2023). Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635.
Khan, T., Nyholm, D., Westin, J., and Dougherty, M. (2014). A computer vision framework for finger-tapping evaluation in parkinson’s disease. Artificial intelligence in medicine, 60(1):27–40.
Kung, H.-S. (2024). Tracking of skin features–comparison of five methods.
Lam, W. W., Tang, Y. M., and Fong, K. N. (2023). A systematic review of the applications of markerless motion capture (mmc) technology for clinical measurement in rehabilitation. Journal of NeuroEngineering and Rehabilitation, 20(1):1–26.
Liu, J., Huang, X., Zheng, J., Liu, Y., and Li, H. (2023). Mixmae: Mixed and masked autoencoder for efficient pretraining of hierarchical vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6252–6261.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022.
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., and Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110.
Lucas, B. and Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In IJCAI, volume 81.
Manni, F., van der Sommen, F., Zinger, S., Shan, C., Holthuizen, R., Lai, M., Buström, G., Hoveling, R. J., Edström, E., Elmi-Terander, A., et al. (2020). Hyperspectral imaging for skin feature detection: Advances in markerless tracking for spine surgery. Applied Sciences, 10(12):4078.
Mathis, A., Mamidanna, P., Cury, K. M., Abe, T., Murthy, V. N., Mathis, M. W., and Bethge, M. (2018). DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience, 21(9):1281–1289. Number: 9 Publisher: Nature Publishing Group.
McLaren, K. (1976). Xiii–the development of the cie 1976 (l* a* b*) uniform colour space and colour-difference formula. J. of the Soc. of Dyers and Colour., 92(9):338–341.
Neog, D. R., Ranjan, A., and Pai, D. K. (2017). Seeing skin in reduced coordinates. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 484–489. IEEE.
Neoral, M., Šerỳch, J., and Matas, J. (2024). Mft: Long-term tracking of every pixel. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6837–6847.
Noh, H., Araujo, A., Sim, J., Weyand, T., and Han, B. (2017). Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE international conference on computer vision, pages 3456–3465.
Rublee, E., Rabaud, V., Konolige, K., and Bradski, G. (2011). Orb: An efficient alternative to sift or surf. In 2011 International conference on computer vision, pages 2564–2571. Ieee.
Sun, K., Xiao, B., Liu, D., and Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5693–5703.
Tan, M. and Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR.
Tan, M. and Le, Q. (2021). Efficientnetv2: Smaller models and faster training. In International conference on machine learning, pages 10096–10106. PMLR
Teed, Z. and Deng, J. (2020). Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer.
Wang, Q., Chang, Y.-Y., Cai, R., Li, Z., Hariharan, B., Holynski, A., and Snavely, N. (2023). Tracking everything everywhere all at once. arXiv preprint arXiv:2306.05422.
Williams, S., Relton, S. D., Fang, H., Alty, J., Qahwaji, R., Graham, C. D., and Wong, D. C. (2020a). Supervised classification of bradykinesia in parkinson's disease from smartphone videos. Artificial Intelligence in Medicine, 110:101966.
Williams, S., Zhao, Z., Hafeez, A., Wong, D. C., Relton, S. D., Fang, H., and Alty, J. E. (2020b). The discerning eye of computer vision: Can it measure parkinson’s finger tap bradykinesia? Journal of the Neurological Sciences, 416:117003.
Xiang, D., Joo, H., and Sheikh, Y. (2019). Monocular total capture: Posing face, body, and hands in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10965–10974.
Xiong, F., Zhang, B., Xiao, Y., Cao, Z., Yu, T., Zhou, J. T., and Yuan, J. (2019). A2j: Anchorto-joint regression network for 3d articulated pose estimation from a single depth image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 793–802.
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. (2020). On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR.
You, Y., Gitman, I., and Ginsburg, B. (2017). Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888.
Zhang, Y., Wang, C., Wang, X., Liu, W., and Zeng, W. (2022). Voxeltrack: Multi-person 3d human pose estimation and tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2613–2626.
Zheng, Y., Harley, A. W., Shen, B., Wetzstein, G., and Guibas, L. J. (2023). Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In Proceedings of the IEEE/ CVF International Conference on Computer Vision, pages 19855–19865.