| 研究生: |
方柏凱 Fang, Po-Kai |
|---|---|
| 論文名稱: |
基於人像語意分割之姿勢遷移生成模型 Pose Transfer Generative Model based on Human Image Semantic Segmentation |
| 指導教授: |
王宗一
Wang, Tzone-I |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 中文 |
| 論文頁數: | 52 |
| 中文關鍵詞: | 深度學習 、姿勢遷移 、人像生成 |
| 外文關鍵詞: | deep learning, pose transfer, person image generation |
| 相關次數: | 點閱:139 下載:15 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
人物圖像生成在多個領域中有巨大的應用潛力,如電影與動畫製作、廣告行銷、虛擬人像製作等領域,其中姿勢遷移扮演著相當重要的角色。姿勢遷移是指從來源人像中提取外觀資訊,根據目標人體姿勢生成相對應的目標人物圖像,使得生成的圖像在外觀上保持來源圖像的特徵,同時呈現出目標姿勢,透過該技術能夠使得單一人物圖像擁有多變的姿勢。
以往在人物的姿勢遷移上需要透過大量人工操作與3D建模,使得執行此項任務門檻相當龐大。因此本研究希望建立一個能夠以少量資源就能生成出目標人物圖像之生成式深度學習模型,藉以降低任務之門檻與人力資源消耗。
本研究提出以目標人像姿勢與人像語意分割來生成目標人物圖像之生成式深度學習模型,透過將人像分解為六種屬性,如頭髮、面部、上半身等,使模型更容易學習特徵對應關係。同時,透過目標人像語意圖進行條件式標準化,使模型依據標註生成符合預期之人像。此外。本研究藉由從原始圖像提取更多外觀資訊作為輸入,包含身體姿勢與手部動作,從而使得手部動作有更優秀的生成效果。
本研究之訓練資料集來源為AVSpeech所蒐集之視訊語音數據集,其來源為YouTube內包含人像之影像數據,本研究基於該數據集並以前處理篩選出適合人像生成之影像片段,並透過預訓練之姿勢提取模型MediaPipe與人像語意分割模型 Graphonomy,提取人像姿勢標註與人像語意分割圖與來源人物圖像作為模型訓練之輸入。
本研究從資料集提取出1000餘筆測試集與DeepFashion中的測試集用於檢測模型生成效果,並且透過structural similarity與Inception Score評估生成效果,最終structural similarity得分為0.855,Inception Score得分為3.462,故可推斷模型有良好的生成效果,並且有效解決了過去研究生成輪廓不準確之問題。
Human image generation holds significant potential in various domains such as film and animation production, advertising, virtual character creation, and more. Among these domains, pose transfer plays a crucial role. Pose transfer involves extracting appearance information from a source human image and generating a corresponding target human image based on a desired pose, maintaining the visual features of the source image while presenting the target pose. This technology enables a single character image to exhibit diverse poses.
In the past, achieving pose transfer required substantial manual effort and 3D modeling, making the task quite challenging. Thus, this study aims to establish a generative deep learning model that can produce target human images with few resources, reducing the barriers and manpower required for this task.
This research proposes a generative deep learning model for generating target human images based on the target pose and human image semantic segmentation. By decomposing the human image into six attributes, such as hair, face, upper body, etc., the model learns feature correspondence more effectively. Additionally, by adopting conditional normalization based on the semantic segmentation of the target human image, the model generates images that align with the expected appearance. Furthermore, this study enhances hand movement generation by extracting more appearance information from the original image, including body posture and hand gestures.
The training dataset for this research is sourced from the AVSpeech, a collection of video data from YouTube that spans a wide variety of people and face poses. The study preprocesses and filters suitable image clips from this dataset and adopts the pre-trained pose estimation model MediaPipe and human image semantic segmentation model Graphonomy to extract human pose landmarks, semantic segmentation, and source human images for model training.
In this study, over a thousand test samples were extracted from the dataset and the DeepFashion test dataset to assess the model's generation performance. The generated results were evaluated using the structural similarity and Inception Score. The structural similarity score was determined to be 0.855, while the Inception Score yielded a score of 3.462. These findings suggest that the model demonstrates a strong capability for generating images and effectively addresses the issue of contour inaccuracies in previous research.
[1] A. Ephrat et al., "Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation," arXiv preprint arXiv:1804.03619, 2018.
[2] C. Lugaresi et al., "Mediapipe: A framework for building perception pipelines," arXiv preprint arXiv:1906.08172, 2019.
[3] K. Gong, Y. Gao, X. Liang, X. Shen, M. Wang, and L. Lin, "Graphonomy: Universal human parsing via graph transfer learning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7450-7459.
[4] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, "Deepfashion: Powering robust clothes recognition and retrieval with rich annotations," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1096-1104.
[5] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: from error visibility to structural similarity," IEEE transactions on image processing, vol. 13, no. 4, pp. 600-612, 2004.
[6] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, "Improved techniques for training gans," Advances in neural information processing systems, vol. 29, 2016.
[7] Y. Men, Y. Mao, Y. Jiang, W.-Y. Ma, and Z. Lian, "Controllable person image synthesis with attribute-decomposed gan," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5084-5093.
[8] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, "Image-to-image translation with conditional adversarial networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125-1134.
[9] Z. Zhu, T. Huang, B. Shi, M. Yu, B. Wang, and X. Bai, "Progressive pose attention transfer for person image generation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2347-2356.
[10] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, "Everybody dance now," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5933-5942.
[11] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, "High-resolution image synthesis and semantic manipulation with conditional gans," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8798-8807.
[12] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, "Semantic image synthesis with spatially-adaptive normalization," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2337-2346.
[13] X. Huang and S. Belongie, "Arbitrary style transfer in real-time with adaptive instance normalization," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1501-1510.
[14] T. Karras, S. Laine, and T. Aila, "A style-based generator architecture for generative adversarial networks," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401-4410.
[15] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[16] L. A. Gatys, A. S. Ecker, and M. Bethge, "A neural algorithm of artistic style," arXiv preprint arXiv:1508.06576, 2015.
[17] Google LLC. "Hand landmarks detection guide." https://developers.google.com/mediapipe/solutions/vision/hand_landmarker (accessed.
[18] Google LLC. "Pose landmark detection guide." https://developers.google.com/mediapipe/solutions/vision/pose_landmarker (accessed.
[19] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[20] J. Johnson, A. Alahi, and L. Fei-Fei, "Perceptual losses for real-time style transfer and super-resolution," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, 2016: Springer, pp. 694-711.
[21] R. Mechrez, I. Talmi, and L. Zelnik-Manor, "The contextual loss for image transformation with non-aligned data," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 768-783.