| 研究生: | 歐禮寬 Ou, Li-Kuan | 
|---|---|
| 論文名稱: | 擴散模型的地圖藝術風格動畫生成 Generating MapArt Style Animation Using Diffusion Model | 
| 指導教授: | 李同益 Lee, Tong-Yee | 
| 學位類別: | 碩士 Master | 
| 系所名稱: | 電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering | 
| 論文出版年: | 2024 | 
| 畢業學年度: | 112 | 
| 語文別: | 英文 | 
| 論文頁數: | 86 | 
| 中文關鍵詞: | 擴散模型 、動畫生成 、文字引導 、地圖藝術 、藝術風格轉換 | 
| 外文關鍵詞: | Diffusion Model, Animation Generation, Text-Guided, Map Art, Artistic Style Transfer | 
| 相關次數: | 點閱:46 下載:0 | 
| 分享至: | 
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 | 
在本論文中,我們探討如何結合擴散模型的生成能力及風格轉移能力,提出一個風格化動畫生成的擴散模型生成架構,在此我們特別關注於地圖藝術(Map Art)這種將地圖作為背景並將物體(例如:肖像)無縫的結合其中的藝術創作。我們的架構讓使用者能夠輸入1)文字提示2)前景文字索引3)單張風格圖4)背景地圖,產生符合文字提示且具備相似風格的地圖藝術動畫(Map Art Animation)。
傳統的風格化模型致力於將給定的圖片或影片轉換成特定的風格,因此他們並不具備影像生成的能力。相較之下,近年來備受關注的擴散模型則在影像生成上展現出令人印象深刻的潛力,此特性讓我們重新思考是否能利用擴散模型來做到自動化的地圖藝術風格動畫生成。
在本論文的方法中,我們基於現有的影片擴散模型架構改進,讓生成的動畫具有地圖藝術風格,另外我們透過文字與圖像間的對應關係來自動生成並取出影片中使用者有興趣的前景並與背景地圖結合,產生新的地圖藝術動畫。最後我們提出微調、細化、後處理方法來進一步改善生成的地圖藝術動畫品質與解決影片模糊(blur)、風格相似性(style similarity)、與影片不連續(video incoherence)的閃爍(flicker)問題。
相較其他論文,我們運用了擴散模型的生成能力,讓使用者不再需要自行尋找合適的影像作為地圖藝術的前景而可以透過文字來描述心中所期望的前景動畫,增加使用上的靈活性及創造性。另外我們的方法能更好的將前景動畫與地圖結合在一起,我們認為這受益於擴散模型的潛空間(latent space)與擴散模型一步步去噪的特性。
In this research, we explore the integration of generative capability and style transfer of diffusion model, proposing a framework for stylized animation generation using diffusion models. Specifically, we focus on Map Art, an artistic form that seamlessly combines objects (e.g., portraits) into map. Our framework lets user input: (1) text prompt (2) target word index (3) a style reference image (4) a background map, to create Map Art Animation that match the text prompt and exhibit a similar style.
Traditional style transfer model aims to transform a given image or video into specific style and lack image generation capabilities. In contrast, recent attention has been drawn to diffusion model for its impressive potential in image generation, this feature prompts us to reconsider leveraging diffusion model for automatically generating Map Art animation.
In our approach, we enhance existing video diffusion model architecture to generate animation with a specific Map Art style. Additionally, we utilize the correspondence between text and image to automatically extract the animation foreground subject of interest to the user and combine it with map to create new Map Art Animation. Furthermore, we propose fine-tuning, refinement, and post-processing methods to enhance the quality of generated Map Art animation to address issues such as blur, style similarity, and video incoherence.
Compared to previous research, our method leverages the generative capability of diffusion model, allowing user to describe the desired animation that they want to combine with a map through text rather than providing a video, thus enhancing flexibility and creativity in usage. Moreover, our approach seamlessly integrates animation subject with map, benefiting from the latent space of diffusion models and their step-by-step denoising characteristics.
[1]	E. Fairburn. " Ed-Fairburn, Original Artwork and Illustration." https://edfairburn.com/.
[2]	J. E. Kyprianidis, J. Collomosse, T. Wang, and T. Isenberg, "State of the "Art”: A Taxonomy of Artistic Stylization Techniques for Images and Video," IEEE Transactions on Visualization and Computer Graphics, vol. 19, no. 5, pp. 866-885, 2013, doi: 10.1109/TVCG.2012.160.
[3]	L. A. Gatys, A. S. Ecker, and M. Bethge, "A neural algorithm of artistic style," arXiv preprint arXiv:1508.06576, 2015.
[4]	C.-Y. Shih, Y.-H. Chen, and T.-Y. Lee, "Map art style transfer with multi-stage framework," Multimedia Tools Appl., vol. 80, no. 3, pp. 4279–4293, 2021, doi: 10.1007/s11042-020-09788-4.
[5]	Y. Zhang, F. Tang, W. Dong, T.-N.-H. Le, C. Xu, and T.-Y. Lee, "portrait map Art generation by Asymmetric Image-to-Image translation," Leonardo, vol. 56, no. 1, pp. 28-36, 2023.
[6]	T.-N.-H. Le, Y.-H. Chen, and T.-Y. Lee, "Structure-aware Video Style Transfer with Map Art," ACM Trans. Multimedia Comput. Commun. Appl., vol. 19, no. 3s, p. Article 131, 2023, doi: 10.1145/3572030.
[7]	J. Ho, A. Jain, and P. Abbeel, "Denoising diffusion probabilistic models," Advances in neural information processing systems, vol. 33, pp. 6840-6851, 2020.
[8]	P. Dhariwal and A. Nichol, "Diffusion models beat gans on image synthesis," Advances in neural information processing systems, vol. 34, pp. 8780-8794, 2021.
[9]	R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, "High-resolution image synthesis with latent diffusion models," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684-10695. 
[10]	D. Ruta, G. C. Tarrés, A. Gilbert, E. Shechtman, N. Kolkin, and J. Collomosse, "Diff-nst: Diffusion interleaving for deformable neural style transfer," arXiv preprint arXiv:2307.04157, 2023.
[11]	G. Kim, T. Kwon, and J. C. Ye, "Diffusionclip: Text-guided diffusion models for robust image manipulation," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 2426-2435. 
[12]	D.-Y. Chen, "ArtFusion: Controllable arbitrary style transfer using dual conditional latent diffusion models," arXiv e-prints, p. arXiv: 2306.09330, 2023.
[13]	M. N. Everaert, M. Bocchio, S. Arpa, S. Süsstrunk, and R. Achanta, "Diffusion in style," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2251-2261. 
[14]	M. Hamazaspyan and S. Navasardyan, "Diffusion-enhanced patchmatch: A framework for arbitrary style transfer with diffusion models," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 797-805. 
[15]	N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, "Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22500-22510. 
[16]	R. Gal et al., "An image is worth one word: Personalizing text-to-image generation using textual inversion," arXiv preprint arXiv:2208.01618, 2022.
[17]	Y. Zhang et al., "Inversion-based style transfer with diffusion models," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10146-10156. 
[18]	J. Chung, S. Hyun, and J.-P. Heo, "Style injection in diffusion: A training-free approach for adapting large-scale diffusion models for style transfer," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8795-8805. 
[19]	Z. Wang, L. Zhao, and W. Xing, "Stylediffusion: Controllable disentangled style transfer via diffusion models," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7677-7689. 
[20]	N. Huang et al., "Diffstyler: Controllable dual diffusion for text-driven image stylization," IEEE Transactions on Neural Networks and Learning Systems, 2024.
[21]	A. Radford et al., "Learning transferable visual models from natural language supervision," in International conference on machine learning, 2021: PMLR, pp. 8748-8763. 
[22]	A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, "Prompt-to-prompt image editing with cross attention control," arXiv preprint arXiv:2208.01626, 2022.
[23]	L. Khachatryan et al., "Text2video-zero: Text-to-image diffusion models are zero-shot video generators," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15954-15964. 
[24]	J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, "Deep unsupervised learning using nonequilibrium thermodynamics," in International conference on machine learning, 2015: PMLR, pp. 2256-2265. 
[25]	J. Song, C. Meng, and S. Ermon, "Denoising diffusion implicit models," arXiv preprint arXiv:2010.02502, 2020.
[26]	J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, "Video diffusion models," Advances in Neural Information Processing Systems, vol. 35, pp. 8633-8646, 2022.
[27]	Z. Xing et al., "A survey on video diffusion models," arXiv preprint arXiv:2310.10647, 2023.
[28]	Y. Guo et al., "Animatediff: Animate your personalized text-to-image diffusion models without specific tuning," arXiv preprint arXiv:2307.04725, 2023.
[29]	K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[30]	X. Huang and S. Belongie, "Arbitrary style transfer in real-time with adaptive instance normalization," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1501-1510. 
[31]	M. Ruder, A. Dosovitskiy, and T. Brox, "Artistic style transfer for videos," in Pattern Recognition: 38th German Conference, GCPR 2016, Hannover, Germany, September 12-15, 2016, Proceedings 38, 2016: Springer, pp. 26-36. 
[32]	V. Dumoulin, J. Shlens, and M. Kudlur, "A learned representation for artistic style," arXiv preprint arXiv:1610.07629, 2016.
[33]	J. Johnson, A. Alahi, and L. Fei-Fei, "Perceptual losses for real-time style transfer and super-resolution," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, 2016: Springer, pp. 694-711. 
[34]	T. Q. Chen and M. Schmidt, "Fast patch-based style transfer of arbitrary style," arXiv preprint arXiv:1612.04337, 2016.
[35]	J. Chen, Y. Pan, T. Yao, and T. Mei, "Controlstyle: Text-driven stylized image generation using diffusion priors," in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 7540-7548. 
[36]	A. Kirillov et al., "Segment anything," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015-4026. 
[37]	R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, "The unreasonable effectiveness of deep features as a perceptual metric," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586-595. 
[38]	W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman, E. Yumer, and M.-H. Yang, "Learning blind video temporal consistency," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 170-185. 
[39]	N. Huang, Y. Zhang, and W. Dong, "Style-a-video: Agile diffusion for arbitrary text-based video style transfer," IEEE Signal Processing Letters, 2024.
[40]	A. Paszke et al., "Pytorch: An imperative style, high-performance deep learning library," Advances in neural information processing systems, vol. 32, 2019.
[41]	P. v. Platen et al., "Diffusers: State-of-the-art diffusion models," GitHub repository, 2022. [Online]. Available: https://github.com/huggingface/diffusers.
[42]	"Pinterest." https://www.pinterest.com/.
[43]	E. Agustsson and R. Timofte, "NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study," in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 21-26 July 2017 2017, pp. 1122-1131, doi: 10.1109/CVPRW.2017.150. 
[44]	A. Hore and D. Ziou, "Image quality metrics: PSNR vs. SSIM," in 2010 20th international conference on pattern recognition, 2010: IEEE, pp. 2366-2369. 
[45]	M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, "Gans trained by a two time-scale update rule converge to a local nash equilibrium," Advances in neural information processing systems, vol. 30, 2017.
[46]	L. Zhang, A. Rao, and M. Agrawala, "Adding conditional control to text-to-image diffusion models," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836-3847.
 校內:2029-08-22公開
                                        校內:2029-08-22公開