| 研究生: |
劉建志 Liu, Chien-Chih |
|---|---|
| 論文名稱: |
對影片內容個性化編輯 Personalizing Text-Guided Video Editing via Hierarchical Control |
| 指導教授: |
李同益
Lee, Tong-Yee |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 英文 |
| 論文頁數: | 60 |
| 中文關鍵詞: | 擴散生成模型 、影片編輯 、個性化內容生成 |
| 外文關鍵詞: | diffusion model, video editing, personalization |
| 相關次數: | 點閱:54 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在本篇論文研究中,我們結合了圖像生成模型進行影片編輯的方法與個性化內容生成方式,根據使用者輸入的圖片內容來編輯影片,將影片內使用者挑選的角色或物品替換為使用者的輸入。要生成圖片並不容易,而在這方面擴散生成模型(Diffusion model)已經達成驚人的成果,尤其是其結合大型語言對應圖片模型後能透過文字引導生成結果,而現在並沒有公開的影片生成模型,所以我們使用預訓練好的文字對圖片生成模型(text-to-image diffusion model)並擴充處理圖片的網路架構成能處理影片,而當我們想要生成的內容並不在該模型的訓料資料時,我們就無法找到適當的文字生成對應的結果,所以我們將根據使用者輸入的圖片對模型進行微調(fine-tune),使模型具備生成輸入內容的能力,但這種微調會需要花費時間找到適當的訓練量並不可避免的使模型的生成品質下降,所以在結合影片編輯方式後,我們並提出兩種模型在編輯與生成影片步驟上的限制與改進,在不需要額外訓練的情形下提升個性化內容生成品質與穩定性。而在最後的評估中,也顯示出本篇研究在與原方法生成內容細節上的提升。
In this research, we combine the methods of video editing with diffusion models and personalized content generation. We edit videos based on the user's input of image content and replace selected characters or objects in the video with the user's input. Generating images is not an easy task, but diffusion models have achieved remarkable results in this area, especially when combined with large-scale language-image models which can be used for image-text similarity. However, there is currently no publicly available video generation diffusion model. We use a pre-trained text-to-image diffusion model and extend its network architecture to handle videos. When the content we want to generate is not present in the model's training data, we cannot find appropriate text prompts to generate corresponding results. To address this issue, we fine-tune the model based on the user's input image, enabling it to generate content that aligns with the input. However, this fine-tuning process requires time to find an appropriate training step and inevitably leads to a decrease in the model's generation quality. Therefore, after combining video editing methods, we propose two methods to impose limitations and improvements on the editing and video generation steps, aiming to enhance the quality and stability of personalized content generation without the need for additional training. In the final evaluation, our research also demonstrates improvements in content details compared to the original method.
[1] Y. Alaluf, E. Richardson, G. Metzer, and D. Cohen-Or, "A Neural Space-Time Representation for Text-to-Image Personalization," arXiv preprint arXiv:2305.15391, 2023.
[2] O. Avrahami, O. Fried, and D. Lischinski, "Blended latent diffusion," arXiv preprint arXiv:2206.02779, 2022.
[3] T. Brooks, A. Holynski, and A. A. Efros, "Instructpix2pix: Learning to follow image editing instructions," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18392-18402.
[4] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, and A. Askell, "Language models are few-shot learners," Advances in neural information processing systems, vol. 33, pp. 1877-1901, 2020.
[5] M. Cao, X. Wang, Z. Qi, Y. Shan, X. Qie, and Y. Zheng, "MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing," arXiv preprint arXiv:2304.08465, 2023.
[6] P. Dhariwal and A. Nichol, "Diffusion models beat gans on image synthesis," Advances in Neural Information Processing Systems, vol. 34, pp. 8780-8794, 2021.
[7] P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, "Structure and content-guided video synthesis with diffusion models," arXiv preprint arXiv:2302.03011, 2023.
[8] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, "An image is worth one word: Personalizing text-to-image generation using textual inversion," arXiv preprint arXiv:2208.01618, 2022.
[9] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, "Prompt-to-prompt image editing with cross attention control," arXiv preprint arXiv:2208.01626, 2022.
[10] J. Ho, A. Jain, and P. Abbeel, "Denoising diffusion probabilistic models," Advances in Neural Information Processing Systems, vol. 33, pp. 6840-6851, 2020.
[11] J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, and D. J. Fleet, "Imagen video: High definition video generation with diffusion models," arXiv preprint arXiv:2210.02303, 2022.
[12] J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, "Video diffusion models," arXiv preprint arXiv:2204.03458, 2022.
[13] N. Huang, F. Tang, W. Dong, T.-Y. Lee, and C. Xu, "Region-aware diffusion for zero-shot text-driven image editing," arXiv preprint arXiv:2302.11797, 2023.
[14] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, "Imagic: Text-based real image editing with diffusion models," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6007-6017.
[15] S. Liu, Y. Zhang, W. Li, Z. Lin, and J. Jia, "Video-p2p: Video editing with cross-attention control," arXiv preprint arXiv:2303.04761, 2023.
[16] R. Mokady, A. Hertz, K. Aberman, Y. Pritch, and D. Cohen-Or, "Null-text inversion for editing real images using guided diffusion models," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6038-6047.
[17] E. Molad, E. Horwitz, D. Valevski, A. R. Acha, Y. Matias, Y. Pritch, Y. Leviathan, and Y. Hoshen, "Dreamix: Video diffusion models are general video editors," arXiv preprint arXiv:2302.01329, 2023.
[18] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, "Glide: Towards photorealistic image generation and editing with text-guided diffusion models," arXiv preprint arXiv:2112.10741, 2021.
[19] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, and J. Clark, "Learning transferable visual models from natural language supervision," in International conference on machine learning, 2021: PMLR, pp. 8748-8763.
[20] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, "Exploring the limits of transfer learning with a unified text-to-text transformer," The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485-5551, 2020.
[21] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, "Zero-shot text-to-image generation," in International Conference on Machine Learning, 2021: PMLR, pp. 8821-8831.
[22] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, "Hierarchical text-conditional image generation with clip latents," arXiv preprint arXiv:2204.06125, 2022.
[23] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, "High-resolution image synthesis with latent diffusion models," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10684-10695.
[24] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 2015: Springer, pp. 234-241.
[25] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, "Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22500-22510.
[26] C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi, "Palette: Image-to-image diffusion models," in ACM SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1-10.
[27] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, and T. Salimans, "Photorealistic text-to-image diffusion models with deep language understanding," Advances in Neural Information Processing Systems, vol. 35, pp. 36479-36494, 2022.
[28] T. Salimans and J. Ho, "Progressive distillation for fast sampling of diffusion models," arXiv preprint arXiv:2202.00512, 2022.
[29] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, and M. Wortsman, "Laion-5b: An open large-scale dataset for training next generation image-text models," arXiv preprint arXiv:2210.08402, 2022.
[30] U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, and O. Gafni, "Make-a-video: Text-to-video generation without text-video data," arXiv preprint arXiv:2209.14792, 2022.
[31] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, "Deep unsupervised learning using nonequilibrium thermodynamics," in International Conference on Machine Learning, 2015: PMLR, pp. 2256-2265.
[32] J. Song, C. Meng, and S. Ermon, "Denoising diffusion implicit models," arXiv preprint arXiv:2010.02502, 2020.
[33] Y. Song and S. Ermon, "Generative modeling by estimating gradients of the data distribution," Advances in neural information processing systems, vol. 32, 2019.
[34] N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, "Plug-and-play diffusion features for text-driven image-to-image translation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1921-1930.
[35] A. Van Den Oord and O. Vinyals, "Neural discrete representation learning," Advances in neural information processing systems, vol. 30, 2017.
[36] W. Wang, K. Xie, Z. Liu, H. Chen, Y. Cao, X. Wang, and C. Shen, "Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models," arXiv preprint arXiv:2303.17599, 2023.
[37] J. Z. Wu, Y. Ge, X. Wang, W. Lei, Y. Gu, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, "Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation," arXiv preprint arXiv:2212.11565, 2022.
[38] Y. Zhang, W. Dong, F. Tang, N. Huang, H. Huang, C. Ma, T.-Y. Lee, O. Deussen, and C. Xu, "ProSpect: Expanded Conditioning for the Personalization of Attribute-aware Image Generation," arXiv preprint arXiv:2305.16225, 2023.
校內:2028-08-18公開