| 研究生: |
曾俊翰 Tseng, Chun-Han |
|---|---|
| 論文名稱: |
生成式AI文字對圖像生成(Text-to-Image)中,提示詞(Prompt)順序影響與提示詞工程研究 A Study on the Impact of Prompt Order and Prompt Engineering in Text-to-Image Generation within Generative AI |
| 指導教授: |
蔡明田
Tsai, Ming-Tian |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 工程管理碩士在職專班 Engineering Management Graduate Program(on-the-job class) |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 73 |
| 中文關鍵詞: | 提示詞工程 、生成式AI 、文生圖 、ChatGPT 、Stable Diffusion 、Prompt 、Generative AI 、Text-to-Image |
| 外文關鍵詞: | Generative AI, Text-to-Image, Prompt Engineering, Stable Diffusion, Prompt |
| 相關次數: | 點閱:25 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究探討生成式 AI 在文字對圖像生成(Text-to-Image)中的提示詞(Prompt)順序影響及提示詞工程的相關研究。研究首先概述了生成式 AI 的發展背景與應用,研究文生圖(Text-to-Image)技術如何透過自然語言處理與圖像生成模型的結合來創建高品質圖像。由於目前對 Prompt 排列方式的研究較少,多數使用者依賴「試錯法」,研究動機即在於探索提示詞順序對生成結果的影響。
文獻探討部分介紹了 Stable Diffusion、DALL·E 等 AI 圖像生成模型的技術基礎,包括擴散模型(Diffusion Model)、U-Net 結構、CLIP 文字編碼器等核心技術。研究中測試了 Prompt 在 AI 圖像生成中是否因更動順序,進而影響構圖、色彩、細節等。
研究方法上,固定模型參數並調整 Prompt 順序,以分析圖像生成的變化。測試項目涵蓋自然景觀(如月亮與山、太陽與山)及生活場景(如植物與房屋),透過定量與定性分析,比較不同提示詞順序對圖像構圖、顏色、光影等影響。
本研究結果顯示,Prompt 順序未能有效影響生成圖像的構圖,這表明 AI 在圖像生成時主要依賴訓練數據中的統計模式,而非使用者輸入的語序變化,「Prompt 順序並非完全決定圖像生成結果」。為了證明這點,在刀叉擺放方式的測試中,AI 生成的結果有很大的機率仍遵循常見的西餐禮儀擺放方式生成,顯示 AI 可能更傾向於再現訓練數據中的既定模式,而非完全受到 Prompt 先後順序的影響,未來研究可關注 AI 在文化偏見與訓練數據影響下的生成行為,由於此篇生成的刀叉擺放方式可能受到訓練模型的大型公司影響,這些大型公司多與西方文化圈有關,所以未來研究方向也可以運用不同語言之提示詞設計實驗,測試是否有過度依賴既有訓練資料中的文化集中和刻板印象,這對於生成式AI是否能因應不同文化也有相當大的研究價值。
This study investigates the impact of prompt order and prompt engineering in generative AI’s text-to-image generation. It reviews how models like Stable Diffusion and DALL·E combine natural language processing with image generation. Since prompt order is underexplored, this research tests whether rearranging prompt words affects image composition, color, and detail. By fixing model parameters and varying prompt sequences in natural and daily scenes, results show prompt order does not significantly influence images. This clearly indicates AI is more likely to reproduce established patterns from training data rather than being fully guided by the sequence of prompt words. For example, AI consistently generated Western-style dining utensil arrangements regardless of prompt order, highlighting the dominance of learned data patterns. Future research should explore AI’s cultural biases and test prompts in different languages to assess overreliance on dominant cultural datasets, which is important for creating culturally inclusive AI.
[1] J. Agnese, J. Herrera, H. Tao, and X. Zhu, “A survey and taxonomy of adversarial neural networks for text-to-image synthesis,” arXiv preprint arXiv:1910.09399, 2019. [Online]. Available: https://arxiv.org/abs/1910.09399
[2] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S. C. Zhu, “I2T: Image parsing to text description,” Proc. IEEE, vol. 98, no. 8, pp. 1485–1508, 2010, doi: 10.1109/JPROC.2010.2050411
[3] A. Vaswani, N. Shazeer, N. Parmar, and others, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017. [Online]. Available: https://arxiv.org/abs/1706.03762
[4] X. Zhu, A. B. Goldberg, M. Eldawy, C. R. Dyer, and B. Strock, “A text-to-picture synthesis system for augmenting communication,” in Proc. AAAI Conf. Artif. Intell., 2007, pp. 1590–1595
[5] H. Čeović, M. Šilić, G. Delač, and K. Vladimir, “An overview of diffusion models for text generation,” in Proc. 46th MIPRO ICT Electron. Conv., Opatija, Croatia, May 2023, pp. 941–946
[6] M. Chen, S. Mei, J. Fan, and M. Wang, “An overview of diffusion models: Applications, guided generation, statistical rates and optimization,” arXiv preprint arXiv:2404.07771, 2024. [Online]. Available: https://arxiv.org/abs/2404.07771
[7] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020. [Online]. Available: https://arxiv.org/abs/2011.13456
[8] B. Jadhav, M. Jain, A. Jajoo, D. Kadam, H. Kadam, and T. Kakkad, “Imagination made real: Stable diffusion for high-fidelity text-to-image tasks,” in Proc. 2nd Int. Conf. Sustain. Comput. Smart Syst., 2024, pp. 773–779
[9] P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image synthesis,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, 2021, pp. 8780–8794
[10] B. Dai and D. Wipf, “Diagnosing and enhancing VAE models,” arXiv preprint arXiv:1903.05789, 2019. [Online]. Available: https://arxiv.org/abs/1903.05789
[11] A. Radford, J. W. Kim, C. Hallacy, and others, “Learning transferable visual models from natural language supervision,” in Proc. 38th Int. Conf. Mach. Learn., 2021, pp. 8748–8763. [Online]. Available: https://proceedings.mlr.press/v139/radford21a.html
[12] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for high fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096, 2018. [Online]. Available: https://arxiv.org/abs/1809.11096
[13] A. Nichol, P. Dhariwal, A. Ramesh et al., “GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv:2112.10741, 2021. [Online]. Available: https://arxiv.org/abs/2112.10741
[14] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10684–10695
[15] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1060–1069
[16] M. I. Iona and M. I. Anda, “Table setting—Restaurant etiquette,” Agricultural Manage./Lucrari Stiintifice Seria I, Manage. Agricol, vol. 26, no. 1, pp. 1–10, 2024