簡易檢索 / 詳目顯示

研究生: 陳薏如
Chen, Yi-Ju
論文名稱: 單張鞋子2D圖片用於三維感知圖像合成之應用
A Single 2D Shoe Picture For 3D-Aware Image Synthesis
指導教授: 王宗一
Wang, Tzone-I
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 53
中文關鍵詞: 深度學習神經輻射場三維感知圖像合成
外文關鍵詞: Deep Learning, Neural Radiance Field, 3D-Aware Image Synthesis
相關次數: 點閱:129下載:14
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著網路購物的普及,越來越多人選擇在網路上購買商品。但是在挑選鞋子類型的商品,僅依靠網路上的圖片和文字描述可能無法完全滿足消費者的需求,且許多款式只有在特定網路平台販售,導致消費者很難在實體店面試穿到這些款式。只根據圖片決定是否購買又會產生猶豫和不確定,可能還會承擔購買後發生不符合期望產生的後悔和退換貨的麻煩。
    為了解決這個問題,本研究著重將單張鞋子圖片作為輸入,建立一個生成模型,生成鞋子的多視角圖片,並透過這些圖片進行三維重建獲得鞋子的三維模型,目的是簡化圖片到三維重建的流程。最後將三維模型應用於手機上現有的擴增實境的濾鏡,透過手機相機顯示鞋子試穿的畫面。本研究自製多視角鞋子圖片資料集來訓練網路模型,該資料集來源是從Sketchfab網站下載鞋子的三維模型,接著使用Blender軟體在不同視角下拍攝這些鞋子的圖片,獲得鞋子模型的多視角圖片。並且為了增加資料集色彩的多樣性,對這些鞋子圖片進行色彩轉換。透過編碼器提取單張圖片的特徵和相機視角位置,並根據視角位置生成不同視角的鞋子圖片。本研究為了解決解析度較低的問題在模型生成過程引入銳化效果,使用FID指標對網路模型有無引入銳化效果生成的圖片進行評估,經由銳化後的結果顯示從30.81的分數降為29.09,表示在引入銳化效果後,生成的圖片與真實圖片之間的分佈差異有所減少。接著使用Meshroom3D開源軟體對生成的鞋子多視角圖片進行三維重建,以獲得鞋子的三維模型。最後將這些三維模型應用於Snap Inc.開發的Footwear Try-On擴增實境濾鏡的套件上,實現使用手機達成虛擬試穿的效果。

    With the popularity of online shopping, more and more people are choosing to purchase goods on the internet. However, when it comes to selecting items like shoes, relying solely on online images and written descriptions may not fully meet the consumers' needs. Many styles are only available for sale on online platforms, making it difficult for consumers to try on these styles in physical stores. Making a purchase decision based solely on pictures can lead to hesitation and uncertainty, and consumers may also have to deal with the hassle of regret and returns if the purchased item does not meet their expectations.
    To solve the issue, this study focuses on utilizing single shoe images as input to develop a generative model that produces multi-view images of shoes. These generated images are subsequently used for three-dimensional reconstruction, simplifying the process of transitioning from 2D images to 3D models. The goal is to integrate these three-dimensional models into existing augmented reality filters on mobile devices, enabling users to virtually try on shoes using their smartphone cameras. To train the neural network model, a custom multi-view shoe image dataset was created. This dataset was sourced from Sketchfab, where 3D shoe models were downloaded. Blender software was employed to capture images of these shoes from various angles, thus obtaining multi-view images of the shoe models. To enhance the diversity of the dataset, color transformations were applied to these shoe images. The encoder extracted features from single images and camera positions, enabling the generation of different view-angle images of shoes. To address the challenge of lower resolution in the generated images, a sharpening effect was introduced during the model generation process. The FID (Fréchet Inception Distance) metric was employed to assess the impact of this sharpening effect on image quality. The results after sharpening demonstrated a decrease in the FID score from 30.81 to 29.09, indicating a reduction in distributional disparity between generated and real images. Finally, Meshroom3D open-source software was utilized to perform three-dimensional reconstruction on the generated multi-view shoe images, resulting in the creation of three-dimensional shoe models. Ultimately, these three-dimensional models were integrated into the Footwear Try-On augmented reality filter developed by Snap Inc. This integration enabled the achievement of virtual try-on effects using mobile devices.

    摘要 i Extended Abstract ii 誌謝 x 目錄 xi 表目錄 xiv 圖目錄 xv 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 2 1.3 研究方法 2 1.4 研究貢獻 2 第二章 文獻探討 4 2.1 圖片色彩空間轉換 4 2.1.1 RGB色彩模型 4 2.1.2 HSV色彩模型 5 2.2 神經輻射場 6 2.3 攝影測量法 8 2.3.1 運動推斷結構 8 2.3.2 多視角立體視覺 10 第三章 系統設計與模型架構 11 3.1 訓練及測試流程 11 3.2 資料集與資料處理 12 3.3 網路架構 16 3.4 生成對抗網路 16 3.4.1 生成網路 18 3.4.2 判別網路 19 3.4.3 生成對抗網路損失函數 20 3.4.3.1 生成器對抗損失 20 3.4.3.2 判別器對抗損失 22 3.5 自動編碼器 23 3.5.1 編碼網路 23 3.5.2 自動編碼器損失函數 24 3.5.2.1 編碼器反轉損失 24 3.5.2.2 圖片重建損失 25 3.5.2.3 條件對抗性損失 27 第四章 實驗設計與結果 28 4.1 資料集及實驗環境設置 28 4.1.1 資料集 28 4.1.2 實驗環境 29 4.2 評估工具 29 4.3 實驗流程 30 4.4 實驗結果 30 4.4.1 多視角圖片渲染 30 4.4.2 模型評估結果 32 4.5 模型成果應用 33 4.5.1 Meshroom3D軟體介紹 33 4.5.2 Meshroom3D軟體與三維重建生成效果 34 4.5.2.1 圖像生成模型加上銳化效果之Meshroom3D軟體生成結果比較 35 4.5.2.2 圖片背景色改變後Meshroom3D軟體生成結果比較 37 4.5.3 擴增實境濾鏡應用 39 4.6 結果討論 46 4.6.1 觀察角度影響圖像生成效果 46 4.6.2 鞋底生成破碎 47 4.6.3 Meshroom3D軟體重建失敗 50 第五章 結論與未來展望 51 5.1 結論 51 5.2 未來展望 51 第六章 參考文獻 52

    [1] S. Inc. "Footwear Try-On Docs." https://docs.snap.com/lens-studio/references/templates/object/Try-On/foot-tracking (accessed.
    [2] C. Griwodz et al., "AliceVision Meshroom: An open-source 3D reconstruction pipeline," in Proceedings of the 12th ACM Multimedia Systems Conference, 2021, pp. 241-247.
    [3] 維基百科編者, "HSL和HSV色彩空間," in 維基百科,自由的百科全書, ed, 2023.
    [4] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, "Nerf: Representing scenes as neural radiance fields for view synthesis," Communications of the ACM, vol. 65, no. 1, pp. 99-106, 2021.
    [5] S. Baker and I. Matthews, "Lucas-kanade 20 years on: A unifying framework," International journal of computer vision, vol. 56, pp. 221-255, 2004.
    [6] M. A. Fischler and R. C. Bolles, "Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography," Communications of the ACM, vol. 24, no. 6, pp. 381-395, 1981.
    [7] Ö. YILMAZ and F. KARAKUŞ, "Stereo and KinectFusion for continuous 3D reconstruction and visual odometry," Turkish Journal of Electrical Engineering and Computer Sciences, vol. 24, no. 4, pp. 2756-2770, 2016.
    [8] S. Cai, A. Obukhov, D. Dai, and L. Van Gool, "Pix2nerf: Unsupervised conditional p-gan for single image to neural radiance fields translation," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3981-3990.
    [9] E. R. Chan, M. Monteiro, P. Kellnhofer, J. Wu, and G. Wetzstein, "pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5799-5809.
    [10] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, "Film: Visual reasoning with a general conditioning layer," in Proceedings of the AAAI conference on artificial intelligence, 2018, vol. 32, no. 1.
    [11] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein, "Implicit neural representations with periodic activation functions," Advances in neural information processing systems, vol. 33, pp. 7462-7473, 2020.
    [12] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: from error visibility to structural similarity," IEEE transactions on image processing, vol. 13, no. 4, pp. 600-612, 2004.
    [13] J. Johnson, A. Alahi, and L. Fei-Fei, "Perceptual losses for real-time style transfer and super-resolution," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, 2016: Springer, pp. 694-711.
    [14] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
    [15] T. Karras, T. Aila, S. Laine, and J. Lehtinen, "Progressive growing of gans for improved quality, stability, and variation," arXiv preprint arXiv:1710.10196, 2017.
    [16] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, "Gans trained by a two time-scale update rule converge to a local nash equilibrium," Advances in neural information processing systems, vol. 30, 2017.
    [17] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the inception architecture for computer vision," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818-2826.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE