簡易檢索 / 詳目顯示

研究生: 鄭力維
Cheng, Li-Wei
論文名稱: 利用資料增強實現中藥材辨識之單樣本學習
One-shot Learning Using Data Augmentation for TCM Herb Recognition
指導教授: 藍崑展
Lan, Kun-Chan
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 人工智慧科技碩士學位學程
Graduate Program of Artificial Intelligence
論文出版年: 2025
畢業學年度: 113
語文別: 英文
論文頁數: 70
中文關鍵詞: 單樣本學習數據增強三維重建新視角合成生成式模型姿態估計中藥辨識
外文關鍵詞: One-Shot Learning, Data Augmentation, 3D Reconstruction, Novel View Synthesis, Generative Models, Pose Estimation, TCM Herb Classification
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究旨在解決細粒度物件辨識中,訓練數據難以大規模獲取之關鍵挑戰,尤其在傳統中藥材 (TCM) 辨識領域,此問題更為顯著。為此,我們提出並驗證了一套創新的單樣本數據增強管線 (One-Shot Data Augmentation Pipeline),此管線能僅從單一張中藥材輸入影像,自動生成大量、高品質的多視角訓練資料。我們的 2D→3D→2D 方法首先利用先進的單圖轉 3D 模型 InstantMesh 作為基礎,但我們的分析也指出了此方法的兩大根本性限制:由模型預設軸向造成的「物體方向性差距」,以及因模型架構瓶頸與通用資料集訓練所導致的「紋理保真度差距」。
    為解決方向性問題,我們提出了創新的「自適應軌道軸 (Adaptive Orbit Axis)」演算法,它通過分析 3D 網格的幾何特性來決定語意上正確的視角。實驗證明,此方法是提升性能的主導因素,將下游分類任務準確率從 71.8% 提升至 81.4%。更重要的是,我們對失敗案例的深入分析,揭示了更深刻的洞見:分類器在面對合成數據時,存在學習不穩健「捷徑」的傾向;且整個管線的最終表現,受限於上游模型對高頻紋理特徵的還原能力。綜上所述,本研究為單樣本資料增強提供了一個可行的概念驗證與實用框架,並明確指出,未來的研究方向必須優先專注於提升基礎 3D 重建模型的保真度。

    This research addresses the critical challenge of data acquisition for fine-grained object recognition, particularly in specialized domains like Traditional Chinese Medicine (TCM) where multi-view datasets are impractical to collect. We propose and validate a novel one-shot data augmentation pipeline that automatically generates multi-view training data from a single input image. Our 2D→3D→2D methodology leverages a state-of-the-art model, InstantMesh, to create a 3D proxy. However, our analysis identifies two fundamental limitations of the baseline approach: an Object Orientation Gap caused by a semantically blind rotation axis, and a Texture Fidelity Gap stemming from the reconstruction model's architectural bottlenecks and its training on general-purpose datasets.
    To overcome the orientation problem, we introduce a novel Adaptive Orbit Axis algorithm that analyzes 3D mesh geometry to determine a semantically correct viewpoint. Experiments show this is the dominant factor for performance, increasing the downstream classification accuracy from 71.8% to 81.4%. Our in-depth discussion further reveals critical insights from the failure cases: the classifier's tendency to learn non-robust "shortcuts" from inconsistent synthetic data, and that the pipeline's overall performance is ultimately capped by the upstream model's ability to preserve high-frequency texture features. Ultimately, this work provides a viable proof of concept for one-shot data augmentation in fine-grained contexts and concludes that future work must prioritize improving the fidelity of the foundational 3D reconstruction model.

    摘要 II ABSTRACT III CONTENTS IV LIST OF FIGURES VII LIST OF TABLES IX CHAPTER 1 INTRODUCTION 1 1.1 BACKGROUND 1 1.2 MOTIVATION: FROM MANY SHOTS TO ONE 1 1.3 OUR STRATEGY: A 3D-BASED APPROACH FOR CROSS-VIEW CONSISTENCY 2 1.4 OUR CONTRIBUTION 3 CHAPTER 2 RELATED WORK 4 2.1 AUTOMATED TCM HERB RECOGNITION 4 2.2 SINGLE-IMAGE 3D RECONSTRUCTION 6 2.3 MULTI-VIEW RENDERING STRATEGIES AND OUR SEMANTIC APPROACH 8 2.4 DETERMINING OBJECT ORIENTATION FROM 3D GEOMETRY 10 CHAPTER 3 METHODOLOOY 11 3.1 SYSTEM OVERVIEW 12 3.2 OUR STRATEGY: A 3D-BASED APPROACH FOR CROSS-VIEW CONSISTENCY 13 3.3 SINGLE-IMAGE 3D RECONSTRUCTION VIA INSTANTMESH 15 3.4 CAMERA POSE ESTIMATION 16 3.4.1 CANDIDATE VIEW AND ID MAP GENERATION 16 3.4.2 FEATURE MATCHING AND POSE DETERMINATION 17 3.5 THE VIRTUAL TURNTABLE WITH ADAPTIVE ORBIT AXIS 19 3.5.1 THE CHALLENGE: THE OBJECT ORIENTATION GAP 19 3.5.2 OUR SOLUTION: THE ADAPTIVE ORBIT AXIS ALGORITHM 20 CHAPTER 4 EXPERIMENT RESULTS 21 4.1 EXPERIMENTAL SETUP 21 4.1.1 SYSTEM ENVIRONMENT AND HARDWARE 21 4.1.2 DATASET 21 4.1.3 EVALUTION METRIC AND DOWNSTREAM MODEL 24 4.1.4 CONFIGURATIONS FOR COMPARISON 25 4.2 MAIN RESULT AND PERFORMACNE ANALYSIS 26 4.3 IMPACT ANALYSIS OF THE ADAPTIVE ORBIT AXIS 33 4.4 QUALITATIVE RESULTS ANALYSIS 34 4.5 AN EXPLORATORY STUDY ON TEXTURE ENHANCEMENT 38 4.5.1 METHODOLOGY OF TEXTURE RECOVERY 38 4.5.2 EXPERIMENTAL RESULTS AND ANALYSIS 39 CHAPTER 5 DISCUSSION 42 5.1 VIEWPOINT QUALITY AS THE DOMINANT FACTOR 42 5.2 IN-DEPTH ANALYSIS OF PREDICTION RESULTS: SUCCESSES AND FAILURES 44 5.2.1 SUCCESS CASE ANALYSIS: TWO PATHWAYS TO HIGH PERFORMANCE 44 5.2.2 ANOMALY CASE ANALYSIS (I): THE REMOVAL OF DECEPTIVE SHORTCUTS 46 5.2.3 PERSISTENT FAILURE ANALYSIS (II): THE TEXTURE FIDELITY GAP 49 5.3 CASE STUDY: DIFFERENTIATING THE VISUALLY SIMILAR CLASSES X2 AND X12 51 5.4 OUR SEMANTIC APPROACH VS. PCA FOR ORIENTATION DETERMINATION 53 CHAPTER 6 LIMITATIONS AND FUTURE WORK 54 CHAPTER 7 CONCLUSION 55 REFERENCES 57

    [1] 朱嘉瑩(2023)。自動化中藥辨識訓練平台。﹝碩士論文。國立成功大學﹞臺灣博碩士論文知識加值系統。
    [2] 張祐維(2023)。深度學習應用於中藥材辨識。﹝碩士論文。明志科技大學﹞臺灣博碩士論文知識加值系統。
    [3] Chen, W., Tong, J., He, R., Lin, Y., Chen, P., Chen, Z., & Liu, X. (2021). An easy method for identifying 315 categories of commonly-used Chinese herbal medicines based on automated image recognition using AutoML platforms. Informatics in Medicine Unlocked, 25, 100607.
    [4] Weng, J. C., Hu, M. C., & Lan, K. C. (2017, June). Recognition of easily-confused TCM herbs using deep learning. In Proceedings of the 8th ACM on Multimedia Systems Conference (pp. 233-234).
    [5] Cai, C., Liu, S., Wang, L., Yang, B., Zhi, M., Wang, R., & He, W. (2019, October). Classification of Chinese herbal medicine using combination of broad learning system and convolutional neural network. In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC) (pp. 3907-3912). IEEE.
    [6] Song, Z., Chen, G., & Chen, C. Y. C. (2024). AI empowering traditional Chinese medicine?. Chemical science, 15(41), 16844-16886.
    [7] Miao, J., Huang, Y., Wang, Z., Wu, Z., & Lv, J. (2023). Image recognition of traditional Chinese medicine based on deep learning. Frontiers in Bioengineering and Biotechnology, 11, 1199803.
    [8] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012-10022).
    [9] Shen, Y., Zhou, K., Wang, H., Yang, Y., & Shao, T. (2025). High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 21558-21569).
    [10] Haroon, U., AlMughrabi, A., Marques, R., & Radeva, P. (2024). Mvsboost: An efficient point cloud-based 3d reconstruction. arXiv preprint arXiv:2406.13515.
    [11] Choi, S., Nguyen, A. D., Kim, J., Ahn, S., & Lee, S. (2019, September). Point cloud deformation for single image 3d reconstruction. In 2019 IEEE International Conference on Image Processing (ICIP) (pp. 2379-2383). IEEE.
    [12] Xu, J., Cheng, W., Gao, Y., Wang, X., Gao, S., & Shan, Y. (2024). Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191.
    [13] Liu, M., Xu, C., Jin, H., Chen, L., Varma T, M., Xu, Z., & Su, H. (2023). One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems, 36, 22226-22246.
    [14] Liu, M., Shi, R., Chen, L., Zhang, Z., Xu, C., Wei, X., ... & Su, H. (2024). One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10072-10083).
    [15] Worchel, M., Diaz, R., Hu, W., Schreer, O., Feldmann, I., & Eisert, P. (2022). Multi-view mesh reconstruction with neural deferred shading. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6187-6197).
    [16] Zhang, C., Cui, Z., Zhang, Y., Zeng, B., Pollefeys, M., & Liu, S. (2021). Holistic 3d scene understanding from a single image with implicit representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8833-8842).
    [17] Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99-106.
    [18] Kerbl, B., Kopanas, G., Leimkühler, T., & Drettakis, G. (2023). 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4), 139-1.
    [19] Li, X., Zhang, Q., Kang, D., Cheng, W., Gao, Y., Zhang, J., ... & Shan, Y. (2024). Advances in 3d generation: A survey. arXiv preprint arXiv:2401.17807.
    [20] Poole, B., Jain, A., Barron, J. T., & Mildenhall, B. (2022). Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988.
    [21] Li, J., Tan, H., Zhang, K., Xu, Z., Luan, F., Xu, Y., ... & Bi, S. (2023). Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214.
    [22] Shen, T., Munkberg, J., Hasselgren, J., Yin, K., Wang, Z., Chen, W., ... & Gao, J. (2023). Flexible isosurface extraction for gradient-based mesh optimization. ACM Transactions on Graphics (TOG), 42(4), 1-16.
    [23] Debevec, P. (2008). Rendering synthetic objects into real scenes: Bridging traditional and image-based graphics with global illumination and high dynamic range photography. In Acm siggraph 2008 classes (pp. 1-10).
    [24] Kluge, S., & Staadt, O. (2025, March). Assessing Photorealism of Rendered Objects in Real-World Images: A Transparent and Reproducible User Study. In 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW) (pp. 387-393). IEEE.
    [25] Niemeyer, M., Mescheder, L., Oechsle, M., & Geiger, A. (2020). Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3504-3515).
    [26] Zheng, X., Weng, Z., Lyu, Y., Jiang, L., Xue, H., Ren, B., ... & Hu, X. (2025). Retrieval augmented generation and understanding in vision: A survey and new outlook. arXiv preprint arXiv:2503.18016.
    [27] Fei, B., Xu, J., Zhang, R., Zhou, Q., Yang, W., & He, Y. (2024). 3d gaussian splatting as new era: A survey. IEEE Transactions on Visualization and Computer Graphics.
    [28] Lorensen, W. E., & Cline, H. E. (1998). Marching cubes: A high resolution 3D surface construction algorithm. In Seminal graphics: pioneering efforts that shaped the field (pp. 347-353).
    [29] Berian, A., & Mahalanobis, A. (2025, May). Modern novel view synthesis algorithms: a survey. In Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications III (Vol. 13459, pp. 331-337). SPIE.
    [30] Garland, M., & Heckbert, P. S. (1997, August). Surface simplification using quadric error metrics. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques (pp. 209-216).
    [31] Dunteman, G. H. (1989). Principal components analysis (Vol. 69). Sage.
    [32] Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., & Vondrick, C. (2023). Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9298-9309).
    [33] Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., ... & Su, H. (2023). Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110.
    [34] Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., ... & Farhadi, A. (2023). Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13142-13153).
    [35] Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., ... & Yu, F. (2015). Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012.
    [36] Sarlin, P. E., DeTone, D., Malisiewicz, T., & Rabinovich, A. (2020). Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4938-4947).
    [37] DeTone, D., Malisiewicz, T., & Rabinovich, A. (2018). Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 224-236).
    [38] Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., & Wang, W. (2023). Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453.
    [39] Voleti, V., Yao, C. H., Boss, M., Letts, A., Pankratz, D., Tochilkin, D., ... & Jampani, V. (2024, September). Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion. In European Conference on Computer Vision (pp. 439-457). Cham: Springer Nature Switzerland.
    [40] Tochilkin, D., Pankratz, D., Liu, Z., Huang, Z., Letts, A., Li, Y., ... & Cao, Y. P. (2024). Triposr: Fast 3d object reconstruction from a single image. arXiv preprint arXiv:2403.02151.
    [41] Zou, Z. X., Yu, Z., Guo, Y. C., Li, Y., Liang, D., Cao, Y. P., & Zhang, S. H. (2024). Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10324-10335).
    [42] Wang, C., Peng, H. Y., Liu, Y. T., Gu, J., & Hu, S. M. (2025). Diffusion models for 3D generation: A survey. Computational Visual Media, 11(1), 1-28.
    [43] Johnson, J., Alahi, A., & Fei-Fei, L. (2016, September). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (pp. 694-711). Cham: Springer International Publishing.
    [44] Liu, J., Sun, W., Yang, H., Zeng, Z., Liu, C., Zheng, J., ... & Mian, A. (2024). Deep learning-based object pose estimation: A comprehensive survey. arXiv preprint arXiv:2405.07801.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE