| 研究生: |
鄭力維 Cheng, Li-Wei |
|---|---|
| 論文名稱: |
利用資料增強實現中藥材辨識之單樣本學習 One-shot Learning Using Data Augmentation for TCM Herb Recognition |
| 指導教授: |
藍崑展
Lan, Kun-Chan |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 人工智慧科技碩士學位學程 Graduate Program of Artificial Intelligence |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 英文 |
| 論文頁數: | 70 |
| 中文關鍵詞: | 單樣本學習 、數據增強 、三維重建 、新視角合成 、生成式模型 、姿態估計 、中藥辨識 |
| 外文關鍵詞: | One-Shot Learning, Data Augmentation, 3D Reconstruction, Novel View Synthesis, Generative Models, Pose Estimation, TCM Herb Classification |
| 相關次數: | 點閱:3 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究旨在解決細粒度物件辨識中,訓練數據難以大規模獲取之關鍵挑戰,尤其在傳統中藥材 (TCM) 辨識領域,此問題更為顯著。為此,我們提出並驗證了一套創新的單樣本數據增強管線 (One-Shot Data Augmentation Pipeline),此管線能僅從單一張中藥材輸入影像,自動生成大量、高品質的多視角訓練資料。我們的 2D→3D→2D 方法首先利用先進的單圖轉 3D 模型 InstantMesh 作為基礎,但我們的分析也指出了此方法的兩大根本性限制:由模型預設軸向造成的「物體方向性差距」,以及因模型架構瓶頸與通用資料集訓練所導致的「紋理保真度差距」。
為解決方向性問題,我們提出了創新的「自適應軌道軸 (Adaptive Orbit Axis)」演算法,它通過分析 3D 網格的幾何特性來決定語意上正確的視角。實驗證明,此方法是提升性能的主導因素,將下游分類任務準確率從 71.8% 提升至 81.4%。更重要的是,我們對失敗案例的深入分析,揭示了更深刻的洞見:分類器在面對合成數據時,存在學習不穩健「捷徑」的傾向;且整個管線的最終表現,受限於上游模型對高頻紋理特徵的還原能力。綜上所述,本研究為單樣本資料增強提供了一個可行的概念驗證與實用框架,並明確指出,未來的研究方向必須優先專注於提升基礎 3D 重建模型的保真度。
This research addresses the critical challenge of data acquisition for fine-grained object recognition, particularly in specialized domains like Traditional Chinese Medicine (TCM) where multi-view datasets are impractical to collect. We propose and validate a novel one-shot data augmentation pipeline that automatically generates multi-view training data from a single input image. Our 2D→3D→2D methodology leverages a state-of-the-art model, InstantMesh, to create a 3D proxy. However, our analysis identifies two fundamental limitations of the baseline approach: an Object Orientation Gap caused by a semantically blind rotation axis, and a Texture Fidelity Gap stemming from the reconstruction model's architectural bottlenecks and its training on general-purpose datasets.
To overcome the orientation problem, we introduce a novel Adaptive Orbit Axis algorithm that analyzes 3D mesh geometry to determine a semantically correct viewpoint. Experiments show this is the dominant factor for performance, increasing the downstream classification accuracy from 71.8% to 81.4%. Our in-depth discussion further reveals critical insights from the failure cases: the classifier's tendency to learn non-robust "shortcuts" from inconsistent synthetic data, and that the pipeline's overall performance is ultimately capped by the upstream model's ability to preserve high-frequency texture features. Ultimately, this work provides a viable proof of concept for one-shot data augmentation in fine-grained contexts and concludes that future work must prioritize improving the fidelity of the foundational 3D reconstruction model.
[1] 朱嘉瑩(2023)。自動化中藥辨識訓練平台。﹝碩士論文。國立成功大學﹞臺灣博碩士論文知識加值系統。
[2] 張祐維(2023)。深度學習應用於中藥材辨識。﹝碩士論文。明志科技大學﹞臺灣博碩士論文知識加值系統。
[3] Chen, W., Tong, J., He, R., Lin, Y., Chen, P., Chen, Z., & Liu, X. (2021). An easy method for identifying 315 categories of commonly-used Chinese herbal medicines based on automated image recognition using AutoML platforms. Informatics in Medicine Unlocked, 25, 100607.
[4] Weng, J. C., Hu, M. C., & Lan, K. C. (2017, June). Recognition of easily-confused TCM herbs using deep learning. In Proceedings of the 8th ACM on Multimedia Systems Conference (pp. 233-234).
[5] Cai, C., Liu, S., Wang, L., Yang, B., Zhi, M., Wang, R., & He, W. (2019, October). Classification of Chinese herbal medicine using combination of broad learning system and convolutional neural network. In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC) (pp. 3907-3912). IEEE.
[6] Song, Z., Chen, G., & Chen, C. Y. C. (2024). AI empowering traditional Chinese medicine?. Chemical science, 15(41), 16844-16886.
[7] Miao, J., Huang, Y., Wang, Z., Wu, Z., & Lv, J. (2023). Image recognition of traditional Chinese medicine based on deep learning. Frontiers in Bioengineering and Biotechnology, 11, 1199803.
[8] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012-10022).
[9] Shen, Y., Zhou, K., Wang, H., Yang, Y., & Shao, T. (2025). High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 21558-21569).
[10] Haroon, U., AlMughrabi, A., Marques, R., & Radeva, P. (2024). Mvsboost: An efficient point cloud-based 3d reconstruction. arXiv preprint arXiv:2406.13515.
[11] Choi, S., Nguyen, A. D., Kim, J., Ahn, S., & Lee, S. (2019, September). Point cloud deformation for single image 3d reconstruction. In 2019 IEEE International Conference on Image Processing (ICIP) (pp. 2379-2383). IEEE.
[12] Xu, J., Cheng, W., Gao, Y., Wang, X., Gao, S., & Shan, Y. (2024). Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191.
[13] Liu, M., Xu, C., Jin, H., Chen, L., Varma T, M., Xu, Z., & Su, H. (2023). One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems, 36, 22226-22246.
[14] Liu, M., Shi, R., Chen, L., Zhang, Z., Xu, C., Wei, X., ... & Su, H. (2024). One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10072-10083).
[15] Worchel, M., Diaz, R., Hu, W., Schreer, O., Feldmann, I., & Eisert, P. (2022). Multi-view mesh reconstruction with neural deferred shading. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6187-6197).
[16] Zhang, C., Cui, Z., Zhang, Y., Zeng, B., Pollefeys, M., & Liu, S. (2021). Holistic 3d scene understanding from a single image with implicit representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8833-8842).
[17] Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99-106.
[18] Kerbl, B., Kopanas, G., Leimkühler, T., & Drettakis, G. (2023). 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4), 139-1.
[19] Li, X., Zhang, Q., Kang, D., Cheng, W., Gao, Y., Zhang, J., ... & Shan, Y. (2024). Advances in 3d generation: A survey. arXiv preprint arXiv:2401.17807.
[20] Poole, B., Jain, A., Barron, J. T., & Mildenhall, B. (2022). Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988.
[21] Li, J., Tan, H., Zhang, K., Xu, Z., Luan, F., Xu, Y., ... & Bi, S. (2023). Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214.
[22] Shen, T., Munkberg, J., Hasselgren, J., Yin, K., Wang, Z., Chen, W., ... & Gao, J. (2023). Flexible isosurface extraction for gradient-based mesh optimization. ACM Transactions on Graphics (TOG), 42(4), 1-16.
[23] Debevec, P. (2008). Rendering synthetic objects into real scenes: Bridging traditional and image-based graphics with global illumination and high dynamic range photography. In Acm siggraph 2008 classes (pp. 1-10).
[24] Kluge, S., & Staadt, O. (2025, March). Assessing Photorealism of Rendered Objects in Real-World Images: A Transparent and Reproducible User Study. In 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW) (pp. 387-393). IEEE.
[25] Niemeyer, M., Mescheder, L., Oechsle, M., & Geiger, A. (2020). Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3504-3515).
[26] Zheng, X., Weng, Z., Lyu, Y., Jiang, L., Xue, H., Ren, B., ... & Hu, X. (2025). Retrieval augmented generation and understanding in vision: A survey and new outlook. arXiv preprint arXiv:2503.18016.
[27] Fei, B., Xu, J., Zhang, R., Zhou, Q., Yang, W., & He, Y. (2024). 3d gaussian splatting as new era: A survey. IEEE Transactions on Visualization and Computer Graphics.
[28] Lorensen, W. E., & Cline, H. E. (1998). Marching cubes: A high resolution 3D surface construction algorithm. In Seminal graphics: pioneering efforts that shaped the field (pp. 347-353).
[29] Berian, A., & Mahalanobis, A. (2025, May). Modern novel view synthesis algorithms: a survey. In Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications III (Vol. 13459, pp. 331-337). SPIE.
[30] Garland, M., & Heckbert, P. S. (1997, August). Surface simplification using quadric error metrics. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques (pp. 209-216).
[31] Dunteman, G. H. (1989). Principal components analysis (Vol. 69). Sage.
[32] Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., & Vondrick, C. (2023). Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9298-9309).
[33] Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., ... & Su, H. (2023). Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110.
[34] Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., ... & Farhadi, A. (2023). Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13142-13153).
[35] Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., ... & Yu, F. (2015). Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012.
[36] Sarlin, P. E., DeTone, D., Malisiewicz, T., & Rabinovich, A. (2020). Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4938-4947).
[37] DeTone, D., Malisiewicz, T., & Rabinovich, A. (2018). Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 224-236).
[38] Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., & Wang, W. (2023). Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453.
[39] Voleti, V., Yao, C. H., Boss, M., Letts, A., Pankratz, D., Tochilkin, D., ... & Jampani, V. (2024, September). Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion. In European Conference on Computer Vision (pp. 439-457). Cham: Springer Nature Switzerland.
[40] Tochilkin, D., Pankratz, D., Liu, Z., Huang, Z., Letts, A., Li, Y., ... & Cao, Y. P. (2024). Triposr: Fast 3d object reconstruction from a single image. arXiv preprint arXiv:2403.02151.
[41] Zou, Z. X., Yu, Z., Guo, Y. C., Li, Y., Liang, D., Cao, Y. P., & Zhang, S. H. (2024). Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10324-10335).
[42] Wang, C., Peng, H. Y., Liu, Y. T., Gu, J., & Hu, S. M. (2025). Diffusion models for 3D generation: A survey. Computational Visual Media, 11(1), 1-28.
[43] Johnson, J., Alahi, A., & Fei-Fei, L. (2016, September). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (pp. 694-711). Cham: Springer International Publishing.
[44] Liu, J., Sun, W., Yang, H., Zeng, Z., Liu, C., Zheng, J., ... & Mian, A. (2024). Deep learning-based object pose estimation: A comprehensive survey. arXiv preprint arXiv:2405.07801.