簡易檢索 / 詳目顯示

研究生: 陳彥穎
Chen, Yan-Ying
論文名稱: 基於多模態 Zero-Shot 架構之低功耗即時物件分類系統
Low-power Real-time Object Classification System based on Multimodal Zero-Shot Architecture
指導教授: 賴槿峰
Lai, Chin-Feng
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 182
中文關鍵詞: 多模態架構零樣本分類YOLOv8對比語言-圖像預訓練(CLIP)邊緣運算神經網路處理器(NPU)模型量化
外文關鍵詞: Multimodal Architecture, Zero-Shot Classification, YOLOv8, Contrastive Language-Image Pre-training (CLIP), Edge Computing, Neural Processing Unit (NPU), Model Quantization
相關次數: 點閱:24下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著邊緣運算(Edge Computing)與物聯網(Internet of Things, IoT)技術的蓬勃發展,越來越多的智慧化視覺應用需要在資源受限的終端裝置上進行即時的資料處理與分析。然而,傳統的物件偵測模型(例如廣泛應用的 YOLO 系列),憑藉其單階段(single-stage)的設計雖然在偵測速度上取得了巨大成功,但其固有的限制在於僅能辨識訓練階段預先定義好的固定類別。當現實場景中出現新的、從未見過的物件,或需要描述更細緻的物件行為時,傳統封閉詞彙(Closed-Vocabulary)模型便需要收集大量相關的標註資料並重新訓練,此過程不僅成本高昂,更缺乏應對動態環境的彈性。
    為突破此一瓶頸,本研究設計並實作了一個能在低功耗邊緣裝置上高效運行的多模態零樣本(Zero-Shot)架構即時物件分類系統。本系統提出了一個「先偵測,後分類」的兩階段式整合流程。在前端物件偵測模組中,系統採用了高效的 YOLOv8 模型,專責從複雜的輸入影像中快速定位並裁切出目標物件的邊界框(Bounding Box)。在後端零樣本分類模組中,本研究採用了由 OpenAI 提出的大型多模態模型CLIP(Contrastive Language-Image Pre-training),並選用 CLIP ResNet 50x4 架構。CLIP透過將影像內容與任意的自然語言文本提示(Text Prompts)映射至共享的語義嵌入空間,賦予了系統強大的開放詞彙(Open Vocabulary)分類能力,使其能擺脫固定標籤的束縛,對前端傳來的物件影像進行動態的行為理解。
    最後,為達成「低功耗」與「即時性」的核心部署目標,本研究採用了專為神經網路平行運算設計的 Hailo-8L 邊緣 AI 加速晶片(NPU)。透過Hailo Dataflow Compiler工具組進行模型量化(Model Quantization)與編譯,將高精度的 32 位元浮點數(FP32)模型權重轉換為硬體友好的 INT8 與 INT4 混合整數格式。

    With the rapid proliferation of edge computing and the Internet of Things (IoT), an increasing number of intelligent vision applications require real-time data processing and analysis on resource-constrained endpoint devices. Traditional object detection algorithms, such as the highly efficient You Only Look Once (YOLO) series, excel in single-stage inference speed. However, their inherent limitation lies in being constrained to recognize only fixed categories predefined during the training phase. When real-world scenarios present unseen objects or demand nuanced behavioral differentiation, these traditional closed-vocabulary models necessitate the costly and time-consuming collection of extensively annotated datasets for retraining, lacking the flexibility to cope with dynamic environments.
    To overcome this bottleneck, this study designs and implements a multimodal zero-shot architecture for a real-time object classification system capable of running efficiently on low-power edge devices. The system introduces a dual-stage "detect-then-classify" pipeline. The front-end module employs the highly efficient YOLOv8 object detector, tasked with rapidly extracting targets from complex input images and generating bounding boxes for localized object cropping. The back-end zero-shot classification module integrates the Contrastive Language-Image Pre-training (CLIP) model developed by OpenAI, specifically utilizing the CLIP ResNet 50x4 architecture. By projecting visual content and arbitrary natural language text prompts into a shared semantic embedding space, CLIP endows the system with robust open-vocabulary classification capabilities. This frees the system from the constraints of fixed labels, enabling the dynamic semantic understanding of object images transmitted from the front end.
    Finally, to fulfill the core deployment objectives of low power consumption and real-time performance, this study utilizes the Hailo-8L edge AI accelerator chip (Neural Processing Unit, NPU), specifically designed for parallel neural network computation. Utilizing the Hailo Dataflow Compiler toolkit for model quantization and compilation, the high-precision 32-bit floating-point (FP32) model weights are converted into a hardware-friendly mixed integer format of INT8 and INT4.

    中文摘要 I English Abstract II 誌謝 X 目錄 XII 表目錄 XVI 圖目錄 XVIII 縮寫表 XXII 第一章 簡介 1 1-1 研究背景 1 1-2 研究目的 2 1-3 研究貢獻 3 1-4 章節架構 4 第二章 文獻探討 5 2-1 機器學習與電腦視覺的發展 5 2-2 卷積神經網路的發展 6 2-3 物件偵測技術演進 8 2-3-1 兩階段偵測器 9 2-3-2 單階段偵測器 10 2-4 多模態學習與零樣本分類 12 2-5 邊緣運算模型優化技術 14 第三章 研究方法 16 3-1 系統架構 16 3-2 前端物件辨識模組(Frontend Object Detection Module) 19 3-2-1 YOLOv8核心架構 20 3-2-2 YOLOv8的模型規模大小 22 3-3 後端零樣本分類模組(Backend Zero-Shot Classification Module) 23 3-3-1 CLIP的核心原理 24 3-3-2 利用CLIP實現零樣本分類 27 3-3-3 CLIP模型規模選擇 29 3-4 邊緣端模型優化與部署 30 3-4-1 模型轉換流程 32 3-4-2 模型量化(Model Quantization) 34 3-4-3 可配置的壓縮等級 37 3-5 本章小節 39 第四章 實驗設計 41 4-1 實驗環境 43 4-1-1 模型訓練環境 43 4-1-2 Hailo NPU模型轉換環境 45 4-1-3 模型效能推論環境 46 4-2 實驗資料集 49 4-2-1 物件偵測模型資料集來源與篩選 50 4-2-2 訓練資料預處理與增強 54 4-2-3 零樣本分類模型評估資料集來源 54 4-3 評估指標定義 55 4-3-1 模型複雜度指標 56 4-3-2 模型偵測準確度指標 56 4-3-3 模型分類準確度指標 59 4-3-4 運行效能指標 60 4-4 實驗流程 61 4-4-1 物件偵測模型訓練流程 61 4-4-2 Hailo NPU模型量化、轉換流程 63 4-4-3 物件辨識模型指標評估流程 67 4-4-4 NPU量化、轉換後模型指標評估流程 70 4-4-5 物件辨識模型跨平台推論效能測試流程 76 4-4-6 零樣本分類模型跨平台推論效能測試流程 81 4-5 本章小節 85 第五章 實驗結果 87 5-1 實驗環境與設定 88 5-1-1 模型訓練環境 88 5-1-2 物件偵測模型訓練參數 89 5-1-3 模型指標定義 89 5-2 物件偵測模型指標評估 91 5-2-1 各模型詳細指標分析 91 5-2-2 模型綜合比較與分析 105 5-3 NPU量化、轉換後模型指標評估 111 5-4 物件偵測模型運行效能評估 115 5-4-1 純CPU平台效能評估 116 5-4-2 高性能GPU平台效能評估 122 5-4-3 NPU加速晶片效能評估 128 5-5 零樣本分類模型運行效能評估 132 5-5-1 純CPU平台效能評估 133 5-5-2 高性能GPU平台效能評估 135 5-5-3 NPU加速晶片效能評估 136 5-6 YOLOv8各PyTorch模型辨識結果 140 5-7 NPU量化、轉換後模型運行結果 144 5-7-1 YOLOv8s模型辨識結果 145 5-7-2 CLIP ResNet 50x4模型分類結果 147 5-7-3 YOLOv8s結合CLIP ResNet 50x4模型運行結果 149 5-8 本章小節 151 第六章 結論與未來展望 152 6-1 結論 152 6-2 未來展望 153 參考文獻 155

    [1] J. R. Koza, F. H. Bennett, D. Andre, and M. A. Keane, "Automated Design of Both the Topology and Sizing of Analog Electrical Circuits Using Genetic Programming," in Artificial Intelligence in Design ’96, J. S. Gero and F. Sudweeks Eds. Dordrecht: Springer Netherlands, 1996, pp. 151-170.
    [2] A. M. Legendre, Nouvelles méthodes pour la détermination des orbites des comètes: avec un supplément contenant divers perfectionnemens de ces méthodes et leur application aux deux comètes de 1805. Courcier, 1806.
    [3] T. Bayes, "LII. An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFR S," Philosophical transactions, no. 53, pp. 370-418, 1763.
    [4] A. L. Samuel, "Some studies in machine learning using the game of checkers," IBM Journal of research and development, vol. 3, no. 3, pp. 210-229, 1959.
    [5] S. J. Russell, P. Norvig, and E. Davis, Artificial Intelligence: A Modern Approach. Prentice Hall, 2010.
    [6] P. Langley, "The changing science of machine learning," Machine learning, vol. 82, no. 3, p. 275, 2011.
    [7] D. H. Ballard and C. M. Brown, Computer vision. Prentice Hall Professional Technical Reference, 1982.
    [8] R. Szeliski, Computer Vision: Algorithms and Applications. Springer London, 2010.
    [9] S. A. Papert, "The summer vision project," 1966.
    [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems, vol. 25, 2012.
    [11] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," nature, vol. 521, no. 7553, pp. 436-444, 2015.
    [12] L. Jiao et al., "A survey of deep learning-based object detection," IEEE access, vol. 7, pp. 128837-128868, 2019.
    [13] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, "Flownet 2.0: Evolution of optical flow estimation with deep networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2462-2470.
    [14] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine learning (no. 4). Springer, 2006.
    [15] H. H. Aghdam and E. J. Heravi, "Guide to convolutional neural networks," New York, NY: Springer, vol. 10, no. 978-973, p. 51, 2017.
    [16] D. H. Hubel and T. N. Wiesel, "Receptive fields of single neurones in the cat's striate cortex," (in eng), J Physiol, vol. 148, no. 3, pp. 574-91, Oct 1959, doi: 10.1113/jphysiol.1959.sp006308.
    [17] D. H. Hubel and T. N. Wiesel, "Receptive fields and functional architecture of monkey striate cortex," The Journal of physiology, vol. 195, no. 1, pp. 215-243, 1968.
    [18] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, "A survey of convolutional neural networks: analysis, applications, and prospects," IEEE transactions on neural networks and learning systems, vol. 33, no. 12, pp. 6999-7019, 2021.
    [19] A. Azulay and Y. Weiss, "Why do deep convolutional networks generalize so poorly to small image transformations?," Journal of Machine Learning Research, vol. 20, no. 184, pp. 1-25, 2019.
    [20] C. Mouton, J. C. Myburgh, and M. H. Davel, "Stride and translation invariance in CNNs," in Southern African Conference for Artificial Intelligence Research, 2020: Springer, pp. 267-281.
    [21] K. Fukushima, "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position," Biological cybernetics, vol. 36, no. 4, pp. 193-202, 1980.
    [22] Y. LeCun et al., "Backpropagation applied to handwritten zip code recognition," Neural computation, vol. 1, no. 4, pp. 541-551, 1989.
    [23] Y. LeCun et al., "Learning algorithms for classification: A comparison on handwritten digit recognition," Neural networks: the statistical mechanics perspective, vol. 261, no. 276, p. 2, 1995.
    [24] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
    [25] C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9.
    [26] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
    [27] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, "Object detection in 20 years: A survey," Proceedings of the IEEE, vol. 111, no. 3, pp. 257-276, 2023.
    [28] J. Wu, A. Osuntogun, T. Choudhury, M. Philipose, and J. M. Rehg, "A scalable approach to activity recognition based on object use," in 2007 IEEE 11th international conference on computer vision, 2007: IEEE, pp. 1-8.
    [29] D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International journal of computer vision, vol. 60, no. 2, pp. 91-110, 2004.
    [30] N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection," presented at the Comput. Vision Pattern Recognit., 07, 2005.
    [31] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580-587.
    [32] R. Girshick, "Fast r-cnn," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440-1448.
    [33] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," Advances in neural information processing systems, vol. 28, 2015.
    [34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779-788.
    [35] W. Liu et al., "Ssd: Single shot multibox detector," in European conference on computer vision, 2016: Springer, pp. 21-37.
    [36] J. Redmon and A. Farhadi, "YOLO9000: better, faster, stronger," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263-7271.
    [37] J. Redmon and A. Farhadi, "Yolov3: An incremental improvement," arXiv preprint arXiv:1804.02767, 2018.
    [38] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, "Yolov4: Optimal speed and accuracy of object detection," arXiv preprint arXiv:2004.10934, 2020.
    [39] G. J. a. A. C. a. J. Qiu, "Ultralytics YOLOv8," 2023. [Online]. Available: https://github.com/ultralytics/ultralytics.
    [40] A. Radford et al., "Learning transferable visual models from natural language supervision," in International conference on machine learning, 2021: PmLR, pp. 8748-8763.
    [41] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, "Efficient processing of deep neural networks: A tutorial and survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, 2017.
    [42] T. Chen et al., "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning," ACM SIGARCH Computer Architecture News, vol. 42, no. 1, pp. 269-284, 2014.
    [43] N. P. Jouppi et al., "In-datacenter performance analysis of a tensor processing unit," in Proceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1-12.
    [44] S. Branco, A. G. Ferreira, and J. Cabral, "Machine learning in resource-scarce embedded systems, FPGAs, and end-devices: A survey," Electronics, vol. 8, no. 11, p. 1289, 2019.
    [45] B. Jacob et al., "Quantization and training of neural networks for efficient integer-arithmetic-only inference," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704-2713.
    [46] R. Krishnamoorthi, "Quantizing deep convolutional networks for efficient inference: A whitepaper," arXiv preprint arXiv:1806.08342, 2018.

    QR CODE