| 研究生: |
陳彥穎 Chen, Yan-Ying |
|---|---|
| 論文名稱: |
基於多模態 Zero-Shot 架構之低功耗即時物件分類系統 Low-power Real-time Object Classification System based on Multimodal Zero-Shot Architecture |
| 指導教授: |
賴槿峰
Lai, Chin-Feng |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 182 |
| 中文關鍵詞: | 多模態架構 、零樣本分類 、YOLOv8 、對比語言-圖像預訓練(CLIP) 、邊緣運算 、神經網路處理器(NPU) 、模型量化 |
| 外文關鍵詞: | Multimodal Architecture, Zero-Shot Classification, YOLOv8, Contrastive Language-Image Pre-training (CLIP), Edge Computing, Neural Processing Unit (NPU), Model Quantization |
| 相關次數: | 點閱:24 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著邊緣運算(Edge Computing)與物聯網(Internet of Things, IoT)技術的蓬勃發展,越來越多的智慧化視覺應用需要在資源受限的終端裝置上進行即時的資料處理與分析。然而,傳統的物件偵測模型(例如廣泛應用的 YOLO 系列),憑藉其單階段(single-stage)的設計雖然在偵測速度上取得了巨大成功,但其固有的限制在於僅能辨識訓練階段預先定義好的固定類別。當現實場景中出現新的、從未見過的物件,或需要描述更細緻的物件行為時,傳統封閉詞彙(Closed-Vocabulary)模型便需要收集大量相關的標註資料並重新訓練,此過程不僅成本高昂,更缺乏應對動態環境的彈性。
為突破此一瓶頸,本研究設計並實作了一個能在低功耗邊緣裝置上高效運行的多模態零樣本(Zero-Shot)架構即時物件分類系統。本系統提出了一個「先偵測,後分類」的兩階段式整合流程。在前端物件偵測模組中,系統採用了高效的 YOLOv8 模型,專責從複雜的輸入影像中快速定位並裁切出目標物件的邊界框(Bounding Box)。在後端零樣本分類模組中,本研究採用了由 OpenAI 提出的大型多模態模型CLIP(Contrastive Language-Image Pre-training),並選用 CLIP ResNet 50x4 架構。CLIP透過將影像內容與任意的自然語言文本提示(Text Prompts)映射至共享的語義嵌入空間,賦予了系統強大的開放詞彙(Open Vocabulary)分類能力,使其能擺脫固定標籤的束縛,對前端傳來的物件影像進行動態的行為理解。
最後,為達成「低功耗」與「即時性」的核心部署目標,本研究採用了專為神經網路平行運算設計的 Hailo-8L 邊緣 AI 加速晶片(NPU)。透過Hailo Dataflow Compiler工具組進行模型量化(Model Quantization)與編譯,將高精度的 32 位元浮點數(FP32)模型權重轉換為硬體友好的 INT8 與 INT4 混合整數格式。
With the rapid proliferation of edge computing and the Internet of Things (IoT), an increasing number of intelligent vision applications require real-time data processing and analysis on resource-constrained endpoint devices. Traditional object detection algorithms, such as the highly efficient You Only Look Once (YOLO) series, excel in single-stage inference speed. However, their inherent limitation lies in being constrained to recognize only fixed categories predefined during the training phase. When real-world scenarios present unseen objects or demand nuanced behavioral differentiation, these traditional closed-vocabulary models necessitate the costly and time-consuming collection of extensively annotated datasets for retraining, lacking the flexibility to cope with dynamic environments.
To overcome this bottleneck, this study designs and implements a multimodal zero-shot architecture for a real-time object classification system capable of running efficiently on low-power edge devices. The system introduces a dual-stage "detect-then-classify" pipeline. The front-end module employs the highly efficient YOLOv8 object detector, tasked with rapidly extracting targets from complex input images and generating bounding boxes for localized object cropping. The back-end zero-shot classification module integrates the Contrastive Language-Image Pre-training (CLIP) model developed by OpenAI, specifically utilizing the CLIP ResNet 50x4 architecture. By projecting visual content and arbitrary natural language text prompts into a shared semantic embedding space, CLIP endows the system with robust open-vocabulary classification capabilities. This frees the system from the constraints of fixed labels, enabling the dynamic semantic understanding of object images transmitted from the front end.
Finally, to fulfill the core deployment objectives of low power consumption and real-time performance, this study utilizes the Hailo-8L edge AI accelerator chip (Neural Processing Unit, NPU), specifically designed for parallel neural network computation. Utilizing the Hailo Dataflow Compiler toolkit for model quantization and compilation, the high-precision 32-bit floating-point (FP32) model weights are converted into a hardware-friendly mixed integer format of INT8 and INT4.
[1] J. R. Koza, F. H. Bennett, D. Andre, and M. A. Keane, "Automated Design of Both the Topology and Sizing of Analog Electrical Circuits Using Genetic Programming," in Artificial Intelligence in Design ’96, J. S. Gero and F. Sudweeks Eds. Dordrecht: Springer Netherlands, 1996, pp. 151-170.
[2] A. M. Legendre, Nouvelles méthodes pour la détermination des orbites des comètes: avec un supplément contenant divers perfectionnemens de ces méthodes et leur application aux deux comètes de 1805. Courcier, 1806.
[3] T. Bayes, "LII. An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFR S," Philosophical transactions, no. 53, pp. 370-418, 1763.
[4] A. L. Samuel, "Some studies in machine learning using the game of checkers," IBM Journal of research and development, vol. 3, no. 3, pp. 210-229, 1959.
[5] S. J. Russell, P. Norvig, and E. Davis, Artificial Intelligence: A Modern Approach. Prentice Hall, 2010.
[6] P. Langley, "The changing science of machine learning," Machine learning, vol. 82, no. 3, p. 275, 2011.
[7] D. H. Ballard and C. M. Brown, Computer vision. Prentice Hall Professional Technical Reference, 1982.
[8] R. Szeliski, Computer Vision: Algorithms and Applications. Springer London, 2010.
[9] S. A. Papert, "The summer vision project," 1966.
[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems, vol. 25, 2012.
[11] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," nature, vol. 521, no. 7553, pp. 436-444, 2015.
[12] L. Jiao et al., "A survey of deep learning-based object detection," IEEE access, vol. 7, pp. 128837-128868, 2019.
[13] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, "Flownet 2.0: Evolution of optical flow estimation with deep networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2462-2470.
[14] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine learning (no. 4). Springer, 2006.
[15] H. H. Aghdam and E. J. Heravi, "Guide to convolutional neural networks," New York, NY: Springer, vol. 10, no. 978-973, p. 51, 2017.
[16] D. H. Hubel and T. N. Wiesel, "Receptive fields of single neurones in the cat's striate cortex," (in eng), J Physiol, vol. 148, no. 3, pp. 574-91, Oct 1959, doi: 10.1113/jphysiol.1959.sp006308.
[17] D. H. Hubel and T. N. Wiesel, "Receptive fields and functional architecture of monkey striate cortex," The Journal of physiology, vol. 195, no. 1, pp. 215-243, 1968.
[18] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, "A survey of convolutional neural networks: analysis, applications, and prospects," IEEE transactions on neural networks and learning systems, vol. 33, no. 12, pp. 6999-7019, 2021.
[19] A. Azulay and Y. Weiss, "Why do deep convolutional networks generalize so poorly to small image transformations?," Journal of Machine Learning Research, vol. 20, no. 184, pp. 1-25, 2019.
[20] C. Mouton, J. C. Myburgh, and M. H. Davel, "Stride and translation invariance in CNNs," in Southern African Conference for Artificial Intelligence Research, 2020: Springer, pp. 267-281.
[21] K. Fukushima, "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position," Biological cybernetics, vol. 36, no. 4, pp. 193-202, 1980.
[22] Y. LeCun et al., "Backpropagation applied to handwritten zip code recognition," Neural computation, vol. 1, no. 4, pp. 541-551, 1989.
[23] Y. LeCun et al., "Learning algorithms for classification: A comparison on handwritten digit recognition," Neural networks: the statistical mechanics perspective, vol. 261, no. 276, p. 2, 1995.
[24] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[25] C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9.
[26] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[27] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, "Object detection in 20 years: A survey," Proceedings of the IEEE, vol. 111, no. 3, pp. 257-276, 2023.
[28] J. Wu, A. Osuntogun, T. Choudhury, M. Philipose, and J. M. Rehg, "A scalable approach to activity recognition based on object use," in 2007 IEEE 11th international conference on computer vision, 2007: IEEE, pp. 1-8.
[29] D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International journal of computer vision, vol. 60, no. 2, pp. 91-110, 2004.
[30] N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection," presented at the Comput. Vision Pattern Recognit., 07, 2005.
[31] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580-587.
[32] R. Girshick, "Fast r-cnn," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440-1448.
[33] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," Advances in neural information processing systems, vol. 28, 2015.
[34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779-788.
[35] W. Liu et al., "Ssd: Single shot multibox detector," in European conference on computer vision, 2016: Springer, pp. 21-37.
[36] J. Redmon and A. Farhadi, "YOLO9000: better, faster, stronger," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263-7271.
[37] J. Redmon and A. Farhadi, "Yolov3: An incremental improvement," arXiv preprint arXiv:1804.02767, 2018.
[38] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, "Yolov4: Optimal speed and accuracy of object detection," arXiv preprint arXiv:2004.10934, 2020.
[39] G. J. a. A. C. a. J. Qiu, "Ultralytics YOLOv8," 2023. [Online]. Available: https://github.com/ultralytics/ultralytics.
[40] A. Radford et al., "Learning transferable visual models from natural language supervision," in International conference on machine learning, 2021: PmLR, pp. 8748-8763.
[41] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, "Efficient processing of deep neural networks: A tutorial and survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, 2017.
[42] T. Chen et al., "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning," ACM SIGARCH Computer Architecture News, vol. 42, no. 1, pp. 269-284, 2014.
[43] N. P. Jouppi et al., "In-datacenter performance analysis of a tensor processing unit," in Proceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1-12.
[44] S. Branco, A. G. Ferreira, and J. Cabral, "Machine learning in resource-scarce embedded systems, FPGAs, and end-devices: A survey," Electronics, vol. 8, no. 11, p. 1289, 2019.
[45] B. Jacob et al., "Quantization and training of neural networks for efficient integer-arithmetic-only inference," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704-2713.
[46] R. Krishnamoorthi, "Quantizing deep convolutional networks for efficient inference: A whitepaper," arXiv preprint arXiv:1806.08342, 2018.