| 研究生: |
李婕 Lee, Chieh |
|---|---|
| 論文名稱: |
使用跨通道高效注意力的輕量級實時語義分割 Lightweight and Real-time Semantic Segmentation using Cross-layer Efficient Attention Computation |
| 指導教授: |
許志仲
Hsu, Chih-Chung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 數據科學研究所 Institute of Data Science |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 39 |
| 中文關鍵詞: | 實時語意分割 、輕量化模型 、高效注意力計算 |
| 外文關鍵詞: | real-time semantic segmentation, light-weight model, efficient attention computation |
| 相關次數: | 點閱:172 下載:7 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
實時語義分割已成為自動駕駛、場景解析甚至醫學影像診斷等各種應用的熱門研究領域。儘管近期的實時語義分割網絡實現了快速推理,但計算資源和功耗過多的狀況仍有待改進。具體來說,自註意力機制被廣泛用於語義分割網絡以提高性能。然而,自註意力機制的計算瓶頸可能不利於快速推理,因此提高計算效率對於實時應用最為關鍵。我們提出了一種基於高效注意力計算的輕量級實時語義分割網絡,稱為 EAC,並通過經驗證明避免降維和適當的跨通道交互(CHI)對於學習有效和高效的通道注意力至關重要。此外,還提出了一種細節感知損失來學習訓練階段的邊緣保留輔助信息,以提高對象之間的邊界一致性。大量實驗證實,所提出的方法實現了最先進的性能,尤其是在標準基準數據集上的實時應用中。
Real-time semantic segmentation has become a trendy research field for various applications such as autonomous driving, scene parsing, and even diagnosis of medical imaging. Although recent real-time semantic segmentation networks achieve fast inference, the computational resources and power-consuming remain room to be improved. Specifically, the self-attention mechanism is widely used on semantic segmentation networks to boost performance. However, the computational bottleneck of self-attention might be harmful to the fast inference, and thus, improving the computational efficiency is most critical for real-time applications. We propose a lightweight real-time semantic segmentation network based on an efficient attention computation, termed EAC, as well as empirically demonstrate that avoiding dimensionality reduction and appropriate cross-channel interactions (CHI) is essential for learning effective and efficient channel attention. Furthermore, a detail-aware loss is also proposed to learn the edge-preserving auxiliary information in the training phase to boost the boundary consistency between objects. Extensive experiments confirm that the proposed method achieves state-of-the-art performance, especially in real-time applications on the standard benchmark datasets.
[1] Günther Nirschl. Human-centered development of advanced driver assistance systems. In Symposium on Human Interface and the Management of Information, pages 1088–1097. Springer, 2007.
[2] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
[3] Sungha Choi, Joanne T Kim, and Jaegul Choo. Cars can’t fly up in the sky: Improving urban-scene segmentation via height-driven attention networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9373–9383, 2020.
[4] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmentation. In European conference on computer vision, pages 173–190. Springer, 2020.
[5] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5693–5703, 2019.
[6] Yuhui Yuan, Xiaokang Chen, Xilin Chen, and Jingdong Wang. Segmentation transformer: Object-contextual representations for semantic segmentation. arXiv preprint arXiv:1909.11065, 2019.
[7] Wo-Ting Liao Han-Yi Kao Bor-Sheng Huang, Chih-Chung Hsu and Xian-Yun Wang. Dcsn: Deformable convolutional semantic segmentation neural network for non-rigid scenes. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022.
[8] Changqian Yu, Bin Xiao, Changxin Gao, Lu Yuan, Lei Zhang, Nong Sang, and Jingdong Wang. Lite-hrnet: A lightweight high-resolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
[9] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
[10] Mingyuan Fan, Shenqi Lai, Junshi Huang, Xiaoming Wei, Zhenhua Chai, Junfeng Luo, and Xiaolin Wei. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9716–9725, 2021.
[11] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
[12] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
[13] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
[14] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014.
[15] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[16] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[18] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 325–341, 2018.
[19] Changqian Yu, Changxin Gao, Jingbo Wang, Gang Yu, Chunhua Shen, and Nong Sang. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129(11):3051–3068, 2021.
[20] Xiangtai Li, Ansheng You, Zhen Zhu, Houlong Zhao, Maoke Yang, Kuiyuan Yang, Shaohua Tan, and Yunhai Tong. Semantic flow for fast and accurate scene parsing. In European Conference on Computer Vision, pages 775–793. Springer, 2020.
[21] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
[22] Tsung-Yi Lin, Piotr Dolla ́r, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
[23] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[24] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang- Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
[25] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1314–1324, 2019.
[26] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018.
[27] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
[28] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[29] Gabriel J Brostow, Julien Fauqueur, and Roberto Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2):88–97, 2009.
[30] Daniel Seita. Bdd100k: A large-scale diverse driving video database. The Berkeley Artificial Intelligence Research Blog. Version, 511:41, 2018.