簡易檢索 / 詳目顯示

研究生: 陳奕辰
Chen, Yi-Chen
論文名稱: 對比特徵學習中正負樣本設計應用於動作切割
Positive and Negative Set Designs in Contrastive Feature Learning for Temporal Action Segmentation
指導教授: 朱威達
Chu, Wei-Ta
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 40
中文關鍵詞: 時間動作切割對比學習特徵學習
外文關鍵詞: Temporal Action Segmentation, Contrastive Learning, Representation Learning
相關次數: 點閱:69下載:6
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在數據標籤稀缺時,我們常常會使用對比學習來做特徵學習。而在對比學習中,不僅學習機制是重要的,正負樣本的設計也是需要被探討的。儘管大多數先前的研究專注於設計新的學習機制,但本文鎖定對時間動作切割 (TAS) 中正負樣本設計的重要性進行探討。我們分別針對時間戳監督 TAS 和無監督 TAS提出正負樣本設計。在時間戳監督 TAS 的情況下,我們藉由幀與其相鄰的時間戳標籤定義每個幀的偽標籤,並利用它們構建正負樣本集合。對於無監督 TAS ,我們直接根據幀與其相鄰的幀的相似性來定義正負樣本集合,為了減輕過度切割的問題,我們提出一個合併演算法有效地結合正樣本。隨後,我們使用我們設計的正負樣本集合,以對比學習增強特徵表示。在三個不同資料集上,通過將我們學到的特徵與現有方法相互結合,我們能夠將時間戳監督 TAS 的 MoF 提升 8% 到 15%,並將無監督 TAS 的 F1 score 提升 8% 提升 9%,達到新的最佳結果。

    When data labels are scarce, contrastive learning is often used to learn representations. In contrastive learning, not only the learning mechanism, but also the designs of positive and negative sets are critical. While most previous works focus on designing new learning mechanisms, this paper investigates the importance of positive and negative set designs in temporal action segmentation (TAS). As a result, we propose positive and negative set designs for both timestamp-supervised TAS and unsupervised TAS. In the case of timestamp-supervised TAS, we define pseudo labels for each frame according to its neighboring timestamp labels and utilize them to construct the positive and negative sets. For unsupervised TAS, we directly define the positive and negative sets based on the similarity between the anchor frame and its neighboring frames. To mitigate the issue of over-segmentation, we introduce a merging algorithm that effectively combines positive sets. Subsequently, we employ the positive and negative sets and run contrastive learning for enhancing representations. By incorporating our learned representations into existing methods, we are able to improve performance of timestamp-supervised TAS from 8% to 15% across three different datasets, and unsupervised TAS from 8% to 9% in terms of F1 scores, achieving new state-of-the-art segmentation results.

    摘要 i Abstract ii Table of Contents iii List of Tables v List of Figures vi Chapter 1. Introduction 1 1.1. Motivation 1 1.2. Concept 3 1.3. Contributions 4 1.4. Thesis Organization 5 Chapter 2. Related Works 6 2.1. Temporal Action Segmentation 6 2.2. Contrastive Representation Learning 7 2.3. Other Similar Tasks 8 Chapter 3. Pilot Study 10 3.1. Background 10 3.2. How Pos./Neg. Sets Influence Performance 11 Chapter 4. Contrastive Representation Learning 13 4.1. Timestamp-Supervised Representation Learning 13 4.2. Unsupervised Representation Learning 15 4.3. Temporal Action Segmentation 15 Chapter 5. Experiments 17 5.1. Datasets 17 5.2. Implementation Details 17 5.3. Evaluation Measures 21 5.3.1. Frame-Based Measures 22 5.3.2. Segment-Based Measures 22 5.4. Performance of Timestamp-Supervised TAS 24 5.5. Performance of Unsupervised TAS 27 5.6. Ablation Study 28 5.6.1. τ in Timestamp-Supervised TAS 28 5.6.2. ρ and R in Unsupervised TAS 30 5.6.3. Impact of the Sizes of PositiveSets 30 5.6.4 Comparison with Fine-tuned Method 31 5.7. Visual Samples 32 Chapter 6. Conclusion 34 6.1. Summary 34 6.2. Future Work 34 References 36

    [1] Sathyanarayanan N. Aakur and Sudeep Sarkar. A perceptual prediction framework for self supervised event segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1197–1206, 2019.

    [2] Jernej Barbič, Alla Safonova, Jia-Yu Pan, Christos Faloutsos, Jessica K Hodgins, and Nancy S Pollard. Segmenting motion capture data into distinct behaviors. In Proceed- ings of Graphics Interface, pages 185–194, 2004.

    [3] NadineBehrmann,S.AlirezaGolestaneh,ZicoKolter,JürgenGall,andMehdiNoroozi. Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In Proceedings of European Conference on Computer Vision, 2022.

    [4] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.

    [5] Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and Juan Carlos Niebles. D3TW: Discriminative differentiable dynamic time warping for weakly supervised ac- tion alignment and segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 3546–3555, 2019.

    [6] Xiaobin Chang, Frederick Tung, and Greg Mori. Learning discriminative prototypes with dynamic time warping. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 8395–8404, 2021.

    [7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of Inter- national Conference on Machine Learning, pages 1597–1607, 2020.

    [8] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv:2003.04297, 2020.

    [9] Guodong Ding, Fadime Sener, and Angela Yao. Temporal action segmentation: An analysis of modern technique, 2023. arXiv:2210.10352.

    [10] Guodong Ding and Angela Yao. Leveraging action affinity and continuity for semi- supervised temporal action segmentation. In Proceedings of European Conference on Computer Vision, 2022.

    [11] Li Ding and Chenliang Xu. Weakly-supervised action segmentation with iterative soft boundary assignment. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018.

    [12] DazhaoDu,EnhanLi,LingyuSi,FanjiangXu,andFuchunSun.Timestamp-supervised action segmentation in the perspective of clustering. In Proceedings of the Thirty- Second International Joint Conference on Artificial Intelligence, IJCAI-23, 2022.

    [13] Zexing Du, Xue Wang, Guoqing Zhou, and Qing Wang. Fast and unsupervised action boundary detection for action segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 3323–3332, 2022.

    [14] Ehsan Elhamifar and Dat Huynh. Self-supervised multi-task procedure learning from instructional videos. In Proceedings of European Conference on Computer Vision, pages 557–573, 2020.

    [15] Yazan Abu Farha and Juergen Gall. MS-TCN: Multi-stage temporal convolutional net- work for action segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 3575–3584, 2019.

    [16] AlirezaFathi,XiaofengRen,andJamesM.Rehg.Learningtorecognizeobjectsinego- centric activities. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 3281–3288, 2011.

    [17] MohsenFayyazandJuergenGall.SCT:Setconstrainedtemporaltransformerforsetsu- pervised action segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 501–510, 2020.

    [18] Zhou Feng, JK Hodgins, et al. Aligned cluster analysis for temporal segmentation of human motion. In Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, pages 1–7, 2008.

    [19] Emily B Fox, Michael C Hughes, Erik B Sudderth, and Michael I Jordan. Joint mod- eling of multiple time series via the beta process with application to motion capture segmentation. The Annals of Applied Statistics.

    [20] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum con- trast for unsupervised visual representation learning. In Proceedings of IEEE Interna- tional Conference on Computer Vision and Pattern Recognition, 2020.

    [21] Noureldien Hussein, Efstratios Gavves, and Arnold WM Smeulders. Timeception for complex action recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 254–263, 2019.

    [22] Noureldien Hussein, Efstratios Gavves, and Arnold WM Smeulders. PIC: Permutation invariant convolution for recognizing long-range activities. arXiv:2003.08275, 2020.

    [23] LonglongJingandYingliTian.Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4037–4058, 2021.

    [24] HamzaKhan,SanjayHaresh,AwaisAhmed,ShakeebSiddiqui,AndreyKonin,M.Zee- shan Zia, and Quoc-Huy Tran. Timestamp-supervised action segmentation with graph convolutional networks. In Proceedings of IEEE/RSJ International Conference on In- telligent Robots and Systems, 2022.

    [25] Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 780–787, 2014.

    [26] Hilde Kuehne, Alexander Richard, and Juergen Gall. A hybrid rnn-hmm approach for weakly supervised temporal action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):765–779, 2020.

    [27] Harold W Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.

    [28] Anna Kukleva, Hilde Kuehne, Fadime Sener, and Juergen Gall. Unsupervised learn- ing of action classes with continuous temporal embedding. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019.

    [29] Phuc H. Le-Khac, Graham Healy, and Alan F. Smeaton. Contrastive representation learning: A framework and review. IEEE Access, 8:193907–193934, 2020.

    [30] Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. Tem- poral convolutional networks for action segmentation and detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 156–165, 2017.

    [31] Peng Lei and Sinisa Todorovic. Temporal deformable residual networks for action seg- mentation in videos. In Proceedings of IEEE Conference on Computer Vision and Pat- tern Recognition, pages 6742–6751, 2018.

    [32] Jun Li, Peng Lei, and Sinisa Todorovic. Weakly supervised energy-based learning for action segmentation. In Proceedings of IEEE Conference on Computer Vision, pages 6243–6251, 2019.

    [33]Jun Li and Sinisa Todorovic. Action shuffle alternating learning for unsupervised action segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 12628–12636, 2021.

    [34] Jun Li and Sinisa Todorovic. Anchor-constrained viterbi for set-supervised action seg- mentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recog- nition, pages 9806–9815, 2021.

    [35] Shijie Li, Yazan Abu Farha, Yun Liu, Ming-Ming Cheng, and Juergen Gall. MS- TCN++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

    [36] Zhe Li, Yazan Abu Farha, and Juergen Gall. Temporal action segmentation from times- tamp supervision. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 8365–8374, 2021.

    [37] Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. BMN: Boundary-matching network for temporal action proposal generation. In Proceedings of IEEE International Conference on Computer Vision, pages 3889–3898, 2019.

    [38] Zwe Naing and Ehsan Elhamifar. Procedure completion by learning from partial sum- maries. In Proceedings of British Machine Vision Conference, 2020.

    [39] Rahul Rahaman, Dipika Singhania, Alexandre Thiery, and Angela Yao. A generalized & robust framework for timestamp supervision in temporal action segmentation. In Proceedings of European Conference on Computer Vision, 2022.

    [40] Alexander Richard, Hilde Kuehne, and Juergen Gall. Action sets: Weakly supervised action segmentation without ordering constraints. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2018.

    [41] Alexander Richard, Hilde Kuehne, Ahsan Iqbal, and Juergen Gall. Neuralnetwork- viterbi: A framework for weakly supervised video learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 7386–7395, 2018.

    [42] M. Saquib Sarfraz, Naila Murray, Vivek Sharma, Ali Diba, Luc Van Gool, and Rainer Stiefelhagen. Temporally-weighted hierarchical clustering for unsupervised action seg- mentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recog- nition, pages 11225–11234, 2021.

    [43] Fadime Sener and Angela Yao. Unsupervised learning and segmentation of complex activities from video. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 8368–8376, 2018.

    [44] Mike Zheng Shou, Stan Weixian Lei, Weiyao Wang, Deepti Ghadiyaram, and Matt Feiszli. Generic event boundary detection: A benchmark for event segmentation. In Proceedings of IEEE International Conference on Computer Vision, pages 8075–8084, 2021.

    [45] Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of IEEE International Con- ference on Computer Vision and Pattern Recognition, pages 1049–1058, 2016.

    [46] Dipika Singhania, Rahul Rahaman, and Angela Yao. Coarse to fine multi-resolution temporal convolutional network, 2021. arXiv:2105.10859.

    [47] DipikaSinghania,RahulRahaman,andAngelaYao.Iterativecontrast-classifyforsemi- supervised temporal action segmentation. In Proceedings of AAAI Conference on Arti- ficial Intelligence, pages 2262–2270, 2022.

    [48] Yaser Souri, Mohsen Fayyaz, Luca Minciullo, Gianpiero Francesca, and Juergen Gall. Fast weakly supervised action segmentation using mutual consistency. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 44(10):6196–6208, 2022.

    [49] Sebastian Stein and Stephen J. McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 729– 738, 2013.

    [50] Satvik Venkatesh, David Moffat, and Eduardo Reck Miranda. Investigating the ef- fects of training set synthesis for audio segmentation of radio broadcast. Electronics, 10(7):827, 2021.

    [51] Rosaura G. VidalMata, Walter J. Scheirer, Anna Kukleva, David Cox, and Hilde
    Kuehne. Joint visual-temporal embedding for unsupervised learning of actions in untrimmed sequences. In Proceedings of IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 1238–1247, 2021.

    [52] Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In Proceedings of IEEE International Conference on Computer Vision, pages 3551–3558, 2013.

    [53] Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance-level discrimination. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2018.

    [54] Fangqiu Yi, Hongyu Wen, and Tingting Jiang. Asformer: Transformer for action seg- mentation. In Proceedings of British Machine Vision Conference, 2021.

    [55] Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. Video summarization with long short-term memory. In Proceedings of European Conference on Computer Vision, pages 766–782, 2016.

    [56] Feng Zhou, Fernando De la Torre, and Jessica K Hodgins. Hierarchical aligned clus- ter analysis for temporal clustering of human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(3):582–596, 2012.

    [57] Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross-task weakly supervised learning from instructional videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3537–3545, 2019.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE