簡易檢索 / 詳目顯示

研究生: 吳玫萱
Wu, Mei-Hsuan
論文名稱: 基於遮罩行時空 Transformer 之深偽視訊辨識
Robust DeepFake Video Detection based on Masked Spatiotemporal Transformer
指導教授: 許志仲
Hsu, Chih-Chung
學位類別: 碩士
Master
系所名稱: 管理學院 - 數據科學研究所
Institute of Data Science
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 44
中文關鍵詞: 真偽檢測視覺轉換器生成對抗網絡自動編碼器視頻取證
外文關鍵詞: DeepFake detection, visual transformer, generative adversarial nets, autoencoder, video forensics
相關次數: 點閱:92下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • DeepFake 視頻處理技術帶來的危害已成為當今社會十分嚴重的問題。因此許多 DeepFake 檢測方法已經被提出,並在公共基準數據集上顯示出有良好的檢測效能,以減少 DeepFake 視頻對社交網絡帶來的負面影響。
    然而,DeepFake 檢測方法高度依賴於良好且穩定的人臉檢測器結果。不穩定的人臉檢測器可能會給當前的 DeepFake 檢測技術帶來新的挑戰。具體來說,由於不可靠的人臉檢測器或低質量的視頻可能導致錯誤的面部圖像擷取結果,產生意外的時間抖動偽影,從而導致當前 DeepFake 視頻檢測器的性能下降。
    在本文中,提出了一種基於掩碼特徵學習和卷積上下文變換器(CCT)的魯棒 DeepFake 視頻檢測方法來解決這一威脅。
    首先,使用傳統卷積神經網絡進行空間特徵表示的提取。然後,使用我們提出的CCT模型中的上下文變換器探索連續幀之間和像素之間相互的關聯性,以學習良好的時空特徵表示幫助DeepFake 視頻檢測任務進行,最後,提出了一種帶掩碼的 CCT 自動編碼器(MCCTA),通過在訓練階段引入幀級隨機掩碼,以及利用生成器將隨機掩碼帶來的無效人臉圖像重建為有效的人臉圖像,進一步提高所提出模型的魯棒性。
    因此,即使面部檢測器不穩定,導致擷取的面部圖像中存在部分無效圖像,MCCTA中魯棒且有效的時空特徵也可以很好的維持檢測性能。根據綜合實驗表明所提出的 MCCTA 實現了最先進的性能。

    The harm of the DeepFake video manipulation technique is becoming a more serious issue today. Thus, several DeepFake detection methods were proposed recently and showed promising results on the public benchmark datasets to reduce the impact of the DeepFake video over social networks.
    However, DeepFake detection methods rely highly on promising face detectors. An unstable face detector could bring a new challenge to current DeepFake detection techniques. Specifically, the invalid facial images due to unreliable face detectors or low-quality videos could result in unexpected temporal jittering artifacts, making the performance degradation of current DeepFake video detectors.
    Temporal features are important clues in video-level deepfake detection, so not only spatial features, temporal feature learning is also very important. However, the false detection of face detectors is usually easy to greatly affect the time domain features, so how to overcome the time jitters artifact is a problem we need to solve.

    In this paper, a robust DeepFake video detection method based on masked feature learning and convolutional context transformer (CCT) is proposed to address this threat. First, the spatial feature representation is extracted via the conventional convolutional neural networks for our CCT. Then, the mutual dependency of frames and pixels is well explored using the proposed context transformer in our CCT, making both spatial and temporal features significantly benefit the DeepFake video detection. Finally, a masked CCT AutoEncoder (MCCTA) is proposed to further boost the robustness of the proposed model by introducing frame-level random masking during the training phase, as well as reconstructing the valid facial images from the invalid ones via the generator. As a result, the robust and effective spatiotemporal feature could significantly improve performance even with partially invalid facial images due to the unstable face detectors. Comprehensive experiments are conducted to confirm that the proposed MCCTA achieves state-of-the-art performance.

    中文摘要 i Abstract ii Acknowledgements iv Contents v List of Tables vii List of Figures viii 1 Introduction 1 2 Related Works 5 2.1 DeepFake image detection 5 2.2 DeepFake video detection 6 2.3 Robustness and generalization ability of deepfake detection 7 3 Proposed Method 10 3.1 Overview of the Proposed Method 10 3.2 Convolutional Context Transformer 12 3.3 Masked Autoencoders 15 3.4 Joint Optimization 18 4 Performance Evaluation 19 4.1 Experimental Setting 19 4.2 Experimental Evaluation Metrics 21 4.3 Quantitative Results 22 4.4 Model Interpretability 26 4.5 Ablation Study 28 4.5.1 Masking ratio mr 28 4.5.2 The number of the extracted frames Nf 30 4.5.3 The penalty term of the loss function λ 31 4.5.4 Impact evaluation of the proposed modules in MCCTA 33 4.5.5 Masking non-facial part spatiotemporal feature to be reconstruction 35 5 Conclusions 37 References 38

    [1] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019.
    [2] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A largescale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3207–3216, 2020.
    [3] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397, 2020.
    [4] Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. Wilddeepfake: A challenging real-world dataset for deepfake detection. In Proceedings of the 28th ACM international conference on multimedia, pages 2382–2390, 2020.
    [5] Avishek Joey Bose and Parham Aarabi. Adversarial attacks on face detectors using neural net based constrained optimization. In 2018 IEEE 20th international workshop on multimedia signal processing (MMSP), pages 1–6. IEEE, 2018.
    [6] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
    [7] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE symposium on security and privacy (SP), pages 39–57. IEEE, 2017.
    [8] Shumeet Baluja and Ian Fischer. Adversarial transformation networks: Learning to generate adversarial examples. arXiv preprint arXiv:1703.09387, 2017.
    [9] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. Adversarial examples for semantic segmentation and object detection. In Proceedings of the IEEE international conference on computer vision, pages 1369– 1378, 2017.
    [10] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ́ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022.
    [11] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. Mesonet: A compact facial video forgery detection network. In 2018 IEEE international workshop on information forensics and security (WIFS), pages 1–7. IEEE, 2018.
    [12] Fran ̧cois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
    [13] Bolin Chen Huaxiao Mo and Weiqi Luo. Fake faces identification via convolutional neural network. In Proceedings of the 6th ACM workshop on information hiding and multimedia security, pages 43–47, 2018.
    [14] Francesco Marra, Diego Gragnaniello, Davide Cozzolino, and Luisa Verdoliva. Detection of gan-generated fake images over social networks. In 2018 IEEE conference on multimedia information processing and retrieval (MIPR), pages 384–389. IEEE, 2018.
    [15] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020.
    [16] Huy H Nguyen, Fuming Fang, Junichi Yamagishi, and Isao Echizen. Multi-task learning for detecting and segmenting manipulated facial images and videos. In 2019 IEEE 10th international conference on biometrics theory, applications and systems (BTAS), pages 1–8. IEEE, 2019.
    [17] Minha Kim, Shahroz Tariq, and Simon S Woo. Fretal: Generalizing deepfake detection using knowledge distillation and representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1001–1012, 2021.
    [18] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen. Use of a capsule network to detect fake images and videos. arXiv preprint arXiv:1910.12467, 2019.
    [19] Chih-Chung Hsu, Yi-Xiu Zhuang, and Chia-Yen Lee. Deep fake image detection based on pairwise learning. Applied Sciences, 10(1):370, 2020.
    [20] Yi-Xiu Zhuang and Chih-Chung Hsu. Detecting generated image based on a coupled network with two-step pairwise learning. In 2019 IEEE international conference on image processing (ICIP), pages 3212–3216. IEEE, 2019.
    [21] Chih-Chung Hsu, Chia-Yen Lee, and Yi-Xiu Zhuang. Learning to detect fake face images in the wild. In 2018 international symposium on computer, consumer and control (IS3C), pages 388–391. IEEE, 2018.
    [22] Iacopo Masi, Aditya Killekar, Royston Marian Mascarenhas, Shenoy Pratik Gurudatt, and Wael AbdAlmageed. Two-branch recurrent network for isolating deepfakes in videos. In European conference on computer vision, pages 667–684.
    Springer, 2020.
    [23] David G ̈uera and Edward J Delp. Deepfake video detection using recurrent neural networks. In 2018 15th IEEE international conference on advanced video and signal based surveillance (AVSS), pages 1–6. IEEE, 2018.
    [24] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. Face x-ray for more general face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5001–5010, 2020.
    [25] Ping Wang, Kunlin Liu, Wenbo Zhou, Hang Zhou, Honggu Liu, Weiming Zhang, and Nenghai Yu. Adt: Anti-deepfake transformer. In ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2899–1903. IEEE, 2022.
    [26] Zekun Sun, Yujie Han, Zeyu Hua, Na Ruan, and Weijia Jia. Improving the efficiency and robustness of deepfakes detection through precise geometric features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3609–3618, 2021.
    [27] Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAlmageed, Iacopo Masi, and Prem Natarajan. Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI), 3(1):80–87, 2019.
    [28] Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes using inconsistent head poses. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 8261–8265. IEEE, 2019.
    [29] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In 2018 IEEE International workshop on information forensics and security (WIFS), pages 1–7. IEEE, 2018.
    [30] Umur Aybars Ciftci, Ilke Demir, and Lijun Yin. Fakecatcher: Detection of synthetic portrait videos using biological signals. IEEE transactions on pattern analysis and machine intelligence, 2020.
    [31] Yuezun Li and Siwei Lyu. Exposing deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656, 2018.
    [32] Joel Frank, Thorsten Eisenhofer, Lea Sch ̈onherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. In International conference on machine learning, pages 3247–3258. PMLR, 2020.
    [33] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In European conference on computer vision, pages 86–103. Springer, 2020.
    [34] Nicholas Carlini and Hany Farid. Evading deepfake-image detectors with whiteand black-box attacks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 658–659, 2020.
    [35] Shehzeen Hussain, Paarth Neekhara, Malhar Jere, Farinaz Koushanfar, and Julian McAuley. Adversarial deepfakes: Evaluating vulnerability of deepfake detectors to adversarial examples. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3348–3357, 2021.
    [36] Paarth Neekhara, Brian Dolhansky, Joanna Bitton, and Cristian Canton Ferrer.
    Adversarial threats to deepfake detection: A practical perspective. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 923–932, 2021.
    [37] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters, 23(10):1499–1503, 2016.
    [38] Davis E King. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research, 10:1755–1758, 2009.
    [39] Beyer Lucas Dosovitskiy, Alexey. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
    [40] Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to mlps.
    Advances in Neural Information Processing Systems, 34:9204–9215, 2021.
    [41] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
    arXiv preprint arXiv:1412.6980, 2014.
    [42] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.

    下載圖示 校內:2025-09-01公開
    校外:2025-09-01公開
    QR CODE