| 研究生: |
倪仕文 Ni, Shi-Wen |
|---|---|
| 論文名稱: |
基於對抗提示微調的判別式預訓練語言模型用於少樣本文本分類 DisAPT: Discriminative Pre-trained Language Model with Adversarial Prompt Tuning for Few-shot Text Classification |
| 指導教授: |
高宏宇
Kao, Hung-Yu |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 111 |
| 語文別: | 英文 |
| 論文頁數: | 89 |
| 中文關鍵詞: | 判別式預訓練語言模型 、提示微調 、對抗訓練 、少樣本學習 、文本分類 |
| 外文關鍵詞: | Discriminative Pre-trained Language Model, Prompt Tuning, Adversarial Training, Few-shot Learning, Text Classification |
| 相關次數: | 點閱:114 下載:51 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
最近,自然語言處理領域中“預訓練,提示,預測”的新範式在少樣本學習任務中取得了卓越的成績。微調一個預訓練的語言模型需要大量的下游任務數據來重新訓練出一個特定任務的head。而提示微調重複利用預訓練好的語言模型head,所以其只需要少量的下游任務數據就可以訓練得到不錯的效果。目前提示微調最常用的預訓練語言模型都是掩蓋的語言模型,例如,BERT,ALBERT,RoBERTa等。但是一個判別式的語言模型ELECTRA在提示學習的研究中被人們所忽視。作為一種替代和嘗試,在本論文中我們第一次提出基於判別式的預訓練語言模型(ELECTRA)來做提示學習。我們提出了一個使用對抗提示微調的判別式預訓練語言模型(DisAPT),用於少樣本文本分類。具體的,我們通過設計好的提示模版將各種下游分類任務轉化為特定的判別式二分類任務。對於少樣本學習,模型容易過擬合少量的訓練數據。因此,我們在做判別式提示微調的同時加入對抗訓練來對模型進行正規化。我們使用六個公開的文本分類數據集來評估我們提出的方法,實驗結果證明我們提出的方法超過了目前最先進的方法,也證明判別式的語言模型就具有很大的少樣本學習潛力和能力。此外,我們對提出的方法進行了消融實驗來分析模型每部分的貢獻。我們還提出了一種 DisAPT 的變體,稱為 DisAPTone,它將範本中的所有標籤詞連接起來,並且在推理過程中只需要前向傳播一次。 實驗結果表明,DisAPTone 可以在僅僅損失很少的性能下,將推理速度提高 |Y|(任務類別數)倍。最後,我們還提出了CoDisAPT,其使用可訓練的連續向量代替原本離散的自然語言作為提示。並且訓練過程中CoDisAPT凍結原本模型的參數,只更新新引入的可訓練的參數,所以其相比DisAPT消耗的GPU顯存和計算資源更少。實驗結果證明在某些數據集上CoDisAPT可以達到比DisAPT更高的少樣本學習性能。
Recently, the new paradigm of “pre-train, prompt, predict” in the field of natural language processing has achieved excellent results on few-shot learning tasks. Fine-tuning a pre-trained language model requires a large amount of downstream task data to retrain a task-specific head, while prompt-tuning reuses the pre-trained language model head, thus it requires only a small amount of downstream task data to get good results. The most commonly used pre-trained language models for prompt-tuning are masked language models, e.g., BERT, ALBERT, and RoBERTa. However, a discriminative language model ELECTRA has been neglected in the study of prompt learning. As an alternative and exploration, in this thesis we are the first to use the discriminative pre-training language model (ELECTRA) for prompt learning. We propose DisAPT, a discriminative pretrained language model with adversarial prompt tuning for few-shot text classification. Specifically, we transform various downstream classification tasks into specific discriminative binary classification tasks by designing prompt templates. For few-shot learning, the model tends to overfit to a small amount of training data. Therefore, we perform discriminative fine-tuning and add adversarial training to regularize the model. We evaluate our proposed method using six public text classification datasets, and the experimental results demonstrate that our proposed method surpasses the state-of-the-art methods, and that the discriminative language model has great potential and capability for few-shot learning. In addition, we conducted ablation experiments on our proposed method to analyze the contribution of each part of the model. We also propose a variant of DisAPT, called DisAPTone, which concatenates all label words in the template and only needs to forward propagate once during inference. Experimental results show that DisAPTone can improve the inference speed by |Y| times with only a small drop in performance. Finally, we propose CoDisAPT, which uses trainable continuous vectors instead of the original discrete natural language as prompts. In the training process, CoDisAPT freezes the parameters of the original model and updates only the newly introduced trainable parameters, thus consuming less GPU memory and computational resources than DisAPT. Experimental results show that CoDisAPT can achieve higher performance than DisAPT on some datasets with fewer samples.
[1] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pages 274–283. PMLR, 2018.
[2] Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. Advances in neural information processing systems, 26:3084–3092, 2013.
[3] Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? risk
bounds for classification and regression rules that interpolate. Advances in Neural Information Processing Systems, 31:2300–2311, 2018.
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are fewshot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[5] Rich Caruana, Steve Lawrence, and Lee Giles. Overfitting in neural nets: Backpropagation. In Conjugate Gradient, and Early Stopping, NeuralInformation Processing Systems Conference, NIPS, Denver, volume 408, 2000.
[6] Kevin Clark, MinhThang Luong, Quoc V Le, and Christopher D Manning. Electra:
Pretraining text encoders as discriminators rather than generators. In International
Conference on Learning Representations, 2020.
[7] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A
largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[8] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL,
2019.
[9] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and HsiaoWuen Hon. Unified language model pretraining for natural language understanding and generation. Advances in Neural Information Processing Systems, 32, 2019.
[10] Dumitru Erhan, PierreAntoine Manzagol, Yoshua Bengio, Samy Bengio, and Pascal Vincent. The difficulty of training deep architectures and the effect of unsupervised pretraining. In Artificial Intelligence and Statistics, pages 153–160. PMLR, 2009.
[11] Luciano Floridi and Massimo Chiriatti. Gpt3: Its nature, scope, limits, and consequences. Minds and Machines, 30(4):681–694, 2020.
[12] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pretrained language models better fewshot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, 2021.
[13] Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821, 2021.
[14] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ICLR, 2015.
[15] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[16] Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In ACM
SIGKDD international conference on Knowledge discovery and data mining, 2004.
[17] Guoliang Kang, Jun Li, and Dacheng Tao. Shakeout: A new regularized deep neural network training scheme. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
[19] Anders Krogh and John A Hertz. A simple weight decay can improve generalization.
In Advances in neural information processing systems, pages 950–957, 1992.
[20] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for selfsupervised learning of language representations. ICLR, 2020.
[21] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
[22] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequencetosequence pretraining for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, 2020.
[23] Xiang Lisa Li and Percy Liang. Prefixtuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021.
[24] Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei Chen, Min
Zhang, and TieYan Liu. Rdrop: regularized dropout for neural networks. arXiv
preprint arXiv:2106.14448, 2021.
[25] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv preprint arXiv:2103.10385, 2021.
[26] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, OmerLevy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[27] Rabeeh Karimi Mahabadi, Luke Zettlemoyer, James Henderson, Marzieh Saeidi, Lambert Mathias, Veselin Stoyanov, and Majid Yazdani. Perfect: Promptfree and efficient few-shot learning with language models. arXiv preprint arXiv:2204.01172, 2022.
[28] Tom McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, 2019.
[29] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semisupervised text classification. ICLR, 2017.
[30] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semisupervised text classification. In ICLR, 2017.
[31] Andrew Y Ng et al. Preventing” overfitting” of crossvalidation data. In ICML, volume 97, pages 245–253. Citeseer, 1997.
[32] Shiwen Ni, Jiawen Li, and HungYu Kao. Dropattack: A masked weight adversarial training method to improve generalization of neural networks. arXiv preprint arXiv:2108.12805, 2021.
[33] Timothy Niven and Hung Yu Kao. Probing neural network comprehension of natural language arguments. In 57th Annual Meeting of the Association for Computational
Linguistics, ACL 2019, pages 4658–4664. Association for Computational Linguistics
(ACL), 2020.
[34] Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. 2004.
[35] Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment
categorization with respect to rating scales. 2005.
[36] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified texttotext transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
[37] Timo Schick, Helmut Schmid, and Hinrich Schütze. Automatically identifying words that can serve as labels for fewshot text classification. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5569–5578, 2020.
[38] Timo Schick and Hinrich Schütze. Exploiting clozequestions for fewshot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, 2021.
[39] Timo Schick and Hinrich Schütze. It’s not just size that matters: Small language models are also fewshot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339–2352, 2021.
[40] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 3358–3369, 2019.
[41] HungYu Kao Shiwen Ni, Jiawen Li. Rat: Regularized adversarial training for natural language understanding. EMNLP, 2022.
[42] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. 2013.
[43] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
[44] Yi Sun, Yu Zheng, Chao Hao, and Hangping Qiu. Nspbert: A promptbased zero-shot learner through an original pretraining task–next sentence prediction. arXiv preprint arXiv:2109.03564, 2021.
[45] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[46] Ellen M Voorhees and Dawn M Tice. Building a question answering test collection. In the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 2000.
[47] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066. PMLR, 2013.
[48] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29:2074–2082, 2016.
[49] Roman Werpachowski, András György, and Csaba Szepesvari. Detecting overfitting via adversarial examples. Advances in Neural Information Processing Systems, 32:7858– 7868, 2019.
[50] Janyce Wiebe, Theresa Wilson, and Claire Cardie. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(23), 2005.
[51] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS, 2019.
[52] Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Freelb:
Enhanced adversarial training for natural language understanding. In ICLR, 2020.