簡易檢索 / 詳目顯示

研究生: 吳貞頤
Wu, Jhen-Yi
論文名稱: 利用語義單元之非監督式單文本生成式摘要
Unsupervised Single Document Abstractive Summarization Using Semantic Units
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 62
中文關鍵詞: 自然語言處理自動化摘要非監督式學習
外文關鍵詞: Natural Language Processing, Text Summarization, Unsupervised Learning
相關次數: 點閱:70下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近期的自動摘要作法通常利用預訓練模型或者大量成對的原文與對應摘要作為訓練資料。然而蒐集這樣大量的成對訓練資料或者預訓練資料在許多真實應用情況是十分耗費時間及人力成本的,而目標摘要的語言也可能較難取得對應的預訓練模型,重新蒐集足夠的對應語言預訓練資料更是困難。
    我們因此提出一個利用自編碼器 (auto­encoder) 的非監督式 (unsupervised) 訓練方式。我們首先透過數據統計研究原文中同樣語義的文字片段頻率對於摘要的影響,並且定義這樣的文字片段為語意單元 (semantic units)。透過遮蔽 (masking) 部分的語意單元訓練模型還原原文,並且於訓練過程中逐漸調整原文中的遮蔽量與遮蔽程度(hard masks or soft masks) 讓模型狀態接近摘要測試 (inference) 時的狀態。因此在這樣的摘要模型訓練過程中,不須利用任何成對資料或者預訓練模型。接著證明模型透過我們所提出的訓練方式能夠自動學習到原文當中的語義單元頻率,並且將這樣的資訊運用於篩選原文內容、辨別文字片段重要程度,最終做到摘要生成。
    我們提出的模型在 CNN/DailyMail 摘要資料集的表現勝過與我們設定相似的非監督式模型。我們甚至能夠以遠低於大型預訓練模型的模型參數量,取得與預訓練摘要模型非常接近的 ROUGE 分數。我們也透過實驗證明: 即使訓練資料來源與目標測試資料不同、僅有 1/3 資料量、訓練於其它語言,各種情況下我們的模型依然能夠取得相似程度的表現。

    Recent neural summarization methods utilize pretrained models or massive article­summary paired data. However, collecting a large amount of paired data for training is challenging for real­-world applications, and pretrained models can hardly be accessible for low­-resource languages. In this work, we propose an unsupervised training method on an auto­-encoder for summarization that doesn't require any paired data or pretrained models. We first inspect that frequency of text spans with similar semantics in source articles helps summarization, and we defined the text spans as "semantic units." The model first predicts the words corresponding to the masked semantic units in the inputs, and then the model is required to reconstruct the original article with much more masked semantic units in the latter training stage. Moreover, we prove that our model can unsupervisedly learn semantic unit frequency information, which is then used for selecting important sentences and distinguishing salient semantic units to generate summaries abstractively. Our model performance on CNN/DailyMail summarization task outperforms the other unsupervised methods under the same setting. Furthermore, with far fewer model parameters, we even achieve similar ROUGE scores to several pretrained models. We also show that our model is robust in transfer learning, less data, or other languages.

    摘要 i Abstract ii 誌謝 iii Table of Contents iv List of Tables vi List of Figures vii Chapter 1. Introduction 1 1.1 Background 1 1.2 Motivation 3 1.3 Our Approach 12 Chapter 2. Related Work 15 2.1 Sentence Compression 15 2.2 Text Summarization 17 2.3 Zero-­shot Pretraining 20 Chapter 3. Method 22 3.1 Transformer Encoder­-decoder 23 3.2 Semantic Unit Construction Layer 26 3.3 Masked Semantic Units Prediction 27 3.4 Reconstruction from Semantic Units 32 3.5 Inference Stage 33 Chapter 4. Experiments 36 4.1 Dataset 36 4.2 Evaluation Metric 37 4.3 Experimental Settings 39 4.4 Baseline 40 4.5 Result 41 Chapter 5. Analysis 44 5.1 Coverage of High-­frequency Semantic Units 44 5.2 Decoding times 45 5.3 Transfer Learning 48 5.4 Semantic Unit Construction 49 5.5 Context Window Size 49 5.6 Transition of Training Stages 50 5.7 Semantic Unit Selection 50 5.8 Number of Training Data 52 5.9 Dataset in Different Languages 53 5.10 Human Evaluation on Grammaticality 54 Chapter 6. Conclusion 56 References 58

    [1] Christos Baziotis, Ion Androutsopoulos, Ioannis Konstas, and Alexandros Potamianos. SEQ^3: Differentiable sequence-­to-­sequence­-to­-sequence autoencoder for unsupervised abstractive sentence compression. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 673–681, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
    [2] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python. ” O’Reilly Media, Inc.”, 2009.
    [3] Arthur Bražinskas, Mirella Lapata, and Ivan Titov. Unsupervised opinion summarization as copycat-­review generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5151–5169, Online, July 2020. Association for Computational Linguistics.
    [4] Eric Chu and Peter Liu. Meansum: a neural model for unsupervised multi­document abstractive summarization. In International Conference on Machine Learning, pages 1223–1232. PMLR, 2019.
    [5] Jacob Devlin, Ming­-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
    [6] Shibhansh Dohare, Vivek Gupta, and Harish Karnick. Unsupervised semantic abstractive summarization. In Proceedings of ACL 2018, Student Research Workshop, pages 74–83, Melbourne, Australia, July 2018. Association for Computational Linguistics.
    [7] Bonnie Dorr, David Zajic, and Richard Schwartz. Hedge trimmer: A parse­-and­trim approach to headline generation. In Proceedings of the HLT­NAACL 03 Text Summarization Workshop, pages 1–8, 2003.
    [8] A. R. Fabbri, Simeng Han, Haoyuan Li, Haoran Li, Marjan Ghazvininejad, Shafiq R. Joty, Dragomir Radev, and Yashar Mehdad. Improving zero and few­-shot abstractive summarization with intermediate fine­tuning and data augmentation. In NAACL, 2021.
    [9] Thibault Févry and Jason Phang. Unsupervised sentence compression using denoising auto­encoders. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 413–422, Brussels, Belgium, October 2018. Association for Computational Linguistics.
    [10] Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. Bottom-­up abstractive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4098–4109. Association for Computational Linguistics, oct–nov 2018.
    [11] Karl Moritz Hermann, Tomás Kočiskỳ, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28, pages 1693–1701. Curran Associates, Inc., 2015.
    [12] Sepp Hochreiter and Jürgen Schmidhuber. Long short­-term memory. Neural computation, 9(8):1735–1780, 1997.
    [13] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020.
    [14] Masaru Isonuma, Junichiro Mori, and Ichiro Sakata. Unsupervised neural single document summarization of reviews via learning latent discourse structure and its ranking. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2142–2152, Florence, Italy, July 2019. Association for Computational Linguistics.
    [15] Kevin Knight and Daniel Marcu. Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence, 139(1):91–107, 2002.
    [16] Wojciech Kryscinski, Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. Neural text summarization: A critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP­IJCNLP), pages 540–551, Hong Kong, China, November 2019. Association for Computational Linguistics.
    [17] Philippe Laban, Andrew Hsi, John Canny, and Marti A. Hearst. The summary loop: Learning to write abstractive summaries without examples. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5135–5150, Online, July 2020. Association for Computational Linguistics.
    [18] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-­to-­sequence pre­training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online, July 2020. Association for Computational Linguistics.
    [19] Quentin Lhoest, Patrick von Platen, Thomas Wolf, Albert Villanova del Moral, Yacine Jernite, Suraj Patil, Mariama Drame, Julien Chaumond, Julien Plu, Joe Davison, Simon Brandeis, Teven Le Scao, Victor Sanh, Kevin Canwen Xu, Lewis Tunstall, Nicolas Patry, Angelina McMillan-­Major, Philipp Schmid, Sylvain Gugger, Lysandre Debut, Clément Delangue, Théo Matussière, Stas Bekman, and François Lagunas. huggingface/datasets: 1.7.0, may 2021.
    [20] Chin­-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
    [21] Peter J. Liu, Yu­-An Chung, and Jie Ren. Summae: Zero-­shot abstractive text summarization using length­-agnostic auto­-encoders. ArXiv, abs/1910.00998, 2019.
    [22] Yang Liu and Mirella Lapata. Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3730–3740. Association for Computational Linguistics, nov 2019.
    [23] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 404–411, 2004.
    [24] Courtney Napoles, Matthew R Gormley, and Benjamin Van Durme. Annotated gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web­-scale Knowledge Extraction (AKBC­WEKEX), pages 95–100, 2012.
    [25] Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-­aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807. Association for Computational Linguistics, oct–nov 2018.
    [26] Ani Nenkova and Rebecca Passonneau. Evaluating content selection in summarization: The pyramid method. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLTNAACL 2004, pages 145–152, Boston, Massachusetts, USA, May 2 ­ May 7 2004. Association for Computational Linguistics.
    [27] Ani Nenkova and Lucy Vanderwende. The impact of frequency on summarization. Microsoft Research, Redmond, Washington, Tech. Rep. MSR­TR­2005, 101, 2005.
    [28] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
    [29] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020.
    [30] Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP­IJCNLP), pages 3982–3992, Hong Kong, China, November 2019. Association for Computational Linguistics.
    [31] Gaetano Rossiello, Pierpaolo Basile, and Giovanni Semeraro. Centroid-based text summarization through compositionality of word embeddings. In Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 12–21, 2017.
    [32] Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389. Association for Computational Linguistics, sep 2015.
    [33] Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. MLSUM: The multilingual summarization corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8051–8067. Association for Computational Linguistics, November 2020.
    [34] Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083. Association for Computational Linguistics, jul 2017.
    [35] Guokan Shang, Wensi Ding, Zekun Zhang, Antoine Tixier, Polykarpos Meladianos, Michalis Vazirgiannis, and Jean-Pierre Lorré. Unsupervised abstractive meeting summarization with multi-sentence compression and budgeted submodular maximization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 664–674, Melbourne, Australia, July 2018. Association for Computational Linguistics.
    [36] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mass: Masked sequence to sequence pre­training for language generation. In International Conference on Machine Learning, pages 5926–5936. PMLR, 2019.
    [37] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. Modeling coverage for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 76–85, Berlin, Germany, August 2016. Association for Computational Linguistics.
    [38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30:5998–6008, 2017.
    [39] Yaushian Wang and Hung-Yi Lee. Learning to encode text as human-readable summaries using generative adversarial networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4187–4195, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
    [40] Ziyi Yang, Chenguang Zhu, Robert Gmyr, Michael Zeng, Xuedong Huang, and Eric Darve. Ted: A pretrained unsupervised summarization model with theme modeling and denoising. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 1865–1874, 2020.
    [41] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. Pegasus: Pre­training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR, 2020.
    [42] Hao Zheng and Mirella Lapata. Sentence centrality revisited for unsupervised summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6236–6247. Association for Computational Linguistics, jul 2019.
    [43] Jiawei Zhou and Alexander Rush. Simple unsupervised summarization by contextual matching. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5101–5106, Florence, Italy, July 2019. Association for Computational Linguistics.
    [44] Chenguang Zhu, Ziyi Yang, Robert Gmyr, Michael Zeng, and Xuedong Huang. Make lead bias in your favor: Zero-shot abstractive news summarization. arXiv preprint arXiv:1912.11602, 2019.
    [45] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.

    下載圖示 校內:2022-11-01公開
    校外:2022-11-01公開
    QR CODE