簡易檢索 / 詳目顯示

研究生: 翁耀昌
Yung, Yiu-Cheong
論文名稱: HKELECTRA: 探索使用雙層語言現象的預訓練語言模型對香港內容的有效性
HKELECTRA: Exploring the Effectiveness of Pre-training Language Models with Incorporation of Diglossia for Hong Kong Content
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 51
中文關鍵詞: 語言建模雙層語言香港
外文關鍵詞: language modeling, diglossia, Hong Kong
相關次數: 點閱:71下載:8
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 雖然每年都有不少針對中文自然語言處理的研究發表,但針對香港常用語言(包括香港繁體中文和香港粵語)的研究仍然長期不足。這個現象主要源自於兩個問題,第一個問題是缺少基於香港內容的公開預訓練資料集,第二個問題是香港的雙層語言現象為語言建模帶來額外挑戰。我們首先對香港的各個媒體和網路平台進行調查,根據來源格式和內容時間範圍等技術條件,以及媒體和平台與中共關聯程度等道德要求來作出挑選。我們最後挑選出八個文字來源,用來製作第一個基於香港內容的公開預訓練資料集。針對第二個問題我們採用了有別於多語資料集現有的前處理模式,現有的前處理工具鏈例如 CCNet 會把內容按語言分割後再進行處理,我們認為語言分割會破壞香港內容中的雙層語言現象,進而拖累語言模型的表現。我們在製作資料集時選擇不進行語言分割以保存原始資料中的雙層語言現象,然後以此資料集訓練出基於 ELECTRA 架構的香港常用語言模型。實驗結果顯示我們的模型在基於香港內容的下游任務中取得了目前最先進的成果。我們也嘗試使用不同方式來移除資料集中的雙層語言現象,並以此測試雙層語言現象對語言模型的表現影響。測試結果顯示保留雙層語言現象對語言模型在香港內容的表現上有著重要影響。我們將會釋出模型和預訓練資料集,以鼓勵未來在香港語言建模方面的研究。

    Although numerous research publications have been on Chinese natural language processing (NLP) each year, studies focusing on the commonly used languages in Hong Kong (including Hong Kong Chinese and Hong Kong Cantonese) have remained insufficient for a long time. This phenomenon primarily stems from two problems. The first problem is the need for publicly available pretraining datasets based on Hong Kong content. The second problem is the diglossia phenomenon in Hong Kong, which poses additional challenges for language modeling.

    To address the first issue, we investigated various media and online platforms in Hong Kong. We made selections based on technical considerations such as source formats, content time ranges, and ethical requirements, including checking the media's and the platform's affiliation with the Chinese Communist Party. Eventually, we identified eight textual sources to create the first publicly available pretraining dataset based on Hong Kong content.

    Regarding the second problem, we adopted a preprocessing approach that differs from existing multilingual datasets - existing preprocessing toolchains, such as CCNet, segment content by language before processing. However, language segmentation disrupts the diglossia phenomenon in Hong Kong content and hampers the performance of language models. In creating our dataset, we avoided performing language segmentation to preserve the diglossia phenomenon in the original data. We trained several Hong Kong language models based on the ELECTRA architecture using this dataset.

    Experimental results demonstrate that our models reach state-of-the-art performance on downstream tasks based on Hong Kong content. We also attempted various methods to remove the diglossia phenomenon from the dataset and assessed the effect of the phenomenon on the performance of language models. The test results indicate that the method of preserving the diglossia phenomenon by not separating Hong Kong Cantonese and Hong Kong Chinese content is crucial for the model's performance on Hong Kong content. We intend to release the model and pretraining dataset to encourage future research in Hong Kong language modeling.

    摘要 i Abstract ii Table of Contents iv List of Tables vii List of Figures ix Chapter 1. Introduction 1 1.1 Diglossia phenomenon 1 1.2 Advancement of Language Model 3 1.3 Threats of Hong Kong National Security Law on data availability and collection 4 1.4 Our Contribution 7 Chapter 2. Related Work 9 2.1 LSTM: Language models evolution in the pre-transformer era 9 2.2 Transformer: Revolutionary architecture 10 2.3 ELECTRA: discriminative language modeling 12 2.4 Language modeling beyond English and Hong Kong language modeling landscape 15 2.5 Ethical aspects of NLP 16 Chapter 3. Datasets 18 3.1 Pre-training data 18 3.2 Downstream data 21 3.2.1. wordshk-sem 21 3.2.2. openrice-senti 21 3.2.3. lihkg-cat-v2 22 3.2.4. DRCD 22 3.2.5. Legalref-cat 22 3.2.6. ZhYue-cat 22 Chapter 4. Pre-train and Evaluation 24 4.1 Pre-train method 24 4.2 Models 25 4.3 Evaluation 27 Chapter 5. Results and Discussion 29 5.1 Model performance 29 5.2 Effect of removing diglossia 30 5.3 Quantitative analysis of diglossia in pre-train datasets 32 5.4 Effect of diglossia language ratio in pre-train data 34 5.5 Effect of Cantonese / Chinese separation under different diglossia language ratios 35 5.6 Effect of adding language label to classified pre-train data 36 5.7 Effect of classifier error rate on test results 37 5.8 Effect of diglossia language ratio in downstream task 38 Chapter 6. Limitations 39 6.1 Language usage variety of pre-train data 39 6.2 Possibility and risk of adding non-Hong Kong Chinese data 39 6.3 Quality and availability of social platform data 40 6.4 Computation cost for language model pre-training 40 6.5 Lack of good reference model for performance comparison and explanation 41 Chapter 7. Ethics Statement 42 7.1 Ethical Criteria for source selection 42 7.2 Crawl before disappear: NLP corpus as anti-censorship method in post NSL era 43 7.3 Privacy measures for data collection 45 7.4 Possibility of NLP for censorship circumvention 46 Chapter 8. Conclusion 47 References 48

    [1] Apple Daily. https://en.wikipedia.org/wiki/Apple_Daily.
    [2] Citizen News. https://en.wikipedia.org/wiki/Citizen_News.
    [3] ELECTRA. https://github.com/google-research/electra.
    [4] Hong Kong - The World Factbook. https://www.cia.gov/the-world-factbook/countries/hong-kong/#people-and-society.
    [5] Hong Kong Judiciary - Judgments & Legal Reference. https://www.judiciary.hk/en/judgments_legal_reference/index.html.
    [6] How Content Moderation and Anti-Vandalism Works on Wikipedia. https://design.wikimedia.org/blog/2020/07/30/content-moderation-anti-vandalism-wikipedia.html.
    [7] Hugging Face–The AI community building the future. https://huggingface.co/.
    [8] jq. https://stedolan.github.io/jq/.
    [9] LIHKG. https://en.wikipedia.org/wiki/LIHKG.
    [10] LIHKG. https://lihkg.com/.
    [11] multilingual BERT. https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/multilingual.md.
    [12] OpenRice. https://en.wikipedia.org/wiki/OpenRice.
    [13] OpenRice. https://www.openrice.com/.
    [14] Stand News. https://en.wikipedia.org/wiki/Stand_News.
    [15] TPU Research Cloud - About. https://sites.research.google/trc/about/.
    [16] World Press Freedom Index. https://en.wikipedia.org/wiki/World_Press_Freedom_Index.
    [17]《蘋 果 日 報》 文 字 備 份 計 劃. https://github.com/collection-news/appledaily-archive-directory.
    [18] 獨立媒體. https://www.inmediahk.net/.
    [19] 獨立媒體 (香港). https://zh.wikipedia.org/wiki/%E7%8D%A8%E7%AB%8B%E5%AA%92%E9%AB%94_(%E9%A6%99%E6%B8%AF).
    [20] 眾新聞. https://www.hkcnews.com/.
    [21] 立場新聞 • 聞庫. https://collection.news/thestandnews.
    [22] 粵典. https://words.hk/.
    [23] 香城公民媒體 Hong Kong Citizen Media. https://hkcitizenmedia.com/.
    [24] Wissam Antoun, Fady Baly, and Hazem Hajj. AraELECTRA: Pre-training text discriminators for Arabic language understanding. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 191–195, Kyiv, Ukraine (Virtual), April 2021. Association for Computational Linguistics.
    [25] Giuseppe Attardi. WikiExtractor. https://github.com/attardi/wikiextractor.
    [26] Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, July 2016. arXiv:1607.06520 [cs, stat].
    [27] Branden Chan, Stefan Schweter, and Timo Möller. German’s next language model. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6788–6796, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics.
    [28] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv:2003.10555 [cs], March 2020. arXiv: 2003.10555.
    [29] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised Cross-lingual Representation Learning at Scale, April 2020. arXiv:1911.02116 [cs].
    [30] Yiming Cui. ymcui/Chinese-ELECTRA. https://github.com/ymcui/Chinese-ELECTRA.
    [31] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
    [32] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for Datasets, December 2021. arXiv:1803.09010 [cs].
    [33] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, nov 1997.
    [34] Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards, May 2018. arXiv:1805.03677 [cs].
    [35] Carbo Kuo. Open Chinese Convert 開放中文轉換. https://github.com/BYVoid/OpenCC.
    [36] Francis L. F. Lee. Changing Political Economy of the Hong Kong Media. China Perspectives, 2018(3):9–18, September 2018. Number: 3 Publisher: French Centre for Research on Contemporary China.
    [37] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach, July 2019. arXiv:1907.11692 [cs].
    [38] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space, September 2013. arXiv:1301.3781 [cs].
    [39] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics.
    [40] Chih Chieh Shao, Trois Liu, Yuting Lai, Yiying Tseng, and Sam Tsai. DRCD: a Chinese Machine Reading Comprehension Dataset, May 2019. arXiv:1806.00920 [cs].
    [41] Robyn Speer. ConceptNet Numberbatch 17.04: better, less-stereotyped word vectors. http://blog.conceptnet.io/posts/2017/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/.
    [42] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural Networks, December 2014. arXiv:1409.3215 [cs].
    [43] ToastyNews. lihkg-cat-v2. https://github.com/toastynews/lihkg-cat-v2.
    [44] ToastyNews. openrice-senti. https://github.com/toastynews/openrice-senti.
    [45] ToastyNews. toastynews/electra-hongkongese. https://github.com/toastynews/electra-hongkongese.
    [46] ToastyNews. wordshk-sem. https://github.com/toastynews/wordshk-sem.
    [47] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need, December 2017. arXiv:1706.03762 [cs].
    [48] Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, November 2019. arXiv:1911.00359 [cs, stat].
    [49] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. HuggingFace’s Transformers: State-of-the-art Natural Language Processing, July 2020. arXiv:1910.03771 [cs].
    [50] Eddie Yang and Margaret E. Roberts. Censorship of online encyclopedias: Implications for nlp models. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 537–548, New York, NY, USA, 2021. Association for Computing Machinery.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE