| 研究生: |
郭庭瑋 KUO, Ting-Wei |
|---|---|
| 論文名稱: |
國台語語音轉錄技術結合大型語言模型於餐飲應用研究 Integrating Taiwanese-Mandarin Speech Transcription and Large Language Models for Food and Beverage Applications |
| 指導教授: |
陳牧言
Chen, Mu-Yen |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 78 |
| 中文關鍵詞: | 語音辨識 、語音合成 、閩南語 、餐飲應用 、大型語言模型 |
| 外文關鍵詞: | Automatic Speech Recognition, Text-to-Speech, Taiwanese Hokkien, Restaurant Applications, Large Language Models |
| 相關次數: | 點閱:23 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著臺灣進入超高齡社會,65 歲以上人口比例逐年上升,高齡者在資訊科技上的學習力不如年輕族群吸收快速,面臨著操作複雜、表達輸入困難等挑戰,導致資訊可及性受限。以查詢周邊餐飲資訊為例,現有多數應用仍仰賴文字輸入,不利於高齡族群與行動中使用者使用,顯示在此部份的應用亟需更自然、直覺的互動系統。
高齡者日常多使用閩南語溝通,然而閩南語作為臺灣主要本土語言之一,長期面臨語料不足與語音技術應用有限等挑戰,相關系統仍屬低資源語言發展階段。
基於上述挑戰,本論文實作一套支援國臺語混用輸入的語音轉錄與語意理解系統,結合自動語音辨識(ASR, Automatic Speech Recognition)與大型語言模型(LLM, Large Language Model),建構可理解語音查詢意圖的智慧互動框架。系統先微調開源多語Whisper-small 模型,提升閩南語與混合語句的辨識準確率,再透過 LLM 抽取語意並應用於餐飲資訊檢索與推薦。最後結合語音合成(TTS, Text To Speech)技術,使用 VITS 模型生成國臺語語音回覆。系統架構分為三層:感知層接收語音,推理層執行辨識與意圖推斷,行動層負責查詢、合成與互動回應,形成具在地語言適應性與語意推理能力的語音互動系統。
實驗結果顯示,微調後之 Whisper-small 模型在混合語句的字錯率(Character Error Rate, CER)由 1.41 降至 0.43,文本相似度(BertScore F1)則從 0.584 提升至 0.783,能提供高品質文本供 LLM 進行語意判斷與查詢處理,並結合 VITS 模型合成語音回覆,完成問答互動流程。透過系統實作與實驗評估,本論文驗證低資源語言於語音查詢任務中的可行性,並展現 ASR 與 LLM 技術於實務應用場景中的整合潛力,推動本土語言於智慧語音技術中的實質應用與發展。
As Taiwan enters a super-aged society, the growing elderly population faces barriers in using digital services due to complex interfaces and text input, reducing information accessibility. Most restaurant search applications still rely on text, which is inconvenient for older adults or users on the move. Many elderly primarily speak Taiwanese Hokkien, yet it remains a low-resource language with limited speech technology support.
This study develops a bilingual speech transcription and semantic understanding system for mixed Mandarin-Hokkien input. It integrates Automatic Speech Recognition (ASR) and Large Language Model (LLM) technologies to interpret spoken queries. A multilingual Whisper-small model was fine-tuned to improve recognition accuracy for Hokkien and codeswitched sentences. The recognized text is processed by the LLM to extract semantic parameters for restaurant information retrieval, and responses are generated using VITS based Text-to-Speech (TTS) in Mandarin or Hokkien.
The system comprises three layers: perception for audio input, inference for recognition and intent understanding, and action for query execution and voice response.
Experiments show that the fine-tuned Whisper-small model reduced Character Error Rate (CER) from 1.41 to 0.43, while BertScore F1 improved from 0.584 to 0.783. The results validate the feasibility of applying speech technology to low-resource languages and demonstrate the potential of combining ASR and LLMs to enhance localized intelligent voice systems.
[1] 內政部統計處,〈109年6歲以上本國籍常住人口使用語言情形普查統計(109年起)〉,可得於:https://segis.moi.gov.tw/STATCloud/QueryInterfaceView?COL=bDhjBLjaYkG6VIk7ahD1kA%253d%253d
[2] OpenAI, "Introducing Whisper," 2022. [Online]. Available: https://openai.com/index/whisper
[3] J. Kim, S.-g. Lee, and S. Yoon, "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech," arXiv preprint arXiv:2106.06103, 2021.
[4] J. Kim, "VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech," GitHub repository, https://github.com/jaywalnut310/vits
[5] 國家教育研究院,〈臺灣臺語語料庫應用檢索系統〉,https://tggl.naer.edu.tw/corpora_applications/new
[6] 教育部,〈臺灣閩南語羅馬字拼音方案使用手冊〉,https://language.moe.gov.tw/001/Upload/FileUpload/3677-15601/Documents/tshiutsheh.pdf
[7] A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
[8] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473.
[9] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. https://arxiv.org/abs/1810.04805
[10] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-training. OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[11] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., ... & Scao, T. L. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971. https://arxiv.org/abs/2302.13971
[12] Google DeepMind, "Gemma: Open Models Based on Gemini Research and Technology," 2024. https://ai.google.dev/gemma
[13] 2024年臺灣網路報告,https://report.twnic.tw/2024/assets/download/TWNIC_TaiwanInternetReport_2024_CH_all.pdf
[14] Ziman, R., & Walsh, G. (2018). Factors Affecting Seniors' Perceptions of Voice-enabled User Interfaces. Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems.
[15] 2023年臺灣網路報告,https://report.twnic.tw/2023/assets/download/TWNIC_TaiwanInternetReport_2023_CH_all.pdf
[16] OpenAI. (2023). Introducing ChatGPT Plugins. https://openai.com/blog/chatgpt-plugins
[17] Google DeepMind. (2024). Gemini: Google’s Multimodal AI Agent. https://deepmind.google/technologies/gemini/
[18] xAI. (2024). Grok: An AI Companion Built for X. https://x.ai/blog/grok-release
[19] Google, "Agents: An open source framework for building autonomous agents with LLMs," Kaggle, 2024. [Online]. Available: https://www.kaggle.com/whitepaper-agents
[20] AARP. (2020). 2020 Tech Trends of the 50+. Retrieved from https://www.aarp.org/]
[21] The World of Chinese. (2022). “Learn Minnanhua: One of the Most Widely Spoken Chinese Dialects.” Retrieved from https://www.theworldofchinese.com
[22] 教育部(2011)。《臺灣閩南語常用詞辭典》導言
[23] DeFrancis, J. (1984). The Chinese Language: Fact and Fantasy. Honolulu: University of Hawaii Press.
[24] 林怡君(2009)。〈語言接觸下的台灣閩南語外來語研究〉,《國文學報》,第20期,頁111–142。
[25] 葉美瑤(2015)。〈從音韻看平埔語對台灣閩南語的影響〉,《台灣語文研究》,10(2),頁55–88。
[26] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 4th ed. Pearson, 2020.
[27] M. Wooldridge, An Introduction to MultiAgent Systems, 2nd ed. Chichester, U.K.: Wiley, 2009.
[28] S. Yao, Y. Zhao, D. Yu, N. Yang, M. Zhang, and D. Zhao, “ReAct: Synergizing Reasoning and Acting in Language Models,” arXiv preprint arXiv:2210.03629, Oct. 2022. [Online]. Available: https://arxiv.org/abs/2210.03629
[29] S.-M. Huang, H.-C. Tseng, and C.-H. Wu, “Design and Evaluation of a Web-based Synthesis Interface for Taiwanese Hokkien,” *International Journal of Computational Linguistics & Chinese Language Processing*, vol. 27, no. 2, pp. 107–122, 2022. [Online]. Available: https://aclanthology.org/2022.ijclclp-2.6
[30] Taiwanese Corpus Project, “icorpus_ka1_han3-ji7: 台羅漢字轉拼音模型與資料處理腳本,” GitHub, 2023. [Online]. Available: https://github.com/Taiwanese-Corpus/icorpus_ka1_han3-ji7
[31] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A Fast, Extensible Toolkit for Sequence Modeling,” in *Proc. NAACL-HLT 2019: Demonstrations*, 2019, pp. 48–53. [Online]. Available: https://aclanthology.org/N19-4010
[32] Plachtaa, “VITS-fast-fine-tuning: Easy-to-use and high-quality TTS finetuning framework based on VITS,” GitHub, 2023. [Online]. Available: https://github.com/Plachtaa/VITS-fast-fine-tuning
[33] 賴志洋 (Chih-Yang Lai), 音標·注音·拼音, Blogspot, Sep. 2018. [Online]. Available: https://ipa-vot-simple-phonetics.blogspot.com/
[34] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” OpenAI, 2022. [Online]. Available: https://cdn.openai.com/papers/whisper.pdf
[35] Y. Yang, H. Chen, and M. Li, “Unified Speech Processing with Mel-Scale Representations: Benefits and Applications,” arXiv preprint arXiv:2406.05298, 2025. [Online]. Available: https://arxiv.org/abs/2406.05298
[36] Milvus, “What Is Mean Opinion Score (MOS) in Voice Quality Testing?,” Milvus, Apr. 11, 2023. [Online]. Available: https://milvus.io/blog/what-is-mean-opinion-score-in-voice-quality-testing.md
[37] S. Valentini-Botinhao, A. Ragano, J. Lorenzo-Trueba, and R. Barra-Chicote, “Refining the Evaluation of Speech Synthesis,” arXiv preprint arXiv:2403.07147, 2024. [Online]. Available: https://arxiv.org/abs/2403.07147
[38] J. Li, Y. Wang, and X. Zhang, “Pairwise Evaluation of Accent Similarity in Speech Synthesis,” in Proc. Interspeech, 2025. [Online]. Available: https://arxiv.org/abs/2505.14410
[39] Wang, X., Kaneko, T., & Kameoka, H. (2023). Analysis of spectral distortion measures for voice conversion evaluation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 1332–1344. https://doi.org/10.1109/TASLP.2023.3249988
[40] Mozilla Foundation, “Common Voice Corpus (Versions 11–17),” Hugging Face, 2024. [Online]. Available: https://huggingface.co/mozilla-foundation
校內:2030-07-22公開