簡易檢索 / 詳目顯示

研究生: 郭庭瑋
KUO, Ting-Wei
論文名稱: 國台語語音轉錄技術結合大型語言模型於餐飲應用研究
Integrating Taiwanese-Mandarin Speech Transcription and Large Language Models for Food and Beverage Applications
指導教授: 陳牧言
Chen, Mu-Yen
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 78
中文關鍵詞: 語音辨識語音合成閩南語餐飲應用大型語言模型
外文關鍵詞: Automatic Speech Recognition, Text-to-Speech, Taiwanese Hokkien, Restaurant Applications, Large Language Models
相關次數: 點閱:23下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著臺灣進入超高齡社會,65 歲以上人口比例逐年上升,高齡者在資訊科技上的學習力不如年輕族群吸收快速,面臨著操作複雜、表達輸入困難等挑戰,導致資訊可及性受限。以查詢周邊餐飲資訊為例,現有多數應用仍仰賴文字輸入,不利於高齡族群與行動中使用者使用,顯示在此部份的應用亟需更自然、直覺的互動系統。
    高齡者日常多使用閩南語溝通,然而閩南語作為臺灣主要本土語言之一,長期面臨語料不足與語音技術應用有限等挑戰,相關系統仍屬低資源語言發展階段。
    基於上述挑戰,本論文實作一套支援國臺語混用輸入的語音轉錄與語意理解系統,結合自動語音辨識(ASR, Automatic Speech Recognition)與大型語言模型(LLM, Large Language Model),建構可理解語音查詢意圖的智慧互動框架。系統先微調開源多語Whisper-small 模型,提升閩南語與混合語句的辨識準確率,再透過 LLM 抽取語意並應用於餐飲資訊檢索與推薦。最後結合語音合成(TTS, Text To Speech)技術,使用 VITS 模型生成國臺語語音回覆。系統架構分為三層:感知層接收語音,推理層執行辨識與意圖推斷,行動層負責查詢、合成與互動回應,形成具在地語言適應性與語意推理能力的語音互動系統。
    實驗結果顯示,微調後之 Whisper-small 模型在混合語句的字錯率(Character Error Rate, CER)由 1.41 降至 0.43,文本相似度(BertScore F1)則從 0.584 提升至 0.783,能提供高品質文本供 LLM 進行語意判斷與查詢處理,並結合 VITS 模型合成語音回覆,完成問答互動流程。透過系統實作與實驗評估,本論文驗證低資源語言於語音查詢任務中的可行性,並展現 ASR 與 LLM 技術於實務應用場景中的整合潛力,推動本土語言於智慧語音技術中的實質應用與發展。

    As Taiwan enters a super-aged society, the growing elderly population faces barriers in using digital services due to complex interfaces and text input, reducing information accessibility. Most restaurant search applications still rely on text, which is inconvenient for older adults or users on the move. Many elderly primarily speak Taiwanese Hokkien, yet it remains a low-resource language with limited speech technology support.
    This study develops a bilingual speech transcription and semantic understanding system for mixed Mandarin-Hokkien input. It integrates Automatic Speech Recognition (ASR) and Large Language Model (LLM) technologies to interpret spoken queries. A multilingual Whisper-small model was fine-tuned to improve recognition accuracy for Hokkien and codeswitched sentences. The recognized text is processed by the LLM to extract semantic parameters for restaurant information retrieval, and responses are generated using VITS based Text-to-Speech (TTS) in Mandarin or Hokkien.
    The system comprises three layers: perception for audio input, inference for recognition and intent understanding, and action for query execution and voice response.
    Experiments show that the fine-tuned Whisper-small model reduced Character Error Rate (CER) from 1.41 to 0.43, while BertScore F1 improved from 0.584 to 0.783. The results validate the feasibility of applying speech technology to low-resource languages and demonstrate the potential of combining ASR and LLMs to enhance localized intelligent voice systems.

    摘要 I 致謝 V 目錄 VI 表目錄 VIII 圖目錄 IX 第一章 緒論 1 1.1 研究背景與動機 1 1.2 AI應用趨勢與研究定位 3 1.3 研究目的 4 第二章 文獻探討 6 2.1 閩南語(臺語) 6 2.2 WHISPER 9 2.3 VITS 11 2.4 大型語言模型與語義理解應用 14 第三章 研究方法以及實驗設計 17 3.1 系統規劃 17 3.1.1 需求分析 17 3.1.2 系統架構 18 3.2 系統模組說明 20 3.2.1 感知層 20 3.2.2 推理層 21 3.2.3 行動層 22 3.3 系統運作流程 23 3.3.1 感知層 23 3.3.2 推理層 24 3.3.3 行動層 25 3.4 實驗設計 26 3.4.1 語音辨識(ASR) 26 3.4.2 台羅漢字轉拼音模型設計(C2T) 28 3.4.3 語音合成(TTS) 29 第四章 系統實作與實驗結果 30 4.1 開發環境 30 4.2 實作 31 4.2.1 感知層 31 4.2.2 推理層 33 4.2.3 行動層 35 4.2.4 實作結果 36 4.3 實驗數據 40 4.3.1 語音辨識(ASR) 40 4.3.2 台羅漢字轉拼音模型設計(C2T) 45 4.4.3 語音合成(TTS) 47 第五章 結論與未來展望 54 5.1 結論 54 5.2 研究限制 56 5.3 未來展望 57 參考文獻 58 附錄一 建置環境 61 聊天機器人開發環境 61 附錄二 聲韻表 65 台羅聲母對 IPA 對照表[6] 65 台羅韻母對 IPA 對照表[6] 65 台羅聲調對 IPA 對照表[33] [6] 66 附錄三 實驗量表 67 合成語音問卷題目 67

    [1] 內政部統計處,〈109年6歲以上本國籍常住人口使用語言情形普查統計(109年起)〉,可得於:https://segis.moi.gov.tw/STATCloud/QueryInterfaceView?COL=bDhjBLjaYkG6VIk7ahD1kA%253d%253d
    [2] OpenAI, "Introducing Whisper," 2022. [Online]. Available: https://openai.com/index/whisper
    [3] J. Kim, S.-g. Lee, and S. Yoon, "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech," arXiv preprint arXiv:2106.06103, 2021.
    [4] J. Kim, "VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech," GitHub repository, https://github.com/jaywalnut310/vits
    [5] 國家教育研究院,〈臺灣臺語語料庫應用檢索系統〉,https://tggl.naer.edu.tw/corpora_applications/new
    [6] 教育部,〈臺灣閩南語羅馬字拼音方案使用手冊〉,https://language.moe.gov.tw/001/Upload/FileUpload/3677-15601/Documents/tshiutsheh.pdf
    [7] A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
    [8] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473.
    [9] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. https://arxiv.org/abs/1810.04805
    [10] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-training. OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
    [11] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., ... & Scao, T. L. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971. https://arxiv.org/abs/2302.13971
    [12] Google DeepMind, "Gemma: Open Models Based on Gemini Research and Technology," 2024. https://ai.google.dev/gemma

    [13] 2024年臺灣網路報告,https://report.twnic.tw/2024/assets/download/TWNIC_TaiwanInternetReport_2024_CH_all.pdf
    [14] Ziman, R., & Walsh, G. (2018). Factors Affecting Seniors' Perceptions of Voice-enabled User Interfaces. Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems.
    [15] 2023年臺灣網路報告,https://report.twnic.tw/2023/assets/download/TWNIC_TaiwanInternetReport_2023_CH_all.pdf
    [16] OpenAI. (2023). Introducing ChatGPT Plugins. https://openai.com/blog/chatgpt-plugins
    [17] Google DeepMind. (2024). Gemini: Google’s Multimodal AI Agent. https://deepmind.google/technologies/gemini/
    [18] xAI. (2024). Grok: An AI Companion Built for X. https://x.ai/blog/grok-release
    [19] Google, "Agents: An open source framework for building autonomous agents with LLMs," Kaggle, 2024. [Online]. Available: https://www.kaggle.com/whitepaper-agents
    [20] AARP. (2020). 2020 Tech Trends of the 50+. Retrieved from https://www.aarp.org/]
    [21] The World of Chinese. (2022). “Learn Minnanhua: One of the Most Widely Spoken Chinese Dialects.” Retrieved from https://www.theworldofchinese.com
    [22] 教育部(2011)。《臺灣閩南語常用詞辭典》導言
    [23] DeFrancis, J. (1984). The Chinese Language: Fact and Fantasy. Honolulu: University of Hawaii Press.
    [24] 林怡君(2009)。〈語言接觸下的台灣閩南語外來語研究〉,《國文學報》,第20期,頁111–142。
    [25] 葉美瑤(2015)。〈從音韻看平埔語對台灣閩南語的影響〉,《台灣語文研究》,10(2),頁55–88。
    [26] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 4th ed. Pearson, 2020.
    [27] M. Wooldridge, An Introduction to MultiAgent Systems, 2nd ed. Chichester, U.K.: Wiley, 2009.
    [28] S. Yao, Y. Zhao, D. Yu, N. Yang, M. Zhang, and D. Zhao, “ReAct: Synergizing Reasoning and Acting in Language Models,” arXiv preprint arXiv:2210.03629, Oct. 2022. [Online]. Available: https://arxiv.org/abs/2210.03629
    [29] S.-M. Huang, H.-C. Tseng, and C.-H. Wu, “Design and Evaluation of a Web-based Synthesis Interface for Taiwanese Hokkien,” *International Journal of Computational Linguistics & Chinese Language Processing*, vol. 27, no. 2, pp. 107–122, 2022. [Online]. Available: https://aclanthology.org/2022.ijclclp-2.6
    [30] Taiwanese Corpus Project, “icorpus_ka1_han3-ji7: 台羅漢字轉拼音模型與資料處理腳本,” GitHub, 2023. [Online]. Available: https://github.com/Taiwanese-Corpus/icorpus_ka1_han3-ji7
    [31] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A Fast, Extensible Toolkit for Sequence Modeling,” in *Proc. NAACL-HLT 2019: Demonstrations*, 2019, pp. 48–53. [Online]. Available: https://aclanthology.org/N19-4010
    [32] Plachtaa, “VITS-fast-fine-tuning: Easy-to-use and high-quality TTS finetuning framework based on VITS,” GitHub, 2023. [Online]. Available: https://github.com/Plachtaa/VITS-fast-fine-tuning
    [33] 賴志洋 (Chih-Yang Lai), 音標·注音·拼音, Blogspot, Sep. 2018. [Online]. Available: https://ipa-vot-simple-phonetics.blogspot.com/
    [34] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” OpenAI, 2022. [Online]. Available: https://cdn.openai.com/papers/whisper.pdf
    [35] Y. Yang, H. Chen, and M. Li, “Unified Speech Processing with Mel-Scale Representations: Benefits and Applications,” arXiv preprint arXiv:2406.05298, 2025. [Online]. Available: https://arxiv.org/abs/2406.05298
    [36] Milvus, “What Is Mean Opinion Score (MOS) in Voice Quality Testing?,” Milvus, Apr. 11, 2023. [Online]. Available: https://milvus.io/blog/what-is-mean-opinion-score-in-voice-quality-testing.md
    [37] S. Valentini-Botinhao, A. Ragano, J. Lorenzo-Trueba, and R. Barra-Chicote, “Refining the Evaluation of Speech Synthesis,” arXiv preprint arXiv:2403.07147, 2024. [Online]. Available: https://arxiv.org/abs/2403.07147
    [38] J. Li, Y. Wang, and X. Zhang, “Pairwise Evaluation of Accent Similarity in Speech Synthesis,” in Proc. Interspeech, 2025. [Online]. Available: https://arxiv.org/abs/2505.14410
    [39] Wang, X., Kaneko, T., & Kameoka, H. (2023). Analysis of spectral distortion measures for voice conversion evaluation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 1332–1344. https://doi.org/10.1109/TASLP.2023.3249988
    [40] Mozilla Foundation, “Common Voice Corpus (Versions 11–17),” Hugging Face, 2024. [Online]. Available: https://huggingface.co/mozilla-foundation

    無法下載圖示 校內:2030-07-22公開
    校外:2030-07-22公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE