| 研究生: |
柯建至 ko, Chien-Chih |
|---|---|
| 論文名稱: |
台灣學齡前兒童雙字詞口語語料分析 Analysis of Bi-syllables Words from Speech Corpus for Taiwanese Pre-school Children |
| 指導教授: |
鍾高基
Chung, Kao-Chi |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 醫學工程研究所 Institute of Biomedical Engineering |
| 論文出版年: | 2006 |
| 畢業學年度: | 94 |
| 語文別: | 中文 |
| 論文頁數: | 52 |
| 中文關鍵詞: | 口語語料 、雙字詞 、學齡前兒童 |
| 外文關鍵詞: | Speech Corpus, Bi-syllables Words, Pre-school Children |
| 相關次數: | 點閱:113 下載:19 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
聽語障礙者的語言學習與矯治強調早期療育的必要性。學齡前兒童語言能力的評估與矯治,仰賴具信度與效度的語言量表與資料庫。西方國家發展系統化的詞彙資料庫以提供臨床評估與教育訓練量表,由於語言不同無法適合本土語言應用。國內缺乏兒童語言樣本的研發,欠缺具信度與效度的語言量表,嚴重影響語言教育、臨床語言治療及語音科技的發展。
本研究的目的針對台灣二至六歲學齡前兒童,發展建立口語詞彙資料庫。特定目標包括(1)利用已轉譯的學齡前兒童口語語料分析雙字詞的傳統詞頻;(2)透過拜氏條件機率正規化雙字詞的詞頻;(3)拜氏條件機率正規化分析情境類別分類下雙字詞的詞頻;(4)分析各組傳統詞頻以及情境類別分類下的差異。研究方法為:(1)收集記錄80位2到6歲學齡前兒童自發性口語語料,依年齡分成四組;(2)將口語語料轉譯成文字檔案;(3)斷詞與分類文字(4)傳統詞頻分析;(5)拜氏條件機率正規化雙字詞詞頻與情境類別分類雙字詞詞頻。透過上述的步驟,進一步發展資料庫。
蒐集口語資料包括45,197個語句、183,781個字。口語語料分別由人工及電腦分析程式處理,程式自動轉譯成文字樣本的正確率約為75.4%,斷詞的正確率約是94.37%。出現最多的單字詞(我)6,133次、雙字詞(媽媽)1,784次、三字詞(為什麼)429次、四字詞(生日快樂)16次。本研究的結果發展建立台灣學齡前兒童口語詞彙資料庫,包括:(1)轉譯並標記文字;(2)轉譯文字的分類及斷詞;(3)分析雙字詞頻分布情況;(4)透過拜氏條件機率正規化雙字詞的詞頻;(5)分析雙字詞頻以情境類別分類正規化之分布情況。
本研究詞頻分析結果顯示雙字詞數量隨著兒童語言發展增加甚多。正規化的結果能得知某詞彙在整個詞彙資料中所佔之權重,可以有效地提供語料庫量化分析參數。情境類別分類能分析在不同條件下兒童的語彙分佈,提供語料庫較完善的資料以作為研究與評估。本研究發展建立語料庫及相關程式與規則,可以提供相關領域量化分析與訓練的兒童學齡前語彙資料庫建立,對於特殊教育語言教學評量評估工具發展、教育訓練的教材、計算語言學研究、臨床聽語障礙治療、研發與訓練等,提供系統化、科學化發展的最基本資料庫。
Language learning and speech therapy are the important issues in early intervention for language disorders. Assessment of language development for the preschool children is relied on the verbal language scale. Developed countries have developed and established lexicon databases for language norm and testing standard. However, they cannot be transferred and applied into our native languages such as Mandarin and Taiwanese. There are only a few researches on domestic language samples, which lead to the lack of a reliable and valid lexicon database. Lack of domestic database is detrimental to clinical evaluation, education, clinic training and speech technology.
The purpose of this research is to develop and establish a Mandarin lexicon database for Taiwanese preschool children. More specifically, this research is aimed to (1) use the transcript recorded verbal words into text ; (2) analyze conventional bi-syllables word frequency; (3) analytically normalize the bi-syllables word frequency; and (4) compare conventional word frequency and normalize word frequency. Research methods: (1) totally eighty preschool children of 2 to 6 years old were recruited and divided into four groups; (2) spontaneous speech corpus were recorded and collected from these subjects and the recorded verbal corpus was transcript to text ; (3) segmentation and categorization of lexicon corpus;(4) analysis of conventional bi-syllables word frequency; (5) normalized bi-syllables word frequency by Bayesian Normalization Index.; (6) word frequency analysis based on semantic-based intension categorization by Bayesian conditional probability.
The whole speech corpus recorded contains more than 45,197 utterances and 183,781 word-tokens. The transcription results show that the accuracy of using the automatic speech recognition machine is approximately 75.4% correction, using the automatic segmentation programs is approximately 94.37%, based on the comparative evaluation by using the manual transcription. The results demonstrate that ’I’ is the most frequent character appearing 6,133 times, ‘mother’ is the most frequent bi-phone appearing 1,784 times, ‘why’ is the most frequent tri-phone appearing 429 times.
The results of conventional bi-syllables word frequency is increasing with age for the children with language development stage. The result of normalized can know the weight for the vocabulary data, and offer the quantitative analysis parameter of the speech corpus. It is analysis that children's vocabulary under different conditions based on semantic-based intension categorization, and offer the more useful data to study and assess with the conduct for lexicon databases. This database is also used extensively by students of child language disorders, aphasia, second language learning, computational linguistics, literacy development, narrative structures, and adult socio-linguistics. The contributions may be significant to language education, special education, and clinical speech therapy as well as commercial application to domestic computer aids instruction.
1. MacWhinney B, 1996, “The CHILDES System,” American Journal of Speech-Language Pathology, 5, p. 5-14.
2. 陳勝良, “語言溝通障礙者數位溝通輔具之研發” , 2001.
3. http://www.moi.gov.tw
4. 何國華, “特殊兒童心理與教育”, 五南圖書出版公司.
5. 佘永吉, “台灣學齡前兒童口語詞彙資料庫之發展”, 2005.
6. Manning CD & Schutze H, 1999, “Foundations of Statistical Natural Language Processing,” The MIT Press, Cambridge, England, p. 29, 31-33, 43, 81, 117-119, 129, 317-320, 347, 575.
7. http://www.ldc.upenn.edu
8. http://www.elra.info/
9. http://nora.hd.uib.no/icame.html
10. http://ota.ahds.ac.uk/
11. http://www.sinica.edu.tw/SinicaCorpus/
12. MacWhinney B & Snow, “The Child Language Data Exchange System,” Journal of Child Language, 12, p. 271-295, 1985.
13. http://childes.psy.cmu.edu/
14. Allen J, “Natural Language Understanding,” Benjamin/ Cummings Publishing Company, 2nd edi., p. 200, 612, 1995.
15. Lee L, “Developmental Sentence Analysis,” Evanston, IL: Northwestern University Press, 1974.
16. Jurafsky D and Martin JH, “Speech and Language Processing,” Prentice Hall, p. 187, 197, 239, 658, 2000.
17. Kirk KI, “Assessing Speech Perception in Listeners with Cochlear Implants: The Development of Lexical Neighborhood Tests,” The Volta Review, Vol. 100(2), p. 63-85, 1999.
18. 張人仁, “PC-Based 助聽復健平台設計開發”, 2005.
19. 陳柏誠, “新穎獨立成份分析應用於隱藏式馬可夫模型分群及未知訊號分離”, 2004.
20. Rosner B, 1995, “Fundamentals of Biostatistics,” 4th edi., ITP, USA, p. 52-62.