| 研究生: |
劉宥辰 Liu, Yu-Chen |
|---|---|
| 論文名稱: |
可將文字轉譯為手語影片之深度學習模型 A Text to Sign Language System |
| 指導教授: |
王宗一
Wang, Tzone-I |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 中文 |
| 論文頁數: | 41 |
| 中文關鍵詞: | 手語 、機器翻譯 、動作轉移 、深度學習 |
| 外文關鍵詞: | Sign Language, Machine Translation, Motion Transfer, Deep Learning |
| 相關次數: | 點閱:198 下載:62 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
手語是一種透過肢體動作表現的語言,但並非只是一連串肢體動作的組合,如同一般人認知中的語言─口說語言,手語同樣具有抽象、規則、組合等特性。且手語的文法結構也與口說語言不同,不僅僅是將文字以肢體動作表達,而是一種獨立的語言。對聾人而言,手語是一種重要的溝通方式,但手語畢竟不是大多數人主要的溝通方式,一般人在與聾人溝通時,可能就必須使用文字或肢體動作。此外,不論傳統媒體或新媒體,很多時候都不會聘請手語老師,也不會使用文字,僅利用聲音傳遞訊息,如:演講、相聲、直播賽事解說、談話性節目等,而聾人就會難以體驗這些活動。所以本研究希望建立一個能夠簡單地讓文字轉為手語的系統,就像是翻譯機,能讓使用者不必學習手語,也能與聾人溝通。 本研究就此任務分為兩個部分:第一部分是自製訓練資料集並建立將中文語句翻譯成手語語句的深度翻譯模型。訓練資料集的製作是收集並將某聽障人協會的手語例句教學影片中的文字提取出來,並自行整理,排除可能有誤的資料。翻譯模型選擇使用 Seq2seq (Sequence to sequence)的模型架構,而為了避免傳統的循還神經網路(Recurrent neural network - RNN)可能發生的問題─如梯度爆炸、梯度消失等─所以編碼器及解碼器選擇使用長短期記憶(Long Short-Term Memory - LSTM)架構。由於可收集的資料集數量較少,研究並透過如相似詞替換或加入噪音等方式擴增資料,以提高模型效能。第二部分是建立利用人體骨架作為動作特徵,生成手語動作影片的動作轉移模型。自製訓練資料集也是收集某聽障人協會的手語教學影片,然後將這些影片剪輯後,僅保留手語字彙的完整動作,再利用人體姿態識別模型 OpenPose 提取影片中人物的骨架,將每一組動作依字彙分類,作為動作轉移模型之訓練資料集。在多方考量後,選擇自製資料集其中相似度最高的 60 部影片作為訓練資料集。此模型是學習特定人物的動作特徵,只要訓練集的人物不同,訓練後模型生成的人物也會不同。本研究所建立之完整的系統整體流程為,第一步將中文語句輸入翻譯模型,讓翻譯模型輸出手語語句;第二步是搜尋資料庫中,與手語語句對應的每一組動作;第三步是將骨架輸入動作轉移模型,來輸出與骨架動作相同的完整人物圖片;最後,將圖片合成完整之手語影片。本研究評估系統效能之測試資料分別為 100 句中文語句和 25 個手語字彙。翻譯模型以 BLEU 分數評估,BLEU-4 分數為 0.09,動作轉移模型所生成之影片經過所聘請之 8 位手語專家評估後,平均準確率超過 80%,平均主觀感受也在 1-5 分中得到超過 3.5 分,故可以推斷手語使用者能夠理解本研究所建立模型所生成的影片所表達之手語。
Sign language is expressed through body movements. It is not just a combination of body movements. Like the oral language, commonly used by ordinary people, sign language is with abstraction, rules, and their combinations. The grammar rules of a sign language are different from that of a spoken language. It is a self-regulating language itself that expresses words through body movements. For the hearing-impaired, sign language is an important communication tool, while it is not as common as the main communication languages of most people. When talking to the hearing-impaired, ordinary people may use words or body language that might be misinterpreted. Few online or public media use sign language as well as teletext for hearing-impaired, but only voice for ordinary people. Hearing-impaired would be difficult to experience many activities, such as speeches, cross
talk, live event commentary, talk shows, and etc. This research aims to establish a system that can convert text into sign language video, like a translator, so that users can communicate with hearing-impaired people without learning sign language. This study consists of two sub-tasks, the first creates a self-made training data set and builds a deep translation model for translating Chinese sentences into sign language sentences. The training data set is made by collecting and extracting the text in the sign language teaching videos of a hearing-impaired association, and by excluding possible erroneous data. The deep translation model uses the Seq2seq (Sequence to sequence) architecture and, in order to avoid problems frequently found in Recurrent neural network -RNN, such as gradient explosion, gradient disappearance, and etc., the encoder and decoder
both use a Long Short-Term Memory (LSTM) architecture. Due to the scarcity of data sets that can be collected online, this study enlarges the data set by augmentations such as replacing similar words or adding noises aiming to improve the performance of the trained model. The second sub-task is to train an action transfer deep model that uses human skeleton
as action features to generate sign language action videos. The model is trained on a selfmade training data set. Videos of sign language training from a hearing-impaired association are collected and edited to retain only clips containing complete movements of sign language vocabularies. These clips are put through a human gesture recognition model, the OpenPose, to extract the skeleton movements of the sign language speaker for each vocabulary. Sorted by vocabulary, the video clips and the skeletons movements together make up a self-made dataset. After multiple considerations, only 60 videos with the highest similarity in the self-made dataset are selected to form the training dataset of the action transfer model. The model is to learn the action characteristics of a specific sign language speaker. As long as the speakers in training sets are different, the speakers generated by the trained model will also be different too. The overall system process established by this study is as follows. First, input a Chinese sentence into the deep translation model that outputs a corresponding sign language sentence that, for every word in it, is used to search the database for each group of skeleton actions. These skeleton actions are put into the action transfer deep model to build all skeleton actions playing pictures, which are synthesized into a complete sign language video. This study uses a test dataset with 100 Chinese sentences and 25 sets of sign language lexical actions for evaluating the performance of the system. The deep translation model is evaluated by the BLEU score with the BLEU-4 score being 0.09. The 25 videos generated by the action transfer deep model have been evaluated by 8 professional sign language experts. The average accuracy is over 80% and the average subjective feeling of understandability is also over 3.5 points on an index of 1-5, which can confirm that sign language users can understand the sign language expressed in the videos generated by the system established in this study.
1. Sutskever, I., O. Vinyals, and Q.V. Le. Sequence to Sequence Learning with Neural Networks. in Advances in Neural Information Processing Systems. 2014. Curran Associates, Inc.
2. Cho, K., et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014. Doha, Qatar: Association for Computational Linguistics.
3. 中華民國聽障人協會. 社團法人中華民國聽障人協會-認識聽障. Available from: http://www.cnad.org.tw/ap/news_view.aspx?bid=25&sn=35ef3d27-4e95-4116-a45d-74e76ece9998.
4. Goodfellow, I.J., et al., Generative Adversarial Nets, in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. 2014, MIT Press: Montreal, Canada. p. 2672–2680.
5. Stoll, S., et al., Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks. International Journal of Computer Vision, 2020. 128(4): p. 891-908.
6. Hochreiter, S. and J. Schmidhuber, Long Short-Term Memory. Neural Computation, 1997. 9(8): p. 1735-1780.
7. Robertson, S. NLP From Scratch: Translation with a Sequence to Sequence Network and Attention — PyTorch Tutorials 1.12.0+cu102 documentation. Available from: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html.
8. Toshev, A. and C. Szegedy. DeepPose: Human Pose Estimation via Deep Neural Networks. in 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014.
9. Fang, H.S., et al. RMPE: Regional Multi-person Pose Estimation. in 2017 IEEE International Conference on Computer Vision (ICCV). 2017.
10. Xiu, Y., et al. Pose Flow: Efficient Online Pose Tracking. in British Machine Vision Conference 2018, {BMVC} 2018, Newcastle, UK, September 3-6, 2018. 2018. {BMVA} Press.
11. He, K., et al. Mask R-CNN. in 2017 IEEE International Conference on Computer Vision (ICCV). 2017.
12. Cao, Z., et al., OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 43(1): p. 172-186.
13. Simon, T., et al. Hand Keypoint Detection in Single Images Using Multiview Bootstrapping. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017.
14. Cao, Z., et al. Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017.
15. Wei, S., et al. Convolutional Pose Machines. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016.
16. Simonyan, K. and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. in International Conference on Learning Representations. 2015.
17. Pishchulin, L., et al. DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016.
18. Gatys, L.A., A.S. Ecker, and M. Bethge. Image Style Transfer Using Convolutional Neural Networks. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016.
19. Isola, P., et al. Image-to-Image Translation with Conditional Adversarial Networks. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017.
20. Mirza, M. and S. Osindero Conditional Generative Adversarial Nets. 2014. arXiv:1411.1784.
21. Zhu, J., et al. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. in 2017 IEEE International Conference on Computer Vision (ICCV). 2017.
22. Chan, C., et al. Everybody Dance Now. in 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019.
23. Wang, T., et al. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018.
24. Siarohin, A., et al. Animating Arbitrary Objects via Deep Motion Transfer. in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019.
25. Kappel, M., et al. High-Fidelity Neural Human Motion Transfer from Monocular Video. in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021.
26. Zabala, U., et al., Modeling and evaluating beat gestures for social robots. Multimedia Tools and Applications, 2022. 81.
27. 李宏毅. ML Lecture 21-1: Recurrent Neural Network (Part I) - 李宏毅. Available from: https://hackmd.io/@shaoeChen/B1CoXxvmm/https%3A%2F%2Fhackmd.io%2Fs%2FBJ14sUSzN.
28. He, K., et al. Deep Residual Learning for Image Recognition. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016.
29. Ronneberger, O., P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. in International Conference on Medical image computing and computer-assisted intervention. 2015. Springer.
30. Papineni, K., et al., BLEU: a method for automatic evaluation of machine translation, in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002, Association for Computational Linguistics: Philadelphia, Pennsylvania. p. 311–318.
31. Dhole, K.D., et al. NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation. 2021. arXiv:2112.02721.
32. Chaudhary, A. A Visual Survey of Data Augmentation in NLP. Available from: https://amitness.com/2020/05/data-augmentation-for-nlp/.