簡易檢索 / 詳目顯示

研究生: 賴鵬宇
Lai, Peng-Yu
論文名稱: 以財經新聞建立情緒辭典並用以預測台灣股票趨勢
Establishing Sentiment Dictionary for Predicting Taiwan Stock Trends Based on Financial News
指導教授: 王宗一
Wang, Tzone-I
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2018
畢業學年度: 106
語文別: 中文
論文頁數: 50
中文關鍵詞: 股票趨勢預測文字探勘情緒字典潛在語意分析長短期記憶
外文關鍵詞: stock trends prediction, text mining, sentiment dictionary, LSA, LSTM
相關次數: 點閱:129下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 當我們想要投資股票市場時,需要可靠的得消息來源,我們可以從電視新聞、報章雜誌中獲取財經新聞。但在這個隨時可上網的時代中網路新聞儼然已成為主要消息來源管道。與傳統媒體相比,網路具有快速、即時、廣泛、大量產生新聞的特性。然而過多的資訊容易造成讀者的混淆、閱讀耗時、甚至被誤導。當讀者面對大量非結構化的資訊時,很難迅速且準確地判斷,因此需要一套有效的方法處理這些新聞資訊才能用以決策與投資。我們從實驗中發現利用潛在語意分析(Latent Semantic Analysis, LSA)找出概念相似度,再搭配TF-IDF建立新聞情緒辭典很適合用來分析股票趨勢。
    本研究針對台灣五十成分股中新聞討論度較高的17支股票進行探討,實驗蒐集自2016年1月日至2018年7月13日的財經新聞與成交資訊進行分析。首先使用TF-IDF方法取出新聞文中的關鍵詞並搭配關鍵詞的前一個及後一個詞做為文章的特徵詞組。再來將每一個特徵詞組所出現的所有交易日收盤價格漲跌加總起來作為情緒分數,將這些特徵詞組與分數蒐集起來建置出一部新聞情緒字典。接下來使用潛在語意分析找出概念相似度做為該篇新聞的特徵,透過奇異值分解將新聞文章轉換成概念相似度向量。利用本研究建立之情緒字典並搭配潛在語意分析進行長短期記憶(Long Short Term Memory, LSTM)模型的訓練,再將其預測結果進行模擬交易。經由兩次不同時間區間的實驗結果發現,兩次所獲得的平均投資報酬率、總獲利及投資總成本均優於其他三部(NTUFSD、ANTUSD、UTFSD)現有的情緒字典的表現。

    This study used web crawlers to grab from the Internet financial news, from which a sentiment dictionary is built and a model is built to predict the trends of Taiwan stocks. The crawlers focus on the mostly discussed 17 stocks in Taiwan 50 (0050) stocks and collect financial news and trading information from January 1, 2016 to July, 13, 2018. To build the sentiment dictionary, this study first uses the TF-IDF method to find out all the keywords in the news articles collected. In a news article, each keyword combines with the keyword ahead it and the one behind it to form the keyword phrases. The sentiment score for each keyword phrase is calculated by summing up all the stock prices up (plus) and down (minus) percentages of the trading days on which news articles containing the keyword phrase are released. All keyword phrases with their scores are collected to establish a sentiment dictionary. In addition to the sentiment dictionary, this study uses Latent Semantic Analysis (LSA) and Singular Value Decomposition (SVD) on the news articles to produce, for each article, a 200 elements vector that represents the relationship between a document and the subjects, constituted of concepts and phrases. The system for predicting the trends of the stock market in Taiwan is a Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN). Its characteristic input consists of the 200 elements vector plus the sentiment scores of the 20 top phrases of each article. The training set uses all the news articles collected between 2016/01/01and 2018/04/30. After training the system, this study conducts a simulated transection using the news article collected between 2018/04/23 and 2018/05/31 to verify the correctness of the system. To evaluate the performance of the system, this study conducts a real life experiment between 2018/06/01 and 2018/07/13 to see the total profit, the average return of investment, and the total cost to hold the 17 stocks when a user transects the stocks according to the prediction of the system. The result yields the average return of investment is greater than 1.6%. Among the 17 stocks, stock (2317) earns the highest 5.79% return of investment. In the experiment, other sentiment dictionaries, NTUFSD, ANTUSD, and UTFSD are used for comparisons too. The one built in this study can bring the highest profit and the most stable average return of investment than other three sentiment dictionaries.

    目錄 摘要 I Extended Abstract II 誌謝 VIII 目錄 IX 表目錄 XI 圖目錄 XII 第一章 緒論 1 第一節 研究背景與動機 1 第二節 研究目的 2 第三節 研究貢獻 2 第四節 論文架構 3 第二章 文獻探討 4 第一節 股票價格的可預測性 4 第二節 文字探勘 6 第三節 情感分析與情緒辭典 7 第四節 自然語言處理 8 第五節 遞迴神經網路與LSTM 13 第三章 研究方法 16 第一節 資料擷取與前處理 18 第二節 建立情緒辭典 21 第三節 潛在語意新分析 25 第四節 類別標記與樣本建立 27 第五節 建立預測模型 29 第四章 實驗設計與結果 31 第一節 實驗環境 32 第二節 模擬交易 33 第三節 實驗結果與分析 35 第五章 結論與未來展望 41 第一節 結論 41 第二節 未來展望與建議 41 參考文獻 42 附錄 45 附錄一 個股實驗比較(原始) 45 附錄二 個股實驗比較(考慮成交手續費) 47 附錄三 個股實驗比較(考慮停利點、停損點成交手續費) 49

    參考文獻
    Ahmad, K., Oliveira, P., Casey, M., & Taskaya, T. (2002). Description of events: an analysis of keywords and indexical names. Paper presented at the Third International Conference on Language Resources and Evaluation.
    Akita, R., Yoshihara, A., Matsubara, T., & Uehara, K. (2016). Deep learning for stock prediction using numerical and textual information. Paper presented at the Computer and Information Science (ICIS), 2016 IEEE/ACIS 15th International Conference on.
    Althelaya, K. A., El-Alfy, E. S. M., & Mohammed, S. (2018, 3-5 April 2018). Evaluation of bidirectional LSTM for short-and long-term stock market prediction. Paper presented at the 2018 9th International Conference on Information and Communication Systems (ICICS).
    Ammann, M., Frey, R., & Verhofen, M. (2014). Do Newspaper Articles Predict Aggregate Stock Returns? Journal of Behavioral Finance, 15(3), 195-213. doi:10.1080/15427560.2014.941061
    Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval (Vol. 463): ACM press New York.
    Brown, G. W., & Cliff, M. T. (2004). Investor sentiment and the near-term stock market. Journal of Empirical Finance, 11(1), 1-27. doi:https://doi.org/10.1016/j.jempfin.2002.12.001
    CHANG, P., & FENG, N. (2012). A Co-occurrence based Vector Space Model for Document Indexing. Journal of Chinese Information Processing, 1, 009.
    De Bondt, W. F., & Thaler, R. (1985). Does the stock market overreact? The Journal of finance, 40(3), 793-805.
    Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391-407.
    Dimmick, J., Chen, Y., & Li, Z. (2004). Competition between the Internet and traditional news media: The gratification-opportunities niche dimension. The Journal of Media Economics, 17(1), 19-33.
    Fama, E. F. (1991). Efficient capital markets: II. The Journal of finance, 46(5), 1575-1617.
    Feldman, R., & Sanger, J. (2007). The text mining handbook: advanced approaches in analyzing unstructured data: Cambridge university press.
    Hamilton, J. T. (1995). Pollution as news: Media and stock market reactions to the toxics release inventory data. Journal of environmental economics and management, 28(1), 98-113.
    Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
    Khedr, A. E., & Yaseen, N. (2017). Predicting stock market behavior using data mining technique and news sentiment analysis. International Journal of Intelligent Systems and Applications, 9(7), 22.
    Ku, L. W., & Chen, H. H. (2007). Mining opinions from the Web: Beyond relevance retrieval. Journal of the American Society for Information Science and Technology, 58(12), 1838-1850.
    Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2), 211.
    Liu, Q.-L., Gu, X.-F., & Li, J.-P. (2010). Researches of Chinese sentence similarity based on HowNet. Paper presented at the Apperceiving Computing and Intelligence Analysis (ICACIA), 2010 International Conference on.
    Loughran, T., & McDonald, B. (2009). Plain English, readability, and 10-K filings. Unpublished Working Paper–Notre Dame.
    Ma, W.-Y., & Chen, K.-J. (2003). Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation Bakeoff. Paper presented at the Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17.
    Madsen, R. E., Kauchak, D., & Elkan, C. (2005). Modeling word burstiness using the Dirichlet distribution. Paper presented at the Proceedings of the 22nd international conference on Machine learning.
    Malinin, A., van Dalen, R., Knill, K., Wang, Y., & Gales, M. (2016). Off-topic response detection for spontaneous spoken english assessment. Paper presented at the Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
    Mittermayer, M.-A. (2004). Forecasting intraday stock price trends with text mining techniques. Paper presented at the system sciences, 2004. proceedings of the 37th annual hawaii international conference on.
    Ning, C., Wang, R., Chen, Z., & Lu, B. (2011). An efficient similarity measure algorithm of Chinese sentence. Paper presented at the Computer Science and Automation Engineering (CSAE), 2011 IEEE International Conference on.
    Olah, C. (2015). Understanding lstm networks. GITHUB blog, posted on August, 27, 2015.
    Poshakwale, S. (1996). Evidence on weak form efficiency and day of the week effect in the Indian stock market. Finance India, 10(3), 605-616.
    Quan, H., Hu, J., & Fang, X. (2011). The research on collocation networks of relation words in modern Chinese language. Paper presented at the Computer Science & Education (ICCSE), 2011 6th International Conference on.
    Veronesi, P. (1999). Stock market overreactions to bad news in good times: a rational expectations equilibrium model. The Review of Financial Studies, 12(5), 975-1007.
    Wang, S.-M., & Ku, L.-W. (2016). ANTUSD: A Large Chinese Sentiment Dictionary. Paper presented at the LREC.
    Zhao, Z., Wu, N., & Song, P.-P. (2012). Sentence semantic similarity calculation based on multi—feature fusion. Computer Engineering, 1, 055.
    林宜萱. (2013). 財經領域情緒辭典之建置與其有效性之驗證-以財經新聞為元件. 臺灣大學會計學研究所學位論文, 1-60.
    張津挺. (2015). 中文財務情緒字典建構與其在財務新聞分析之應用. (碩士), 臺北市立大學, 臺北市.
    鍾任明. (2005). 運用文字探勘於日內股價漲跌趨勢預測之研究. (碩士), 中原大學, 桃園縣.

    無法下載圖示 校內:2023-07-23公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE