| 研究生: |
李亭葦 Li, Ting-Wei |
|---|---|
| 論文名稱: |
自動化新聞事件內容策展方法之建立 A Method for Automatic Content Curation of News Events |
| 指導教授: |
王惠嘉
Wang, Hei-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 中文 |
| 論文頁數: | 56 |
| 中文關鍵詞: | 自動化內容策展 、資訊檢索 、自動化新聞摘要 |
| 外文關鍵詞: | Automatic Content Curation, Information Retrieval, Automatic News Summary |
| 相關次數: | 點閱:110 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著網際網路的快速發展影響了人們獲取資訊的方式,傳統透過報紙、廣播接觸新聞的讀者不斷減縮,取而代之的是線上新聞文章,越來越多的讀者透過電腦或行動裝置來瀏覽新聞,因此新聞媒體產業也開始走向數位化,各家新聞媒體透過網頁發布線上新聞文章,傳遞資訊給群眾。網路上大量的新聞文章固然可以帶給讀者多樣化的新聞,但同時讀者也需要耗費許多時間閱覽才能夠消化資訊。此外新聞的事件期間,隨著時間發展或話題的延燒經常是持續一段時間的,當讀者欲針對特定事件做檢索了解事件發生的來龍去脈時,利用當前新聞網站的搜尋功能做查詢,而搜尋的結果往往面對的是大量新聞文章,導致讀者須要花費更多心力逐一檢視、整理,才能獲取真正尋求的資訊。
面對上述問題,有些平台應用內容策展(Content Curation)的概念,將大量的新聞文章根據事件、議題為基礎進行彙整,經由選材、精煉、組織並增加價值等步驟處理並呈現給讀者。有別於以往內容策展平台經由編輯人工進行整理之方式,本研究主要以自動化實做新聞事件之內容策展,首先萃取出資料集的主題,並透過隱馬爾可夫模型利用字詞序列找出主題轉移之序列,接著計算主題強度以及強度之變異偵測出事件發展期間重要的時間點,最終產生簡潔的文章摘要,結合時序化與摘要兩項特點,來設計呈現給讀者的事件策展結果,期望能有效幫助讀者簡單明瞭的閱讀並快速地掌握事件的脈絡。
The read habit of readers have changed, more and more readers use the computers or mobile devices to browse news, and the news industry is also digitized. Various news broadcaster published online news to pass information. A large number of online news bring readers a variety of information, but at the same time, readers also need to spend more time digesting them. When readers quering a news event, it often returns a large number of search result, which leads readers to spend extra effort to sort out.
In order to solve the problem, some platforms apply the concept of Content Curation, which aggregates the news articles based on event, and then organize and present to readers. At present, most of Content Curation is manually organized. Different from the way of the past platform, this study proposes an automated method of news curation. We first extract the topics from the dataset and use the word sequence to find out the topic sequence through the Hidden Markov Model. Then calculate the strength and the variation to detect important time points during the development of the event. Finally, generate a concise summary to every time points. We combine chronology and summary to design the curation, and look forward to help readers to quickly grasp the context of the news event.
Experiments has found that the method has a good performance in each modules. The curation result have good practicality for the readers. But in terms of coherence, there is slightly insufficient to improve.
參考文獻
英文文獻
Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998). Topic Detection and Tracking Pilot Study: Final Report. Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, 1998, 194-218.
Antoniou, G., & Harmelen, F. V. (2008). A Semantic Web Primer (2nd ed.) Cambridge, MA: The MIT Press.
Baralis, E., Cagliero, L., Mahoto, N., & Fiori, A. (2013). GraphSum: Discovering Correlations among Multiple Terms for Graph-based Summarization. Information Sciences, 249, 96-109.
Bawden, D., & Robinson, L. (2009). The Dark Side of Information: Overload, Anxiety and Other Paradoxes and Pathologies. Information Science, 35(2), 180-191.
Bhargava, R. (2009). Manifesto for the Content Curator: The Next Big Social Media Job of the Future? Retrieved from http://www.rohitbhargava.com/2009/09/manifesto-for-the-content-curator-the-next-big-social-media-job-of-the-future.html
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of machine Learning research, 3(Jan), 993-1022.
Burnette-Lemon, J. (2012, Jan-Feb). The Collector: Pearltrees' Oliver Starr Explains How Content Curation Works for Both Individual Users and Companies. Communication World, 29, 24-27.
Carbonell, J., & Goldstein, J. (1998). The Use of MMR, Diversity-based Reranking for Reordering Documents and Producing Summaries. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 335-336.
Chen, J., Niu, Z., & Fu, H. (2015). A Multi-news Timeline Summarization Algorithm Based on Aging Theory. In R. Cheng, B. Cui, Z. Zhang, R. Cai, & J. Xu (Eds.), Web Technologies and Applications (pp. 449-460). Cham, Switzerland: Springer International Publishing.
Chen, K. Y., Liu, S. H., Chen, B., Wang, H. M., Jan, E. E., Hsu, W. L., & Chen, H. H. (2015). Extractive Broadcast News Summarization Leveraging Recurrent Neural Network Language Modeling Techniques. IEEE Transactions on Audio, Speech, and Language Processing, 23(8), 1322-1334.
Dale, S. (2014). Content Curation: The Future of Relevance. Business Information Review, 31(4), 199-205.
Dhillon, I. S., & Modha, D. S. (2001). Concept Decompositions for Large Sparse Text Data Using Clustering. Machine Learning, 42(1), 143-175.
Endres, D. M., & Schindelin, J. E. (2003). A New Metric for Probability Distributions. IEEE Transactions on Information Theory, 49(7), 1858-1860.
Erkan, G., & Radev, D. R. (2004). Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of artificial intelligence research, 22, 457-479.
Filatova, E., & Hatzivassiloglou, V. (2004). Event-based Extractive Summarization. Text Summarization Branches Out, 104-112.
Greenbacker, C. F. (2011). Towards a Framework for Abstractive Summarization of Multimodal Documents. Proceedings of the ACL 2011 Student Session, 75-80.
Haribhakta, Y., Malgaonkar, A., & Kulkarni, P. (2012). Unsupervised Topic Detection Model and Its Application in Text Categorization. Proceedings of the CUBE International Information Technology Conference, 314-319.
Herther, N. K. (2012 September). Content Curation: Quality Judgment and the Future of Media and Web Search. Searcher, 20, 30-41.
Hofmann, T. (1999). Probabilistic Latent Semantic Analysis. Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, 289-296.
Hu, P., Huang, M., Xu, P., Li, W., Usadi, A. K., & Zhu, X. (2011). Generating Breakpoint-based Timeline Overview for News Topic Retrospection. 2011 IEEE 11th International Conference on Data Mining, 260-269.
Indra, Winarko, E., & Pulungan, R. (in press). Trending Topics Detection of Indonesian Tweets Using BN-grams and Doc-p. Journal of King Saud University - Computer and Information Sciences.
Kessler, R., Tannier, X., Hagège, C., Moriceau, V., & Bittar, A. (2012). Finding Salient Dates for Building Thematic Timelines. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistic, 1, 730-739.
Khan, A., Salim, N., & Kumar, Y. J. (2015). A Framework for Multi-document Abstractive Summarization Based on Semantic Role Labelling. Applied Soft Computing, 30, 737-747.
Lim, J. M., Kang, I. S., Bae, J. H. J., & Lee, J. H. (2005). Sentence Extraction Using Time Features in Multi-document Summarization. In S. H. Myaeng, M. Zhou, K. F. Wong, & H. J. Zhang (Eds.), Information Retrieval Technology (pp. 82-93). Berlin, Heidelberg: Springer.
Lin, C. Y., & Hovy, E. (2000). The Automated Acquisition of Topic Signatures for Text Summarization Vol. 1. Proceedings of the 18th conference on Computational linguistics (pp. 495-501).
Lloret, E., Plaza, L., & Aker, A. (2018). The Challenging Task of Summary Evaluation: An Overview. Language Resources and Evaluation, 52(1), 101-148.
Loan, F. A. (2011). Impact of Internet on Reading Habits of the Net Generation College Students. International Journal of Digital Library Services, 1(2), 43-48.
Marujo, L., Ling, W., Ribeiro, R., Gershman, A., Carbonell, J., de Matos, D., & Neto, J. P. (2016). Exploring Events and Distributed Representations of Text in Multi-document Summarization. Knowledge-Based Systems, 94, 33-42.
Marujo, L., Ling, W., Ribeiro, R., Gershman, A., Carbonell, J., Martins de Matos, D., & Neto, J. P. (2016). Exploring Events and Distributed Representations of Text in Multi-Document SSummarization. Knowledge-Based Systems, 94, 33-42.
Mauá, D., Antonucci, A., & de Campos, C. (2016). Hidden Markov Models with Set-valued Parameters. Neurocomputing, 180, 94-107.
Mihalcea, R., & Tarau, P. (2004). Textrank: Bringing Order into Text. Proceedings of the 2004 conference on empirical methods in natural language processing.
Nenkova, A., & McKeown, K. (2012). A Survey of Text Summarization Techniques. In C. C. Aggarwal & C. Zhai (Eds.), Mining Text Data (pp. 43-76). New York, NY: Springer Science & Business Media.
Newman, N., Fletcher, R., Kalogeropoulos, A., & Levy, D. (2018). Reuters Institute Digital News Report 2018.
Newman, N., Fletcher, R., Kalogeropoulos, A., Levy, D. A., & Nielsen, R. K. (2017). Reuters Institute digital news report 2017.
Nicholson, N. (2012, Jan/Feb ). An Opportunity to Add Value. Communication World, 29, 3.
Ohsawa, Y., Benson, N. E., & Yachida, M. (1998). KeyGraph: Automatic Indexing by Co-occurrence Graph Based on Building Construction Metaphor. Proceedings of the IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-, 12-18. doi:10.1109/ADL.1998.670375
Petkos, G., Papadopoulos, S., Aiello, L., Skraba, R., & Kompatsiaris, Y. (2014). A Soft Frequent Pattern Mining Approach for Textual Topic Detection. Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14), 1-10.
Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257-286.
Román-Gálvez, R., Román-Roldán, R., Martínez-Aroza, J., & Gómez-Lopera, J. F. (2015). Semi-hidden Markov Models for Generation and Analysis of Sequences. Mathematics and Computers in Simulation, 118, 320-328.
Sahoo, D., Bhoi, A., & Balabantaray, R. C. (2018). Hybrid Approach To Abstractive Summarization. Procedia Computer Science, 132, 1228-1237.
Salton, G., Wong, A., & Yang, C. S. (1975, November). A Vector Space Model for Automatic Indexing. Communications of the ACM, 18, 613-620.
Sankarasubramaniam, Y., Ramanathan, K., & Ghosh, S. (2014). Text Summarization Using Wikipedia. Information Processing & Management, 50(3), 443-461.
Sarkar, K., Nasipuri, M., & Ghose, S. (2011). Using Machine Learning for Medical Document Summarization. International Journal of Database Theory and Application, 4(1), 31-48.
Sayyadi, H., & Raschid, L. (2013). A Graph Analytical Approach for Topic Detection. ACM Transactions on Internet Technology, 13(2), 1-23.
Sun, J. (2012). ‘Jieba’ Chinese word segmentation tool. Retrieved from https://github.com/fxsjy/jieba
Sun, Y., Deng, H., & Han, J. (2012). Probabilistic Models for Text Mining. In C. C. Aggarwal & C. Zhai (Eds.), Mining Text Data (pp. 259-295). New York, NY: Springer Science & Business Media.
Tanaka, H., Kinoshita, A., Kobayakawa, T., Kumano, T., & Kato, N. (2009). Syntax-driven Sentence Revision for Broadcast News Summarization. Proceedings of the 2009 Workshop on Language Generation and Summarisation, 39-47.
Wartena, C., & Brussee, R. (2008). Topic Detection by Clustering Keywords. Proceedings of the 19th International Workshop on Database and Expert Systems Applications, 54-58.
Wu, Q., Zhang, C., Hong, Q., & Chen, L. (2014). Topic Evolution Based on LDA and HMM and Its Application in Stem Cell Research. Journal of Information Science, 40(5), 611-620.
Xu, J., & Yang, X. (2015). Generating the Theme Overview Based on Clue Chain from Online News. Proceedings of the 2015 IEEE International Conference on Systems, Man, and Cybernetics, 2730-2735.
Yang, Y., Pierce, T., & Carbonell, J. (1998). A Study of Retrospective and On-line Event Detection. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 28-36.
Zhai, C., Velivelli, A., & Yu, B. (2004). A Cross-Collection Mixture Model for Comparative Text Mining. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 743-748.
Zhang, C., Wang, H., Cao, L., Wang, W., & Xu, F. (2016). A Hybrid Term–term Relations Analysis Approach for Topic Detection. Knowledge-Based Systems, 93(1), 109-120.
Zhang, P. Y., & Li, C. H. (2009). Automatic Text Summarization Based on Sentences Clustering and Extraction. Proceedings of the 2nd IEEE International Conference on Computer Science and Information Technology, 167-170.
Zhao, T., Luo, X., Qin, W., Huang, S., & Xie, S. (2018). Topic Detection Model in a Single‐Domain Corpus Inspired by the Human Memory Cognitive Process. Concurrency and Computation: Practice and Experience, 30(19), e4642.
中文文獻
蔡尚勳. (2017). 沒想到? 2017最厲害閱讀關鍵字出爐. Retrieved from https://money.udn.com/money/story/10860/2900117