| 研究生: |
王琇薏 Wang, Hsiu-Yi |
|---|---|
| 論文名稱: |
於異質性語句網路中基於使用者意圖之文件摘要模型 User Intention-based Document Summarization on Heterogeneous Sentence Networks |
| 指導教授: |
黃仁暐
Huang, Jen-Wei |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2018 |
| 畢業學年度: | 106 |
| 語文別: | 英文 |
| 論文頁數: | 50 |
| 中文關鍵詞: | 基於自動提取式的文章摘要 、文字向量模型 、異質性網絡 |
| 外文關鍵詞: | Extraction-based Document Summarization, Word Model, Heterogeneous Network, BeamSearch, MMR |
| 相關次數: | 點閱:96 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
基於自動提取的文檔摘要是自然語言處理領域中著名且艱鉅的任務之一。在過去的研究中,人們通常通過在基於圖的排序算法上提取Top K顯著句子來生成摘要。然而,這樣的句子特徵表示僅捕獲兩個對象之間表面的關係。無法精準的抽取出使用者感興趣的關鍵字及相關訊息。因此,我們提出的模型試圖解決以下挑戰:(1)通過基於單詞向量和TF-IDF值的組合的有意義的句子向量獲得候選句子之間更深層次的語義概念(2)不僅關注句子之間的關係,而且考慮使用者感興趣的關鍵字對句子的重要性,並選擇重要句子進行排序並應用於異質性語句網路中(3)逐句生成結果以確保摘要語義與原始文檔相關。
我們不僅在英文摘要資料集DUC 2001和DUC 2002進行單文件及多文件的摘要,也於中文摘要資料集LCSTS(large-scale Chinese summatization data set)進行實驗。此外,我們也實際與公司合作,收集新聞數據,然後由銀行審計專家標註參考摘要、針對摘要結果進行摘要質量評分。實驗結果中,ROUGE分數的進步證明了我們提出的模型的有效性和前景,摘要質量分數也較其他比較方法來的佳。
Automatic extraction-based document summarization is one of the famous and tough task in Natural Language Processing area. In the past research, people usually generate the summary by extract Top K salient sentences on graph-based ranking algorithms. Moreover, feature representation of the sentences only capture the surface relationship between two objects. However, these result may not as expect to user intentions. Therefore, our proposed model attempt to settle the challenges below: (1) obtain the deeper semantic concept among candidate sentence by meaningful sentence vector which based on the combination of word vectors and TF-IDF value; (2) not only concern the relationship between sentences but also consider the importance of user intentions to sentences then ranking to choose significant sentences and apply on heterogeneous graph; (3) generate the result sentence by sentence to ensure that summary semantics is related to the original document.
We conduct the experiment on summarization benchmark English datasets, DUC 2001 and DUC 2002 as well as on a large-scale Chinese summatization data set, LCSTS. Besides, based on our task assumption, we collect news data then label reference summary by bank auditor experts. The ROUGE evaluation results demonstrate the effectiveness and promising of our proposed model.
[1] E. Lloret, E. Boldrini, T. Vodolazova, P. Mart´ınez-Barco, R. Mu˜noz, and M. Palomar, “A novel concept-level approach for ultra-concise opinion summarization,” Expert Systems with Applications, vol. 42, no. 20, pp. 7148–7156, 2015.
[2] Y.-H. Hu, Y.-L. Chen, and H.-L. Chou, “Opinion mining from online hotel reviews–a text summarization approach,” Information Processing & Management, vol. 53, no. 2, pp. 436–449, 2017.
[3] A. Abdi, S. M. Shamsuddin, and R. M. Aliguliyev, “Qmos: Query-based multi-documents opinion-oriented summarization,” Information Processing & Management, vol. 54, no. 2, pp. 318–338, 2018.
[4] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web,” in Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998, pp. 161–172. [Online]. Available: citeseer.nj.nec.com/page98pagerank.html
[5] G. Erkan and D. R. Radev, “LexRank: Graph-based lexical centrality as salience in text summarization,” Journal of Artificial Intelligence Research, vol. 22, pp. 457–479, 2004.
[6] R. Mihalcea and P. Tarau, “TextRank: Bringing order into texts,” in Proceedings of EMNLP-04and the 2004 Conference on Empirical Methods in Natural Language Processing, July 2004.
[7] Q. Mei, J. Guo, and D. R. Radev, “Divrank: the interplay of prestige and diversity in information networks.” in KDD, B. Rao, B. Krishnapuram, A. Tomkins, and Q. Yang, Eds. ACM, 2010, pp. 1009–1018. [Online]. Available: http://dblp.uni-trier.de/db/conf/kdd/kdd2010.html#MeiGR10
[8] H. P. Luhn, “The automatic creation of literature abstracts,” IBM Journal of Research and Development, pp. 155–164, April 1958.
[9] P. B. Baxendale, “Machine-made index for technical literature – an experiment,” IBM Journal of Reasearch and Development, no. 2, p. 354–361, 1958.
[10] H. Edmundson, “New methods in automatic extracting,” Journal of theACM, vol. 16, no. 2, pp. 264–285, 1969.
[11] J. Kleinberg, “Authoritative sources in a hyperlinked environment,” in Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, 1998.
[12] X. Wan and J. Yang, “Multi-document summarization using cluster-based link analysis.” in SIGIR, S.-H. Myaeng, D. W. Oard, F. Sebastiani, T.-S. Chua, and M.-K. Leong, Eds. ACM, 2008, pp. 299–306. [Online]. Available: http: //dblp.uni-trier.de/db/conf/sigir/sigir2008.html#WanY08
[13] X. Zhu, A. Goldberg, J. V. Gael, and D. Andrzejewski, “Improving diversity in ranking using absorbing random walks,” HLT-NAACL, pp. 97–104, 2007. [Online]. Available: http://pages.cs.wisc.edu/∼{}jerryzhu/pub/grasshopper.pdf
[14] Z. He, C. Chen, J. Bu, C. Wang, L. Zhang, D. Cai, and X. He, “Document summarization based on data reconstruction.” in AAAI, J. Hoffmann and B. Selman, Eds. AAAI Press, 2012. [Online]. Available: http://dblp.uni-trier.de/db/conf/aaai/aaai2012.html# HeCBWZCH12
[15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in NIPS. Curran Associates, Inc., 2013, pp. 3111–3119. [Online]. Available: http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
[16] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,” JMLR, pp. 1137–1155, 2003.
[17] W. Yin and Y. Pei, “Optimizing sentence modeling and selection for document summarization.” in IJCAI, Q. Yang and M.Wooldridge, Eds. AAAI Press, 2015, pp. 1383–1389. [Online]. Available: http://dblp.uni-trier.de/db/conf/ijcai/ijcai2015.html#YinP15
[18] M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger, “From word embeddings to document distances,” in Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), 2015, pp. 957–966.
[19] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, 1988.
[20] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” in COMPUTER NETWORKS AND ISDN SYSTEMS. Elsevier Science Publishers B. V., 1998, pp. 107–117.
[21] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “Pathsim: Meta path-based top-k similarity search in heterogeneous information networks.” PVLDB, vol. 4, no. 11, pp. 992–1003, 2011. [Online]. Available: http://dblp.uni-trier.de/db/journals/pvldb/pvldb4.html#SunHYYW11
[22] N. Lao and W. W. Cohen, “Relational retrieval using a combination of path-constrained random walks.” Machine Learning, vol. 81, no. 1, pp. 53–67, 2010. [Online]. Available: http://dblp.uni-trier.de/db/journals/ml/ml81.html#LaoC10
[23] B. T. Lowerre, “The Harpy speech recognition system,” Ph.D. dissertation, Carnegie Mellon University, Pittsburgh, PA, 1968.
[24] C.-Y. Lin and E. H. Hovy, “Identifying topics by position,” in Proceedings of 5th Conference on Applied Natural Language Processing, Washington D.C., March 1997.
[25] O. You, W. Li, Q. Lu, and R. Zhang, “A study on position information in document summarization.” in COLING (Posters), C.-R. Huang and D. Jurafsky, Eds. Chinese Information Processing Society of China, 2010, pp. 919–927. [Online]. Available: http://dblp.uni-trier.de/db/conf/coling/coling2010p.html#YouLLZ10
[26] J. G. Carbonell and J. Goldstein, “The use of MMR, diversity-based reranking for reordering documents and producing summaries,” in Research and Development in Information Retrieval, 1998, pp. 335–336.
[27] B. Hu, Q. Chen, and F. Zhu, “LCSTS: A large scale chinese short text summarization dataset,” CoRR, vol. abs/1506.05865, 2015. [Online]. Available: http://arxiv.org/abs/1506.05865
[28] C.-Y. Lin, “Rouge: a package for automatic evaluation of summaries,” in Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), 2004.
[29] K.-Y. Chen, S.-H. Liu, B. Chen, and H.-M. Wang, “An information distillation framework for extractive summarization.” IEEE/ACM Trans. Audio, Speech & Language Processing, vol. 26, no. 1, pp. 161–170, 2018. [Online]. Available: http://dblp.uni-trier.de/db/journals/taslp/taslp26.html#ChenLCW18
[30] C.-Y. Lin and E. Hovy, “Automatic evaluation of summaries using n-gram cooccurrence statistics,” in Proc. HLT-NAACL, 2003, p. 8 pages. [Online]. Available: http://research.microsoft.com/∼cyl/download/papers/NAACL2003.pdf
校內:2023-06-29公開