簡易檢索 / 詳目顯示

研究生: 賴銘偉
Lai, Ming-Wei
論文名稱: 基於文件分段之電子書特徵選取
A Feature Selection Method Based on Text Segmentation of E-Books
指導教授: 王惠嘉
Wang, Hei-Chia
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 90
中文關鍵詞: 電子書k個最近鄰居法支援向量機特徵選取文件分段
外文關鍵詞: e-books, k-nearest neighbor, support vector machines, feature selection, text segmentation
相關次數: 點閱:120下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著資訊科技的崛起與網路快速的發展,紙本書籍轉換成電子書的形式漸漸受到矚目,人們可直接透過網路找到自己想看的書並且下載到電子書閱讀器,此乃大大提昇閱讀書籍吸收知識的便利性,然而,目前電子書已累積相當龐大的數量,因此在為電子書分門別類時,往往需要耗費大量的人力與時間。
    而傳統分類技術,如決策樹(Decision Tree)、k個最近鄰居法(k-Nearest Neighbor, kNN)、貝氏分類法(Naïve Bayes)以及支援向量機(Support Vector Machines, SVM)等方法,其分類方法的過程中,通常會對文章擷取出具代表性的特徵字詞,形成特徵空間(Feature Space)。當文章越長時,則越容易產生大量的特徵字詞,造成空間維度過高,因此導致後續的分類處理很複雜,故在分類過程中會透過特徵選取的步驟,過濾不良的特徵字。而電子書內容長度通常是一般文章的十幾倍,若將傳統的特徵選取方式套用在電子書時,容易產生大量的特徵字詞而增加後續分類的複雜度,而且也會讓重要特徵字詞因為只集中在一些區塊而降低其整體權重,造成該字詞在篩選過程中被剔除。
    因此,本研究在對電子全文進行特徵擷取時,將電子全文的文章結構納入考慮,利用文件分段技術,將全文分成數個較小長度的段落區塊,個別處理,分析字詞在各段落中的集中程度,藉此找出各段相對重要的特徵字詞,並期望利用所選取出的特徵字詞能夠提升分類的準確度。

    With the exponential growth of information technology and Internet, paper books can be transformed into e-books. People can get these e-books form Internet and download them by e-reader. It enhances the convenience to absorb knowledge from books. However, the number of e-books has been very large. It costs lot time and energy to classify these e-books.
    The traditional classification approaches like decision tree, k-nearest neighbor, naïve bayes, support vector machines, usually select the feature words from content. These words will form a feature space. The longer the article is, the more likely generate a lot of feature words, and the dimension of feature space is higher. It causes the follow-up of the classification process complicated. Therefore, the classification process steps filter undesirable feature words through feature selection. However, the length of e-books is usually much longer than the general article. With traditional approaches, e-books generate a large number of feature words, and cause the follow-up of the classification process complicated, and even lost important feature words because of the long length content reducing these words’ overall weights.
    Therefore, we present a novel feature selection approach which applies a text segmentation algorithm. With this algorithm, e-books can be cut several segments.We analyze all words’ importance in these segments and select the inportant feature words for every segments. We expect that the feature words which selected by our approach can improve the accuracy of classification.

    目 錄 第1章 緒論 1 1.1 研究背景 1 1.2 研究動機與目的 2 1.3 研究範圍與限制 3 1.4 研究流程 4 1.5 論文大綱 5 第2章 文獻探討 6 2.1 文字量化 6 2.2 相似度計算 7 2.2.1 向量空間模型 7 2.2.2 資訊理論 8 2.2.3 圖形理論 9 2.2.4 小結 10 2.3 文件切割(Text Segmentation) 10 2.3.1 相似度 11 2.3.2 語彙鏈 13 2.3.3 特徵字 13 2.3.4 小結 15 2.4 詞義辨識(Word Sense Disambiguation, WSD) 15 2.5 特徵選取(Feature Selection) 16 2.5.1 文件頻率法(Document Frequency, DF) 16 2.5.2 資訊獲利(Information Gain, IG) 17 2.5.3 共同資訊量(Mutual Information, MI) 18 2.5.4 卡方統計量(χ2 Statistic Measure, CHI) 18 2.5.5 小結 19 2.6 文件分類 19 2.6.1 決策樹(Decision Tree) 20 2.6.2 k個最近鄰居法(k-Nearest Neighbor, kNN) 20 2.6.3 貝氏分類法(Naïve Bayes Classifier) 21 2.6.4 支援向量機(Support Vector Machines, SVM) 22 2.6.5 小結 24 第3章 研究方法 25 3.1 研究架構 25 3.2 文件蒐集與處理模組 27 3.3 內文分段模組 27 3.4 段落特徵選取模組 32 第4章 系統建置與驗證 35 4.1 系統實驗設計 35 4.1.1 Pre-processing 36 4.1.2 Feature Selection 36 4.1.3 Classifier 36 4.2 方法實驗 37 4.2.1 資料來源 37 4.2.2 比較對象 38 4.2.3 評估方式與指標 38 4.3 實驗分析與討論 40 4.3.1 實驗一:傳統的特徵選取法成效 40 4.3.2 實驗二:區塊參數設定 44 4.3.3 實驗三:語意資訊的影響 45 4.3.4 實驗四:分段與未分段的分類成效比較 51 4.3.5 實驗五:人工分段與自動分段的分類成效比較 59 4.3.6 實驗六:各方法在各類電子書中的分類成效比較 66 第5章 結論及未來研究方向 83 5.1 結論 83 5.2 未來研究方向 85 參考文獻 87 表 目 錄 表 2-1 基於向量空間模型的相似度公式 8 表 2-2 分類法比較表 24 表 3-1 權重計算比較表 30 表 4-1 電子書類別與數量 37 表 4-2 Contingency Table 39 表 4-3 傳統方法最佳情況 43 表 4-4 NOsym-CSF和sym-CSF最佳情況 50 表 4-5 分段與未分段的最佳情況 56 表 4-6 分段與未分段的t檢定 58 表 4-7自動分段與人工分段的最佳情況 64 表 4-8 分段與未分段的t檢定 65 表 4-9各方法對於各類電子書的最佳情況 79 圖 目 錄 圖 1-1 研究流程圖 4 圖 2-1 Optimal Matching的例子 9 圖 2-2 多對多配對的例子 10 圖 2-3 每個gap的分數波形圖 12 圖 2-4 Field-associated Term tree(Lee et al., 2002) 14 圖 2-5 kNN分類概念圖 21 圖 2-6 SVM概念圖 22 圖 2-7 SVM資料投射圖 23 圖 2-8 Non- Separable Case 23 圖 3-1 本研究架構流程圖 25 圖 3-2 文件前置處理流程圖 27 圖 3-3 文件分段流程圖 27 圖 3-4計算GAP I的Lexical score示意圖 28 圖 3-5 依各GAP的Lexical score所構成的波形圖 31 圖 3-6 每個GAP的分數波形圖 31 圖 3-7 文章分段圖 32 圖 3-8 選取段落特徵字詞流程圖 33 圖 4-1 系統部屬圖 35 圖 4-2 分類評估圖 38 圖 4-3 傳統方法的P值(K=3) 40 圖 4-4傳統方法的P值(K=4) 41 圖 4-5傳統方法的R值(K=3) 41 圖 4-6傳統方法的R值(K=4) 41 圖 4-7傳統方法的F-M值(K=3) 42 圖 4-8傳統方法的F-M值(K=4) 42 圖 4-9相似度變動情況 44 圖 4-10 NOsym-CSF(80)與sym-CSF(80)之P值比較(K=3) 45 圖 4-11 NOsym-CSF(80)與sym-CSF(80)之P值比較(K=4) 45 圖 4-12 NOsym-CSF(80)與sym-CSF(80)之R值比較(K=3) 46 圖 4-13 NOsym-CSF(80)與sym-CSF(80)之R值比較(K=4) 46 圖 4-14 NOsym-CSF(80)與sym-CSF(80)之F-M值比較(K=3) 46 圖 4-15 NOsym-CSF(80)與sym-CSF(80)之F-M值比較(K=4) 47 圖 4-16 NOsym-CSF(140)與sym-CSF(140)之P值比較(K=3) 47 圖 4-17 NOsym-CSF(140)與sym-CSF(140)之P值比較(K=4) 48 圖 4-18 NOsym-CSF(140)與sym-CSF(140)之R值比較(K=3) 48 圖 4-19 NOsym-CSF(140)與sym-CSF(140)之R值比較(K=4) 48 圖 4-20 NOsym-CSF(140)與sym-CSF(140)之F-M值比較(K=3) 49 圖 4-21 NOsym-CSF(140)與sym-CSF(140)之F-M值比較(K=4) 49 圖 4-22 NOsym-CSF(80)和sym-CSF(80)與傳統方法之P值比較(K=3) 52 圖 4-23 NOsym-CSF(80)和sym-CSF(80)與傳統方法之P值比較(K=4) 52 圖 4-24 NOsym-CSF(80)和sym-CSF(80)與傳統方法之R值比較(K=3) 52 圖 4-25 NOsym-CSF(80)和sym-CSF(80)與傳統方法之R值比較(K=4) 53 圖 4-26 NOsym-CSF(80)和sym-CSF(80)與傳統方法之F-M值比較(K=3)53 圖 4-27 NOsym-CSF(80)和sym-CSF(80)與傳統方法之F-M值比較(K=4) 53 圖 4-28 NOsym-CSF(140)和sym-CSF(140)與傳統方法之P值比較(K=3) 54 圖 4-29 NOsym-CSF(140)和sym-CSF(140)與傳統方法之P值比較(K=4) 54 圖 4-30 NOsym-CSF(140)和sym-CSF(140)與傳統方法之R值比較(K=3) 55 圖 4-31 NOsym-CSF(140)和sym-CSF(140)與傳統方法之R值比較(K=4) 55 圖 4-32 NOsym-CSF(140)和sym-CSF(140)與傳統方法之F-M值比較(K=3) 55 圖 4-33 NOsym-CSF(140)和sym-CSF(140)與傳統方法之F-M值比較(K=4) 56 圖 4-34 NOsym-CSF(80)和sym-CSF(80)與人工分段方法之P值比較(K=3) 59 圖 4-35 NOsym-CSF(80)和sym-CSF(80)與人工分段方法之P值比較(K=4) 59 圖 4-36 NOsym-CSF(80)和sym-CSF(80)與人工分段方法之R值比較(K=3) 60 圖 4-37 NOsym-CSF(80)和sym-CSF(80)與人工分段方法之R值比較(K=4) 60 圖 4-38 NOsym-CSF(80)和sym-CSF(80)與人工分段方法之F-M值比較(K=3) 60 圖 4-39 NOsym-CSF(80)和sym-CSF(80)與人工分段方法之F-M值比較(K=4) 61 圖 4-40 NOsym-CSF(140)和sym-CSF(140)與人工分段方法之P值比較(K=3) 61 圖 4-41 NOsym-CSF(140)和sym-CSF(140)與人工分段方法之P值比較(K=4) 62 圖 4-42 NOsym-CSF(140)和sym-CSF(140)與人工分段方法之R值比較(K=3) 62 圖 4-43 NOsym-CSF(140)和sym-CSF(140)與人工分段方法之R值比較(K=4) 62 圖 4-44 NOsym-CSF(140)和sym-CSF(140)與人工分段方法之F-M值比較(K=3) 63 圖 4-45 NOsym-CSF(140)和sym-CSF(140)與人工分段方法之F-M值比較(K=4) 63 圖 4-46 ANI類中各方法之P值比較(K=3) 67 圖 4-47 ANI類中各方法之P值比較(K=4) 67 圖 4-48 ANI類中各方法之R值比較(K=3) 67 圖 4-49 ANI類中各方法之R值比較(K=4) 68 圖 4-50 ANI類中各方法之F-M值比較(K=3) 68 圖 4-51 ANI類中各方法之F-M值比較(K=4) 68 圖 4-52 COPU類中各方法之P值比較(K=3) 69 圖 4-53 COPU類中各方法之P值比較(K=4) 69 圖 4-54 COPU類中各方法之R值比較(K=3) 70 圖 4-55 COPU類中各方法之R值比較(K=4) 70 圖 4-56 COPU類中各方法之F-M值比較(K=3) 70 圖 4-57 COPU類中各方法之F-M值比較(K=4) 71 圖 4-58 FAR類中各方法之P值比較(K=3) 71 圖 4-59 FAR類中各方法之P值比較(K=4) 72 圖 4-60 FAR類中各方法之R值比較(K=3) 72 圖 4-61 FAR類中各方法之R值比較(K=4) 72 圖 4-62 FAR類中各方法之F-M值比較(K=3) 73 圖 4-63 FAR類中各方法之F-M值比較(K=4) 73 圖 4-64 GE類中各方法之P值比較(K=3) 74 圖 4-65 GE類中各方法之P值比較(K=4) 74 圖 4-66 GE類中各方法之R值比較(K=3) 74 圖 4-67 GE類中各方法之R值比較(K=4) 75 圖 4-68 GE類中各方法之F-M值比較(K=3) 75 圖 4-69 GE類中各方法之F-M值比較(K=4) 75 圖 4-70 ME類中各方法之P值比較(K=3) 76 圖 4-71 ME類中各方法之P值比較(K=4) 76 圖 4-72 ME類中各方法之R值比較(K=3) 77 圖 4-73 ME類中各方法之R值比較(K=4) 77 圖 4-74 ME類中各方法之F-M值比較(K=3) 77 圖 4-75 ME類中各方法之F-M值比較(K=4) 78

    Aslam, J. A., & Frost, M. (2003). An information-theoretic measure for document similarity. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, 449 - 450.
    Beeferman, D., Berger, A., & Lafferty, J. (1999). Statistical models for text segmentation. Machine Learning, 34, 177-210.
    Brants, T., Chen, F., & Tsochantaridis, I. (2002). Topic-based document segmentation with probabilistic latent semantic analysis. In Proceedings of CIKM, 211-218.
    Carthy, J., & Sherwood-Smith, M. (2002). Lexical chains for topic tracking. IEEE International Conference on System, Man and Cybernetics, 7.
    Choi, F. Y. Y., Wiemer-Hastings, P., & Moore, J. (2001). Latent semantic analysis for text segmentation. In Proceedings of EMNLP, 109-117.
    Choudhary, A. K., Harding, J. A., & Popplewell, K. (2006). Knowledge discovery for moderating collaborative projects. Paper presented at the Proceedings of the 4th IEEE International Conference on Industrial Informatics Singapore.
    Choudhary, A. K., Harding, J. A., & Tiwari, M. K. (2008). Data mining in manufacturing: a review based on the kind of knowledge. Journal of Intelligent Manufacturing
    Cordon, O., Herrera-Viedma, E., Lopez-Pujalte, C., Luque, M., & Zarco, C. (2003). A review on the application of evolutionary computation to information retrieval. Approximate Reasoning, 34, 241-264.
    Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory. New York Wiley-Interscience.
    Frakes, W. B., & Baeza-Yates, R. (1992). Information Retrieval Data Structures & Algorithms: Prentice-Hall.
    Galavotti, L., Nardi, V. J., Sebastiani, F., & Simi, M. (2000). Feature Selection and Negative Evidence in Automated Text Categorization. Proceedings of the 4 th European Conference on Research and Advanced Technology for Digital Libraries.
    Hao, P.-Y., Chiang, J.-H., & Tu, Y.-K. (2007). Hierarchically SVM classification based on support vector clustering method and its application to document categorization. Expert Systems with Applications, 33(3), 627-635
    Harding, J. A., Shahbaz, M., Srinivas, & Kusiak, A. (2006). Data mining in manufacturing: a review American Society of Mechanical Engineers (ASME). Journal of Manufacturing Science and Engineering 128(4), 969–976.
    Hearst, M. A. (1994). Multi-paragraph segmentation of expository text. In Meeting of ACL, 9-16.
    Hearst, M. A. (1997). TextTiling: segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), 33-64.
    Huang, K.-C., Geller, J., Halper, M., Perl, Y., & Xu, J. (2009). Using WordNet synonym substitution to enhance UMLS source integration. Artificial Intelligence in Medicine, 46, 97-109.
    Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with many relevant features. Lecture Notes in Computer Science, 1398, 137-142.
    Joachims, T. (1999). Transductive Inference for Text Classification using Support Vector Machines. Proceedings of the Sixteenth International Conference on Machine Learning 200 - 209
    Kauchak, D., & Chen, F. (2005). Feature-based segmentation of narrative documents. ACL Workshops, 32-39.
    Kozima, H., & Furugori, T. (1994). Segmenting narrative text into coherent scenes. Literary and Linguistic Computing, 9, 13-19.
    Lee, S. S., Shishibori, M., Sumitomo, T., & Aoe, J.-i. (2002). Extraction of Field-Coherent Passages. Information Processing & Management, 38(2), 173-207
    Lewis, D. D., & Ringuette, M. (1994). A Comparison of Two Learning Algorithms for Text Categorization. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval 81-93.
    Li, S., Xia, R., Zong, C., & Huang, C.-R. (2009). A Framework of Feature Selection Methods for Text Categorization. Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, 692–700.
    Li, Y. H., & Jain, A. K. (1998). Classification of Text Documents. THE COMPUTER JOURNAL, 41(8), 537-546.
    Manu, K. (2006). Text Mining Application Programming. Boston, Massachusetts: Charles River Media.
    Maron, M. E. (1961). Automatic Indexing: An Experimental Inquiry. Journal of the ACM (JACM), 8(3), 404 - 417.
    Matveeva, I., & Levow, G.-A. (2007). Topic Segmentation with Hybrid Document Indexing. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
    351–359.
    Mochizuki, H., Honda, T., & Okumura, M. (1998). Text segmentation with multiple surface linguistic cues. In COLING-ACL, 881-885.
    Neaga, E. I., & Harding, J. A. (2005). An enterprise modelling and integration framework based on knowledge discovery and data mining. International Journal of Production Research, 43(6), 1089–1108.
    Oh, H.-J., Myaeng, S. H., & Jang, M.-G. (2007). Semantic passage segmentation based on sentence topics for question answering. Information Sciences: an International Journal 177(18), 3696-3717.
    Paradis, F., & Nie, J.-Y. (2007). Contextual feature selection for text classification. Information Processing and Management, 43, 344-352.
    Pham, D. T., & Afify, A. A. (2005). Machine learning techniques and their applications in manufacturing. Proceedings of the Institution of Mechanical Engineers, Journal of Engineering Manufacture: Part B 219, 395–412.
    Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130-137.
    Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81-106.
    Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo, CA.: Morgan Kaufmann.
    Razi, M. A., & Athappilly, K. (2005). A comparative predictive analysis of neural networks (NNs), nonlinear regression and classification and regression tree (CART) models. Expert Systems with Applications, 29, 65–74.
    Reynar, J. (1999). Statistical models for topic segmentation. In Proceedings of ACL, 357-364.
    Ristad, E. S. (1995). A Natural Law of Succession. Technical Report TR-495-95, Princeton University.
    Salton, G. (1988). Automatic text processing. Addison-Wesley Longman Publishing Company.
    Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM 18(11), 613-620.
    Shah, P. K., Perez-Iratxeta, C., Bork, P., & Andrade, M. A. (2003). Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics, 4(1).
    Shahbaz, M., Srinivas, Harding, J. A., & Turner, M. (2006). Product design and manufacturing process improvement using association rules. Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture, 220, 243-254.
    Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27, 379–432.
    Stokes, N., Carthy, J., & Smeato., A. (2002). Segmenting broadcast news streams using lexical chains. In Proceedings of Starting AI Researchers Symposium, 145-154.
    Suchanek, F. M., Kasneci, G., & Weikum, G. (2008). YAGO: A Large Ontology from Wikipedia and WordNet. Web Semantics: Science, Services and Agents on the World Wide Web, 6, 203–217.
    Tagarelli, A., & Karypis, G. (2008). A Segment-based Approach To Clustering Multi-Topic Documents. Paper presented at the In Text Mining Workshop, SIAM Datamining Conference.
    Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Berlin: Springer-Verlag.
    Wan, X. (2007). A novel document similarity measure based on earth mover's distance. Information Sciences, 177, 3718–3730.
    Wan, X. J., & Peng, Y. X. (2005). A new retrieval model based on TextTiling for document similarity search. Journal of Computer Science and Technology, 20(4), 552-558.
    Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques (Second Edition): Morgan Kaufmann.
    Xiao, W. S. (1993). Graph Theory and Its Algorithms: Beijing. Aviation Industrial Press.
    Xie, X. L., & Beni, G. (1991). A Validity Measure for Fuzzy Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(8), 841 - 847
    Xu, Y., Wang, B., Li, J., & Jing, H. (2008). An Extended Document Frequency Metric for Feature Selection in Text Categorization. Lecture Notes in Computer Science, 4993, 71-82.
    Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. Annual ACM Conference on Research and Development in Information Retrieval, 42-49.

    無法下載圖示 校內:2020-12-31公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE