| 研究生: |
賴銘偉 Lai, Ming-Wei |
|---|---|
| 論文名稱: |
基於文件分段之電子書特徵選取 A Feature Selection Method Based on Text Segmentation of E-Books |
| 指導教授: |
王惠嘉
Wang, Hei-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2010 |
| 畢業學年度: | 98 |
| 語文別: | 中文 |
| 論文頁數: | 90 |
| 中文關鍵詞: | 電子書 、k個最近鄰居法 、支援向量機 、特徵選取 、文件分段 |
| 外文關鍵詞: | e-books, k-nearest neighbor, support vector machines, feature selection, text segmentation |
| 相關次數: | 點閱:120 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著資訊科技的崛起與網路快速的發展,紙本書籍轉換成電子書的形式漸漸受到矚目,人們可直接透過網路找到自己想看的書並且下載到電子書閱讀器,此乃大大提昇閱讀書籍吸收知識的便利性,然而,目前電子書已累積相當龐大的數量,因此在為電子書分門別類時,往往需要耗費大量的人力與時間。
而傳統分類技術,如決策樹(Decision Tree)、k個最近鄰居法(k-Nearest Neighbor, kNN)、貝氏分類法(Naïve Bayes)以及支援向量機(Support Vector Machines, SVM)等方法,其分類方法的過程中,通常會對文章擷取出具代表性的特徵字詞,形成特徵空間(Feature Space)。當文章越長時,則越容易產生大量的特徵字詞,造成空間維度過高,因此導致後續的分類處理很複雜,故在分類過程中會透過特徵選取的步驟,過濾不良的特徵字。而電子書內容長度通常是一般文章的十幾倍,若將傳統的特徵選取方式套用在電子書時,容易產生大量的特徵字詞而增加後續分類的複雜度,而且也會讓重要特徵字詞因為只集中在一些區塊而降低其整體權重,造成該字詞在篩選過程中被剔除。
因此,本研究在對電子全文進行特徵擷取時,將電子全文的文章結構納入考慮,利用文件分段技術,將全文分成數個較小長度的段落區塊,個別處理,分析字詞在各段落中的集中程度,藉此找出各段相對重要的特徵字詞,並期望利用所選取出的特徵字詞能夠提升分類的準確度。
With the exponential growth of information technology and Internet, paper books can be transformed into e-books. People can get these e-books form Internet and download them by e-reader. It enhances the convenience to absorb knowledge from books. However, the number of e-books has been very large. It costs lot time and energy to classify these e-books.
The traditional classification approaches like decision tree, k-nearest neighbor, naïve bayes, support vector machines, usually select the feature words from content. These words will form a feature space. The longer the article is, the more likely generate a lot of feature words, and the dimension of feature space is higher. It causes the follow-up of the classification process complicated. Therefore, the classification process steps filter undesirable feature words through feature selection. However, the length of e-books is usually much longer than the general article. With traditional approaches, e-books generate a large number of feature words, and cause the follow-up of the classification process complicated, and even lost important feature words because of the long length content reducing these words’ overall weights.
Therefore, we present a novel feature selection approach which applies a text segmentation algorithm. With this algorithm, e-books can be cut several segments.We analyze all words’ importance in these segments and select the inportant feature words for every segments. We expect that the feature words which selected by our approach can improve the accuracy of classification.
Aslam, J. A., & Frost, M. (2003). An information-theoretic measure for document similarity. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, 449 - 450.
Beeferman, D., Berger, A., & Lafferty, J. (1999). Statistical models for text segmentation. Machine Learning, 34, 177-210.
Brants, T., Chen, F., & Tsochantaridis, I. (2002). Topic-based document segmentation with probabilistic latent semantic analysis. In Proceedings of CIKM, 211-218.
Carthy, J., & Sherwood-Smith, M. (2002). Lexical chains for topic tracking. IEEE International Conference on System, Man and Cybernetics, 7.
Choi, F. Y. Y., Wiemer-Hastings, P., & Moore, J. (2001). Latent semantic analysis for text segmentation. In Proceedings of EMNLP, 109-117.
Choudhary, A. K., Harding, J. A., & Popplewell, K. (2006). Knowledge discovery for moderating collaborative projects. Paper presented at the Proceedings of the 4th IEEE International Conference on Industrial Informatics Singapore.
Choudhary, A. K., Harding, J. A., & Tiwari, M. K. (2008). Data mining in manufacturing: a review based on the kind of knowledge. Journal of Intelligent Manufacturing
Cordon, O., Herrera-Viedma, E., Lopez-Pujalte, C., Luque, M., & Zarco, C. (2003). A review on the application of evolutionary computation to information retrieval. Approximate Reasoning, 34, 241-264.
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory. New York Wiley-Interscience.
Frakes, W. B., & Baeza-Yates, R. (1992). Information Retrieval Data Structures & Algorithms: Prentice-Hall.
Galavotti, L., Nardi, V. J., Sebastiani, F., & Simi, M. (2000). Feature Selection and Negative Evidence in Automated Text Categorization. Proceedings of the 4 th European Conference on Research and Advanced Technology for Digital Libraries.
Hao, P.-Y., Chiang, J.-H., & Tu, Y.-K. (2007). Hierarchically SVM classification based on support vector clustering method and its application to document categorization. Expert Systems with Applications, 33(3), 627-635
Harding, J. A., Shahbaz, M., Srinivas, & Kusiak, A. (2006). Data mining in manufacturing: a review American Society of Mechanical Engineers (ASME). Journal of Manufacturing Science and Engineering 128(4), 969–976.
Hearst, M. A. (1994). Multi-paragraph segmentation of expository text. In Meeting of ACL, 9-16.
Hearst, M. A. (1997). TextTiling: segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), 33-64.
Huang, K.-C., Geller, J., Halper, M., Perl, Y., & Xu, J. (2009). Using WordNet synonym substitution to enhance UMLS source integration. Artificial Intelligence in Medicine, 46, 97-109.
Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with many relevant features. Lecture Notes in Computer Science, 1398, 137-142.
Joachims, T. (1999). Transductive Inference for Text Classification using Support Vector Machines. Proceedings of the Sixteenth International Conference on Machine Learning 200 - 209
Kauchak, D., & Chen, F. (2005). Feature-based segmentation of narrative documents. ACL Workshops, 32-39.
Kozima, H., & Furugori, T. (1994). Segmenting narrative text into coherent scenes. Literary and Linguistic Computing, 9, 13-19.
Lee, S. S., Shishibori, M., Sumitomo, T., & Aoe, J.-i. (2002). Extraction of Field-Coherent Passages. Information Processing & Management, 38(2), 173-207
Lewis, D. D., & Ringuette, M. (1994). A Comparison of Two Learning Algorithms for Text Categorization. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval 81-93.
Li, S., Xia, R., Zong, C., & Huang, C.-R. (2009). A Framework of Feature Selection Methods for Text Categorization. Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, 692–700.
Li, Y. H., & Jain, A. K. (1998). Classification of Text Documents. THE COMPUTER JOURNAL, 41(8), 537-546.
Manu, K. (2006). Text Mining Application Programming. Boston, Massachusetts: Charles River Media.
Maron, M. E. (1961). Automatic Indexing: An Experimental Inquiry. Journal of the ACM (JACM), 8(3), 404 - 417.
Matveeva, I., & Levow, G.-A. (2007). Topic Segmentation with Hybrid Document Indexing. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
351–359.
Mochizuki, H., Honda, T., & Okumura, M. (1998). Text segmentation with multiple surface linguistic cues. In COLING-ACL, 881-885.
Neaga, E. I., & Harding, J. A. (2005). An enterprise modelling and integration framework based on knowledge discovery and data mining. International Journal of Production Research, 43(6), 1089–1108.
Oh, H.-J., Myaeng, S. H., & Jang, M.-G. (2007). Semantic passage segmentation based on sentence topics for question answering. Information Sciences: an International Journal 177(18), 3696-3717.
Paradis, F., & Nie, J.-Y. (2007). Contextual feature selection for text classification. Information Processing and Management, 43, 344-352.
Pham, D. T., & Afify, A. A. (2005). Machine learning techniques and their applications in manufacturing. Proceedings of the Institution of Mechanical Engineers, Journal of Engineering Manufacture: Part B 219, 395–412.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130-137.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81-106.
Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo, CA.: Morgan Kaufmann.
Razi, M. A., & Athappilly, K. (2005). A comparative predictive analysis of neural networks (NNs), nonlinear regression and classification and regression tree (CART) models. Expert Systems with Applications, 29, 65–74.
Reynar, J. (1999). Statistical models for topic segmentation. In Proceedings of ACL, 357-364.
Ristad, E. S. (1995). A Natural Law of Succession. Technical Report TR-495-95, Princeton University.
Salton, G. (1988). Automatic text processing. Addison-Wesley Longman Publishing Company.
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM 18(11), 613-620.
Shah, P. K., Perez-Iratxeta, C., Bork, P., & Andrade, M. A. (2003). Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics, 4(1).
Shahbaz, M., Srinivas, Harding, J. A., & Turner, M. (2006). Product design and manufacturing process improvement using association rules. Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture, 220, 243-254.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27, 379–432.
Stokes, N., Carthy, J., & Smeato., A. (2002). Segmenting broadcast news streams using lexical chains. In Proceedings of Starting AI Researchers Symposium, 145-154.
Suchanek, F. M., Kasneci, G., & Weikum, G. (2008). YAGO: A Large Ontology from Wikipedia and WordNet. Web Semantics: Science, Services and Agents on the World Wide Web, 6, 203–217.
Tagarelli, A., & Karypis, G. (2008). A Segment-based Approach To Clustering Multi-Topic Documents. Paper presented at the In Text Mining Workshop, SIAM Datamining Conference.
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Berlin: Springer-Verlag.
Wan, X. (2007). A novel document similarity measure based on earth mover's distance. Information Sciences, 177, 3718–3730.
Wan, X. J., & Peng, Y. X. (2005). A new retrieval model based on TextTiling for document similarity search. Journal of Computer Science and Technology, 20(4), 552-558.
Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques (Second Edition): Morgan Kaufmann.
Xiao, W. S. (1993). Graph Theory and Its Algorithms: Beijing. Aviation Industrial Press.
Xie, X. L., & Beni, G. (1991). A Validity Measure for Fuzzy Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(8), 841 - 847
Xu, Y., Wang, B., Li, J., & Jing, H. (2008). An Extended Document Frequency Metric for Feature Selection in Text Categorization. Lecture Notes in Computer Science, 4993, 71-82.
Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. Annual ACM Conference on Research and Development in Information Retrieval, 42-49.
校內:2020-12-31公開