| 研究生: |
郭孟家 Kuo, Meng-Chia |
|---|---|
| 論文名稱: |
結合機率主題模型之模糊概念群集於文字探勘 Fuzzy Conceptual Clustering with Probabilistic Topic Model in Text Mining |
| 指導教授: |
李昇暾
Li, Sheng-Tun |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2011 |
| 畢業學年度: | 99 |
| 語文別: | 英文 |
| 論文頁數: | 57 |
| 中文關鍵詞: | 模糊理論 、概念群集 、正規化概念分析 、主題模型 、文字探勘 |
| 外文關鍵詞: | fuzzy, conceptual cluster, formal concept analysis, topic model, text mining |
| 相關次數: | 點閱:69 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於資訊科技與網際網路的發展,數位化文件已成為一種普遍之資訊儲存媒介,人們可以藉由不同管道獲取大量的數位化資訊。如何迅速且有效的處理資訊並從中取得所需之資料成為了一項新的議題。因此,人們開始研究各種文字探勘方法,以提升從大量資料中取得特定資訊的效率,而數位文件的自動分類則是其中重要的環節。
正規化概念分析(Formal concept analysis) 最早被提出為一種數學分析理論。近年,正規化概念分析已被證實為一種有效之方法於分析特定領域之文件。然而,傳統之正規化概念分析並無法呈現出資訊中所帶有之不確定性,因此有學者結合正規化概念分析與模糊理論,提出模糊正規化分析來處理不確定資訊。但是,效率會隨著資訊量增加而下降,是正規化概念分析的最大瓶頸。因此,本研究採用主題模型為主要方法用以減少資訊量,幫助正規化概念分析效率的改善。主題模型係一種用於尋找文件集中潛在主題之統計模型,主題模型已被廣泛的應用於文件模型、協同過濾等機器學習領域中。潛在狄氏配置(Latent Dirichlet Allocation)是近年來最常被探討的主題模型之一,主要用於建立文件庫之潛在主題生成模型。
本研究藉由資訊擷取技術與主題模型,建立文件庫之潛在主題模型,並以潛在主題為基礎透過模糊正規化分析進行文件之分類,透過主題模型之特性,用以幫助模糊正規化分析效率之提升,且不影響其分類成效。同時藉由外部之索引典,給予各潛在主題一個有意義之標籤,進一步詮釋各潛在主題的意涵。
Due to the growth of internet and information technology, digital file becomes a common storage media. People can get enormous digital file via various way. Therefore, a new problem occurs, how can people organize the enormous amount of digital file and extracting useful information from them efficiently. Hence, the study of text mining becomes popular for improving the efficiency of extracting certain information from the huge data. And the study of automatic categorization of documents is one of the important issues.
Formal concept analysis (FCA), a mathematical approach, which is proved to be an effective method for document searching in certain area. But, the classical FCA has difficulty in presenting the uncertainty in documents. So some researchers were proposed a fuzzy formal concept analysis (FFCA) which consists of FCA and fuzzy theory. However, the efficiency of FCA decreases with the growth of information. Therefore, this research reduces the information by applying topic model as an information filter, and improves the efficiency of FFCA. Topic model is a kind of statistical model which is used to discover the latent topic within a corpus. Topic model has been widely applied to various areas, such as machine learning and collaborative filtering. Latent Dirichlet allocation (LDA) is one of the most common topic models in recent year, which is applied in this research.
First, this research construct a latent topic model of corpus through LDA and information retrieval technique, then classifies the documents by FFCA base on latent topics. And proves that the property of topic model is helpful for improving the efficiency of FFCA. Second this research labels the latent topics with some meaningful words through the specific thesaurus, Wordnet.
Abebe, A. J., Guinot, V., & Solomatine, D. P. (2000). Fuzzy alpha-cut vs. Monte Carlo techniques in assessing uncertainty in model parameters. Paper presented at the proceeding of the 4th conference on Hydroinformatics.
Belohlavek, R., & Vychodil, V. (2009). Formal Concept Analysis With Background Knowledge: Attribute Priorities. IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Reviews, 39(4), 399-409.
Blei, D., & Lafferty, J. (2006a). Correlated Topic Models. Advances in Neural Information Processing Systems (NIPS), 18, 147-154.
Blei, D., & Lafferty, J. (2006b). Dynamic topic models. Paper presented at the Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania.
Blei, D., & McAuliffe, J. (2008). Supervised topic models, In Advances in Neural Information Processing Systems 20.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(4-5), 993-1022.
Burusco, A., & Fuentes-Gonzales, R. (1994). The study of the L-fuzzy concept lattice. Mathware & Soft Computing, 3, 209-218.
Cheung, K. S. K., & Vogel, D. (2005). Complexity reduction in lattice-based information retrieval. Information Retrieval, 8(2), 285-299.
Chung, C. Y., Lieu, R., Liu, J., Luk, A., Mao, J., & Raghavan, P. (2002). Thematic mapping - from unstructured documents to taxonomies. Paper presented at the Proceedings of the eleventh international conference on Information and knowledge management.
Cole, R. (2000). The Management and Visualisation of Document Collections Using Formal Concept Analysis: School of Information, Technology Griffith University.
Cole, R. J., Amardeilh, F., & Eklund, P. (2004). Browsing Semi-Structured Texts on the Web Using Formal Concept Analysis: Berlin: Springer-Verlag.
Cross, V. (2003). Uncertainty in the Automation of Ontology Matching. Paper presented at the 4th International Symposium on Uncertainty Modelling and Analysis.
Eklund, P., & Wormuth, B. (2005). Restructuring Help Systems Using Formal Concept Analysis. Paper presented at the Formal Concept Analysis. Third International Conference, ICFCA 2005. Proceedings (Lecture Notes in Artificial Intelligence Vol.3403).
Everts, T. J., Park, S. S., & Kang, B. H. (2006). Using formal concept analysis with an incremental knowledge acquisition system for web document management. Paper presented at the Proceedings of the 29th Australasian Computer Science Conference - Volume 48.
Girolami, M., & Kab, A. (2003). On an equivalence between PLSI and LDA. Paper presented at the Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval.
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101, 5228-5235.
Hofmann, T. (1999). Probabilistic Latent Semantic Analysis. Paper presented at the Proc. of Uncertainty in Artificial Intelligence, UAI'99.
Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1-2), 177-196.
Hofmann, T., & Puzicha, J. (1999). Unsupervised learning from dyadic data. Technical Report, ICSI TR-98-04.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183-233.
Kim, M., & Compton, P. (2004). Evolutionary document management and retrieval for specialized domains on the web. International Journal of Human-Computer Studies, 60(2), 201-241.
Klir, G. J., & Yuan, B. (1995). Fuzzy sets and fuzzy logic : theory and applications. Upper Saddle River, N.J.: Prentice Hall PTR.
Kumar, C. A., & Srinivas, S. (2010). Concept lattice reduction using fuzzy K-Means clustering. Expert Systems with Applications, 37(3), 2696-2704.
Lacoste-Julien, S., Sha, F., & Jordan, M. (2008). DiscLDA: Discriminative learning for dimensionality reduction and classification. Paper presented at the Proceedings of NIPS 21.
Maqbool, O., & Babri, H. A. (2005, 17-18 Sept. 2005). Interpreting clustering results through cluster labeling. Paper presented at the Proceedings of the IEEE Symposium on Emerging Technologies.
Mei, Q., Shen, X., & Zhai, C. (2007). Automatic labeling of multinomial topic models. Paper presented at the Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining.
Minka, T., & Lafferty, J. (2002). Expectation-Propagation for the Generative Aspect Model. Paper presented at the In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence.
Mrva, D., & Woodland, P. (2006). Unsupervised language model adaptation for Mandarin Broadcast Conversation transcription. Paper presented at the Proceedings of International Conference on Spoken Language Processing.
Nguyen, C.-T., Phan, X.-H., Horiguchi, S., Nguyen, T.-T., & Ha, Q.-T. (2009). Web Search Clustering and Labeling with Hidden Topics. ACM Transactions on Asian Language Information Processing, 8(3), 1-40.
Popescul, A., & Ungar, L. H. (2000). Automatic Labeling of Document Clusters. Technical Report, from http://www.cis.upenn.edu/~popescul/Publications/popescul00labeling.pdf
Quan, T. T., Hui, S. C., & Cao, T. H. (2004, 1-3 Dec. 2004). A fuzzy FCA-based approach for citation-based document retrieval. Paper presented at the Proceedings of the 2004 IEEE Conference on Cybernetics and Intelligent Systems, Singapore.
Quan, T. T., Hui, S. C., Fong, A. C. M., & Tru Hoang, C. (2006). Automatic fuzzy ontology generation for semantic Web. IEEE Transactions on Knowledge and Data Engineering, 18(6), 842-856.
Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. Paper presented at the Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing.
Ramage, D., Heymann, P., Manning, C. D., & Garcia-Molina, H. (2009). Clustering the tagged web. Paper presented at the Proceedings of the Second ACM International Conference on Web Search and Data Mining.
Salton, G., & Buckley, C. (1988). Term-Weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24(5), 513-523.
Snasel, V., Abdulla, H. M. D., & Polovincak, M. (2007). Behavior of the concept lattice reduction to visualizing data after using matrix decompositions. Paper presented at the Proceedings of 4th international conference on innovations in information technology.
Snasel, V., Polovincak, M., & Dahwa, H. M. (2008). On concept lattices and implication bases from reduced contexts, Supplementary Proceedings of the 16th International Conference on Conceptual Structures, ICCS 2008 (pp. 83--90).
Steyvers, M., & Griffiths, T. (2007). Probabilistic Topic Models. In T. Landauer, D. McNamara, S. Dennis & W. Kintsch (Eds.), Handbook of Latent Semantic Analysis: Lawrence Erlbaum Associates.
Stumme, G., Taouil, R., Bastide, Y., & Lakhal, L. (2001). Conceptual Clustering with Iceberg Concept Lattices. Paper presented at the Proceedings of GI-Fachgruppentreffen Maschinelles Lernen'01, Universitat Dortmund.
Tam, Y. C., & Schultz, T. (2005). Dynamic Language Model Adaptation using Variational Bayes Inference. Paper presented at the Proceedings of European Conference on Speech Communication and Technology.
Treeratpituk, P., & Callan, J. (2006). Automatically labeling hierarchical clusters. Paper presented at the Proceedings of the 2006 international conference on Digital government research.
Wang, T.-Y., & Chiang, H.-M. (2007). Fuzzy support vector machine for multi-class text categorization. Information Processing and Management, 43(4), 914-929.
Wille, R. (1982). Restructuring lattice theory: an approach based on hierarchies of concepts Rival, I. (ed.): Ordered Sets (pp. 445--470): Boston.
Wille, R. (2005). Formal concept analysis as mathematical theory of concepts and concept hierarchies Formal Concept Analysis Lecture Notes in Computer Science (Vol. 3626, pp. 1-33).
Wolff, K. E. (1993). A first course in Formal Concept Analysis. How to understand line diagrams. Paper presented at the Advances in Statistical Software, In: Faulbaum, F. (ed.).
Zadeh, L. A. (1965). Fuzzy Sets. Information and Control, 8(3), 338-353.
校內:2023-12-18公開