簡易檢索 / 詳目顯示

研究生: 林友民
Lin, Yeou-Min
論文名稱: 一個以矩陣為基礎的分割叢集演算法應用於非數值性資料
MBA – A Matrix-Based Partitioning Clustering Algorithm for Non-Numeric Data
指導教授: 焦惠津
Jiau, Hewi-Jin Christine
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2003
畢業學年度: 91
語文別: 英文
論文頁數: 59
中文關鍵詞: 資料探勘叢集演算法
外文關鍵詞: data mining, clustering algorithm
相關次數: 點閱:125下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Data clustering has attracted a lot of attention in the field of data mining and computational statistics. Many methods have been developed for clustering objects according to their similarity.
    However, most of the traditional clustering algorithms that use distances between points for clustering are not appropriate for non-numeric attributes such as nominal, boolean and categorical data. Because the data type of these specific applications is non-numeric(non-metric), thus it is not easy to map attribute tuples to vectors. And sometimes we would lose certain useful information embedded in data attributes with using traditional distance-based clustering algorithms directly.
    This thesis proposes and explores two simple algorithms more
    suitable than distance-based clustering algorithms for non-numeric data clustering. The Cosine measure for document clustering triggers the idea of our similarity measure. This similarity measure can easily reflect the ordinal information of data variables of sequential data. It means that our similarity measure shows whether the ``quality of consecution" is good or not. In addition, some issues about similarity matrix (SM) partitioning and Union-Find-like methods are discussed in detail.

    Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Non-numeric Data Attributes . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 OurWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Application Domain: Sequential Data Clustering . . . . . . . . . . . . . . 7 1.4.1 Application Example: Mechanism of Predictive Pre-fetching . . . 9 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Market Basket Data and Categorical data . . . . . . . . . . . . . . . . . 11 2.2 Clustering for Personalized Mobile Web Usage and Data Allocation . . . 13 2.3 WebMining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Web User Clustering By BIRCH Algorithm . . . . . . . . . . . . 15 2.3.2 Web User Clustering Using Sequence Alignment . . . . . . . . . . 19 3 A Clustering Approach for Non-numeric Data Types . . . . . . . . . . 23 3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 An Straightforward Clustering Algorithm . . . . . . . . . . . . . . . . . . 25 3.2.1 Calculation of Similarity . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.2 Union-Find-like Algorithm . . . . . . . . . . . . . . . . . . . . . . 26 3.3 AMatrix-Based Clustering Algorithm. . . . . . . . . . . . . . . . . . . . 28 3.3.1 Measure of the Affinity . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.2 Partition of theMatrix . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 SimilarityMeasure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4 Application Examples and Experiment Results . . . . . . . . . . . . . . 39 4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Example Scenario:Web-Based E-learning System . . . . . . . . . . . . . . 42 4.2.1 E-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.2 Data Mining Assisting E-learning . . . . . . . . . . . . . . . . . . 43 4.3 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    [1] G. Karypis,E.-H. Han, and V. Kumar,“Chameleon:Hierarchical Clustering Using
    Dynamic Modeling, ” Computer, pp. 68–75, Aug. 1999.
    [2] L. Talavera and J. Bejar,“Generality-Based Conceptual Clustering with Probabilistic
    Concepts, ” IEEE Trans on Patter Analysis and Machine Intelligence, vol. 23,
    pp. 196–206, Feb. 2001.
    [3] T. Zhang, R. Ramakrishnan, and M. Livny,“BIRCH: An Efficient Data Clustering
    Method for Very Large Databases, ” In Proceedings of the ACM SIGMOD International
    Conference on Management of Data, pp. 196–206, Jun. 1996.
    [4] Y. Xie and V.V. Phoha,“Web User Clustering From Access Log Using Belief Function,
    ” First International Conference on Knowledge Capture, pp. 202–208, Oct.
    2001.
    [5] W. Wang and Osmar R. Zaiane,“Clustering Web Sessions by Sequence Alignment,
    ” Third International Workshop on Management of Information on the Web, Sep.
    2002
    [6] B. Hay, G. Wets, and K. Vanhoof,“Clustering Navigation Patterns on a Website
    using a Sequence Alignment Method, ” IJCAI’s Workshop on Intelligent Techniques
    for Web Personalization, Aug. 2001.
    [7] V. Guralnik and G. Karypis,“A Scalable Algorithm for Clustering Sequential Data,
    ” IEEE International Conference on Data Mining, pp. 179–1866, Dec. 2001.
    [8] H. Kawaji, Y. Yamaguchi, H Matsuda, and A. Hashimoto,“A Graph-Based Clustering
    Method for a Large set of sequence Using a Graph Partitioning Algorithm,
    ” The Twelfth International Conference on Genome Informatics, pp. 93–102, Dec.
    2001.
    [9] E. Bolten1, A. Schliep, S. Schneckener, D. Schomburg, and R. Schrader,“Clustering
    Protein Sequences - Structure Prediction by Transitive Homology, ” 8th International
    Conference on Intelligent Systems for Molecular Biology, Aug. 2000.
    [10] F. Giannotti,C. Gozzi, and G. Manco,“Characterizing Web User Accesses: A Transactional
    Approach to Web Log Clustering, ” Proceedings of the International Conference
    on Information Technology: Coding and Computing, pp. 246–251, Apr. 2002.
    [11] Y. Fu, K. Sandhu, and M. Shih,“Clustering ofWeb Users Based on Access Patterns, ”
    International Workshop on Web Usage Analysis and User Profiling (WEBKDD’99),
    Aug. 1999.

    [12] A. Nanopoulos, D. Katsaros, and Y. Manolopoulos,“Effective Prediction of Webuser
    Accesses: A Data Mining Approach, ” Mining Log Data Across All Customer
    TouchPoints (WEBKDD’2001), Aug. 2001.
    [13] Jiawei Han and Micheline Kamber,“Data Mining: Concepts and Techniques, ” Morgan
    Kaufmann, 2001.
    [14] Ricardo Baeza-Yates and Berthier Ribeiro-Neto,“Modern Information Retrieval, ”
    Addison Wesley, 1999.
    [15] R. Cooley, B.Mobasher, and J. Srivastava,“Data Preparation for MiningWorldWide
    Web Browsing Patterns, ” Knowledge and Information Systems, pp. 5–32, 1999.
    [16] J. Han, Y. Cai, and N. Cercine,“Knowledge Discovery in Databases: An Attribute-
    Oriented Approach, ” In Proceedings of 18th International Conference Very Large
    Data Base, pp. 547–559, Aug. 1992.
    [17] K. Charter, J. Schaeffer, and D. Szafron,“Sequence Alignment Using Fastlsa, ” International
    Conference on Mathematics and Engineering Techniques in Medicine and
    Biological Sciences (METMBS’2000), pp. 239–245, Jun. 2000.
    [18] H. Mannila, H.Toivonen, and A. Inleri Verkamo,“Discovering Frequent Episodes in
    Sequences, ” In Proceedings of the International Conference on Knowledge Discovery
    and Data Mining (KDD’95), pp. 210–215, Aug. 1995.
    [19] R. Agrawal and R. Srikant,“Mining Sequential Patterns, ” In Proceedings of the
    International Conference Data Engineering (ICDE’95), pp. 3–14, Aug. 1995.
    [20] B. Mobasher, H. Dai, T. Luo, M. Nakagawa, Y. Sun, and J. Wiltshire,“Discovery of
    Aggregate Usage Profiles for Web Personalization, ” Worskhop on Web Mining for
    E-Commerce – Challenges and Opportunities (WEBKDD’2000), Aug. 2000.
    [21] G. Adomavicius and A. Tuzhilin,“Using Data Mining Methods to Build Customer
    Profiles, ” IEEE Computer, vol. 34, pp. 74–82, Feb. 2001.
    [22] J. Yang, W.Wang, P.Yu, and J. Han,“Discovery of Aggregate Usage Profiles for
    Web Personalization, ” In Proceedings of ACM SIGMOD International Coference
    on Manegement of Data, Jun. 2002.
    [23] S. Guha, R. Rasrogi, K. Shim,“ROCK: A Robust Clustering Algorithm for Categorical
    Attributes, ” In Proceedings of IEEE International Conference on Data
    Engineering, pp. 512–521, Mar. 1999.
    [24] W.-C. Chih, M.-S. Chen,“Developing Data Allocation Schemes by Incremental Mining
    of User Moving Patterns in a Mobile Computing System, ” IEEE Transactions
    on Knowledge and Data Engineering, Vol. 15, pp. 70–85 Feb. 2003.
    [25] W.-C. Chih, M.-S. Chen,“Mining User Moving Patterns for Personal Data Allocation
    in aMobile Computing System, ” In Proceedings of the 29th International Conference
    on Parallel Processing (ICPP-2000), pp. 573–580, Aug. 2000.
    [26] D.S. Phatak, R. Mulvaney,“Clustering for Personalized Mobile Web Usage, ” Proceedings
    of the IEEE FUZZ’02, pp. 705–710, May. 2002.

    [27] C.-H. Yun, K.-T. Chuang and M.-S. Chen,“An Efficient Clustering Algorithm for
    Market Basket Data Based on Small-Large Ratios, ” In Proceedings of the 25th
    International Computer Software and Applications Conference (COMPSAC 2001),
    pp. 505–510, Oct. 2001.
    [28] C. Ordonez, E. Omiecinski, N. Ezquerra,“A fast algorithm to cluster high dimensional
    basket data, ” In Proceedings of IEEE International Conference on Data
    Mining (ICDM 2001), pp. 633–636, Dec. 2001.

    下載圖示 校內:2004-07-11公開
    校外:2004-07-11公開
    QR CODE