簡易檢索 / 詳目顯示

研究生: 林俊翰
Lin, Jun-Han
論文名稱: 巨量資料中平行挖掘前k個高效益項目集之研究
Parallel Mining of Top-k High Utility Itemsets in Big Data
指導教授: 謝孫源
Hsieh, Sun-Yuan
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 英文
論文頁數: 74
中文關鍵詞: 高效益項目集MapReduce架構Spark平台前k個高效益項目集
外文關鍵詞: high utility itemset, MapReduce, Spark platform, top-k high utility itemset
相關次數: 點閱:113下載:9
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 前k高效益項目集探勘是資料探勘領域中一門重要且新興的研究,其主要目的乃從資料庫中有效率地找出前k個效益最高的項目集。雖然一些研究已致力於發展其相關技術,現今的方法僅考慮以單一電腦執行探勘任務,並未考慮巨量資料之特性。有鑒於此,本論文率先提出一套從巨量資料中平行挖掘前k個高效益項目集之架構。於此架構下,我們提出一套以Spark平台為基礎的高效性平行探勘演算法PKU (Parallel Top-K High Utility Itemset Mining),其利用MapReduce架構將主要的探勘工作切割成數個搜尋空間較小的獨立子工作,並憑藉Spark獨有的RDD (Resilient Distributed Dataset)資料結構以記憶體內運算(In-memory Computing)方式有效率地平行挖掘出前k高效益項目集。此外,本研究還提出一系列適用於平行探勘的動態門檻值攀爬策略,其能有效率地刪減冗餘的候選項目集,進而大幅降低探勘所需的執行時間及記憶體使用量。之外,PKU還承襲Spark的諸多優點如低傳輸成本、支援容錯、高延展性及高擴充性。我們以真實及模擬資料驗證所提出的方法之效率。實驗結果顯示,PKU具備高度延展性及優異的執行效能,其執行速度大幅優於現今最有效率的相關演算法。

    Top-k high utility itemset (abbreviated as Top-k HUI) mining has been an important and emerging topic in data mining, which aims at efficiently mining k itemsets having the highest utility without setting the thresholds. Although many studies have been conducted on top-k HUI mining, they mainly focused on centralized databases and may not be scalable enough to handle big data. To address the above issues, this thesis proposes a novel framework called parallel mining of top-k high utility itemsets in big data. In this framework, a new algorithm named PKU (Parallel Top-K High Utility Itemset Mining) is proposed for parallel mining top-k HUIs on Spark platform. It uses MapReduce framework to divide the mining task into several independent subtasks, and utilizes a special data structure of Spark, called RDD (Resilient Distributed Dataset), to efficiently process data in parallel and achieve in-memory computing. Besides, several dynamic threshold rising strategies are proposed to efficiently prune the redundant candidates and largely reduce the execution time and memory usage of mining process. PKU inherits several advantages of Spark, including low communication cost, fault tolerance, and high scalability. Experimental results on both real and synthetic datasets show that PKU has good performance and high scalability on large datasets and outperforms the state-of-the-art algorithms.

    中文摘要 I Abstract II 致謝 III Content IV List of Figures VI List of Tables VIII 1. Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Challenges 5 1.4 Research Aims 6 1.5 Thesis Organization 7 2. Related Work 8 2.1 Frequent Itemset Mining 8 2.2 High Utility Itemset Mining 10 2.3 Top-k High Utility Itemset Mining 12 2.4 Parallel Mining of High Utility Patterns 13 3. The Proposed Framework 15 3.1 Preliminary 15 3.2 The Proposed Framework 19 3.2.1 Overview 19 3.2.2 Main Function of PKU 21 3.2.3 Pre-evaluation in Parallel 22 3.2.4 Reorganize Transactions in Parallel 28 3.2.5 Mining Patterns in Parallel 34 3.3 Novel Strategies for Effectively Pruning Search Space 43 3.3.1 Discarding Local Unpromising Items in Parallel 43 3.3.2 Merge Conditional Transactions in Parallel 47 4. Performance Evaluation 50 4.1 Experimental Settings 50 4.2 Datasets Descriptions 51 4.3 Performance Evaluation on Small Datasets 52 4.4 Performance Evaluation on Large Datasets 54 4.5 Performance Evaluation for Pruning Strategy 57 4.6 Scalability of PKU 63 4.6.1 Performance of PKU under Varied I 63 4.6.2 Performance of PKU under Varied N 64 4.6.3 Performance of PKU under Varied D 65 4.6.4 Performance of PKU under Varied Executors 66 4.6.5 Performance of PKU under Spark platform 69 5. Conclusion and Future Works 70 5.1 Conclusion 70 5.2 Future Works 71 Reference 72

    [1] R. C. Agarwal, C. C. Aggarwal and V. V. V. Prasad, “A Tree Projection Algorithm for Generation of Frequent Itemsets,” Journal of Parallel and Distributed Computing, 61(3):350-371, 2001.
    [2] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules, “ in Proc. of the Int’l Conf. on Very Large Data Bases, pp. 487-499, 1994.
    [3] C. F. Ahmed, S. K. Tanbeer and B. Jeong, “Mining High Utility Web Access Sequences in Dynamic Web Log Data,” in Proc. of Int’l Conf. on Software Engineering Artificial Intelligence Networking and Parallel/Distributed Computing, pp. 76-81, 2010.
    [4] C. F. Ahmed, S. K. Tanbeer, B. Jeong and Y. Lee, “Efficient Tree Structures for High Utility Pattern Mining in Incremental Databases,” IEEE Transactions on Knowledge and Data Engineering, 21(12):1708-1721, 2009.
    [5] R. Chan, Q. Yang and Y. Shen, “Mining High Utility Itemsets,” in Proc. of the IEEE Int’l Conf. on Data Mining, pp. 19, 2003.
    [6] Laney, Douglas. "3D Data Management: Controlling Data Volume, Velocity and Variety" (PDF). Gartner. Retrieved 6 February 2001.
    [7] A. Erwin, R. P. Gopalan and N. R. Achuthan, “A Bottom-Up Projection Based Algorithm for Mining High Utility Itemsets,” in Proc. of the Int’l Conf. on Artificial Intelligence and Data Mining, pp. 3-11, 2007.
    [8] A. Erwin, R. P. Gopalan and N. R. Achuthan, “CTU-Mine: An Efficient High Utility Itemset Mining Algorithm Using the Pattern Growth Approach,” in Proc. of the Int’l Conf. on Instruction & Technology, pp. 71-76, 2007.
    [9] A. Erwin, R. P. Gopalan and N. R. Achuthan, “Efficient Mining of High Utility Itemsets from Large Datasets,” in Proc. of the Pacific-Asia Conf. on Knowledge Discovery and Data Mining, pp. 554-561, 2008.
    [10] P. Fournier-Viger, C. W. Wu, S. Zida and V. S. Tseng, “FHM: Faster High-Utility Itemset Mining Using Estimated Utility Co-occurrence Pruning,” in Proc. of the Int’l Symposium on Methodologies for Intelligent Systems, pp. 83-92, 2014.
    [11] P. Y. Hsu, Y. L. Chen and C. C. Ling, “Algorithms for Mining Association Rules in Bag Database,” Information Sciences, 166(1-4):31-47, 2004.
    [12] J. Han, J. Pei and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” in Proc. of the ACM SIGMOD Int’l Conf. on Management of Data, pp. 1-12, 2000.
    [13] S. Krishnamoorthy, “Pruning Strategies for Mining High Utility Itemsets,” Expert Systems with Applications, 42:2371-2381, 2015.
    [14] Y. Liu, C. Cheng and V. S. Tseng, “Mining Differential Top-k Co-expression Patterns from Time Course Comparative Gene Expression Datasets,” BMC Bioinformatics, 14:230, 2013.
    [15] C. Lin, T. Hong and W. Lu, “An Effective Tree Structure for Mining High Utility Itemsets,” Expert Systems with Applications, 38(6):7419-7424, 2011.
    [16] G. Lan, T. Hong and V. S. Tseng, “An Efficient Projection-based Indexing Approach for Mining High Utility Itemsets,” Knowledge and Information System, 38(1):85-107, 2014.
    [17] Y. Liu, W. Liao and A. Choudhary, “A Fast High Utility Itemsets Mining Algorithm,” in Proc. of the Int’l Workshop on Utility-Based Data Mining, pp. 90-99, 2005.
    [18] Y. Liu, J. Li, W. Liao, A. N. Choudhary and Y. Shi, “High Utility Itemsets Mining,” Int’l Journal of Information Technology and Decision Making, 9(6):905-934, 2010.
    [19] B. Le, H. Nguyen, T. A. Cao and B. Vo, “A Novel Algorithm for Mining High Utility Itemsets,” in Proc. of the Asian Conf. on Intelligent Information and Database Systems, pp. 13-17, 2009.
    [20] M. Liu and J. Qu, “Mining High Utility Itemsets without Candidate Generation,” in Proc. of the ACM SIGKDD Int’l Conf. on Knowledge Management, pp. 55-64, 2012.
    [21] J. Liu, K. Wang and B. Fung, “Direct Discovery of High Utility Itemsets without Candidate Generation,” in Proc. of the IEEE Int’l Conf. on Data Mining, pp. 984-989, 2012.
    [22] Y. Lin, C. Wu, V. S. Tseng, “Mining High Utility Itemsets in Big Data,” in Proc. of the Pacific-Asia Conf. on Knowledge Discovery and Data Mining, pp. 649-664, 2015
    [23] H. Li, Y. Wang, D. Zhang, M. Zhang and E. Y. Chang, “PFP: Parallel FP-Growth for Query Recommendation,” in Proc. Int’l Conf. on Recommender systems, pp. 107-114, 2008
    [24] Y. Li, J. Yeh and C. Chang, “Isolated Items Discarding Strategy for Discovering High Utility Itemsets,” Data & Knowledge Engineering, 64(1):198-217, 2008.
    [25] S. Moens, E. Aksehirli and B. Goethals, “Frequent Itemset Mining for Big Data,” in Proc. of the Int’l Conf. on Big Data, pp. 111-118, 2013
    [26] H. Ryang, U. Yun, “Top-k high utility pattern mining with effective threshold raising strategies,” Knowledge-Based Systems, 76:109-126, 2015
    [27] B. Shie, H. Hsiao and V. S. Tseng, “Efficient Algorithms for Discovering High Utility User Behavior Patterns in Mobile Commerce Environments,” Knowledge and Information System, 37(2):363-387, 2013.
    [28] K. Subramanian, P. Kandhasamy and S. Subramanian, “A Novel Approach to Extract High Utility Itemsets from Distributed Databases,” Computing and Informatics, 31:1597-1615, 2012.
    [29] W. Song, Y. Liu and J. Li, “Mining High Utility Itemsets by Dynamically Pruning the Tree Structure,” Applied Intelligence, 40(1):29-43, 2014.
    [30] W. Song, Y. Liu and J. Li, “Vertical Mining for High Utility Itemsets,” in Proc. of the IEEE Int’l Conf. on Granular Computing, pp. 429-434, 2012.
    [31] A. Savasere, E. Omiecinski and S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Databases, “ in Proc. of the Int’l Conf. on Very Large Data Bases, pp. 432-444, 1995.
    [32] M. Thilagu and R. Nadarajan, “Efficiently Mining of Effective Web Traversal Patterns With Average Utility,” in Proc. of the Int’l Conf. on Communication, Computing, and Security, pp. 444-451, 2012.
    [33] V. S. Tseng, B. Shie, C. Wu and P. S. Yu, “Efficient Algorithms for Mining High Utility Itemsets from Transactional Databases,” IEEE Transactions on Knowledge and Data Engineering, 25(8):1772-1786, 2013.
    [34] V. S. Tseng, C. Wu, P. Fournier-Viger and P. S. Yu, “Efficient Algorithms for Mining Top-K High Utility Itemsets,” IEEE Transactions on Knowledge and Data Engineering, 28(1):54-67, 2016.
    [35] V. S. Tseng, C. Wu, B. Shie and P. S. Yu, “UP-Growth: An Efficient Algorithm for High Utility Itemset Mining,” in Proc. of the ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pp. 253-262, 2010.
    [36] B. Vo, H. Nguyen, T. B. Ho and B. Le, “Parallel Method for Mining High utility Itemsets from Vertically Partitioned Distributed Databases,” in Proc. of Int’l Conf. on Knowledge-based and Intelligent Information and Engineering Systems, pp. 251-260, 2009
    [37] C. Wu, B. Shie, V. S. Tseng and P. S. Yu, “Mining Top-k High Utility Itemsets,” in Proc. of the ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pp. 78-86, 2012.
    [38] H. Yao and H. J. Hamilton, “Mining Itemset Utilities from Transaction Databases,” Data & Knowledge Engineering, 59(3):603-626, 2006.
    [39] H. Yao., H. J. Hamilton, and C. J. Butz, “A Foundational Approach to Mining Itemset Utilities from Databases,” in Proc. of the third SIAM International Conference on Data Mining, pp.482-486, 2004
    [40] J. Yin, Z. Zheng, L. Cao, Y. Song, W. Wei, “Efficiently mining top-k high utility sequential patterns,” in Proc. of the IEEE Int’l Conf. on Data Mining, pp. 1259–1264, 2013
    [41] M. J. Zaki, “Scalable Algorithms for Association Mining,” IEEE Transactions on Knowledge and Data Engineering, 12(3):372-390, 2000.
    [42] S. Zida, P. Fournier-Viger, J.C. Lin, C. Wu, V.S. Tseng, “EFIM: A Highly Efficient Algorithm for High-Utility Itemset Mining,” in Proc. of the 14th Mexican Int’l Conf. on Artificial Intelligence, pp. 530-546, 2015
    [43] Apache Software Foundation. http://www.apache.org/
    [44] Frequent Itemset Mining Dataset Repository. Available at (http://fimi.ua.ac.be/data/)
    [45] Hadoop. http://hadoop.apache.org/
    [46] IBM Quest Data Mining Project, Quest Synthetic Data Generation Code. Available at (https://sourceforge.net/projects/ibmquestdatagen/)
    [47] Microsoft Corporation. Example Database FoodMart2000. Available at (https://technet.microsoft.com/enus/library/aa217032(v=sql.80).aspx)
    [48] NU-MineBench Version 2.0 Dataset and Technical Report. Available at (http://cucis.ece.northwestern.edu/projects/DMS/MineBench.html)
    [49] Spark. http://spark.apache.org/

    下載圖示 校內:2021-07-01公開
    校外:2021-07-01公開
    QR CODE