成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	林韋体 Lin, Wei-Tee
論文名稱：	分散式環境下快速頻繁樣式探勘演算法之研究 A Study on Fast Mining of Frequent Patterns in Distributed Computing Environments
指導教授：	朱治平 Chu, Chih-Ping
學位類別：	博士 Doctor
系所名稱：	電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering
論文出版年：	2015
畢業學年度：	103
語文別：	英文
論文頁數：	74
中文關鍵詞：	資料探勘、大數據、多工作計算、頻繁樣式探勘、分散式資料探勘
外文關鍵詞：	Data Mining, Big Data, Many-Task Computing, Frequent Patterns Mining, Distributed Data Mining
相關次數：	點閱：221 下載：1
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來，許多資料探勘的研究著眼於如何有效率的從大型資料庫中得到頻繁樣式。這些演算法主要可以分為兩類，一為基於Apriori之演算法，另一為以FP-growth為底之演算法。Apriori類的演算法，其探勘方式，乃為持續地產生候選樣式集合，接著檢驗每一個候選樣式是否為頻繁樣式。因此，一旦候選樣式多時，此類方法之效能隨即低落。因此，近年來之研究方法已朝向於以FP-growth之方式來做頻繁樣式的探勘。隨著資料量的大幅成長，舊式演算法的執行速度以及可擴展性皆面臨了挑戰。大數據(big data)通常包含了大量相異的項目，大量的交易數目，較長的交易長度，因此，當以FP-growth方式做探勘時，將會產生複雜的FP-tree。然而，FP-tree的複雜度，除了來自資料的本身特性外，也對於使用者所給定的最小支持度相當敏感。因為小的支持度會造成每一個子節點的平均分支數增加，所以在FP-tree所含之節點數亦增加，同時在使用FP-growth時，也需要探勘更複雜的條件(conditional)FP-tree。除此之外，為了增加執行速度，過去的研究大多著眼於利用分散式計算技術以設計高效率的演算法，但卻極少論文討論到如何決定適當的計算節點數目來做探勘。使用過多的計算節點時，由於現有分散式探勘演算法之特性，需要大量的資料交換，導致探勘速度會變慢。而若使用太少的計算節點，負載平衡的效能就不好。因此，在此論文中，我們將先提出一個可分散式探勘之高效率演算法與架構於多工作計算環境之中。接著，並將提出兩個演算法，其能夠在此環境內，自動決定適當數量的計算節點。經由一系列的實驗，我們證明了所提出的演算法能夠有較快的執行速度。

Many studies have tried to efficiently discover frequent patterns in large databases. The algorithms used in these studies fall into two main categories: Apriori algorithms and frequent pattern growth (FP-growth) algorithms. Apriori algorithms operate according to a generate-and-test approach, so performance suffers from the testing of too many candidate itemsets [1]. Therefore, most recent studies have applied an FP-growth approach to the discovery of frequent patterns. The rapid growth of data, however, has introduced new challenges for the mining of frequent patterns, in terms of both execution efficiency and scalability. Big data often contains a large number of items, a large number of transactions and long average transaction length, which result in large FP-trees. In addition to its dependence on data characteristics, FP-tree size is also sensitive to the minimum support threshold. This is because the small support is probable to bring many branches for nodes, greatly enlarging the FP-tree and the number of reconstructed conditional pattern-based trees. Moreover, most of the past studies focused on designing efficient mining algorithm, and none of them has explored how the appropriate number of computing nodes is determined. Using too many computing nodes will increase the execution time because the existing algorithms need to transmit the sub-databases or FP-tress over the network; using insufficient computing nodes may not effectively distribute the mining loading. In this thesis, we propose a novel algorithm and architecture for efficiently mining frequent patterns from big data in distributed many-task computing environments. We also propose two algorithms for efficiently mining frequent patterns with the ability to use the appropriate number of computing nodes in distributed computing environments. Through empirical evaluation of various simulation conditions, we show that the proposed methods deliver excellent execution time.

摘　要			I
ABSTRACT　　	III
致  謝			V
LIST OF FIGURES	VIII
LIST OF TABLES	X
CHAPTER 1. INTRODUCTION	1
CHAPTER 2. RELATED WORK	7
2.1	Frequent pattern mining and association rules	7
2.2 FP-tree and FP-growth algorithm	10
2.3 Parallel and distributed algorithms for frequent pattern mining	13
2.4	CARM algorithm and architecture	18
2.5 PSWS algorithm	20
CHAPTER 3. PROPOSED METHOD: DP-MINING	24
3.1 Proposed architecture	24
3.2 The details of DP-Mining	27
3.3 Experimental Results	30
3.3.1 Impact on execution time by varying the number of items for DP-Mining and PSWS	32
3.3.2 Impact on execution time by varying the average transaction length for the proposed method and PSWS	37
3.3.3 Impact on execution time by varying the number of transactions for the proposed method and PSWS	38
3.3.4 Impact on execution time by varying the average length of frequent patterns for the proposed method and PSWS	40
3.4 Conclusions	42
CHAPTER 4. PROPOSED METHOD: DAN-MINING	43
4.1 The details of DAN-Mining	43
4.2 Experimental Results	47
4.2.1 Impact on execution time by varying the number of computing nodes and sampling ratio for DAN-Mining	49
4.2.2 Impact on execution time by varying the support threshold for DAN-Mining and PSWS	51
4.2.3 Impact on execution time by varying the number of transactions for DAN-Mining and PSWS	53
4.2.4 Impact on execution time by varying the average transaction length for DAN-Mining and PSWS	54
4.3 Conclusions	56
CHAPTER 5. PROPOSED METHOD: FLB-MINING	58
5.1 The details of FLB-Mining	58
5.2 Experimental Results	61
5.2.1 Impact on execution time by varying the support threshold for FLB-Mining and DAN-Mining	62
5.2.2 Impact on execution time by varying the number of transactions for FLB-Mining and DAN-Mining	63
5.2.3 Impact on execution time by varying the average transaction length for FLB-Mining and DAN-Mining	63
5.3 Conclusions	65
CHAPTER 6. CONCLUSIONS	66
REFERENCES		67
 
LIST OF FIGURES
FIGURE 1. AN EXAMPLE OF FREQUENT PATTERN MINING USING APRIORI ALGORITHM.	9
FIGURE 2. AN EXAMPLE OF DISCOVERED ASSOCIATION RULES.	9
FIGURE 3. AN EXAMPLE OF FP-TREE AND FP-GROWTH.	12
FIGURE 4. THE GENERAL ARCHITECTURE OF EXISTING APPROACHES FOR MINING FREQUENT PATTERNS IN PARALLEL AND DISTRIBUTED ENVIRONMENTS.	14
FIGURE 5. AN EXAMPLE OF USING TPFP-TREE TO MINE FREQUENT PATTERNS IN DISTRIBUTED COMPUTING ENVIRONMENTS	15
FIGURE 6. THE EXCHANGING STAGES OF THE EXAMPLE SHOWN IN FIGURE 5	17
FIGURE 7. THE ARCHITECTURE OF CARM.	19
FIGURE 8. THE MECHANISMS OF THE FOUR ALGORITHMS.	21
FIGURE 9. THE FOUR ALGORITHMS PROPOSED IN [28].	22
FIGURE 10. PROPOSED ARCHITECTURE.	25
FIGURE 11. THE DP-MINING ALGORITHM.	28
FIGURE 12. THE IMPACT ON EXECUTION TIME FOR THE PROPOSED METHOD AND PSWS WITH THE NUMBER OF ITEMS VARIED FROM 150K TO 250K, AVERAGE NUMBER OF TRANSACTIONS = 100K.	33
FIGURE 13. THE IMPACT ON EXECUTION TIME FOR THE PROPOSED METHOD AND PSWS WITH THE NUMBER OF ITEMS VARIED FROM 150K TO 250K, AVERAGE NUMBER OF TRANSACTIONS = 200K.	35
FIGURE 14. IMPACT ON EXECUTION TIME FOR THE PROPOSED METHOD AND PSWS WITH THE SUPPORT THRESHOLD VARIED DECREASINGLY FROM 0.05% TO 0.01%.	36
FIGURE 15. IMPACT ON EXECUTION TIME FOR THE PROPOSED METHOD AND PSWS WITH THE AVERAGE TRANSACTION LENGTH VARIED FROM 10 TO 50.	39
FIGURE 16. IMPACT ON EXECUTION TIME FOR THE PROPOSED METHOD AND PSWS WITH THE NUMBER OF TRANSACTIONS VARIED FROM 250K TO 450K.	40
FIGURE 17. IMPACT ON EXECUTION TIME FOR THE PROPOSED METHOD AND PSWS WITH THE AVERAGE LENGTH OF FREQUENT PATTERNS VARIED FROM 5 TO 9.	41
FIGURE 18. THE DAN-MINING ALGORITHM.	47
FIGURE 19. IMPACT ON EXECUTION TIME FOR THE PROPOSED METHOD AND PSWS WITH THE NUMBER OF COMPUTING NODES VARIED FROM 5 TO 20 AND THE SAMPLING RATIO VARIED FROM 0.1% TO 3.2%.	50
FIGURE 20. IMPACT ON EXECUTION TIME FOR THE PROPOSED METHOD AND PSWS WITH THE SUPPORT THRESHOLD VARIED FROM 0.1% TO 0.03%.	52
FIGURE 21. IMPACT ON EXECUTION TIME FOR THE PROPOSED METHOD AND PSWS WITH THE NUMBER OF TRANSACTIONS VARIED FROM 50K TO 250K.	54
FIGURE 22. IMPACT ON EXECUTION TIME FOR THE PROPOSED METHOD AND PSWS WITH THE AVERAGE TRANSACTION LENGTH VARIED FROM 50K TO 250K.	56
FIGURE 23. THE FLB-MINING ALGORITHM.	61
FIGURE 24. IMPACT ON EXECUTION TIME FOR THE PROPOSED METHOD AND PSWS WITH THE NUMBER OF COMPUTING NODES VARIED FROM 5 TO 20 AND THE SAMPLING RATIO VARIED FROM 0.1% TO 3.2%.	62
FIGURE 25. IMPACT ON EXECUTION TIME FOR THE PROPOSED METHOD AND PSWS WITH THE NUMBER OF COMPUTING NODES VARIED FROM 5 TO 20 AND THE SAMPLING RATIO VARIED FROM 0.1% TO 3.2%.	64
FIGURE 26. IMPACT ON EXECUTION TIME FOR THE PROPOSED METHOD AND PSWS WITH THE NUMBER OF COMPUTING NODES VARIED FROM 5 TO 20 AND THE SAMPLING RATIO VARIED FROM 0.1% TO 3.2%.	64
 
LIST OF TABLES
TABLE  1. COMPARISON WITH PREVIOUS METHODS.	30
TABLE 2. THE PROPORTIONS OF THE EXECUTION TIME OF DP-MINING10% TO THAT OF PSWS WITH NUMBER OF ITEMS VARIED FROM 150K TO 250K.	33
TABLE 3. THE PROPORTIONS OF THE EXECUTION TIME OF DP-MINING10% TO THAT OF PSWS WITH NUMBER OF ITEMS VARIED FROM 150K TO 250K.	35
TABLE 4. THE PROPORTIONS OF THE EXECUTION TIME OF DP-MINING10% TO THAT OF PSWS WITH THE SUPPORT THRESHOLD VARIED DECREASINGLY FROM 0.05% TO 0.01%.	36
TABLE 5. THE PROPORTIONS OF THE EXECUTION TIME OF DP-MINING10% TO THAT OF PSWS WITH THE NUMBER OF TRANSACTIONS VARIED FROM 250K TO 450K.	40
TABLE 6. THE PROPORTIONS OF THE EXECUTION TIME OF DP-MINING10% TO THAT OF PSWS WITH THE AVERAGE LENGTH OF FREQUENT PATTERNS VARIED FROM 5 TO 9.	41
TABLE  7. COMPARISON WITH PREVIOUS METHODS.	47

                                    

1. Abaya, S. A., ‘Association Rule Mining based on Apriori Algorithm in Minimizing Candidate Generation‘, In:International Journal of Scientific & Engineering Research Volume 3, Issue 7, 2012.
2. Adnan, M. and Alhajj, R., ‘DRFP-tree: disk-resident frequent pattern tree’, Applied Intelligence, April, Vol. 30, Issue 2, pp. 84-97, 2009.
3. Aflori, C. and Craus, M., ‘Grid implementation of the Apriori algorithm’, Advances in Engineering Software, Vol. 38, Issue 5, pp. 295-300, 2007.
4. Agrawal, R. and Srikant, R. ‘Quest synthetic data generator’, IBM Almaden Research Center, San Jose, California, http://www.almaden.ibm.com/cs/quest/syndata.html.
5. Agrawal, R. and Srikant, R., ‘Fast Algorithms for Mining Association Rules’, Proceedings of the 20th International Conference on Very Large Data Bases, pp. 487-499, 1994.
6. Agrawal, R. and Shafer, J. C., ‘Parallel Mining of Association Rules’, IEEE Transactions on Knowledge and Data Engineering, pp. 952-969, 1996.
7. Agarwal R., Aggarwal, C. and V. V. V. Prasad, ‘A tree projection algorithm for generation of frequent itemsets’, Journal of Parallel and Distributed Computing, Vol. 61, No 3, pp. 350-371, 2001.
8. Cao, Y., Patnaik, D., Ponce, S., Archuleta, J., Butler, P., Feng, W.C. and Ramakrishnan, N., ’Parallel Mining of Neuronal Spike Streams on Graphics Processing Units’, International Journal of Parallel Programming , pp. 1-28, 2011.
9. Chen, R. and Wang, H., ’Research on parallel mining algorithm of sequential pattern based on multi-processor scheduling’, World Automation Congress (WAC), pp.24-28, 2012.
10. Congiusta, A., Talia, D. and P. Trunfio, ‘Service-oriented middleware for distributed data mining on the grid’, Journal of Parallel and Distributed Computing, Vol. 68, Issue 1, pp. 3-15, 2008.
11. Dhanda, M., ’An Approach To Extract Efficient Frequent Patterns From Transactional Database’,In: International Journal of Engineering Science and Technology (IJEST), Vol.3 No.7, ISSN:0975-5462, 2011.
12. Dhanda, M., Guglani, S. and Gupta, G., ’Mining Efficient Association Rules Through Apriori Algorithm Using Attributes’, In: International Journal of Computer Science and Technology Vol 2,Issue 3, ISSN:0976-8491, 2011.
13. Do, T. D. T., Laurent, A. and Termier, A., ’Efficient Parallel Mining of Closed Frequent Gradual Itemsets’, IEEE International Conference on Data Mining, pp. 138-147, 2010.
14. Glimcher, L., Jin, R. and Agrawal, G., ‘Middleware for data mining applications on clusters and grids’, Journal of Parallel and Distributed Computing, Vol. 68, Issue 1, pp. 37-53, 2008.
15. Grahne G. and Zhu, J., ‘Mining Frequent Itemsets from Secondary Memory’, Proceedings of 2004 International Conference on Data Mining, November, pp. 91-98, 2004.
16. Han, E.H.S., Karypis, G. and Kumar, V., ‘Scalable parallel data mining for association rules’, IEEE Transactions on Knowledge and Data Engineering, Vol. 12, No 3, pp. 352 -377, 2000.
17. Han, J., Pei, J., Yin, Y. and Mao, R., ‘Mining frequent patterns without candidate generation: a frequent-pattern tree approach’, Journal of Data Mining and Knowledge Discovery, Vol. 8, No. 1, pp. 53-87, 2004.
18. Han, Y., Brezany, P. and Janciak, I., ‘Cloud-enabled scalable decision tree construction’, Proceedings of Fifth International Conference on Semantics, Knowledge and Grid, pp. 128 – 135, 2009.
19. Holt, J. D. and Chung, S. M., ‘Parallel mining of association rules from text databases on a cluster of workstations’, Proceedings of 18th International Symposium on Parallel and Distributed Processing, 2004.
20. Iko, P. and Kitsuregawa, M., ‘Shared nothing parallel execution of FP-growth’, DBSJ Letters, Vol. 2, No.1, pp. 43-46, 2003.
21. Itkar, S. A and Kulkarni, U. V, ’Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Framework’, Proc. of Int. Conf. on Advances in Computer Science, 2013.
22. Javed, A. and Khokhar A., ‘Frequent pattern mining on message passing multiprocessor systems’, Distributed and Parallel Database, Vol. 16, Issue 3, pp. 321-334, 2004.
23. Lai, Y. and ZhongZhi, S., ‘An efficient data mining framework on Hadoop using java persistence API’, Proceedings of the IEEE 10th International Conference on Computer and Information Technology, pp. 203-209, 2010.
24. Lai, Y. and ZhongZhi, S., Xu, L.D., Fan, L. and Kirsh I., ‘DH-TRIE Frequent Pattern Mining on Hadoop using JPA’, Proceedings of International Conference on Granular Computing, November, pp. 875-878, 2011.
25. Li, L. and Zhang M., ‘The strategy of mining association rule based on cloud computing’, Proceedings of 2011 Business Computing and Global Informatization, pp. 475-478, 2011.
26. Lin, K. W., Chen, P.L. and Chang, W.L., ‘A Novel Frequent Pattern Mining Algorithm for Very Large Databases in Cloud Computing Environments’, Proceedings of the 2011 IEEE International Conference on Granular Computing, pp. 399-403, 2011.
27. Lin, K. W., Deng, D.J., ‘A novel parallel algorithm for frequent pattern mining with privacy preserved in cloud computing environments’, International Journal of Ad Hoc and Ubiquitous Computing, Vol. 6, No.4, pp. 205-215, 2010.
28. Lin, K. W. and Luo, Y.C., ‘Efficient Algorithms for Frequent Pattern Mining in Many-Task Computing Environments’, Knowledge-Based Systems, Vol. 49, pp. 10-21, 2013.
29. Natarajan, S. and Sehar, S., ’Distributed FP-ARMH algorithm in Hadoop map reduce framework’, Green Computing, Communication and Conservation of Energy (ICGCE), pp. 264-270, 2013.
30. Nguyena, D., Vob, B. and Lec, B., ’Efficient strategies for parallel mining class association rules’, Expert Systems with Applications, 2014.
31. Papadimitriou, S. and Sun, J., ‘DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining’, Proceedings of the 8th IEEE International Conference on Data Mining, pp. 512-521, 2008.
32. Pérez, M. S., Sánchez, A., Robles, V., Herrero, P. and Peña J. M., ‘Design and implementation of a data mining grid-aware architecture’, Future Generation Computer Systems, Vol. 23, Issue 1, pp. 42-47, 2007.
33. Qiu, Y., Lan, Y. J. and Xie, Q.S., ‘An improved algorithm of mining from FP- tree’, Proceedings of the Third International Conference on Machine Learning and Cybernetics, pp. 26-29, 2004.
34. Raicu, I., Foster, I.T. and Zhao, Y., ‘Many-task computing for grids and supercomputers’, Proceedings of 2008 IEEE Workshop on Many-Task Computing on Grids and Supercomputers, pp. 1-11, 2008.
35. Riondato, M., DeBrabant, J. A, Fonseca, R. and Upfal, E., ’a parallel randomized algorithm for approximate association rules mining in MapReduce’, CIKM '12 Proceedings of the 21st ACM international conference on Information and knowledge management, pp. 85-94, 2012.
36. Schlegel, B., Gemulla, R. and Lehner, W., ’Memory-Efficient Frequent-Itemset Mining’, in EDBT/ICDT ‘11 Proceedings of the 14th International Conference on Extending Database Technology, pp. 461-472, 2011.
37. Sharma, K., Shrivastava, G. and Kumar, V., ‘Web mining: today and tomorrow’, Proceedings of the Third International Conference on Electronics Computer Technology, pp. 399-403, 2011.
38. White, B., Yeh, T., Lin, J. and Davis, L., ‘Web-scale computer vision using MapReduce for multimedia data mining’, Proceedings of the Tenth International Workshop on Multimedia Data Mining, 2010.
39. Wu, C.H., Lai, C.C. Lai and Lo, Y.C., ‘An empirical study on mining sequential patterns in a grid computing environment’, Expert Systems with Applications, Vol. 39, Issue 5, pp. 5748-5757, 2012.
40. Wu, X., Zhu, X., Wu, G. Q. and Ding, W., ’Data Mining with Big Data’, TKDE.
41. Xia, T. (2008) ‘Large-Scale SMS Messages Mining Based on Map-Reduce’, Proceedings of the International Symposium on Computational Intelligence and Design, pp. 7-12, 2014.
42. Yang, X. Y., Liu, Z. and Fu, Y., ’MapReduce as a programming model for association rules algorithm on Hadoop’, Information Sciences and Interaction Sciences (ICIS), pp. 99-102, 2010.
43. Yang, X. Y., Liu, Z. and Fu Y., ‘MapReduce as a programming model for association rules algorithm on Hadoop’, Information Sciences and Interaction Sciences, pp. 99-102, 2010.
44. Yen, S. J., Lee, Y. S., Wang, C. K., Wu, J. W. and Ouyang, L. Y., ‘The Studies of Mining Frequent Patterns Based on Frequent Pattern Tree‘, Advances in Knowledge Discovery and Data Mining, Lecture Notes in Computer Science, vol.5476, pp. 232-241, 2009.
45. Yong, W., Zhe, Z. and Fang, W., ‘A parallel algorithm of association rules based on cloud computing‘, Communications and Networking in China (CHINACOM), pp. 14-16, 2013.
46. Xiaoting, W., Yunlong, M., Feng, Z., Min, L. and Weiming, S., ‘Incremental FP-Growth mining strategy for dynamic threshold value and database based on MapReduce‘, Computer Supported Cooperative Work in Design (CSCWD), pp. 21-23, 2014.
47. Yu, K.M. and Zhou, J., ‘Parallel TID-based frequent pattern mining algorithm on a PC cluster and grid computing system’, Expert Systems with Applications, Vol. 37, Issue 3, pp. 2486-2494, 2010.
48. Zaki, M. J., ‘Scalable Algorithms for Association Mining’, IEEE Transactions on Knowledge and data engineering, Vol.12, No3, pp. 372-390, 2000.
49. Changsheng, Z., Zhongyue, L. and Dongsong, Z., ‘An Improved Algorithm for Apriori‘, In: IEEE,First International Workshop on Education Technology and Computer Science, 2009.
50. Zhou, J. and Yu, K.M., ‘Tidset-based parallel FP-tree algorithm for the frequent pattern mining problem on PC clusters’, Lecture Notes in Computer Science 5036, pp. 18-28, 2008.
51. Zhou, J. and Yu, K.M., ‘Balanced Tidset-based parallel FP-tree algorithm for the frequent pattern mining on grid system’, Proceedings of the Fourth International Conference on Semantics, Knowledge and Grid, pp. 103-108, 2008.

2020-01-30公開

簡易檢索 / 詳目顯示

相關論文