| 研究生: |
張文琇 Chang, Wen-Hsiu |
|---|---|
| 論文名稱: |
在串流及巨量資料下之機率型資料結構:Bloom Filter和LogLog Counter個案比較與探討 Probabilistic Membership Data Structures for Streaming, Big Data: The Comparative Study of Bloom Filter and LogLog Counter |
| 指導教授: |
蕭宏章
Hsiao, Hung-Chang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2018 |
| 畢業學年度: | 106 |
| 語文別: | 中文 |
| 論文頁數: | 32 |
| 中文關鍵詞: | bloom filter 、LogLog counter 、機率型資料結構 |
| 外文關鍵詞: | bloom filter, LogLog counter, probabilistic data structure |
| 相關次數: | 點閱:61 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著巨量資料的發展,原本解決一個問題所需要花的時間及記憶體空間成本都會加大。為了解決這樣的問題,資料的處理方式往往仰賴大型計算資源。但這樣的計算方式,需要大量的計算及儲存資源的投入,處理也需要耗費大量的時間。如果並不一定需要精確的答案,而是只要大概的數值即近似解,例如資料庫資料表的行數、網頁的瀏覽數或是影片的觀看次數等。這類的應用,通常希望能在real-time時間內解決且不希望耗費大量的計算及記憶體資源。因此我們可以使用機率型資料結構降低時間成本及空間成本,以獲得即時性的答案。
而本篇論文中,主要是以集合的概念來探討bloom filter及LogLog counter兩種皆是位元向量結構的機率型資料結構。將比較在單一集合下的操作、集合和集合間的運算,及透過資訊熵來做比較。透過比較來得出在上述的面向下,來得出bloom filter及LogLog counter的操作支援度、空間複雜度及時間複雜度,透過實驗來得出在做count運算時,改變記憶體空間、資料數、資料重複比例、集合數等的比較結果。
The thesis compared two common probabilistic data structures, the bloom filter and the LogLog counter which are based on the bit vector. Through the concept of collection, the operation in one set, the operation between the set and the set and the entropy are discussed. The thesis shows the usage and the relative error for operations between two probabilistic data structures.
[1]Welcome to Apache™ Hadoop®! (n.d.). Retrieved from http://hadoop.apache.org/
[2]Pugh, W. (1990). Skip lists: a probabilistic alternative to balanced trees. Communications of the ACM, 33(6), 668-676.
[3]Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426.
[4]Flajolet, P., & Martin, G. N. (1985). Probabilistic counting algorithms for data base applications. Journal of computer and system sciences, 31(2), 182-209.
[5]Durand, M., & Flajolet, P. (2003, September). Loglog counting of large cardinalities. In European Symposium on Algorithms (pp. 605-617). Springer, Berlin, Heidelberg.
[6]Flajolet, P., Fusy, É., Gandouet, O., & Meunier, F. (2007, June). Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science (pp. 137-156). Discrete Mathematics and Theoretical Computer Science.
[7]Heule, S., Nunkesser, M., & Hall, A. (2013, March). HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the 16th International Conference on Extending Database Technology (pp. 683-692). ACM.
[8]Xiao, Q., Zhou, Y., & Chen, S. (2017, May). Better with fewer bits: Improving the performance of cardinality estimation of large data streams. In INFOCOM 2017-IEEE Conference on Computer Communications, IEEE (pp. 1-9). IEEE.
[9]Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’networks. nature, 393(6684), 440.
[10]Cormode, G. (2009). Count-min sketch. In Encyclopedia of Database Systems (pp. 511-516). Springer, Boston, MA.
[11]Cormode, G., & Muthukrishnan, M. (2011). Approximating data with the count-min sketch. IEEE software, (1), 64-69.
[12]Mitzenmacher, M., & Upfal, E. (2005). Probability and Computing: Randomized Algorithms and Probabilistic Analysis: Cambridge University Press.
[13]Fan, L., Cao, P., Almeida, J., & Broder, A. Z. (2000). Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking (TON), 8(3), 281-293.
[14]Bonomi, F., Mitzenmacher, M., Panigrahy, R., Singh, S., & Varghese, G. (2006, September). An improved construction for counting bloom filters. In European Symposium on Algorithms (pp. 684-695). Springer, Berlin, Heidelberg.
[15]Flajolet, P., & Martin, G. N. (1985). Probabilistic counting algorithms for data base applications. Journal of computer and system sciences, 31(2), 182-209.
[16]Kirsch, A., & Mitzenmacher, M. (2006, September). Less hashing, same performance: building a better bloom filter. In European Symposium on Algorithms (pp. 456-467). Springer, Berlin, Heidelberg.
[17]E. (2017, December 17). Echen/streaming-simulations. Retrieved from https://github.com/echen/streaming-simulations
[18]Egert, R., Fischlin, M., Gens, D., Jacob, S., Senker, M., & Tillmanns, J. (2015, June). Privately computing set-union and set-intersection cardinality via bloom filters. In Australasian Conference on Information Security and Privacy (pp. 413-430). Springer, Cham.
[19]Jeffrey, M. C., & Steffan, J. G. (2011, June). Understanding bloom filter intersection for lazy address-set disambiguation. In Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures (pp. 345-354). ACM.
[20]Ertl, O. (2017). New Cardinality Estimation Methods for HyperLogLog Sketches. arXiv preprint arXiv:1706.07290
校內:2023-12-31公開