簡易檢索 / 詳目顯示

研究生: 張文琇
Chang, Wen-Hsiu
論文名稱: 在串流及巨量資料下之機率型資料結構:Bloom Filter和LogLog Counter個案比較與探討
Probabilistic Membership Data Structures for Streaming, Big Data: The Comparative Study of Bloom Filter and LogLog Counter
指導教授: 蕭宏章
Hsiao, Hung-Chang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 中文
論文頁數: 32
中文關鍵詞: bloom filterLogLog counter機率型資料結構
外文關鍵詞: bloom filter, LogLog counter, probabilistic data structure
相關次數: 點閱:61下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著巨量資料的發展,原本解決一個問題所需要花的時間及記憶體空間成本都會加大。為了解決這樣的問題,資料的處理方式往往仰賴大型計算資源。但這樣的計算方式,需要大量的計算及儲存資源的投入,處理也需要耗費大量的時間。如果並不一定需要精確的答案,而是只要大概的數值即近似解,例如資料庫資料表的行數、網頁的瀏覽數或是影片的觀看次數等。這類的應用,通常希望能在real-time時間內解決且不希望耗費大量的計算及記憶體資源。因此我們可以使用機率型資料結構降低時間成本及空間成本,以獲得即時性的答案。
    而本篇論文中,主要是以集合的概念來探討bloom filter及LogLog counter兩種皆是位元向量結構的機率型資料結構。將比較在單一集合下的操作、集合和集合間的運算,及透過資訊熵來做比較。透過比較來得出在上述的面向下,來得出bloom filter及LogLog counter的操作支援度、空間複雜度及時間複雜度,透過實驗來得出在做count運算時,改變記憶體空間、資料數、資料重複比例、集合數等的比較結果。

    The thesis compared two common probabilistic data structures, the bloom filter and the LogLog counter which are based on the bit vector. Through the concept of collection, the operation in one set, the operation between the set and the set and the entropy are discussed. The thesis shows the usage and the relative error for operations between two probabilistic data structures.

    摘要 i EXTENDED ABSTRACT ii 目錄 v 圖目錄 vii 表目錄 viii 第一章 簡介 1 第二章 研究背景 5 2.1 Bloom Filter 5 2.1.1 演算法描述 5 2.1.2 誤判機率 6 2.1.3 最小化誤判機率 6 2.2 LogLog Counter 7 2.2.1 演算法描述 7 第三章 Bloom Filter和LogLog Counter的比較與探討 10 3.1 單一集合下的Operation 10 3.1.1 二元搜尋樹的operation 11 3.1.2 雜湊表的operation 12 3.1.3 Bloom Filter的operation 12 3.1.4 LogLog counter的operation 14 3.1.5 operation的比較探討 16 3.1.6 透過實驗來進行operation的比較探討 17 3.2集合的運算 21 3.2.1 Bloom filter的集合運算 21 3.2.2 LogLog Counter的集合運算 22 3.2.3 兩者的集合運算比較與探討 22 3.2.4 透過實驗來進行集合間運算的比較與探討 23 3.3 資訊熵 27 第四章 結論 29 參考資料 30

    [1]Welcome to Apache™ Hadoop®! (n.d.). Retrieved from http://hadoop.apache.org/
    [2]Pugh, W. (1990). Skip lists: a probabilistic alternative to balanced trees. Communications of the ACM, 33(6), 668-676.
    [3]Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426.
    [4]Flajolet, P., & Martin, G. N. (1985). Probabilistic counting algorithms for data base applications. Journal of computer and system sciences, 31(2), 182-209.
    [5]Durand, M., & Flajolet, P. (2003, September). Loglog counting of large cardinalities. In European Symposium on Algorithms (pp. 605-617). Springer, Berlin, Heidelberg.
    [6]Flajolet, P., Fusy, É., Gandouet, O., & Meunier, F. (2007, June). Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science (pp. 137-156). Discrete Mathematics and Theoretical Computer Science.
    [7]Heule, S., Nunkesser, M., & Hall, A. (2013, March). HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the 16th International Conference on Extending Database Technology (pp. 683-692). ACM.
    [8]Xiao, Q., Zhou, Y., & Chen, S. (2017, May). Better with fewer bits: Improving the performance of cardinality estimation of large data streams. In INFOCOM 2017-IEEE Conference on Computer Communications, IEEE (pp. 1-9). IEEE.
    [9]Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’networks. nature, 393(6684), 440.
    [10]Cormode, G. (2009). Count-min sketch. In Encyclopedia of Database Systems (pp. 511-516). Springer, Boston, MA.
    [11]Cormode, G., & Muthukrishnan, M. (2011). Approximating data with the count-min sketch. IEEE software, (1), 64-69.
    [12]Mitzenmacher, M., & Upfal, E. (2005). Probability and Computing: Randomized Algorithms and Probabilistic Analysis: Cambridge University Press.
    [13]Fan, L., Cao, P., Almeida, J., & Broder, A. Z. (2000). Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking (TON), 8(3), 281-293.
    [14]Bonomi, F., Mitzenmacher, M., Panigrahy, R., Singh, S., & Varghese, G. (2006, September). An improved construction for counting bloom filters. In European Symposium on Algorithms (pp. 684-695). Springer, Berlin, Heidelberg.
    [15]Flajolet, P., & Martin, G. N. (1985). Probabilistic counting algorithms for data base applications. Journal of computer and system sciences, 31(2), 182-209.
    [16]Kirsch, A., & Mitzenmacher, M. (2006, September). Less hashing, same performance: building a better bloom filter. In European Symposium on Algorithms (pp. 456-467). Springer, Berlin, Heidelberg.
    [17]E. (2017, December 17). Echen/streaming-simulations. Retrieved from https://github.com/echen/streaming-simulations
    [18]Egert, R., Fischlin, M., Gens, D., Jacob, S., Senker, M., & Tillmanns, J. (2015, June). Privately computing set-union and set-intersection cardinality via bloom filters. In Australasian Conference on Information Security and Privacy (pp. 413-430). Springer, Cham.
    [19]Jeffrey, M. C., & Steffan, J. G. (2011, June). Understanding bloom filter intersection for lazy address-set disambiguation. In Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures (pp. 345-354). ACM.
    [20]Ertl, O. (2017). New Cardinality Estimation Methods for HyperLogLog Sketches. arXiv preprint arXiv:1706.07290

    無法下載圖示 校內:2023-12-31公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE