簡易檢索 / 詳目顯示

研究生: 廖啟村
Liao, Chi-Tsun
論文名稱: Hadoop HBase的分散式快照架構
A Framework of Distributed Snapshots for Hadoop HBase
指導教授: 蕭宏章
Hsiao, Hung-Chang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 32
中文關鍵詞: HBase分散式快照
外文關鍵詞: HBase, distributed snapshots
相關次數: 點閱:51下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 儘管 Apache HBase™已經是一套相當出色的分散式big data store,但是系統狀態還原的研究仍相當缺乏彈性,例如HBase 無法指定系統恢復至過去某一時刻的狀態。本論文著重在實現更有彈性的復原機制在HBase,主要分成四個階段。首先紀錄HFile 和更新指令日誌。第二,利用vector clock 找出一致性的分散式快照。第三,我們利用bulk load process 讀取HFile 以重建HBase。第四,重播備份的HFile 和實際恢復表格內容的時間差內的更新指令日誌。最後我們用一個應用程式來證明修改後的HBase 擁有還原至任意分散式快照的能力。

    Apache Hadoop HBase™ is an emerging distributed key-value persistent data store, which can accommodate a large volume of data rapidly introduced from a variety of sources. While data objects stored in HBase are precious, HBase is unable to perform parallel recovery for recovering historical data objects concurrently stored in multiple storage servers in a consistent manner. The study presents a framework for implementing a data recovery scheme in HBase. The framework consists of four components, including (1) distributed snapshots represented by event logs gathered from internal (system) and external (clients) operations, (2) a global time labeling scheme for correlated events, (3) a bulk load process for bootstrapping a HBase cluster with a given snapshot, and (4) a forward replaying mechanism for precisely running the system into any specified time instance in the past. We enhance HBase such that it is capable of performing parallel recovery, and demonstrate our prototype implementation with performance results. In addition, based on our prototype, an application tracking multiple clients’ locations is demonstrated.

    摘要 iv ABSTRACT v ACKNOWLEDGEMENTS vi TABLE OF CONTENTS vii LIST OF TABLES ix LIST OF FIGURES x CHAPTER 1 INTRODUCTION 1 1.1 Solutions in State-of-the-Art Products for Recovery 2 1.2 Research Issues 3 1.3 Our Proposal and Contributions 4 1.4 Roadmap 5 CHAPTER 2 RELATED WORK 6 2.1 Apache Hadoop 6 2.2 Apache HBase 8 2.3 Lamport Timestamps 11 2.4 Vector Clock 12 CHAPTER 3 OUR PROPOSED FRAMEWORK 14 3.1 State Gathering 15 3.1.1 HFile replication 15 3.1.2 Event log 15 3.2 Distributed Snapshots 16 3.2.1 Moving region 17 3.2.2 The implement of mark vector clock 17 3.2.3 Consistent distributed snapshots 19 3.3 Bulk Load and Log Replay 21 3.3.1 Bulk load process 21 3.3.2 Log replay process 22 CHAPTER 4 EVULATION 23 4.1 System Deploying 23 4.2 Experiment 24 CHAPTER 5 APPLICATION OF THE CONSISTENT SNAPSHOTS 27 CHAPTER 6 SUMMURY 29 REFERENCES 31

    [1] Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein,Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, and Ramana
    Yerneni. “PNUTS: Yahoo!'s hosted data serving platform”. Proc. VLDB Endow. 1,2.2008, pp. 1277-1288.
    [2] Cassandra. http://cassandra.apache.org/
    [3] CDH. http://www.cloudera.com/content/cloudera/en/products/cdh.html
    [4] Colin Fidge. “Timestamps in Message-Passing Systems that Preserve the Partial Ordering”. In Proceedings of the 11th Australian Computer ScienceConference, February 1988, pp. 55–66.
    [5] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
    Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber.“Bigtable: a distributed storage system for structured data”. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7 (OSDI '06), Vol. 7. USENIX Association, Berkeley, CA, USA, 2006, pp. 15-15.
    [6] Friedemann Mattern. “Virtual Time and Global States of Distributed Systems”. In M. Cosnard et al., editor, Proceedings of the Workshop on Parallel and Distributed Algorithms, 1989, pp. 215–226.
    [7] Hadoop. http://hadoop.apache.org/
    [8] HBase. http://hbase.apache.org/
    [9] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li,
    Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. “Spanner: Google's globallydistributed database”. In Proceedings of the 10th USENIX conference on 32 Operating Systems Design and Implementation (OSDI'12). USENIX Association, Berkeley, CA, USA, 2012, pp.251-264.
    [10] Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton. “Architecture of a Database System”. Now Publishers Inc., Hanover, MA, USA. 2007. Chapter. 7
    [11] Leslie Lamport. “Time, clocks, and the ordering of events in a distributed system”. Communications of the ACM 21 (7), 1978, pp. 558-565.
    [12] MongoDB. http://www.mongodb.org/
    [13] Özalp Babaoğlu, Keith Marzullo. “Consistent global states of distributed systems: fundamental concepts and mechanisms”. In Distributed systems (2ndEd.), Sape Mullender (Ed.). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA. 1993, pp. 55-96.
    [14] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. “The Google file system”. In Proceedings of the nineteenth ACM symposium on Operating systems principles (SOSP '03). ACM, New York, NY, USA, 2003, pp. 29-43.

    無法下載圖示 校內:2023-12-31公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE