From 565ae3654c599863fc41752b70986d5ebfcaee57 Mon Sep 17 00:00:00 2001 From: Misty Stanley-Jones Date: Wed, 13 Apr 2016 12:14:29 -0700 Subject: [PATCH] HBASE-15646 Add some docs about exporting and importing snapshots using S3 --- src/main/asciidoc/_chapters/configuration.adoc | 31 ++++++++++++++++ src/main/asciidoc/_chapters/ops_mgt.adoc | 51 ++++++++++++++++++++++++++ 2 files changed, 82 insertions(+) diff --git a/src/main/asciidoc/_chapters/configuration.adoc b/src/main/asciidoc/_chapters/configuration.adoc index 6aefd5a..58645d5 100644 --- a/src/main/asciidoc/_chapters/configuration.adoc +++ b/src/main/asciidoc/_chapters/configuration.adoc @@ -1111,6 +1111,37 @@ Only a subset of all configurations can currently be changed in the running serv Here is an incomplete list: `hbase.regionserver.thread.compaction.large`, `hbase.regionserver.thread.compaction.small`, `hbase.regionserver.thread.split`, `hbase.regionserver.thread.merge`, as well as compaction policy and configurations and adjustment to offpeak hours. For the full list consult the patch attached to link:https://issues.apache.org/jira/browse/HBASE-12147[HBASE-12147 Porting Online Config Change from 89-fb]. +[[amazon_s3_configuration]] +== Using Amazon S3 Storage + +HBase is designed to be tightly coupled with HDFS, and testing of other filesystems +has not been thorough. However, some HBase clusters are successfully using Amazon +S3 buckets, typically in the context of Amazon Elastic Mapreduce (Amazon EMR). + +The following limitations have been reported: + +- RegionServers should be deployed in Amazon EC2 to mitigate latency and bandwidth +limitations when accessing the filesystem, and RegionServers must remain available +to preserve data locality. +- S3 writes each inbound and outbound file to disk, which adds overhead to each operation. +- The best performance is achieved when all clients and servers are in the Amazon +cloud, rather than a heterogenous architecture. +- You must be aware of the location of `hadoop.tmp.dir` so that the local `/tmp/` +directory is not filled to capacity. +- HBase has a different file usage pattern than MapReduce jobs and has been optimized for +HDFS, rather than distant networked storage. +- You must use the `s3a://` protocol. The `s3n://` and `s3://` protocols have serious +limitations and do not use official Amazon APIs. The `s3a://` protocol is supported +in Hadoop 2.6 and higher. + +Configuration details for Amazon S3 and associated Amazon services such as EMR are +out of the scope of the HBase documentation. See the +link:https://wiki.apache.org/hadoop/AmazonS3[Hadoop Wiki entry on Amazon S3 Storage] +and +link:http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hbase.html[Amazon's documentation for deploying HBase in EMR]. + +One use case that is well-suited for Amazon S3 is storing snapshots. See <>. + ifdef::backend-docbook[] [index] == Index diff --git a/src/main/asciidoc/_chapters/ops_mgt.adoc b/src/main/asciidoc/_chapters/ops_mgt.adoc index 53aee33..17113d7 100644 --- a/src/main/asciidoc/_chapters/ops_mgt.adoc +++ b/src/main/asciidoc/_chapters/ops_mgt.adoc @@ -2039,6 +2039,57 @@ The following example limits the above example to 200 MB/sec. $ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:8082/hbase -mappers 16 -bandwidth 200 ---- +[[snapshots_s3]] +=== Storing Snapshots in an Amazon S3 Bucket + +For general information and limitations of using Amazon S3 storage with HBase, see +<>. You can also store and retrieve snapshots from Amazon +S3, using the following procedure. + +.Prerequisites +- You must be using HBase 1.0 or higher and Hadoop 2.6 or higher. +- You must use the `s3a://` protocol to connect to Amazon S3. The older `s3n://` +and `s3://` protocols have various limitations and do not use the official Amazon +APIs. +- The `s3a://` URI must be configured and available on the server where you run +the commands to export and restore the snapshot. + +After you have fulfilled the prerequisites, take the snapshow like you normally would. +Afterward, you can export it using the `org.apache.hadoop.hbase.snapshot.ExportSnapshot` +command like the one below, substituting your own `s3a://` path in the `copy-from` +or `copy-to` directive and substituting or modifying other options as required: + +---- +$ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \ + -snapshot MySnapshot \ + -copy-from hdfs://srv2:8082/hbase \ + -copy-to s3a:////hbase \ + -chuser MyUser \ + -chgroup MyGroup \ + -chmod 700 \ + -mappers 16 +---- + +---- +$ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \ + -snapshot MySnapshot + -copy-from s3a:////hbase \ + -copy-to hdfs://srv2:8082/hbase \ + -chuser MyUser \ + -chgroup MyGroup \ + -chmod 700 \ + -mappers 16 +---- + +You can also use the `org.apache.hadoop.hbase.snapshot.SnapshotInfo` utility with the `s3a://` path by including the +`-remote-dir` option. + +---- +$ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo \ + -remote-dir s3a:////hbase \ + -list-snapshots +---- + [[ops.capacity]] == Capacity Planning and Region Sizing -- 2.6.4 (Apple Git-63)