+ HBase includes several methods of loading data into tables. + The most straightforward method is to either use the TableOutputFormat + class from a MapReduce job, or use the normal client APIs; however, + these are not always the most efficient methods. +
++ This document describes HBase's bulk load functionality. The bulk load + feature uses a MapReduce job to output table data in HBase's internal + data format, and then directly loads the data files into a running + cluster. +
++ The HBase bulk load process consists of two main steps. +
++ The first step of a bulk load is to generate HBase data files from + a MapReduce job using HFileOutputFormat. This output format writes + out data in HBase's internal storage format so that they can be + later loaded very efficiently into the cluster. +
++ In order to function efficiently, HFileOutputFormat must be configured + such that each output HFile fits within a single region. In order to + do this, jobs use Hadoop's TotalOrderPartitioner class to partition the + map output into disjoint ranges of the key space, corresponding to the + key ranges of the regions in the table. +
+
+ HFileOutputFormat includes a convenience function, configureIncrementalLoad(),
+ which automatically sets up a TotalOrderPartitioner based on the current
+ region boundaries of a table.
+
+ After the data has been prepared using HFileOutputFormat, it
+ is loaded into the cluster using a command line tool. This command line tool
+ iterates through the prepared data files, and for each one determines the
+ region the file belongs to. It then contacts the appropriate Region Server
+ which adopts the HFile, moving it into its storage directory and making
+ the data available to clients.
+
+ If the region boundaries have changed during the course of bulk load + preparation, or between the preparation and completion steps, the bulk + load commandline utility will automatically split the data files into + pieces corresponding to the new boundaries. This process is not + optimally efficient, so users should take care to minimize the delay between + preparing a bulk load and importing it into the cluster, especially + if other clients are simultaneously loading data through other means. +
+importtsv tool
+ HBase ships with a command line tool called importtsv. This tool
+ is available by running hadoop jar /path/to/hbase-VERSION.jar importtsv.
+ Running this tool with no arguments prints brief usage information:
+
+Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
+
+Imports the given input directory of TSV data into the specified table.
+
+The column names of the TSV data must be specified using the -Dimporttsv.columns
+option. This option takes the form of comma-separated column names, where each
+column name is either a simple column family, or a columnfamily:qualifier. The special
+column name HBASE_ROW_KEY is used to designate that this column should be used
+as the row key for each imported record. You must specify exactly one column
+to be the row key.
+
+In order to prepare data for a bulk data load, pass the option:
+ -Dimporttsv.bulk.output=/path/for/output
+
+Other options that may be specified with -D include:
+ -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
+
+ completebulkload tool
+ After a data import has been prepared using the importtsv tool, the
+ completebulkload tool is used to import the data into the running cluster.
+
+ The completebulkload tool simply takes the same output path where
+ importtsv put its results, and the table name. For example:
+
$ hadoop jar hbase-VERSION.jar completebulkload /user/todd/myoutput mytable
+ + This tool will run quickly, after which point the new data will be visible in + the cluster. +
+
+ Although the importtsv tool is useful in many cases, advanced users may
+ want to generate data programatically, or import data from other formats. To get
+ started doing so, dig into ImportTsv.java and check the JavaDoc for
+ HFileOutputFormat.
+
+ The import step of the bulk load can also be done programatically. See the
+ LoadIncrementalHFiles class for more information.
+