TableOutputFormat
+ class from a MapReduce job, or use the normal client APIs; however,
+ these are not always the most efficient methods.
+ HFileOutputFormat. This output format writes
+ out data in HBase's internal storage format so that they can be
+ later loaded very efficiently into the cluster.
+ HFileOutputFormat must be
+ configured such that each output HFile fits within a single region.
+ In order to do this, jobs whose output will be bulk loaded into HBase
+ use Hadoop's TotalOrderPartitioner class to partition the map output
+ into disjoint ranges of the key space, corresponding to the key
+ ranges of the regions in the table.
+ HFileOutputFormat includes a convenience function,
+ configureIncrementalLoad(), which automatically sets up
+ a TotalOrderPartitioner based on the current region boundaries of a
+ table.
+ HFileOutputFormat, it is loaded into the cluster using
+ completebulkload. This command line tool iterates
+ through the prepared data files, and for each one determines the
+ region the file belongs to. It then contacts the appropriate Region
+ Server which adopts the HFile, moving it into its storage directory
+ and making the data available to clients.
+ completebulkloads utility will automatically split the
+ data files into pieces corresponding to the new boundaries. This
+ process is not optimally efficient, so users should take care to
+ minimize the delay between preparing a bulk load and importing it
+ into the cluster, especially if other clients are simultaneously
+ loading data through other means.
+ importtsv tool with the
+ "importtsv.bulk.output" option or by some other MapReduce
+ job using the HFileOutputFormat, the
+ completebulkload tool is used to import the data into the
+ running cluster.
+ completebulkload tool simply takes the output path
+ where importtsv or your MapReduce job put its results, and
+ the table name to import into. For example:
+ $ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
+ -c config-file option can be used to specify a file
+ containing the appropriate hbase parameters (e.g., hbase-site.xml) if
+ not supplied already on the CLASSPATH (In addition, the CLASSPATH must
+ contain the directory that has the zookeeper configuration file if
+ zookeeper is NOT managed by HBase).
+ importtsv tool is useful in many cases, advanced users may
+ want to generate data programatically, or import data from other formats. To get
+ started doing so, dig into ImportTsv.java and check the JavaDoc for
+ HFileOutputFormat.
+ LoadIncrementalHFiles class for more information.
+ Driver class that is executed by the HBase jar can be used to invoke frequently accessed utilities. For example,
+completebulkload.
+ completebulkload utility will move generated StoreFiles into an HBase table. This utility is often used
+ in conjunction with output from - HBase includes several methods of loading data into tables. - The most straightforward method is to either use the TableOutputFormat - class from a MapReduce job, or use the normal client APIs; however, - these are not always the most efficient methods. -
-- This document describes HBase's bulk load functionality. The bulk load - feature uses a MapReduce job to output table data in HBase's internal - data format, and then directly loads the data files into a running - cluster. Using bulk load will use less CPU and network resources than - simply using the HBase API. -
-- The HBase bulk load process consists of two main steps. -
-- The first step of a bulk load is to generate HBase data files from - a MapReduce job using HFileOutputFormat. This output format writes - out data in HBase's internal storage format so that they can be - later loaded very efficiently into the cluster. -
-- In order to function efficiently, HFileOutputFormat must be - configured such that each output HFile fits within a single region. - In order to do this, jobs whose output will be bulk loaded into HBase - use Hadoop's TotalOrderPartitioner class to partition the map output - into disjoint ranges of the key space, corresponding to the key - ranges of the regions in the table. -
-
- HFileOutputFormat includes a convenience function,
- configureIncrementalLoad(), which automatically sets up
- a TotalOrderPartitioner based on the current region boundaries of a
- table.
-
- After the data has been prepared using
- HFileOutputFormat, it is loaded into the cluster using
- completebulkload. This command line tool iterates
- through the prepared data files, and for each one determines the
- region the file belongs to. It then contacts the appropriate Region
- Server which adopts the HFile, moving it into its storage directory
- and making the data available to clients.
-
- If the region boundaries have changed during the course of bulk load
- preparation, or between the preparation and completion steps, the
- completebulkloads utility will automatically split the
- data files into pieces corresponding to the new boundaries. This
- process is not optimally efficient, so users should take care to
- minimize the delay between preparing a bulk load and importing it
- into the cluster, especially if other clients are simultaneously
- loading data through other means.
-
- After a data import has been prepared, either by using the
- importtsv tool with the
- "importtsv.bulk.output" option or by some other MapReduce
- job using the HFileOutputFormat, the
- completebulkload tool is used to import the data into the
- running cluster.
-
- The completebulkload tool simply takes the output path
- where importtsv or your MapReduce job put its results, and
- the table name to import into. For example:
-
$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
-
- The -c config-file option can be used to specify a file
- containing the appropriate hbase parameters (e.g., hbase-site.xml) if
- not supplied already on the CLASSPATH (In addition, the CLASSPATH must
- contain the directory that has the zookeeper configuration file if
- zookeeper is NOT managed by HBase).
-
- Note: If the target table does not already exist in HBase, this - tool will create the table automatically.
-- This tool will run quickly, after which point the new data will be visible in - the cluster. -
-
- HBase ships with a command line tool called importtsv
- which when given files containing data in TSV form can prepare this
- data for bulk import into HBase. This tool by default uses the HBase
- put API to insert data into HBase one row at a time, but
- when the "importtsv.bulk.output" option is used,
- importtsv will instead generate files using
- HFileOutputFormat which can subsequently be bulk-loaded
- into HBase using the completebulkload tool described
- above. This tool is available by running "hadoop jar
- /path/to/hbase-VERSION.jar importtsv". Running this command
- with no arguments prints brief usage information:
-
-Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
-
-Imports the given input directory of TSV data into the specified table.
-
-The column names of the TSV data must be specified using the -Dimporttsv.columns
-option. This option takes the form of comma-separated column names, where each
-column name is either a simple column family, or a columnfamily:qualifier. The special
-column name HBASE_ROW_KEY is used to designate that this column should be used
-as the row key for each imported record. You must specify exactly one column
-to be the row key, and you must specify a column name for every column that exists in the
-input data.
-
-By default importtsv will load data directly into HBase. To instead generate
-HFiles of data to prepare for a bulk data load, pass the option:
- -Dimporttsv.bulk.output=/path/for/output
- Note: if you do not use this option, then the target table must already exist in HBase
-
-Other options that may be specified with -D include:
- -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
- '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
- -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
- -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
-
-
- Although the importtsv tool is useful in many cases, advanced users may
- want to generate data programatically, or import data from other formats. To get
- started doing so, dig into ImportTsv.java and check the JavaDoc for
- HFileOutputFormat.
-
- The import step of the bulk load can also be done programatically. See the
- LoadIncrementalHFiles class for more information.
-
This page has been retired. The contents have been moved to the + Bulk Loading section + in the Reference Guide. +