Index: src/docbkx/book.xml =================================================================== --- src/docbkx/book.xml (revision 1342072) +++ src/docbkx/book.xml (working copy) @@ -2372,7 +2372,118 @@ + + +
Bulk Loading +
Overview + + HBase includes several methods of loading data into tables. + The most straightforward method is to either use the TableOutputFormat + class from a MapReduce job, or use the normal client APIs; however, + these are not always the most efficient methods. + + + The bulk load feature uses a MapReduce job to output table data in HBase's internal + data format, and then directly loads the generated StoreFiles into a running + cluster. Using bulk load will use less CPU and network resources than + simply using the HBase API. + +
+
Bulk Load Architecture + + The HBase bulk load process consists of two main steps. + +
Preparing data via a MapReduce job + + The first step of a bulk load is to generate HBase data files (StoreFiles) from + a MapReduce job using HFileOutputFormat. This output format writes + out data in HBase's internal storage format so that they can be + later loaded very efficiently into the cluster. + + + In order to function efficiently, HFileOutputFormat must be + configured such that each output HFile fits within a single region. + In order to do this, jobs whose output will be bulk loaded into HBase + use Hadoop's TotalOrderPartitioner class to partition the map output + into disjoint ranges of the key space, corresponding to the key + ranges of the regions in the table. + + + HFileOutputFormat includes a convenience function, + configureIncrementalLoad(), which automatically sets up + a TotalOrderPartitioner based on the current region boundaries of a + table. + +
+
Completing the data load + + After the data has been prepared using + HFileOutputFormat, it is loaded into the cluster using + completebulkload. This command line tool iterates + through the prepared data files, and for each one determines the + region the file belongs to. It then contacts the appropriate Region + Server which adopts the HFile, moving it into its storage directory + and making the data available to clients. + + + If the region boundaries have changed during the course of bulk load + preparation, or between the preparation and completion steps, the + completebulkloads utility will automatically split the + data files into pieces corresponding to the new boundaries. This + process is not optimally efficient, so users should take care to + minimize the delay between preparing a bulk load and importing it + into the cluster, especially if other clients are simultaneously + loading data through other means. + +
+
+
Importing the prepared data using the completebulkload tool + + After a data import has been prepared, either by using the + importtsv tool with the + "importtsv.bulk.output" option or by some other MapReduce + job using the HFileOutputFormat, the + completebulkload tool is used to import the data into the + running cluster. + + + The completebulkload tool simply takes the output path + where importtsv or your MapReduce job put its results, and + the table name to import into. For example: + + $ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable + + The -c config-file option can be used to specify a file + containing the appropriate hbase parameters (e.g., hbase-site.xml) if + not supplied already on the CLASSPATH (In addition, the CLASSPATH must + contain the directory that has the zookeeper configuration file if + zookeeper is NOT managed by HBase). + + + Note: If the target table does not already exist in HBase, this + tool will create the table automatically. + + This tool will run quickly, after which point the new data will be visible in + the cluster. +
+
See Also + For more information about the referenced utilities, see and . + +
+
Advanced Usage + + Although the importtsv tool is useful in many cases, advanced users may + want to generate data programatically, or import data from other formats. To get + started doing so, dig into ImportTsv.java and check the JavaDoc for + HFileOutputFormat. + + + The import step of the bulk load can also be done programatically. See the + LoadIncrementalHFiles class for more information. + +
+
HDFS As HBase runs on HDFS (and each StoreFile is written as a file on HDFS), Index: src/docbkx/ops_mgt.xml =================================================================== --- src/docbkx/ops_mgt.xml (revision 1342072) +++ src/docbkx/ops_mgt.xml (working copy) @@ -36,6 +36,25 @@ Here we list HBase tools for administration, analysis, fixup, and debugging. +
Driver + There is a Driver class that is executed by the HBase jar can be used to invoke frequently accessed utilities. For example, +HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar + +... will return... + +An example program must be given as the first argument. +Valid program names are: + completebulkload: Complete a bulk data load. + copytable: Export a table from local cluster to peer cluster + export: Write table data to HDFS. + import: Import data written by Export. + importtsv: Import data in TSV format. + rowcounter: Count rows in HBase table + verifyrep: Compare the data from tables in two different clusters. WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is chan + +... for allowable program names. + +
HBase <application>hbck</application> An fsck for your HBase install @@ -133,15 +152,92 @@
ImportTsv - Import is a utility that will load data in TSV format into HBase. Invoke via: -$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <inputdir> + ImportTsv is a utility that will load data in TSV format into HBase. It has two distinct usages: loading data from TSV format in HDFS + into HBase via Puts, and preparing StoreFiles to be loaded via the completebulkload. + + To load data via Puts (i.e., non-bulk loading): +$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir> + + + To generate StoreFiles for bulk-loading: +$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename> <hdfs-data-inputdir> + These generated StoreFiles can be loaded into HBase via . + +
ImportTsv Options + Running ImportTsv with no arguments prints brief usage information: + +Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir> + +Imports the given input directory of TSV data into the specified table. + +The column names of the TSV data must be specified using the -Dimporttsv.columns +option. This option takes the form of comma-separated column names, where each +column name is either a simple column family, or a columnfamily:qualifier. The special +column name HBASE_ROW_KEY is used to designate that this column should be used +as the row key for each imported record. You must specify exactly one column +to be the row key, and you must specify a column name for every column that exists in the +input data. + +By default importtsv will load data directly into HBase. To instead generate +HFiles of data to prepare for a bulk data load, pass the option: + -Dimporttsv.bulk.output=/path/for/output + Note: if you do not use this option, then the target table must already exist in HBase + +Other options that may be specified with -D include: + -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line + '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs + -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import + -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper + +
+
ImportTsv Example + For example, assume that we are loading data into a table called 'datatsv' with a ColumnFamily called 'd' with two columns "c1" and "c2". + + Assume that an input file exists as follows: + +row1 c1 c2 +row2 c1 c2 +row3 c1 c2 +row4 c1 c2 +row5 c1 c2 +row6 c1 c2 +row7 c1 c2 +row8 c1 c2 +row9 c1 c2 +row10 c1 c2 + + + For ImportTsv to use this imput file, the command line needs to look like this: + + HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile + + ... and in this example the first column is the rowkey, which is why the HBASE_ROW_KEY is used. The second and third columns in the file will be imported as "d:c1" and "d:c2", respectively. + +
+
ImportTsv Warning + If you have preparing a lot of data for bulk loading, make sure the target HBase table is pre-split appropriately. + +
+
See Also + For more information about bulk-loading HFiles into HBase, see +
-
- Bulk Loading - For imformation about bulk-loading HFiles into HBase, see Bulk Loads. - This page currently exists on the website and will eventually be migrated into the RefGuide. + +
+ CompleteBulkLoad + The completebulkload utility will move generated StoreFiles into an HBase table. This utility is often used + in conjunction with output from . + + There are two ways to invoke this utility, with explicit classname and via the driver: +$ bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFile <hdfs://storefileoutput> <tablename> + +.. and via the Driver.. +HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar completebulkload <hdfs://storefileoutput> <tablename> + + + For more information about bulk-loading HFiles into HBase, see .
Index: src/site/xdoc/bulk-loads.xml =================================================================== --- src/site/xdoc/bulk-loads.xml (revision 1342072) +++ src/site/xdoc/bulk-loads.xml (working copy) @@ -23,149 +23,9 @@ -
-

- HBase includes several methods of loading data into tables. - The most straightforward method is to either use the TableOutputFormat - class from a MapReduce job, or use the normal client APIs; however, - these are not always the most efficient methods. -

-

- This document describes HBase's bulk load functionality. The bulk load - feature uses a MapReduce job to output table data in HBase's internal - data format, and then directly loads the data files into a running - cluster. Using bulk load will use less CPU and network resources than - simply using the HBase API. -

-
-
-

- The HBase bulk load process consists of two main steps. -

-
-

- The first step of a bulk load is to generate HBase data files from - a MapReduce job using HFileOutputFormat. This output format writes - out data in HBase's internal storage format so that they can be - later loaded very efficiently into the cluster. -

-

- In order to function efficiently, HFileOutputFormat must be - configured such that each output HFile fits within a single region. - In order to do this, jobs whose output will be bulk loaded into HBase - use Hadoop's TotalOrderPartitioner class to partition the map output - into disjoint ranges of the key space, corresponding to the key - ranges of the regions in the table. -

-

- HFileOutputFormat includes a convenience function, - configureIncrementalLoad(), which automatically sets up - a TotalOrderPartitioner based on the current region boundaries of a - table. -

-
-
-

- After the data has been prepared using - HFileOutputFormat, it is loaded into the cluster using - completebulkload. This command line tool iterates - through the prepared data files, and for each one determines the - region the file belongs to. It then contacts the appropriate Region - Server which adopts the HFile, moving it into its storage directory - and making the data available to clients. -

-

- If the region boundaries have changed during the course of bulk load - preparation, or between the preparation and completion steps, the - completebulkloads utility will automatically split the - data files into pieces corresponding to the new boundaries. This - process is not optimally efficient, so users should take care to - minimize the delay between preparing a bulk load and importing it - into the cluster, especially if other clients are simultaneously - loading data through other means. -

-
-
-
-

- After a data import has been prepared, either by using the - importtsv tool with the - "importtsv.bulk.output" option or by some other MapReduce - job using the HFileOutputFormat, the - completebulkload tool is used to import the data into the - running cluster. -

-

- The completebulkload tool simply takes the output path - where importtsv or your MapReduce job put its results, and - the table name to import into. For example: -

- $ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable -

- The -c config-file option can be used to specify a file - containing the appropriate hbase parameters (e.g., hbase-site.xml) if - not supplied already on the CLASSPATH (In addition, the CLASSPATH must - contain the directory that has the zookeeper configuration file if - zookeeper is NOT managed by HBase). -

-

- Note: If the target table does not already exist in HBase, this - tool will create the table automatically.

-

- This tool will run quickly, after which point the new data will be visible in - the cluster. -

-
-
-

- HBase ships with a command line tool called importtsv - which when given files containing data in TSV form can prepare this - data for bulk import into HBase. This tool by default uses the HBase - put API to insert data into HBase one row at a time, but - when the "importtsv.bulk.output" option is used, - importtsv will instead generate files using - HFileOutputFormat which can subsequently be bulk-loaded - into HBase using the completebulkload tool described - above. This tool is available by running "hadoop jar - /path/to/hbase-VERSION.jar importtsv". Running this command - with no arguments prints brief usage information: -

-
-Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
-
-Imports the given input directory of TSV data into the specified table.
-
-The column names of the TSV data must be specified using the -Dimporttsv.columns
-option. This option takes the form of comma-separated column names, where each
-column name is either a simple column family, or a columnfamily:qualifier. The special
-column name HBASE_ROW_KEY is used to designate that this column should be used
-as the row key for each imported record. You must specify exactly one column
-to be the row key, and you must specify a column name for every column that exists in the
-input data.
-
-By default importtsv will load data directly into HBase. To instead generate
-HFiles of data to prepare for a bulk data load, pass the option:
-  -Dimporttsv.bulk.output=/path/for/output
-  Note: if you do not use this option, then the target table must already exist in HBase
-
-Other options that may be specified with -D include:
-  -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
-  '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
-  -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
-  -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
-
-
-
-

- Although the importtsv tool is useful in many cases, advanced users may - want to generate data programatically, or import data from other formats. To get - started doing so, dig into ImportTsv.java and check the JavaDoc for - HFileOutputFormat. -

-

- The import step of the bulk load can also be done programatically. See the - LoadIncrementalHFiles class for more information. -

-
+

This page has been retired. The contents have been moved to the + Bulk Loading section + in the Reference Guide. +