Index: CHANGES.txt
===================================================================
--- CHANGES.txt (revision 952559)
+++ CHANGES.txt (working copy)
@@ -672,7 +672,9 @@
HBASE-2651 Allow alternate column separators to be specified for ImportTsv
HBASE-2661 Add test case for row atomicity guarantee
HBASE-2578 Add ability for tests to override server-side timestamp
- setting (currentTimeMillis) (Daniel Ploeg via Ryan Rawson)
+ setting (currentTimeMillis) (Daniel Ploeg via Ryan Rawson)
+ HBASE-2558 Our javadoc overview -- "Getting Started", requirements, etc. --
+ is not carried across by mvn javadoc:javadoc target
NEW FEATURES
HBASE-1961 HBase EC2 scripts
Index: src/assembly/bin.xml
===================================================================
--- src/assembly/bin.xml (revision 952559)
+++ src/assembly/bin.xml (working copy)
@@ -40,6 +40,17 @@
HBase is a distributed, column-oriented store, modeled after Google's BigTable. HBase is built on top of Hadoop for its MapReduce and distributed file system implementation. All these projects are open-source and part of the Apache Software Foundation. As being distributed, large scale platforms, the Hadoop and HBase projects mainly focus on *nix environments for production installations. However, being developed in Java, both projects are fully portable across platforms and, hence, also to the Windows operating system. For ease of development the projects rely on Cygwin to have a *nix-like environment on Windows to run the shell scripts. This document explains the intricacies of running HBase on Windows using Cygwin as an all-in-one single-node installation for testing and development. The HBase Overview and QuickStart guides on the other hand go a long way in explaning how to setup HBase in more complex deployment scenario's. For running HBase on Windows, 3 technologies are required: Java, Cygwin and SSH. The following paragraphs detail the installation of each of the aforementioned technologies. HBase depends on the Java Platform, Standard Edition, 6 Release. So the target system has to be provided with at least the Java Runtime Environment (JRE); however if the system will also be used for development, the Jave Development Kit (JDK) is preferred. You can download the latest versions for both from Sun's download page. Installation is a simple GUI wizard that guides you through the process. Cygwin is probably the oddest technology in this solution stack. It provides a dynamic link library that emulates most of a *nix environment on Windows. On top of that a whole bunch of the most common *nix tools are supplied. Combined, the DLL with the tools form a very *nix-alike environment on Windows. For installation, Cygwin provides the To support installation, the Perform following steps to install Cygwin, which are elaboratly detailed in the 2nd chapter of the Cygwin User's Guide: HBase (and Hadoop) rely on SSH for interprocess/-node communication and launching remote commands. SSH will be provisioned on the target system via Cygwin, which supports running Cygwin programs as Windows services! Download the latest release of HBase from the website. As the HBase distributable is just a zipped archive, installation is as simple as unpacking the archive so it ends up in its final installation directory. Notice that HBase has to be installed in Cygwin and a good directory suggestion is to use There are 3 parts left to configure: Java, SSH and HBase itself. Following paragraphs explain eacht topic in detail. One important thing to remember in shell scripting in general (i.e. *nix and Windows) is that managing, manipulating and assembling path names that contains spaces can be very hard, due to the need to escape and quote those characters and strings. So we try to stay away from spaces in path names. *nix environments can help us out here very easily by using symbolic links. Configuring SSH is quite elaborate, but primarily a question of launching it by default as a Windows service.
+This should conclude the installation and configuration of HBase on Windows using Cygwin. So it's time to test it.
+setup.exe utility that tracks the versions of all installed components on the target system and provides the mechanism for installing or updating everything from the mirror sites of Cygwin.setup.exe utility uses 2 directories on the target system. The Root directory for Cygwin (defaults to C:\cygwin) which will become / within the eventual Cygwin installation; and the Local Package directory (e.g. C:\cygsetup that is the cache where setup.exe stores the packages before they are installed. The cache must not be the same folder as the Cygwin root.
+
+Administrator privileges on the target system.C:\cygwin\root and C:\cygwin\setup folders.setup.exe utility and save it to the Local Package directory.setup.exe utility,
+
+
+Install from Internet option,setup.exe utility in the Local Package folder.CYGWIN_HOME system-wide environment variable that points to your Root directory.%CYGWIN_HOME%\bin to the end of your PATH environment variable.Cygwin.bat command in the Root folder. You should end up in a terminal window that is running a Bash shell. Test the shell by issuing following commands:
+
+
+cd / should take you to thr Root directory in Cygwin;LS commands that should list all files and folders in the current directory.exit command to end the terminal.
+
+setup.exe utility.Next button until the Select Packages panel is shown.View button to toggle to the list view, which is ordered alfabetically on Package, making it easier to find the packages we'll need.Skip) so it's marked for installation. Use the Next button to download and install the packages.
+
+
+/usr/local/ (or [Root directory]\usr\local in Windows slang). You should end up with a /usr/local/hbase-<version> installation in Cygwin.
+
+/usr/local to the Java home directory by using the following command and substituting the name of your chosen Java environment:
+LN -s /cygdrive/c/Program\ Files/Java/<jre name> /usr/local/<jre name>
+CD /usr/local/<jre name> and issueing the command ./bin/java -version. This should output your version of the chosen JRE.
+
+Run as Administrator.LS -L command on the different files. Also, notice the auto-completion feature in the shell using <TAB> is extremely handy in these situations.
+
+
+chmod +r /etc/passwd to make the passwords file readable for allchmod u+w /etc/passwd to make the passwords file writable for the ownerchmod +r /etc/group to make the groups file readable for all
+
+chmod u+w /etc/group to make the groups file writable for the owner
+
+chmod 755 /var to make the var folder writable to owner and readable and executable to allPARANOID line:
+
+
+ALL : localhost 127.0.0.1/32 : allowALL : [::1]/128 : allowssh-host-config
+
+
+/etc/ssh_config, answer yes./etc/sshd_config, answer yes.yes.sshd as a service, answer yes. Make sure you started your shell as Adminstrator!<enter> as the default is ntsec.sshd account, answer yes.no as the default will suffice.cyg_server account, answer yes. Enter a password for the account.net start sshd or cygrunsrv --start sshd. Notice that cygrunsrv is the utility that make the process run as a Windows service. Confirm that you see a message stating that the CYGWIN sshd service was started succesfully.
+
+mkpasswd -cl > /etc/passwdmkgroup --local > /etc/group
+
+whoami to verify your userIDssh localhost to connect to the system itself
+
+
+yes when presented with the server's fingerprintexit command should take you back to your first shell in CygwinExit should terminate the Cygwin shell.[installation directory] as working directory.
+
+
+./conf/hbase-env.sh to configure its dependencies on the runtime environment. Copy and uncomment following lines just underneath their original, change them to fit your environemnt. They should read something like:
+
+
+export JAVA_HOME=/usr/local/<jre name>export HBASE_IDENT_STRING=$HOSTNAME as this most likely does not inlcude spaces.hbase-default.xml file for configuration. Some properties do not resolve to existing directories because the JVM runs on Windows. This is the major issue to keep in mind when working with Cygwin: within the shell all paths are *nix-alike, hence relative to the root /. However, every parameter that is to be consumed within the windows processes themself, need to be Windows settings, hence C:\-alike. Change following propeties in the configuration file, adjusting paths where necessary to conform with your own installation:
+
+
+hbase.rootdir must read e.g. file:///C:/cygwin/root/tmp/hbase/datahbase.tmp.dir must read C:/cygwin/root/tmp/hbase/tmphbase.zookeeper.quorum must read 127.0.0.1 because for some reason localhost doesn't seem to resolve properly on Cygwin.hbase.rootdir and hbase.tmp.dir directories exist and have the proper rights set up e.g. by issuing a chmod 777 on them.
+
+CD /usr/local/hbase-<version>, preferably using auto-completion../bin/start-hbase.sh
+
+
+yes../logs directory for any exceptions../bin/hbase shell
+
+create 'test', 'data'listput 'test', 'row1', 'data:1', 'value1'
+put 'test', 'row2', 'data:2', 'value2'
+put 'test', 'row3', 'data:3', 'value3'
+scan 'test' that should list all the rows previously inserted. Notice how 3 new columns where added without changing the schema!disable 'test' followed by drop 'test' and verified by list which should give an empty listing.exit./bin/stop-hbase.sh command. And wait for it to complete!!! Killing the process might corrupt your data on disk.
+
+./logs directory.#hbase@freenode.net). People are very active and keen to help out!
+Now your HBase server is running, start coding and build that next killer app on this particular, but scalable datastore! +
+HBase is not an ACID compliant database. However, it does guarantee certain specific + properties.
+This specification enumerates the ACID properties of HBase.
+For the sake of common vocabulary, we define the following terms:
++ The terms must and may are used as specified by RFC 2119. + In short, the word "must" implies that, if some case exists where the statement + is not true, it is a bug. The word "may" implies that, even if the guarantee + is provided in a current release, users should not rely on it. +
++ A scan is not a consistent view of a table. Scans do + not exhibit snapshot isolation. +
++ Rather, scans have the following properties: +
+ ++ Those familiar with relational databases will recognize this isolation level as "read committed". +
++ Please note that the guarantees listed above regarding scanner consistency + are referring to "transaction commit time", not the "timestamp" + field of each cell. That is to say, a scanner started at time t may see edits + with a timestamp value greater than t, if those edits were committed with a + "forward dated" timestamp before the scanner was constructed. +
+All of the above guarantees must be possible within HBase. For users who would like to trade + off some guarantees for performance, HBase may offer several tuning options. For example:
+[1] In the context of HBase, "durably on disk" implies an hflush() call on the transaction + log. This does not actually imply an fsync() to magnetic media, but rather just that the data has been + written to the OS cache on all replicas of the log. In the case of a full datacenter power loss, it is + possible that the edits are not truly durable.
++ HBase emits Hadoop metrics. +
+First read up on Hadoop metrics. + If you are using ganglia, the GangliaMetrics + wiki page is useful read.
+To have HBase emit metrics, edit $HBASE_HOME/conf/hadoop-metrics.properties
+ and enable metric 'contexts' per plugin. As of this writing, hadoop supports
+ file and ganglia plugins.
+ Yes, the hbase metrics files is named hadoop-metrics rather than
+ hbase-metrics because currently at least the hadoop metrics system has the
+ properties filename hardcoded. Per metrics context,
+ comment out the NullContext and enable one or more plugins instead.
+
+ If you enable the hbase context, on regionservers you'll see total requests since last + metric emission, count of regions and storefiles as well as a count of memstore size. + On the master, you'll see a count of the cluster's requests. +
++ Enabling the rpc context is good if you are interested in seeing + metrics on each hbase rpc method invocation (counts and time taken). +
++ The jvm context is + useful for long-term stats on running hbase jvms -- memory used, thread counts, etc. + As of this writing, if more than one jvm is running emitting metrics, at least + in ganglia, the stats are aggregated rather than reported per instance. +
++ In addition to the standard output contexts supported by the Hadoop + metrics package, you can also export HBase metrics via Java Management + Extensions (JMX). This will allow viewing HBase stats in JConsole or + any other JMX client. +
+
+ To enable JMX support in HBase, first edit
+ $HBASE_HOME/conf/hadoop-metrics.properties to support
+ metrics refreshing. (If you've already configured
+ hadoop-metrics.properties for another output context,
+ you can skip this step).
+
+ For remote access, you will need to configure JMX remote passwords + and access profiles. Create the files: +
+$HBASE_HOME/conf/jmxremote.passwd (set permissions
+ to 600)$HBASE_HOME/conf/jmxremote.access
+ Finally, edit the $HBASE_HOME/conf/hbase-env.sh
+ script to add JMX support:
+
$HBASE_HOME/conf/hbase-env.shAdd the lines:
+
+ After restarting the processes you want to monitor, you should now be
+ able to run JConsole (included with the JDK since JDK 5.0) to view
+ the statistics via JMX. HBase MBeans are exported under the
+ hadoop domain in JMX.
+
HBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data.
+ This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters
+ of commodity hardware.
+
+
June 30th, HBase Contributor Workshop (Day after Hadoop Summit)
+May 10th, 2010: HBase graduates from Hadoop sub-project to Apache Top Level Project
+Signup for HBase User Group Meeting, HUG10 hosted by Trend Micro, April 19th, 2010
+ +HBase User Group Meeting, HUG9 hosted by Mozilla, March 10th, 2010
+Sign up for the HBase User Group Meeting, HUG8, January 27th, 2010 at StumbleUpon in SF
+September 8th, 2010: HBase 0.20.0 is faster, stronger, slimmer, and sweeter tasting than any previous HBase release. Get it off the Releases page.
+ApacheCon in Oakland: November 2-6th, 2009: + The Apache Foundation will be celebrating its 10th anniversary in beautiful Oakland by the Bay. Lots of good talks and meetups including an HBase presentation by a couple of the lads.
+HBase at Hadoop World in NYC: October 2nd, 2009: A few of us will be talking on Practical HBase out east at Hadoop World: NYC.
+HUG7 and HBase Hackathon: August 7th-9th, 2009 at StumbleUpon in SF: Sign up for the HBase User Group Meeting, HUG7 or for the Hackathon or for both (all are welcome!).
+June, 2009 -- HBase at HadoopSummit2009 and at NOSQL: See the presentations
+March 3rd, 2009 -- HUG6: HBase User Group 6
+January 30th, 2009 -- LA Hbackathon:HBase January Hackathon Los Angeles at Streamy in Manhattan Beach
++ HBase includes several methods of loading data into tables. + The most straightforward method is to either use the TableOutputFormat + class from a MapReduce job, or use the normal client APIs; however, + these are not always the most efficient methods. +
++ This document describes HBase's bulk load functionality. The bulk load + feature uses a MapReduce job to output table data in HBase's internal + data format, and then directly loads the data files into a running + cluster. +
++ The HBase bulk load process consists of two main steps. +
++ The first step of a bulk load is to generate HBase data files from + a MapReduce job using HFileOutputFormat. This output format writes + out data in HBase's internal storage format so that they can be + later loaded very efficiently into the cluster. +
++ In order to function efficiently, HFileOutputFormat must be configured + such that each output HFile fits within a single region. In order to + do this, jobs use Hadoop's TotalOrderPartitioner class to partition the + map output into disjoint ranges of the key space, corresponding to the + key ranges of the regions in the table. +
+
+ HFileOutputFormat includes a convenience function, configureIncrementalLoad(),
+ which automatically sets up a TotalOrderPartitioner based on the current
+ region boundaries of a table.
+
+ After the data has been prepared using HFileOutputFormat, it
+ is loaded into the cluster using a command line tool. This command line tool
+ iterates through the prepared data files, and for each one determines the
+ region the file belongs to. It then contacts the appropriate Region Server
+ which adopts the HFile, moving it into its storage directory and making
+ the data available to clients.
+
+ If the region boundaries have changed during the course of bulk load + preparation, or between the preparation and completion steps, the bulk + load commandline utility will automatically split the data files into + pieces corresponding to the new boundaries. This process is not + optimally efficient, so users should take care to minimize the delay between + preparing a bulk load and importing it into the cluster, especially + if other clients are simultaneously loading data through other means. +
+
+ HBase ships with a command line tool called importtsv. This tool
+ is available by running hadoop jar /path/to/hbase-VERSION.jar importtsv.
+ Running this tool with no arguments prints brief usage information:
+
+Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
+
+Imports the given input directory of TSV data into the specified table.
+
+The column names of the TSV data must be specified using the -Dimporttsv.columns
+option. This option takes the form of comma-separated column names, where each
+column name is either a simple column family, or a columnfamily:qualifier. The special
+column name HBASE_ROW_KEY is used to designate that this column should be used
+as the row key for each imported record. You must specify exactly one column
+to be the row key.
+
+In order to prepare data for a bulk data load, pass the option:
+ -Dimporttsv.bulk.output=/path/for/output
+
+Other options that may be specified with -D include:
+ -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
+
+
+ After a data import has been prepared using the importtsv tool, the
+ completebulkload tool is used to import the data into the running cluster.
+
+ The completebulkload tool simply takes the same output path where
+ importtsv put its results, and the table name. For example:
+
$ hadoop jar hbase-VERSION.jar completebulkload /user/todd/myoutput mytable
+ + This tool will run quickly, after which point the new data will be visible in + the cluster. +
+
+ Although the importtsv tool is useful in many cases, advanced users may
+ want to generate data programatically, or import data from other formats. To get
+ started doing so, dig into ImportTsv.java and check the JavaDoc for
+ HFileOutputFormat.
+
+ The import step of the bulk load can also be done programatically. See the
+ LoadIncrementalHFiles class for more information.
+
src/documentation/content/xdocs, or wherever the
- ${project.xdocs-dir} property (set in
- forrest.properties) points.
- HBase is a distributed, column-oriented store, modeled after Google's BigTable. HBase is built on top of Hadoop for its MapReduce and distributed file system implementation. All these projects are open-source and part of the Apache Software Foundation.
- -As being distributed, large scale platforms, the Hadoop and HBase projects mainly focus on *nix environments for production installations. However, being developed in Java, both projects are fully portable across platforms and, hence, also to the Windows operating system. For ease of development the projects rely on Cygwin to have a *nix-like environment on Windows to run the shell scripts.
-This document explains the intricacies of running HBase on Windows using Cygwin as an all-in-one single-node installation for testing and development. The HBase Overview and QuickStart guides on the other hand go a long way in explaning how to setup HBase in more complex deployment scenario's.
-For running HBase on Windows, 3 technologies are required: Java, Cygwin and SSH. The following paragraphs detail the installation of each of the aforementioned technologies.
-HBase depends on the Java Platform, Standard Edition, 6 Release. So the target system has to be provided with at least the Java Runtime Environment (JRE); however if the system will also be used for development, the Jave Development Kit (JDK) is preferred. You can download the latest versions for both from Sun's download page. Installation is a simple GUI wizard that guides you through the process.
-Cygwin is probably the oddest technology in this solution stack. It provides a dynamic link library that emulates most of a *nix environment on Windows. On top of that a whole bunch of the most common *nix tools are supplied. Combined, the DLL with the tools form a very *nix-alike environment on Windows.
- -For installation, Cygwin provides the setup.exe utility that tracks the versions of all installed components on the target system and provides the mechanism for installing or updating everything from the mirror sites of Cygwin.
To support installation, the setup.exe utility uses 2 directories on the target system. The Root directory for Cygwin (defaults to C:\cygwin) which will become / within the eventual Cygwin installation; and the Local Package directory (e.g. C:\cygsetup that is the cache where setup.exe stores the packages before they are installed. The cache must not be the same folder as the Cygwin root.
Perform following steps to install Cygwin, which are elaboratly detailed in the 2nd chapter of the Cygwin User's Guide:
- -Administrator privileges on the target system.C:\cygwin\root and C:\cygwin\setup folders.setup.exe utility and save it to the Local Package directory.setup.exe utility,
-Install from Internet option,setup.exe utility in the Local Package folder.CYGWIN_HOME system-wide environment variable that points to your Root directory.%CYGWIN_HOME%\bin to the end of your PATH environment variable.Cygwin.bat command in the Root folder. You should end up in a terminal window that is running a Bash shell. Test the shell by issuing following commands:
-cd / should take you to thr Root directory in Cygwin;LS commands that should list all files and folders in the current directory.exit command to end the terminal.HBase (and Hadoop) rely on SSH for interprocess/-node communication and launching remote commands. SSH will be provisioned on the target system via Cygwin, which supports running Cygwin programs as Windows services!
- -setup.exe utility.Next button until the Select Packages panel is shown.View button to toggle to the list view, which is ordered alfabetically on Package, making it easier to find the packages we'll need.Skip) so it's marked for installation. Use the Next button to download and install the packages.
-Download the latest release of HBase from the website. As the HBase distributable is just a zipped archive, installation is as simple as unpacking the archive so it ends up in its final installation directory. Notice that HBase has to be installed in Cygwin and a good directory suggestion is to use /usr/local/ (or [Root directory]\usr\local in Windows slang). You should end up with a /usr/local/hbase-<version> installation in Cygwin.
There are 3 parts left to configure: Java, SSH and HBase itself. Following paragraphs explain eacht topic in detail.
-One important thing to remember in shell scripting in general (i.e. *nix and Windows) is that managing, manipulating and assembling path names that contains spaces can be very hard, due to the need to escape and quote those characters and strings. So we try to stay away from spaces in path names. *nix environments can help us out here very easily by using symbolic links.
- -/usr/local to the Java home directory by using the following command and substituting the name of your chosen Java environment:
-LN -s /cygdrive/c/Program\ Files/Java/<jre name> /usr/local/<jre name>-
CD /usr/local/<jre name> and issueing the command ./bin/java -version. This should output your version of the chosen JRE.Configuring SSH is quite elaborate, but primarily a question of launching it by default as a Windows service.
- -Run as Administrator.LS -L command on the different files. Also, notice the auto-completion feature in the shell using <TAB> is extremely handy in these situations.
-chmod +r /etc/passwd to make the passwords file readable for allchmod u+w /etc/passwd to make the passwords file writable for the ownerchmod +r /etc/group to make the groups file readable for allchmod u+w /etc/group to make the groups file writable for the ownerchmod 755 /var to make the var folder writable to owner and readable and executable to allPARANOID line:
-ALL : localhost 127.0.0.1/32 : allowALL : [::1]/128 : allowssh-host-config
-/etc/ssh_config, answer yes./etc/sshd_config, answer yes.yes.sshd as a service, answer yes. Make sure you started your shell as Adminstrator!<enter> as the default is ntsec.sshd account, answer yes.no as the default will suffice.cyg_server account, answer yes. Enter a password for the account.net start sshd or cygrunsrv --start sshd. Notice that cygrunsrv is the utility that make the process run as a Windows service. Confirm that you see a message stating that the CYGWIN sshd service was started succesfully.mkpasswd -cl > /etc/passwdmkgroup --local > /etc/groupwhoami to verify your userIDssh localhost to connect to the system itself
-yes when presented with the server's fingerprintexit command should take you back to your first shell in CygwinExit should terminate the Cygwin shell.[installation directory] as working directory.
-./conf/hbase-env.sh to configure its dependencies on the runtime environment. Copy and uncomment following lines just underneath their original, change them to fit your environemnt. They should read something like:
-export JAVA_HOME=/usr/local/<jre name>export HBASE_IDENT_STRING=$HOSTNAME as this most likely does not inlcude spaces.hbase-default.xml file for configuration. Some properties do not resolve to existing directories because the JVM runs on Windows. This is the major issue to keep in mind when working with Cygwin: within the shell all paths are *nix-alike, hence relative to the root /. However, every parameter that is to be consumed within the windows processes themself, need to be Windows settings, hence C:\-alike. Change following propeties in the configuration file, adjusting paths where necessary to conform with your own installation:
-hbase.rootdir must read e.g. file:///C:/cygwin/root/tmp/hbase/datahbase.tmp.dir must read C:/cygwin/root/tmp/hbase/tmphbase.zookeeper.quorum must read 127.0.0.1 because for some reason localhost doesn't seem to resolve properly on Cygwin.hbase.rootdir and hbase.tmp.dir directories exist and have the proper rights set up e.g. by issuing a chmod 777 on them.-This should conclude the installation and configuration of HBase on Windows using Cygwin. So it's time to test it. -
CD /usr/local/hbase-<version>, preferably using auto-completion../bin/start-hbase.sh
-yes../logs directory for any exceptions../bin/hbase shellcreate 'test', 'data'listput 'test', 'row1', 'data:1', 'value1' -put 'test', 'row2', 'data:2', 'value2' -put 'test', 'row3', 'data:3', 'value3'-
scan 'test' that should list all the rows previously inserted. Notice how 3 new columns where added without changing the schema!disable 'test' followed by drop 'test' and verified by list which should give an empty listing.exit./bin/stop-hbase.sh command. And wait for it to complete!!! Killing the process might corrupt your data on disk../logs directory.#hbase@freenode.net). People are very active and keen to help out!-Now your HBase server is running, start coding and build that next killer app on this particular, but scalable datastore! -
-HBase is not an ACID compliant database. However, it does guarantee certain specific - properties.
-This specification enumerates the ACID properties of HBase.
-For the sake of common vocabulary, we define the following terms:
-- The terms must and may are used as specified by RFC 2119. - In short, the word "must" implies that, if some case exists where the statement - is not true, it is a bug. The word "may" implies that, even if the guarantee - is provided in a current release, users should not rely on it. -
-- A scan is not a consistent view of a table. Scans do - not exhibit snapshot isolation. -
-- Rather, scans have the following properties: -
- -- Those familiar with relational databases will recognize this isolation level as "read committed". -
-- Please note that the guarantees listed above regarding scanner consistency - are referring to "transaction commit time", not the "timestamp" - field of each cell. That is to say, a scanner started at time t may see edits - with a timestamp value greater than t, if those edits were committed with a - "forward dated" timestamp before the scanner was constructed. -
-All of the above guarantees must be possible within HBase. For users who would like to trade - off some guarantees for performance, HBase may offer several tuning options. For example:
-[1] In the context of HBase, "durably on disk" implies an hflush() call on the transaction - log. This does not actually imply an fsync() to magnetic media, but rather just that the data has been - written to the OS cache on all replicas of the log. In the case of a full datacenter power loss, it is - possible that the edits are not truly durable.
-- HBase emits Hadoop metrics. -
-First read up on Hadoop metrics. - If you are using ganglia, the GangliaMetrics - wiki page is useful read.
-To have HBase emit metrics, edit $HBASE_HOME/conf/hadoop-metrics.properties
- and enable metric 'contexts' per plugin. As of this writing, hadoop supports
- file and ganglia plugins.
- Yes, the hbase metrics files is named hadoop-metrics rather than
- hbase-metrics because currently at least the hadoop metrics system has the
- properties filename hardcoded. Per metrics context,
- comment out the NullContext and enable one or more plugins instead.
-
- If you enable the hbase context, on regionservers you'll see total requests since last - metric emission, count of regions and storefiles as well as a count of memstore size. - On the master, you'll see a count of the cluster's requests. -
-- Enabling the rpc context is good if you are interested in seeing - metrics on each hbase rpc method invocation (counts and time taken). -
-- The jvm context is - useful for long-term stats on running hbase jvms -- memory used, thread counts, etc. - As of this writing, if more than one jvm is running emitting metrics, at least - in ganglia, the stats are aggregated rather than reported per instance. -
-- In addition to the standard output contexts supported by the Hadoop - metrics package, you can also export HBase metrics via Java Management - Extensions (JMX). This will allow viewing HBase stats in JConsole or - any other JMX client. -
-
- To enable JMX support in HBase, first edit
- $HBASE_HOME/conf/hadoop-metrics.properties to support
- metrics refreshing. (If you've already configured
- hadoop-metrics.properties for another output context,
- you can skip this step).
-
- For remote access, you will need to configure JMX remote passwords - and access profiles. Create the files: -
-$HBASE_HOME/conf/jmxremote.passwd (set permissions
- to 600)$HBASE_HOME/conf/jmxremote.access
- Finally, edit the $HBASE_HOME/conf/hbase-env.sh
- script to add JMX support:
-
$HBASE_HOME/conf/hbase-env.shAdd the lines:
-
- After restarting the processes you want to monitor, you should now be
- able to run JConsole (included with the JDK since JDK 5.0) to view
- the statistics via JMX. HBase MBeans are exported under the
- hadoop domain in JMX.
-
- The following documents provide concepts and procedures that will help you - get started using HBase. If you have more questions, you can ask the - mailing list or browse the archives. -
-- HBase includes several methods of loading data into tables. - The most straightforward method is to either use the TableOutputFormat - class from a MapReduce job, or use the normal client APIs; however, - these are not always the most efficient methods. -
-- This document describes HBase's bulk load functionality. The bulk load - feature uses a MapReduce job to output table data in HBase's internal - data format, and then directly loads the data files into a running - cluster. -
-- The HBase bulk load process consists of two main steps. -
-- The first step of a bulk load is to generate HBase data files from - a MapReduce job using HFileOutputFormat. This output format writes - out data in HBase's internal storage format so that they can be - later loaded very efficiently into the cluster. -
-- In order to function efficiently, HFileOutputFormat must be configured - such that each output HFile fits within a single region. In order to - do this, jobs use Hadoop's TotalOrderPartitioner class to partition the - map output into disjoint ranges of the key space, corresponding to the - key ranges of the regions in the table. -
-
- HFileOutputFormat includes a convenience function, configureIncrementalLoad(),
- which automatically sets up a TotalOrderPartitioner based on the current
- region boundaries of a table.
-
- After the data has been prepared using HFileOutputFormat, it
- is loaded into the cluster using a command line tool. This command line tool
- iterates through the prepared data files, and for each one determines the
- region the file belongs to. It then contacts the appropriate Region Server
- which adopts the HFile, moving it into its storage directory and making
- the data available to clients.
-
- If the region boundaries have changed during the course of bulk load - preparation, or between the preparation and completion steps, the bulk - load commandline utility will automatically split the data files into - pieces corresponding to the new boundaries. This process is not - optimally efficient, so users should take care to minimize the delay between - preparing a bulk load and importing it into the cluster, especially - if other clients are simultaneously loading data through other means. -
-importtsv tool
- HBase ships with a command line tool called importtsv. This tool
- is available by running hadoop jar /path/to/hbase-VERSION.jar importtsv.
- Running this tool with no arguments prints brief usage information:
-
-Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
-
-Imports the given input directory of TSV data into the specified table.
-
-The column names of the TSV data must be specified using the -Dimporttsv.columns
-option. This option takes the form of comma-separated column names, where each
-column name is either a simple column family, or a columnfamily:qualifier. The special
-column name HBASE_ROW_KEY is used to designate that this column should be used
-as the row key for each imported record. You must specify exactly one column
-to be the row key.
-
-In order to prepare data for a bulk data load, pass the option:
- -Dimporttsv.bulk.output=/path/for/output
-
-Other options that may be specified with -D include:
- -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
-
- completebulkload tool
- After a data import has been prepared using the importtsv tool, the
- completebulkload tool is used to import the data into the running cluster.
-
- The completebulkload tool simply takes the same output path where
- importtsv put its results, and the table name. For example:
-
$ hadoop jar hbase-VERSION.jar completebulkload /user/todd/myoutput mytable
- - This tool will run quickly, after which point the new data will be visible in - the cluster. -
-
- Although the importtsv tool is useful in many cases, advanced users may
- want to generate data programatically, or import data from other formats. To get
- started doing so, dig into ImportTsv.java and check the JavaDoc for
- HFileOutputFormat.
-
- The import step of the bulk load can also be done programatically. See the
- LoadIncrementalHFiles class for more information.
-