diff --git src/main/docbkx/troubleshooting.xml src/main/docbkx/troubleshooting.xml
index 3d480d3..481d061 100644
--- src/main/docbkx/troubleshooting.xml
+++ src/main/docbkx/troubleshooting.xml
@@ -1244,6 +1244,215 @@ sv4r6s38: at org.apache.hadoop.security.UserGroupInformation.ensureInitial
+
+ HBase and HDFS
+ General configuration guidance for Apache HDFS is out of the scope of this guide. Refer to
+ the documentation available at http://hadoop.apache.org/ for extensive
+ information about configuring HDFS. This section deals with HDFS in terms of HBase.
+
+ In most cases, HBase stores its data in Apache HDFS. This includes the HFiles containing
+ the data, as well as the write-ahead logs (WALs) which store data before it is written to the
+ HFiles and protect against RegionServer crashes. HDFS provides reliability and protection to
+ data in HBase because it is distributed. To operate with the most efficiency, HBase needs data
+ to be available locally. Therefore, it is a good practice to run an HDFS datanode on each
+ RegionServer.
+
+ Important Information and Guidelines for HBase and HDFS
+
+ HBase is a client of HDFS.
+
+ HBase is an HDFS client, using the HDFS DFSClient class, and references
+ to this class appear in HBase logs with other HDFS client log messages.
+
+
+
+ Configuration is necessary in multiple places.
+
+ Some HDFS configurations relating to HBase need to be done at the HDFS (server) side.
+ Others must be done within HBase (at the client side). Other settings need
+ to be set at both the server and client side.
+
+
+
+
+ Write errors which affect HBase may be logged in the HDFS logs rather than HBase logs.
+
+ When writing, HDFS pipelines communications from one datanode to another. HBase
+ communicates to both the HDFS namenode and datanode, using the HDFS client classes.
+ Communication problems between datanodes are logged in the HDFS logs, not the HBase
+ logs.
+ HDFS writes are always local when possible. HBase RegionServers should not
+ experience many write errors, because they write the local datanode. If the datanode
+ cannot replicate the blocks, the errors are logged in HDFS, not in the HBase
+ RegionServer logs.
+
+
+
+ HBase communicates with HDFS using two different ports.
+
+ HBase communicates with datanodes using the ipc.Client interface and
+ the DataNode class. References to these will appear in HBase logs. Each of
+ these communication channels use a different port (50010 and 50020 by default). The
+ ports are configured in the HDFS configuration, via the
+ dfs.datanode.address and dfs.datanode.ipc.address
+ parameters.
+
+
+
+ Errors may be logged in HBase, HDFS, or both.
+
+ When troubleshooting HDFS issues in HBase, check logs in both places for errors.
+
+
+
+ HDFS takes a while to mark a node as dead. You can configure HDFS to avoid using stale
+ datanodes.
+
+ By default, HDFS does not mark a node as dead until it is unreachable for 630
+ seconds. In Hadoop 1.1 and Hadoop 2.x, this can be alleviated by enabling checks for
+ stale datanodes, though this check is disabled by default. You can enable the check for
+ reads and writes separately, via dfs.namenode.avoid.read.stale.datanode and
+ dfs.namenode.avoid.write.stale.datanode settings. A stale datanode is one
+ that has not been reachable for dfs.namenode.stale.datanode.interval
+ (default is 30 seconds). Stale datanodes are avoided, and marked as the last possible
+ target for a read or write operation. For configuration details, see the HDFS
+ documentation.
+
+
+
+ Settings for HDFS retries and timeouts are important to HBase.
+
+ You can configure settings for various retries and timeouts. Always refer to the
+ HDFS documentation for current recommendations and defaults. Some of the settings
+ important to HBase are listed here. Defaults are current as of Hadoop 2.3. Check the
+ Hadoop documentation for the most current values and recommendations.
+
+ Retries
+
+ ipc.client.connect.max.retries (default: 10)
+
+ The number of times a client will attempt to establish a connection with the
+ server. This value sometimes needs to be increased. You can specify different
+ setting for the maximum number of retries if a timeout occurs. For SASL
+ connections, the number of retries is hard-coded at 15 and cannot be
+ configured.
+
+
+ ipc.client.connect.max.retries.on.timeouts (default: 45)
+ The number of times a client will attempt to establish a connection
+ with the server in the event of a timeout. If some retries are due to timeouts and
+ some are due to other reasons, this counter is added to
+ ipc.client.connect.max.retries, so the maximum number of retries for
+ all reasons could be the combined value.
+
+
+ dfs.client.block.write.retries (default: 3)
+ How many times the client attempts to write to the datanode. After the
+ number of retries is reached, the client reconnects to the namenode to get a new
+ location of a datanode. You can try increasing this value.
+
+
+
+ HDFS Heartbeats
+ HDFS heartbeats are entirely on the HDFS side, between the namenode and datanodes.
+
+ dfs.heartbeat.interval (default: 3)
+ The interval at which a node heartbeats.
+
+
+ dfs.namenode.heartbeat.recheck-interval (default: 300000)
+
+ The interval of time between heartbeat checks. The total time before a node is
+ marked as stale is determined by the following formula, which works out to 10
+ minutes and 30 seconds:
+ 2 * (dfs.namenode.heartbeat.recheck-interval) + 10 * 1000 * (dfs.heartbeat.interval)
+
+
+
+ dfs.namenode.stale.datanode.interval (default: 3000)
+
+ How long (in milliseconds) a node can go without a heartbeat before it is
+ determined to be stale, if the other options to do with stale datanodes are
+ configured (off by default).
+
+
+
+
+
+
+ Connection Timeouts
+ Connection timeouts occur between the client (HBASE) and the HDFS datanode. They may
+ occur when establishing a connection, attempting to read, or attempting to write. The two
+ settings below are used in combination, and affect connections between the DFSClient and the
+ datanode, the ipc.cClient and the datanode, and communication between two datanodes.
+
+ dfs.client.socket-timeout (default: 60000)
+
+ The amount of time before a client connection times out when establishing a
+ connection or reading. The value is expressed in milliseconds, so the default is 60
+ seconds.
+
+
+
+ dfs.datanode.socket.write.timeout (default: 480000)
+
+ The amount of time before a write operation times out. The default is 8
+ minutes, expressed as milliseconds.
+
+
+
+
+ Typical Error Logs
+ The following types of errors are often seen in the logs.
+
+ INFO HDFS.DFSClient: Failed to connect to /xxx50010, add to deadNodes and
+ continue java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel
+ to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
+ remote=/region-server-1:50010]
+
+ All datanodes for a block are dead, and recovery is not possible. Here is the
+ sequence of events that leads to this error:
+
+
+ The client attempts to connect to a dead datanode.
+
+
+ The connection fails, so the client moves down the list of datanodes and tries
+ the next one. It also fails.
+
+
+ When the client exhausts its entire list, it sleeps for 3 seconds and requests a
+ new list. It is very likely to receive the exact same list as before, in which case
+ the error occurs again.
+
+
+
+
+
+ INFO org.apache.hadoop.HDFS.DFSClient: Exception in createBlockOutputStream
+ java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be
+ ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/
+ xxx:50010]
+
+ This type of error indicates a write issue. In this case, the master wants to split
+ the log. It does not have a local datanode so it tries to connect to a remote datanode,
+ but the datanode is dead.
+ In this situation, there will be three retries (by default). If all retries fail, a
+ message like the following is logged:
+
+WARN HDFS.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block
+
+ If the operation was an attempt to split the log, the following type of message may
+ also appear:
+
+FATAL wal.HLogSplitter: WriterThread-xxx Got while writing log entry to log
+
+
+
+
+
+
Running unit or integration tests