Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
When there is a table in Hive backed by an HBase table, then the following access pattern is shown multiple times in Zookeeper even for a simple query like "SELECT * FROM table":
- A client is connecting to Zookeeper
- Checks whether the /hbase ZNode exists
- Reads /hbase/hbaseid
- Client closes the connection.
The amount of these accesses are depending on the amount of data most likely it is correlating to the number of HBase regions.
The same access pattern one can see in ZK when one runs the following Java code:
import org.apache.hadoop.hbase.client.*; public class Test { public static void main(String args[]) throws Exception { Connection c = ConnectionFactory.createConnection(); c.close(); } }
The problem with this is that for large tables this creates an enormous amount of session creation which is expensive in ZK, and if the amount of queries to this table is high, then the ZK transaction log is heavily written, and there are way more snapshots created then otherwise due to the amount of createSession closeSession transaction in Zookeeper. In this particular case the Zookeeper data directory was filled with about 24GB of data and caused the device to almost fill under the Zookeeper data directory. ~90% of the data written was createSession and closeSession transactions.
I am not sure what logs I should provide, but reproducing the behaviour is easy enough. In Zookeeper if one enables DEBUG level logging, the logs are showing what is being read by sessions. These sessions live for 1-5ms tops.
I imagine that the solution is to somehow share the connection object between the mappers if possible, and use one connection according to the suggestion in the API documentation of ConnectionFactory and request table/admin/any object from the one connection, or at least use only one connection object per map/reduce, and make it a longer living connection that is there for the whole map/reduce lifetime.