Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.0.0
-
None
-
None
Description
HiveEndPoint represents a way to write to a specific partition transactionally.
Each HiveEndPoint creates TransactionBatch(es) and commits transactions.
Suppose you have 10 instances of Storm Hive bolt using Streaming API.
Each instance will create HiveEndPoints on demand when it sees an event for particular partition value.
If events are uniformly distributed wrt partition values and the table has 1000 partitions (for example it's partitioned by CustomerId), each of 10 bolt instances may create 1000 HiveEndPoints and thus > 10,000 (actually 10K * num_txn_per_batch) concurrent transactions.
This creates huge amount of Metastore traffic.
HIVE-11672 is investigating how some sort of "shuffle" phase can be added route events for a particular bucket to the same bolt instance.
The same idea should explored to route events based on partition value.
Attachments
Issue Links
- is related to
-
HIVE-11672 Hive Streaming API handles bucketing incorrectly
- Resolved