[HIVE-11672] Hive Streaming API handles bucketing incorrectly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Duplicate
Affects Version/s: 1.2.1
Fix Version/s: None
Component/s: HCatalog, Hive, Transactions
Labels:
None

Description

Hive Streaming API allows the clients to get a random bucket and then insert data into it. However, this leads to incorrect bucketing as Hive expects data to be distributed into buckets based on a hash function applied to bucket key. The data is inserted randomly by the clients right now. They have no way of

Knowing what bucket a row (tuple) belongs to
Asking for a specific bucket

There are optimization such as Sort Merge Join and Bucket Map Join that rely on the data being correctly distributed across buckets and these will cause incorrect read results if the data is not distributed correctly.

There are two obvious design choices

Hive Streaming API should fix this internally by distributing the data correctly
Hive Streaming API should expose data distribution scheme to the clients and allow them to distribute the data correctly

The first option will mean every client thread will write to many buckets, causing many small files in each bucket and too many connections open. this does not seem feasible. The second option pushes more functionality into the client of the Hive Streaming API, but can maintain high throughput and write good sized ORC files. This option seems preferable.

Attachments

Issue Links

blocks

STORM-1014 Use Hive Streaming API bucket info to bucket correctly

Resolved

relates to

HIVE-11683 Hive Streaming may overload the metastore

Open

HIVE-11983 Hive streaming API uses incorrect logic to assign buckets to incoming records

Resolved

Activity

People

Assignee:: Roshan Naik

Reporter:: Raj Bains

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Due:: 11/Sep/15

Created:: 27/Aug/15 23:46

Updated:: 03/Nov/15 08:27

Resolved:: 02/Nov/15 20:02