[HIVE-15852] Tablesampling on Tez in low-record case throws ArrayIndexOutOfBoundsException - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.1.1
Fix Version/s: None
Component/s: Tez
Labels:
None

Description

Due to ~~HIVE-13040~~ ( https://issues.apache.org/jira/browse/HIVE-13040 ), which doesn't create empty files to represent empty buckets when Hive is on Tez, a couple things are broken.

First of all, if there are empty buckets (which is possible with large datasets in the partitioned-bucketed case), tablesampling will not work if you're referencing a bucket number higher than the number of files.
e.g. In some partition 'p', there are three rows. The table 't' is clustered into ten buckets. With maximal hashing, only three bucket files will be created. If we do select * from t tablesample (bucket x out of 10) where <selecting from p> (where x > 3), an ArrayIndexOutOfBoundsException will be thrown because Hive assumes there are only three buckets.

Second, other applications (such as Pig) may be making assumptions about the number of files equaling the number of buckets.

Possible fixes:

Revert ~~HIVE-13040~~
Change how tablesampling is implemented to accept possibility that number of files != number of buckets
- Would require coordination across projects to change assumptions

Things to consider:

what performance gains are there from not creating empty files?
if the gains are large, are we willing to lose them? (by reverting ~~HIVE-13040~~)
how else can we avoid creating unnecessary files, while still maintaining invariants other applications expect?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Thomas Poepping

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Feb/17 20:13

Updated:: 14/Feb/17 00:16