[HIVE-7239] Fix bug in HiveIndexedInputFormat implementation that causes incorrect query result when input backed by Sequence/RC files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.0
Fix Version/s: 2.2.0
Component/s: Indexing
Labels:
None

Description

In case of sequence files, it's crucial that splits are calculated around the boundaries enforced by the input sequence file. However by default hadoop creates input splits depending on the configuration parameters which may not match the boundaries for the input sequence file. Hive provides HiveIndexedInputFormat that provides extra logic and recalculates the split boundaries for each split depending on the sequence file's boundaries.

However we noticed this behavior of "over" reporting from data backed by sequence file. We've a sample data on which we experimented and fixed this bug, we have verified this fix by comparing the query output for input being sequence file format, rc file and regular format. However we have not able to find the right place to include this as a unit test that would execute as part of hive tests. We tried writing a "clientpositive" test as part of ql module but the output seems quite verbose and i couldn't interpret it that well. Can someone please review this change and guide on how to write a test that will execute as part of Hive testing?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-7239.2.patch
21/Jul/16 19:13
36 kB
Illya Yalovyy
HIVE-7239.3.patch
25/Jul/16 21:35
36 kB
Illya Yalovyy
HIVE-7239.4.patch
27/Jul/16 05:09
36 kB
Illya Yalovyy
HIVE-7239.patch
16/Jun/14 18:49
7 kB
Sumit Kumar

Issue Links

links to

Code review

Activity

People

Assignee:: Illya Yalovyy

Reporter:: Sumit Kumar

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Jun/14 18:47

Updated:: 26/Jul/17 03:30

Resolved:: 09/Aug/16 19:44