[DRILL-5941] Skip header / footer logic works incorrectly for Hive tables when file has several input splits - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.11.0
Fix Version/s: 1.12.0
Component/s: Storage - Hive
Labels:
- ready-to-commit

Description

To reproduce
1. Create csv file with two columns (key, value) for 3000029 rows, where first row is a header.
The data file has size of should be greater than chunk size of 256 MB. Copy file to the distributed file system.

2. Create table in Hive:

CREATE EXTERNAL TABLE `h_table`(
  `key` bigint,
  `value` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'maprfs:/tmp/h_table'
TBLPROPERTIES (
 'skip.header.line.count'='1');

3. Execute query select * from hive.h_table in Drill (query data using Hive plugin). The result will return less rows then expected. Expected result is 3000028 (total count minus one row as header).

The root cause
Since file is greater than default chunk size, it's split into several fragments, known as input splits. For example:

maprfs:/tmp/h_table/h_table.csv:0+268435456
maprfs:/tmp/h_table/h_table.csv:268435457+492782112

TextHiveReader is responsible for handling skip header and / or footer logic.
Currently Drill creates reader for each input split and skip header and /or footer logic is applied for each input splits, though ideally the above mentioned input splits should have been read by one reader, so skip / header footer logic was applied correctly.

Attachments

Issue Links

incorporates

DRILL-5106 Refactor SkipRecordsInspector to exclude check for predefined file formats

Resolved

relates to

DRILL-5991 Performance improvements for Hive tables with skip header / footer logic

Open

links to

GitHub Pull Request #1030

Activity

People

Assignee:: Arina Ielchiieva

Reporter:: Arina Ielchiieva

Reviewer:: Padma Penumarthy

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Nov/17 16:06

Updated:: 19/Jan/18 12:16

Resolved:: 22/Nov/17 22:56