1. Create csv file with two columns (key, value) for 3000029 rows, where first row is a header.
The data file has size of should be greater than chunk size of 256 MB. Copy file to the distributed file system.
2. Create table in Hive:
CREATE EXTERNAL TABLE `h_table`( `key` bigint, `value` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'maprfs:/tmp/h_table' TBLPROPERTIES ( 'skip.header.line.count'='1');
3. Execute query select * from hive.h_table in Drill (query data using Hive plugin). The result will return less rows then expected. Expected result is 3000028 (total count minus one row as header).
The root cause
Since file is greater than default chunk size, it's split into several fragments, known as input splits. For example:
TextHiveReader is responsible for handling skip header and / or footer logic.
Currently Drill creates reader for each input split and skip header and /or footer logic is applied for each input splits, though ideally the above mentioned input splits should have been read by one reader, so skip / header footer logic was applied correctly.
DRILL-5106 Refactor SkipRecordsInspector to exclude check for predefined file formats
- relates to
DRILL-5991 Performance improvements for Hive tables with skip header / footer logic
- links to