Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5941

Skip header / footer logic works incorrectly for Hive tables when file has several input splits

    Details

      Description

      To reproduce
      1. Create csv file with two columns (key, value) for 3000029 rows, where first row is a header.
      The data file has size of should be greater than chunk size of 256 MB. Copy file to the distributed file system.

      2. Create table in Hive:

      CREATE EXTERNAL TABLE `h_table`(
        `key` bigint,
        `value` string)
      ROW FORMAT DELIMITED
        FIELDS TERMINATED BY ','
      STORED AS INPUTFORMAT
        'org.apache.hadoop.mapred.TextInputFormat'
      OUTPUTFORMAT
        'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
      LOCATION
        'maprfs:/tmp/h_table'
      TBLPROPERTIES (
       'skip.header.line.count'='1');
      

      3. Execute query select * from hive.h_table in Drill (query data using Hive plugin). The result will return less rows then expected. Expected result is 3000028 (total count minus one row as header).

      The root cause
      Since file is greater than default chunk size, it's split into several fragments, known as input splits. For example:

      maprfs:/tmp/h_table/h_table.csv:0+268435456
      maprfs:/tmp/h_table/h_table.csv:268435457+492782112
      

      TextHiveReader is responsible for handling skip header and / or footer logic.
      Currently Drill creates reader for each input split and skip header and /or footer logic is applied for each input splits, though ideally the above mentioned input splits should have been read by one reader, so skip / header footer logic was applied correctly.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                arina Arina Ielchiieva
                Reporter:
                arina Arina Ielchiieva
                Reviewer:
                Padma Penumarthy
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: