Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-25765

skip.header.line.count property skips rows of each block in FetchOperator when file size is larger

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.1.2, 4.0.0
    • None
    • None

    Description

      When skip.header.line.count property is set in table properties, simple select queries that gets converted into FetchTask skip rows of each block instead of skipping header lines of each file. This happens when the file size is larger and file is read in blocks. This issue doesn't exist when select query is converted into map only job by setting hive.fetch.task.conversion to none because the header lines are skipped only for the first block because of this check We should have similar check in FetchOperator to avoid this issue. 

       

      Steps to reproduce: 

      -- Create table on top of the data file (uncompressed size: ~239M) attached in this ticket
      CREATE EXTERNAL TABLE test_table(
        col1 string,
        col2 string,
        col3 string,
        col4 string,
        col5 string,
        col6 string,
        col7 string,
        col8 string,
        col9 string,
        col10 string,
        col11 string,
        col12 string)
      ROW FORMAT SERDE
        'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
      STORED AS INPUTFORMAT
        'org.apache.hadoop.mapred.TextInputFormat'
      OUTPUTFORMAT
        'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
      LOCATION
        'location_of_data_file'
      TBLPROPERTIES ('skip.header.line.count'='1');
      
      
      -- Counting number of rows gives correct result with only one header line skipped
      
      select count(*) from test_table;
      3145727
      
      -- Select query skips more rows and the result depends upon the number of blocks configured in underlying filesystem. 3 rows are skipped when the file is read in 3 blocks. 
      
      select * from test_table;
      .
      .
      Fetched 3145724 rows
       

      Attachments

        1. data.txt.gz
          950 kB
          Ganesha Shreedhara

        Issue Links

          Activity

            People

              ganeshas Ganesha Shreedhara
              ganeshas Ganesha Shreedhara
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m