Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7360

Avro scanner sometimes skips blocks when skip marker is on HDFS block boundary

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • Impala 2.10.0, Impala 2.11.0, Impala 3.0, Impala 2.12.0
    • Impala 3.1.0
    • Backend

    Description

      The Avro changes in IMPALA-3905 introduced a correctness bug. You can hit it organically if you have a large avro file where the 16 byte sync marker straddles a block boundary. In that case the block after the sync marker may not be scanned, resulting in a few records missing.

      It's possible to reproduce on our test data by tweaking max_scan_range_length until you find a value where count returns fewer results.

      [localhost:21000] default> set max_scan_range_length=256k; select count(*) from tpch_avro_snap.lineitem;
      MAX_SCAN_RANGE_LENGTH set to 256k
      Query: select count(*) from tpch_avro_snap.lineitem
      Query submitted at: 2018-07-26 10:08:21 (Coordinator: http://tarmstrong-box:25000)
      Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=5142ec7a702e67ac:b6882a6f00000000
      +----------+
      | count(*) |
      +----------+
      | 6001215  |
      +----------+
      Fetched 1 row(s) in 6.77s
      [localhost:21000] default> set max_scan_range_length=255k; select count(*) from tpch_avro_snap.lineitem;
      MAX_SCAN_RANGE_LENGTH set to 255k
      Query: select count(*) from tpch_avro_snap.lineitem
      Query submitted at: 2018-07-26 10:08:31 (Coordinator: http://tarmstrong-box:25000)
      Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=3d40e63dacaac65b:99d17eaf00000000
      +----------+
      | count(*) |
      +----------+
      | 6000679  |
      +----------+
      Fetched 1 row(s) in 1.33s
      

      We do have test coverage in TestScanRangeLengths that exercise the code with avro blocks straddling scan ranges. However, the necessary condition for this bug is that the scan range includes a full avro block, followed by a sync marker on the boundary with the next scan range. We need to add test coverage for a larger range of values here - larger files and larger scan ranges.

      Attachments

        Issue Links

          Activity

            People

              tarmstrong Tim Armstrong
              tarmstrong Tim Armstrong
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: