Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
Impala 2.10.0, Impala 2.11.0, Impala 3.0, Impala 2.12.0
-
ghx-label-1
Description
The Avro changes in IMPALA-3905 introduced a correctness bug. You can hit it organically if you have a large avro file where the 16 byte sync marker straddles a block boundary. In that case the block after the sync marker may not be scanned, resulting in a few records missing.
It's possible to reproduce on our test data by tweaking max_scan_range_length until you find a value where count returns fewer results.
[localhost:21000] default> set max_scan_range_length=256k; select count(*) from tpch_avro_snap.lineitem; MAX_SCAN_RANGE_LENGTH set to 256k Query: select count(*) from tpch_avro_snap.lineitem Query submitted at: 2018-07-26 10:08:21 (Coordinator: http://tarmstrong-box:25000) Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=5142ec7a702e67ac:b6882a6f00000000 +----------+ | count(*) | +----------+ | 6001215 | +----------+ Fetched 1 row(s) in 6.77s [localhost:21000] default> set max_scan_range_length=255k; select count(*) from tpch_avro_snap.lineitem; MAX_SCAN_RANGE_LENGTH set to 255k Query: select count(*) from tpch_avro_snap.lineitem Query submitted at: 2018-07-26 10:08:31 (Coordinator: http://tarmstrong-box:25000) Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=3d40e63dacaac65b:99d17eaf00000000 +----------+ | count(*) | +----------+ | 6000679 | +----------+ Fetched 1 row(s) in 1.33s
We do have test coverage in TestScanRangeLengths that exercise the code with avro blocks straddling scan ranges. However, the necessary condition for this bug is that the scan range includes a full avro block, followed by a sync marker on the boundary with the next scan range. We need to add test coverage for a larger range of values here - larger files and larger scan ranges.
Attachments
Issue Links
- is broken by
-
IMPALA-3905 Single-threaded scan node
- Resolved
- is related to
-
IMPALA-7363 Spurious error generated by sequence file scanner with weird scan range length
- Resolved
-
IMPALA-8452 Avro scanner seems broken
- Resolved