Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 2.8.0
-
None
-
ghx-label-7
Description
Impala may return non-deterministic errors for certain corrupt Parquet files that are compressed. See the relevant snippet from BaseScalarColumnReader::ReadDataPage() below:
if (decompressor_.get() != NULL) { SCOPED_TIMER(parent_->decompress_timer_); uint8_t* decompressed_buffer = decompressed_data_pool_->TryAllocate(uncompressed_size); if (UNLIKELY(decompressed_buffer == NULL)) { string details = Substitute(PARQUET_COL_MEM_LIMIT_EXCEEDED, "ReadDataPage", uncompressed_size, "decompressed data"); return decompressed_data_pool_->mem_tracker()->MemLimitExceeded( parent_->state_, details, uncompressed_size); } RETURN_IF_ERROR(decompressor_->ProcessBlock32(true, current_page_header_.compressed_page_size, data_, &uncompressed_size, &decompressed_buffer)); VLOG_FILE << "Decompressed " << current_page_header_.compressed_page_size << " to " << uncompressed_size; if (current_page_header_.uncompressed_page_size != uncompressed_size) { return Status(Substitute("Error decompressing data page in file '$0'. " "Expected $1 uncompressed bytes but got $2", filename(), current_page_header_.uncompressed_page_size, uncompressed_size)); } data_ = decompressed_buffer; data_size = current_page_header_.uncompressed_page_size; data_end_ = data_ + data_size;
The 'decompressed_buffer' is not initialized, and it is possible that decompressor_->ProcessBlock32() succeeds without writing to all the bytes in the 'decompressed_buffer' leading to non-deterministic errors being reported later in the scan. For example, this may happen when the 'compressed_page_size' is corrupt and set to 1.
We've seen the following errors being reported for files like this:
Could not read definition level, even though metadata states there are <some_number> values remaining in data page.
Corrupt Parquet file '<file>' <some_number> bytes of encoded levels but only <some_number> bytes left in page.