Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5250

Non-deterministic error reporting for compressed corrupt Parquet files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.8.0
    • Impala 2.11.0
    • Backend
    • None
    • ghx-label-7

    Description

      Impala may return non-deterministic errors for certain corrupt Parquet files that are compressed. See the relevant snippet from BaseScalarColumnReader::ReadDataPage() below:

          if (decompressor_.get() != NULL) {
            SCOPED_TIMER(parent_->decompress_timer_);
            uint8_t* decompressed_buffer =
                decompressed_data_pool_->TryAllocate(uncompressed_size);
            if (UNLIKELY(decompressed_buffer == NULL)) {
              string details = Substitute(PARQUET_COL_MEM_LIMIT_EXCEEDED, "ReadDataPage",
                  uncompressed_size, "decompressed data");
              return decompressed_data_pool_->mem_tracker()->MemLimitExceeded(
                  parent_->state_, details, uncompressed_size);
            }
            RETURN_IF_ERROR(decompressor_->ProcessBlock32(true,
                current_page_header_.compressed_page_size, data_, &uncompressed_size,
                &decompressed_buffer));
            VLOG_FILE << "Decompressed " << current_page_header_.compressed_page_size
                      << " to " << uncompressed_size;
            if (current_page_header_.uncompressed_page_size != uncompressed_size) {
              return Status(Substitute("Error decompressing data page in file '$0'. "
                  "Expected $1 uncompressed bytes but got $2", filename(),
                  current_page_header_.uncompressed_page_size, uncompressed_size));
            }
            data_ = decompressed_buffer;
            data_size = current_page_header_.uncompressed_page_size;
            data_end_ = data_ + data_size;
      

      The 'decompressed_buffer' is not initialized, and it is possible that decompressor_->ProcessBlock32() succeeds without writing to all the bytes in the 'decompressed_buffer' leading to non-deterministic errors being reported later in the scan. For example, this may happen when the 'compressed_page_size' is corrupt and set to 1.

      We've seen the following errors being reported for files like this:

      Could not read definition level, even though metadata states there are <some_number> values remaining in data page. 
      Corrupt Parquet file '<file>' <some_number> bytes of encoded levels but only <some_number> bytes left in page.
      

      Attachments

        Activity

          People

            tianyiwang Tianyi Wang
            alex.behm Alexander Behm
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: