Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4371

Incorrect DCHECK-s in hdfs-parquet-table-writer

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.2.4
    • Impala 3.0
    • Backend
    • None

    Description

      The following two DCHECK-s in hdfs-parquet-table-writer.cc seem to be invalid:

          // Last page might be empty
          if (page.header.data_page_header.num_values == 0) {
            DCHECK_EQ(page.header.compressed_page_size, 0);
            DCHECK_EQ(i, num_data_pages_ - 1);
            continue;
          }
      

      The first DCHECK means that if a page's size is 0 then it's compressed size is also 0. This, however, seems to be a false assumption, as the compressed output may include metadata, such as length or checksum.

      The GZIP compressor, for example, states that an input of 0 bytes requires 23 bytes when compressed. The Snappy compressor also mentions storing length information in the compressed output. The compressed size estimation in the LZ4 compressor also contains a constant part.

      The "Last page might be empty" comment and the second DCHECK also seems to be based on a false assumption. If a value doesn't fit on the current page, AppendRow creates a new, possibly bigger page and tries writing the data in the new page instead. This means that if the data is bigger than the page size, then the current page is finalized and a new page is added, even if the original page was empty. In other words, empty pages can occur in the middle of the pages_ array as well, not only at the end of it.

      Attachments

        Activity

          People

            zi Zoltan Ivanfi
            zi Zoltan Ivanfi
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: