Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 2.2.4
-
None
Description
The following two DCHECK-s in hdfs-parquet-table-writer.cc seem to be invalid:
// Last page might be empty if (page.header.data_page_header.num_values == 0) { DCHECK_EQ(page.header.compressed_page_size, 0); DCHECK_EQ(i, num_data_pages_ - 1); continue; }
The first DCHECK means that if a page's size is 0 then it's compressed size is also 0. This, however, seems to be a false assumption, as the compressed output may include metadata, such as length or checksum.
The GZIP compressor, for example, states that an input of 0 bytes requires 23 bytes when compressed. The Snappy compressor also mentions storing length information in the compressed output. The compressed size estimation in the LZ4 compressor also contains a constant part.
The "Last page might be empty" comment and the second DCHECK also seems to be based on a false assumption. If a value doesn't fit on the current page, AppendRow creates a new, possibly bigger page and tries writing the data in the new page instead. This means that if the data is bigger than the page size, then the current page is finalized and a new page is added, even if the original page was empty. In other words, empty pages can occur in the middle of the pages_ array as well, not only at the end of it.