Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.13.1
-
None
-
None
-
None
Description
When we were writing an encrypted file, we encountered the following error:
Encrypted parquet files can't have more than 32767 pages per chunk: 32768
Error Stack:
org.apache.parquet.crypto.ParquetCryptoRuntimeException: Encrypted parquet files can't have more than 32767 pages per chunk: 32768 at org.apache.parquet.crypto.AesCipher.quickUpdatePageAAD(AesCipher.java:131) at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:178) at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:67) at org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:392) at org.apache.parquet.column.impl.ColumnWriteStoreBase.sizeCheck(ColumnWriteStoreBase.java:231) at org.apache.parquet.column.impl.ColumnWriteStoreBase.endRecord(ColumnWriteStoreBase.java:216) at org.apache.parquet.column.impl.ColumnWriteStoreV1.endRecord(ColumnWriteStoreV1.java:29) at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endMessage(MessageColumnIO.java:295)
Reasons:
The `getBufferedSize` method of FallbackValuesWriter
returns raw data size to decide if we want to flush the page,
so the actual size of the page written could be much more smaller due to dictionary encoding. This prevents page being too big when fallback happens, but can also produce too many pages in a single column chunk. On the other side, the encryption module only supports up to 32767 pages per chunk, as we use `Short` to store page ordinal as a part of AAD.
Reproduce:
reproduce.zip