[PARQUET-1401] RowGroup offset and total compressed size fields - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: encryption-feature-branch
Component/s: parquet-cpp, parquet-format
Labels:
- pull-request-available

Description

Spark uses filterFileMetaData* methods in ParquetMetadataConverter class, that calculate the offset and total compressed size of a RowGroup data.
The offset calculation is done by extracting the ColumnMetaData of the first column, and using its offset fields.
The total compressed size calculation is done by running a loop over all column chunks in the RowGroup, and summing up the size values from each chunk's ColumnMetaData .
If one or more columns are hidden (encrypted with a key unavailable to the reader), these calculations can't be performed, because the column metadata is protected.

But: these calculations don't really need the individual column values. The results pertain to the whole RowGroup, not specific columns.
Therefore, we will define two new optional fields in the RowGroup Thrift structure:

optional i64 file_offset
optional i64 total_compressed_size

and calculate/set them upon file writing. Then, Spark will be able to query a file with hidden columns (of course, only if the query itself doesn't need the hidden columns - works with a masked version of them, or reads columns with available keys).

These values can be set only for encrypted files (or for all files, to skip the loop upon reading). I've tested this, works fine in Spark writers and readers.

I've also checked other references to ColumnMetaData fields in parquet-mr. There are none - therefore, its the only change we need in parquet.thrift to handle hidden columns.

Attachments

Issue Links

links to

GitHub Pull Request #104

GitHub Pull Request #497

GitHub Pull Request #2573

GitHub Pull Request #6807

Activity

People

Assignee:: Gidon Gershinsky

Reporter:: Gidon Gershinsky

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Aug/18 14:50

Updated:: 07/Apr/20 05:27

Resolved:: 18/Sep/18 04:05

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3.5h