[DRILL-5266] Parquet Reader produces "low density" record batches - bits vs. bytes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.10.0
Fix Version/s: 1.10.0
Component/s: Storage - Parquet
Labels:
- ready-to-commit

Description

Testing with the managed sort revealed that, for at least one file, Parquet produces "low-density" batches: batches in which only 5% of each value vector contains actual data, with the rest being unused space. When fed into the sort, we end up buffering 95% of wasted space, using only 5% of available memory to hold actual query data. The result is poor performance of the sort as it must spill far more frequently than expected.

The managed sort analyzes incoming batches to prepare good memory use estimates. The following the the output from the Parquet file in question:

Actual batch schema & sizes {
  T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
  T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
  T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
...
  c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
  Records: 1129, Total size: 32006144, Row width:28350, Density:5}

Attachments

Issue Links

links to

GitHub Pull Request #749

Activity

People

Assignee:: Paul Rogers

Reporter:: Paul Rogers

Reviewer:: Rahul Kumar Challapalli

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 15/Feb/17 17:35

Updated:: 11/May/17 22:06

Resolved:: 02/Mar/17 22:50