[DRILL-3918] Avoid extra loading of the metadata cache file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.2.0
Component/s: Metadata
Labels:
None

Description

The metadata cache file is currently being deserialized and read twice: once during ParquetFormatPlugin.expandSelection() that happens as part of the creation of DynamicDrillTable and once during ParquetGroupScan. This was also pointed out by sphillips in ~~DRILL-3901~~. We should avoid doing the read twice.

The performance issue is getting exposed more now because of the fix for ~~DRILL-3917~~ which fixed the behavior of expandSelection() by reading the metadata cache file through the correct interface (it was previously erroring out and not spending any time in the expansion). This fix is needed for correct functionality. However, performance numbers show a slowdown of about 2.7x for the 400K files test using caching. In my view, this performance comparison is not very meaningful because of the prior bug.

This JIRA is to specifically targeting the extra load of the metadata cache file. There are other opportunities for improvement (for instance reading from the metadata cache is single threaded whereas reading from parquet files gets parallelized. That should be a separate JIRA).

Attachments

Activity

People

Assignee:: Aman Sinha

Reporter:: Aman Sinha

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Oct/15 14:58

Updated:: 11/Oct/15 19:47

Resolved:: 11/Oct/15 19:47