[DRILL-2743] Parquet file metadata caching - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.2.0
Component/s: Storage - Parquet
Labels:
None

Description

To run a query against parquet files, we have to first recursively search the directory tree for all of the files, get the block locations for each file, and read the footer from each file, and this is done during the planning phase. When there are many files, this can result in a very large delay in running the query, and it does not scale.

However, there isn't really any need to read the footers during planning, if we instead treat each parquet file as a single work unit, all we need to know are the block locations for the file, the number of rows, and the columns. We should store only the information which we need for planning in a file located in the top directory for a given parquet table, and then we can delay reading of the footers until execution time, which can be done in parallel.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

DRILL-2743.patch
10/Apr/15 01:13
72 kB
Steven Phillips
drill.parquet_metadata
10/Apr/15 01:52
5 kB
Steven Phillips

Activity

People

Assignee:: Rahul Kumar Challapalli

Reporter:: Steven Phillips

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 10/Apr/15 00:21

Updated:: 03/Oct/15 00:16

Resolved:: 20/Sep/15 00:24