[DRILL-4601] Partitioning based on the parquet statistics - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: 1.9.0
Component/s: Query Planning & Optimization
Labels:

Flags:

Patch

Description

It can really help performance to extend current partitioning idea implemented in ~~DRILL-3333~~ even further.
Currently partitioning is based on statistics, when min value equals to max value for whole file. Based on this, files are removed from scan in planning phase. Problem is, that it leads to many small parquet files, which is not fine in HDFS world. Also only few columns are partitioned.

I would like to extend this idea to use all statistics for all columns. So if value should equal to constant, remove all files from plan which have statistics off. This will really help performance for scans over many parquet files.

I have initial patch ready, currently just to give an idea. (it changes metadata v2, which is not fine).

Attachments

DRILL-4601.1.patch
13/Apr/16 08:38
36 kB
Miroslav Holubec

Issue Links

Add Link

duplicates

DRILL-1950 Implement filter pushdown for Parquet

Resolved

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Miroslav Holubec

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 13/Apr/16 08:27

Updated:: 24/Jan/17 09:17

Resolved:: 24/Jan/17 09:17

Agile

View on Board

Partitioning based on the parquet statistics

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment