Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-4601

Partitioning based on the parquet statistics

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Patch

    Description

      It can really help performance to extend current partitioning idea implemented in DRILL-3333 even further.
      Currently partitioning is based on statistics, when min value equals to max value for whole file. Based on this, files are removed from scan in planning phase. Problem is, that it leads to many small parquet files, which is not fine in HDFS world. Also only few columns are partitioned.

      I would like to extend this idea to use all statistics for all columns. So if value should equal to constant, remove all files from plan which have statistics off. This will really help performance for scans over many parquet files.

      I have initial patch ready, currently just to give an idea. (it changes metadata v2, which is not fine).

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            myroch Miroslav Holubec
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment