Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8449

Avoid Parquet pages with too many rows + try to make them aligned

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Backend
    • ghx-label-4

    Description

      Currently Impala has a limit for Parquet data page size, but not the number of rows in the page. This means that if the page can be encoded efficiently with RLE, then the any number of rows can fit into a page. This is an issue for column indexes, because ordered columns (which are very good candidates for min/max filtering) with low enough NDV to fit into the dictionary will be encoded "too well", making the per page index too coarse grained.

      Parquet-mr choose the approach of adding a configurable "max row count in page" (20000 by default): PARQUET-1414. This would work for Impala too, and is relatively simple to implement, but I think that it is still a sub-optimal solution for column indexes, as it doesn't make every page aligned, as some pages may hit the max size limit first, leading to less rows than 20000, which makes all subsequent pages in the column chunk non-aligned. The max size limit seems important for string columns, as long strings could lead to very large pages otherwise. An alternative algorithm is to start a new page at every Nth row regardless of the number of rows in the current page. This would result in the same layout as the previous approach in case of columns where pages always hit the max row count limit before the max size limit, but for other columns, alignment would be reestablished after every Nth row.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            csringhofer Csaba Ringhofer

            Dates

              Created:
              Updated:

              Slack

                Issue deployment