Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7294

TABLESAMPLE clause allocates arrays based on total file count instead of selected partitions

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • Impala 3.0
    • Impala 2.13.0, Impala 3.1.0
    • None
    • None
    • ghx-label-4

    Description

      The HdfsTable.getFilesSample function takes a list of input partitions to sample files from, but then, when allocating an array to sample into, sizes that array based on the total file count across all partitions. This is an unnecessarily large array, which is expensive to allocate (may cause full GC when the heap is fragmented). The code claims this to be an optimization:

          // Use max size to avoid looping over inputParts for the exact size.
      

      ...but I think the loop over inputParts is likely to be trivial here since we'll loop over them anyway later in the function and thus will already be pulled into CPU cache, etc. This is also necessary for fine-grained metadata loading in the impalad – for a large table with many partitions, we don't want to load the file lists of all partitions just to tablesample from one partition.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tlipcon Todd Lipcon
            tlipcon Todd Lipcon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment