Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.8.0
    • 1.11.0
    • None
    • None

    Description

      The ExternalSortBatch (ESB) operator sorts data while spilling to disk as needed to operate within a memory budget.

      The sort happens in two phases:

      1. Gather the incoming batches from the upstream operator, sort them, and spill to disk as needed.
      2. Merge the "runs" spilled in step 1.

      In most cases, the second step should run within the memory available for the first step (which is why severity is only Minor). Large queries need multiple sort "phases" in which previously spilled runs are read back into memory, merged, and again spilled. It is here that ESB has an issue. This process correctly limit the amount of memory used, but at the cost or rewriting the same data over and over.

      Consider current Drill behavior:

      a b c d (re-spill)
      abcd e f g h (re-spill)
      abcefgh i j k
      

      That is batches, a, b, c and d are re-spilled to create the combined abcd, and so on. The same data is rewritten over and over.

      Note that spilled batches take no (direct) memory in Drill, and require only a small on-heap memento. So, maintaining data on disk s "free". So, better would be to re-spill only newer data:

      a b c d (re-spill)
      abcd | e f g h (re-spill)
      abcd efgh | i j k
      

      Where the bar indicates a moving point at which we've already merged and do not need to do so again. If each letter is one unit of disk I/O, the original method uses 35 units while the revised method uses 27 units.

      At some point the process may have to repeat by merging the second-generation spill files and so on.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            paul-rogers Paul Rogers
            paul-rogers Paul Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment