Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-3473

io.sort.factor should default to 100 instead of 10

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: conf
    • Labels:
      None

      Description

      10 is really conservative and can make merges much much more expensive.

        Activity

        Hide
        cutting Doug Cutting added a comment -

        Changing this has memory implications, no? Buffers are allocated for each stream being merged. Buffers should be large enough so that transfer dominates seek, i.e., @ 10ms/seek, 100MB/s transfer, seek=transfer at 1MB. So for merging not to be seek-bound with 100 buffers, the total buffer size needs to be substantially larger than 100MB, which is currently the default for io.sort.mb. So I can see increasing this to 50 w/o changing the default for io.sort.mb.

        BTW, you've proposed a solution in the description rather than a problem. The problem, I assume, is that the sort-factor is non-optimal. Perhaps a better solution to this problem is to not specify the sort factor at all, but rather to have the sort code determine it automatically based on io.sort.mb? So if you increase io.sort.mb, you'd get a larger sort factor. Of course, then we'd have to make some assumptions about disk performance...

        Show
        cutting Doug Cutting added a comment - Changing this has memory implications, no? Buffers are allocated for each stream being merged. Buffers should be large enough so that transfer dominates seek, i.e., @ 10ms/seek, 100MB/s transfer, seek=transfer at 1MB. So for merging not to be seek-bound with 100 buffers, the total buffer size needs to be substantially larger than 100MB, which is currently the default for io.sort.mb. So I can see increasing this to 50 w/o changing the default for io.sort.mb. BTW, you've proposed a solution in the description rather than a problem. The problem, I assume, is that the sort-factor is non-optimal. Perhaps a better solution to this problem is to not specify the sort factor at all, but rather to have the sort code determine it automatically based on io.sort.mb? So if you increase io.sort.mb, you'd get a larger sort factor. Of course, then we'd have to make some assumptions about disk performance...

          People

          • Assignee:
            owen.omalley Owen O'Malley
            Reporter:
            owen.omalley Owen O'Malley
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development