Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3430

Make split sorting optional

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.9.0
    • None
    • None
    • Reviewed

    Description

      The fair routing design in TEZ-3209 addresses the skewed partitions where one partition could be much larger than the others. But to simplify the stats tracking, it assumes a given partition's data is distributed evenly to some degree across source tasks so that it can group consecutive source tasks together.

      However, this assumption is invalid given MRInputHelpers's generateNewSplits and generateOldSplits sort the splits by size, thus the data size in the beginning of source task range is bigger than that of at the end.

      Arrays.sort(splits, new InputSplitComparator());
      

      One way to fix this is to have fair routing track not only the aggregated size of each partition, but also the size of each partition of each source task. But that will significantly increase the memory footprint.

      Alternatively, it can skip the sorting above. Test results for TEZ-3209 show that jobs can finish 30% faster, given the source tasks output size is more balanced.

      Attachments

        1. TEZ-3430.patch
          31 kB
          Ming Ma

        Activity

          People

            mingma Ming Ma
            mingma Ming Ma
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: