Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3209 Support for fair custom data routing
  3. TEZ-4521

Partition stats should be always uncompressed size

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.10.2
    • 0.10.3
    • None
    • None

    Description

      We always put compressed size in ExternalSorter#partitionStats while we put uncompressed size in UnorderedPartitionedKVWriter#sizePerPartition. Those should have consistent semantics.

       

      As far as I know, the uncompressed size is preferable because of some reasons.

      1. The stats are used in FairShuffleVertexManager to configure the parallelism. The normal ShuffleVertexManager which is broadly used computes parallelism based on uncompressed size. Otherwise, we need to tune `tez.fair-shuffle-vertex-manager.desired-task-input-size` based on compressed size though `tez.shuffle-vertex-manager.desired-task-input-size` must be based on decompressed size
      2. Ming pointed out we should use uncompressed size in TEZ-3206. Looks like, we missed creating a follow-up ticket

      Attachments

        Issue Links

          Activity

            People

              okumin Shohei Okumiya
              okumin Shohei Okumiya
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m