Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.20.2, 0.23.0
    • Fix Version/s: 0.24.0
    • Component/s: task, tasktracker
    • Labels:
      None

      Description

      As the scale of cluster and job get larger, we see a lot of empty partitions in MapOutputFile due to large reduce numbers or partition skew. When map output compression is enabled, empty map output partitions gets larger & has additional compressor/decompressor initialization overhead.
      This can be optimized by allowing empty MapOutputFile segments, where the rawLength & partLength of IndexRecord all equal to 0. Corresponding support need to be added to IFile reader, writer, and reduce shuffle copier.

        Activity

        Hide
        Binglin Chang added a comment -

        LzoCodec: 2byte EOF marker + 4 byte checksum -> 14 byte compressed data + 4 byte checksum
        GzipCodec: 2byte EOF marker + 4 byte checksum -> 26 byte compressed data + 4 byte checksum
        Empty segments don't have any bytes, thus the seek & read in MapOutputServlet can also be saved.
        This optimization is only for extreme cases, I often see large proportion(90%) of empty segments in vary large jobs(particularly with map side filter) in our cluster, this is partially because of bad configuration or bad partitioner, but tuning a partitioner or key distribution sometimes is non trivial for user.

        Show
        Binglin Chang added a comment - LzoCodec: 2byte EOF marker + 4 byte checksum -> 14 byte compressed data + 4 byte checksum GzipCodec: 2byte EOF marker + 4 byte checksum -> 26 byte compressed data + 4 byte checksum Empty segments don't have any bytes, thus the seek & read in MapOutputServlet can also be saved. This optimization is only for extreme cases, I often see large proportion(90%) of empty segments in vary large jobs(particularly with map side filter) in our cluster, this is partially because of bad configuration or bad partitioner, but tuning a partitioner or key distribution sometimes is non trivial for user.
        Hide
        Harsh J added a comment -

        How much is the overhead of compressed, empty partition files?

        Show
        Harsh J added a comment - How much is the overhead of compressed, empty partition files?

          People

          • Assignee:
            Unassigned
            Reporter:
            Binglin Chang
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development