• Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.23.0
    • Fix Version/s: None
    • Component/s: performance, task
    • Labels:


      When running oprofile on a terasort workload, I noticed that a large amount of CPU usage was going to MapTask$ Upon disassembling this and looking at cycle counters, most of the cycles were going to memory loads dereferencing into the array of key-value data – implying expensive cache misses. This can be avoided as follows:

      • rather than simply swapping indexes into the kv array, swap the entire meta entries in the meta array. Swapping 16 bytes is only negligibly slower than swapping 4 bytes. This requires adding the value-length into the meta array, since we used to rely on the previous-in-the-array meta entry to determine this. So we replace INDEX with VALUELEN and avoid one layer of indirection.
      • introduce an interface which allows key types to provide a 4-byte comparison proxy. For string keys, this can simply be the first 4 bytes of the string. The idea is that, if stringCompare(key1.proxy(), key2.proxy()) != 0, then compare(key1, key2) should have the same result. If the proxies are equal, the normal comparison method is used. We then include the 4-byte proxy as part of the metadata entry, so that for many cases the indirection into the data buffer can be avoided.

      On a terasort benchmark, these optimizations plus an optimization to WritableComparator.compareBytes dropped the aggregate mapside CPU millis by 40%, and the compare() routine mostly dropped off the oprofile results.

      1. hashed-sort-MAPREDUCE-3235.patch
        10 kB
        Gopal V
      2. map_sort_perf.diff
        8 kB
        Hal Mo
      3. mr-3235-poc.txt
        12 kB
        Todd Lipcon

        Issue Links

        There are no Sub-Tasks for this issue.


          Todd Lipcon created issue -
          Todd Lipcon made changes -
          Field Original Value New Value
          Attachment mr-3235-poc.txt [ 12499917 ]
          Todd Lipcon made changes -
          Link This issue is related to MAPREDUCE-1639 [ MAPREDUCE-1639 ]
          Todd Lipcon made changes -
          Component/s performance [ 12316500 ]
          Hal Mo made changes -
          Attachment map_sort_perf.diff [ 12532169 ]
          Gopal V made changes -
          Attachment hashed-sort-MAPREDUCE-3235.patch [ 12552509 ]
          Gopal V made changes -
          Attachment hashed-sort-MAPREDUCE-3235.patch [ 12552654 ]
          Gopal V made changes -
          Attachment hashed-sort-MAPREDUCE-3235.patch [ 12552509 ]
          Todd Lipcon made changes -
          Link This issue is related to MAPREDUCE-4755 [ MAPREDUCE-4755 ]


            • Assignee:
              Todd Lipcon
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              38 Start watching this issue


              • Created: