Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3582

Exception swallowed in PipelinedSorter causing incorrect results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 0.8.4
    • 0.7.2, 0.9.0, 0.8.5
    • None
    • None

    Description

      I've run into a potentially serious issue with yarn-tez mapreduce.

      We've recently moved from using classic mapreduce on hadoop 1.0.3 to using Tez, and a user noticed a data inconsistency in some results calculated via yarn-tez.

      On investigation, I've determined that an error occurred during key deserialization while sorting.

      In this case, PipelinedSorter.SpanMerger.ready() caught the resulting ExecutionException, logged the message (though it should really be logging the stack trace as well), and returned false. PipelinedSorter.spill() interpreted the returned false as an empty spill and continued with no indication that an error occur. This resulted in data that existed in the sort buffer after the error record being lost.

      I suspect that there may also be an error somewhere else in the sort code that is causing buffer corruption (or index corruption), since we've been using this mapreduce code for years and have never seen a deserialization error here; however, I can't confirm that there isn't a subtle error on our side.

      In any case, the fact that Tez is silently swallowing errors is a critical issue for us, as we can't trust the results it produces.

      Attachments

        1. TEZ-3582.debug.patch
          4 kB
          Rajesh Balamohan
        2. logs.zip
          15 kB
          Travis Woodruff
        3. TEZ-3582.1.patch
          2 kB
          Rajesh Balamohan
        4. TEZ-3582.2.patch
          3 kB
          Rajesh Balamohan

        Activity

          People

            rajesh.balamohan Rajesh Balamohan
            tmwoodruff Travis Woodruff
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: