Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1932

record skipping doesn't work with the new map/reduce api

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.20.1
    • Fix Version/s: None
    • Component/s: task
    • Labels:
      None

      Description

      The new HADOOP-1230 map/reduce api doesn't support the record skipping features.

      There are no Sub-Tasks for this issue.

        Activity

        Hide
        Harsh J added a comment -

        Done, didn't notice that one.

        Show
        Harsh J added a comment - Done, didn't notice that one.
        Hide
        Robert Joseph Evans added a comment -

        Shouldn't MAPREDUCE-2165 be closed as Won't Fix also, as it is a dependent sub-task of this one?

        Show
        Robert Joseph Evans added a comment - Shouldn't MAPREDUCE-2165 be closed as Won't Fix also, as it is a dependent sub-task of this one?
        Hide
        Harsh J added a comment -

        Won't Fix, per Tom and Owen's comments above.

        Show
        Harsh J added a comment - Won't Fix, per Tom and Owen's comments above.
        Hide
        Owen O'Malley added a comment -

        The record skipping had a very unintuitive api and has not been well tested.

        I'd recommend having best practices to deal with it rather than a bunch of framework changes.

        Show
        Owen O'Malley added a comment - The record skipping had a very unintuitive api and has not been well tested. I'd recommend having best practices to deal with it rather than a bunch of framework changes.
        Hide
        Tom White added a comment -

        I wonder whether we want to add this to the new API, when we could instead suggest that people launch their own subprocess (as Owen suggests here: http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201108.mbox/%3cCAFQoU9Ekv+SBvAv-bSF5dORJO68VSj6zTqXywWUT+qHS3V3bbA@mail.gmail.com%3e).

        As I understand it, the record skipping feature finds bad records by doing a binary search on the record range covered by a given task, so it has to re-run the task many times until the size of the window is below a given threshold. Also, I'm not sure how it copes with the case of multiple corrupted records in a single split.

        Show
        Tom White added a comment - I wonder whether we want to add this to the new API, when we could instead suggest that people launch their own subprocess (as Owen suggests here: http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201108.mbox/%3cCAFQoU9Ekv+SBvAv-bSF5dORJO68VSj6zTqXywWUT+qHS3V3bbA@mail.gmail.com%3e ). As I understand it, the record skipping feature finds bad records by doing a binary search on the record range covered by a given task, so it has to re-run the task many times until the size of the window is below a given threshold. Also, I'm not sure how it copes with the case of multiple corrupted records in a single split.
        Hide
        Harsh J added a comment -

        Here's the first attempt at this to get it rolling (smells like a regression!).

        Will add a test case for this soon and up a fresh patch post-verification.

        Show
        Harsh J added a comment - Here's the first attempt at this to get it rolling (smells like a regression!). Will add a test case for this soon and up a fresh patch post-verification.

          People

          • Assignee:
            Harsh J
            Reporter:
            Owen O'Malley
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development