Details

    • Hadoop Flags:
      Reviewed

      Description

      Currently, userlogs directory in TaskTracker is placed under hadoop.log.dir like <hadoop.log.dir>/userlogs. I am proposing to spread these userlogs onto multiple configured mapred.local.dirs to strengthen TaskTracker reliability w.r.t disk failures.

      1. MAPREDUCE-2415-3.patch
        36 kB
        Bharath Mundlapudi
      2. MAPREDUCE-2415-2.patch
        36 kB
        Bharath Mundlapudi
      3. TaskTracker Userlogs Design.pdf
        46 kB
        Bharath Mundlapudi
      4. MAPREDUCE-2415-1.patch
        18 kB
        Bharath Mundlapudi

        Issue Links

          Activity

          Hide
          Bharath Mundlapudi added a comment -

          Attaching the patch.

          Show
          Bharath Mundlapudi added a comment - Attaching the patch.
          Hide
          Bharath Mundlapudi added a comment -

          Attaching the design document.

          Show
          Bharath Mundlapudi added a comment - Attaching the design document.
          Hide
          Ravi Gummadi added a comment -

          >> to strengthen TaskTracker reliability w.r.t disk failures.

          In addition to that, user logs can grow to bigger sizes and can stay alive for more time by setting mapred.userlog.retain.hours to higher value because of spreading them on to multiple disks.

          Show
          Ravi Gummadi added a comment - >> to strengthen TaskTracker reliability w.r.t disk failures. In addition to that, user logs can grow to bigger sizes and can stay alive for more time by setting mapred.userlog.retain.hours to higher value because of spreading them on to multiple disks.
          Hide
          Owen O'Malley added a comment -

          It looks like you'll have undefined references when you goto cleanup. In general you should free the memory when you are done with it rather than moving the frees to a cleanup.

          Part of getting this code (or any important C/C++ code) correct requires testing and running in valgrind. Run both the unit tests and manual tests with valgrind, in particular looking at the failure cases and fixing all of the undefined memory references and unfreed memory.

          Use stdbool.h instead of defining your own.

          You have tabs instead of all spaces.

          I think it would be clearer to replace the code that traverses the link with code that does:
          rm link
          foreach dir: goodir
          rm dir

          The reinitialization code needs to delete old user logs that are no longer pointed to by a symlink. This is easier given the change above.

          You should create a new method createLogDir in TaskController rather than testing the type in an if statement.

          You need to update the unit tests to test the new functionality.

          Show
          Owen O'Malley added a comment - It looks like you'll have undefined references when you goto cleanup. In general you should free the memory when you are done with it rather than moving the frees to a cleanup. Part of getting this code (or any important C/C++ code) correct requires testing and running in valgrind. Run both the unit tests and manual tests with valgrind, in particular looking at the failure cases and fixing all of the undefined memory references and unfreed memory. Use stdbool.h instead of defining your own. You have tabs instead of all spaces. I think it would be clearer to replace the code that traverses the link with code that does: rm link foreach dir: goodir rm dir The reinitialization code needs to delete old user logs that are no longer pointed to by a symlink. This is easier given the change above. You should create a new method createLogDir in TaskController rather than testing the type in an if statement. You need to update the unit tests to test the new functionality.
          Hide
          Owen O'Malley added a comment -

          I'd also change the "Invalid local dir count" error message to "No valid local directories."

          I'd also add an error message in get_nth_local_dir to say, "Invalid index %d for %d local directories."

          With those two covered, you don't need to include the log for "Invalid local dir."

          It looks like the new error messages in initialize_job after calling open_file_as_task_tracker will produce duplicated messages. Same after change_user.

          Show
          Owen O'Malley added a comment - I'd also change the "Invalid local dir count" error message to "No valid local directories." I'd also add an error message in get_nth_local_dir to say, "Invalid index %d for %d local directories." With those two covered, you don't need to include the log for "Invalid local dir." It looks like the new error messages in initialize_job after calling open_file_as_task_tracker will produce duplicated messages. Same after change_user.
          Hide
          Bharath Mundlapudi added a comment -

          Thanks for the code review, Owen. Attached a revised patch which incorporates your comments.

          Show
          Bharath Mundlapudi added a comment - Thanks for the code review, Owen. Attached a revised patch which incorporates your comments.
          Hide
          Owen O'Malley added a comment -

          Bhrath,
          Can you run test-patch on this patch?

          Show
          Owen O'Malley added a comment - Bhrath, Can you run test-patch on this patch?
          Hide
          Bharath Mundlapudi added a comment -

          Owen,

          I have run ant test-patch. Most of the things seems fine - no javac warnings or findbugs errors.
          But found two minor javadoc warnings. I am attaching a patch which fixes these warnings.

          You will see the following in the diff b/w these two (v2 and v3) patches:

          < + * @param TaskAttemptID ID of the task

          > + * @param taskID ID of the task

          < + * @return

          > + * @return target task attempt log directory

          Show
          Bharath Mundlapudi added a comment - Owen, I have run ant test-patch. Most of the things seems fine - no javac warnings or findbugs errors. But found two minor javadoc warnings. I am attaching a patch which fixes these warnings. You will see the following in the diff b/w these two (v2 and v3) patches: < + * @param TaskAttemptID ID of the task — > + * @param taskID ID of the task < + * @return — > + * @return target task attempt log directory
          Hide
          Bharath Mundlapudi added a comment -

          Attaching the updated patch for javadoc warnings.

          Show
          Bharath Mundlapudi added a comment - Attaching the updated patch for javadoc warnings.
          Hide
          Owen O'Malley added a comment -

          I just committed this to 204. Thanks, Bharath!

          Show
          Owen O'Malley added a comment - I just committed this to 204. Thanks, Bharath!
          Hide
          Owen O'Malley added a comment -

          Hadoop 0.20.204.0 was just released.

          Show
          Owen O'Malley added a comment - Hadoop 0.20.204.0 was just released.

            People

            • Assignee:
              Bharath Mundlapudi
              Reporter:
              Bharath Mundlapudi
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development