Issue Details (XML | Word | Printable)

Key: HADOOP-3956
Type: New Feature New Feature
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Amir Youssefi
Votes: 0
Watchers: 8
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

map-reduce doctor (Mr Doctor)

Created: 14/Aug/08 11:36 PM   Updated: 10/Oct/08 03:33 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

Issue Links:
Reference
 


 Description  « Hide
Problem Description:

Users typically submit jobs with sub-optimal parameters resulting in under-utilization, black-listed task-trackers, time-outs, re-tries etc.

Issue can be mitigated by submitting job with custom Hadoop parameters.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Amir Youssefi added a comment - 15/Aug/08 12:37 AM
Proposed Solutions:

Hopefully in future, Hadoop can develop dynamic configuration capabilities. Given complexity of the issue it may take a long time to get there.

Meanwhile, we can attack this problem from different angles or levels:

1) Having metrics: Providing understandable metrics on Web UI to raise user awareness. We can expand counters web page (or another page) to have more understandable and actionable metrics (e.g. a cluster utilization number) and more flow diagrams.
2) Detecting issues: Have an agent to interpret logs then highlight issue or trigger a process. Example: A rule-based agent loads a set of exentsible rules and follows Hadoop logs. Applicable rule creates a message/highlight in UI or triggers a separate process.
3) Notification: User gets notification (e.g. email) from a process triggered by rule-based agent above. This way, user doesn't need to be pinned to his monitor looking at web UI all the time.

Focus of this JIRA is development of rule-based agent of item 2 above which we call Mr Doctor (map-reduce doctor aka Hadoop Doctor). It simply processes Hadoop Logs and will be part of contrib. Mr Doctor will provide recommendations/prescriptions while following a live log of running process or postmortem logs.


Suhas Gogate added a comment - 15/Aug/08 02:08 AM
1. I have an agent written in perl, which I call it as "Hadoop Performance Adviser". It provides an extensible framework for evaluating the performance of a map/reduce job. It generates a report indicating potential problems affecting the job performance and the advice (if any) to take any corrective actions to rectify the problem.

2. Framework is extensible in the sense,
– it allows adding new entries to a pre-defined list of performance and cluster utilization hints and framework evaluates them against job execution counters and configuration parameters parsed through log files.
– Also the hint subroutines are written in such a fashion that more complex hints can be built using a boolean expression around existing set of hints.

I agree that there is a lot of potential in such tools that can help user (as well as a grid service provider) to get more targeted advice on the job efficiency.


Amir Youssefi added a comment - 15/Aug/08 02:24 AM
Sample issues we detect from task/job logs:
  • Shuffle
  • Map Spill
  • Lagging single reducer (un-even distribution from time or row count point of view)

and more...

Runping brought up a good point on availability of data while process is running. In some installations, logs are gathered by HOD after job is finished. User needs to wait for a job to finish to see all logs. We can change logging process to some extent and improve availability of items in progressing Live Log. Task counters are available when each task is finished.

BTW, diagram in item 1 above refers to a progress diagram developed by Owen O'Malley.


Amir Youssefi added a comment - 15/Aug/08 02:33 AM - edited
More:
  • Task initialization problems. As we speak I can see some tasks get stuck in initialization stage for 2 minutes or so. User can detect it by watching UI for a few hours but Mr. Doctor can improve on this for live/postmortem logs.

Amir Youssefi added a comment - 15/Aug/08 04:06 AM
Also there is another path to gathering logs as the job runs, that is to use Chukwa.

This way Chukwa can give us other metrics e.g. CPU utilization to be fed to Mr. Doctor.