|
1. I have an agent written in perl, which I call it as "Hadoop Performance Adviser". It provides an extensible framework for evaluating the performance of a map/reduce job. It generates a report indicating potential problems affecting the job performance and the advice (if any) to take any corrective actions to rectify the problem.
2. Framework is extensible in the sense, I agree that there is a lot of potential in such tools that can help user (as well as a grid service provider) to get more targeted advice on the job efficiency. Sample issues we detect from task/job logs:
and more... Runping brought up a good point on availability of data while process is running. In some installations, logs are gathered by HOD after job is finished. User needs to wait for a job to finish to see all logs. We can change logging process to some extent and improve availability of items in progressing Live Log. Task counters are available when each task is finished. BTW, diagram in item 1 above refers to a progress diagram developed by Owen O'Malley. More:
Also there is another path to gathering logs as the job runs, that is to use Chukwa.
This way Chukwa can give us other metrics e.g. CPU utilization to be fed to Mr. Doctor. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Hopefully in future, Hadoop can develop dynamic configuration capabilities. Given complexity of the issue it may take a long time to get there.
Meanwhile, we can attack this problem from different angles or levels:
1) Having metrics: Providing understandable metrics on Web UI to raise user awareness. We can expand counters web page (or another page) to have more understandable and actionable metrics (e.g. a cluster utilization number) and more flow diagrams.
2) Detecting issues: Have an agent to interpret logs then highlight issue or trigger a process. Example: A rule-based agent loads a set of exentsible rules and follows Hadoop logs. Applicable rule creates a message/highlight in UI or triggers a separate process.
3) Notification: User gets notification (e.g. email) from a process triggered by rule-based agent above. This way, user doesn't need to be pinned to his monitor looking at web UI all the time.
Focus of this JIRA is development of rule-based agent of item 2 above which we call Mr Doctor (map-reduce doctor aka Hadoop Doctor). It simply processes Hadoop Logs and will be part of contrib. Mr Doctor will provide recommendations/prescriptions while following a live log of running process or postmortem logs.