Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Auto Closed
-
None
-
None
-
None
Description
There is still a property in nutch-default 'hadoop.job.history.user.location' that redirects the creation of history files from job output locations to a custom location. I noticed that the current value does not work well with cloudera (I have tested cdh3u4), because ${hadoop.log.dir} is not defined. This actually causes the job in the jobtracker to show empty info. (With 'incomplete' job status). This is only when the job moves to retired. When it is still in 'completed', all is looking well.
This property can be set to 'none', because the job history is ALSO stored in the central jobtracker location anyway. The 'hadoop.job.history.user.location' property specifies an extra location. But if it is set to an invalid value, it causes the central history location to NOT store it, so it seems. Please see for more details:
http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html
Besides setting it to 'none', another option is to set it to 'history' which does work with cdh. (This writes all logs to 'history' in the user directory in the configured filesystem, usually dfs). The final option is to simply remove this value and not meddle with hadoop properties at all. But that actually requires all jobs to correctly ignore these files. I am not up to date how well this currently works with Nutch jobs. This question is most relevant for trunk, since trunk heavily relies on the filesystem for jobs.
What do you think?
A) Set property to 'none'
B) Set property to 'history'
C) Remove property, see what happens, possibly fix jobs
D) ?
For now, I opt for A. But I think we need some more input with other distributions (for example official Hadoop 1.x) and also Nutch trunk.