Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1452

hadoop.job.history.user.location in nutch-default making job history useless

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Auto Closed
    • None
    • 2.5
    • None
    • None

    Description

      There is still a property in nutch-default 'hadoop.job.history.user.location' that redirects the creation of history files from job output locations to a custom location. I noticed that the current value does not work well with cloudera (I have tested cdh3u4), because ${hadoop.log.dir} is not defined. This actually causes the job in the jobtracker to show empty info. (With 'incomplete' job status). This is only when the job moves to retired. When it is still in 'completed', all is looking well.

      This property can be set to 'none', because the job history is ALSO stored in the central jobtracker location anyway. The 'hadoop.job.history.user.location' property specifies an extra location. But if it is set to an invalid value, it causes the central history location to NOT store it, so it seems. Please see for more details:
      http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html

      Besides setting it to 'none', another option is to set it to 'history' which does work with cdh. (This writes all logs to 'history' in the user directory in the configured filesystem, usually dfs). The final option is to simply remove this value and not meddle with hadoop properties at all. But that actually requires all jobs to correctly ignore these files. I am not up to date how well this currently works with Nutch jobs. This question is most relevant for trunk, since trunk heavily relies on the filesystem for jobs.

      What do you think?
      A) Set property to 'none'
      B) Set property to 'history'
      C) Remove property, see what happens, possibly fix jobs
      D) ?

      For now, I opt for A. But I think we need some more input with other distributions (for example official Hadoop 1.x) and also Nutch trunk.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            ferdy.g Ferdy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment