Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2936

Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 1.19
    • 1.19
    • plugin, protocol
    • None

    Description

      After merging NUTCH-2429 I've observed that Nutch jobs running in distributed mode may fail early with the following dubious error:

      2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: java.io.IOException: Error generating shuffle secret key
              at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182)
              at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
              at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
              at java.base/java.security.AccessController.doPrivileged(Native Method)
              at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
              at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
              at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583)
              at org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301)
              at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
              at org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379)
              at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
              at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.base/java.lang.reflect.Method.invoke(Method.java:566)
              at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
              at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
      Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not available
              at java.base/javax.crypto.KeyGenerator.<init>(KeyGenerator.java:177)
              at java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244)
              at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179)
              ... 16 more
      

      After removing the early registration of URL stream handlers (see NUTCH-2429) in NutchJob and NutchTool, the job starts without errors.

      Notes:

      • the job this error was observed a custom de-duplication job to flag redirects pointing to the same target URL. But I'll try to reproduce it with a standard Nutch job and in pseudo-distributed mode.
      • should also verify whether registering URL stream handlers works at all in distributed mode. Tasks are launched differently, not as NutchJob or NutchTool.

      Attachments

        Issue Links

          Activity

            People

              lewismc Lewis John McGibbney
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: