Description
After merging NUTCH-2429 I've observed that Nutch jobs running in distributed mode may fail early with the following dubious error:
2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: java.io.IOException: Error generating shuffle secret key at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583) at org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.hadoop.util.RunJar.run(RunJar.java:323) at org.apache.hadoop.util.RunJar.main(RunJar.java:236) Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not available at java.base/javax.crypto.KeyGenerator.<init>(KeyGenerator.java:177) at java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179) ... 16 more
After removing the early registration of URL stream handlers (see NUTCH-2429) in NutchJob and NutchTool, the job starts without errors.
Notes:
- the job this error was observed a custom de-duplication job to flag redirects pointing to the same target URL. But I'll try to reproduce it with a standard Nutch job and in pseudo-distributed mode.
- should also verify whether registering URL stream handlers works at all in distributed mode. Tasks are launched differently, not as NutchJob or NutchTool.
Attachments
Issue Links
- fixes
-
NUTCH-2949 Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers
- Closed
- is caused by
-
NUTCH-2429 Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers
- Closed
- links to