Details
Description
Issue
We are facing strange issue since we have updated our Proxmox from 7.2-4 to 7.2-11 which host the VMs/containers used for our Hadoop cluster.
When we are using the sitemap component to add URLs, the generator process doesn't work. It generates 0 partition.
But if we call a second time the generator process, this time the generator actually create a partition segment.
It happens only when we use the sitemap process. If we use only the Injector process, this issue doesn't happen.
I checked the logs and the generator just seems to find no record in the crawldb. It is like the crawldb wasn't available or the files are locked.
Here is the command used :
Sitemap :
hadoop jar <job> org.apache.nutch.crawl.Generator crawl_000_111/crawldb crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm -force
It returns as expected :
2022-11-23 10:37:22,194 INFO util.SitemapProcessor: SitemapProcessor: Total records rejected by filters: 0
2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total sitemaps from HostDb: 0
2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total sitemaps from seed urls: 1
2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total failed sitemap fetches: 0
2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total new sitemap entries added: 151
Generetor :
hadoop jar <job> org.apache.nutch.crawl.Generator crawl_000_111/crawldb crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm -force
1st time it returns :
2022-11-23 11:25:15,202 WARN crawl.Generator: Generator: 0 records selected for fetching, exiting ...
2nd time it returns :
2022-11-23 11:27:43,007 INFO crawl.Generator: Generator: Partitioning selected urls for politeness.
2022-11-23 11:27:44,009 INFO crawl.Generator: Generator: segment: crawl_000_111/segment/20221123112744
...
2022-11-23 11:28:34,061 INFO crawl.Generator: Generator: finished at 2022-11-23 11:28:34, elapsed: 00:01:53