Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2975

Generate 0 partition when used with sitemap

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 1.16
    • None
    • generator, sitemap
    • None
    • Hadoop 3.1.1

    Description

      Issue
      We are facing strange issue since we have updated our Proxmox from 7.2-4 to 7.2-11 which host the VMs/containers used for our Hadoop cluster.

      When we are using the sitemap component to add URLs, the generator process doesn't work. It generates 0 partition.

      But if we call a second time the generator process, this time the generator actually create a partition segment.

      It happens only when we use the sitemap process. If we use only the Injector process, this issue doesn't happen.
      I checked the logs and the generator just seems to find no record in the crawldb. It is like the crawldb wasn't available or the files are locked.

      Here is the command used :
      Sitemap :

      hadoop jar <job> org.apache.nutch.crawl.Generator crawl_000_111/crawldb crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm -force
      

      It returns as expected :

      Sitemap output

      2022-11-23 10:37:22,194 INFO util.SitemapProcessor: SitemapProcessor: Total records rejected by filters: 0
      2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total sitemaps from HostDb: 0
      2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total sitemaps from seed urls: 1
      2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total failed sitemap fetches: 0
      2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total new sitemap entries added: 151

      Generetor :

      hadoop jar <job> org.apache.nutch.crawl.Generator crawl_000_111/crawldb crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm -force
      

      1st time it returns :

      2022-11-23 11:25:15,202 WARN crawl.Generator: Generator: 0 records selected for fetching, exiting ...
      

      2nd time it returns :

      2022-11-23 11:27:43,007 INFO crawl.Generator: Generator: Partitioning selected urls for politeness.
      2022-11-23 11:27:44,009 INFO crawl.Generator: Generator: segment: crawl_000_111/segment/20221123112744
      ...
      2022-11-23 11:28:34,061 INFO crawl.Generator: Generator: finished at 2022-11-23 11:28:34, elapsed: 00:01:53
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            lucasp Lucas Pauchard
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment