Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2588

Getting status code x01 (unfetched) on more than 80% crawled urls

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 2.3.1
    • Fix Version/s: None
    • Component/s: crawldb, fetcher
    • Labels:
      None
    • Environment:

      I am using apache nutch 2.3.1 with hadoop 2.7.6 and hbase 0.98.8 hadop2.

      Operating System: Ubuntu 16.04

      Description

      when i run nucth with external links enabled, seed of 10 urls and number of rounds 5 using command 

      bin/crawl <seed_path> <db>  [<solr url>] <number of rounds>

      i have default topN value which is 50000

      the process completes execution in 11 to 12 hours and generated urls rows of about 280000.

      when we analyze hbase table and check status codes of all urls we got round about 242000 urls having status code of x01 [un fetched] 

      it means 242000 urls out of 280000 which nutch extracted remains unfetched.

      after some debugging of nutch and analyzing its logs i found that those urls which have status code of x01 are not even tried to fetch.

      is this the bug of nutch or something configuration issue?
      kindly resolve my issue as soon as possible.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              usama_ Usama Tahir
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: