Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2973

Single domain names (eg https://localnet) can't be crawled - filtering fails

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 1.19
    • 1.20
    • fetcher
    • None

    Description

      There appears to be a bug within the core of Nutch that fails to permit any single domain name URLs to be crawled.  Example:

      https://localnet/something.aspx

      The issue is that Nutch is rejecting any url with a single element domain name such as localnet above. "localnet.com" is not rejected, nor is "local.localnet". It almost feels as if there's a chunk of code within Nutch that's unrelated to the filtering mechanisms that rejects URLs outright if they don't have a WWW style format and a WWW-style domain such as .COM

      Error message:

      Total urls rejected by filters: 1

      I've checked and updated all the filter files in the conf directory. Even making then incredibly permissive (effectively "crawl everything") has not helped.    Immediately that a dot (.) is added to the domain name it is not rejected - eg blah.localnet.

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              daviddsmith David Smith
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: