Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2365

HTTP Redirects to SubDomains don't get crawled if db.ignore.external.links.mode == byDomain

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.12
    • 1.14
    • fetcher
    • None
    • Fedora 25

    • Patch Available

    Description

      Crawling a domain http://www.mercenarytrader.com which redirects to https://members.mercenarytrader.com which doesn't get followed by Nutch even though 'db.ignore.external.links' is set to 'true' and 'db.ignore.external.links.mode' is set to 'byDomain'.
      The bug is in FetcherThread where the comparison is by host and not by domain

      String origHost = new URL(urlString).getHost().toLowerCase();
      > String newHost = new URL(newUrl).getHost().toLowerCase();
      > if (ignoreExternalLinks) {
      > if (!origHost.equals(newHost)) {
      > if (LOG.isDebugEnabled())

      { > LOG.debug(" - ignoring redirect " + redirType + " from " > + urlString + " to " + newUrl > + " because external links are ignored"); > }

      > return null;
      > }
      > }

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              srinookala Sriram Nookala
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: