Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-387

host normalization in Generator$Selector

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.9.0
    • generator
    • None
    • nutch trunk since revision 449088

    Description

      the host normalization in Generator$Selector#reduce at line 177 seems broken:
      String host = new URL(url.toString()).getHost();
      ...
      try

      { host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); host = new URL(host).getHost().toLowerCase(); }

      catch (Exception e)

      { LOG.warn("Malformed URL: '" + host + "', skipping"); }

      With default configuration the basic nomalizer will be called, which is doing 'new URL(host)'.
      Also in line below 'new URL(host)' will be called.
      Since url.getHost() always return the host without protocol, there will be a MalformedUrlException be thrown, always.
      The job will continue as usual though, cause the exception is catched.

      Attachments

        Activity

          People

            ab Andrzej Bialecki
            oae Johannes Zillmann
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: