Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-253

Normalize Host during Generate

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8
    • 0.9.0
    • fetcher
    • None

    Description

      Extend URL Normalizer to allow for normalizion of the Hostname during Generate. By default no rules are applied.

      In short, this allows foo.bar.com, bif.baz.bar.com and bar.com to be counted as being the same for generate.max.per.host if an appropriate regex is used.

      Add "urlnormalizer-regex" to plugin.includes in nutch-site.xml in order to enable it.

      Since several modules now extend the urlnormalizer base we use a "scope" parameter within plugin.xml to allow differentiation between the various urlnormalizer modules to select the right module for Generate.

      Attachments

        1. nutch_hostnormalize.patch
          41 kB
          Rod Taylor

        Activity

          People

            Unassigned Unassigned
            rbt Rod Taylor
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: