Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.9.0
    • Component/s: fetcher
    • Labels:
      None

      Description

      Extend URL Normalizer to allow for normalizion of the Hostname during Generate. By default no rules are applied.

      In short, this allows foo.bar.com, bif.baz.bar.com and bar.com to be counted as being the same for generate.max.per.host if an appropriate regex is used.

      Add "urlnormalizer-regex" to plugin.includes in nutch-site.xml in order to enable it.

      Since several modules now extend the urlnormalizer base we use a "scope" parameter within plugin.xml to allow differentiation between the various urlnormalizer modules to select the right module for Generate.

        Activity

        Andrzej Bialecki made changes -
        Status Open [ 1 ] Closed [ 6 ]
        Fix Version/s 0.9.0 [ 12312013 ]
        Resolution Fixed [ 1 ]
        Hide
        Andrzej Bialecki added a comment -

        This issue is fixed as a part of NUTCH-365 in trunk. Changes were too intrusive to be ported to branch-0.8, although the patch in NUTCH-365 should apply more or less cleanly.

        Show
        Andrzej Bialecki added a comment - This issue is fixed as a part of NUTCH-365 in trunk. Changes were too intrusive to be ported to branch-0.8, although the patch in NUTCH-365 should apply more or less cleanly.
        Rod Taylor made changes -
        Field Original Value New Value
        Attachment nutch_hostnormalize.patch [ 12325743 ]
        Rod Taylor created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Rod Taylor
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development