Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None

      Description

      This patch is a heavily restructured version of the patch in NUTCH-253, so much that I decided to create a separate issue. It changes the URL normalization from a selectable single class to a flexible and context-aware chain of normalization filters.

      Highlights:

      • rename all UrlNormalizer to URLNormalizer for consistency.
      • use a "chained filter" pattern for running several normalizers in sequence
      • the order in which normalizers are executed is defined by "urlnormalizer.order" property, which lists space-separated implementation classes. If there are more normalizers active than explicitly named on this list, they will be run in random order after the ones specified on the list are executed.
      • define a set of contexts (or scopes) in which normalizers may be called. Each scope can have its own list of normalizers (via "urlnormalizer.scope.<scope_name>" property) and its own order (via "urlnormalizer.order.<scope_name>" property). If any of these properties are missing, default settings are used.
      • each normalizer may further select among many configurations, depending on the context in which it is called, using a modified API:

      URLNormalizer.normalize(String url, String scope);

      • if a config for a given scope is not defined, then the default config will be used.
      • several standard contexts / scopes have been defined, and various applications have been modified to attempt using appropriate normalizer in their context.
      • all JUnit tests have been modified, and run successfully.

      NUTCH-363 suggests to me that further changes may be required in this area, perhaps we should combine urlfilters and urlnormalizers into a single subsystem of url munging - now that we have support for scopes and flexible combinations of normalizers we could turn URLFilters into a special case of normalizers (or vice versa, depending on the point of view) ...

      1. patch.txt
        83 kB
        Andrzej Bialecki

        Activity

        Andrzej Bialecki created issue -
        Andrzej Bialecki made changes -
        Field Original Value New Value
        Assignee Andrzej Bialecki [ ab ]
        Andrzej Bialecki made changes -
        Attachment patch.txt [ 12340514 ]
        Andrzej Bialecki made changes -
        Resolution Fixed [ 1 ]
        Status Open [ 1 ] Closed [ 6 ]

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development