Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:
      None

      Description

      This patch is a heavily restructured version of the patch in NUTCH-253, so much that I decided to create a separate issue. It changes the URL normalization from a selectable single class to a flexible and context-aware chain of normalization filters.

      Highlights:

      • rename all UrlNormalizer to URLNormalizer for consistency.
      • use a "chained filter" pattern for running several normalizers in sequence
      • the order in which normalizers are executed is defined by "urlnormalizer.order" property, which lists space-separated implementation classes. If there are more normalizers active than explicitly named on this list, they will be run in random order after the ones specified on the list are executed.
      • define a set of contexts (or scopes) in which normalizers may be called. Each scope can have its own list of normalizers (via "urlnormalizer.scope.<scope_name>" property) and its own order (via "urlnormalizer.order.<scope_name>" property). If any of these properties are missing, default settings are used.
      • each normalizer may further select among many configurations, depending on the context in which it is called, using a modified API:

      URLNormalizer.normalize(String url, String scope);

      • if a config for a given scope is not defined, then the default config will be used.
      • several standard contexts / scopes have been defined, and various applications have been modified to attempt using appropriate normalizer in their context.
      • all JUnit tests have been modified, and run successfully.

      NUTCH-363 suggests to me that further changes may be required in this area, perhaps we should combine urlfilters and urlnormalizers into a single subsystem of url munging - now that we have support for scopes and flexible combinations of normalizers we could turn URLFilters into a special case of normalizers (or vice versa, depending on the point of view) ...

      1. patch.txt
        83 kB
        Andrzej Bialecki

        Activity

        Hide
        Doug Cook added a comment -

        Hi, Andrzej.

        Sounds very cool. Haven't had a chance to check out the patch yet to see if it supports this, but attaching a related discussion from the email list...

        ------

        Neal Richter wrote:

        Doug,

        I think it sounds like a good idea. It eliminates the need to order the
        rules precisely...

        We don't iterate them in HtDig and it's been on my todo list for a while as
        well.

        I would iterate until no matches, some max iteration number, or the URL is
        obviously junk.

        For the max iteration number I would use the number of rewrite rules you
        have. So if you have 10 rules, you iterate on all 10 rules 10 times. That
        will cover the case where your rules 'chain' in a 10 step sequence. Sure
        it's an edge case to do that, but I can see rule sets where you construct
        3-step chains (like swapping strings or something).

        Thanks

        Neal

        On 8/30/06, Doug Cook <nabble@...> wrote:
        >
        >
        > Hi,
        >
        > I've run across a few patterns in URLs where applying a normalization puts
        > the URL in a form matching another normalization pattern (or even the same
        > one). But that pattern won't get executed because the patterns are applied
        > only once.
        >
        > Should normalization iterate until no patterns match (with, perhaps, some
        > limit to the number of iterations to prevent loops from pattern mistakes)?
        >
        > It's a minor problem; it doesn't seem to affect too many URLs for things
        > like session ID removal, since finding two session IDs in the same URL is
        > rare (but does happen – that's how I noticed this). I could imagine it
        > being much more significant, however, if other Nutch users out there are
        > using "broader" normalization patterns.
        >
        > Any philosophical/practical objections? (it's early, I've only had 1
        > coffee,
        > and I've probably missed something obvious!)
        >
        > I'll file an issue and add it to my queue of things to do if people think
        > its a good idea.
        >
        > -Doug
        > –
        > View this message in context:
        > http://www.nabble.com/Should-URL-normalization-iterate--tf2190244.html#a6059957
        > Sent from the Nutch - Dev forum at Nabble.com.
        >

        Show
        Doug Cook added a comment - Hi, Andrzej. Sounds very cool. Haven't had a chance to check out the patch yet to see if it supports this, but attaching a related discussion from the email list... ------ Neal Richter wrote: Doug, I think it sounds like a good idea. It eliminates the need to order the rules precisely... We don't iterate them in HtDig and it's been on my todo list for a while as well. I would iterate until no matches, some max iteration number, or the URL is obviously junk. For the max iteration number I would use the number of rewrite rules you have. So if you have 10 rules, you iterate on all 10 rules 10 times. That will cover the case where your rules 'chain' in a 10 step sequence. Sure it's an edge case to do that, but I can see rule sets where you construct 3-step chains (like swapping strings or something). Thanks Neal On 8/30/06, Doug Cook <nabble@...> wrote: > > > Hi, > > I've run across a few patterns in URLs where applying a normalization puts > the URL in a form matching another normalization pattern (or even the same > one). But that pattern won't get executed because the patterns are applied > only once. > > Should normalization iterate until no patterns match (with, perhaps, some > limit to the number of iterations to prevent loops from pattern mistakes)? > > It's a minor problem; it doesn't seem to affect too many URLs for things > like session ID removal, since finding two session IDs in the same URL is > rare (but does happen – that's how I noticed this). I could imagine it > being much more significant, however, if other Nutch users out there are > using "broader" normalization patterns. > > Any philosophical/practical objections? (it's early, I've only had 1 > coffee, > and I've probably missed something obvious!) > > I'll file an issue and add it to my queue of things to do if people think > its a good idea. > > -Doug > – > View this message in context: > http://www.nabble.com/Should-URL-normalization-iterate--tf2190244.html#a6059957 > Sent from the Nutch - Dev forum at Nabble.com. >
        Hide
        Doug Cook added a comment -

        PS. I like your idea of combining URL filters & normalization. In a sense, a "filter" is just a normalizer that happens to normalize the URL either to itself or to nothing. It's a nice abstraction if we can implement such "normalizers" as efficiently as the current filters.

        If we iterated over these new "normalizers,"
        and allowed for a flexible combination of normalizers, as we do with filters, with short-circuit evaluation, then the first pass could throw away the obvious garbage (file types we don't handle, advertisements, etc.), and later passes could normalize and then filter the normalized URLs.

        Also on a related note, I was just starting to think about how to implement efficient site-specific normalizations and use these to handle (an already large number of) site mirrors as well as (an increasing number of) site-specific patterns for things like session-ID removal.

        Show
        Doug Cook added a comment - PS. I like your idea of combining URL filters & normalization. In a sense, a "filter" is just a normalizer that happens to normalize the URL either to itself or to nothing. It's a nice abstraction if we can implement such "normalizers" as efficiently as the current filters. If we iterated over these new "normalizers," and allowed for a flexible combination of normalizers, as we do with filters, with short-circuit evaluation, then the first pass could throw away the obvious garbage (file types we don't handle, advertisements, etc.), and later passes could normalize and then filter the normalized URLs. Also on a related note, I was just starting to think about how to implement efficient site-specific normalizations and use these to handle (an already large number of) site mirrors as well as (an increasing number of) site-specific patterns for things like session-ID removal.
        Hide
        Andrzej Bialecki added a comment -

        Running several iterations of filters/normalizers may be risky ... We would have to ensure that match/replace expressions are stable, in the sense that running the same url twice or more times through the same pair of match/replace will still produce the same result.

        Example: if I want to always remove one level of domains (i.e. www.example.com -> example.com; foo.bar.baz.com -> bar.baz.com), running these filters again would produce unwanted results.

        Re: short-circuiting the evaluation loops: we would have to change the way we pass arguments, so that we can change or not change the urls, and still proceed with the loop if needed. This seems to be the key semantic difference between filters and normalizers. Filters are primarily in business of discarding urls, while normalizers only munge them but rarely cause them to be thrown away.

        Re: per-site rules: you can already accomplish this. Just write a normalizer or filter which applies different rule-sets depending on the domain/host name.

        Show
        Andrzej Bialecki added a comment - Running several iterations of filters/normalizers may be risky ... We would have to ensure that match/replace expressions are stable, in the sense that running the same url twice or more times through the same pair of match/replace will still produce the same result. Example: if I want to always remove one level of domains (i.e. www.example.com -> example.com; foo.bar.baz.com -> bar.baz.com), running these filters again would produce unwanted results. Re: short-circuiting the evaluation loops: we would have to change the way we pass arguments, so that we can change or not change the urls, and still proceed with the loop if needed. This seems to be the key semantic difference between filters and normalizers. Filters are primarily in business of discarding urls, while normalizers only munge them but rarely cause them to be thrown away. Re: per-site rules: you can already accomplish this. Just write a normalizer or filter which applies different rule-sets depending on the domain/host name.
        Hide
        Sami Siren added a comment -

        looks ok to me,

        the ugly (with &) regexps could perhaps be put inside ![CDATA[ ]]> elements

        in generator there's
        + try

        { + host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); + host = new URL(host).getHost().toLowerCase(); + }

        catch (Exception e)

        { + LOG.warn("Malformed URL: '" + host + "', skipping"); + }

        why isn't the .toLowerCase also done in normalizer

        Show
        Sami Siren added a comment - looks ok to me, the ugly (with &) regexps could perhaps be put inside ![CDATA[ ]]> elements in generator there's + try { + host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); + host = new URL(host).getHost().toLowerCase(); + } catch (Exception e) { + LOG.warn("Malformed URL: '" + host + "', skipping"); + } why isn't the .toLowerCase also done in normalizer
        Hide
        Andrzej Bialecki added a comment -

        Lowercasing is done here because we can't rely on each normalizer to do it, and having uniform host names is important here.

        Show
        Andrzej Bialecki added a comment - Lowercasing is done here because we can't rely on each normalizer to do it, and having uniform host names is important here.
        Hide
        Doug Cook added a comment -

        It still seems to me that iterative normalization is useful and not risky. By definition, a "normalizer" is something which converts a URL to a "normal" form, and a URL in "normal" form should transform to itself. Thus a true "normalizer" should be stable. But I can see people wanting to do other transformations with normalizers, ones which perhaps shouldn't iterate. That's why there should be a configurable limit to the number of iterations, and those who want the current behavior can just set the limit to 1. Right now there is no good way, for example, to handle URLs with multiple session ID strings (rare, but extant!). Yes, one could manually repeat the pattern several times in the normalizer configuration, but this is hardly efficient. The second iteration of the same pattern should not be executed unless the first one matches.

        Re: your comment about site-specific normalization, there is already some way to do this efficiently? By "efficiently," I mean having a pattern which applies only to site foo.com and is not examined for other sites. I know I can already (and do already) add general regexps which will only match for foo.com – but these will be executed for all URLs, even if they only match for foo.com, and thus slow things down quite a bit if there are many of them. I was thinking something like having a hash table of sites with site-specific patterns, and then executing the given normalizations only for the given sites. That would allow us to efficiently build large tables of mirrors and other site-specific normalizations (for example, for session ID removals which would be unsafe in the general case). Thoughts? If there is already some easy way to do this you will make me a happy man!

        Show
        Doug Cook added a comment - It still seems to me that iterative normalization is useful and not risky. By definition, a "normalizer" is something which converts a URL to a "normal" form, and a URL in "normal" form should transform to itself. Thus a true "normalizer" should be stable. But I can see people wanting to do other transformations with normalizers, ones which perhaps shouldn't iterate. That's why there should be a configurable limit to the number of iterations, and those who want the current behavior can just set the limit to 1. Right now there is no good way, for example, to handle URLs with multiple session ID strings (rare, but extant!). Yes, one could manually repeat the pattern several times in the normalizer configuration, but this is hardly efficient. The second iteration of the same pattern should not be executed unless the first one matches. Re: your comment about site-specific normalization, there is already some way to do this efficiently? By "efficiently," I mean having a pattern which applies only to site foo.com and is not examined for other sites. I know I can already (and do already) add general regexps which will only match for foo.com – but these will be executed for all URLs, even if they only match for foo.com, and thus slow things down quite a bit if there are many of them. I was thinking something like having a hash table of sites with site-specific patterns, and then executing the given normalizations only for the given sites. That would allow us to efficiently build large tables of mirrors and other site-specific normalizations (for example, for session ID removals which would be unsafe in the general case). Thoughts? If there is already some easy way to do this you will make me a happy man!
        Hide
        Andrzej Bialecki added a comment -

        Patch applied with minor changes.

        Show
        Andrzej Bialecki added a comment - Patch applied with minor changes.

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development