|
PS. I like your idea of combining URL filters & normalization. In a sense, a "filter" is just a normalizer that happens to normalize the URL either to itself or to nothing. It's a nice abstraction if we can implement such "normalizers" as efficiently as the current filters.
If we iterated over these new "normalizers," Also on a related note, I was just starting to think about how to implement efficient site-specific normalizations and use these to handle (an already large number of) site mirrors as well as (an increasing number of) site-specific patterns for things like session-ID removal. Running several iterations of filters/normalizers may be risky ... We would have to ensure that match/replace expressions are stable, in the sense that running the same url twice or more times through the same pair of match/replace will still produce the same result.
Example: if I want to always remove one level of domains (i.e. www.example.com -> example.com; foo.bar.baz.com -> bar.baz.com), running these filters again would produce unwanted results. Re: short-circuiting the evaluation loops: we would have to change the way we pass arguments, so that we can change or not change the urls, and still proceed with the loop if needed. This seems to be the key semantic difference between filters and normalizers. Filters are primarily in business of discarding urls, while normalizers only munge them but rarely cause them to be thrown away. Re: per-site rules: you can already accomplish this. Just write a normalizer or filter which applies different rule-sets depending on the domain/host name. looks ok to me,
the ugly (with &) regexps could perhaps be put inside ![CDATA[ ]]> elements in generator there's why isn't the .toLowerCase also done in normalizer Lowercasing is done here because we can't rely on each normalizer to do it, and having uniform host names is important here.
It still seems to me that iterative normalization is useful and not risky. By definition, a "normalizer" is something which converts a URL to a "normal" form, and a URL in "normal" form should transform to itself. Thus a true "normalizer" should be stable. But I can see people wanting to do other transformations with normalizers, ones which perhaps shouldn't iterate. That's why there should be a configurable limit to the number of iterations, and those who want the current behavior can just set the limit to 1. Right now there is no good way, for example, to handle URLs with multiple session ID strings (rare, but extant!). Yes, one could manually repeat the pattern several times in the normalizer configuration, but this is hardly efficient. The second iteration of the same pattern should not be executed unless the first one matches.
Re: your comment about site-specific normalization, there is already some way to do this efficiently? By "efficiently," I mean having a pattern which applies only to site foo.com and is not examined for other sites. I know I can already (and do already) add general regexps which will only match for foo.com – but these will be executed for all URLs, even if they only match for foo.com, and thus slow things down quite a bit if there are many of them. I was thinking something like having a hash table of sites with site-specific patterns, and then executing the given normalizations only for the given sites. That would allow us to efficiently build large tables of mirrors and other site-specific normalizations (for example, for session ID removals which would be unsafe in the general case). Thoughts? If there is already some easy way to do this you will make me a happy man! Patch applied with minor changes.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sounds very cool. Haven't had a chance to check out the patch yet to see if it supports this, but attaching a related discussion from the email list...
------
Neal Richter wrote:
Doug,
I think it sounds like a good idea. It eliminates the need to order the
rules precisely...
We don't iterate them in HtDig and it's been on my todo list for a while as
well.
I would iterate until no matches, some max iteration number, or the URL is
obviously junk.
For the max iteration number I would use the number of rewrite rules you
have. So if you have 10 rules, you iterate on all 10 rules 10 times. That
will cover the case where your rules 'chain' in a 10 step sequence. Sure
it's an edge case to do that, but I can see rule sets where you construct
3-step chains (like swapping strings or something).
Thanks
Neal
On 8/30/06, Doug Cook <nabble@...> wrote:
>
>
> Hi,
>
> I've run across a few patterns in URLs where applying a normalization puts
> the URL in a form matching another normalization pattern (or even the same
> one). But that pattern won't get executed because the patterns are applied
> only once.
>
> Should normalization iterate until no patterns match (with, perhaps, some
> limit to the number of iterations to prevent loops from pattern mistakes)?
>
> It's a minor problem; it doesn't seem to affect too many URLs for things
> like session ID removal, since finding two session IDs in the same URL is
> rare (but does happen – that's how I noticed this). I could imagine it
> being much more significant, however, if other Nutch users out there are
> using "broader" normalization patterns.
>
> Any philosophical/practical objections? (it's early, I've only had 1
> coffee,
> and I've probably missed something obvious!)
>
> I'll file an issue and add it to my queue of things to do if people think
> its a good idea.
>
> -Doug
> –
> View this message in context:
> http://www.nabble.com/Should-URL-normalization-iterate--tf2190244.html#a6059957
> Sent from the Nutch - Dev forum at Nabble.com.
>