Currently Nutch uses two subsystems related to url validation and normalization:
- URLFilter: this interface checks if URLs are valid for further processing. Input URL is not changed in any way. The output is a boolean value.
- URLNormalizer: this interface brings URLs to their base ("normal") form, or removes unneeded URL components, or performs any other URL mangling as necessary. Input URLs are changed, and are returned as result.
However, various Nutch tools run filters and normalizers in pre-determined order, i.e. normalizers first, and then filters. In some cases, where normalizers are complex and running them is costly (e.g. numerous regex rules, DNS lookups) it would make sense to run some of the filters first (e.g. prefix-based filters that select only certain protocols, or suffix-based filters that select only known "extensions"). This is currently not possible - we always have to run normalizers, only to later throw away urls because they failed to pass through filters.
I would like to solicit comments on the following two solutions, and work on implementation of one of them:
1) we could make URLFilters and URLNormalizers implement the same interface, and basically make them interchangeable. This way users could configure their order arbitrarily, even mixing filters and normalizers out of order. This is more complicated, but gives much more flexibility - and
NUTCH-365 already provides sufficient framework to implement this, including the ability to define different sequences for different steps in the workflow.
2) we could use a property "url.mangling.order" to define whether normalizers or filters should run first. This is simple to implement, but provides only limited improvement - because either all filters or all normalizers would run, they couldn't be mixed in arbitrary order.