Issue Details (XML | Word | Printable)

Key: NUTCH-365
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Andrzej Bialecki
Reporter: Andrzej Bialecki
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Flexible URL normalization

Created: 09/Sep/06 01:21 PM   Updated: 22/Sep/06 09:02 PM
Return to search
Component/s: None
Affects Version/s: 0.9.0
Fix Version/s: 0.9.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works patch.txt 2006-09-09 01:22 PM Andrzej Bialecki 83 kB

Resolution Date: 22/Sep/06 09:02 PM


 Description  « Hide
This patch is a heavily restructured version of the patch in NUTCH-253, so much that I decided to create a separate issue. It changes the URL normalization from a selectable single class to a flexible and context-aware chain of normalization filters.

Highlights:

  • rename all UrlNormalizer to URLNormalizer for consistency.
  • use a "chained filter" pattern for running several normalizers in sequence
  • the order in which normalizers are executed is defined by "urlnormalizer.order" property, which lists space-separated implementation classes. If there are more normalizers active than explicitly named on this list, they will be run in random order after the ones specified on the list are executed.
  • define a set of contexts (or scopes) in which normalizers may be called. Each scope can have its own list of normalizers (via "urlnormalizer.scope.<scope_name>" property) and its own order (via "urlnormalizer.order.<scope_name>" property). If any of these properties are missing, default settings are used.
  • each normalizer may further select among many configurations, depending on the context in which it is called, using a modified API:

URLNormalizer.normalize(String url, String scope);

  • if a config for a given scope is not defined, then the default config will be used.
  • several standard contexts / scopes have been defined, and various applications have been modified to attempt using appropriate normalizer in their context.
  • all JUnit tests have been modified, and run successfully.

NUTCH-363 suggests to me that further changes may be required in this area, perhaps we should combine urlfilters and urlnormalizers into a single subsystem of url munging - now that we have support for scopes and flexible combinations of normalizers we could turn URLFilters into a special case of normalizers (or vice versa, depending on the point of view) ...



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Repository Revision Date User Message
ASF #449088 Fri Sep 22 21:05:33 UTC 2006 ab Refactor URLNormalizers (NUTCH-365). Iterative normalization has been
implemented, but is not used by default.

Development of this functionality was supported by SiteSell Inc.
Files Changed
DEL /lucene/nutch/trunk/src/java/org/apache/nutch/net/UrlNormalizerFactory.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/build.xml
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass
DEL /lucene/nutch/trunk/src/test/org/apache/nutch/net/TestRegexUrlNormalizer.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/crawl/TestInjector.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/parse/Outlink.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/java/org
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/build.xml
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/test/org/apache
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src
DEL /lucene/nutch/trunk/src/java/org/apache/nutch/net/BasicUrlNormalizer.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/test
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/sample/regex-normalize-scope1.test
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/plugin.xml
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/test
DEL /lucene/nutch/trunk/src/test/org/apache/nutch/net/TestUrlNormalizerFactory.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/plugin.xml
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/test
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.test
DEL /lucene/nutch/trunk/src/test/org/apache/nutch/net/TestBasicUrlNormalizer.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/java/org/apache
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/test/org/apache/nutch
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic
MODIFY /lucene/nutch/trunk/src/plugin/build.xml
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/PartitionUrlByHost.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/test/org/apache
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/plugin.xml
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/net/URLNormalizer.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/sample/regex-normalize-scope1.xml
DEL /lucene/nutch/trunk/src/java/org/apache/nutch/net/UrlNormalizer.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/test/org/apache/nutch
MODIFY /lucene/nutch/trunk/conf/nutch-default.xml
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/build.xml
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex
DEL /lucene/nutch/trunk/src/test/org/apache/nutch/net/test-regex-normalize.xml
MODIFY /lucene/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbFilter.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/sample
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/test/org
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/test/org/apache/nutch/net
DEL /lucene/nutch/trunk/src/java/org/apache/nutch/net/RegexUrlNormalizer.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/test/org
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/LinkDbFilter.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
ADD /lucene/nutch/trunk/src/test/org/apache/nutch/net/TestURLNormalizers.java
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/test/org/apache
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.xml
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-pass/src/test/org
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org
ADD /lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache