Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2690

Configurable and fast URL filter

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Implemented
    • Affects Version/s: None
    • Fix Version/s: 1.16
    • Component/s: plugin
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      This improvement introduces a new URL filter plugin "urlfilter-fast" (naming debatable) which is in use at Common Crawl since 2013 to apply a long list of filters.

      1. an exact (suffix) match against the host name is done to retrieve host/domain-specific regex rules
      2. applies a regular expression against the path (and query) component of the URL

      What makes it faster than urlfilter-regex for common cases:

      • regexes are selected by host name or it's domain suffix, so there are usually fewer rules to be checked. That's similar to NUTCH-1838 but any domain suffix can be matched including subdomain.domain.com, com or . for global rules. The selection by host name suffix is considerably fast.
      • regexes are applied only to the path component (optionally including the query) and not the entire URL.
        Matching against a shorter string can make a huge difference for more complex regular expressions.
      • the rule to deny everything from a host or domain gets special treatment to be fast

      More details about the rule format are found in the plugin's README.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wastl-nagel Sebastian Nagel
                Reporter:
                wastl-nagel Sebastian Nagel
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: