Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2690

Configurable and fast URL filter

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Implemented
    • None
    • 1.16
    • plugin
    • None
    • Patch Available

    Description

      This improvement introduces a new URL filter plugin "urlfilter-fast" (naming debatable) which is in use at Common Crawl since 2013 to apply a long list of filters.

      1. an exact (suffix) match against the host name is done to retrieve host/domain-specific regex rules
      2. applies a regular expression against the path (and query) component of the URL

      What makes it faster than urlfilter-regex for common cases:

      • regexes are selected by host name or it's domain suffix, so there are usually fewer rules to be checked. That's similar to NUTCH-1838 but any domain suffix can be matched including subdomain.domain.com, com or . for global rules. The selection by host name suffix is considerably fast.
      • regexes are applied only to the path component (optionally including the query) and not the entire URL.
        Matching against a shorter string can make a huge difference for more complex regular expressions.
      • the rule to deny everything from a host or domain gets special treatment to be fast

      More details about the rule format are found in the plugin's README.

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: