Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2689

Speed up urlfilter-regex and urlfilter-automaton

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Implemented
    • 1.15
    • 1.16
    • plugin
    • None
    • Patch Available

    Description

      The unit tests of urlfilter-regex and urlfilter-automaton include a benchmark. After playing and benchmarking modifications the following changes seem to significantly improve the performance:

      • do not extract host and domain name from the URL if not needed (no host/domain-specific rules used, cf. NUTCH-1838)
      • use non-capturing groups if possible
      • use (?i) to make the patterns case insensitive and remove uppercase variants

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: