Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2335

Injector not to filter and normalize existing URLs in CrawlDb

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.12
    • Fix Version/s: 1.14
    • Component/s: crawldb, injector
    • Labels:
      None
    • Patch Info:
      Patch Available
    • Flags:
      Patch

      Description

      With NUTCH-1712 the behavior of the Injector has changed in case new URLs are added to an existing CrawlDb:

      • before only injected URLs were filtered and normalized
      • now filters and normalizers are applied to all URLs including those already in the CrawlDb

      The default should be as before not to filter existing URLs. Filtering and normalizing may take long for large CrawlDbs and/or complex URL filters. If URL filter or normalizer rules are not changed there is no need to apply them anew every time new URLs are added. Of course, injected URLs should be filtered and normalized by default.

        Attachments

        1. Injector.java
          21 kB
          Markus Jelsma

          Issue Links

            Activity

              People

              • Assignee:
                snagel Sebastian Nagel
                Reporter:
                snagel Sebastian Nagel
              • Votes:
                1 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: