Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2034

CrawlDB filtered documents counter.

    XMLWordPrintableJSON

    Details

      Description

      When we are doing big crawls we would like to know how many of the URLs are being discarded by the regex filters, this is only presented in the Inject class:

      Injector: Total number of urls rejected by filters: 0

      It will be nice to have a counter in the CrawlDB class so we know in every round how many were discarded by our filters:

      CrawlDb update: Total number of URLs filtered by regex filters: 31415

        Attachments

          Activity

            People

            • Assignee:
              lewismc Lewis John McGibbney
              Reporter:
              betolink Luis Lopez
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: