Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2737

Generator: count and log reason of rejections during selection

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Implemented
    • Affects Version/s: 1.15
    • Fix Version/s: 1.16
    • Component/s: generator
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      During the map phase of the selection step, the generator rejects many (usually most of) items for various reasons:

      • not yet time for a refetch (returned by the fetch scheduler)
      • generator score too low
      • status does not match restrict status
      • Jexl expression not matched

      and some more. It would be useful if the reasons are counted and logged, esp. when the CrawlDb gets bigger and multiple options to restrict the selection are used.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                snagel Sebastian Nagel
                Reporter:
                snagel Sebastian Nagel
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: