Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-3010

Injector: count unique number of injected URLs

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.19
    • 1.20
    • injector
    • None
    • Patch Available

    Description

      Injector uses two counters: one for the total number of injected URLs, the other for the number of URLs "merged", that is already in CrawlDb. There is now counter for the number of unique URLs injected which may lead to wrong counts if the seed files contain duplicates:

      Suppose the following seed file which contains a duplicated URL:

      $> cat seeds_with_duplicates.txt 
      https://www.example.org/page1.html
      https://www.example.org/page2.html
      https://www.example.org/page2.html
      
      $> $NUTCH_HOME/bin/nutch inject /tmp/crawldb seeds_with_duplicates.txt
      ...
      2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls rejected by filters: 0
      2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls injected after normalization and filtering: 3
      2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total urls injected but already in CrawlDb: 0
      2023-09-30 07:38:00,185 INFO o.a.n.c.Injector [main] Injector: Total new urls injected: 3
      ...
      

      However, because of the duplicated URL, only 2 URLs were injected into the CrawlDb:

      $> $NUTCH_HOME/bin/nutch readdb /tmp/crawldb -stats
      ...
      2023-09-30 07:39:43,945 INFO o.a.n.c.CrawlDbReader [main] TOTAL urls:   2
      ...
      

      If the Injector job is run again with the same input, we get the erroneous output, that still one "new URL" was injected:

      2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls rejected by filters: 0
      2023-09-30 07:41:13,625 INFO o.a.n.c.Injector [main] Injector: Total urls injected after normalization and filtering: 3
      2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total urls injected but already in CrawlDb: 2
      2023-09-30 07:41:13,626 INFO o.a.n.c.Injector [main] Injector: Total new urls injected: 1
      

      This is because the urls_merged counter counts unique items, while url_injected does not, and the shown number is the difference between both counters.

      Adding a counter to count the number of unique injected URLs will allow to get the correct count of newly injected URLs.

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: