Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2565

MergeDB incorrectly handles unfetched CrawlDatums

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.14
    • 1.15
    • crawldb
    • None
    • Patch Available
    • Patch

    Description

      I ran into this issue when merging a crawlDB originating from sitemaps into our normal crawlDB. CrawlDatums are merged based on output of AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are unfetched, this can overwrite fetchTime or other stuff.

      I assume this is a bug and have a simple fix for it that checks if CrawlDatum has status db_unfetched.

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jurian Jurian Broertjes
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: