Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2219

Criteria order to be configurable in DeduplicationJob

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.11
    • 1.12
    • crawldb
    • None
    • Patch Available

    Description

      Current implementation:

      "This command takes a path to a crawldb as parameter and finds duplicates based on the signature. If several entries share the same signature, the one with the highest score is kept. If the scores are the same, then the fetch time is used to determine which one to keep with the most recent one being kept. If their fetch times are the same we keep the one with the shortest URL."

      The order in which the main document is selected is currently not changeable. Therefore I think this option would be nice:
      -compareOrder <score>,<fetchTime>,<urlLength>

      I have written a patch on trunk (rev 1730516). I'm looking forward for any peer review.

      Attachments

        1. NUTCH-2219.patch
          7 kB
          Markus Jelsma
        2. NUTCH-2219.patch
          7 kB
          Ron van der Vegt

        Activity

          People

            markus17 Markus Jelsma
            ronvandervegt Ron van der Vegt
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: