Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2219

Criteria order to be configurable in DeduplicationJob

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.12
    • Component/s: crawldb
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Current implementation:

      "This command takes a path to a crawldb as parameter and finds duplicates based on the signature. If several entries share the same signature, the one with the highest score is kept. If the scores are the same, then the fetch time is used to determine which one to keep with the most recent one being kept. If their fetch times are the same we keep the one with the shortest URL."

      The order in which the main document is selected is currently not changeable. Therefore I think this option would be nice:
      -compareOrder <score>,<fetchTime>,<urlLength>

      I have written a patch on trunk (rev 1730516). I'm looking forward for any peer review.

        Attachments

        1. NUTCH-2219.patch
          7 kB
          Markus Jelsma
        2. NUTCH-2219.patch
          7 kB
          Ron van der Vegt

          Activity

            People

            • Assignee:
              markus17 Markus Jelsma
              Reporter:
              ronvandervegt Ron van der Vegt
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: