Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-420

DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.9.0
    • 0.9.0
    • indexer
    • None

    Description

      DeleteDuplicates.HashPartitioner.reduce():

      // byScore case
      if (value.score > highest.score) {
      highest.keep = false;
      LOG.debug("-discard " + highest + ", keep " + value);
      output.collect(highest.url, highest); // delete highest
      highest = value;
      }
      // !byScore is also similar

      So assume two docs with same hash are in values.If the first has higher score than the second than second doc will be deleted. But if the first has lower score than the second then none will be deleted. AFAICS, there should be an else condition to delete value and keep highest as it is.

      Attachments

        1. index.tar.gz
          0.9 kB
          Dogacan Guney
        2. dedup-v3.patch
          5 kB
          Dogacan Guney
        3. dedup-v2.patch
          2 kB
          Dogacan Guney
        4. dedup.patch
          2 kB
          Dogacan Guney

        Activity

          People

            ab Andrzej Bialecki
            dogacan Dogacan Guney
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: