Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2391

Spurious Duplications for MD5

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.11
    • 1.14
    • commoncrawl
    • None

    Description

      We're seeing some incidence of a large number of documents being marked as duplicate in our crawl.

      We traced it back to one of the crawl plugins returning an empty array for the content field.

      We'd like to propose changing the MD5 signature generation from:

      public byte[] calculate(Content content, Parse parse) {
          byte[] data = content.getContent();
          if (data == null)
            data = content.getUrl().getBytes();
          return MD5Hash.digest(data).getDigest();
        }
      

      to:

      public byte[] calculate(Content content, Parse parse) {
          byte[] data = content.getContent();
          if ((data == null) || (data.length == 0))
            data = content.getUrl().getBytes();
          return MD5Hash.digest(data).getDigest();
        }
      

      to address the issue

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              kakrofoon David Johnson
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: