Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.14
    • Component/s: commoncrawl
    • Labels:
      None

      Description

      We're seeing some incidence of a large number of documents being marked as duplicate in our crawl.

      We traced it back to one of the crawl plugins returning an empty array for the content field.

      We'd like to propose changing the MD5 signature generation from:

      public byte[] calculate(Content content, Parse parse) {
          byte[] data = content.getContent();
          if (data == null)
            data = content.getUrl().getBytes();
          return MD5Hash.digest(data).getDigest();
        }
      

      to:

      public byte[] calculate(Content content, Parse parse) {
          byte[] data = content.getContent();
          if ((data == null) || (data.length == 0))
            data = content.getUrl().getBytes();
          return MD5Hash.digest(data).getDigest();
        }
      

      to address the issue

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wastl-nagel Sebastian Nagel
                Reporter:
                kakrofoon David Johnson
              • Votes:
                1 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: