Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2232

DeduplicationJob should decode URL's before length is compared

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.12
    • Component/s: crawldb
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      When certain documents have the same signature de deduplication script will elect one as duplicate. The urls are stored url encoded in the crawldb. When two urls are compared by url length, the urls are not first decoded. This could lead to misleading url length.

        Attachments

        1. NUTCH-2232.patch
          1 kB
          Markus Jelsma
        2. NUTCH-2232.patch
          1 kB
          Ron van der Vegt

          Activity

            People

            • Assignee:
              markus17 Markus Jelsma
              Reporter:
              ronvandervegt Ron van der Vegt
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: