Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2232

DeduplicationJob should decode URL's before length is compared

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.11
    • 1.12
    • crawldb
    • None
    • Patch Available

    Description

      When certain documents have the same signature de deduplication script will elect one as duplicate. The urls are stored url encoded in the crawldb. When two urls are compared by url length, the urls are not first decoded. This could lead to misleading url length.

      Attachments

        1. NUTCH-2232.patch
          1 kB
          Ron van der Vegt
        2. NUTCH-2232.patch
          1 kB
          Markus Jelsma

        Activity

          People

            markus17 Markus Jelsma
            ronvandervegt Ron van der Vegt
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: