Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2683

DeduplicationJob: add option to prefer https:// over http://

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Implemented
    • Affects Version/s: 1.15
    • Fix Version/s: 1.16
    • Component/s: crawldb
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The deduplication job allows to keep the shortest URLs as the "best" URL of a set of duplicates, marking all longer ones as duplicates. Recently search engines started to penalize non-https pages by giving https pages a higher rank and marking http as insecure.

      If URLs are identical except for the protocol the deduplication job should be able to prefer https:// over http:// URLs, although the latter ones are shorter by one character. Of course, this should be configurable and in addition to existing preferences (length, score and fetch time) to select the "best" URL among duplicates.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wastl-nagel Sebastian Nagel
                Reporter:
                wastl-nagel Sebastian Nagel
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: