Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2683

DeduplicationJob: add option to prefer https:// over http://

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Implemented
    • 1.15
    • 1.16
    • crawldb
    • None
    • Patch Available

    Description

      The deduplication job allows to keep the shortest URLs as the "best" URL of a set of duplicates, marking all longer ones as duplicates. Recently search engines started to penalize non-https pages by giving https pages a higher rank and marking http as insecure.

      If URLs are identical except for the protocol the deduplication job should be able to prefer https:// over http:// URLs, although the latter ones are shorter by one character. Of course, this should be configurable and in addition to existing preferences (length, score and fetch time) to select the "best" URL among duplicates.

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: