Details
Description
The deduplication job allows to keep the shortest URLs as the "best" URL of a set of duplicates, marking all longer ones as duplicates. Recently search engines started to penalize non-https pages by giving https pages a higher rank and marking http as insecure.
If URLs are identical except for the protocol the deduplication job should be able to prefer https:// over http:// URLs, although the latter ones are shorter by one character. Of course, this should be configurable and in addition to existing preferences (length, score and fetch time) to select the "best" URL among duplicates.
Attachments
Issue Links
- links to