As suggested previously we could either treat canonicals as redirections or during deduplication. Neither are satisfactory solutions.
Redirection : we want to index the document if/when the target of the canonical is not available for indexing. We also want to follow the outlinks.
Dedup : could modify the *DeleteDuplicates code but canonical are more complex due to fact that we need to follow redirections
We probably need a third approach: prefilter by going through the crawldb & detect URLs which have a canonical target already indexed or ready to be indexed. We need to follow up to X levels of redirection e.g. doc A marked as canonical representation doc B, doc B redirects to doc C etc...if end of redirection chain exists and is valid then mark A as duplicate of C (intermediate redirs will not get indexed anyway)
As we don't know if has been indexed yet we would give it a special marker (e.g. status_duplicate) in the crawlDB. Then
-> if indexer comes across such an entry : skip it
-> make so that *deleteDuplicates can take a list of URLs with status_duplicate as an additional source of input OR have a custom resource that deletes such entries in SOLR or Lucene indices
The implementation would be as follows :
Go through all redirections and generate all redirection chains e.g.
A -> B
B -> C
D -> C
where C is an indexable document (i.e. has been fetched and parsed - it may have been already indexed.
A -> C
B -> C
D -> C
C -> C
Once we have all possible redirections : go through the crawlDB in search of canonicals. if the target of a canonical is the source of a valid alias (e.g. A - B - C - D) mark it as 'status:duplicate'
This design implies generating quite a few intermediate structures + scanning the whole crawlDB twice (once of the aliases then for the canonical) + rewrite the whole crawlDB to mark some of the entries as duplicates.
This would be much easier to do when we have Nutch2/HBase : could simply follow the redirs from the initial URL having a canonical tag instead of generating these intermediate structures. We can then modify the entries one by one instead of regenerating the whole crawlDB.