Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
All
-
Patch Available
Description
By default the regex-urlnormalizers only remove PHPSESSID strings. I propose adding in more default url normalizers including expressions for removing different types of session ids, removing default pages, remvoing interpage links, and cleaning up url strings. The point of these expressions is to decrease the number of duplicate urls that are being stored and scored in the crawl database and being fetched.