Description
Hi all,
running the CommonCrawlDataDumper tool (bin/nutch commoncrawldump) with the new options (as described in NUTCH-1975) I noticed there are some problems if an invalid URL is detected.
For example, the following URLs (that I found in crawled data) break the naming schema provided by using -epochFilename command-line option:
- http://www/
- http:/
More in detail, using -epochFilename option, files extracted will be organized in a reversed-DNS tree based on the FQDN of the webpage, followed by a SHA1 hash of the complete URL. When the tool detect the URLs as above, it is not able to build the reversed-DNS tree.
You can find in attachment a simple patch for detecting invalid URLs. The patch uses the Apache Commons Validator APIs to detect invalid URLs:
UrlValidator urlValidator = new UrlValidator(); if (!urlValidator.isValid(url)) { LOG.warn("Not valid URL detected: " + url); }
The tool logs a warning message if an invalid URL is detected. I am just wondering if we can perform a specific action if invalid URLs occur. We could skip invalid URLs but I notice that also the following URLs are detected as invalid:
2015-04-15 13:49:40,386 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/ 2015-04-15 13:49:41,603 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://www/ 2015-04-15 13:49:41,632 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http:/ 2015-04-15 13:49:44,601 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/ 2015-04-15 13:50:34,821 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/ 2015-04-15 13:50:35,847 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://www/ 2015-04-15 13:50:35,866 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http:/ 2015-04-15 13:50:38,605 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/ 2015-04-15 13:51:20,013 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://antilop.cc/sr/users/nomad bloodbath 2015-04-15 13:51:20,499 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/ars.to\/1aPaqvW 2015-04-15 13:51:20,500 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com 2015-04-15 13:51:20,500 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com\/gaming\/2015\/04\/mortal-kombat-x-charges-players-for-easy-fatalities\/ 2015-04-15 13:51:20,500 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/cdn.arstechnica.net\/wp-content\/themes\/arstechnica\/assets 2015-04-15 13:51:20,500 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/civis 2015-04-15 13:51:20,588 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/ars.to\/1tECmHU 2015-04-15 13:51:20,589 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/arstechnica.com 2015-04-15 13:51:20,589 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/arstechnica.com\/tech-policy\/2014\/11\/prosecutor-silk-road-2-0-suspect-did-admit-to-everything\/ 2015-04-15 13:51:20,590 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/cdn.arstechnica.net\/wp-content\/themes\/arstechnica\/assets 2015-04-15 13:51:20,590 WARN tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/civis
I would be very pleased to get your feedback on action to perform when invalid URLs are detected, avoiding to drop off data and break the naming schema if -epochFilename option is used.
Now I am going to add a counter for invalid URLs. Thanks lewismc for supporting me on this work.