Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1989

Handling invalid URLs in CommonCrawlDataDumper

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.10
    • Component/s: tool
    • Labels:

      Description

      Hi all,
      running the CommonCrawlDataDumper tool (bin/nutch commoncrawldump) with the new options (as described in NUTCH-1975) I noticed there are some problems if an invalid URL is detected.
      For example, the following URLs (that I found in crawled data) break the naming schema provided by using -epochFilename command-line option:

      More in detail, using -epochFilename option, files extracted will be organized in a reversed-DNS tree based on the FQDN of the webpage, followed by a SHA1 hash of the complete URL. When the tool detect the URLs as above, it is not able to build the reversed-DNS tree.

      You can find in attachment a simple patch for detecting invalid URLs. The patch uses the Apache Commons Validator APIs to detect invalid URLs:

      UrlValidator urlValidator = new UrlValidator();
      if (!urlValidator.isValid(url)) {
        LOG.warn("Not valid URL detected: " + url);
      }
      

      The tool logs a warning message if an invalid URL is detected. I am just wondering if we can perform a specific action if invalid URLs occur. We could skip invalid URLs but I notice that also the following URLs are detected as invalid:

      2015-04-15 13:49:40,386 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/
      2015-04-15 13:49:41,603 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://www/
      2015-04-15 13:49:41,632 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http:/
      2015-04-15 13:49:44,601 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/
      2015-04-15 13:50:34,821 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://www.reddit.com/r/agora/comments/22ezoa/how_to_buy_drugs_on_agora_hur_man_köper_droger_på/
      2015-04-15 13:50:35,847 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://www/
      2015-04-15 13:50:35,866 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http:/
      2015-04-15 13:50:38,605 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://allthingsvice.com/2012/05/30/the-great-420-scam/\/\/allthingsvice.com\/2012\/05\/30\/the-great-420-scam\/
      2015-04-15 13:51:20,013 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://antilop.cc/sr/users/nomad bloodbath
      2015-04-15 13:51:20,499 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/ars.to\/1aPaqvW
      2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com
      2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/arstechnica.com\/gaming\/2015\/04\/mortal-kombat-x-charges-players-for-easy-fatalities\/
      2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/\/cdn.arstechnica.net\/wp-content\/themes\/arstechnica\/assets
      2015-04-15 13:51:20,500 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/gaming/2015/04/mortal-kombat-x-charges-players-for-easy-fatalities/\/civis
      2015-04-15 13:51:20,588 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/ars.to\/1tECmHU
      2015-04-15 13:51:20,589 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/arstechnica.com
      2015-04-15 13:51:20,589 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/arstechnica.com\/tech-policy\/2014\/11\/prosecutor-silk-road-2-0-suspect-did-admit-to-everything\/
      2015-04-15 13:51:20,590 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/\/cdn.arstechnica.net\/wp-content\/themes\/arstechnica\/assets
      2015-04-15 13:51:20,590 WARN  tools.CommonCrawlDataDumper - Not valid URL detected: http://arstechnica.com/tech-policy/2014/11/prosecutor-silk-road-2-0-suspect-did-admit-to-everything/\/civis
      

      I would be very pleased to get your feedback on action to perform when invalid URLs are detected, avoiding to drop off data and break the naming schema if -epochFilename option is used.

      Now I am going to add a counter for invalid URLs. Thanks Lewis John McGibbney for supporting me on this work.

        Attachments

        1. NUTCH-1989.patch
          1 kB
          Giuseppe Totaro

          Activity

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              gostep Giuseppe Totaro
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: