Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1963

CommonsCrawlDataDumper is too long ( > 100 bytes) when -gzip option invoked

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.10
    • 1.10
    • commoncrawl
    • None

    Description

      When invoking the commoncrawldump tool with the -gzip option and -mimtype application/pdf I get the following stack trace which results in a failure of the task

      java.lang.RuntimeException: file name 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf' is too long ( > 100 bytes)
      	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674)
      	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275)
      	at org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400)
      	at org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236)
      

      The workaround consists of not using the -gzip option, instead delaying this until a later task, however this is a workaround and not a solution.
      We need to fix this in order for the tool to work as designed and required.

      Attachments

        Activity

          People

            gostep Giuseppe Totaro
            lewismc Lewis John McGibbney
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: