[NUTCH-1963] CommonsCrawlDataDumper is too long ( > 100 bytes) when -gzip option invoked - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.10
Fix Version/s: 1.10
Component/s: commoncrawl
Labels:
None

Description

When invoking the commoncrawldump tool with the -gzip option and -mimtype application/pdf I get the following stack trace which results in a failure of the task

java.lang.RuntimeException: file name 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf' is too long ( > 100 bytes)
	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674)
	at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275)
	at org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400)
	at org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236)

The workaround consists of not using the -gzip option, instead delaying this until a later task, however this is a workaround and not a solution.
We need to fix this in order for the tool to work as designed and required.

Attachments

Activity

People

Assignee:: Giuseppe Totaro

Reporter:: Lewis John McGibbney

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Mar/15 18:13

Updated:: 13/Mar/24 14:50

Resolved:: 23/Apr/15 23:36