[NUTCH-2213] CommonCrawlDataDumper saves gzipped body in extracted form - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.12
Component/s: commoncrawl, dumpers
Labels:
- easyfix

External issue URL:
https://github.com/internetarchive/warctools/issues/15
External issue ID:
15

Description

I have downloaded a WARC file from the common crawl data. This file contains several gzipped responses which are stored plaintext (without the gzip encoding).

I used warctools from Internet Archive to extract the responses out of the WARC file. However this tool expects the Content-Length field to match the actual length of the body in the WARC (See the issue on github). warctools uses a more up to date version of hanzo warctools which is recommended on the Common Crawl website under "Processing the file format".

I have not been using Nutch and can therefore not say which versions are affected by this.

After reading the official WARC draft I could not find out how gzipped content is supposed to be stored. However probably multiple WARC file parsers will have an issue with this.

It would be nice to know whether you consider this a bug and plan on fixing this and whether this is a major issue which concerns most WARC files of the Common Crawl data or only a small part.

Attachments

Activity

People

Assignee:: Chris A. Mattmann

Reporter:: Joris Rau

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Feb/16 10:26

Updated:: 13/Mar/24 14:51

Resolved:: 01/Mar/16 03:44