Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Implemented
-
1.11
-
None
-
Patch Available
-
Patch
Description
Adds the possibility of exporting the nutch segments to a WARC files.
From the usage point of view a couple of new command line options are available:
-warc: enables the functionality to export into WARC files, if not specified the default JACKSON formatter is used.
-warcSize: enable the option to define a max file size for each WARC file, if not specified a default of 1GB per file is used as recommended by the WARC ISO standard.
The usual -gzip flag can be used to enable compression on the WARC files.
Some changes to the default CommonCrawlDataDumper were done, essentially some changes to the Factory and to the Formats. This changes avoid creating a new instance of a CommmonCrawlFormat on each URL read from the segments.