Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Implemented
-
None
-
None
-
None
-
Patch Available
Description
WARCExporter is a handy tool to dump the segments. Unfortunately it also emits WARC records for status' other than success of notmodified, which accounts for a decent number in each crawl cycle. It also doesn't emit parsed metadata or extracted text. It does now.
This patch adds three switches:
- -includeOnlySuccessfulResponses to only emit records of success or notmodified
- -includeParseData to also emit parse metadata as WARC metadata record
- -includeParseText to also emit extracted text as WARC metadata
Both metadata objects are stored in the same WARC metadata record to save space.
Attachments
Attachments
Issue Links
- links to