[NUTCH-2795] CrawlDbReader: compress CrawlDb dumps if configured - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Implemented
Affects Version/s: 1.17
Fix Version/s: 1.19
Component/s: crawldb
Labels:
- help-wanted

Description

The dumps of CrawlDbReader (text, CSV, JSON) are not compressed given the configured file output compression. E.g., if running

$> bin/nutch readdb \
       -Dmapreduce.output.fileoutputformat.compress=true  \
       -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec \
       crawldb/ -dump crawldb.dump -format json

the output should be compressed using bzip2.

See the Hadoop class FileOutputFormat and the implementation in TextOutputFormat.

Attachments

Issue Links

links to

GitHub Pull Request #746

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 06/Jul/20 11:40

Updated:: 13/Mar/24 14:51

Resolved:: 21/Aug/22 10:38