Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2795

CrawlDbReader: compress CrawlDb dumps if configured

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Implemented
    • 1.17
    • 1.19
    • crawldb

    Description

      The dumps of CrawlDbReader (text, CSV, JSON) are not compressed given the configured file output compression. E.g., if running

      $> bin/nutch readdb \
             -Dmapreduce.output.fileoutputformat.compress=true  \
             -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec \
             crawldb/ -dump crawldb.dump -format json
      

      the output should be compressed using bzip2.

      See the Hadoop class FileOutputFormat and the implementation in TextOutputFormat.

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: