Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2795

CrawlDbReader: compress CrawlDb dumps if configured

VotersStop watchingWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Implemented
    • 1.17
    • 1.19
    • crawldb

    Description

      The dumps of CrawlDbReader (text, CSV, JSON) are not compressed given the configured file output compression. E.g., if running

      $> bin/nutch readdb \
             -Dmapreduce.output.fileoutputformat.compress=true  \
             -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec \
             crawldb/ -dump crawldb.dump -format json
      

      the output should be compressed using bzip2.

      See the Hadoop class FileOutputFormat and the implementation in TextOutputFormat.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            snagel Sebastian Nagel
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            4 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment