Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2095

WARC exporter for the CommonCrawlDataDumper

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Implemented
    • 1.11
    • None
    • commoncrawl, tool

    Description

      Adds the possibility of exporting the nutch segments to a WARC files.

      From the usage point of view a couple of new command line options are available:

      -warc: enables the functionality to export into WARC files, if not specified the default JACKSON formatter is used.
      -warcSize: enable the option to define a max file size for each WARC file, if not specified a default of 1GB per file is used as recommended by the WARC ISO standard.

      The usual -gzip flag can be used to enable compression on the WARC files.

      Some changes to the default CommonCrawlDataDumper were done, essentially some changes to the Factory and to the Formats. This changes avoid creating a new instance of a CommmonCrawlFormat on each URL read from the segments.

      Attachments

        1. NUTCH-2095.patch
          83 kB
          Jorge Luis Betancourt Gonzalez

        Activity

          People

            Unassigned Unassigned
            jorgelbg Jorge Luis Betancourt Gonzalez
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: