Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2095

WARC exporter for the CommonCrawlDataDumper

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Implemented
    • Affects Version/s: 1.11
    • Fix Version/s: None
    • Component/s: commoncrawl, tool
    • Labels:

      Description

      Adds the possibility of exporting the nutch segments to a WARC files.

      From the usage point of view a couple of new command line options are available:

      -warc: enables the functionality to export into WARC files, if not specified the default JACKSON formatter is used.
      -warcSize: enable the option to define a max file size for each WARC file, if not specified a default of 1GB per file is used as recommended by the WARC ISO standard.

      The usual -gzip flag can be used to enable compression on the WARC files.

      Some changes to the default CommonCrawlDataDumper were done, essentially some changes to the Factory and to the Formats. This changes avoid creating a new instance of a CommmonCrawlFormat on each URL read from the segments.

        Attachments

        1. NUTCH-2095.patch
          83 kB
          Jorge Luis Betancourt Gonzalez

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jorgelbg Jorge Luis Betancourt Gonzalez
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: