Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2102

WARC Exporter

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.10
    • Fix Version/s: 1.11
    • Component/s: commoncrawl, dumpers
    • Labels:
      None

      Description

      This patch adds a WARC exporter http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf. Unlike the code submitted in https://github.com/apache/nutch/pull/55 which is based on the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be able to cope with large segments in a timely fashion and also is not limited to the local file system.

      Later on we could have a WARCImporter to generate segments from WARC files, which is outside the scope of the CCDD anyway. Also WARC is not specific to CommonCrawl, which is why the package name does not reflect it.

      I don't think it would be a problem to have both the modified CCDD and this class providing similar functionalities.

      This class is called in the following way

      ./nutch org.apache.nutch.tools.warc.WARCExporter /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/

        Attachments

        1. NUTCH-2102.patch
          20 kB
          Julien Nioche

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jnioche Julien Nioche
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: