Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1949

Dump out the Nutch data into the Common Crawl format

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.10
    • crawldb, linkdb, storage, tool
    • None

    Description

      We are going to develop a CommonCrawlDataDumper.java class. The CommonCrawlDataDumper is a tool able to perfom the following steps:

      1. deserialize the crawled data from Nutch
      2. map serialized data on the proper JSON structure
      3. serialize the data into CBOR format
      4. optionally, compress the serialized data using gzip

      This tool has to be able to work with either single Nutch segments or directory including segments as input data.

      Thanks lewismc and chrismattmann for your great suggestions, support and code.

      Attachments

        1. CommonCrawlDataDumper_v02.pdf
          97 kB
          Giuseppe Totaro
        2. CommonCrawlDataDumper.xlsx
          21 kB
          Giuseppe Totaro
        3. CommonCrawlDataDumper.pdf
          87 kB
          Giuseppe Totaro
        There are no Sub-Tasks for this issue.

        Activity

          People

            lewismc Lewis John McGibbney
            gostep Giuseppe Totaro
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: