Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1949

Dump out the Nutch data into the Common Crawl format

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.10
    • Component/s: crawldb, linkdb, storage, tool
    • Labels:
      None

      Description

      We are going to develop a CommonCrawlDataDumper.java class. The CommonCrawlDataDumper is a tool able to perfom the following steps:

      1. deserialize the crawled data from Nutch
      2. map serialized data on the proper JSON structure
      3. serialize the data into CBOR format
      4. optionally, compress the serialized data using gzip

      This tool has to be able to work with either single Nutch segments or directory including segments as input data.

      Thanks Lewis John McGibbney and Chris A. Mattmann for your great suggestions, support and code.

        Attachments

        1. CommonCrawlDataDumper_v02.pdf
          97 kB
          Giuseppe Totaro
        2. CommonCrawlDataDumper.xlsx
          21 kB
          Giuseppe Totaro
        3. CommonCrawlDataDumper.pdf
          87 kB
          Giuseppe Totaro

          Activity

            People

            • Assignee:
              lewismc Lewis John McGibbney
              Reporter:
              gostep Giuseppe Totaro
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: