Details
Description
We are going to develop a CommonCrawlDataDumper.java class. The CommonCrawlDataDumper is a tool able to perfom the following steps:
- deserialize the crawled data from Nutch
- map serialized data on the proper JSON structure
- serialize the data into CBOR format
- optionally, compress the serialized data using gzip
This tool has to be able to work with either single Nutch segments or directory including segments as input data.
Thanks lewismc and chrismattmann for your great suggestions, support and code.