We are going to develop a CommonCrawlDataDumper.java class. The CommonCrawlDataDumper is a tool able to perfom the following steps:
- deserialize the crawled data from Nutch
- map serialized data on the proper JSON structure
- serialize the data into CBOR format
- optionally, compress the serialized data using gzip
This tool has to be able to work with either single Nutch segments or directory including segments as input data.
|CommonCrawlDumper : Invalid format + skipped parts||Resolved|
|Make CommonCrawlFormatJackson instance reusable by properly handling object state||Closed||Unassigned|