We are going to develop a CommonCrawlDataDumper.java class. The CommonCrawlDataDumper is a tool able to perfom the following steps:
- deserialize the crawled data from Nutch
- map serialized data on the proper JSON structure
- serialize the data into CBOR format
- optionally, compress the serialized data using gzip
This tool has to be able to work with either single Nutch segments or directory including segments as input data.
Thanks Lewis John McGibbney and Chris A. Mattmann for your great suggestions, support and code.
1.
|
CommonCrawlDumper : Invalid format + skipped parts |
|
Resolved | Chris A. Mattmann |
2.
|
Make CommonCrawlFormatJackson instance reusable by properly handling object state |
|
Resolved | Unassigned |