Description
Hi all, you can find in attachment a new patch including support for new options for CommonCrawlDataDumper.
In particultar, new options are passed to CommonCrawlFormat object (which provides methods to create JSON output) using a configuration object (CommonCrawlConfig).
In particular, in this patch CommonCrawlDataDumper provides support for the following options:
- -SimpleDataFormat: enables timestamps in GMT epoche (milliseconds) format.
- -epochFilename: files extracted will be organized in a reversed-DNS tree based on the FQDN of the webpage, followed by a SHA1 hash of the complete URL. Scraped data will be stored in these directories as individual GMT-timestamped files using "epoche time (in milliseconds)" plus file extension.
- -jsonArray: organizes both request and response headers into a JSON array instead of using a JSON sub-object.
*-reverseKey: enables to use the same layout as described for -epochFilename option, with underscore in place of directory separators.
You can use the options above in addition to the options already supported, as described in the Nutch wiki page.
This patch starts from NUTCH-1974.
Thanks Chris A. Mattmann and Ann Burgess for supporting me on this work.