Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
This jira brings support for online mode in corona.
In online mode, common crawl data from AWS will be used to populate ozone with data. Default source is CC-MAIN-2017-17/warc.paths.gz (it contains the path to actual data segment), user can override this using -source.
The following values are derived from URL of Common Crawl data
- Domain will be used as Volume
- URL will be used as Bucket
- FileName will be used as Key