Details
-
Improvement
-
Status: Done
-
Major
-
Resolution: Done
-
None
-
None
Description
We have a nice and generalized infrastructure for loading data into HBase and interacting with it via `flatfile_loader.sh` and `ENRICHMENT_GET()`. It is also useful to summarize a set of data into a static data structure, store it on HDFS and interact with it via stellar. To this end, to complement the `flatfile_loader.sh`, we should have a `flatfile_summarizer.sh` that, using the same extractor config, will process a flat file and output a serialized object.
The usecase for this is as follows:
Let's say that I have a static list of domains in the second column of a CSV, domains.csv, and I want to generate a bloom filter with those domains in them sans TLD.
I should be able to create a file called `bloom.ser` with the serialized bloom filter given the extractor config:
{ "config" : { "columns" : { "rank" : 0, "domain" : 1 }, "value_transform" : { "domain" : "DOMAIN_REMOVE_TLD(domain)" }, "value_filter" : "LENGTH(domain) > 0", "state_init" : "BLOOM_INIT()", "state_update" : { "state" : "BLOOM_ADD(state, domain)" }, "state_merge" : "BLOOM_MERGE(states)", "separator" : "," }, "extractor" : "CSV" }
Note, the associated stellar function `OBJECT_GET` is pending.
Attachments
Issue Links
- blocks
-
METRON-1380 Create a typosquatting use-case
- Done
- links to