Uploaded image for project: 'Metron (Retired)'
  1. Metron (Retired)
  2. METRON-1378

Create a summarizer

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Done
    • Major
    • Resolution: Done
    • None
    • 0.5.0
    • None

    Description

      We have a nice and generalized infrastructure for loading data into HBase and interacting with it via `flatfile_loader.sh` and `ENRICHMENT_GET()`. It is also useful to summarize a set of data into a static data structure, store it on HDFS and interact with it via stellar. To this end, to complement the `flatfile_loader.sh`, we should have a `flatfile_summarizer.sh` that, using the same extractor config, will process a flat file and output a serialized object.

      The usecase for this is as follows:
      Let's say that I have a static list of domains in the second column of a CSV, domains.csv, and I want to generate a bloom filter with those domains in them sans TLD.

      I should be able to create a file called `bloom.ser` with the serialized bloom filter given the extractor config:

      {
        "config" : {
          "columns" : {
             "rank" : 0,
             "domain" : 1
          },
          "value_transform" : {
             "domain" : "DOMAIN_REMOVE_TLD(domain)"
          },
          "value_filter" : "LENGTH(domain) > 0",
          "state_init" : "BLOOM_INIT()",
          "state_update" : {
             "state" : "BLOOM_ADD(state, domain)"
                           },
          "state_merge" : "BLOOM_MERGE(states)",
          "separator" : ","
        },
        "extractor" : "CSV"
      }
      

      Note, the associated stellar function `OBJECT_GET` is pending.

      Attachments

        Issue Links

          Activity

            People

              cestella Casey Stella
              cestella Casey Stella
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: