Affects Version/s: None
Fix Version/s: 0.7.0
Create a Batch Profiler that satisfies the following use cases.
- As a Security Data Scientist, I want to understand the historical behaviors and trends of a profile that I have created so that I can determine if I have created a feature set that has predictive value for model building.
- As a Security Data Scientist, I want to understand the historical behaviors and trends of a profile that I have created so that I can determine if I have defined the profile correctly and created a feature set that matches reality.
- As a Security Platform Engineer, I want to generate a profile using archived telemetry when I deploy a new model to production so that models depending on that profile can function on day 1.
- Currently, a profile can only be generated from the telemetry consumed after the profile was created.
- The goal would be to enable “profile seeding” which allows profiles to be populated from a time before the profile was created.
- A profile would be seeded using the telemetry that has been archived by Metron in HDFS.
- A profile consumer should not be able to distinguish the “seeded” portion of a profile.
- There are currently two ports of the Profiler; the Streaming Profiler that handles streaming data in Storm and the other that runs in the REPL and allows a user to manually build, test, and debug profiles.
- These ports largely share a common code base in metron-analytics/metron-profiler-common.
- A smaller set of “orchestration” logic is required to maintain each port; one for Storm, another for the REPL.
- Both Profiler ports supports both system time and event time processing.
- Create a third port of the Profiler; the Batch Profiler.
- The Batch Profiler will be built to run in Spark so that the telemetry can be consumed in batch.
- Allows a user to seed profiles using the JSON telemetry that is archived in HDFS by Metron Indexing.
- Only generates the profile data stored in HBase, not the messages that are produced for Threat Triage and Kafka.
- Any number of profiles can be generated at once, but no dependencies between the profiles are supported. A dependency is where one profile is a consumer of the profile generated by another.
- The Batch Profiler must use the timestamps contained within the telemetry; it runs on event time. Luckily the Profiler already supports event time.
- Enable a pluggable mechanism so that telemetry stored in different formats can be consumed by the Batch Profiler. For example, the Profiler should be able to consume telemetry stored as raw JSON or in other formats like ORC or Parquet.