[METRON-1699] Create Batch Profiler - ASF JIRA

XML

Word

Printable

JSON

Create a Batch Profiler that satisfies the following use cases.

As a Security Data Scientist, I want to understand the historical behaviors and trends of a profile that I have created so that I can determine if I have created a feature set that has predictive value for model building.

As a Security Data Scientist, I want to understand the historical behaviors and trends of a profile that I have created so that I can determine if I have defined the profile correctly and created a feature set that matches reality.

As a Security Platform Engineer, I want to generate a profile using archived telemetry when I deploy a new model to production so that models depending on that profile can function on day 1.

Currently, a profile can only be generated from the telemetry consumed after the profile was created.
The goal would be to enable “profile seeding” which allows profiles to be populated from a time before the profile was created.
A profile would be seeded using the telemetry that has been archived by Metron in HDFS.
A profile consumer should not be able to distinguish the “seeded” portion of a profile.

There are currently two ports of the Profiler; the Streaming Profiler that handles streaming data in Storm and the other that runs in the REPL and allows a user to manually build, test, and debug profiles.
These ports largely share a common code base in metron-analytics/metron-profiler-common.
A smaller set of “orchestration” logic is required to maintain each port; one for Storm, another for the REPL.
Both Profiler ports supports both system time and event time processing.

Create a third port of the Profiler; the Batch Profiler.
The Batch Profiler will be built to run in Spark so that the telemetry can be consumed in batch.
Allows a user to seed profiles using the JSON telemetry that is archived in HDFS by Metron Indexing.
Only generates the profile data stored in HBase, not the messages that are produced for Threat Triage and Kafka.
Any number of profiles can be generated at once, but no dependencies between the profiles are supported. A dependency is where one profile is a consumer of the profile generated by another.
The Batch Profiler must use the timestamps contained within the telemetry; it runs on event time. Luckily the Profiler already supports event time.
Enable a pluggable mechanism so that telemetry stored in different formats can be consumed by the Batch Profiler. For example, the Profiler should be able to consume telemetry stored as raw JSON or in other formats like ORC or Parquet.

is duplicated by

METRON-594 Replay Telemetry Data through Profiler

1.	Make Core Profiler Components Serializable	Done	Nick Allen
2.	Message Timestamp Logic Should be Shared	Done	Nick Allen
3.	Create ProfilePeriod Using Period ID	Done	Nick Allen
4.	HbaseClient.mutate should return the number of mutations	Done	Nick Allen
5.	Port Profiler to Spark	Done	Nick Allen
6.	Run the Batch Profiler in Spark	Done	Nick Allen
7.	Create RPM Packaging for the Batch Profiler	Done	Nick Allen
8.	Create DEB Packaging for Batch Profiler	Done	Nick Allen
9.	Relocate Storm Profiler Code	Done	Nick Allen
10.	Enhance Batch Profiler Integration Test	Done	Nick Allen
11.	Move REPL Port of Profiler to Separate Project	Done	Nick Allen
12.	Merge Batch Profiler Feature Branch into Master	Done	Nick Allen
13.	Add Docs for Running the Profiler with Spark on YARN	Done	Nick Allen
14.	Support alternative input formats in the Batch Profiler	Done	Nick Allen
15.	Fix RPM Spec File	Done	Nick Allen
16.	Input Time Constraints for Batch Profiler	Done	Nick Allen