[SPOT-280] [SPOT-141] Ingestion using Spark Streaming - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Labels:
None

Description

Ingestion using Spark Streaming

A new branch of Spot Ingest framework, where ingestion via Spark Streaming [*Spark2*] is available for flow, dns and proxy. The code was developed without modifying the existing one, to provide this feature as an extra functionality. This means that the current functionality of the Spot Ingest framework still exists and the Streaming Ingestion can be used as an alternative.

Implementation

A new collector class [*Distributed Collector*] has been added, which is the same for all pipelines. The role of the Distributed Collector is similar, as it processes the data before transmission. Distributed Collector tracks a directory backwards for newly created files. When a file is detected, it converts it into CSV format and stores the output in the local staging area. Following to that, reads the CSV file line-by-line and creates smaller chunks of bytes. The size of each chunk depends on the maximum request size allowed by Kafka. Finally, it serializes each chunk into an Avro-encoded format and publishes them to Kafka cluster.
Due to its architecture, Distributed Collector can run on an edge node of the Big Data infrastructure as well as on a remote host (proxy server, vNSF, etc).
Distributed Collector publishes to Apache Kafka only the CSV-converted file, and not the original one. The binary file remains to the local filesystem of the current host.

In contrary, Streaming Listener can only run on the central infrastructure. Its ability is to listen to a specific Kafka topic and consumes incoming messages. Streaming data is divided into batches (according to a time interval). These batches are deserialized by the Listener, according to the supported Avro schema, parsed and registered in the corresponding table of Hive.

For each pipeline, the following modules have been implemented:

processing - contains methods that will be used to convert and prepare ingested data, before being sent to Kafka cluster.
streaming - contains methods to be used during the streaming process (like table schema etc.).

Advantages

A preliminary traffic ingestion check for all pipelines (flow,dns,proxy) of the distributed collector-worker modules has been applied on an existing Spot v1.0 installation. In a summary, the module has flawless performance on netflow files (both .nfcapd and .csv), order of magnitude quicker compared to standard version's ingestion. DNS and proxy ingestion were tested as well with similar results.

Attachments

Issue Links

links to

GitHub Pull Request #141

Activity

People

Assignee:: Unassigned

Reporter:: Kostas Tzoulas

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/Aug/18 11:33

Updated:: 13/Sep/19 22:00

Resolved:: 13/Sep/19 22:00