Uploaded image for project: 'Spot (Retired)'
  1. Spot (Retired)
  2. SPOT-280

[SPOT-141] Ingestion using Spark Streaming

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • None
    • None

    Description

      Ingestion using Spark Streaming

      A new branch of Spot Ingest framework, where ingestion via Spark Streaming [*Spark2*] is available for flow, dns and proxy. The code was developed without modifying the existing one, to provide this feature as an extra functionality. This means that the current functionality of the Spot Ingest framework still exists and the Streaming Ingestion can be used as an alternative.

      Implementation

      A new collector class [*Distributed Collector*] has been added, which is the same for all pipelines. The role of the Distributed Collector is similar, as it processes the data before transmission. Distributed Collector tracks a directory backwards for newly created files. When a file is detected, it converts it into CSV format and stores the output in the local staging area. Following to that, reads the CSV file line-by-line and creates smaller chunks of bytes. The size of each chunk depends on the maximum request size allowed by Kafka. Finally, it serializes each chunk into an Avro-encoded format and publishes them to Kafka cluster.
      Due to its architecture, Distributed Collector can run on an edge node of the Big Data infrastructure as well as on a remote host (proxy server, vNSF, etc).
      Distributed Collector publishes to Apache Kafka only the CSV-converted file, and not the original one. The binary file remains to the local filesystem of the current host.

      In contrary, Streaming Listener can only run on the central infrastructure. Its ability is to listen to a specific Kafka topic and consumes incoming messages. Streaming data is divided into batches (according to a time interval). These batches are deserialized by the Listener, according to the supported Avro schema, parsed and registered in the corresponding table of Hive.

      For each pipeline, the following modules have been implemented:

      • processing - contains methods that will be used to convert and prepare ingested data, before being sent to Kafka cluster.
      • streaming - contains methods to be used during the streaming process (like table schema etc.).

      Advantages

      A preliminary traffic ingestion check for all pipelines (flow,dns,proxy) of the distributed collector-worker modules has been applied on an existing Spot v1.0 installation. In a summary, the module has flawless performance on netflow files (both .nfcapd and .csv), order of magnitude quicker compared to standard version's ingestion. DNS and proxy ingestion were tested as well with similar results.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ktzoulas Kostas Tzoulas
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: