Uploaded image for project: 'CarbonData'
  1. CarbonData
  2. CARBONDATA-1072

Streaming Ingestion Feature




      High level break down of work Items/Implementation phases:
      Design document will be attached soon.

      Phase – 1 – Spark Structured Streaming with regular Carbondata Format
      This phase will mainly focus on supporting Streaming ingestion using
      Spark Structured streaming
      1. Write Path Implementation

      • Integration with Spark’s Structured Streaming framework
        (FileStreamSink etc)
      • StreamingOutputWriter (StreamingOuputWriterFactory)
      • Prepare Write (Schema Validation, Segment creation,
        Streaming file creation etc)
      • StreamingRecordWriter ( Data conversion from Catalyst InternalRow
        to Carbondata compatible format , make use of new load path)

      2. Read Path Implementation (some overlap with phase-2)

      • Modify getsplits() to read from Streaming Segment
      • Read commited info from meta data to get correct offsets
      • Make use of Min-Max index if available
      • Use sequential scan - data is unsorted , cannot use Btree index

      3. Compaction

      • Minor Compaction
      • Major Compaction

      4. Metadata Management

      • Streaming metadata store (e.g. Offsets, timestamps etc.)

      5. Failure Recovery

      • Rollback on failure
      • Handle asynchronous writes to CarbonData (using hflush)
        Phase – 2 : Spark Structured Streaming with Appendable CarbonData format
        1.Streaming File Format
      • Writers use V3 file format for appending Columnar unsorted
        data blockets
      • Modify Readers to read from appendable streaming file format
        Phase -3 :
        1. Inter-opertability Support
      • Functionality with other features/Components
      • Concurrent queries with streaming ingestion
      • Concurrent operations with Streaming Ingestion (e.g. Compaction,
        Alter table, Secondary Index etc.)
        2. Kafka Connect Ingestion / Carbondata connector
      • Direct ingestion from Kafka Connect without Spark Structured
      • Separate Kafka Connector to receive data through network port
      • Data commit and Offset management
        Phase-4 : Support for other streaming engines
      • Analysis of Streaming APIs/interface with other streaming engines
      • Implementation of connectors for different streaming engines storm,
        flink , flume, etc.
        Phase -5 : In-memory Streaming table (probable feature)
        1. In-memory Cache for Streaming data
      • Fault tolerant in-memory buffering / checkpoint with WAL
      • Readers read from in-memory tables if available
      • Background threads for writing streaming data ,etc.


        There are no Sub-Tasks for this issue.



            • Assignee:
              aniketadnaik Aniket Adnaik
            • Votes:
              0 Vote for this issue
              2 Start watching this issue


              • Created:

                Time Tracking

                Original Estimate - Not Specified
                Not Specified
                Remaining Estimate - 0h
                Time Spent - 8h 10m
                8h 10m