[CARBONDATA-1072] Streaming Ingestion Feature - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: core, data-load, data-query, examples, file-format, spark-integration, sql
Labels:
None

Description

High level break down of work Items/Implementation phases:
Design document will be attached soon.

Phase – 1 – Spark Structured Streaming with regular Carbondata Format
----------------------------
This phase will mainly focus on supporting Streaming ingestion using
Spark Structured streaming
1. Write Path Implementation

Integration with Spark’s Structured Streaming framework
(FileStreamSink etc)
StreamingOutputWriter (StreamingOuputWriterFactory)
Prepare Write (Schema Validation, Segment creation,
Streaming file creation etc)
StreamingRecordWriter ( Data conversion from Catalyst InternalRow
to Carbondata compatible format , make use of new load path)

2. Read Path Implementation (some overlap with phase-2)

Modify getsplits() to read from Streaming Segment
Read commited info from meta data to get correct offsets
Make use of Min-Max index if available
Use sequential scan - data is unsorted , cannot use Btree index

3. Compaction

Minor Compaction
Major Compaction

4. Metadata Management

Streaming metadata store (e.g. Offsets, timestamps etc.)

5. Failure Recovery

Rollback on failure
Handle asynchronous writes to CarbonData (using hflush)
-----------------------------
Phase – 2 : Spark Structured Streaming with Appendable CarbonData format
1.Streaming File Format
Writers use V3 file format for appending Columnar unsorted
data blockets
Modify Readers to read from appendable streaming file format
-----------------------------
Phase -3 :
1. Inter-opertability Support
Functionality with other features/Components
Concurrent queries with streaming ingestion
Concurrent operations with Streaming Ingestion (e.g. Compaction,
Alter table, Secondary Index etc.)
2. Kafka Connect Ingestion / Carbondata connector
Direct ingestion from Kafka Connect without Spark Structured
Streaming
Separate Kafka Connector to receive data through network port
Data commit and Offset management
-----------------------------
Phase-4 : Support for other streaming engines
Analysis of Streaming APIs/interface with other streaming engines
Implementation of connectors for different streaming engines storm,
flink , flume, etc.
----------------------------
Phase -5 : In-memory Streaming table (probable feature)
1. In-memory Cache for Streaming data
Fault tolerant in-memory buffering / checkpoint with WAL
Readers read from in-memory tables if available
Background threads for writing streaming data ,etc.