Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
High level break down of work Items/Implementation phases:
Design document will be attached soon.
Phase – 1 – Spark Structured Streaming with regular Carbondata Format
----------------------------
This phase will mainly focus on supporting Streaming ingestion using
Spark Structured streaming
1. Write Path Implementation
- Integration with Spark’s Structured Streaming framework
(FileStreamSink etc) - StreamingOutputWriter (StreamingOuputWriterFactory)
- Prepare Write (Schema Validation, Segment creation,
Streaming file creation etc) - StreamingRecordWriter ( Data conversion from Catalyst InternalRow
to Carbondata compatible format , make use of new load path)
2. Read Path Implementation (some overlap with phase-2)
- Modify getsplits() to read from Streaming Segment
- Read commited info from meta data to get correct offsets
- Make use of Min-Max index if available
- Use sequential scan - data is unsorted , cannot use Btree index
3. Compaction
- Minor Compaction
- Major Compaction
4. Metadata Management
- Streaming metadata store (e.g. Offsets, timestamps etc.)
5. Failure Recovery
- Rollback on failure
- Handle asynchronous writes to CarbonData (using hflush)
-----------------------------
Phase – 2 : Spark Structured Streaming with Appendable CarbonData format
1.Streaming File Format - Writers use V3 file format for appending Columnar unsorted
data blockets - Modify Readers to read from appendable streaming file format
-----------------------------
Phase -3 :
1. Inter-opertability Support - Functionality with other features/Components
- Concurrent queries with streaming ingestion
- Concurrent operations with Streaming Ingestion (e.g. Compaction,
Alter table, Secondary Index etc.)
2. Kafka Connect Ingestion / Carbondata connector - Direct ingestion from Kafka Connect without Spark Structured
Streaming - Separate Kafka Connector to receive data through network port
- Data commit and Offset management
-----------------------------
Phase-4 : Support for other streaming engines - Analysis of Streaming APIs/interface with other streaming engines
- Implementation of connectors for different streaming engines storm,
flink , flume, etc.
----------------------------
Phase -5 : In-memory Streaming table (probable feature)
1. In-memory Cache for Streaming data - Fault tolerant in-memory buffering / checkpoint with WAL
- Readers read from in-memory tables if available
- Background threads for writing streaming data ,etc.
Attachments
Attachments
There are no Sub-Tasks for this issue.