[FLINK-19161] Port File Sources to FLIP-27 API - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.12.0
Component/s: Connectors / FileSystem
Labels:
- pull-request-available

Description

Porting the File sources to the FLIP-27 API means combining the

FileInputFormat from the DataSet Batch API
The Monitoring File Source from the DataStream API.

The two currently share the same reader code already and partial enumeration code.

Structure

The new File Source will have three components:

File enumerators that discover the files.
File split assigners that decide which reader gets what split
File Reader Formats, which deal with the decoding.

The main difference between the Bounded (Batch) version and the unbounded (Streaming) version is that the streaming version repeatedly invokes the file enumerator to search for new files.

Checkpointing Enumerators

The enumerators need to checkpoint the not-yet-assigned splits, plus, if they are in continuous discovery mode (streaming) the paths / timestamps already processed.

Checkpointing Readers

The new File Source needs to ensure that every reader can be checkpointed.
Some readers may be able to expose the position in the input file that corresponds to the latest emitted record, but many will not be able to do that due to

storing compresses record batches
using buffered decoders where exact position information is not accessible

We therefore suggest to expose a mechanism that combines seekable file offsets and records to read and skip after that offset. In the extreme cases, files can work only with seekable positions or only with records-to-skip. Some sources, like Avro, can have periodic seek points (sync markers) and count records-to-skip after these markers.

Efficient and Convenient Readers

To balance efficiency (batch vectorized reading of ORC / Parquet for vectorized query processing) and convenience (plug in 3-rd party CSV decoder over stream) we offer three abstraction for record readers