[FLINK-9749] Rework Bucketing Sink - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: None
Component/s: Connectors / FileSystem
Labels:
None

Description

The BucketingSink has a series of deficits at the moment.

Due to the long list of issues, I would suggest to add a new StreamingFileSink with a new and cleaner design

Encoders, Parquet, ORC

It only efficiently supports row-wise data formats (avro, json, sequence files).
Efforts to add (columnar) compression for blocks of data is inefficient, because blocks cannot span checkpoints due to persistence-on-checkpoint.
The encoders are part of the flink-connector-filesystem project, rather than in orthogonal formats projects. This blows up the dependencies of the flink-connector-filesystem project project. As an example, the rolling file sink has dependencies on Hadoop and Avro, which messes up dependency management.

Use of FileSystems

The BucketingSink works only on Hadoop's FileSystem abstraction not support Flink's own FileSystem abstraction and cannot work with the packaged S3, maprfs, and swift file systems
The sink hence needs Hadoop as a dependency
The sink relies on "trying out" whether truncation works, which requires write access to the users working directory
The sink relies on enumerating and counting files, rather than maintaining its own state, making less efficient

Correctness and Efficiency on S3

The BucketingSink relies on strong consistency in the file enumeration, hence may work incorrectly on S3.
The BucketingSink relies on persisting streams at intermediate points. This is not working properly on S3, hence there may be data loss on S3.

.valid-length companion file

The valid length file makes it hard for consumers of the data and should be dropped

We track this design in a series of sub issues.

Attachments

Issue Links

is related to

FLINK-11388 Add an OSS RecoverableWriter

Closed

supercedes

FLINK-5789 Make Bucketing Sink independent of Hadoop's FileSystem

Closed

Sub-Tasks

1.

Create new StreamingFileSink that works on Flink's FileSystem abstraction

Closed

Kostas Kloudas

2.

Add a RecoverableWriter to the FileSystem abstraction

Closed

Stephan Ewen

3.

Add an S3 RecoverableWriter

Closed

Kostas Kloudas

4.

Support Parquet for StreamingFileSink

Closed

Kostas Kloudas

5.

Support Orc for StreamingFileSink

Closed

Sivaprasanna Sethuraman

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 10m

6.

Add end-to-end test for reworked BucketingSink

Closed

Chesnay Schepler

7.

Add support for bulk writers.

Closed

Kostas Kloudas

8.

Update the rolling policy interface.

Closed

Kostas Kloudas

9.

Add logging to the StreamingFileSink

Closed

Kostas Kloudas

10.

Refactor Streaming File Sink for better separation of concerns.

Closed

Kostas Kloudas

11.

More tests to increase StreamingFileSink test coverage

Closed

Kostas Kloudas

12.

Add documentation for StreamingFileSink

Closed

Aljoscha Krettek

13.

Check if RecoverableWriter supportsResume and act accordingly.

Closed

Kostas Kloudas

14.

Extend test_streaming_file_sink to test S3 writer

Closed

Andrey Zagrebin

15.

Cleanup small objects uploaded to S3 as independent objects

Closed

Kostas Kloudas

Activity

People

Assignee:: Kostas Kloudas

Reporter:: Stephan Ewen

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 04/Jul/18 20:10

Updated:: 04/Mar/22 08:00

Resolved:: 05/May/20 07:45

Time Tracking

Estimated:

Not Specified

Remaining:

0h

Logged:

10m

Include sub-tasks