Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Implemented
-
None
-
None
-
None
Description
The BucketingSink has a series of deficits at the moment.
Due to the long list of issues, I would suggest to add a new StreamingFileSink with a new and cleaner design
Encoders, Parquet, ORC
- It only efficiently supports row-wise data formats (avro, json, sequence files).
- Efforts to add (columnar) compression for blocks of data is inefficient, because blocks cannot span checkpoints due to persistence-on-checkpoint.
- The encoders are part of the flink-connector-filesystem project, rather than in orthogonal formats projects. This blows up the dependencies of the flink-connector-filesystem project project. As an example, the rolling file sink has dependencies on Hadoop and Avro, which messes up dependency management.
Use of FileSystems
- The BucketingSink works only on Hadoop's FileSystem abstraction not support Flink's own FileSystem abstraction and cannot work with the packaged S3, maprfs, and swift file systems
- The sink hence needs Hadoop as a dependency
- The sink relies on "trying out" whether truncation works, which requires write access to the users working directory
- The sink relies on enumerating and counting files, rather than maintaining its own state, making less efficient
Correctness and Efficiency on S3
- The BucketingSink relies on strong consistency in the file enumeration, hence may work incorrectly on S3.
- The BucketingSink relies on persisting streams at intermediate points. This is not working properly on S3, hence there may be data loss on S3.
.valid-length companion file
- The valid length file makes it hard for consumers of the data and should be dropped
We track this design in a series of sub issues.
Attachments
Issue Links
- is related to
-
FLINK-11388 Add an OSS RecoverableWriter
- Closed
- supercedes
-
FLINK-5789 Make Bucketing Sink independent of Hadoop's FileSystem
- Closed