Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-9749

Rework Bucketing Sink

Agile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Implemented
    • None
    • None
    • None

    Description

      The BucketingSink has a series of deficits at the moment.

      Due to the long list of issues, I would suggest to add a new StreamingFileSink with a new and cleaner design

      Encoders, Parquet, ORC

      • It only efficiently supports row-wise data formats (avro, json, sequence files).
      • Efforts to add (columnar) compression for blocks of data is inefficient, because blocks cannot span checkpoints due to persistence-on-checkpoint.
      • The encoders are part of the flink-connector-filesystem project, rather than in orthogonal formats projects. This blows up the dependencies of the flink-connector-filesystem project project. As an example, the rolling file sink has dependencies on Hadoop and Avro, which messes up dependency management.

      Use of FileSystems

      • The BucketingSink works only on Hadoop's FileSystem abstraction not support Flink's own FileSystem abstraction and cannot work with the packaged S3, maprfs, and swift file systems
      • The sink hence needs Hadoop as a dependency
      • The sink relies on "trying out" whether truncation works, which requires write access to the users working directory
      • The sink relies on enumerating and counting files, rather than maintaining its own state, making less efficient

      Correctness and Efficiency on S3

      • The BucketingSink relies on strong consistency in the file enumeration, hence may work incorrectly on S3.
      • The BucketingSink relies on persisting streams at intermediate points. This is not working properly on S3, hence there may be data loss on S3.

      .valid-length companion file

      • The valid length file makes it hard for consumers of the data and should be dropped

      We track this design in a series of sub issues.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kkl0u Kostas Kloudas
            sewen Stephan Ewen
            Votes:
            0 Vote for this issue
            Watchers:
            19 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 10m
              10m

              Slack

                Issue deployment