Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-24296 [Umbrella] Refactor Google PubSub to new Source/Sink interfaces
  3. FLINK-20625

Refactor Google Cloud PubSub Source in accordance with FLIP-27

    XMLWordPrintableJSON

Details

    Description

      The Source implementation for Google Cloud Pub/Sub should be refactored in accordance with FLIP-27: Refactor Source Interface.

      Split Enumerator

      Pub/Sub doesn't expose any partitions to consuming applications. Therefore, the implementation of the Pub/Sub Split Enumerator won't do any real work discovery. Instead, a static Source Split is handed to Source Readers which request a Source Split. This static Source Split merely contains details about the connection to Pub/Sub and the concrete Pub/Sub subscription to use but no Split-specific information like partitions/offsets because this information can not be obtained.

      Source Reader

      A Source Reader will use Pub/Sub's pull mechanism to read new messages from the Pub/Sub subscription specified in the SourceSplit. In the case of parallel-running Source Readers in Flink, every Source Reader will be passed the same Source Split from the Enumerator. Because of this, all Source Readers use the same connection details and the same Pub/Sub subscription to receive messages. In this case, Pub/Sub will automatically load-balance messages between all Source Readers pulling from the same subscription. This way, parallel processing can be achieved in the Source.

      At-least-once guarantee

      Pub/Sub itself guarantees at-least-once message delivery so it is the goal to keep up this guarantee in the Source as well. A mechanism that can be used to achieve this is that Pub/Sub expects a message to be acknowledged by the subscriber to signal that the message has been consumed successfully. Any message that has not been acknowledged yet will be automatically redelivered by Pub/Sub once an ack deadline has passed.

      After a certain time interval has elapsed...

      1. all pulled messages are checkpointed in the Source Reader
      2. same messages are acknowledged to Pub/Sub
      3. same messages are forwarded to downstream Flink tasks

      This should ensure at-least-once delivery in the Source because in the case of failure, non-checkpointed messages have not yet been acknowledged and will therefore be redelivered by the Pub/Sub service.

      Because of the static Source Split, it appears like checkpointing is not necessary in the Split Enumerator.

      Possible exactly-once guarantee

      It should even be possible to achieve exactly-once guarantees for the source. The following requirements would have to be met to have an exactly-once mode besides the at-least-once mode similar to how it is done in the current RabbitMQ Source:

      • The system which publishes messages to Pub/Sub must add an id to each message so that messages can be deduplicated in the Source.
      • The Source must run in a non-parallel fashion (with parallelism=1).

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Jakob Edding Jakob Edding
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: