Uploaded image for project: 'Bahir (Retired)'
  1. Bahir (Retired)
  2. BAHIR-242

Support for more params and flexibilities in spark | google pubsub

    XMLWordPrintableJSON

Details

    Description

      Hi All,

      I am using google pub-sub along with spark stream.

      Following is my requirement :

      1. There are multiple publishers who pushes the message and i can expect the topic to recieve 7k - 10k messages per second.

      2. On the subscribers side, i have a spark streaming running on a high memory cluster with one worker having one executor. In the spark streaming app, the batch size is 10 secs, for every 10 secs I need to pull all the data from the topic and write it to a file.

      With this requirement I started with the code and assumed everything will work file, but unfortunately it didtn't due to following issues :

      1. I see there there is hardcoded param which pull number of messages. currently it is hardcoded to 1000.
      https://github.com/apache/bahir/blob/master/streaming-pubsub/src/main/scala/org/apache/spark/streaming/pubsub/PubsubInputDStream.scala#L234
      Due to which i am not able to pull messages more than 1000 messages in a batch, with my requirement i cannot increase my executors.

      2. There are other various configurations available in the google's pubsub apis https://github.com/googleapis/java-pubsub which is completely missing here like manual ack, increase in ack time, async ack etc.

      3. support for latest version of 2.x and 3.x spark versions.

      Is there any plan to develop these ?

      Attachments

        Activity

          People

            Unassigned Unassigned
            shivakumar.ss ShivaKumar SS
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: