Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
Spark-2.3.0
-
None
-
None
Description
Hi All,
I am using google pub-sub along with spark stream.
Following is my requirement :
1. There are multiple publishers who pushes the message and i can expect the topic to recieve 7k - 10k messages per second.
2. On the subscribers side, i have a spark streaming running on a high memory cluster with one worker having one executor. In the spark streaming app, the batch size is 10 secs, for every 10 secs I need to pull all the data from the topic and write it to a file.
With this requirement I started with the code and assumed everything will work file, but unfortunately it didtn't due to following issues :
1. I see there there is hardcoded param which pull number of messages. currently it is hardcoded to 1000.
https://github.com/apache/bahir/blob/master/streaming-pubsub/src/main/scala/org/apache/spark/streaming/pubsub/PubsubInputDStream.scala#L234
Due to which i am not able to pull messages more than 1000 messages in a batch, with my requirement i cannot increase my executors.
2. There are other various configurations available in the google's pubsub apis https://github.com/googleapis/java-pubsub which is completely missing here like manual ack, increase in ack time, async ack etc.
3. support for latest version of 2.x and 3.x spark versions.
Is there any plan to develop these ?