[BAHIR-242] Support for more params and flexibilities in spark | google pubsub - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Spark-2.3.0
Fix Version/s: None
Component/s: Spark Streaming Connectors, Spark Structured Streaming Connectors
Labels:
None

Description

Hi All,

I am using google pub-sub along with spark stream.

Following is my requirement :

1. There are multiple publishers who pushes the message and i can expect the topic to recieve 7k - 10k messages per second.

2. On the subscribers side, i have a spark streaming running on a high memory cluster with one worker having one executor. In the spark streaming app, the batch size is 10 secs, for every 10 secs I need to pull all the data from the topic and write it to a file.

With this requirement I started with the code and assumed everything will work file, but unfortunately it didtn't due to following issues :

1. I see there there is hardcoded param which pull number of messages. currently it is hardcoded to 1000.
https://github.com/apache/bahir/blob/master/streaming-pubsub/src/main/scala/org/apache/spark/streaming/pubsub/PubsubInputDStream.scala#L234
Due to which i am not able to pull messages more than 1000 messages in a batch, with my requirement i cannot increase my executors.

2. There are other various configurations available in the google's pubsub apis https://github.com/googleapis/java-pubsub which is completely missing here like manual ack, increase in ack time, async ack etc.

3. support for latest version of 2.x and 3.x spark versions.

Is there any plan to develop these ?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: ShivaKumar SS

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 10/Aug/20 17:01

Updated:: 10/Aug/20 17:01