[SPARK-47842] Spark job relying over Hudi are blocked after one or zero commit - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Blocker
Resolution: Unresolved
Affects Version/s: 3.3.0
Fix Version/s: None
Component/s: PySpark, Structured Streaming
Labels:
- pull-request-available
Environment:

Hudi version : 0.12.1-amzn-0
Spark version : 3.3.0
Hive version : 3.1.3
Hadoop version : 3.3.3 amz
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no (EMR 6.9.0)
Additional context

Flags:

Important
Language:
- English
- italian

Description

Hello, we are facing the fact that some pyspark job that rely on Hudi seems to be blocked, in fact if we go over the spark console we can see the situation in the attachment
we can see that we have 71 completed jobs but those are CDC process that should read from Kafka topic continuously. We verified yet that there are messages queued over the kafka topic. If you kill the application and then restart in some cases the job will act normally and other times the job still remain stacked.

Our deploy condition are the following:
We read INSERT, UPDATE and DELETE operation from a Kafka topic and we replicate them in a target hudi table stored on Hive via a pyspark job running 24/7

PYSPARK WRITE
df_source.writeStream.foreachBatch(foreach_batch_write_function)
{{ FOR EACH BATCH FUNCTION:
#management of delete messages
batchDF_deletes.write.format('hudi') \
.option('hoodie.datasource.write.operation', 'delete') \
.options(**hudiOptions_table) \
.mode('append') \
.save(S3_OUTPUT_PATH)

#management of update and insert messages
batchDF_upserts.write.format('org.apache.hudi') \
.option('hoodie.datasource.write.operation', 'upsert') \
.options(**hudiOptions_table) \
.mode('append') \
.save(S3_OUTPUT_PATH)}}

SPARK SUBMIT
spark-submit --master yarn --deploy-mode cluster --num-executors 1 --executor-memory 1G --executor-cores 2 --conf spark.dynamicAllocation.enabled=false --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false --jars /usr/lib/hudi/hudi-spark-bundle.jar <path_to_script>

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

console_spark.png
13/Apr/24 14:03
88 kB
alessandro pontis

Issue Links

links to

GitHub Pull Request #19

Activity

People

Assignee:: Unassigned

Reporter:: alessandro pontis

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 13/Apr/24 14:00

Updated:: 26/Apr/24 00:22