Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.4.3
-
None
-
Hadoop 2.8.4
Description
I have been testing what happens to a running structured streaming that is writing to HDFS when all datanodes are down/stopped or all cluster is down (including namenode)
So I created a structured stream from kafka to a File output sink to HDFS and tested some scenarios.
We used a very simple streamings:
spark.readStream() .format("kafka") .option("kafka.bootstrap.servers", "kafka.server:9092...") .option("subscribe", "test_topic") .load() .select(col("value").cast(DataTypes.StringType)) .writeStream() .format("text") .option("path", "HDFS/PATH") .option("checkpointLocation", "checkpointPath") .start() .awaitTermination();
After stopping all the datanodes the process starts logging the error that datanodes are bad.
That's correct...
2019-07-03 15:55:00 [spark-listener-group-eventLog] ERROR org.apache.spark.scheduler.AsyncEventQueue:91 - Listener EventLoggingListener threw an exception java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.2.12.202:50010,DS-d2fba01b-28eb-4fe4-baaa-4072102a2172,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1530) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1465) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657)
The problem is that even after starting again the datanodes the process keeps logging the same error all the time.
We checked and the WriteStream to HDFS recovered successfully after starting the datanodes and the output sink worked again without problems.
I have been trying some different HDFS configurations to be sure it's not a client config related problem but with no clue about how to fix it.
It seams that something is stuck indefinitely in an error loop.