Affects Version/s: 2.4.0
Fix Version/s: None
Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
I ran into an issue similar and probably related to
SPARK-26128. The org.apache.spark.sql.functions.input_file_name is sometimes empty.
My environment is databricks and debugging the Log4j output showed me that the issue occurred when the files are being listed in parallel, e.g. when
Everything's fine as long as
Setting spark.sql.sources.parallelPartitionDiscovery.threshold to 9999 resolves the issue for me.
edit: the problem is not exclusively linked to listing files in parallel. I've setup a larger cluster for which after parallel file listing the input_file_name did return the correct filename. After inspecting the log4j again, I assume that it's linked to some kind of MetaStore being full. I've attached a section of the log4j output that I think should indicate why it's failing. If you need more, please let me know.