Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
3.0.0
-
None
-
None
Description
I have run into an error somewhat non-deterministically where a query fails with
NoSuchElementException: None.get
which happens at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L182
getActiveSession apparently is returning None. I only use Pyspark, and I think this is a threading issue, since the active session comes from an InheritableThreadLocal. I encounter this both when I manually use threading to run multiple jobs at the same time, as well as occasionally when I have multiple streams active at the same time. I tried using the PYSPARK_PIN_THREAD flag but it didn't seem to help. For the former case I hacked around it in my manual threading code by doing
spark._jvm.SparkSession.setActiveSession(spark._jvm.SparkSession.builder().getOrCreate())
at the start of each new thread, and this sometimes doesn't work reliably either.
I see this was mentioned in this issue
I'm not sure if the problem/solution is something to do with Python threads, or adding a default value or some other way of updating this function. One other note is that I started encountering this when using Delta Lake OSS, which reads parquet files as part of the transaction log, which is when this error always happens. It doesn't seem like anything specific to that library though that would be doing something incorrectly that would cause this issue.
Attachments
Issue Links
- duplicates
-
SPARK-32813 Reading parquet rdd in non columnar mode fails in multithreaded environment
- Resolved