I am using pySpark 2.1.0 in a production environment, and trying to join two DataFrames, one of which is very large and has complex nested structures.
Basically, I load both DataFrames and cache them.
Then, in the large DataFrame, I extract 3 nested values and save them as direct columns.
Finally, I join on these three columns with the smaller DataFrame.
This would be a short code for this:
And this is the error I get when it gets to the count():
I have seen many tickets with similar issues here, but no proper solution. Most of the fixes are until Spark 2.1.0 so I don't know if running it on Spark 2.2.0 would fix it. In any case I cannot change the version of Spark since it is in production.
I have also tried setting
but still the same error.
The job worked well up to now, also with large datasets, but apparently this batch got larger, and that is the only thing that changed. Is there any workaround for this?