Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.2.0, 3.3.0, 3.4.0, 3.4.1, 3.5.0, 3.5.1
-
None
-
None
Description
the auto batching mechanism of serialzation leads to job crash in pyspark
the logic is increasing batch size when it can https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L269-L285
however, this logic is vulnerable to the situation that the total size of objects is larger than 2G and as a result , we are hit by the issue that
```
File "/databricks/spark/python/pyspark/worker.py", line 1876, in main process()
File "/databricks/spark/python/pyspark/worker.py", line 1868, in process serializer.dump_stream(out_iter, outfile)
File "/databricks/spark/python/pyspark/serializers.py", line 308, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream)
File "/databricks/spark/python/pyspark/serializers.py", line 158, in dump_stream self._write_with_length(obj, stream)
File "/databricks/spark/python/pyspark/serializers.py", line 172, in _write_with_length raise ValueError("can not serialize object larger than 2G")
ValueError: can not serialize object larger than 2G
```