Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-49189

pyspark auto batching serialization leads to job crash

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.0, 3.3.0, 3.4.0, 3.4.1, 3.5.0, 3.5.1
    • None
    • PySpark
    • None

    Description

      the auto batching mechanism of serialzation leads to job crash in pyspark 

       

      the logic is increasing batch size when it can https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L269-L285

       

      however, this logic is vulnerable to the situation that the total size of objects is larger than 2G and as a result , we are hit by the issue that

      ```
      File "/databricks/spark/python/pyspark/worker.py", line 1876, in main process()

      File "/databricks/spark/python/pyspark/worker.py", line 1868, in process serializer.dump_stream(out_iter, outfile)

      File "/databricks/spark/python/pyspark/serializers.py", line 308, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream)

      File "/databricks/spark/python/pyspark/serializers.py", line 158, in dump_stream self._write_with_length(obj, stream)

      File "/databricks/spark/python/pyspark/serializers.py", line 172, in _write_with_length raise ValueError("can not serialize object larger than 2G")

      ValueError: can not serialize object larger than 2G
      ```

      Attachments

        Activity

          People

            Unassigned Unassigned
            codingcat Nan Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: