Description
Calling to pandas with spark.sql.execution.arrow.enabled: true fails for dataframes with no partitions. The error is a EOFError. With spark.sql.execution.arrow.enabled: false the conversion.
Repro (on current master branch):
>>> from pyspark.sql.types import * >>> schema = StructType([StructField("field1", StringType(), True)]) >>> df = spark.createDataFrame(sc.emptyRDD(), schema) >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> df.toPandas() /Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py:2162: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.fallback.enabled' does not have an effect on failures in the middle of computation. warnings.warn(msg) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 2143, in toPandas batches = self._collectAsArrow() File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 2205, in _collectAsArrow results = list(_load_from_socket(sock_info, ArrowCollectSerializer())) File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 210, in load_stream num = read_int(stream) File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 810, in read_int raise EOFError EOFError >>> spark.conf.set("spark.sql.execution.arrow.enabled", "false") >>> df.toPandas() Empty DataFrame Columns: [field1] Index: []
Attachments
Issue Links
- links to