Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33189

Support PyArrow 2.0.0+

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.0
    • Fix Version/s: 3.0.2, 3.1.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      Some tests fail with PyArrow 2.0.0 in PySpark:

      ======================================================================
      ERROR [0.774s]: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
      ----------------------------------------------------------------------
      Traceback (most recent call last):
        File "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 595, in test_grouped_over_window_with_key
          .select('id', 'result').collect()
        File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in collect
          sock_info = self._jdf.collectToPython()
        File "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
          answer, self.gateway_client, self.target_id, self.name)
        File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
          raise converted from None
      pyspark.sql.utils.PythonException: 
        An exception was thrown from the Python worker. Please see the stack trace below.
      Traceback (most recent call last):
        File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, in main
          process()
        File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, in process
          serializer.dump_stream(out_iter, outfile)
        File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 255, in dump_stream
          return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream)
        File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 81, in dump_stream
          for batch in iterator:
        File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 248, in init_stream_yield_batches
          for series in iterator:
        File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, in mapper
          return f(keys, vals)
        File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, in <lambda>
          return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
        File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, in wrapped
          result = f(key, pd.concat(value_series, axis=1))
        File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in wrapper
          return f(*args, **kwargs)
        File "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 590, in f
          "{} != {}".format(expected_key[i][1], window_range)
      AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 15, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'end': datetime.datetime(2018, 3, 20, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>)}
      

      We should verify and support PyArrow 2.0.0+

      See also https://github.com/apache/spark/runs/1278918780

        Attachments

          Activity

            People

            • Assignee:
              bryanc Bryan Cutler
              Reporter:
              hyukjin.kwon Hyukjin Kwon
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: