Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31186

toPandas fails on simple query (collect() works)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.4.4
    • 2.4.6, 3.0.0
    • PySpark
    • None

    Description

      My pandas is 0.25.1.

      I ran the following simple code (cross joins are enabled):

      spark.sql('''
      select t1.*, t2.* from (
        select explode(sequence(1, 3)) v
      ) t1 left join (
        select explode(sequence(1, 3)) v
      ) t2
      ''').toPandas()
      

      and got a ValueError from pandas:

      > ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

      Collect works fine:

      spark.sql('''
      select * from (
        select explode(sequence(1, 3)) v
      ) t1 left join (
        select explode(sequence(1, 3)) v
      ) t2
      ''').collect()
      # [Row(v=1, v=1),
      #  Row(v=1, v=2),
      #  Row(v=1, v=3),
      #  Row(v=2, v=1),
      #  Row(v=2, v=2),
      #  Row(v=2, v=3),
      #  Row(v=3, v=1),
      #  Row(v=3, v=2),
      #  Row(v=3, v=3)]
      

      I imagine it's related to the duplicate column names, but this doesn't fail:

      spark.sql("select 1 v, 1 v").toPandas()
      # v	v
      # 0	1	1
      

      Also no issue for multiple rows:

      spark.sql("select 1 v, 1 v union all select 1 v, 2 v").toPandas()

      It also works when not using a cross join but a janky programatically-generated union all query:

      cond = []
      for ii in range(3):
          for jj in range(3):
              cond.append(f'select {ii+1} v, {jj+1} v')
      spark.sql(' union all '.join(cond)).toPandas()
      

      As near as I can tell, the output is identical to the explode output, making this issue all the more peculiar, as I thought toPandas() is applied to the output of collect(), so if collect() gives the same output, how can toPandas() fail in one case and not the other? Further, the lazy DataFrame is the same: DataFrame[v: int, v: int] in both cases. I must be missing something.

      Attachments

        Activity

          People

            viirya L. C. Hsieh
            michaelchirico Michael Chirico
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: