Description
My pandas is 0.25.1.
I ran the following simple code (cross joins are enabled):
spark.sql('''
select t1.*, t2.* from (
select explode(sequence(1, 3)) v
) t1 left join (
select explode(sequence(1, 3)) v
) t2
''').toPandas()
and got a ValueError from pandas:
> ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Collect works fine:
spark.sql(''' select * from ( select explode(sequence(1, 3)) v ) t1 left join ( select explode(sequence(1, 3)) v ) t2 ''').collect() # [Row(v=1, v=1), # Row(v=1, v=2), # Row(v=1, v=3), # Row(v=2, v=1), # Row(v=2, v=2), # Row(v=2, v=3), # Row(v=3, v=1), # Row(v=3, v=2), # Row(v=3, v=3)]
I imagine it's related to the duplicate column names, but this doesn't fail:
spark.sql("select 1 v, 1 v").toPandas() # v v # 0 1 1
Also no issue for multiple rows:
spark.sql("select 1 v, 1 v union all select 1 v, 2 v").toPandas()
It also works when not using a cross join but a janky programatically-generated union all query:
cond = [] for ii in range(3): for jj in range(3): cond.append(f'select {ii+1} v, {jj+1} v') spark.sql(' union all '.join(cond)).toPandas()
As near as I can tell, the output is identical to the explode output, making this issue all the more peculiar, as I thought toPandas() is applied to the output of collect(), so if collect() gives the same output, how can toPandas() fail in one case and not the other? Further, the lazy DataFrame is the same: DataFrame[v: int, v: int] in both cases. I must be missing something.