Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Cannot Reproduce
-
2.2.1
-
None
-
None
-
Operating System: Windows 7
Tested in Jupyter notebooks using Python 2.7.14 and Python 3.6.3.
Hardware specs not relevant to the issue.
Description
Coalescing an empty dataframe to 1 partition returns an error.
The funny thing is that coalescing an empty dataframe to 2 or more partitions seem to work.
The test case is the following:
from pyspark.sql.types import StructType
df = spark.createDataFrame(spark.sparkContext.emptyRDD(), StructType([]))
print(df.coalesce(2).count())
print(df.coalesce(3).count())
print(df.coalesce(4).count())
df.coalesce(1).count()
Output:
0 0 0 --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) <ipython-input-5-c067400f2ef0> in <module>() 7 print(df.coalesce(4).count()) 8 ----> 9 print(df.coalesce(1).count()) C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\dataframe.py in count(self) 425 2 426 """ --> 427 return int(self._jdf.count()) 428 429 @ignore_unicode_prefix c:\python36\lib\site-packages\py4j\java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() c:\python36\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( Py4JJavaError: An error occurred while calling o176.count. : java.util.NoSuchElementException: next on empty iterator at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) at scala.collection.IterableLike$class.head(IterableLike.scala:107) at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186) at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186) at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2435) at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2434) at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841) at org.apache.spark.sql.Dataset.count(Dataset.scala:2434) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Unknown Source)
Shouldn't this be consistent?
Thank you very much.