Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23950

Coalescing an empty dataframe to 1 partition

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 2.2.1
    • None
    • PySpark
    • None
    • Operating System: Windows 7

      Tested in Jupyter notebooks using Python 2.7.14 and Python 3.6.3.

      Hardware specs not relevant to the issue.

    Description

      Coalescing an empty dataframe to 1 partition returns an error.

      The funny thing is that coalescing an empty dataframe to 2 or more partitions seem to work.

      The test case is the following:

      from pyspark.sql.types import StructType
      
      df = spark.createDataFrame(spark.sparkContext.emptyRDD(), StructType([]))
      
      print(df.coalesce(2).count())
      print(df.coalesce(3).count())
      print(df.coalesce(4).count())
      
      df.coalesce(1).count()

      Output:

      0
      0
      0
      ---------------------------------------------------------------------------
      Py4JJavaError Traceback (most recent call last)
      <ipython-input-5-c067400f2ef0> in <module>()
      7 print(df.coalesce(4).count())
      8 
      ----> 9 print(df.coalesce(1).count())
      
      C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\dataframe.py in count(self)
      425 2
      426 """
      --> 427 return int(self._jdf.count())
      428 
      429 @ignore_unicode_prefix
      
      c:\python36\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
      1131 answer = self.gateway_client.send_command(command)
      1132 return_value = get_return_value(
      -> 1133 answer, self.gateway_client, self.target_id, self.name)
      1134 
      1135 for temp_arg in temp_args:
      
      C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
      61 def deco(*a, **kw):
      62 try:
      ---> 63 return f(*a, **kw)
      64 except py4j.protocol.Py4JJavaError as e:
      65 s = e.java_exception.toString()
      
      c:\python36\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
      317 raise Py4JJavaError(
      318 "An error occurred while calling {0}{1}{2}.\n".
      --> 319 format(target_id, ".", name), value)
      320 else:
      321 raise Py4JError(
      
      Py4JJavaError: An error occurred while calling o176.count.
      : java.util.NoSuchElementException: next on empty iterator
      at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
      at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
      at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
      at scala.collection.IterableLike$class.head(IterableLike.scala:107)
      at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
      at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
      at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
      at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2435)
      at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2434)
      at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
      at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
      at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
      at org.apache.spark.sql.Dataset.count(Dataset.scala:2434)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
      at java.lang.reflect.Method.invoke(Unknown Source)
      at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
      at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
      at py4j.Gateway.invoke(Gateway.java:280)
      at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
      at py4j.commands.CallCommand.execute(CallCommand.java:79)
      at py4j.GatewayConnection.run(GatewayConnection.java:214)
      at java.lang.Thread.run(Unknown Source)

      Shouldn't this be consistent?

      Thank you very much.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jonsnowseven João Neves
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: