Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22249

UnsupportedOperationException: empty.reduceLeft when caching a dataframe

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.1, 2.2.0
    • Fix Version/s: 2.2.1, 2.3.0
    • Component/s: SQL
    • Labels:
      None
    • Environment:

      Description

      It seems that the isin() method with an empty list as argument only works, if the dataframe is not cached. If it is cached, it results in an exception. To reproduce

      $ pyspark
      >>> df = spark.createDataFrame([pyspark.Row(KEY="value")])
      >>> df.where(df["KEY"].isin([])).show()
      +---+
      |KEY|
      +---+
      +---+
      
      >>> df.cache()
      DataFrame[KEY: string]
      >>> df.where(df["KEY"].isin([])).show()
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/usr/local/anaconda3/envs/<myenv>/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 336, in show
          print(self._jdf.showString(n, 20))
        File "/usr/local/anaconda3/envs/<myenv>/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
        File "/usr/local/anaconda3/envs/<myenv>/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
          return f(*a, **kw)
        File "/usr/local/anaconda3/envs/<myenv>/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
      py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString.
      : java.lang.UnsupportedOperationException: empty.reduceLeft
      	at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180)
      	at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48)
      	at scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74)
      	at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48)
      	at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208)
      	at scala.collection.AbstractTraversable.reduce(Traversable.scala:104)
      	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107)
      	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71)
      	at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223)
      	at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219)
      	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112)
      	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111)
      	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
      	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
      	at scala.collection.immutable.List.foreach(List.scala:381)
      	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
      	at scala.collection.immutable.List.flatMap(List.scala:344)
      	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.<init>(InMemoryTableScanExec.scala:111)
      	at org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
      	at org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
      	at org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99)
      	at org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
      	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
      	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
      	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
      	at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
      	at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
      	at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
      	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
      	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
      	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
      	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
      	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
      	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
      	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
      	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2832)
      	at org.apache.spark.sql.Dataset.head(Dataset.scala:2153)
      	at org.apache.spark.sql.Dataset.take(Dataset.scala:2366)
      	at org.apache.spark.sql.Dataset.showString(Dataset.scala:245)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
      	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
      	at py4j.Gateway.invoke(Gateway.java:280)
      	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
      	at py4j.commands.CallCommand.execute(CallCommand.java:79)
      	at py4j.GatewayConnection.run(GatewayConnection.java:214)
      	at java.lang.Thread.run(Thread.java:745)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mgaido Marco Gaido
                Reporter:
                asmaier Andreas Maier
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: