Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22249

UnsupportedOperationException: empty.reduceLeft when caching a dataframe

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.1, 2.2.0
    • 2.2.1, 2.3.0
    • SQL
    • None

    Description

      It seems that the isin() method with an empty list as argument only works, if the dataframe is not cached. If it is cached, it results in an exception. To reproduce

      $ pyspark
      >>> df = spark.createDataFrame([pyspark.Row(KEY="value")])
      >>> df.where(df["KEY"].isin([])).show()
      +---+
      |KEY|
      +---+
      +---+
      
      >>> df.cache()
      DataFrame[KEY: string]
      >>> df.where(df["KEY"].isin([])).show()
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/usr/local/anaconda3/envs/<myenv>/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 336, in show
          print(self._jdf.showString(n, 20))
        File "/usr/local/anaconda3/envs/<myenv>/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
        File "/usr/local/anaconda3/envs/<myenv>/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
          return f(*a, **kw)
        File "/usr/local/anaconda3/envs/<myenv>/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
      py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString.
      : java.lang.UnsupportedOperationException: empty.reduceLeft
      	at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180)
      	at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48)
      	at scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74)
      	at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48)
      	at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208)
      	at scala.collection.AbstractTraversable.reduce(Traversable.scala:104)
      	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107)
      	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71)
      	at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223)
      	at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219)
      	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112)
      	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111)
      	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
      	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
      	at scala.collection.immutable.List.foreach(List.scala:381)
      	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
      	at scala.collection.immutable.List.flatMap(List.scala:344)
      	at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.<init>(InMemoryTableScanExec.scala:111)
      	at org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
      	at org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307)
      	at org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99)
      	at org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
      	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
      	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
      	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
      	at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
      	at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
      	at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
      	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
      	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
      	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
      	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
      	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
      	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
      	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
      	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
      	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2832)
      	at org.apache.spark.sql.Dataset.head(Dataset.scala:2153)
      	at org.apache.spark.sql.Dataset.take(Dataset.scala:2366)
      	at org.apache.spark.sql.Dataset.showString(Dataset.scala:245)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
      	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
      	at py4j.Gateway.invoke(Gateway.java:280)
      	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
      	at py4j.commands.CallCommand.execute(CallCommand.java:79)
      	at py4j.GatewayConnection.run(GatewayConnection.java:214)
      	at java.lang.Thread.run(Thread.java:745)
      

      Attachments

        Issue Links

          Activity

            People

              mgaido Marco Gaido
              asmaier Andreas Maier
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: