Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21550

approxQuantiles throws "next on empty iterator" on empty data

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.1.0
    • Fix Version/s: 2.2.0
    • Component/s: SQL
    • Labels:
      None

      Description

      The documentation says:

      null and NaN values will be removed from the numerical column before calculation. If
      the dataframe is empty or the column only contains null or NaN, an empty array is returned.
      

      However, this small pyspark example

      sql_context.range(10).filter(col("id") == 42).approxQuantile("id", [0.99], 0.001)
      

      throws

      Py4JJavaError: An error occurred while calling o3493.approxQuantile.
      : java.util.NoSuchElementException: next on empty iterator
      	at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
      	at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
      	at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
      	at scala.collection.IterableLike$class.head(IterableLike.scala:107)
      	at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
      	at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
      	at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
      	at scala.collection.TraversableLike$class.last(TraversableLike.scala:431)
      	at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186)
      	at scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132)
      	at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186)
      	at org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207)
      	at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(StatFunctions.scala:92)
      	at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92)
      	at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                peay peay
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: