Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16875

Add args checking for DataSet randomSplit and sample

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.0.1, 2.1.0
    • SQL
    • None

    Description

      scala> data
      res73: org.apache.spark.sql.DataFrame = [label: double, features: vector]
      
      scala> data.count
      res74: Long = 150
      
      scala> val s = data.randomSplit(Array(1,2,-0.01))
      s: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = Array([label: double, features: vector], [label: double, features: vector], [label: double, features: vector])
      
      scala> s(0).count
      res75: Long = 51
      
      scala> s(2).count
      16/08/03 18:28:27 ERROR Executor: Exception in task 0.0 in stage 76.0 (TID 66)
      java.lang.IllegalArgumentException: requirement failed: Upper bound (1.0033444816053512) must be <= 1.0
      	at scala.Predef$.require(Predef.scala:224)
      
      scala> data.sample(false, -0.01)
      res80: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]
      
      scala> data.sample(false, -0.01).count
      16/08/03 18:30:33 ERROR Executor: Exception in task 0.0 in stage 84.0 (TID 71)
      java.lang.IllegalArgumentException: requirement failed: Lower bound (0.0) must be <= upper bound (-0.01)
      

      val s = data.randomSplit(Array(1,2,-0.01)) run successfully, even if I use s(0) in the following lines.
      data.sample(false, -0.01) should also fail immediately.

      Attachments

        Activity

          People

            podongfeng Ruifeng Zheng
            podongfeng Ruifeng Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: