Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13473

Predicate can't be pushed through project with nondeterministic field

    XMLWordPrintableJSON

Details

    Description

      The following Spark shell snippet reproduces this issue:

      import org.apache.spark.sql.functions._
      
      val parallelism = 8 // Adjust this to default parallelism
      val df = sqlContext.
        range(2 * parallelism). // 8 partitions, 2 elements per partition
        select(
          col("id"),
          monotonicallyIncreasingId().as("long_id")
        )
      
      df.show()
      
      // +---+-----------+
      // | id|    long_id|
      // +---+-----------+
      // |  0|          0|
      // |  1|          1|
      // |  2| 8589934592|
      // |  3| 8589934593|
      // |  4|17179869184|
      // |  5|17179869185|
      // |  6|25769803776|
      // |  7|25769803777|
      // |  8|34359738368|
      // |  9|34359738369|
      // | 10|42949672960|
      // | 11|42949672961|
      // | 12|51539607552|
      // | 13|51539607553|
      // | 14|60129542144|
      // | 15|60129542145|
      // +---+-----------+
      
      df.
        filter(col("id") === 3). // 2nd element in the 2nd partition
        show()
      
      // +---+----------+
      // | id|   long_id|
      // +---+----------+
      // |  3|8589934592|
      // +---+----------+
      
      

      monotonicallyIncreasingId is nondeterministic.

      Attachments

        Issue Links

          Activity

            People

              lian cheng Cheng Lian
              lian cheng Cheng Lian
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: