Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6319

Should throw analysis exception when using binary type in groupby/join

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.0.2, 1.1.1, 1.2.1, 1.3.0
    • 1.5.0
    • SQL
    • None

    Description

      Spark shell session for reproduction:

      scala> import sqlContext.implicits._
      scala> import org.apache.spark.sql.types._
      scala> Seq(1, 1, 2, 2).map(i => Tuple1(i.toString)).toDF("c").select($"c" cast BinaryType).distinct.show()
      ...
      CAST(c, BinaryType)
      [B@43f13160
      [B@5018b648
      [B@3be22500
      [B@476fc8a1
      

      Spark SQL uses plain byte arrays to represent binary values. However, arrays are compared by reference rather than by value. On the other hand, the DISTINCT operator uses a HashSet and its .contains method to check for duplicated values. These two facts together cause the problem.

      Attachments

        Issue Links

          Activity

            People

              viirya L. C. Hsieh
              lian cheng Cheng Lian
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: