Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21344

BinaryType comparison does signed byte array comparison

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0, 2.1.1
    • Fix Version/s: 2.0.3, 2.1.2, 2.2.1
    • Component/s: SQL
    • Labels:
      None

      Description

      BinaryType used by Spark SQL defines ordering using signed byte comparisons. This can lead to unexpected behavior. Consider the following code snippet that shows this error:

      case class TestRecord(col0: Array[Byte])
      def convertToBytes(i: Long): Array[Byte] = {
          val bb = java.nio.ByteBuffer.allocate(8)
          bb.putLong(i)
          bb.array
        }
      def test = {
          val sql = spark.sqlContext
          import sql.implicits._
          val timestamp = 1498772083037L
          val data = (timestamp to timestamp + 1000L).map(i => TestRecord(convertToBytes(i)))
          val testDF = sc.parallelize(data).toDF
          val filter1 = testDF.filter(col("col0") >= convertToBytes(timestamp) && col("col0") < convertToBytes(timestamp + 50L))
          val filter2 = testDF.filter(col("col0") >= convertToBytes(timestamp + 50L) && col("col0") < convertToBytes(timestamp + 100L))
          val filter3 = testDF.filter(col("col0") >= convertToBytes(timestamp) && col("col0") < convertToBytes(timestamp + 100L))
          assert(filter1.count == 50)
          assert(filter2.count == 50)
          assert(filter3.count == 100)
      }
      

        Attachments

          Activity

            People

            • Assignee:
              kiszk Kazuaki Ishizaki
              Reporter:
              shubhamc Shubham Chopra
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: