[SPARK-21344] BinaryType comparison does signed byte array comparison - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0, 2.1.1
Fix Version/s: 2.0.3, 2.1.2, 2.2.1
Component/s: SQL
Labels:
None

Description

BinaryType used by Spark SQL defines ordering using signed byte comparisons. This can lead to unexpected behavior. Consider the following code snippet that shows this error:

case class TestRecord(col0: Array[Byte])
def convertToBytes(i: Long): Array[Byte] = {
    val bb = java.nio.ByteBuffer.allocate(8)
    bb.putLong(i)
    bb.array
  }
def test = {
    val sql = spark.sqlContext
    import sql.implicits._
    val timestamp = 1498772083037L
    val data = (timestamp to timestamp + 1000L).map(i => TestRecord(convertToBytes(i)))
    val testDF = sc.parallelize(data).toDF
    val filter1 = testDF.filter(col("col0") >= convertToBytes(timestamp) && col("col0") < convertToBytes(timestamp + 50L))
    val filter2 = testDF.filter(col("col0") >= convertToBytes(timestamp + 50L) && col("col0") < convertToBytes(timestamp + 100L))
    val filter3 = testDF.filter(col("col0") >= convertToBytes(timestamp) && col("col0") < convertToBytes(timestamp + 100L))
    assert(filter1.count == 50)
    assert(filter2.count == 50)
    assert(filter3.count == 100)
}

Attachments

Issue Links

links to

[Github] Pull Request #18571 (kiszk)

Activity

People

Assignee:: Kazuaki Ishizaki

Reporter:: Shubham Chopra

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 07/Jul/17 20:15

Updated:: 16/Jul/17 16:45

Resolved:: 16/Jul/17 16:45