Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.2.2
-
None
-
None
Description
array_contains([0.0], -0.0) will return true. array_overlaps([0.0], [-0.0]) will return false. I think we generally want to treat -0.0 and 0.0 as the same (see https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala#L28)
However, the Double::equals method doesn't. Therefore, we should either mark double as false in TypeUtils#typeWithProperEquals, or we should wrap it with our own equals method that handles this case.
Java code snippets showing the issue:
dataset = sparkSession.createDataFrame( List.of(RowFactory.create(List.of(-0.0))), DataTypes.createStructType(ImmutableList.of(DataTypes.createStructField( "doubleCol", DataTypes.createArrayType(DataTypes.DoubleType), false)))); Dataset<Row> df = dataset.withColumn( "overlaps", functions.arrays_overlap(functions.array(functions.lit(+0.0)), dataset.col("doubleCol"))); List<Row> result = df.collectAsList(); // [[WrappedArray(-0.0),false]]
dataset = sparkSession.createDataFrame( List.of(RowFactory.create(-0.0)), DataTypes.createStructType( ImmutableList.of(DataTypes.createStructField("doubleCol", DataTypes.DoubleType, false)))); Dataset<Row> df = dataset.withColumn( "contains", functions.array_contains(functions.array(functions.lit(+0.0)), dataset.col("doubleCol"))); List<Row> result = df.collectAsList(); // [[-0.0,true]]