Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26233

Incorrect decimal value with java beans and first/last/max... functions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
    • 2.2.3, 2.3.3, 2.4.1, 3.0.0
    • SQL

    Description

      Decimal values from Java beans are incorrectly scaled when used with functions like first/last/max...

      This problem came because Encoders.bean always set Decimal values as DecimalType(this.MAX_PRECISION(), 18).

      Usually it's not a problem if you use numeric functions like sum but for functions like first/last/max... it is a problem.

      How to reproduce this error:

      Using this class as an example:

      public class Foo implements Serializable {
      
        private String group;
        private BigDecimal var;
      
        public BigDecimal getVar() {
          return var;
        }
      
        public void setVar(BigDecimal var) {
          this.var = var;
        }
      
        public String getGroup() {
          return group;
        }
      
        public void setGroup(String group) {
          this.group = group;
        }
      }
      

       

      And a dummy code to create some objects:

      Dataset<Foo> ds = spark.range(5)
          .map(l -> {
            Foo foo = new Foo();
            foo.setGroup("" + l);
            foo.setVar(BigDecimal.valueOf(l + 0.1111));
            return foo;
          }, Encoders.bean(Foo.class));
      ds.printSchema();
      ds.show();
      
      +-----+------+
      |group| var|
      +-----+------+
      | 0|0.1111|
      | 1|1.1111|
      | 2|2.1111|
      | 3|3.1111|
      | 4|4.1111|
      +-----+------+
      

      We can see that the DecimalType is precision 38 and 18 scale and all values are show correctly.

      But if we use a first function, they are scaled incorrectly:

      ds.groupBy(col("group"))
          .agg(
              first("var")
          )
          .show();
      
      
      +-----+-----------------+
      |group|first(var, false)|
      +-----+-----------------+
      | 3| 3.1111E-14|
      | 0| 1.111E-15|
      | 1| 1.1111E-14|
      | 4| 4.1111E-14|
      | 2| 2.1111E-14|
      +-----+-----------------+
      

      This incorrect behavior cannot be reproduced if we use "numerical "functions like sum or if the column is cast a new Decimal Type.

      ds.groupBy(col("group"))
          .agg(
              sum("var")
          )
          .show();
      
      +-----+--------------------+
      |group| sum(var)|
      +-----+--------------------+
      | 3|3.111100000000000000|
      | 0|0.111100000000000000|
      | 1|1.111100000000000000|
      | 4|4.111100000000000000|
      | 2|2.111100000000000000|
      +-----+--------------------+
      
      ds.groupBy(col("group"))
          .agg(
              first(col("var").cast(new DecimalType(38, 8)))
          )
          .show();
      
      +-----+----------------------------------------+
      |group|first(CAST(var AS DECIMAL(38,8)), false)|
      +-----+----------------------------------------+
      | 3| 3.11110000|
      | 0| 0.11110000|
      | 1| 1.11110000|
      | 4| 4.11110000|
      | 2| 2.11110000|
      +-----+----------------------------------------+
      

         

       

      Attachments

        Activity

          People

            mgaido Marco Gaido
            mcanes Miquel Canes
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: