Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24401

Aggreate on Decimal Types does not work

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.0
    • 2.3.0
    • SQL
    • None

    Description

      Hi, 

      I think I found a really ugly bug in spark when performing aggregations with Decimals

      To reproduce: 

       

      val df = spark.read.parquet("attached file")
      val first_agg = fact_df.groupBy("id1", "id2", "start_date").agg(mean("projection_factor").alias("projection_factor"))
      first_agg.show
      val second_agg = first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), min("projection_factor").alias("minf"))
      second_agg.show
      

      First aggregation works fine the second aggregation seems to be summing instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem.

      The dataset as circa 800 Rows and the projection_factor has values from 0 until 100. the result should not be bigger that 5 but with get 265820543091454.... as result back.

       

       

      As Code not 100% the same but I think there is really a bug there: 

       

      BigDecimal [] objects = new BigDecimal[]{
              new BigDecimal(3.5714285714D),
              new BigDecimal(3.5714285714D),
              new BigDecimal(3.5714285714D),
              new BigDecimal(3.5714285714D)};
      Row dataRow = new GenericRow(objects);
      Row dataRow2 = new GenericRow(objects);
      StructType structType = new StructType()
              .add("id1", DataTypes.createDecimalType(38,10), true)
              .add("id2", DataTypes.createDecimalType(38,10), true)
              .add("id3", DataTypes.createDecimalType(38,10), true)
              .add("id4", DataTypes.createDecimalType(38,10), true);
      
      final Dataset<Row> dataFrame = sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType);
      System.out.println(dataFrame.schema());
      dataFrame.show();
      final Dataset<Row>  df1 = dataFrame.groupBy("id1","id2")
              .agg( mean("id3").alias("projection_factor"));
      df1.show();
      final Dataset<Row> df2 = df1
              .groupBy("id1")
              .agg(max("projection_factor"));
      
      df2.show();
      

       

      The df2 should have:

      +------------+----------------------+
      | id1|max(projection_factor)|
      +------------+----------------------+
      |3.5714285714| 3.5714285714|
      +------------+----------------------+
      
      

      instead it returns: 

      +------------+----------------------+
      | id1|max(projection_factor)|
      +------------+----------------------+
      |3.5714285714| 0.00035714285714|
      +------------+----------------------+
      

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            jomach Jorge Machado
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: