Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20356

Spark sql group by returns incorrect results after join + distinct transformations

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.2.0, 2.3.0
    • Component/s: SQL
    • Labels:
    • Environment:

      Linux mint 18
      Python 3.5

      Description

      I'm experiencing a bug with the head version of spark as of 4/17/2017. After joining to dataframes, renaming a column and invoking distinct, the results of the aggregation is incorrect after caching the dataframe. The following code snippet consistently reproduces the error.

      from pyspark.sql import SparkSession
      import pyspark.sql.functions as sf
      import pandas as pd

      spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()

      mapping_sdf = spark.createDataFrame(pd.DataFrame([

      {"ITEM": "a", "GROUP": 1}

      ,

      {"ITEM": "b", "GROUP": 1}

      ,

      {"ITEM": "c", "GROUP": 2}

      ]))

      items_sdf = spark.createDataFrame(pd.DataFrame([

      {"ITEM": "a", "ID": 1}

      ,

      {"ITEM": "b", "ID": 2}

      ,

      {"ITEM": "c", "ID": 3}

      ]))

      mapped_sdf = \
      items_sdf.join(mapping_sdf, on='ITEM').select("ID", sf.col("GROUP").alias('ITEM')).distinct()

      print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct
      mapped_sdf.cache()
      print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 3, incorrect

      The next code snippet is almost the same after the first except I don't call distinct on the dataframe. This snippet performs as expected:

      mapped_sdf = \
      items_sdf.join(mapping_sdf, on='ITEM').select("ID", sf.col("GROUP").alias('ITEM'))

      print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct
      mapped_sdf.cache()
      print(mapped_sdf.groupBy("ITEM").count().count()) # Prints 2, correct

      I don't experience this bug with spark 2.1 or event earlier versions for 2.2

        Attachments

          Activity

            People

            • Assignee:
              viirya Liang-Chi Hsieh
              Reporter:
              ckipers Chris Kipers
            • Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: