Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17908

Column names Corrupted in pysaprk dataframe groupBy

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Cannot Reproduce
    • 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
    • None
    • PySpark
    • None

    Description

      I have DF say df
      df1= df.groupBy('key1', 'key2', 'key3').agg(func.count(func.col('val')).alias('total'))

      df3 =df.join(df1, ['key1', 'key2', 'key3'])\
      .withcolumn('newcol', func.col('val')/func.col('total'))

      I am getting key2 is not present in df1, which is not truw becuase df1.show () is having the data with the key2.
      Then i added this code before join-- df1 = df1.columnRenamed('key2', 'key2') renamed with same name. Then it works.

      Stack trace will say column missing, but it is npt.

      Attachments

        Activity

          People

            Unassigned Unassigned
            harishk15 Harish
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: