Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15127

Column names are handled incorrectly when they originate from a single Dataframe

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.6.1, 2.0.0
    • None
    • PySpark, Spark Core, SQL
    • Mac OS X 10.11.4 And Ubuntu Linux 16.04 LTS

    Description

      I think I found a bug in the way columns are handled in (py)Spark

      How to reproduce

      df = sc.parallelize([[1, 'A', 'Not B'], [1, 'Not A', 'B']]).toDF(['id', 'a', 'b'])
      
      example = sc.parallelize([[1],[2]]).toDF(['id'])
      
      df_a = df.filter('a = "A"').alias('df_a')
      df_b = df.filter('b = "B"').alias('df_b')
      
      example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show()
      

      Results in:

      +---+---+-----+
      | id|  a|    b|
      +---+---+-----+
      |  1|  A|Not B|
      +---+---+-----+
      

      Expected result:

      +---+---+---+
      | id|  a|  b|
      +---+---+---+
      |  1|  A|  B|
      +---+---+---+
      

      When using the aliases in the select statement it does work properly

      example.join(df_a, 'id').join(df_b, 'id').select('id', 'df_a.a', 'df_b.b').show()
      

      Results in expected result:

      +---+---+---+
      | id|  a|  b|
      +---+---+---+
      |  1|  A|  B|
      +---+---+---+
      

      I'm not sure if this is how you're supposed to select columns from this kind of Dataframe, but I think the first example should've worked just as fine.

      I did some other experiments with this:

      It also works when creating a new Dataframe using toDF():

      df_a = df.filter('a = "A"').alias('df_a')
      df_b = df.filter('b = "B"').alias('df_b')
      df_a = df_a.toDF(*df_a.columns)
      df_b = df_b.toDF(*df_b.columns)
      example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show()
      

      Results in expected result:

      +---+---+---+
      | id|  a|  b|
      +---+---+---+
      |  1|  A|  B|
      +---+---+---+
      

      But not when doing this with a select (which according to the docs, should return a new Dataframe)

      df_a = df.filter('a = "A"').alias('df_a')
      df_b = df.filter('b = "B"').alias('df_b')
      df_a = df_a.select(*df_a.columns)
      df_b = df_b.select(*df_b.columns)
      example.join(df_a, 'id').drop(df_a['id']).join(df_b, 'id').drop(df_b['id']).select('id', df_a['a'], df_b['b']).show()
      

      Results in:

      +---+---+-----+
      | id|  a|    b|
      +---+---+-----+
      |  1|  A|Not B|
      +---+---+-----+
      

      At least something is unclear in the documentation here, and maybe this is a Column handing bug too.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jurriaanpruis Jurriaan Pruis
              Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: