Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22335

Union for DataSet uses column order instead of types for union

    XMLWordPrintableJSON

Details

    • Documentation
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.0
    • 2.3.0
    • SQL
    • None

    Description

      I see union uses column order for a DF. This to me is "fine" since they aren't typed.
      However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0

      
        case class AB(a : String, b : String)
      
        val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
        val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
        
        abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
        
        val abDs = abDf.as[AB]
        val baDs = baDf.as[AB]
        
        abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order
        
        abDs.union(baDs).map(_.a).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order
      
         abDs.union(baDs).rdd.take(2) // This also gives wrong result
      
        baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order.
        abDs.map(_.a).show() // This is correct too
      
        baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround for linked issue, slightly modified.  However this seems wrong since its supposed to be strongly typed
        
        baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct result, which is logically inconsistent behavior
        abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives correct result
      

      So its inconsistent and a bug IMO. And I'm not sure that the suggested work around is really fair, since I'm supposed to be getting of type `AB`. More importantly I think the issue is bigger when you consider that it happens even if you read from parquet (as you would expect). And that its inconsistent when going to/from rdd.

      I imagine its just lazily converting to typed DS instead of initially. So either that typing could be prioritized to happen before the union or unioning of DF could be done with column order taken into account. Again, this is speculation..

      Attachments

        Issue Links

          Activity

            People

              viirya L. C. Hsieh
              CBribiescas Carlos Bribiescas
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: