Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15527

Duplicate column names with different case after join of DataFrames

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 1.4.1
    • Fix Version/s: 1.6.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Column names can be duplicated when the cases (upper/lower/mixed) do not match in 1.4.1. In 1.6.0, I have checked it and Spark behaves as expected: the join columns are matched in a case-sensitive fashion. In 1.4.1 joins appear to be case-insensitive even though the results are inconsistent.

      I did not find a related ticket, hence I'm opening this one even though it's technically fixed, just in case this happens to be a coincidence.

      Here's a minimal example to check:

      case class Test(id: Int, value: String)
      
      val lhs = sc.parallelize(List(Test(1, "A"), Test(2, "B"), Test(3, "C"))).toDF
      val rhs = sc.parallelize(List(Test(1, "AA"), Test(2, "BB"), Test(4, "D"))).toDF
      val rhsId = rhs.withColumnRenamed("id", "ID")
      
      val full = lhs.join(rhs, "id")
      val fullId = lhs.join(rhsId, "id") // both id and ID in result in 1.4.1
      val fullID = lhs.join(rhsId, "ID") // only id in result in 1.4.1
      

      The last two joins don't execute on 1.6.0 because "id" is not found in rhsId (first case) and "ID" is not found in lhs (second case). On 1.4.1 you can see the difference. The former gives a DataFrame with two columns even though it's clear the rows where matched, and in the latter we see only one.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              hellstorm Ian Hellstrom
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: