Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17024

Weird behaviour of the DataFrame when a column name contains dots.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Cannot Reproduce
    • 2.0.0
    • None
    • SQL
    • None

    Description

      When a column name contains dots and one of the segment in a name is the same as other column's name, Spark treats this column as a nested structure, although the actual type of column is String/Int/etc. Example:

            val df = sqlContext.createDataFrame(Seq(
              ("user1", "task1"),
              ("user2", "task2")
            )).toDF("user", "user.task")
      

      Two columns "user" and "user.task". Both of them are string, and the schema resolution seems to be correct:

      root
       |-- user: string (nullable = true)
       |-- user.task: string (nullable = true)
      

      But when I'm trying to query this DataFrame like i.e.:

            df.select(df("user"), df("user.task"))
      

      Spark throws an exception "Can't extract value from user#2;"
      It happens during the resolution of the LogicalPlan while processing the "user.task" column.

      Here is the full stacktrace:

      Can't extract value from user#2;
      org.apache.spark.sql.AnalysisException: Can't extract value from user#2;
      	at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
      	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276)
      	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275)
      	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
      	at scala.collection.immutable.List.foldLeft(List.scala:84)
      	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275)
      	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
      	at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
      	at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708)
      	at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696)
      

      Is this actually an expected behaviour?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              zyoma Iaroslav Zeigerman
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: