[SPARK-17024] Weird behaviour of the DataFrame when a column name contains dots. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

When a column name contains dots and one of the segment in a name is the same as other column's name, Spark treats this column as a nested structure, although the actual type of column is String/Int/etc. Example:

      val df = sqlContext.createDataFrame(Seq(
        ("user1", "task1"),
        ("user2", "task2")
      )).toDF("user", "user.task")

Two columns "user" and "user.task". Both of them are string, and the schema resolution seems to be correct:

root
 |-- user: string (nullable = true)
 |-- user.task: string (nullable = true)

But when I'm trying to query this DataFrame like i.e.:

      df.select(df("user"), df("user.task"))

Spark throws an exception "Can't extract value from user#2;"
It happens during the resolution of the LogicalPlan while processing the "user.task" column.

Here is the full stacktrace:

Can't extract value from user#2;
org.apache.spark.sql.AnalysisException: Can't extract value from user#2;
	at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275)
	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
	at scala.collection.immutable.List.foldLeft(List.scala:84)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
	at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
	at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708)
	at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696)

Is this actually an expected behaviour?

Attachments

Issue Links

duplicates

SPARK-15230 Back quoted column with dot in it fails when running distinct on dataframe

Resolved

links to

[Github] Pull Request #14736 (izeigerman)

Activity

People

Assignee:: Unassigned

Reporter:: Iaroslav Zeigerman

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Aug/16 19:10

Updated:: 12/Dec/22 18:11

Resolved:: 21/Aug/16 14:07