[SPARK-14948] Exception when joining DataFrames derived form the same DataFrame - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.6.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Spark Analyser is throwing the following exception in a specific scenario :

Exception :

org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing from asd#5,F2#4,F1#6,F2#7 in operator !Project asd#5,F1#3;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)

Code :

SparkClient.java

    StructField[] fields = new StructField[2];
    fields[0] = new StructField("F1", DataTypes.StringType, true, Metadata.empty());
    fields[1] = new StructField("F2", DataTypes.StringType, true, Metadata.empty());
    JavaRDD<Row> rdd =
        sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a", "b")));
    DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new StructType(fields));
    sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1");

    DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as asd, F2 from t1");

    sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, "t2");
    sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3");
    
    DataFrame join = aliasedDf.join(df, aliasedDf.col("F2").equalTo(df.col("F2")), "inner");
    DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1"));
    select.collect();

Observations :

This issue is related to the Data Type of Fields of the initial Data Frame.(If the Data Type is not String, it will work.)
It works fine if the data frame is registered as a temporary table and an sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written.

Attachments

Issue Links

duplicates

SPARK-10925 Exception when joining DataFrames

Resolved

is duplicated by

SPARK-10925 Exception when joining DataFrames

Resolved

SPARK-23677 Selecting columns from joined DataFrames with the same origin yields wrong results

Resolved

links to

[Github] Pull Request #20276 (cloud-fan)

Activity

People

Assignee:: Unassigned

Reporter:: Saurabh Santhosh

Shepherd:: Michael Armbrust

Votes:: 12 Vote for this issue

Watchers:: 28 Start watching this issue

Dates

Created:: 27/Apr/16 06:43

Updated:: 15/Jun/20 21:25