-
Type:
Bug
-
Status: Resolved
-
Priority:
Blocker
-
Resolution: Fixed
-
Affects Version/s: 2.0.0, 2.0.1
-
Component/s: SQL
-
Labels:
import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types.{StructField, StructType} import org.apache.spark.sql.types.DataTypes._ val sc = SparkSession.builder().config(new SparkConf().setMaster("local")).getOrCreate() val jsonRDD = sc.sparkContext.parallelize(Seq( """{"a":1,"b":1,"c":1}""", """{"a":1,"b":1,"c":2}""" )) sc.read.schema(StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", LongType) ))).json(jsonRDD).write.parquet("/tmp/test") val df = sc.read.load("/tmp/test") df.join(df, Seq("a", "b", "c"), "left_outer").show()
returns:
+---+---+---+ | a| b| c| +---+---+---+ | 1| 1| 1| | 1| 1| 1| | 1| 1| 2| | 1| 1| 2| +---+---+---+
Expected result:
+---+---+---+ | a| b| c| +---+---+---+ | 1| 1| 1| | 1| 1| 2| +---+---+---+
If I use this code without saving to parquet it works fine. If you change type of `c` column to `IntegerType` it also works fine.
- is duplicated by
-
SPARK-17962 DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn
-
- Resolved
-
-
SPARK-17891 SQL-based three column join loses first column
-
- Closed
-
- links to