[SPARK-10896] Parquet join issue - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.5.0, 1.5.1
Fix Version/s: None
Component/s: SQL
Labels:
- dataframe
- hdfs
- join
- parquet
- sql
Environment:

spark-1.5.0-bin-hadoop2.6.tgz with HDP 2.3

Description

After loading parquet files join is not working.
How to reproduce:

import org.apache.spark.sql._
import org.apache.spark.sql.types._
val arr1 = Array[Row](Row.apply(0, 0), Row.apply(1,1), Row.apply(2,2), Row.apply(3, 3), Row.apply(4, 4), Row.apply(5, 5), Row.apply(6, 6), Row.apply(7, 7))
val schema1 = StructType(
      StructField("id", IntegerType) ::
      StructField("value1", IntegerType) :: Nil)
val df1 = sqlContext.createDataFrame(sc.parallelize(arr1), schema1)
val arr2 = Array[Row](Row.apply(0, 0), Row.apply(1,1), Row.apply(2,2), Row.apply(3, 3), Row.apply(4, 4), Row.apply(5, 5), Row.apply(6, 6), Row.apply(7, 7))
val schema2 = StructType(
      StructField("otherId", IntegerType) ::
      StructField("value2", IntegerType) :: Nil)
val df2 = sqlContext.createDataFrame(sc.parallelize(arr2), schema2)

val res = df1.join(df2, df1("id")===df2("otherId"))
df1.take(10)
df2.take(10)
res.count()
res.take(10)

df1.write.format("parquet").save("hdfs:///tmp/df1")
df2.write.format("parquet").save("hdfs:///tmp/df2")

val df1=sqlContext.read.parquet("hdfs:///tmp/df1/*.parquet")
val df2=sqlContext.read.parquet("hdfs:///tmp/df2/*.parquet")

val res = df1.join(df2, df1("id")===df2("otherId"))
df1.take(10)
df2.take(10)
res.count()
res.take(10)

Output

Array[org.apache.spark.sql.Row] = Array([0,0], [1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7]) 
Array[org.apache.spark.sql.Row] = Array([0,0], [1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7]) 
Long = 8 
Array[org.apache.spark.sql.Row] = Array([0,0,0,0], [1,1,1,1], [2,2,2,2], [3,3,3,3], [4,4,4,4], [5,5,5,5], [6,6,6,6], [7,7,7,7])

After reading back:

Array[org.apache.spark.sql.Row] = Array([0,0], [1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7]) 
Array[org.apache.spark.sql.Row] = Array([0,0], [1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7]) 
Long = 4 
Array[org.apache.spark.sql.Row] = Array([0,0,0,5], [2,2,2,null], [4,4,4,5], [6,6,6,null])

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Tamas Szuromi

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Oct/15 12:55

Updated:: 06/Oct/15 10:35

Resolved:: 06/Oct/15 10:35