Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10896

Parquet join issue

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 1.5.0, 1.5.1
    • Fix Version/s: None
    • Component/s: SQL
    • Environment:

      spark-1.5.0-bin-hadoop2.6.tgz with HDP 2.3

      Description

      After loading parquet files join is not working.
      How to reproduce:

      import org.apache.spark.sql._
      import org.apache.spark.sql.types._
      val arr1 = Array[Row](Row.apply(0, 0), Row.apply(1,1), Row.apply(2,2), Row.apply(3, 3), Row.apply(4, 4), Row.apply(5, 5), Row.apply(6, 6), Row.apply(7, 7))
      val schema1 = StructType(
            StructField("id", IntegerType) ::
            StructField("value1", IntegerType) :: Nil)
      val df1 = sqlContext.createDataFrame(sc.parallelize(arr1), schema1)
      val arr2 = Array[Row](Row.apply(0, 0), Row.apply(1,1), Row.apply(2,2), Row.apply(3, 3), Row.apply(4, 4), Row.apply(5, 5), Row.apply(6, 6), Row.apply(7, 7))
      val schema2 = StructType(
            StructField("otherId", IntegerType) ::
            StructField("value2", IntegerType) :: Nil)
      val df2 = sqlContext.createDataFrame(sc.parallelize(arr2), schema2)
      
      val res = df1.join(df2, df1("id")===df2("otherId"))
      df1.take(10)
      df2.take(10)
      res.count()
      res.take(10)
      
      df1.write.format("parquet").save("hdfs:///tmp/df1")
      df2.write.format("parquet").save("hdfs:///tmp/df2")
      
      val df1=sqlContext.read.parquet("hdfs:///tmp/df1/*.parquet")
      val df2=sqlContext.read.parquet("hdfs:///tmp/df2/*.parquet")
      
      val res = df1.join(df2, df1("id")===df2("otherId"))
      df1.take(10)
      df2.take(10)
      res.count()
      res.take(10)
      

      Output

      Array[org.apache.spark.sql.Row] = Array([0,0], [1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7]) 
      Array[org.apache.spark.sql.Row] = Array([0,0], [1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7]) 
      Long = 8 
      Array[org.apache.spark.sql.Row] = Array([0,0,0,0], [1,1,1,1], [2,2,2,2], [3,3,3,3], [4,4,4,4], [5,5,5,5], [6,6,6,6], [7,7,7,7]) 
      

      After reading back:

      Array[org.apache.spark.sql.Row] = Array([0,0], [1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7]) 
      Array[org.apache.spark.sql.Row] = Array([0,0], [1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7]) 
      Long = 4 
      Array[org.apache.spark.sql.Row] = Array([0,0,0,5], [2,2,2,null], [4,4,4,5], [6,6,6,null])
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tromika Tamas Szuromi
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: