Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25332

Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 2.3.0
    • None
    • SQL
    • None

    Description

      spark.sql("create table x1(name string,age int) stored as parquet ")
      spark.sql("insert into x1 select 'a',29")
      spark.sql("create table x2 (name string,age int) stored as parquet '")
      spark.sql("insert into x2_ex select 'a',29")
      scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain

      == Physical Plan ==
      *(2) BroadcastHashJoin name#101, name#103, Inner, BuildRight
      :- *(2) Project name#101, age#102
      : +- *(2) Filter isnotnull(name#101)
      : +- *(2) FileScan parquet default.x1_exname#101,age#102 Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:string,age:int>
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
      +- *(1) Project name#103, age#104
      +- *(1) Filter isnotnull(name#103)
      +- *(1) FileScan parquet default.x2_exname#103,age#104 Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:string,age:int>

       

       

      Now Restart Spark-Shell or do spark-submit orrestart JDBCServer  again and run same select query again

       

      scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
      scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
      == Physical Plan ==
      *(5) SortMergeJoin [name#43], name#45, Inner
      :- *(2) Sort name#43 ASC NULLS FIRST, false, 0
      : +- Exchange hashpartitioning(name#43, 200)
      : +- *(1) Project name#43, age#44
      : +- *(1) Filter isnotnull(name#43)
      : +- *(1) FileScan parquet default.x1name#43,age#44 Batched: true, Format: Parquet, Location: InMemoryFileIndexfile:/D:/spark_release/spark/bin/spark-warehouse/x1, PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:string,age:int>
      +- *(4) Sort name#45 ASC NULLS FIRST, false, 0
      +- Exchange hashpartitioning(name#45, 200)
      +- *(3) Project name#45, age#46
      +- *(3) Filter isnotnull(name#45)
      +- *(3) FileScan parquet default.x2name#45,age#46 Batched: true, Format: Parquet, Location: InMemoryFileIndexfile:/D:/spark_release/spark/bin/spark-warehouse/x2, PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:string,age:int>

       

       

      scala> spark.sql("desc formatted x1").show(200,false)
      ---------------------------------------------------------------------------------------------

      col_name data_type comment

      ---------------------------------------------------------------------------------------------

      name string null
      age int null
           
      1. Detailed Table Information
         
      Database default  
      Table x1  
      Owner Administrator  
      Created Time Sun Aug 19 12:36:58 IST 2018  
      Last Access Thu Jan 01 05:30:00 IST 1970  
      Created By Spark 2.3.0  
      Type MANAGED  
      Provider hive  
      Table Properties [transient_lastDdlTime=1534662418]  
      Location file:/D:/spark_release/spark/bin/spark-warehouse/x1  
      Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
      InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat  
      OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat  
      Storage Properties [serialization.format=1]  
      Partition Provider Catalog  

      ---------------------------------------------------------------------------------------------

       

      With datasource table ,working fine ( create table using parquet instead of stored by )

      Attachments

        Activity

          People

            Unassigned Unassigned
            Bjangir Babulal
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: