Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21216

Streaming DataFrames fail to join with Hive tables

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.1
    • 2.3.0
    • Structured Streaming
    • None

    Description

      The following code will throw a cryptic exception:

      import org.apache.spark.sql.execution.streaming.MemoryStream
          import testImplicits._
      
          implicit val _sqlContext = spark.sqlContext
      
          Seq((1, "one"), (2, "two"), (4, "four")).toDF("number", "word").createOrReplaceTempView("t1")
          // Make a table and ensure it will be broadcast.
          sql("""CREATE TABLE smallTable(word string, number int)
                |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
                |STORED AS TEXTFILE
              """.stripMargin)
      
          sql(
            """INSERT INTO smallTable
              |SELECT word, number from t1
            """.stripMargin)
      
          val inputData = MemoryStream[Int]
          val joined = inputData.toDS().toDF()
            .join(spark.table("smallTable"), $"value" === $"number")
      
          val sq = joined.writeStream
            .format("memory")
            .queryName("t2")
            .start()
          try {
            inputData.addData(1, 2)
      
            sq.processAllAvailable()
          } finally {
            sq.stop()
          }
      

      If someone creates a HiveSession, the planner in `IncrementalExecution` doesn't take into account the Hive scan strategies

      Attachments

        Issue Links

          Activity

            People

              brkyvz Burak Yavuz
              brkyvz Burak Yavuz
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: