[SPARK-21216] Streaming DataFrames fail to join with Hive tables - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.1
Fix Version/s: 2.3.0
Component/s: Structured Streaming
Labels:
None

Target Version/s:

2.3.0

Description

The following code will throw a cryptic exception:

import org.apache.spark.sql.execution.streaming.MemoryStream
    import testImplicits._

    implicit val _sqlContext = spark.sqlContext

    Seq((1, "one"), (2, "two"), (4, "four")).toDF("number", "word").createOrReplaceTempView("t1")
    // Make a table and ensure it will be broadcast.
    sql("""CREATE TABLE smallTable(word string, number int)
          |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
          |STORED AS TEXTFILE
        """.stripMargin)

    sql(
      """INSERT INTO smallTable
        |SELECT word, number from t1
      """.stripMargin)

    val inputData = MemoryStream[Int]
    val joined = inputData.toDS().toDF()
      .join(spark.table("smallTable"), $"value" === $"number")

    val sq = joined.writeStream
      .format("memory")
      .queryName("t2")
      .start()
    try {
      inputData.addData(1, 2)

      sq.processAllAvailable()
    } finally {
      sq.stop()
    }

If someone creates a HiveSession, the planner in `IncrementalExecution` doesn't take into account the Hive scan strategies

Attachments

Issue Links

Blocked

SPARK-21279 stream join hive text batch not supported

Closed

links to

[Github] Pull Request #18426 (brkyvz)

Activity

People

Assignee:: Burak Yavuz

Reporter:: Burak Yavuz

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Jun/17 18:57

Updated:: 27/Jul/18 17:22

Resolved:: 28/Jun/17 17:46