[SPARK-21177] df.saveAsTable slows down linearly, with number of appends - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

In short, please use the following shell transcript for the reproducer.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> def printTimeTaken(str: String, f: () => Unit) {
    val start = System.nanoTime()
    f()
    val end = System.nanoTime()
    val timetaken = end - start
    import scala.concurrent.duration._
    println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
  }
     |      |      |      |      |      |      | printTimeTaken: (str: String, f: () => Unit)Unit

scala> 
for(i <- 1 to 100000) {printTimeTaken("time to append to hive:", () => { Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
Time taken for time to append to hive: is 284

Time taken for time to append to hive: is 211

...
...

Time taken for time to append to hive: is 2615

...
Time taken for time to append to hive: is 3055
...
Time taken for time to append to hive: is 22425

....

Why does it matter ?

In a streaming job it is not possible to append to hive using this dataframe operation.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Prashant Sharma

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/Jun/17 10:53

Updated:: 12/Dec/22 18:10

Resolved:: 25/May/21 01:43