[SPARK-17706] DataFrame losing string data in yarn mode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.5.0
Fix Version/s: None
Component/s: Spark Core, SQL, YARN
Labels:
None
Environment:

RedHat 6.6, CDH 5.5.2

Description

By some reason when I add new column or append string to existing data/column or creating new DataFrame from code, it misinterpreting string data, so function show() doesn't work properly, filters (such as withColumn, where, when, etc.) doesn't work ether.

Here is example code:

object MissingValue {
  def hex(str: String): String = str
    .getBytes("UTF-8")
    .map(f => Integer.toHexString(f&0xFF).toUpperCase)
    .mkString("-")

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("MissingValue")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._

    val list = List((101,"ABC"),(102,"BCD"),(103,"CDE"))
    val rdd = sc.parallelize(list).map(f => Row(f._1,f._2))
    val schema = StructType(
      StructField("COL1",IntegerType,true)
      ::StructField("COL2",StringType,true)
      ::Nil
    )
    val df = sqlContext.createDataFrame(rdd,schema)
    df.show()

    val str = df.first().getString(1)
    println(s"${str} == ${hex(str)}")

    sc.stop()
  }
}

When I run it in local mode then everything works as expected:

    +----+----+
    |COL1|COL2|
    +----+----+
    | 101| ABC|
    | 102| BCD|
    | 103| CDE|
    +----+----+
    
    ABC == 41-42-43

But if I run the same code in yarn-client mode it produces:

    +----+----+
    |COL1|COL2|
    +----+----+
    | 101| ^E^@^@|
    | 102| ^E^@^@|
    | 103| ^E^@^@|
    +----+----+

    ^E^@^@ == 5-0-0

This problem exists only for string values, so first column (Integer) is fine.

Also if I'm creating rdd from the dataframe then everything is fine i.e. df.rdd.take(1).apply(0).getString(1)

I'm using Spark 1.5.0 from CDH 5.5.2

It seems that this happens when the difference between driver memory and executor memory is too high --driver-memory xxG --executor-memory yyG i.e. when I decreasing executor memory or increasing driver memory then the problem disappears.

Attachments

Issue Links

duplicates

SPARK-10914 UnsafeRow serialization breaks when two machines have different Oops size

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Andrey Dmitriev

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Sep/16 10:14

Updated:: 17/May/20 18:14

Resolved:: 04/Oct/16 09:48