Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
None
-
None
Description
In Platform.java, methods of Java Unsafe are called directly without considering endianness.
In thread, 'Tungsten in a mixed endian environment', Adam Roberts reported data corruption when "spark.sql.tungsten.enabled" is enabled in mixed endian environment.
Platform.java should take endianness into account.
Below is a copy of Adam's report:
I've been experimenting with DataFrame operations in a mixed endian environment - a big endian master with little endian workers. With tungsten enabled I'm encountering data corruption issues.
For example, with this simple test code:
import org.apache.spark.SparkContext import org.apache.spark._ import org.apache.spark.sql.SQLContext object SimpleSQL { def main(args: Array[String]): Unit = { if (args.length != 1) { println("Not enough args, you need to specify the master url") } val masterURL = args(0) println("Setting up Spark context at: " + masterURL) val sparkConf = new SparkConf val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf) println("Performing SQL tests") val sqlContext = new SQLContext(sc) println("SQL context set up") val df = sqlContext.read.json("/tmp/people.json") df.show() println("Selecting everyone's age and adding one to it") df.select(df("name"), df("age") + 1).show() println("Showing all people over the age of 21") df.filter(df("age") > 21).show() println("Counting people by age") df.groupBy("age").count().show() } }
Instead of getting
+----+-----+
| age|count|
+----+-----+
|null| 1|
| 19| 1|
| 30| 1|
+----+-----+
I get the following with my mixed endian set up:
+-------------------+-----------------+
| age| count|
+-------------------+-----------------+
| null| 1|
|1369094286720630784|72057594037927936|
| 30| 1|
+-------------------+-----------------+
and on another run:
+-------------------+-----------------+ | age| count| +-------------------+-----------------+ | 0|72057594037927936| | 19| 1|
Attachments
Issue Links
- links to