[HIVE-7540] NotSerializableException encountered when using sortByKey transformation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: Spark
Labels:
None
Environment:

Spark-1.0.1

Description

This exception is thrown when sortByKey is used as the shuffle transformation between MapWork and ReduceWork:

org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:772)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:715)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:719)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:718)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:718)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:699)
…

The root cause is that the RangePartitioner used by sortByKey contains rangeBounds: Array[BytesWritable], which is considered not serializable in spark.
A workaround to this issue is to set the number of partitions to 1 when calling sortByKey, in which case the rangeBounds will be just an empty array.

NO PRECOMMIT TESTS. This is for spark branch only.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-7540.2-spark.patch
06/Aug/14 03:38
2 kB
Rui Li
HIVE-7540.3-spark.patch
11/Aug/14 05:08
3 kB
Brock Noland
HIVE-7540-spark.patch
06/Aug/14 02:52
2 kB
Rui Li

Issue Links

depends upon

SPARK-2421 Spark should treat writable as serializable for keys

Resolved

is duplicated by

HIVE-7614 Find solution for closures containing writables [Spark Branch]

Resolved

is part of

HIVE-7292 Hive on Spark

Resolved

Activity

People

Assignee:: Rui Li

Reporter:: Rui Li

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 29/Jul/14 10:14

Updated:: 29/May/15 02:31

Resolved:: 11/Aug/14 05:11