[SPARK-4019] Shuffling with more than 2000 reducers may drop all data when partitons are mostly empty or cause deserialization errors if at least one partition is empty - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.2.0
Fix Version/s: 1.2.0
Component/s: Spark Core
Labels:
None

Target Version/s:

1.2.0

Description

sc.makeRDD(0 until 10, 1000).repartition(2001).collect()

returns `Array()`.

1.1.0 doesn't have this issue. Tried both HASH and SORT manager.

This problem can also manifest itself as Snappy deserialization errors if the average map output status size is non-zero but there is at least one empty partition, e.g.

sc.makeRDD(0 until 100000, 1000).repartition(2001).collect()

Attachments

Issue Links

is related to

SPARK-3740 Use a compressed bitmap to track zero sized blocks in HighlyCompressedMapStatus

Resolved

relates to

SPARK-3630 Identify cause of Kryo+Snappy PARSING_ERROR

Resolved

links to

[Github] Pull Request #2866 (JoshRosen)

Activity

People

Assignee:: Josh Rosen

Reporter:: Xiangrui Meng

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/Oct/14 22:38

Updated:: 23/Oct/14 23:40

Resolved:: 23/Oct/14 23:40