[SPARK-17777] Spark Scheduler Hangs Indefinitely - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.6.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None
Environment:

AWS EMR 4.3, can also be reproduced locally

Description

We've identified a problem with Spark scheduling. The issue manifests itself when an RDD calls SparkContext.parallelize within its getPartitions method. This seemingly "recursive" call causes the problem. We have a repro case that can easily be run.

Please advise on what the issue might be and how we can work around it in the mean time.

I've attached repro.scala which can simply be pasted in spark-shell to reproduce the problem.

Why are we calling sc.parallelize in production within getPartitions? Well, we have an RDD that is composed of several thousands of Parquet files. To compute the partitioning strategy for this RDD, we create an RDD to read all file sizes from S3 in parallel, so that we can quickly determine the proper partitions. We do this to avoid executing this serially from the master node which can result in significant slowness in the execution. Pseudo-code:

val splitInfo = sc.parallelize(filePaths).map(f => (f, s3.getObjectSummary)).collect()

A similar logic is used in DataFrame by Spark itself:
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L902

Thanks,
-Ameen

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

jstack-dump.txt
05/Oct/16 14:49
78 kB
Ameen Tayyebi
repro.scala
04/Oct/16 18:18
0.7 kB
Ameen Tayyebi

Activity

People

Assignee:: Unassigned

Reporter:: Ameen Tayyebi

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 04/Oct/16 18:16

Updated:: 14/Oct/16 12:10

Resolved:: 14/Oct/16 12:10