[SPARK-8437] Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.3.1, 1.4.0
Fix Version/s: 1.4.1, 1.5.0
Component/s: Input/Output
Labels:
None
Environment:

Ubuntu 15.04 + local filesystem
Amazon EMR + S3 + HDFS

Target Version/s:

1.4.2, 1.5.0

Description

When calling wholeTextFiles or binaryFiles with a directory path with 10,000s of files in it, Spark hangs for a few minutes before processing the files.

If you add a * to the end of the path, there is no delay.

This happens for me on Spark 1.3.1 and 1.4 on the local filesystem, HDFS, and on S3.

To reproduce, create a directory with 50,000 files in it, then run:

val a = sc.binaryFiles("file:/path/to/files/")
a.count()

val b = sc.binaryFiles("file:/path/to/files/*")
b.count()

and monitor the different startup times.

For example, in the spark-shell these commands are pasted in together, so the delay at f.count() is from 10:11:08 t- 10:13:29 to output "Total input paths to process : 49999", then until 10:15:42 to being processing files:

scala> val f = sc.binaryFiles("file:/home/ewan/large/")
15/06/18 10:11:07 INFO MemoryStore: ensureFreeSpace(160616) called with curMem=0, maxMem=278019440
15/06/18 10:11:07 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 156.9 KB, free 265.0 MB)
15/06/18 10:11:08 INFO MemoryStore: ensureFreeSpace(17282) called with curMem=160616, maxMem=278019440
15/06/18 10:11:08 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.9 KB, free 265.0 MB)
15/06/18 10:11:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:40430 (size: 16.9 KB, free: 265.1 MB)
15/06/18 10:11:08 INFO SparkContext: Created broadcast 0 from binaryFiles at <console>:21
f: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/ BinaryFileRDD[0] at binaryFiles at <console>:21

scala> f.count()
15/06/18 10:13:29 INFO FileInputFormat: Total input paths to process : 49999
15/06/18 10:15:42 INFO FileInputFormat: Total input paths to process : 49999
15/06/18 10:15:42 INFO CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 0
15/06/18 10:15:42 INFO SparkContext: Starting job: count at <console>:24
15/06/18 10:15:42 INFO DAGScheduler: Got job 0 (count at <console>:24) with 49999 output partitions (allowLocal=false)
15/06/18 10:15:42 INFO DAGScheduler: Final stage: ResultStage 0(count at <console>:24)
15/06/18 10:15:42 INFO DAGScheduler: Parents of final stage: List()

Adding a * to the end of the path removes the delay:

scala> val f = sc.binaryFiles("file:/home/ewan/large/*")
15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(160616) called with curMem=0, maxMem=278019440
15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 156.9 KB, free 265.0 MB)
15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(17309) called with curMem=160616, maxMem=278019440
15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.9 KB, free 265.0 MB)
15/06/18 10:08:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:42825 (size: 16.9 KB, free: 265.1 MB)
15/06/18 10:08:29 INFO SparkContext: Created broadcast 0 from binaryFiles at <console>:21
f: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/* BinaryFileRDD[0] at binaryFiles at <console>:21

scala> f.count()
15/06/18 10:08:32 INFO FileInputFormat: Total input paths to process : 49999
15/06/18 10:08:33 INFO FileInputFormat: Total input paths to process : 49999
15/06/18 10:08:35 INFO CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 0
15/06/18 10:08:35 INFO SparkContext: Starting job: count at <console>:24
15/06/18 10:08:35 INFO DAGScheduler: Got job 0 (count at <console>:24) with 49999 output partitions

Attachments

Issue Links

links to

[Github] Pull Request #7036 (srowen)

[Github] Pull Request #7126 (srowen)

Activity

People

Assignee:: Sean R. Owen

Reporter:: Ewan Leith

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 18/Jun/15 09:48

Updated:: 13/Oct/16 16:47

Resolved:: 30/Jun/15 17:07