[SPARK-21137] Spark reads many small files slowly off local filesystem - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.1.1
Fix Version/s: 2.3.0
Component/s: Spark Core
Labels:
None

Description

A very common use case in big data is to read a large number of small files. For example the Enron email dataset has 1,227,645 small files.

When one tries to read this data using Spark one will hit many issues. Firstly, even if the data is small (each file only say 1K) any job can take a very long time (I have a simple job that has been running for 3 hours and has not yet got to the point of starting any tasks, I doubt if it will ever finish).

It seems all the code in Spark that manages file listing is single threaded and not well optimised. When I hand crank the code and don't use Spark, my job runs much faster.

Is it possible that I'm missing some configuration option? It seems kinda surprising to me that Spark cannot read Enron data given that it's such a quintessential example.

So it takes 1 hour to output a line "1,227,645 input paths to process", it then takes another hour to output the same line. Then it outputs a CSV of all the input paths (so creates a text storm).

Now it's been stuck on the following:

17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]

for 2.5 hours.

So I've provided full reproduce steps here (including code and cluster setup) https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can easily just clone, and follow the README to reproduce exactly!

Attachments

Issue Links

is broken by

HADOOP-14600 LocatedFileStatus constructor forces RawLocalFS to exec a process to get the permissions

Resolved

is related to

SPARK-37530 Spark reads many paths very slow though newAPIHadoopFile

Resolved

links to

[Github] Pull Request #18441 (srowen)

Activity

People

Assignee:: Sean R. Owen

Reporter:: sam

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 19/Jun/17 10:48

Updated:: 03/Dec/21 07:16

Resolved:: 03/Jul/17 11:53