[SPARK-24610] wholeTextFiles broken for small files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.2.1, 2.3.1
Fix Version/s: 2.4.0
Component/s: Input/Output
Labels:
None

Description

Spark is unable to read small files using the wholeTextFiles method when split size related configs are specified - either explicitly or if they are contained in other config files like hive-site.xml.

For small sized files, the computed maxSplitSize by `WholeTextFileInputFormat` is way smaller than the default or commonly used split size of 64/128M and spark throws an exception while trying to read them.

To reproduce the issue:

$SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --conf "spark.hadoop.mapreduce.input.fileinputformat.split.minsize.per.node=123456"

scala> sc.wholeTextFiles("file:///etc/passwd").count
java.io.IOException: Minimum split size pernode 123456 cannot be larger than maximum split size 9962
at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200)
at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
... 48 elided


// For hdfs
sc.wholeTextFiles("smallFile").count
java.io.IOException: Minimum split size pernode 123456 cannot be larger than maximum split size 15
at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200)
at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
... 48 elided

Attachments

Issue Links

is related to

SPARK-25753 binaryFiles broken for small files

Resolved

links to

[Github] Pull Request #21601 (dhruve)

Activity

People

Assignee:: Dhruve Ashar

Reporter:: Dhruve Ashar

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Jun/18 16:22

Updated:: 19/Oct/18 21:46

Resolved:: 12/Jul/18 20:37