Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.0.1
-
None
-
None
Description
Background
Long time ago I have written a way for reading a (usually large) Gzipped file in a way that allows better distribution of the load over an Apache Hadoop cluster: https://github.com/nielsbasjes/splittablegzip
Seems like people still need this kind of functionality and it turns out my code works without modification in conjunction with Apache Spark.
See for example:
So nchammas provided documentation to my project a while ago on how to use it with Spark.
https://github.com/nielsbasjes/splittablegzip/blob/master/README-Spark.md
The problem
Now some people have indicated getting errors from this feature of mine.
Fact is that this functionality cannot read a split if it is too small (the number of bytes read from disk and the number of bytes coming out the compression are different). So my code uses the io.file.buffer.size setting but also has a hard coded lower limit split size of 4 KiB.
Now the problem I found when looking into the reports I got is that Spark does not have a minimum number of bytes in a split.
In fact: When I created a test file and then set the spark.sql.files.maxPartitionBytes to exactly 1 byte less than the size of my test file my library gave the error:
java.lang.IllegalArgumentException: The provided InputSplit (562686;562687] is 1 bytes which is too small. (Minimum is 65536)
I found the code that does this calculation here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L74
Proposed enhancement
So what I propose is to have a new setting (spark.sql.files.minPartitionBytes ?) that will guarantee that no split of a file is smaller than a configured number of bytes.
I also propose to have this set to something like 64KiB as a default.
Having some constraints on the values of spark.sql.files.minPartitionBytes and possibly in relation with spark.sql.files.maxPartitionBytes would be fine.
Notes
Hadoop already has code that does this: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L456