[SPARK-33534] Allow specifying a minimum number of bytes in a split of a file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.1
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Background
Long time ago I have written a way for reading a (usually large) Gzipped file in a way that allows better distribution of the load over an Apache Hadoop cluster: https://github.com/nielsbasjes/splittablegzip

Seems like people still need this kind of functionality and it turns out my code works without modification in conjunction with Apache Spark.
See for example:

So nchammas provided documentation to my project a while ago on how to use it with Spark.
https://github.com/nielsbasjes/splittablegzip/blob/master/README-Spark.md

The problem
Now some people have indicated getting errors from this feature of mine.

Fact is that this functionality cannot read a split if it is too small (the number of bytes read from disk and the number of bytes coming out the compression are different). So my code uses the io.file.buffer.size setting but also has a hard coded lower limit split size of 4 KiB.

Now the problem I found when looking into the reports I got is that Spark does not have a minimum number of bytes in a split.

In fact: When I created a test file and then set the spark.sql.files.maxPartitionBytes to exactly 1 byte less than the size of my test file my library gave the error:
java.lang.IllegalArgumentException: The provided InputSplit (562686;562687] is 1 bytes which is too small. (Minimum is 65536)

I found the code that does this calculation here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L74

Proposed enhancement
So what I propose is to have a new setting (spark.sql.files.minPartitionBytes ?) that will guarantee that no split of a file is smaller than a configured number of bytes.

I also propose to have this set to something like 64KiB as a default.

Having some constraints on the values of spark.sql.files.minPartitionBytes and possibly in relation with spark.sql.files.maxPartitionBytes would be fine.

Notes

Hadoop already has code that does this: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L456

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Niels Basjes

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Nov/20 13:36

Updated:: 14/Aug/24 13:51