Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33534

Allow specifying a minimum number of bytes in a split of a file

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.1
    • None
    • SQL
    • None

    Description

      Background
      Long time ago I have written a way for reading a (usually large) Gzipped file in a way that allows better distribution of the load over an Apache Hadoop cluster: https://github.com/nielsbasjes/splittablegzip

      Seems like people still need this kind of functionality and it turns out my code works without modification in conjunction with Apache Spark.
      See for example:

      So nchammas provided documentation to my project a while ago on how to use it with Spark.
      https://github.com/nielsbasjes/splittablegzip/blob/master/README-Spark.md

      The problem
      Now some people have indicated getting errors from this feature of mine.

      Fact is that this functionality cannot read a split if it is too small (the number of bytes read from disk and the number of bytes coming out the compression are different). So my code uses the io.file.buffer.size setting but also has a hard coded lower limit split size of 4 KiB.

      Now the problem I found when looking into the reports I got is that Spark does not have a minimum number of bytes in a split.

      In fact: When I created a test file and then set the spark.sql.files.maxPartitionBytes to exactly 1 byte less than the size of my test file my library gave the error:
      java.lang.IllegalArgumentException: The provided InputSplit (562686;562687] is 1 bytes which is too small. (Minimum is 65536)

      I found the code that does this calculation here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L74

      Proposed enhancement
      So what I propose is to have a new setting (spark.sql.files.minPartitionBytes  ?) that will guarantee that no split of a file is smaller than a configured number of bytes.

      I also propose to have this set to something like 64KiB as a default.

      Having some constraints on the values of spark.sql.files.minPartitionBytes and possibly in relation with spark.sql.files.maxPartitionBytes would be fine.

      Notes

      Hadoop already has code that does this: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L456

      Attachments

        Activity

          People

            Unassigned Unassigned
            nielsbasjes Niels Basjes
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: