Details
Description
Files compressed with the gzip codec are not splittable due to the nature of the codec.
This limits the options you have scaling out when reading large gzipped input files.
Given the fact that gunzipping a 1GiB file usually takes only 2 minutes I figured that for some use cases wasting some resources may result in a shorter job time under certain conditions.
So reading the entire input file from the start for each split (wasting resources!!) may lead to additional scalability.
Attachments
Attachments
Issue Links
- is related to
-
HADOOP-7909 Implement a generic splittable signature-based compression format
- Open
-
HADOOP-6153 RAgzip: multiple map tasks for a large gzipped file
- Resolved
-
SPARK-29102 Read gzipped file into multiple partitions without full gzip expansion on a single-node
- Resolved