Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.4.4
-
None
Description
i have large CSV files that are gzipped and uploaded to S3 with Content-Encoding=gzip. the files have file extension ".csv", as most web clients will automatically decompress the file based on the Content-Encoding header. using pyspark to read these CSV files does not mimic this behavior.
works as expected:
df = spark.read.csv('s3://bucket/large.csv.gz', header=True)
does not decompress and tries to load entire contents of file as the first row:
df = spark.read.csv('s3://bucket/large.csv', header=True)
it looks like it's relying on the file extension to determine if the file is gzip compressed or not. it would be great if S3 resources, and any other http based resources, could consult the Content-Encoding response header as well.
i tried to find the code that determines this, but i'm not familiar with the code base. any pointers would be helpful. and i can look into fixing it.