[SPARK-31177] DataFrameReader.csv incorrectly reads gzip encoded CSV from S3 when it has non-".gz" extension - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.4.4
Fix Version/s: None
Component/s: Input/Output
Labels:
- bulk-closed

Description

i have large CSV files that are gzipped and uploaded to S3 with Content-Encoding=gzip. the files have file extension ".csv", as most web clients will automatically decompress the file based on the Content-Encoding header. using pyspark to read these CSV files does not mimic this behavior.

works as expected:

df = spark.read.csv('s3://bucket/large.csv.gz', header=True)

does not decompress and tries to load entire contents of file as the first row:

df = spark.read.csv('s3://bucket/large.csv', header=True)

it looks like it's relying on the file extension to determine if the file is gzip compressed or not. it would be great if S3 resources, and any other http based resources, could consult the Content-Encoding response header as well.

i tried to find the code that determines this, but i'm not familiar with the code base. any pointers would be helpful. and i can look into fixing it.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Mark Waddle

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Mar/20 22:32

Updated:: 25/Sep/23 00:12

Resolved:: 25/May/21 01:38