[ARROW-11161] [Python][C++] S3Filesystem: file Content-Type not set correctly? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 5.0.0
Component/s: C++, Python
Labels:
- filesystem
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/27070

Description

I am using the Fileystem abstraction to write out html / text files to the local filesystem as well as s3.

I noticed that when using s3_fs.open_output_stream in combination with file.write(bytes), the object that gets created has a Content-Type of 'application/xml' even tough it's plain text, which is problematic for me.

Here is a minimal example:

import boto3
BUCKET = "my-bucket"
path = f"s3://{BUCKET}/pyarrow_encoding.txt"
s3_fs, output_path = FileSystem.from_uri(path)
with s3_fs.open_output_stream(path=output_path, compression=None) as f:
    f.write('hello'.encode('UTF-8'))

s3 = boto3.client('s3')
response = s3.get_object(Bucket=BUCKET, Key='pyarrow_encoding.txt')
print(response['ContentType']) # Output: application/xml
print(response['Body'].read().decode('UTF-8')) # Output: hello

s3.put_object(Bucket=BUCKET,
              Key='boto3_encoding.txt',
              Body='hello'.encode('UTF-8'))
response = s3.get_object(Bucket=BUCKET, Key='boto3_encoding.txt')
print(response['ContentType']) # Output: binary/octet-stream
print(response['Body'].read().decode('UTF-8')) # Output: hello

I know, that the S3Filesystem implementation of pyarrow might no have mime type inference implemented, but I am wondering, why always 'application/xml' is the resulting Content-Type? Maybe this is hardcoded somewhere?

Originally, I tried this with '.html' files and also there, the objects on s3 always got the 'application/xml' Content-Type. (Please also see attachment from the s3 console)

Any help or pointer is appreciated.

Thank you,

Nicolas

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

boto3-metadata.png
07/Jan/21 15:23
25 kB
Nicolas Renkamp
Screen Shot 2021-01-07 at 15.23.07.png
07/Jan/21 14:25
24 kB
Nicolas Renkamp

Issue Links

links to

GitHub Pull Request #10295

Activity

People

Assignee:: Antoine Pitrou

Reporter:: Nicolas Renkamp

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 07/Jan/21 14:25

Updated:: 11/Jan/23 08:17

Resolved:: 01/Jun/21 08:09

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2.5h