Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11161

[Python][C++] S3Filesystem: file Content-Type not set correctly?

    XMLWordPrintableJSON

Details

    Description

      I am using the Fileystem abstraction to write out html / text files to the local filesystem as well as s3.

      I noticed that when using s3_fs.open_output_stream in combination with file.write(bytes), the object that gets created has a Content-Type of 'application/xml' even tough it's plain text, which is problematic for me.

      Here is a minimal example:

      import boto3
      BUCKET = "my-bucket"
      path = f"s3://{BUCKET}/pyarrow_encoding.txt"
      s3_fs, output_path = FileSystem.from_uri(path)
      with s3_fs.open_output_stream(path=output_path, compression=None) as f:
          f.write('hello'.encode('UTF-8'))
      
      s3 = boto3.client('s3')
      response = s3.get_object(Bucket=BUCKET, Key='pyarrow_encoding.txt')
      print(response['ContentType']) # Output: application/xml
      print(response['Body'].read().decode('UTF-8')) # Output: hello
      
      s3.put_object(Bucket=BUCKET,
                    Key='boto3_encoding.txt',
                    Body='hello'.encode('UTF-8'))
      response = s3.get_object(Bucket=BUCKET, Key='boto3_encoding.txt')
      print(response['ContentType']) # Output: binary/octet-stream
      print(response['Body'].read().decode('UTF-8')) # Output: hello
      

      I know, that the S3Filesystem implementation of pyarrow might no have mime type inference implemented, but I am wondering, why always 'application/xml' is the resulting Content-Type? Maybe this is hardcoded somewhere?

      Originally, I tried this with '.html' files and also there, the objects on s3 always got the 'application/xml' Content-Type. (Please also see attachment from the s3 console)

       

      Any help or pointer is appreciated. 

      Thank you,

      Nicolas

      Attachments

        1. Screen Shot 2021-01-07 at 15.23.07.png
          24 kB
          Nicolas Renkamp
        2. boto3-metadata.png
          25 kB
          Nicolas Renkamp

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              nicornk Nicolas Renkamp
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2.5h
                  2.5h