Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16864

[Python] external_id for S3 incorrectly set

    XMLWordPrintableJSON

Details

    Description

      It looks like any attempt to read from S3 via pyarrow fails if access is supposed to be done via Assume Role while not passing an `external_id` to S3FileSystem.

      In my understanding, `external_id` is an optional string to be passed to AWS API, however by setting `external_id=None` by default in init and then apply `tobytes()` to it later, it fails if external_id is None.
      https://github.com/apache/arrow/blob/c72f84a48b4952796ab78a0c33b84a9fc8f893db/python/pyarrow/_s3fs.pyx#L230

      This then leads to an exception like this:

      (...)
          df = cursor.execute(query+';').as_pandas()
        File "/opt/conda/lib/python3.9/site-packages/pyathena/util.py", line 37, in _wrapper
          return wrapped(*args, **kwargs)
        File "/opt/conda/lib/python3.9/site-packages/pyathena/pandas/cursor.py", line 157, in execute
          self.result_set = AthenaPandasResultSet(
        File "/opt/conda/lib/python3.9/site-packages/pyathena/pandas/result_set.py", line 72, in __init__
          self._fs = self.__s3_file_system()
        File "/opt/conda/lib/python3.9/site-packages/pyathena/pandas/result_set.py", line 86, in __s3_file_system
          fs = fs.S3FileSystem(
        File "pyarrow/_s3fs.pyx", line 217, in pyarrow._s3fs.S3FileSystem.__init__
        File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
      TypeError: expected bytes, NoneType found
      

      This exception comes from using pyarrow with pyathena lib and their code does not pass any external_id.

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              nichoio Nicholas Kappel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h