Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5647

[Python] Accessing a file from Databricks using pandas read_parquet using the pyarrow engine fails with : Passed non-file path: /mnt/aa/example.parquet

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 0.13.0
    • Fix Version/s: 0.14.0
    • Component/s: Python
    • Labels:
      None
    • Environment:
      Azure Databricks

      Description

      When trying to access a file using a mount point pointing to an Azure blob storage account the code fails with the following error:
      OSError: Passed non-file path: /mnt/aa/example.parquet
      --------------------------------------------------------------------------- OSError Traceback (most recent call last) <command-1848295812523966> in <module>() ----> 1 pddf2 = pd.read_parquet("/mnt/aa/example.parquet", engine='pyarrow') 2 display(pddf2) /databricks/python/lib/python3.5/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs) 280 281 impl = get_engine(engine) --> 282 return impl.read(path, columns=columns, **kwargs) /databricks/python/lib/python3.5/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs) 127 kwargs['use_pandas_metadata'] = True 128 result = self.api.parquet.read_table(path, columns=columns, --> 129 **kwargs).to_pandas() 130 if should_close: 131 try: /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, filesystem) 1150 return fs.read_parquet(path, columns=columns, 1151 use_threads=use_threads, metadata=metadata, -> 1152 use_pandas_metadata=use_pandas_metadata) 1153 1154 pf = ParquetFile(source, metadata=metadata) /databricks/python/lib/python3.5/site-packages/pyarrow/filesystem.py in read_parquet(self, path, columns, metadata, schema, use_threads, use_pandas_metadata) 177 from pyarrow.parquet import ParquetDataset 178 dataset = ParquetDataset(path, schema=schema, metadata=metadata, --> 179 filesystem=self) 180 return dataset.read(columns=columns, use_threads=use_threads, 181 use_pandas_metadata=use_pandas_metadata) /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in _init_(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, memory_map) 933 self.metadata_path) = _make_manifest( 934 path_or_paths, self.fs, metadata_nthreads=metadata_nthreads, --> 935 open_file_func=self._open_file_func) 936 937 if self.common_metadata_path is not None: /databricks/python/lib/python3.5/site-packages/pyarrow/parquet.py in _make_manifest(path_or_paths, fs, pathsep, metadata_nthreads, open_file_func) 1108 if not fs.isfile(path): 1109 raise IOError('Passed non-file path: {0}' -> 1110 .format(path)) 1111 piece = ParquetDatasetPiece(path, open_file_func=open_file_func) 1112 pieces.append(piece) OSError: Passed non-file path: /mnt/aa/example.parquet
       
      I am using the following code from a Databricks notebook to reproduce the issue:
      %sh
      sudo apt-get -y install python3-pip
      /databricks/python3/bin/pip3 uninstall pandas -y
      /databricks/python3/bin/pip3 uninstall numpy -y

      /databricks/python3/bin/pip3 uninstall pyarrow -y
       
       
      %sh
      /databricks/python3/bin/pip3 install numpy==1.14.0
      /databricks/python3/bin/pip3 install pandas==0.24.1
      /databricks/python3/bin/pip3 install pyarrow==0.13.0

       
      dbutils.fs.mount(
        source = "wasbs://<mycontainer>@<mystorageaccount>.blob.core.windows.net",
        mount_point = "/mnt/aa",
        extra_configs = {"fs.azure.account.key.<mystorageaccount>.blob.core.windows.net":dbutils.secrets.get(scope = "storage", key = "blob_key")})

       
      pddf2 = pd.read_parquet("/mnt/aa/example.parquet", engine='pyarrow')
      display(pddf2)

       

        Attachments

        1. arrow_error.txt
          3 kB
          Simon Lidberg

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              simonlid Simon Lidberg
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: