Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10462

[Python] ParquetDatasetPiece's path broken when using fsspec fs on Windows

    XMLWordPrintableJSON

Details

    Description

      Dask reported some failures starting with the pyarrow 2.0 release, and specifically on Windows: https://github.com/dask/dask/issues/6754

      After some investigation, it seems that this is due to the ParquetDatasetPiece its path attribute now returning a path with a mixture of \\ and / in it.

      It specifically happens when dask is passing a posix-style base path pointing to the dataset base directory (so using all /), and passing an fsspec-based (local) filesystem.
      From a debugging output during one of the dask tests:

      (Pdb) dataset
      <pyarrow.parquet.ParquetDataset object at 0x00000290D7506308>
      (Pdb) dataset.paths
      'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0'
      (Pdb) dataset.pieces[0].path
      'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0\\a1=A\\a2=X\\part.0.parquet'
      

      So you can see that the result here has a mix of \\ and /. Using pyarrow 1.0, this was consistently using /.

      The reason for the change is that in pyarrow 2.0 we started to replace fsspec LocalFileSystem with our own LocalFileSystem (assuming for a local filesystem that should be equivalent). But it seems that our own LocalFileSystem has a pathsep} property that equals to os.path.sep, which is \\ on Windows (https://github.com/apache/arrow/blob/9231976609d352b7050f5c706b86c15e8c604927/python/pyarrow/filesystem.py#L304-L306.

      So note that while this started being broken in pyarrow 2.0 when using fsspec filesystem, this was already "broken" before when using our own local filesystem (or when not passing any filesystem). But, 1) dask always passes an fsspec filesystem, and 2) dask uses the piece's path as dictionary key and is thus especially sensitive to the change (using it as a file path to read something in, it will probably still work even with the mixture of path separators).

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m