Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10029

[Python] Deadlock in the interaction of pyarrow FileSystem and ParquetDataset

    XMLWordPrintableJSON

Details

    Description

      @martindurant good news (for you): I have a repro test case that is 100% pyarrow, so it looks like s3fs is not involved.

      @jorisvandenbossche how should I follow up with this, based on pyarrow.filesystem.LocalFileSystem?

      Viewing the File System directories as a tree, one thread is required for every non-leaf node, in order to avoid deadlock.

      1) dataset
      2) dataset/foo=1
      3) dataset/foo=1/bar=2
      4) dataset/foo=1/bar=2/baz=0
      5) dataset/foo=1/bar=2/baz=1
      6) dataset/foo=1/bar=2/baz=2
      *) dataset/foo=1/bar=2/baz=0/qux=false
      *) dataset/foo=1/bar=2/baz=1/qux=false
      *) dataset/foo=1/bar=2/baz=1/qux=true
      *) dataset/foo=1/bar=2/baz=0/qux=true
      *) dataset/foo=1/bar=2/baz=2/qux=false
      *) dataset/foo=1/bar=2/baz=2/qux=true

      import pyarrow.parquet as pq
      import pyarrow.filesystem as fs
      
      class LoggingLocalFileSystem(fs.LocalFileSystem):
          def walk(self, path):
              print(path)
              return super().walk(path)
      
      fs = LoggingLocalFileSystem()
      dataset_url = "dataset"
      
      threads = 6
      dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads)
      print(len(dataset.pieces))
      
      threads = 5
      dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads)
      print(len(dataset.pieces))
      

      Call with 6 threads completes.

      Call with 5 threads hangs indefinitely.

      $ python repro.py 
      dataset
      dataset/foo=1
      dataset/foo=1/bar=2
      dataset/foo=1/bar=2/baz=0
      dataset/foo=1/bar=2/baz=1
      dataset/foo=1/bar=2/baz=2
      dataset/foo=1/bar=2/baz=0/qux=false
      dataset/foo=1/bar=2/baz=0/qux=true
      dataset/foo=1/bar=2/baz=1/qux=false
      dataset/foo=1/bar=2/baz=1/qux=true
      dataset/foo=1/bar=2/baz=2/qux=false
      dataset/foo=1/bar=2/baz=2/qux=true
      6
      dataset
      dataset/foo=1
      dataset/foo=1/bar=2
      dataset/foo=1/bar=2/baz=0
      dataset/foo=1/bar=2/baz=1
      dataset/foo=1/bar=2/baz=2
      ^C
      ...
      KeyboardInterrupt
      ^C
      ...
      KeyboardInterrupt
      

      *NOTE:* this also happens with the un-decorated LocalFileSystem, and when omitting the filesystem argument.

      Attachments

        1. repro.py
          1 kB
          David McGuire

        Activity

          People

            Unassigned Unassigned
            dmcguire David McGuire
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: