Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.0.1
-
None
Description
@martindurant good news (for you): I have a repro test case that is 100% pyarrow, so it looks like s3fs is not involved.
@jorisvandenbossche how should I follow up with this, based on pyarrow.filesystem.LocalFileSystem?
Viewing the File System directories as a tree, one thread is required for every non-leaf node, in order to avoid deadlock.
1) dataset
2) dataset/foo=1
3) dataset/foo=1/bar=2
4) dataset/foo=1/bar=2/baz=0
5) dataset/foo=1/bar=2/baz=1
6) dataset/foo=1/bar=2/baz=2
*) dataset/foo=1/bar=2/baz=0/qux=false
*) dataset/foo=1/bar=2/baz=1/qux=false
*) dataset/foo=1/bar=2/baz=1/qux=true
*) dataset/foo=1/bar=2/baz=0/qux=true
*) dataset/foo=1/bar=2/baz=2/qux=false
*) dataset/foo=1/bar=2/baz=2/qux=true
import pyarrow.parquet as pq import pyarrow.filesystem as fs class LoggingLocalFileSystem(fs.LocalFileSystem): def walk(self, path): print(path) return super().walk(path) fs = LoggingLocalFileSystem() dataset_url = "dataset" threads = 6 dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads) print(len(dataset.pieces)) threads = 5 dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads) print(len(dataset.pieces))
Call with 6 threads completes.
Call with 5 threads hangs indefinitely.
$ python repro.py dataset dataset/foo=1 dataset/foo=1/bar=2 dataset/foo=1/bar=2/baz=0 dataset/foo=1/bar=2/baz=1 dataset/foo=1/bar=2/baz=2 dataset/foo=1/bar=2/baz=0/qux=false dataset/foo=1/bar=2/baz=0/qux=true dataset/foo=1/bar=2/baz=1/qux=false dataset/foo=1/bar=2/baz=1/qux=true dataset/foo=1/bar=2/baz=2/qux=false dataset/foo=1/bar=2/baz=2/qux=true 6 dataset dataset/foo=1 dataset/foo=1/bar=2 dataset/foo=1/bar=2/baz=0 dataset/foo=1/bar=2/baz=1 dataset/foo=1/bar=2/baz=2 ^C ... KeyboardInterrupt ^C ... KeyboardInterrupt
*NOTE:* this also happens with the un-decorated LocalFileSystem, and when omitting the filesystem argument.