I tried using pyarrow.dataset and pq.ParquetDataset(use_legacy_system=False) and my connection to Azure Blob fails.
I know the documentation says only hdfs and s3 are implemented, but I have been using Azure Blob by using fsspec as the filesystem when reading and writing parquet files/datasets with Pyarrow (with use_legacy_system=True). Also, Dask works with storage_options.
I am hoping that Azure Blob will be supported because I'd really like to try out the new row filtering and non-hive partitioning schemes.
This is what I use for the filesystem when using read_table() or write_to_dataset():
It seems like the class _ParquetDatasetV2 has a section that the original ParquetDataset does not have. Perhaps this is why the fsspec filesystem fails when I turn off the legacy system?
Line 1423 in arrow/python/pyarrow/parquet.py:
if filesystem is not None:
filesystem = pyarrow.fs._ensure_filesystem(filesystem, use_mmap=memory_map)
I got this to work using fsspec on single files on Azure Blob:
When I try to use this on a partitioned file I made using write_to_dataset, I run into an error though. I tried this with the same code as above and also with the partitioning='hive' option.