Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
2.0.0
-
Ubuntu 18.04
adal 1.2.5 pyh9f0ad1d_0 conda-forge
adlfs 0.5.9 pyhd8ed1ab_0 conda-forge
apache-airflow 1.10.14 pypi_0 pypi
azure-common 1.1.24 py_0 conda-forge
azure-core 1.9.0 pyhd3deb0d_0 conda-forge
azure-datalake-store 0.0.51 pyh9f0ad1d_0 conda-forge
azure-identity 1.5.0 pyhd8ed1ab_0 conda-forge
azure-nspkg 3.0.2 py_0 conda-forge
azure-storage-blob 12.6.0 pyhd3deb0d_0 conda-forge
azure-storage-common 2.1.0 py37hc8dfbb8_3 conda-forge
fsspec 0.8.5 pyhd8ed1ab_0 conda-forge
jupyterlab_pygments 0.1.2 pyh9f0ad1d_0 conda-forge
pandas 1.2.0 py37ha9443f7_0
pyarrow 2.0.0 py37h4935f41_6_cpu conda-forgeUbuntu 18.04 adal 1.2.5 pyh9f0ad1d_0 conda-forge adlfs 0.5.9 pyhd8ed1ab_0 conda-forge apache-airflow 1.10.14 pypi_0 pypi azure-common 1.1.24 py_0 conda-forge azure-core 1.9.0 pyhd3deb0d_0 conda-forge azure-datalake-store 0.0.51 pyh9f0ad1d_0 conda-forge azure-identity 1.5.0 pyhd8ed1ab_0 conda-forge azure-nspkg 3.0.2 py_0 conda-forge azure-storage-blob 12.6.0 pyhd3deb0d_0 conda-forge azure-storage-common 2.1.0 py37hc8dfbb8_3 conda-forge fsspec 0.8.5 pyhd8ed1ab_0 conda-forge jupyterlab_pygments 0.1.2 pyh9f0ad1d_0 conda-forge pandas 1.2.0 py37ha9443f7_0 pyarrow 2.0.0 py37h4935f41_6_cpu conda-forge
Description
In a Jupyter notebook, I have noticed that sometimes I am not able to read a dataset which certainly exists on Azure Blob.
fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
One example of this is reading a dataset in one cell:
ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
Then in another cell I try to read the same dataset:
ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) --------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-514-bf63585a0c1b> in <module> ----> 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes) 669 # TODO(kszucs): support InMemoryDataset for a table input 670 if _is_path_like(source): --> 671 return _filesystem_dataset(source, **kwargs) 672 elif isinstance(source, (tuple, list)): 673 if all(_is_path_like(elem) for elem in source): /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) 426 fs, paths_or_selector = _ensure_multiple_sources(source, filesystem) 427 else: --> 428 fs, paths_or_selector = _ensure_single_source(source, filesystem) 429 430 options = FileSystemFactoryOptions( /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _ensure_single_source(path, filesystem) 402 paths_or_selector = [path] 403 else: --> 404 raise FileNotFoundError(path) 405 406 return filesystem, paths_or_selector FileNotFoundError: dev/test-split
If I reset the kernel, it works again. It also works if I change the path slightly, like adding a "/" at the end (so basically it just not work if I read the same dataset twice):
ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs)
The other strange behavior I have noticed that that if I read a dataset inside of my Jupyter notebook,
%%time dataset = ds.dataset("dev/test-split", partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), flavor="hive"), filesystem=fs, exclude_invalid_files=False) CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s
Now, on the exact same server when I try to run the same code against the same dataset in Airflow it takes over 3 minutes (comparing the timestamps in my logs between right before I read the dataset, and immediately after the dataset is available to filter):
[2021-01-14 03:52:04,011] INFO - Reading dev/test-split [2021-01-14 03:55:17,360] INFO - Processing dataset in batches
This is probably not a pyarrow issue, but what are some potential causes that I can look into? I have one example where it is 9 seconds to read the dataset in Jupyter, but then 11 minutes in Airflow. I don't know what to really investigate - as I mentioned, the Jupyter notebook and Airflow are on the same server and both are deployed using Docker. Airflow is using the CeleryExecutor.