Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10694

[Python] ds.write_dataset() generates empty files for each final partition

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 3.0.0
    • C++, Python
    • Ubuntu 18.04
      Python 3.8.6
      adlfs master branch

    Description

      ds.write_dataset() is generating empty files for the final partition folder which causes errors when reading the dataset or converting a dataset to a table.

      I believe this may be caused by fs.mkdir(). Without the final slash in the path, an empty file is created in the "dev" container:

       

      fs = fsspec.filesystem(protocol='abfs', account_name=base.login, account_key=base.password)
      fs.mkdir("dev/test2")
      

       

      If the final slash is added, a proper folder is created:

      fs.mkdir("dev/test2/")

       

      Here is a full example of what happens with ds.write_dataset:

      schema = pa.schema(
          [
              ("year", pa.int16()),
              ("month", pa.int8()),
              ("day", pa.int8()),
              ("report_date", pa.date32()),
              ("employee_id", pa.string()),
              ("designation", pa.dictionary(index_type=pa.int16(), value_type=pa.string())),
          ]
      )
      
      part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", pa.int8()), ("day", pa.int8())]))
      
      ds.write_dataset(data=table, 
                       base_dir="dev/test-dataset", 
                       basename_template="test-{i}.parquet", 
                       format="parquet",
                       partitioning=part, 
                       schema=schema,
                       filesystem=fs)
      
      dataset.files
      
      #sample printed below, note the empty files
      [
       'dev/test-dataset/2018/1/1/test-0.parquet',
       'dev/test-dataset/2018/10/1',
       'dev/test-dataset/2018/10/1/test-27.parquet',
       'dev/test-dataset/2018/3/1',
       'dev/test-dataset/2018/3/1/test-6.parquet',
       'dev/test-dataset/2020/1/1',
       'dev/test-dataset/2020/1/1/test-2.parquet',
       'dev/test-dataset/2020/10/1',
       'dev/test-dataset/2020/10/1/test-29.parquet',
       'dev/test-dataset/2020/11/1',
       'dev/test-dataset/2020/11/1/test-32.parquet',
       'dev/test-dataset/2020/2/1',
       'dev/test-dataset/2020/2/1/test-5.parquet',
       'dev/test-dataset/2020/7/1',
       'dev/test-dataset/2020/7/1/test-20.parquet',
       'dev/test-dataset/2020/8/1',
       'dev/test-dataset/2020/8/1/test-23.parquet',
       'dev/test-dataset/2020/9/1',
       'dev/test-dataset/2020/9/1/test-26.parquet'
      ]

      As you can see, there is an empty file for each "day" partition. I was not even able to read the dataset at all until I manually deleted the first empty file in the dataset (2018/1/1).

      I then get an error when I try to use the to_table() method:

      OSError                                   Traceback (most recent call last)
      <ipython-input-127-6fb0d79c4511> in <module>
      ----> 1 dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()OSError: Could not open parquet input source 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
      

      If I manually delete the empty file, I can then use the to_table() function:

      dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 10)).to_pandas()
      

      Is this a bug with pyarrow, adlfs, or fsspec?

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            ldacey Lance Dacey
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: