Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7617

[Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories)

    XMLWordPrintableJSON

Details

    Description

      Hello,

      it looks like, views with selection along categorical column are not properly respected.

      For the following dummy dataframe:

       

      d = pd.date_range('1990-01-01', freq='D', periods=10000)
      vals = pd.np.random.randn(len(d), 4)
      x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
      x['Year'] = x.index.year
      

      The slice by Year is saved to partitioned parquet properly:

      table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
      pq.write_to_dataset(table, root_path='test_a.parquet', partition_cols=['Year'])

      However, if we convert Year to pandas.Categorical - it will save the whole original dataframe, not only slice of Year=1990:

      x['Year'] = x['Year'].astype('category')
      
      table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
      pq.write_to_dataset(table, root_path='test_b.parquet', partition_cols=['Year'])
      

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            Filimonov Vladimir
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: