Details
-
Bug
-
Status: Reopened
-
Major
-
Resolution: Unresolved
-
0.15.1
-
None
Description
Hello,
it looks like, views with selection along categorical column are not properly respected.
For the following dummy dataframe:
d = pd.date_range('1990-01-01', freq='D', periods=10000) vals = pd.np.random.randn(len(d), 4) x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D']) x['Year'] = x.index.year
The slice by Year is saved to partitioned parquet properly:
table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) pq.write_to_dataset(table, root_path='test_a.parquet', partition_cols=['Year'])
However, if we convert Year to pandas.Categorical - it will save the whole original dataframe, not only slice of Year=1990:
x['Year'] = x['Year'].astype('category') table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False) pq.write_to_dataset(table, root_path='test_b.parquet', partition_cols=['Year'])