[ARROW-7617] [Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Reopened
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.15.1
Fix Version/s: None
Component/s: Python
Labels:

External issue URL:
https://github.com/apache/arrow/issues/23870

Description

Hello,

it looks like, views with selection along categorical column are not properly respected.

For the following dummy dataframe:

d = pd.date_range('1990-01-01', freq='D', periods=10000)
vals = pd.np.random.randn(len(d), 4)
x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
x['Year'] = x.index.year

The slice by Year is saved to partitioned parquet properly:

table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
pq.write_to_dataset(table, root_path='test_a.parquet', partition_cols=['Year'])

However, if we convert Year to pandas.Categorical - it will save the whole original dataframe, not only slice of Year=1990:

x['Year'] = x['Year'].astype('category')

table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
pq.write_to_dataset(table, root_path='test_b.parquet', partition_cols=['Year'])

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Vladimir

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 20/Jan/20 14:16

Updated:: 11/Jan/23 07:54