Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Not A Problem
-
6.0.1
-
None
Description
While experimenting with the partitioned dataset persistence in parquet, I stumbled upon an interesting feature (or bug?) where after restoring only a certain partition and applying groupby I suddenly get all the filtered rows in the dataframe.
Following code demonstrates the issue:
import numpy as np import os import pandas as pd # 1.3.4 import pyarrow as pa # 6.0.1 import random import shutil import string import tempfile from datetime import datetime, timedelta if __name__ == '__main__': # 1. generate random data frame day_count = 5 data_length = 10 numpy_random_gen = np.random.default_rng() label_choices = [''.join(random.choices(string.ascii_uppercase + string.digits, k=8)) for _ in range(5)] partial_dfs = [] start_date = datetime.today().date() - timedelta(days=day_count) for date in (start_date + timedelta(n) for n in range(day_count)): date_array = pd.to_datetime(np.full(data_length, date)).date label_array = np.full(data_length, [random.choice(label_choices) for _ in range(data_length)]) value_array = numpy_random_gen.integers(low=1, high=500, size=data_length) partial_dfs.append(pd.DataFrame(data={'date': date_array, 'label': label_array, 'value': value_array})) df = pd.concat(partial_dfs, ignore_index=True) print(f"Unique dates before restore:\n{df.drop_duplicates(subset='date')['date']}") # 2. persist data frame partitioned by date dataset_dir = tempfile.mkdtemp() df.to_parquet(path=dataset_dir, engine='pyarrow', partition_cols=['date', 'label']) # 3. restore from parquet partitioned dataset restored_df = pd.read_parquet(dataset_dir, engine='pyarrow', filters=[ ('date', '=', str(start_date))], use_legacy_dataset=False) print(f"Unique dates after restore:\n{restored_df.drop_duplicates(subset='date')['date']}") group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum') print(group_by_df) shutil.rmtree(dataset_dir)
It correctly reports five unique dates upon random df generation and correctly reports only one after reading back from parquet:
Unique dates after restore: 0 2021-11-13 Name: date, dtype: category Categories (5, object): ['2021-11-13', '2021-11-14', '2021-11-15', '2021-11-16', '2021-11-17']
Albeit it adds that there are 5 categories. When subsequently I perform a groupby, all dates that were filtered out at read miracolously appear:
group_by_df = restored_df.groupby(by=['date', 'label'])['value'].sum().reset_index(name='val_sum') print(group_by_df)
With the following output:
date label val_sum 0 2021-11-13 04LOXJCH 494 1 2021-11-13 4QOZ321D 819 2 2021-11-13 GG6YO5FS 394 3 2021-11-13 J7ZD3LDS 203 4 2021-11-13 TFVIXE6L 164 5 2021-11-14 04LOXJCH 0 6 2021-11-14 4QOZ321D 0 7 2021-11-14 GG6YO5FS 0 8 2021-11-14 J7ZD3LDS 0 9 2021-11-14 TFVIXE6L 0 10 2021-11-15 04LOXJCH 0 11 2021-11-15 4QOZ321D 0 12 2021-11-15 GG6YO5FS 0 13 2021-11-15 J7ZD3LDS 0 14 2021-11-15 TFVIXE6L 0 15 2021-11-16 04LOXJCH 0 16 2021-11-16 4QOZ321D 0 17 2021-11-16 GG6YO5FS 0 18 2021-11-16 J7ZD3LDS 0 19 2021-11-16 TFVIXE6L 0 20 2021-11-17 04LOXJCH 0 21 2021-11-17 4QOZ321D 0 22 2021-11-17 GG6YO5FS 0 23 2021-11-17 J7ZD3LDS 0 24 2021-11-17 TFVIXE6L 0
Perhaps I am doing something incorrectly within read_parquet call or something, but my expectation would be for filtered data just be gone after the read operation.
Attachments
Issue Links
- relates to
-
ARROW-5436 [Python] expose filters argument in parquet.read_table
- Resolved