Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.0.0
Description
When the loading parquet with partitioned column that starts with an underscore '_', nothing is loaded. No exceptions are raised either. Loading this parquet have worked for me in pyarrow 0.17.1, but not working anymore in pyarrow 1.0.0.
On the other hand, loading parquet with a partitioned column starting with '_' is possible by using the `use_legacy_dataset` option. Also, when the column that starts with an underscore is not a partitioned column, loading parquet seems to work as expected.
>>> import pyarrow as pa >>> import pyarrow.parquet as pq >>> import pandas as pd >>> df1 = pd.DataFrame(data={'_COL_1': [1, 2], 'COL_2': [3, 4], 'COL_3': [5, 6]}) >>> table1 = pa.Table.from_pandas(df1) >>> pq.write_to_dataset(table1, partition_cols=['_COL_1', 'COL_2'], root_path='test_parquet1') >>> df_pq1 = pq.read_table('test_parquet1') >>> df_pq1 pyarrow.Table >>> len(df_pq1) 0 >>> df_pq1_legacy = pq.read_table('test_parquet1', use_legacy_dataset=True) pyarrow.Table COL_3: int64 _COL_1: dictionary<values=int64, indices=int32, ordered=0> COL_2: dictionary<values=int64, indices=int32, ordered=0> >>> len(df_pq1_legacy) 2 >>> df2 = pd.DataFrame(data={'COL_1': [1, 2], 'COL_2': [3, 4], '_COL_3': [5, 6]}) >>> table2 = pa.Table.from_pandas(df2) >>> pq.write_to_dataset(table2, partition_cols=['COL_1', 'COL_2'], root_path='test_parquet2') >>> df_pq2 = pq.read_table('test_parquet2') >>> df_pq2 pyarrow.Table _COL_3: int64 COL_1: int32 COL_2: int32 >>> len(df_pq2) 2
Attachments
Issue Links
- links to