Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9573

[Python] Parquet doesn't load when partitioned column starts with '_'

    XMLWordPrintableJSON

Details

    Description

      When the loading parquet with partitioned column that starts with an underscore '_', nothing is loaded. No exceptions are raised either. Loading this parquet have worked for me in pyarrow 0.17.1, but not working anymore in pyarrow 1.0.0.

      On the other hand, loading parquet with a partitioned column starting with '_' is possible by using the `use_legacy_dataset` option. Also, when the column that starts with an underscore is not a partitioned column, loading parquet seems to work as expected.

      >>> import pyarrow as pa
      >>> import pyarrow.parquet as pq
      >>> import pandas as pd
      >>> df1 = pd.DataFrame(data={'_COL_1': [1, 2], 'COL_2': [3, 4], 'COL_3': [5, 6]})
      >>> table1 = pa.Table.from_pandas(df1)
      >>> pq.write_to_dataset(table1, partition_cols=['_COL_1', 'COL_2'], root_path='test_parquet1')
      >>> df_pq1 = pq.read_table('test_parquet1')
      >>> df_pq1
      pyarrow.Table
      >>> len(df_pq1)
      0
      >>> df_pq1_legacy = pq.read_table('test_parquet1', use_legacy_dataset=True)
      pyarrow.Table
      COL_3: int64
      _COL_1: dictionary<values=int64, indices=int32, ordered=0>
      COL_2: dictionary<values=int64, indices=int32, ordered=0>
      >>> len(df_pq1_legacy)
      2
      >>> df2 = pd.DataFrame(data={'COL_1': [1, 2], 'COL_2': [3, 4], '_COL_3': [5, 6]})
      >>> table2 = pa.Table.from_pandas(df2)
      >>> pq.write_to_dataset(table2, partition_cols=['COL_1', 'COL_2'], root_path='test_parquet2')
      >>> df_pq2 = pq.read_table('test_parquet2')
      >>> df_pq2
      pyarrow.Table
      _COL_3: int64
      COL_1: int32
      COL_2: int32
      >>> len(df_pq2)
      2
      

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              tonnamb Tonnam Balankura
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h