Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3861

[Python] ParquetDataset().read columns argument always returns partition column

    XMLWordPrintableJSON

Details

    Description

      I just noticed that no matter which columns are specified on load of a dataset, the partition column is always returned. This might lead to strange behaviour, as the resulting dataframe has more than the expected columns:

      import dask as da
      import pyarrow as pa
      import pyarrow.parquet as pq
      import pandas as pd
      import os
      import numpy as np
      import shutil
      
      PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
      
      if os.path.exists(PATH_PYARROW_MANUAL):
          shutil.rmtree(PATH_PYARROW_MANUAL)
      os.mkdir(PATH_PYARROW_MANUAL)
      
      arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
      strings = np.array([np.nan, np.nan, 'a', 'b'])
      
      df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
      df.index.name='DPRD_ID'
      df['arrays'] = pd.Series(arrays)
      df['strings'] = pd.Series(strings)
      
      my_schema = pa.schema([('DPRD_ID', pa.int64()),
                             ('partition_column', pa.int32()),
                             ('arrays', pa.list_(pa.int32())),
                             ('strings', pa.string()),
                             ('new_column', pa.string())])
      
      table = pa.Table.from_pandas(df, schema=my_schema)
      pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, partition_cols=['partition_column'])
      
      df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 'strings']).to_pandas()
      # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], engine='pyarrow')
      df_pq
      

      df_pq has column `partition_column`

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              cthi Christian Thiel
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h