Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8729

[C++][Dataset] Only selecting a partition column results in empty table

    XMLWordPrintableJSON

    Details

      Description

      Python reproducer:

      import pyarrow as pa
      import pyarrow.parquet as pq
      import pyarrow.dataset as ds
      path = "test_dataset"
      
      table = pa.table({'part': ['a', 'a', 'b', 'b'], 'col': [1, 2, 3, 4]})
      pq.write_to_dataset(table, str(path), partition_cols=["part"])
      

      gives

      In [38]: ds.dataset(str(path), partitioning="hive").to_table().num_rows                                                                                                                                            
      Out[38]: 4
      
      In [39]: ds.dataset(str(path), partitioning="hive").to_table(columns=["part"]).num_rows                                                                                                                            
      Out[39]: 0
      

      The schema correctly only includes the "part" column, but there are no rows.

      cc Ben Kietzman

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bkietz Ben Kietzman
                Reporter:
                jorisvandenbossche Joris Van den Bossche
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2.5h
                  2.5h