Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8613

[C++][Dataset] Raise error for unparsable partition value

    XMLWordPrintableJSON

Details

    Description

      Currently, when specifying a partitioning schema, but on of the partition field values cannot be parsed according to the specified type, you silently get null values for that partition field.

      Python example:

      import pathlib              
      import pyarrow.parquet as pq 
      import pyarrow.datasets as d
      
      path = pathlib.Path(".") / "dataset_partition_schema_errors" 
      path.mkdir(exist_ok=True)                                                                                                                                                                                  
      
      table = pa.table({"part": ["1_2", "1_2", "3_4", "3_4"], "values": range(4)})   
      pq.write_to_dataset(table, str(path), partition_cols=["part"]) 
      
      In [17]: ds.dataset(path, partitioning="hive").to_table().to_pandas() 
      Out[17]: 
         values part
      0       0  1_2
      1       1  1_2
      2       2  3_4
      3       3  3_4
      
      In [18]: partitioning = ds.partitioning(pa.schema([("part", pa.int64())]), flavor="hive")                                                                                                                          
      
      In [19]: ds.dataset(path, partitioning=partitioning).to_table().to_pandas()   
      Out[19]: 
         values  part
      0       0   NaN
      1       1   NaN
      2       2   NaN
      3       3   NaN
      

      Silently ignoring such a parse error doesn't seem the best default to me (since partition keys are quite essential). I think raising an error might be better?

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h