Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9363

[C++][Dataset] ParquetDatasetFactory schema: pandas metadata is lost

    XMLWordPrintableJSON

Details

    Description

      When using the standard factory, the pandas metadata is included in the schema metadata of the dataset, but when using the ParquetDatasetFactory, it is not included:

      Using dask to write a small partitioned dataset with written _metadata file:

      df = pd.DataFrame({"part": ["A", "A", "B", "B"], "col": [1, 2, 3, 4]})                                                                                                                                     
      
      import dask.dataframe as dd                                                                                                                                                                                
      ddf = dd.from_pandas(df, npartitions=2)                                                                                                                                                                    
      ddf.to_parquet("test_parquet_pandas_metadata/", engine="pyarrow")                                                                                                                                          
      
      In [9]: import pyarrow.dataset as ds                                                                                                                                                                               
      
      # with ds.dataset -> pandas metadata included
      In [11]: ds.dataset("test_parquet_pandas_metadata/", format="parquet", partitioning="hive").schema                                                                                                                 
      Out[11]: 
      part: string
        -- field metadata --
        PARQUET:field_id: '1'
      col: int64
        -- field metadata --
        PARQUET:field_id: '2'
      index: int64
        -- field metadata --
        PARQUET:field_id: '3'
      -- schema metadata --
      pandas: '{"index_columns": ["index"], "column_indexes": [{"name": null, "' + 558
      
      # with parquet_dataset -> pandas metadata not included
      In [14]: ds.parquet_dataset("test_parquet_pandas_metadata/_metadata",  partitioning="hive").schema                                                                                                                 
      Out[14]: 
      part: string
        -- field metadata --
        PARQUET:field_id: '1'
      col: int64
        -- field metadata --
        PARQUET:field_id: '2'
      index: int64
        -- field metadata --
        PARQUET:field_id: '3'
      
      # to show that the pandas metadata are present in the actual Parquet FileMetadata
      In [17]: pq.read_metadata("test_parquet_pandas_metadata/_metadata").metadata                                                                                                                                       
      Out[17]: 
      {b'ARROW:schema': b'/////4ADAAAQAAAAAAAKAA4AB...',
       b'pandas': b'{"index_columns": ["index"], ...'}
      

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m