Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8802

[C++][Dataset] Schema metadata are lost when reading a subset of columns

    XMLWordPrintableJSON

Details

    Description

      Python example:

      import pandas as pd     
      import pyarrow.dataset as ds                                                                                                                                                                              
      
      df = pd.DataFrame({'a': [1, 2, 3]})  
      df.to_parquet("test_metadata.parquet")  
      
      dataset = ds.dataset("test_metadata.parquet")                                                                                                                                                             
      

      gives:

      >>> dataset.to_table().schema 
      a: int64
        -- field metadata --
        PARQUET:field_id: '1'
      -- schema metadata --
      pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 397
      ARROW:schema: '/////4ACAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 806
      
      >>> dataset.to_table(columns=['a']).schema 
      a: int64
        -- field metadata --
        PARQUET:field_id: '1'
      

      So when specifying a subset of the columns, the additional metadata entries are lost (while those can still be informative, eg for conversion to pandas)

      Attachments

        Issue Links

          Activity

            People

              fsaintjacques Francois Saint-Jacques
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m