Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9009

[C++][Dataset] ARROW:schema should be removed from schema's metadata when reading Parquet files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.0.0
    • C++

    Description

      When reading a parquet file (which was written by Arrow) with the datasets API, it preserves the "ARROW:schema" field in the metadata:

      import pyarrow as pa
      import pyarrow.parquet as pq
      import pyarrow.dataset as ds
      
      table = pa.table({'a': [1, 2, 3]})
      pq.write_table(table, "test.parquet")
      
      dataset = ds.dataset("test.parquet", format="parquet")
      
      In [7]: dataset.schema                                                                                                                                                                        
      Out[7]: 
      a: int64
        -- field metadata --
        PARQUET:field_id: '1'
      -- schema metadata --
      ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 114
      
      In [8]: dataset.to_table().schema                                                                                                                                                             
      Out[8]: 
      a: int64
        -- field metadata --
        PARQUET:field_id: '1'
      -- schema metadata --
      ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 114
      

      while when reading with the `parquet` module reader, we do not preserve this metadata:

      In [9]: pq.read_table("test.parquet").schema                                                                                                                                                  
      Out[9]: 
      a: int64
        -- field metadata --
        PARQUET:field_id: '1'
      

      Since the "ARROW:schema" information is used to properly reconstruct the Arrow schema from the ParquetSchema, it is no longer needed once you already have the arrow schema, so it's probably not needed to keep it as metadata in the arrow schema.

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: