[ARROW-9009] [C++][Dataset] ARROW:schema should be removed from schema's metadata when reading Parquet files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: C++
Labels:
- dataset

External issue URL:
https://github.com/apache/arrow/issues/25127

Description

When reading a parquet file (which was written by Arrow) with the datasets API, it preserves the "ARROW:schema" field in the metadata:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

table = pa.table({'a': [1, 2, 3]})
pq.write_table(table, "test.parquet")

dataset = ds.dataset("test.parquet", format="parquet")

In [7]: dataset.schema                                                                                                                                                                        
Out[7]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 114

In [8]: dataset.to_table().schema                                                                                                                                                             
Out[8]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 114

while when reading with the `parquet` module reader, we do not preserve this metadata:

In [9]: pq.read_table("test.parquet").schema                                                                                                                                                  
Out[9]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'

Since the "ARROW:schema" information is used to properly reconstruct the Arrow schema from the ParquetSchema, it is no longer needed once you already have the arrow schema, so it's probably not needed to keep it as metadata in the arrow schema.

Attachments

Issue Links

is related to

ARROW-8980 [Python] Metadata grows exponentially when using schema from disk

Resolved

Activity

People

Assignee:: Wes McKinney

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 02/Jun/20 09:26

Updated:: 11/Jan/23 08:04

Resolved:: 29/Jun/20 04:09