[ARROW-3210] [Python] Creating ParquetDataset creates partitioned ParquetFiles with mismatched Parquet schemas - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0
Fix Version/s: 0.10.0
Component/s: Python
Labels:
- parquet
Environment:
Ubuntu 16.04 LTS, System76 Oryx Pro

External issue URL:
https://github.com/apache/arrow/issues/19554

Description

STEPS TO REPRODUCE:

1. Create a conda environment reflecting environment.yml

2. Execute script repro.py, replacing various config variables to create a ParquetDataset on S3 given repro.csv

3. Create reference of ParquetDataset using script repro_2.py, again replacing various config variables.

EXPECTED:

Reference is created correctly.

GOT:

Mismatched Arrow schemas in validate_schemas() method:

```python

- - ValueError: Schema in partition[Draught=1, Name=1, VesselType=0, x=1, Heading=1] s3://kio-tests-files/_tmp/test_parquet_dataset/Draught=10.3/Name=MSC RAFAELA/VesselType=Cargo/x=130.43158/Heading=270.0/e9e3cea5a5c24c4da587c263ec817c98.parquet was different.
    Record_ID: int64
    y: double
    TRACKID: string
    MMSI: int64
    IMO: int64
    AgeMinutes: double
    SoG: double
    Width: int64
    Length: int64
    Callsign: string
    Destination: string
    ETA: int64
    Status: string
    ExtraInfo: string
    TIMESTAMP: int64
    _index_level_0_: int64
    metadata
    --------
    {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
    b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
    b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
    b' [ {"name": "Record_ID", "field_name": "Record_ID", "pandas_type"' b': "int64", "numpy_type": "int64", "metadata": null}
    ,
    {"name": "y' b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f' b'loat64", "metadata": null}
    ,
    {"name": "TRACKID", "field_name": "T' b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null}
    ,
    {"name": "MMSI", "field_name": "MMSI", "pandas_typ' b'e": "int64", "numpy_type": "int64", "metadata": null}
    ,
    {"name": ' b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"' b': "int64", "metadata": null}
    ,
    {"name": "AgeMinutes", "field_name' b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6' b'4", "metadata": null}
    ,
    {"name": "SoG", "field_name": "SoG", "pan' b'das_type": "float64", "numpy_type": "float64", "metadata": null}
    '
    b',
    {"name": "Width", "field_name": "Width", "pandas_type": "int64' b'", "numpy_type": "int64", "metadata": null}
    ,
    {"name": "Length", ' b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i' b'nt64", "metadata": null}
    ,
    {"name": "Callsign", "field_name": "Ca' b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null}
    ,
    {"name": "Destination", "field_name": "Destination' b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":' b' null}
    ,
    {"name": "ETA", "field_name": "ETA", "pandas_type": "int' b'64", "numpy_type": "int64", "metadata": null}
    ,
    {"name": "Status"' b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"' b': "object", "metadata": null}
    ,
    {"name": "ExtraInfo", "field_name' b'": "ExtraInfo", "pandas_type": "unicode", "numpy_type": "object"' b', "metadata": null}
    ,
    {"name": "TIMESTAMP", "field_name": "TIMEST' b'AMP", "pandas_type": "int64", "numpy_type": "int64", "metadata":' b' null}
    ,
    {"name": null, "field_name": "__index_level_0__", "panda' b's_type": "int64", "numpy_type": "int64", "metadata": null}
    ], "pa'
    b'ndas_version": "0.21.0"}'}

vs

Record_ID: int64
y: double
TRACKID: string
MMSI: int64
IMO: int64
AgeMinutes: double
SoG: double
Width: int64
Length: int64
Callsign: string
Destination: string
ETA: int64
Status: string
ExtraInfo: null
TIMESTAMP: int64
_index_level_0_: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [

{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"' b': "int64", "numpy_type": "int64", "metadata": null}

,

{"name": "y' b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f' b'loat64", "metadata": null}

,

{"name": "TRACKID", "field_name": "T' b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null}

,

{"name": "MMSI", "field_name": "MMSI", "pandas_typ' b'e": "int64", "numpy_type": "int64", "metadata": null}

,

{"name": ' b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"' b': "int64", "metadata": null}

,

{"name": "AgeMinutes", "field_name' b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6' b'4", "metadata": null}

,

{"name": "SoG", "field_name": "SoG", "pan' b'das_type": "float64", "numpy_type": "float64", "metadata": null}

'
b',

{"name": "Width", "field_name": "Width", "pandas_type": "int64' b'", "numpy_type": "int64", "metadata": null}

,

{"name": "Length", ' b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i' b'nt64", "metadata": null}

,

{"name": "Callsign", "field_name": "Ca' b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta' b'data": null}

,

{"name": "Destination", "field_name": "Destination' b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":' b' null}

,

{"name": "ETA", "field_name": "ETA", "pandas_type": "int' b'64", "numpy_type": "int64", "metadata": null}

,

{"name": "Status"' b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"' b': "object", "metadata": null}

,

{"name": "ExtraInfo", "field_name' b'": "ExtraInfo", "pandas_type": "empty", "numpy_type": "object", ' b'"metadata": null}

,

{"name": "TIMESTAMP", "field_name": "TIMESTAM' b'P", "pandas_type": "int64", "numpy_type": "int64", "metadata": n' b'ull}

,

{"name": null, "field_name": "__index_level_0__", "pandas_' b'type": "int64", "numpy_type": "int64", "metadata": null}

], "pand'
b'as_version": "0.21.0"}'}

```

The issue is with column ExtraInfo, where pandas_type is unicode in a partitioned ParquetDatasetPiece referencing the 2nd Parquet file created, while the ParquetDataset schema referencing the 1st Parquet file created has pandas_type empty for that same column.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

environment.yml
10/Sep/18 19:41
3 kB
Ying Wang
repro.csv
10/Sep/18 19:42
4 kB
Ying Wang
repro_2.py
10/Sep/18 19:46
0.2 kB
Ying Wang
repro.py
10/Sep/18 19:47
0.4 kB
Ying Wang

Issue Links

is duplicated by

ARROW-2891 [Python] Preserve schema in write_to_dataset

Resolved

[Python] Creating ParquetDataset creates partitioned ParquetFiles with mismatched Parquet schemas

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates